3. Methodology¶

Overview¶

This chapter has two parts. Section 3.1 documents how raw HDB transaction records, school datasets, and amenity layers are cleaned, geocoded, enriched, and merged into the final analytic file data/processed/all_hdb_final_processed.csv. Section 3.2 then describes the four modelling components used in the analysis: pooled hedonic OLS, local-boundary RDD around the 1 km and 2 km cutoffs, a spatial error robustness model, and town-level heterogeneity analysis.

3.1 Data Processing¶

3.1.1 Data Summary¶

Dataset	Source	Coverage	Used for
HDB resale transactions	data.gov.sg resale flat price releases	2000-2025 resale transactions across four raw CSV files	Core transaction sample, resale price outcome, and flat-level structural variables
School general information	data.gov.sg General Information of Schools	Singapore school registry snapshot, filtered to primary schools	School master table, school attributes, and school addresses
School subjects, programmes, and CCA	MOE / data.gov.sg CSV releases	School-level enrichment files	Subject, programme, and CCA features
School popularity / ballot history	`sgschooling.com` annual P1 ballot history pages	School-year phase data, 2009-2025	Annual popularity measures and yearly school rankings
Amenity layers	data.gov.sg GeoJSON/catalogues, LTA DataMall, and Wikipedia	MRT, malls, parks, supermarkets, hawkers, and bus stops	Amenity reference layers for spatial feature construction
Geocoding service	OneMap Search API	All unique HDB and school addresses used in the study	Latitude and longitude for spatial joins

Table 1. Data Sources

3.1.2 Data Pipeline¶

Subsection	Scripts	Main input	Main output	Purpose
HDB transaction preparation	`01a_hdb_collect.py`, `01b_hdb_geocode.py`	Raw HDB resale CSV files	`all_hdb_geocoded.csv`	Harmonize schemas, remove exact duplicates, derive time, storey, and lease fields, and geocode HDB addresses.
School data enrichment and annual ranking construction	`02a_school_general_information.py`, `02b_school_subjects.py`, `02c_school_cca.py`, `02d_school_geocode.py`, `02e_school_popularity.py`, `02f_school_rankings.py`	Raw school information, subject/programme/CCA files, and ballot history pages	`school_general_info_enriched.csv`, `school_ranking/yearly_school_rankings.csv`	Build a primary-school reference table with attributes and coordinates, then generate year-specific school rankings.
Amenity and school exposure feature engineering	`03a_hdb_amenities.py`, `03b_hdb_facility_features.py`, `03c_hdb_school_exposure_features.py`, `03d_hdb_final_processing.py`	Geocoded HDB data, enriched school table, annual school rankings, and amenity layers	`all_hdb_final_processed.csv`	Compute amenity proximity/count features, construct year-matched school exposure variables, and create final modelling covariates.

Table 2. Data Pipeline

3.1.3 P1 Cycle Year Alignment¶

School exposure features are matched by p1_cycle_year rather than by calendar transaction year alone. In the feature-engineering pipeline, the P1 cycle is defined on a July-to-June basis to better reflect the timing of the P1 registration exercise. The mapping is therefore:

2015-07 to 2016-06 -> p1_cycle_year = 2015
2016-07 to 2017-06 -> p1_cycle_year = 2016
...
2025-07 to 2026-03 -> p1_cycle_year = 2025

3.2 Modelling¶

3.2.1 Model Workflow¶

Pipeline diagram

Figure 1. Model Workflow

3.2.2 School Ranking and Dataset Integration¶

3.2.2.1 Popularity Index and Ridge Regression¶

School desirability is measured using a Popularity Index derived from Phase 2B and Phase 2C registration data. We calculate a 3-year rolling average of the applicant-to-vacancy ratio to ensure the metric reflects sustained demand rather than annual statistical noise.

To determine final rankings, we employ Ridge Regression ($L2$ penalty, $\alpha = 1.0$), which effectively handles multicollinearity among highly correlated school features: * Academic/Special Programs: SAP, GEP, and Autonomous status. * Offerings: Number of CCA and Higher Mother Tongue subjects. * Controls: School zone, gender composition (nature), and session types.

The model generates a Predicted Score, from which we derive an Annual Rank. We define "Good Schools" as the top 20% of schools in each academic year, creating a binary indicator that captures the tier of schools most likely to trigger residential price premiums.

3.2.2.2 Spatial-Temporal Data Alignment¶

The final dataset merges HDB transactions with granular proximity-based features, including distance to MRT stations, bus stops, hawker centers, and shopping malls.

To eliminate look-ahead bias, we implemented a strict temporal join for MRT stations and malls: * Dynamic Verification: Each transaction is matched only with infrastructure that was officially operational at the time of sale. * Accuracy: This ensures the "school premium" is isolated from general price appreciation driven by the gradual expansion of the transport network (e.g., opening of the Thomson-East Coast Line).

3.2.3 Hedonic Regression Framework¶

3.2.3.1 Treatment Partition¶

School exposure is partitioned into three mutually exclusive binary indicators. The partition is constructed from the ring-count variables produced by the BallTree spatial join:

Indicator	Condition	Omitted Baseline
`has_top20_0_1km_only`	`top_20_percent_count_0_1km >= 1` AND `top_20_percent_count_1_2km = 0`	No top-20% school within 2 km
`has_top20_1_2km_only`	`top_20_percent_count_0_1km = 0` AND `top_20_percent_count_1_2km >= 1`	No top-20% school within 2 km
`has_top20_both`	`top_20_percent_count_0_1km >= 1` AND `top_20_percent_count_1_2km >= 1`	No top-20% school within 2 km

Table 3. Treatment Partition

Mutual exclusivity is enforced by construction so that each transaction belongs either to one of the three exposure groups or to the omitted baseline category.

3.2.3.2 Estimating Equation¶

See Appendix B. The three school-exposure coefficients are semi-elasticities on log price: a coefficient of 0.05 implies an approximate 5% premium relative to the omitted baseline.

3.2.3.3 Why Cluster by Address¶

Multiple transactions occur at the same block-and-unit address over the sample period. Clustering by address produces conservative standard errors that account for within-building correlation in resale prices.

3.2.4 Regression Discontinuity Design¶

3.2.4.1 Design¶

The RDD component is used as a local boundary-effect test around school-distance cutoffs. Rather than estimating an overall market-wide premium, it asks whether log resale prices jump discontinuously when a flat moves from just inside to just outside a given school-distance ring.

Element	Value
Outcome	`Y = log_resale_price`
Running variable	`X = nearest_top_20_percent_school_distance_m`
Main cutoffs	`c = 1000` and `c = 2000`
Supplementary scan cutoffs	`c in {300, 500, 800}` with `1000` repeated in the multi-cutoff scan as the main benchmark

Table 4. RDD Design

For each cutoff c, the estimand is the local discontinuity:

tau(c) = lim E[Y | X = c-] - lim E[Y | X = c+]

where c- denotes observations just inside the boundary and c+ denotes observations just outside it. Under this sign convention, a negative estimate implies that flats just outside the cutoff are cheaper than flats just inside.

3.2.4.2 Estimation¶

The current implementation uses rdrobust to estimate a local-linear (p = 1) RDD at each cutoff, see Appendix D.

3.2.4.3 Bandwidth Specifications¶

Each main cutoff is estimated in two ways:

Optimal-bandwidth specification selected automatically by rdrobust
Manual sensitivity checks at h in {200, 300, 500, 750} metres

The optimal bandwidth is treated as the main specification. The manual windows are used to check whether the sign and rough magnitude of the estimated jump remain stable as the local comparison set widens.

3.2.4.4 Interpretation¶

The RDD outputs are interpreted as local discontinuity evidence, not as overall market effects. The 1km and 2km results should be read together with the bandwidth sensitivity checks and the supplementary cutoff scan. Because the current implementation does not include formal density or covariate continuity tests, the resulting evidence is described as suggestive local boundary evidence rather than stand-alone definitive proof of a clean causal policy effect.

3.2.5 Spatial Regression Robustness Check¶

method1_spatial_regression-2.ipynb implements a pooled Spatial Error Model (SEM) as a robustness check on the main hedonic specification. The purpose is not to replace the baseline OLS model, but to test whether the school-exposure coefficients remain broadly similar after explicitly accounting for residual spatial dependence among nearby resale transactions.

3.2.5.1 Design¶

The SEM uses the same outcome and the same core school-treatment variables as the pooled hedonic model:

Component	Specification
Outcome	`log_resale_price`
School variables	`has_top20_0_1km_only`, `has_top20_1_2km_only`, `has_top20_both`
Continuous controls	`floor_area_sqm`, `storey_mid`, `remaining_lease_year`
Facility controls	Same 12 distance-and-count controls used in the main hedonic OLS
Fixed effects	Dummy variables for `flat_type`, `town`, and `p1_cycle_year`

Table 5. Spatial Regression Variables

All fixed effects are expanded explicitly using one-hot encoding with one omitted category per factor. The estimation sample is defined by listwise deletion on the full SEM variable set, including latitude and longitude, so that the spatial model and its diagnostic OLS benchmark are run on the same transactions.

3.2.5.2 Same-Year KNN Spatial Weights¶

The spatial weights matrix is constructed using a block-diagonal same-year K-nearest-neighbour graph:

Sort the data by transaction_year.
Within each year, connect each transaction to up to k = 10 nearest neighbours using longitude and latitude.
Convert each year-specific KNN matrix into a sparse block.
Stack the yearly blocks into a block-diagonal matrix, so that no observation is linked to neighbours from a different transaction year.
Row-standardize the resulting matrix before estimation.

Restricting neighbours to the same transaction year avoids creating artificial spatial links across distinct market regimes.

3.2.5.3 Diagnostic Step: OLS Residual Moran's I¶

Before fitting the SEM, the notebook estimates a pooled OLS model using the same covariate set and computes Moran's I on the OLS residuals using the same-year KNN weights. This diagnostic checks whether residual spatial autocorrelation remains after the standard hedonic controls and fixed effects have already been added.

If Moran's I remains large and statistically significant, then standard pooled OLS may still leave location-related shocks in the error term, which motivates the SEM robustness check.

3.2.5.4 Spatial Error Model Specification¶

The SEM is estimated using spreg.GM_Error_Het, a heteroskedasticity-robust generalized moments estimator for the global spatial error model:

y = X beta + u
u = lambda W u + epsilon

where y is log_resale_price, X contains the school variables and controls, W is the same-year row-standardized KNN weights matrix, and lambda captures residual spatial error dependence.

3.2.5.5 Interpretation¶

This specification is interpreted as a spatial robustness check on the pooled hedonic model. If the school-exposure coefficients remain directionally similar after the SEM correction, then the main findings are less likely to be driven purely by omitted neighbourhood-level spatial shocks. The SEM remains an associational model and is not treated as a separate causal design.

3.2.6 Town-Level Heterogeneity Analysis¶

town_premium.py estimates whether the school-related premium varies systematically across towns. This is a descriptive heterogeneity exercise rather than a separate causal identification strategy: the goal is to compare conditional school premiums across towns after holding constant the same structural and amenity controls used in the main hedonic model.

3.2.6.1 Model Design¶

See Appendix C for the full model specification strings. The 0 + C(town) specification removes the global intercept so that each eligible town receives its own intercept and its own town-specific slope on the relevant school-count variable. Model A uses the count of top-20% schools within 0-1 km. Model B uses top_20_percent_count_0_2km = top_20_percent_count_0_1km + top_20_percent_count_1_2km.

3.2.6.2 Eligibility Rule¶

Town-specific premiums are estimated only for towns that satisfy both conditions:

At least ELIGIBILITY_MIN_N = 100 transactions.
Within-town variation in the focal school-count variable (count_nunique >= 2).

Towns that fail the variation criterion are labelled not_estimable_count_no_variation rather than assigned a premium of zero.

3.2.6.3 Interpretation¶

The town-specific interaction coefficient is a semi-elasticity: it measures the percentage change in resale price associated with one additional top-20% school within the relevant radius, conditional on controls. These are count-based marginal effects, not binary treatment effects.

For benchmarking, each town's estimate is compared against the sample-weighted average premium across eligible towns in the same model. The cross-model comparison then identifies towns that remain above average in both specifications and towns whose estimated premium changes most when the radius expands from 0-1 km to 0-2 km.

3.2.7 Design Summary¶

Dimension	Hedonic OLS	RDD	Spatial SEM
Estimand	Average premium across all transactions	Local discontinuity in log price at selected distance cutoffs	Average premium across all transactions after explicit correction for spatially correlated errors
Identification	Conditional on controls (selection on observables)	Local continuity of the price-distance relationship around each cutoff	Conditional on controls plus a structured spatial error process
Sample	Full regression sample	Transactions within the selected bandwidth around each cutoff	Listwise-complete pooled sample with valid coordinates
Inference	Clustered by `address`	Robust bias-corrected `rdrobust` inference with NN variance estimation	`GM_Error_Het` GMM-based z-statistics under a spatial error model
Key confounding threat	Unobserved neighbourhood characteristics	Sorting near cutoffs and non-unique nonlinear distance gradients	Misspecified neighbourhood structure or remaining non-spatial omitted variables

Table 6. Design Summary

3.3 LLM Appplication¶

Our web application helps users explore resale HDB prices in Singapore while providing convenient access to nearby school information. Through an interactive map, users can view housing price trends alongside surrounding amenities, especially primary schools. The platform includes a school search and analysis feature that provides key details such as school locations, distance from selected flats, and relevant school-level information. A town-level explorer also allows users to compare different areas, while an integrated chatbot offers personalized assistance in finding nearby schools and answering related queries. LLM Application