diff --git a/README.md b/README.md index d6ca677..342fd59 100644 --- a/README.md +++ b/README.md @@ -40,12 +40,23 @@ Compared to the US average, data center host communities are: - **Better connected**: 94.9% broadband (vs. 89%) ### Infrastructure Insights -- **89% of data centers are in metropolitan tracts** (vs. 80% of all US tracts) +- **89% of data centers are in metropolitan tracts** (vs. 80% of all US tracts) - only 1.11× over-index - **Non-metro data centers (11%)** are dominated by hyperscalers: - AWS (67), Meta (22), Microsoft (10), Google (4) = 55% of non-metro facilities - 66% are in Oregon + Washington (Columbia River hydro corridor) -- **Energy infrastructure**: 4 states have >2/3 of generation within 50 km of a data center: +- **Grid saturation**: 4 states have >2/3 of generation within 50 km of a data center: - New Jersey: 83%, Nevada: 75%, Tennessee: 70%, Oregon: 68% +- **Hyperscaler energy strategies** (non-metro sites): + - AWS: 114 GW wind + 66 GW hydro + - Microsoft: 13 GW nuclear (Palo Verde co-location) + - Meta: 16 GW solar + +### Clustered vs. Isolated Facilities +Facilities in DBSCAN clusters differ significantly from isolated sites: +- **$35K income gap**: Clustered sites in tracts with median income $108K vs. $73K for isolated +- **+18 pp education**: 51% bachelor's+ vs. 33% +- **More diverse**: 25 pp less non-Hispanic white +- **2× energy infrastructure**: 89 vs. 40 generators within 50 km ### Submarine Cables - **Data centers are NOT systematically closer to cables** than ordinary US cities @@ -194,10 +205,18 @@ python3 make_internet_cables_map.py ## Data Quality Notes 1. **Incomplete power ratings**: Only 5.9% of data centers have power ratings (108/1,833) -2. **Operator fragmentation**: String variations ("Meta" vs. "Meta, Inc.") inflate distinct-operator counts +2. **Operator fragmentation**: String variations ("Meta" vs. "Meta, Inc.", AWS variants) inflate distinct-operator counts 3. **45 facilities** use city-precision fallback coordinates (approximate tract assignment) 4. **7 facilities** failed RUCA join (Puerto Rico / non-US) 5. **Broadband subscribers** are a coarse benefit proxy (actual cloud users are global) +6. **EIA longitude correction**: 2008-2010 generator coordinates had sign errors, corrected in flat-table build + +## Known Limitations + +- **Power capacity**: Only 5.9% populated - nearby EIA generator capacity used as proxy +- **Operator strings**: Need deduplication (50 of 190 non-metro facilities have null operator) +- **Benefit measurement**: Broadband subscribers are an imperfect proxy for cloud computing benefits +- **Universe**: Limited to 46 DC-host states (excludes DC-free states from ACS comparison) ## Research Ideas & Future Work diff --git a/database-tables.md b/database-tables.md index 116e226..89987f2 100644 --- a/database-tables.md +++ b/database-tables.md @@ -412,6 +412,93 @@ Tables are organized into four categories: --- +### Other Tables + +#### `opposition_cases_geocoded` +**Rows**: 18 +**Purpose**: Geocoded community-opposition cases against data center builds + +**Key Columns**: +- `case_id` (TEXT) - Unique identifier +- `developer` (TEXT) - Proposed developer/operator +- `investment_billions` (DOUBLE PRECISION) - Investment amount in billions +- `outcome` (TEXT) - Case outcome (approved, rejected, pending) +- `governance_response` (TEXT) - Government response +- `latitude`, `longitude`, `geom` + +**Source**: Compiled from news archives + +**Notes**: Loaded but currently unused - see research-ideas.md for proposed analyses + +#### `census_tract_huc8_link` +**Rows**: 806 +**Purpose**: Tract↔HUC8 spatial overlap table + +**Key Columns**: +- `geoid` (TEXT) - Census tract GEOID +- `huc8` (TEXT) - HUC8 watershed code +- `overlap_pct` (DOUBLE PRECISION) - Percentage of tract overlapping watershed + +**Notes**: Useful for downstream tract-level water-stress joins; limited to tracts containing data centers + +#### `im3_state_projected_moderate_50` +**Rows**: 328 +**Purpose**: PNNL IM3 projected data center siting (moderate growth, gravity weight 0.50) + +**Key Columns**: +- `facility_id` (TEXT) +- `state` (TEXT) +- `cost_millions` (DOUBLE PRECISION) +- `it_mw` (DOUBLE PRECISION) - IT load in megawatts +- `cooling_water_demand_gal_per_day` (DOUBLE PRECISION) +- `latitude`, `longitude`, `geom` + +**Source**: PNNL Integrated Multisector Multiscale Modeling (IM3) + +**Notes**: Loaded but unused - potential for forward-projection analysis + +#### `im3_projected_state_demand_summary` +**Rows**: 31 +**Purpose**: State-level rollup of IM3 projected facility counts, IT MW, and cooling demand + +**Key Columns**: +- `state` (TEXT) +- `facility_count` (INTEGER) +- `total_it_mw` (DOUBLE PRECISION) +- `total_cooling_demand_mgd` (DOUBLE PRECISION) - Million gallons per day + +**Source**: IM3 model outputs + +#### `utility_rate_tracker_2025_2028` +**Rows**: 374 +**Purpose**: Utility rate-increase tracker by provider × state × service type + +**Key Columns**: +- `provider` (TEXT) - Utility provider name +- `state` (TEXT) +- `service_type` (TEXT) +- `effective_date` (DATE) +- `monthly_increase_dollars` (DOUBLE PRECISION) +- `percent_increase` (DOUBLE PRECISION) + +**Source**: Utility rate tracker database + +**Notes**: Loaded but unused in demographic/energy analysis + +#### `energy_atlas_layers_catalog` +**Rows**: ~5 +**Purpose**: Metadata catalog of EIA layers ingested + +**Key Columns**: +- `table_name` (TEXT) +- `source_url` (TEXT) +- `import_timestamp` (TIMESTAMP) +- `row_count` (INTEGER) + +**Notes**: Created by `ingest_eia_energy_layers.py` + +--- + ## Commonly Used Joins ### Data Center to Demographics diff --git a/research-ideas.md b/research-ideas.md index 0a82127..8e368aa 100644 --- a/research-ideas.md +++ b/research-ideas.md @@ -61,6 +61,8 @@ canonical_map = { ### 3. Water Stress Overlay **Status**: 257 HUC8 watersheds contain data centers; 15 watersheds hold 50% of facilities +**Priority**: HIGH - Critical for environmental impact analysis + **Approach**: - Join to USGS WaterWatch streamflow data - Add USGS Drought Watch indicators by HUC8 @@ -69,10 +71,18 @@ canonical_map = { - Surface water withdrawal permits - Drought frequency/severity (USDM historical data) +**Key Watersheds for Focus**: +- **Middle Potomac-Catoctin** (HUC8 02070008): 235 DCs (12.8% of US total) - Loudoun/Ashburn +- **Middle Potomac-Anacostia-Occoquan** (02070010): 111 DCs - Fairfax/inner Loudoun +- **Coyote** (18050003): 88 DCs - Silicon Valley +- **Upper Scioto** (05060001): 73 DCs - Columbus OH +- **Umatilla** (17070103): 29 DCs - AWS-exclusive watershed + **Research Questions**: - Are data centers sited in water-stressed watersheds? - Do high-density clusters (Loudoun County, Columbus OH) face water constraints? - Compare water stress in hyperscaler non-metro sites (Columbia River corridor) vs. metro clusters +- Does single-operator watershed capture (Umatilla = AWS only) correlate with water availability? **Tables to Create**: - `watershed_water_stress` - HUC8-level water stress indicators @@ -83,27 +93,38 @@ canonical_map = { --- ### 4. Opposition Cases Overlay -**Status**: Anecdotal evidence of community opposition to new data centers +**Status**: 18 geocoded opposition cases in `opposition_cases_geocoded` table (loaded but unused) **Approach**: -- Compile cases of rejected/delayed data center proposals (news archive scraping) -- Geocode opposition cases, join to demographics/hazards +- Expand dataset: Compile additional cases of rejected/delayed data center proposals from news archives +- Geocode all opposition cases, join to demographics/hazards - Test hypotheses: - Do wealthier/more educated communities successfully block projects? - Are opposition cases more common in water-stressed or drought-prone areas? - Do smaller non-metro communities have less bargaining power? + - Does clustered vs. isolated location predict opposition likelihood? **Research Questions**: - What predicts opposition success? - Are opposition cases spatially clustered? - Do demographics differ between accepted vs. rejected sites? +- Correlation with FEMA hazard exposure scores? + +**Analysis Plan**: +```sql +-- Join opposition cases to demographics +SELECT o.*, ct.median_household_income, ct.bachelors_or_higher_pct +FROM opposition_cases_geocoded o +JOIN _dc_census_tract_acs_2024 ct + ON ST_Contains(ct.geom, o.geom); +``` **Output**: `opposition_cases_analysis.md` --- ### 5. IM3 Forward Projection Integration -**Status**: IM3 model includes projected data center demand growth +**Status**: IM3 model data loaded in `im3_state_projected_moderate_50` (328 rows) and `im3_projected_state_demand_summary` (31 rows) **Approach**: - Load IM3 projected demand scenarios (2030, 2040, 2050) @@ -113,10 +134,34 @@ canonical_map = { - Land availability (zoned industrial parcels) - Identify regions where projected demand may exceed infrastructure capacity +**Grid Saturation Context** (from current analysis): +- **New Jersey**: 83% of grid within 50 km of DC +- **Nevada**: 75% +- **Tennessee**: 70% +- **Oregon**: 68% +- **Arizona**: 56% +- **Virginia**: 50% + **Research Questions**: - Which states face grid saturation from data center growth? - Are projected sites in water-stressed watersheds? - Does IM3 assume spatial distribution patterns consistent with current siting? +- Can states with >50% grid saturation accommodate projected demand? + +**Implementation**: +```sql +-- Compare current saturation to IM3 projected demand +SELECT + current.state, + current.dc_count, + current.pct_grid_saturated, + proj.facility_count AS projected_new_facilities, + proj.total_it_mw AS projected_new_mw +FROM state_grid_saturation current +JOIN im3_projected_state_demand_summary proj ON current.state = proj.state +WHERE current.pct_grid_saturated > 50 +ORDER BY current.pct_grid_saturated DESC; +``` **Notebook**: `im3_projection_overlay.ipynb`