updated map and cluster analysis
This commit is contained in:
@@ -1,10 +1,10 @@
|
||||
# US Data Centers — Demographic, Urban-Rural & Energy Context Analysis
|
||||
|
||||
**Date:** 2026-05-18
|
||||
**Notebook:** [cluster_analysis.ipynb](../cluster_analysis.ipynb)
|
||||
**Universe:** 1,833 data centers in `public.master_data_centers`, joined to ACS-2024 demographics, USDA RUCA-2020 codes, and EIA operating-generator capacity (50 km radius, latest period 2026-02, status=OP).
|
||||
|
||||
> **Update 2026-05-18**: 196 previously-null `state` values were backfilled from `geoid` (first 2 chars = state FIPS). All 1,833 DCs now have a state; all state-level numbers below reflect the corrected attribution.
|
||||
**Notebook:** [cluster_analysis.ipynb](../cluster_analysis.ipynb)
|
||||
|
||||
**Universe:** 1,833 data centers in `public.master_data_centers`, joined to ACS-2024 demographics, USDA RUCA-2020 codes, USGS HUC8 watersheds, and EIA operating-generator capacity (50 km radius, latest period 2026-02, status=OP).
|
||||
|
||||
---
|
||||
|
||||
@@ -15,6 +15,7 @@
|
||||
3. **The non-metro tail is overwhelmingly hyperscale and Pacific Northwest.** Of 190 DCs outside metropolitan tracts (RUCA 4–10), AWS owns 67, Meta 22, Microsoft 10, Google 4, Yahoo 2 — combined 55% of the non-metro footprint. Oregon (86) and Washington (40) alone hold 66% of non-metro DCs, anchored to the Columbia River hydropower corridor.
|
||||
4. **Clustered DCs are demographically distinct from isolated ones.** DCs in DBSCAN clusters (n=1,583) sit in tracts with $108K median income vs. $73K for isolated DCs (n=250) — a $35K gap. Clustered DCs are more educated (+18 pp bachelor's), more diverse (–25 pp non-Hispanic white), and embedded in much denser energy infrastructure (89 vs. 40 generators within 50 km).
|
||||
5. **Microsoft co-locates with the largest US nuclear plant.** Microsoft's Goodyear, AZ campus has 14.6 GW of generation within 50 km — including 4.2 GW from Palo Verde Nuclear, the largest in the US. Despite the campus being in a RUCA-2 "Metro high-commute" tract (not strictly metro core), the surrounding grid is the densest by capacity in our analysis.
|
||||
6. **Extreme watershed concentration: half of all US DCs sit in just 15 of 2,139 HUC8 watersheds.** A single watershed — Middle Potomac-Catoctin (Loudoun County) — holds 235 DCs (12.8% of the US total). The top 2 (both DC-Alley watersheds) hold 18.9%; the top 10 hold 40%. Water stress in any one of these basins propagates to a huge share of national DC capacity.
|
||||
|
||||
---
|
||||
|
||||
@@ -26,6 +27,7 @@
|
||||
| `master_data_center_spatial_clusters` | 1,831 | `master_id` | 99.9% |
|
||||
| `_dc_census_tract_acs_2024` | ~73,000 tracts | `geoid` | 1,807 matched (98.6%) |
|
||||
| `ruca_codes_2020_tract` | 85,528 tracts | `tract_fips_20 = geoid` | 1,826 matched (99.6%) |
|
||||
| `watershed_huc8` | 2,139 watersheds | `ST_Contains(w.geom, m.geom)` | 1,831 matched (99.9%) |
|
||||
| `energy_eia_operating_generator_capacity_flat` | 4.7M rows | `ST_DWithin(geom, 50km)` | 1,831 DCs have ≥1 nearby gen |
|
||||
|
||||
Energy aggregation uses period `2026-02` only with `status='OP'`, summing `nameplate_capacity_mw` for operating generators within 50 km of each DC. Note: EIA capacity columns were added to this table on 2026-05-17 — prior to that the `_flat` table had no MW values despite its name.
|
||||
@@ -49,7 +51,7 @@ Energy aggregation uses period `2026-02` only with `status='OP'`, summing `namep
|
||||
|
||||
**Interpretation.** DC tracts skew toward high-income, highly-educated, technically connected, and racially diverse (specifically Asian-heavy). The race composition is interesting: DC tracts are *less* non-Hispanic white than national average, not more. This reflects DC siting in mixed-race coastal/exurban tech corridors (Bay Area, Northern Virginia, Seattle) rather than in homogeneous suburbs.
|
||||
|
||||
**Data quality note.** `avg_household_size` contains sentinel-value pollution (min: −666,666,666), so the mean is unusable; the median (2.55) is sensible.
|
||||
**Data quality note.** `avg_household_size` previously contained ACS sentinel-value pollution (−666,666,666) for 1,089 zero-population tracts in `_dc_census_tract_acs_2024` (29 of which contained DCs) plus 16 rows in `data_center_census_tracts_2024`. As of 2026-05-18, those sentinels have been replaced with `NULL`. The column now has plausible ranges (min 1.00, max 9.33) and a usable mean.
|
||||
|
||||
---
|
||||
|
||||
@@ -125,6 +127,7 @@ National baseline of all US tracts: 80% Metropolitan, 9% Micropolitan, 3% Small
|
||||
| 10 | Rural area | 77 | $93,820 | 12 | 42 |
|
||||
|
||||
**Two surprises:**
|
||||
|
||||
- Rural DCs (RUCA 10) sit in tracts with $93.8K median income — *higher* than micropolitan DCs ($63.7K–$72.7K). The rural DC sites are not poor rural America; they are wealthy-by-rural-standards counties chosen for power and water access.
|
||||
- Micropolitan-core DCs (RUCA 4) have the *lowest* median income at $63.7K — the closest thing to "economic-development DC siting" in the dataset.
|
||||
|
||||
@@ -184,6 +187,7 @@ Aggregated across DCs in RUCA 2–10 (i.e. anything outside dense metro core, n=
|
||||
| Yahoo | 2 | 1 | 7 | 3.5 | 6.4 | 0 | 0 | 0 | 0.7 |
|
||||
|
||||
**Distinct hyperscaler strategies, visible in the fuel mix:**
|
||||
|
||||
- **AWS** has aggregated 114 GW of *wind* exposure across its 93 sites — by far the most renewable-coupled portfolio. Also heavy hydro (66 GW) from its OR/WA footprint and 201 GW of natural gas as baseline.
|
||||
- **Microsoft** has the highest *nuclear* exposure (12.6 GW) — almost entirely from its Goodyear, AZ campuses near Palo Verde Nuclear.
|
||||
- **Meta** has the most *solar* (16 GW) among the named hyperscalers, but minimal nuclear or wind — consistent with its New Mexico (Los Lunas) and Iowa builds.
|
||||
@@ -201,15 +205,86 @@ Aggregated across DCs in RUCA 2–10 (i.e. anything outside dense metro core, n=
|
||||
|
||||
---
|
||||
|
||||
## 7. Watershed (HUC8) concentration
|
||||
|
||||
Each DC sits in exactly one USGS HUC8 watershed (8-digit hydrologic unit, subbasin scale, median ~3,000 sq km). Cooling-water draw and wastewater discharge happen at watershed scale, not state scale — a single stressed basin can cap an entire DC corridor regardless of how big the state's overall water budget is.
|
||||
|
||||
### Where the 1,831 matched DCs land
|
||||
|
||||
- **257 distinct HUC8s** hold at least one DC — that's only **12% of the 2,139 US watersheds** (the other 88% have zero data centers).
|
||||
- **The top 1 watershed alone (Middle Potomac-Catoctin) holds 235 DCs** — 12.8% of the entire US data-center footprint.
|
||||
- DC concentration is much more extreme at the watershed level than at the state level. Virginia has 20.6% of US DCs; the single Loudoun watershed holds 12.8%.
|
||||
|
||||
### Cumulative concentration
|
||||
|
||||
| Top N watersheds | DCs | Share of all US DCs |
|
||||
|---:|---:|---:|
|
||||
| 1 | 235 | 12.8% |
|
||||
| 2 | 346 | 18.9% |
|
||||
| 3 | 434 | 23.7% |
|
||||
| 5 | 551 | 30.1% |
|
||||
| **10** | **736** | **40.2%** |
|
||||
| **15** | **887** | **48.4%** |
|
||||
| 20 | 1,012 | 55.3% |
|
||||
| 30 | 1,186 | 64.8% |
|
||||
| 50 | 1,380 | 75.4% |
|
||||
| 100 | 1,611 | 88.0% |
|
||||
|
||||
**Half of all US data centers live in just 15 watersheds.** Three-quarters in 50. Water stress, drought policy, or thermal-discharge limits in any one of these basins propagates to a large share of the national footprint.
|
||||
|
||||
### Top 15 watersheds by DC count
|
||||
|
||||
| HUC8 | Name | States | DCs | Cluster |
|
||||
|---|---|---|---:|---|
|
||||
| 02070008 | Middle Potomac-Catoctin | DC, MD, VA, WV | 235 | Loudoun / Ashburn (DC-Alley) |
|
||||
| 02070010 | Middle Potomac-Anacostia-Occoquan | DC, MD, VA | 111 | Fairfax + inner Loudoun + DC |
|
||||
| 18050003 | Coyote | CA | 88 | Silicon Valley / San Jose |
|
||||
| 05060001 | Upper Scioto | OH | 73 | Columbus (fastest-growing market) |
|
||||
| 17070101 | Middle Columbia-Lake Wallula | OR, WA | 44 | Boardman / Hermiston (hyperscale hydro) |
|
||||
| 17020015 | Lower Crab | WA | 40 | Quincy / Moses Lake (hyperscale hydro) |
|
||||
| 17090010 | Tualatin | OR | 39 | Hillsboro (Intel / Google) |
|
||||
| 12030105 | Upper Trinity | TX | 37 | DFW |
|
||||
| 10230006 | Big Papillion-Mosquito | IA, NE | 36 | Council Bluffs / Omaha (Meta) |
|
||||
| 07120004 | Des Plaines | IL, WI | 33 | Chicago metro |
|
||||
| 12100302 | Medina | TX | 32 | San Antonio |
|
||||
| 02030105 | Raritan | NJ | 31 | Central NJ carrier hotels |
|
||||
| 15050100 | Middle Gila | AZ | 30 | Phoenix metro |
|
||||
| 02030103 | Hackensack-Passaic | NJ, NY | 29 | NYC metro east |
|
||||
| 17070103 | Umatilla | OR | 29 | AWS-only (all 29 DCs) — pure single-operator basin |
|
||||
|
||||
The non-metro / hyperscale Pacific Northwest story is visible at watershed scale: three Columbia-system watersheds (**Middle Columbia-Lake Wallula, Lower Crab, Umatilla**) hold 113 DCs combined, all hyperscale-dominated. Umatilla is operationally **AWS-exclusive** — all 29 DCs in that basin are AWS.
|
||||
|
||||
### Non-metro watersheds (RUCA 4–10) — where hyperscalers cluster
|
||||
|
||||
| HUC8 | Name | States | DCs | Operators |
|
||||
|---|---|---|---:|---|
|
||||
| 17070101 | Middle Columbia-Lake Wallula | OR, WA | 44 | AWS (multiple variants), Rowan Green Data |
|
||||
| 17020015 | Lower Crab | WA | 40 | CyrusOne, Intuit, Microsoft, NTT, Sabey, Yahoo |
|
||||
| 17070103 | Umatilla | OR | 29 | AWS only |
|
||||
| 17070305 | Lower Crooked | OR | 8 | Meta (Prineville) |
|
||||
| 13020203 | Rio Grande-Albuquerque | NM | 7 | Meta (Los Lunas) |
|
||||
| 03050105 | Upper Broad | NC, SC | 6 | Meta |
|
||||
| 13070001 | Lower Pecos-Red Bluff Reservoir | NM, TX | 5 | IONIC Digital |
|
||||
| 17070105 | Middle Columbia-Hood | OR, WA | 4 | Google (The Dalles) |
|
||||
| 02050107 | Upper Susquehanna-Lackawanna | PA | 3 | AWS |
|
||||
| 03070103 | Upper Ocmulgee | GA | 2 | Meta |
|
||||
|
||||
This view is the cleanest evidence yet of the *hyperscale geographic strategy* — single-operator capture of individual watersheds (Umatilla = AWS, Lower Crooked = Meta, Middle Columbia-Hood = Google, Rio Grande-Albuquerque = Meta). Each of these basins has effectively been claimed by one player.
|
||||
|
||||
### Implications for water-stress analysis
|
||||
|
||||
This watershed view is a **boundary set** for downstream water-stress analysis. Pull WaterWatch streamflow data, USGS water-use estimates, or EPA drought indicators against just these 257 HUC8s (or against just the top 15 for the highest-leverage story). A single-pull stress index against this set would size the "water exposure" of the entire US DC fleet.
|
||||
|
||||
---
|
||||
|
||||
## Data quality flags
|
||||
|
||||
1. ~~196 of 1,833 DCs (10.7%) have null `state`~~ **Resolved 2026-05-18** by backfilling from `geoid` first-2-chars (state FIPS).
|
||||
2. **`master_data_centers.power_mw` is populated for only 108 / 1,833 DCs (5.9%).** Useless as a sizing metric without imputation or alternative source. Nearby EIA capacity is the more reliable proxy (used as the per-DC scale in this analysis). A grant-funded scrape of Baxtel / Data Center Map would close this gap.
|
||||
3. **50 of 190 non-metro DCs (26%) have null `operator`.** Likely hyperscalers based on geography (OR/WA) but unconfirmed.
|
||||
4. **Operator-string fragmentation**: "Meta" vs. "Meta, Inc."; "Amazon Web Services" vs. "Amazon AWS" vs. "amazon web services"; "Microsoft" vs. "Microsoft Azure". Inflates distinct-operator counts and fragments per-operator totals.
|
||||
5. **`avg_household_size` column has sentinel pollution** (min: −666,666,666). Use median or filter before using.
|
||||
6. **7 DCs failed RUCA join** — Puerto Rico tracts or non-US locations; trivial.
|
||||
7. **EIA generator coordinates had a longitude sign error for 2008-01 through 2010-11** (~11K rows with positive lower-48 longitudes). The flat-table build at [ingest_eia_energy_layers.py:839-870](../ingest_eia_energy_layers.py#L839-L870) corrects this in `longitude` and `geom`, so spatial joins are unaffected.
|
||||
1. **`master_data_centers.power_mw` is populated for only 108 / 1,833 DCs (5.9%).** Useless as a sizing metric without imputation or alternative source. Nearby EIA capacity is the more reliable proxy (used as the per-DC scale in this analysis). A grant-funded scrape of Baxtel / Data Center Map would close this gap.
|
||||
2. **50 of 190 non-metro DCs (26%) have null `operator`.** Likely hyperscalers based on geography (OR/WA) but unconfirmed.
|
||||
3. **Operator-string fragmentation**: "Meta" vs. "Meta, Inc."; "Amazon Web Services" vs. "Amazon AWS" vs. "amazon web services"; "Microsoft" vs. "Microsoft Azure". Inflates distinct-operator counts and fragments per-operator totals.
|
||||
4. ~~`avg_household_size` column has sentinel pollution~~ **Resolved 2026-05-18** — 1,089 sentinel values (−666,666,666) in `_dc_census_tract_acs_2024` and 16 in `data_center_census_tracts_2024` replaced with `NULL`. Affected 29 DCs.
|
||||
5. **7 DCs failed RUCA join** — Puerto Rico tracts or non-US locations; trivial.
|
||||
6. **EIA generator coordinates had a longitude sign error for 2008-01 through 2010-11** (~11K rows with positive lower-48 longitudes). The flat-table build at [ingest_eia_energy_layers.py:839-870](../ingest_eia_energy_layers.py#L839-L870) corrects this in `longitude` and `geom`, so spatial joins are unaffected.
|
||||
|
||||
---
|
||||
|
||||
@@ -217,6 +292,6 @@ Aggregated across DCs in RUCA 2–10 (i.e. anything outside dense metro core, n=
|
||||
|
||||
1. **Backfill `power_mw`** from Baxtel / Data Center Map (paid scrape — grant work).
|
||||
2. **Operator-string deduplication** — collapse "Meta"/"Meta, Inc.", "AWS" variants, etc., before any per-operator analysis.
|
||||
3. **Watershed (HUC8) join** — `public.watershed_huc8` is loaded but unused; would let us look at water stress overlap, particularly for the 190 non-metro DCs.
|
||||
3. **Water-stress overlay against the 257 watersheds** — now that the HUC8 join is in place, pull USGS WaterWatch streamflow data, USGS water-use estimates, or EPA drought-status indicators against this watershed set. A single stress index per HUC8 would size the entire US fleet's water exposure.
|
||||
4. **State-level energy demand context** — `im3_state_projected_moderate_50` and `seds_state_msn_year` are loaded; joining these would let us compute "DC nearby capacity as share of state grid" rather than absolute MW.
|
||||
5. **Opposition cases overlay** — `opposition_cases_geocoded` is loaded but unused; could test whether cluster-vs-isolated demographic differences predict community opposition.
|
||||
5. **Opposition cases overlay** — `opposition_cases_geocoded` is loaded but unused; could test whether cluster-vs-isolated demographic differences (or watershed concentration) predict community opposition.
|
||||
|
||||
Reference in New Issue
Block a user