Files
data-centers/output/data_center_demographic_ruca_energy_summary.md

298 lines
19 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# US Data Centers — Demographic, Urban-Rural & Energy Context Analysis
**Date:** 2026-05-18
**Notebook:** [cluster_analysis.ipynb](../cluster_analysis.ipynb)
**Universe:** 1,833 data centers in `public.master_data_centers`, joined to ACS-2024 demographics, USDA RUCA-2020 codes, USGS HUC8 watersheds, and EIA operating-generator capacity (50 km radius, latest period 2026-02, status=OP).
---
## Headline findings
1. **DC tracts are richer, more educated, and more diverse than the US average.** Median household income $103,623 vs. national $78,538 (+32%); 49% bachelor's+ vs. 35% (+14 pp); poverty rate 7.2% vs. 12.4%. Non-Hispanic white share is *below* national (50% vs. 58%), driven by Asian-heavy (mean 13% vs. 6%) and Hispanic-significant tracts.
2. **The metro skew is more modest than expected: 1.11×.** 89% of DCs sit in metropolitan tracts, but 80% of *all* US tracts are metropolitan — so DCs are only slightly more concentrated than the underlying population distribution would predict.
3. **The non-metro tail is overwhelmingly hyperscale and Pacific Northwest.** Of 190 DCs outside metropolitan tracts (RUCA 410), AWS owns 67, Meta 22, Microsoft 10, Google 4, Yahoo 2 — combined 55% of the non-metro footprint. Oregon (86) and Washington (40) alone hold 66% of non-metro DCs, anchored to the Columbia River hydropower corridor.
4. **Clustered DCs are demographically distinct from isolated ones.** DCs in DBSCAN clusters (n=1,583) sit in tracts with $108K median income vs. $73K for isolated DCs (n=250) — a $35K gap. Clustered DCs are more educated (+18 pp bachelor's), more diverse (25 pp non-Hispanic white), and embedded in much denser energy infrastructure (89 vs. 40 generators within 50 km).
5. **Microsoft co-locates with the largest US nuclear plant.** Microsoft's Goodyear, AZ campus has 14.6 GW of generation within 50 km — including 4.2 GW from Palo Verde Nuclear, the largest in the US. Despite the campus being in a RUCA-2 "Metro high-commute" tract (not strictly metro core), the surrounding grid is the densest by capacity in our analysis.
6. **Extreme watershed concentration: half of all US DCs sit in just 15 of 2,139 HUC8 watersheds.** A single watershed — Middle Potomac-Catoctin (Loudoun County) — holds 235 DCs (12.8% of the US total). The top 2 (both DC-Alley watersheds) hold 18.9%; the top 10 hold 40%. Water stress in any one of these basins propagates to a huge share of national DC capacity.
---
## Dataset coverage and joins
| Source table | Rows | Join key | Coverage |
|---|---|---|---|
| `master_data_centers` | 1,833 | base | — |
| `master_data_center_spatial_clusters` | 1,831 | `master_id` | 99.9% |
| `_dc_census_tract_acs_2024` | ~73,000 tracts | `geoid` | 1,807 matched (98.6%) |
| `ruca_codes_2020_tract` | 85,528 tracts | `tract_fips_20 = geoid` | 1,826 matched (99.6%) |
| `watershed_huc8` | 2,139 watersheds | `ST_Contains(w.geom, m.geom)` | 1,831 matched (99.9%) |
| `energy_eia_operating_generator_capacity_flat` | 4.7M rows | `ST_DWithin(geom, 50km)` | 1,831 DCs have ≥1 nearby gen |
Energy aggregation uses period `2026-02` only with `status='OP'`, summing `nameplate_capacity_mw` for operating generators within 50 km of each DC. Note: EIA capacity columns were added to this table on 2026-05-17 — prior to that the `_flat` table had no MW values despite its name.
---
## 1. Demographic profile of DC tracts (n=1,807 with non-null ACS)
| Metric | DC tract (median) | DC tract (mean) | US avg | Δ mean vs. US |
|---|---:|---:|---:|---:|
| Median household income | $103,623 | $114,543 | $78,538 | **+$36,005** |
| Per-capita income | $51,283 | $55,725 | $43,313 | +$12,412 |
| Poverty rate | 7.2% | 10.1% | 12.4% | 2.3 pp |
| Unemployment rate | 3.5% | 4.4% | 5.4% | 1.0 pp |
| Bachelor's+ % | 49.3% | 46.2% | 35.0% | **+11.2 pp** |
| Broadband subscription % | 94.9% | 93.5% | 89.0% | +4.5 pp |
| Non-Hispanic white % | 50.2% | 51.0% | 58.4% | **7.4 pp** |
| Hispanic / Latino % | 12.8% | 19.5% | 19.5% | 0.0 pp |
| Non-Hispanic Black % | 5.9% | 10.6% | 12.1% | 1.5 pp |
| Non-Hispanic Asian % | 6.4% | 13.4% | 6.4% | **+7.0 pp** |
**Interpretation.** DC tracts skew toward high-income, highly-educated, technically connected, and racially diverse (specifically Asian-heavy). The race composition is interesting: DC tracts are *less* non-Hispanic white than national average, not more. This reflects DC siting in mixed-race coastal/exurban tech corridors (Bay Area, Northern Virginia, Seattle) rather than in homogeneous suburbs.
**Data quality note.** `avg_household_size` previously contained ACS sentinel-value pollution (666,666,666) for 1,089 zero-population tracts in `_dc_census_tract_acs_2024` (29 of which contained DCs) plus 16 rows in `data_center_census_tracts_2024`. As of 2026-05-18, those sentinels have been replaced with `NULL`. The column now has plausible ranges (min 1.00, max 9.33) and a usable mean.
---
## 2. Geographic concentration (top 15 states)
| State | DC count | Total power_mw (where known) | Median HH income | Median bachelor's % | Median % white | Notes |
|---|---:|---:|---:|---:|---:|---|
| **VA** | **378** | 255 | $141,250 | 62.6% | 42.5% | Loudoun / DC-Alley dominance (20.6% of all US DCs) |
| TX | 162 | 597 | $88,228 | 46.2% | 32.0% | DFW + Austin + San Antonio |
| CA | 147 | 130 | $164,928 | 56.4% | 22.4% | Bay Area + LA basin |
| OR | 145 | 125 | $72,719 | 20.0% | 63.2% | Columbia River hydro corridor (rural) |
| OH | 103 | 135 | $128,875 | 47.0% | 74.5% | Columbus boom — fastest-rising market |
| WA | 93 | 70 | $91,623 | 21.9% | 40.3% | Quincy/Wenatchee + Seattle |
| AZ | 69 | 54 | $85,335 | 35.2% | 51.6% | Phoenix/Goodyear hyperscale |
| IA | 65 | 0 | $93,393 | 34.3% | 88.1% | 88% white (rural Midwest) |
| NJ | 62 | 98 | $147,321 | 59.4% | 32.9% | NYC-metro carrier hotels |
| IL | 61 | 128 | $96,191 | 52.9% | 52.0% | Chicago metro |
| GA | 50 | 241 | $101,176 | 51.4% | 31.6% | Atlanta + high-power rural builds |
| NY | 48 | 47 | $77,465 | 47.6% | 74.8% | NYC + upstate |
| NV | 41 | 0 | $93,409 | 31.2% | 34.6% | Reno + Las Vegas |
| TN | 32 | 0 | — | — | 54.8% | Nashville + Memphis (newly visible after state backfill) |
| NC | 31 | 56 | $82,708 | 44.7% | 59.6% | Charlotte + Catawba (nuclear-adjacent) |
**Virginia alone holds 20.6% of all US DCs** (378 of 1,833), with the most affluent tract profile in the top 15 — a Loudoun County effect. The state backfill substantially elevated **Ohio (76 → 103)** and **Texas (135 → 162)**, pushing TX into the #2 slot. The previously-uncounted **Tennessee (32) now appears in the top 15**.
Oregon and Washington tracts look notably different from the urban-heavy states (lower income, lower education, lower broadband, higher Hispanic share), reflecting their rural Columbia River siting.
---
## 3. Spatially clustered DCs vs. isolated DCs
DBSCAN cluster assignment from `master_data_center_spatial_clusters` (1,583 clustered, 250 isolated):
| Metric (median) | Isolated | In cluster | Δ |
|---|---:|---:|---:|
| Median household income | $73,500 | $108,359 | **+$34,859** |
| Bachelor's+ % | 33.2 | 51.2 | **+18.0 pp** |
| Poverty rate | 11.6 | 6.9 | 4.7 pp |
| Non-Hispanic white % | 71.0 | 45.9 | **25.1 pp** |
| EIA generators within 50 km | 40 | 89 | +49 |
| EIA capacity within 50 km (MW) | 2,176 | 3,300 | +1,125 |
**Reading.** A clustered data center sits, at the median, in a tract that is ~$35K richer, 18 pp more educated, and 25 pp less non-Hispanic white than an isolated one — and is surrounded by twice as much energy infrastructure (and 50% more generation capacity). The isolated set looks like rural / small-town America (whiter, poorer, less educated); the clustered set looks like coastal exurban tech corridors.
---
## 4. RUCA (urban-rural) distribution
National baseline of all US tracts: 80% Metropolitan, 9% Micropolitan, 3% Small town, 8% Rural.
| RUCA band | DCs | DC % | US tract % | Over-index |
|---|---:|---:|---:|---:|
| Metropolitan (13) | 1,636 | 89.3% | 80.1% | **1.11×** |
| Micropolitan (46) | 98 | 5.3% | 9.0% | 0.59× |
| Small town (79) | 15 | 0.8% | 2.9% | 0.28× |
| Rural (10) | 77 | 4.2% | 7.6% | 0.55× |
| Unknown / missed match | 7 | 0.4% | — | — |
**Reading.** The metro skew is real but only mild — 1.11×. The eye-catching pattern is that **rural tracts (RUCA 10) hold more DCs than micropolitan or small-town combined**, because the hyperscale greenfield model deliberately bypasses small-city economies in favor of remote, cheap-power, low-population sites.
### Per-RUCA-code drilldown
| RUCA | Description | DCs | Median HH income | Median pop density | Median EIA gens (50km) |
|---:|---|---:|---:|---:|---:|
| 1 | Metro core | 1,425 | $110,333 | 1,859 / sq mi | 97 |
| 2 | Metro high-commute | 206 | $105,404 | 96 | 49 |
| 3 | Metro low-commute | 5 | $119,495 | 22 | 23 |
| 4 | Micropolitan core | 54 | $63,698 | 312 | 53 |
| 5 | Micropolitan high-commute | 22 | $72,465 | 191 | 51 |
| 6 | Micropolitan low-commute | 22 | $72,719 | 69 | 59 |
| 7 | Small town core | 14 | $87,522 | 2,336 | 40 |
| 8 | Small town high-commute | 1 | $69,074 | 36 | 41 |
| 10 | Rural area | 77 | $93,820 | 12 | 42 |
**Two surprises:**
- Rural DCs (RUCA 10) sit in tracts with $93.8K median income — *higher* than micropolitan DCs ($63.7K$72.7K). The rural DC sites are not poor rural America; they are wealthy-by-rural-standards counties chosen for power and water access.
- Micropolitan-core DCs (RUCA 4) have the *lowest* median income at $63.7K — the closest thing to "economic-development DC siting" in the dataset.
---
## 5. Non-metro deep dive (190 DCs, RUCA 410)
### Operators
| Operator | Non-metro DCs |
|---|---:|
| Amazon Web Services | 67 |
| *(null operator)* | 50 |
| Meta | 20 (+ 2 as "Meta, Inc.") |
| Microsoft | 10 |
| Google | 4 |
| Rowan Green Data | 4 |
| NTT | 2 |
| Yahoo | 2 |
| Amazon AWS *(dupe)* | 2 |
**The five hyperscalers (AWS, Meta, Microsoft, Google, Yahoo) account for 105 of 190 non-metro DCs (55%).** If the 50 null-operator rows skew similarly hyperscale (likely — they're disproportionately in OR/WA), the share is probably closer to 75%.
### States (post-backfill)
| State | Non-metro DCs |
|---|---:|
| Oregon | 86 |
| Washington | 40 |
| Texas | 9 |
| New Mexico | 7 |
| North Carolina | 6 |
| Pennsylvania | 5 |
| Wisconsin | 4 |
| New York | 3 |
| Tennessee | 3 |
| Georgia | 3 |
**Oregon + Washington = 126 (66%) of all non-metro DCs.** This is the Columbia River basin: Prineville / Hermiston / Boardman / The Dalles (OR) and Quincy / East Wenatchee / Moses Lake (WA). The pull is hydroelectric power (cheap, low-carbon, abundant) and cool dry climate (free-cooling).
The state backfill clarified the rest of the non-metro tail: **Texas (9)** and **Pennsylvania (5)** were previously hidden in the null bucket. These likely represent shale-gas-adjacent builds (Permian and Marcellus respectively).
---
## 6. Energy footprint by operator (using EIA capacity within 50 km)
Aggregated across DCs in RUCA 210 (i.e. anything outside dense metro core, n=401):
| Operator | DCs | States | Total nearby capacity (GW) | Median per site (GW) | Hydro (GW) | Nuclear (GW) | NG (GW) | Solar (GW) | Wind (GW) |
|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|
| AWS | 93 | 5 | 397 | 4.8 | 66 | 2.5 | 201 | 4.6 | **114** |
| *(Unknown)* | 118 | 26 | 339 | 2.3 | 86 | 35 | 135 | 23 | 19 |
| Meta | 51 | 11 | 120 | 2.0 | 4.9 | 0 | 61 | 16 | 0.3 |
| Microsoft | 26 | 6 | 113 | 3.4 | 28 | **13** | 39 | 9.1 | 8.1 |
| Google | 31 | 5 | 100 | 3.9 | 14 | 0 | 43 | 3.6 | 4.7 |
| Apple | 5 | 2 | 4 | 0.6 | 1.6 | 0 | 1.1 | 0.9 | 0.4 |
| Yahoo | 2 | 1 | 7 | 3.5 | 6.4 | 0 | 0 | 0 | 0.7 |
**Distinct hyperscaler strategies, visible in the fuel mix:**
- **AWS** has aggregated 114 GW of *wind* exposure across its 93 sites — by far the most renewable-coupled portfolio. Also heavy hydro (66 GW) from its OR/WA footprint and 201 GW of natural gas as baseline.
- **Microsoft** has the highest *nuclear* exposure (12.6 GW) — almost entirely from its Goodyear, AZ campuses near Palo Verde Nuclear.
- **Meta** has the most *solar* (16 GW) among the named hyperscalers, but minimal nuclear or wind — consistent with its New Mexico (Los Lunas) and Iowa builds.
- **Google** is split — moderate hydro and natural gas, modest renewables.
### Largest non-metro grid neighborhoods (top sites by surrounding capacity)
| DC | Operator | Location | Nearby capacity | Fuel highlight |
|---|---|---|---:|---|
| PHX70 / PHX-10 / PHX-11 | Microsoft (Azure) | Goodyear, AZ (RUCA 2) | 14.014.6 GW | **4.2 GW nuclear (Palo Verde)** + 6.4 GW gas + 2.2 GW solar |
| Stream PHX-1 | Stream Data Centers | Goodyear, AZ | 13.4 GW | Same Palo Verde / gas mix |
| T5 Charlotte Campus | T5 | Kings Mountain, NC (RUCA 6) | 12.9 GW | **4.9 GW nuclear** (Catawba) + 5.5 GW gas + 1.5 GW coal |
| Apple Maiden | Apple | Maiden, NC (RUCA 2) | 9.1 GW | 2.4 GW nuclear + 4.6 GW gas |
| Percheron DC | Rowan Green Data | (Texas, RUCA 10) | 6.7 GW | **3.0 GW wind** + 0.9 GW hydro + 2.4 GW gas |
---
## 7. Watershed (HUC8) concentration
Each DC sits in exactly one USGS HUC8 watershed (8-digit hydrologic unit, subbasin scale, median ~3,000 sq km). Cooling-water draw and wastewater discharge happen at watershed scale, not state scale — a single stressed basin can cap an entire DC corridor regardless of how big the state's overall water budget is.
### Where the 1,831 matched DCs land
- **257 distinct HUC8s** hold at least one DC — that's only **12% of the 2,139 US watersheds** (the other 88% have zero data centers).
- **The top 1 watershed alone (Middle Potomac-Catoctin) holds 235 DCs** — 12.8% of the entire US data-center footprint.
- DC concentration is much more extreme at the watershed level than at the state level. Virginia has 20.6% of US DCs; the single Loudoun watershed holds 12.8%.
### Cumulative concentration
| Top N watersheds | DCs | Share of all US DCs |
|---:|---:|---:|
| 1 | 235 | 12.8% |
| 2 | 346 | 18.9% |
| 3 | 434 | 23.7% |
| 5 | 551 | 30.1% |
| **10** | **736** | **40.2%** |
| **15** | **887** | **48.4%** |
| 20 | 1,012 | 55.3% |
| 30 | 1,186 | 64.8% |
| 50 | 1,380 | 75.4% |
| 100 | 1,611 | 88.0% |
**Half of all US data centers live in just 15 watersheds.** Three-quarters in 50. Water stress, drought policy, or thermal-discharge limits in any one of these basins propagates to a large share of the national footprint.
### Top 15 watersheds by DC count
| HUC8 | Name | States | DCs | Cluster |
|---|---|---|---:|---|
| 02070008 | Middle Potomac-Catoctin | DC, MD, VA, WV | 235 | Loudoun / Ashburn (DC-Alley) |
| 02070010 | Middle Potomac-Anacostia-Occoquan | DC, MD, VA | 111 | Fairfax + inner Loudoun + DC |
| 18050003 | Coyote | CA | 88 | Silicon Valley / San Jose |
| 05060001 | Upper Scioto | OH | 73 | Columbus (fastest-growing market) |
| 17070101 | Middle Columbia-Lake Wallula | OR, WA | 44 | Boardman / Hermiston (hyperscale hydro) |
| 17020015 | Lower Crab | WA | 40 | Quincy / Moses Lake (hyperscale hydro) |
| 17090010 | Tualatin | OR | 39 | Hillsboro (Intel / Google) |
| 12030105 | Upper Trinity | TX | 37 | DFW |
| 10230006 | Big Papillion-Mosquito | IA, NE | 36 | Council Bluffs / Omaha (Meta) |
| 07120004 | Des Plaines | IL, WI | 33 | Chicago metro |
| 12100302 | Medina | TX | 32 | San Antonio |
| 02030105 | Raritan | NJ | 31 | Central NJ carrier hotels |
| 15050100 | Middle Gila | AZ | 30 | Phoenix metro |
| 02030103 | Hackensack-Passaic | NJ, NY | 29 | NYC metro east |
| 17070103 | Umatilla | OR | 29 | AWS-only (all 29 DCs) — pure single-operator basin |
The non-metro / hyperscale Pacific Northwest story is visible at watershed scale: three Columbia-system watersheds (**Middle Columbia-Lake Wallula, Lower Crab, Umatilla**) hold 113 DCs combined, all hyperscale-dominated. Umatilla is operationally **AWS-exclusive** — all 29 DCs in that basin are AWS.
### Non-metro watersheds (RUCA 410) — where hyperscalers cluster
| HUC8 | Name | States | DCs | Operators |
|---|---|---|---:|---|
| 17070101 | Middle Columbia-Lake Wallula | OR, WA | 44 | AWS (multiple variants), Rowan Green Data |
| 17020015 | Lower Crab | WA | 40 | CyrusOne, Intuit, Microsoft, NTT, Sabey, Yahoo |
| 17070103 | Umatilla | OR | 29 | AWS only |
| 17070305 | Lower Crooked | OR | 8 | Meta (Prineville) |
| 13020203 | Rio Grande-Albuquerque | NM | 7 | Meta (Los Lunas) |
| 03050105 | Upper Broad | NC, SC | 6 | Meta |
| 13070001 | Lower Pecos-Red Bluff Reservoir | NM, TX | 5 | IONIC Digital |
| 17070105 | Middle Columbia-Hood | OR, WA | 4 | Google (The Dalles) |
| 02050107 | Upper Susquehanna-Lackawanna | PA | 3 | AWS |
| 03070103 | Upper Ocmulgee | GA | 2 | Meta |
This view is the cleanest evidence yet of the *hyperscale geographic strategy* — single-operator capture of individual watersheds (Umatilla = AWS, Lower Crooked = Meta, Middle Columbia-Hood = Google, Rio Grande-Albuquerque = Meta). Each of these basins has effectively been claimed by one player.
### Implications for water-stress analysis
This watershed view is a **boundary set** for downstream water-stress analysis. Pull WaterWatch streamflow data, USGS water-use estimates, or EPA drought indicators against just these 257 HUC8s (or against just the top 15 for the highest-leverage story). A single-pull stress index against this set would size the "water exposure" of the entire US DC fleet.
---
## Data quality flags
1. **`master_data_centers.power_mw` is populated for only 108 / 1,833 DCs (5.9%).** Useless as a sizing metric without imputation or alternative source. Nearby EIA capacity is the more reliable proxy (used as the per-DC scale in this analysis). A grant-funded scrape of Baxtel / Data Center Map would close this gap.
2. **50 of 190 non-metro DCs (26%) have null `operator`.** Likely hyperscalers based on geography (OR/WA) but unconfirmed.
3. **Operator-string fragmentation**: "Meta" vs. "Meta, Inc."; "Amazon Web Services" vs. "Amazon AWS" vs. "amazon web services"; "Microsoft" vs. "Microsoft Azure". Inflates distinct-operator counts and fragments per-operator totals.
4. ~~`avg_household_size` column has sentinel pollution~~ **Resolved 2026-05-18** — 1,089 sentinel values (666,666,666) in `_dc_census_tract_acs_2024` and 16 in `data_center_census_tracts_2024` replaced with `NULL`. Affected 29 DCs.
5. **7 DCs failed RUCA join** — Puerto Rico tracts or non-US locations; trivial.
6. **EIA generator coordinates had a longitude sign error for 2008-01 through 2010-11** (~11K rows with positive lower-48 longitudes). The flat-table build at [ingest_eia_energy_layers.py:839-870](../ingest_eia_energy_layers.py#L839-L870) corrects this in `longitude` and `geom`, so spatial joins are unaffected.
---
## Suggested next steps
1. **Backfill `power_mw`** from Baxtel / Data Center Map (paid scrape — grant work).
2. **Operator-string deduplication** — collapse "Meta"/"Meta, Inc.", "AWS" variants, etc., before any per-operator analysis.
3. **Water-stress overlay against the 257 watersheds** — now that the HUC8 join is in place, pull USGS WaterWatch streamflow data, USGS water-use estimates, or EPA drought-status indicators against this watershed set. A single stress index per HUC8 would size the entire US fleet's water exposure.
4. **State-level energy demand context**`im3_state_projected_moderate_50` and `seds_state_msn_year` are loaded; joining these would let us compute "DC nearby capacity as share of state grid" rather than absolute MW.
5. **Opposition cases overlay**`opposition_cases_geocoded` is loaded but unused; could test whether cluster-vs-isolated demographic differences (or watershed concentration) predict community opposition.