Files
data-centers/output/data_center_demographic_ruca_energy_summary.md
dadams eccfbdbad9 Add data center demographic, RUCA, and energy capacity analysis
Adds three coordinated changes:

- Request nameplate, summer, and winter capacity from the EIA
  operating-generator-capacity endpoint and project them as typed columns
  on energy_eia_operating_generator_capacity_flat. The original ingest
  only pulled latitude and longitude, leaving the flat table with no MW
  values despite its name.
- New cluster_analysis.ipynb joins master_data_centers to ACS-2024
  demographics, USDA RUCA-2020 codes (loaded from new/), and EIA
  generation capacity within 50 km of each site.
- Summary doc consolidates the headline findings: DC tracts skew higher
  income / more educated / more racially diverse than US average, the
  metro over-index is only 1.11x, the non-metro tail is dominated by
  hyperscalers in the Columbia River corridor (OR+WA = 66% of non-metro
  DCs), and Microsoft co-locates with Palo Verde Nuclear in Goodyear AZ.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 08:14:57 -07:00

223 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# US Data Centers — Demographic, Urban-Rural & Energy Context Analysis
**Date:** 2026-05-18
**Notebook:** [cluster_analysis.ipynb](../cluster_analysis.ipynb)
**Universe:** 1,833 data centers in `public.master_data_centers`, joined to ACS-2024 demographics, USDA RUCA-2020 codes, and EIA operating-generator capacity (50 km radius, latest period 2026-02, status=OP).
> **Update 2026-05-18**: 196 previously-null `state` values were backfilled from `geoid` (first 2 chars = state FIPS). All 1,833 DCs now have a state; all state-level numbers below reflect the corrected attribution.
---
## Headline findings
1. **DC tracts are richer, more educated, and more diverse than the US average.** Median household income $103,623 vs. national $78,538 (+32%); 49% bachelor's+ vs. 35% (+14 pp); poverty rate 7.2% vs. 12.4%. Non-Hispanic white share is *below* national (50% vs. 58%), driven by Asian-heavy (mean 13% vs. 6%) and Hispanic-significant tracts.
2. **The metro skew is more modest than expected: 1.11×.** 89% of DCs sit in metropolitan tracts, but 80% of *all* US tracts are metropolitan — so DCs are only slightly more concentrated than the underlying population distribution would predict.
3. **The non-metro tail is overwhelmingly hyperscale and Pacific Northwest.** Of 190 DCs outside metropolitan tracts (RUCA 410), AWS owns 67, Meta 22, Microsoft 10, Google 4, Yahoo 2 — combined 55% of the non-metro footprint. Oregon (86) and Washington (40) alone hold 66% of non-metro DCs, anchored to the Columbia River hydropower corridor.
4. **Clustered DCs are demographically distinct from isolated ones.** DCs in DBSCAN clusters (n=1,583) sit in tracts with $108K median income vs. $73K for isolated DCs (n=250) — a $35K gap. Clustered DCs are more educated (+18 pp bachelor's), more diverse (25 pp non-Hispanic white), and embedded in much denser energy infrastructure (89 vs. 40 generators within 50 km).
5. **Microsoft co-locates with the largest US nuclear plant.** Microsoft's Goodyear, AZ campus has 14.6 GW of generation within 50 km — including 4.2 GW from Palo Verde Nuclear, the largest in the US. Despite the campus being in a RUCA-2 "Metro high-commute" tract (not strictly metro core), the surrounding grid is the densest by capacity in our analysis.
---
## Dataset coverage and joins
| Source table | Rows | Join key | Coverage |
|---|---|---|---|
| `master_data_centers` | 1,833 | base | — |
| `master_data_center_spatial_clusters` | 1,831 | `master_id` | 99.9% |
| `_dc_census_tract_acs_2024` | ~73,000 tracts | `geoid` | 1,807 matched (98.6%) |
| `ruca_codes_2020_tract` | 85,528 tracts | `tract_fips_20 = geoid` | 1,826 matched (99.6%) |
| `energy_eia_operating_generator_capacity_flat` | 4.7M rows | `ST_DWithin(geom, 50km)` | 1,831 DCs have ≥1 nearby gen |
Energy aggregation uses period `2026-02` only with `status='OP'`, summing `nameplate_capacity_mw` for operating generators within 50 km of each DC. Note: EIA capacity columns were added to this table on 2026-05-17 — prior to that the `_flat` table had no MW values despite its name.
---
## 1. Demographic profile of DC tracts (n=1,807 with non-null ACS)
| Metric | DC tract (median) | DC tract (mean) | US avg | Δ mean vs. US |
|---|---:|---:|---:|---:|
| Median household income | $103,623 | $114,543 | $78,538 | **+$36,005** |
| Per-capita income | $51,283 | $55,725 | $43,313 | +$12,412 |
| Poverty rate | 7.2% | 10.1% | 12.4% | 2.3 pp |
| Unemployment rate | 3.5% | 4.4% | 5.4% | 1.0 pp |
| Bachelor's+ % | 49.3% | 46.2% | 35.0% | **+11.2 pp** |
| Broadband subscription % | 94.9% | 93.5% | 89.0% | +4.5 pp |
| Non-Hispanic white % | 50.2% | 51.0% | 58.4% | **7.4 pp** |
| Hispanic / Latino % | 12.8% | 19.5% | 19.5% | 0.0 pp |
| Non-Hispanic Black % | 5.9% | 10.6% | 12.1% | 1.5 pp |
| Non-Hispanic Asian % | 6.4% | 13.4% | 6.4% | **+7.0 pp** |
**Interpretation.** DC tracts skew toward high-income, highly-educated, technically connected, and racially diverse (specifically Asian-heavy). The race composition is interesting: DC tracts are *less* non-Hispanic white than national average, not more. This reflects DC siting in mixed-race coastal/exurban tech corridors (Bay Area, Northern Virginia, Seattle) rather than in homogeneous suburbs.
**Data quality note.** `avg_household_size` contains sentinel-value pollution (min: 666,666,666), so the mean is unusable; the median (2.55) is sensible.
---
## 2. Geographic concentration (top 15 states)
| State | DC count | Total power_mw (where known) | Median HH income | Median bachelor's % | Median % white | Notes |
|---|---:|---:|---:|---:|---:|---|
| **VA** | **378** | 255 | $141,250 | 62.6% | 42.5% | Loudoun / DC-Alley dominance (20.6% of all US DCs) |
| TX | 162 | 597 | $88,228 | 46.2% | 32.0% | DFW + Austin + San Antonio |
| CA | 147 | 130 | $164,928 | 56.4% | 22.4% | Bay Area + LA basin |
| OR | 145 | 125 | $72,719 | 20.0% | 63.2% | Columbia River hydro corridor (rural) |
| OH | 103 | 135 | $128,875 | 47.0% | 74.5% | Columbus boom — fastest-rising market |
| WA | 93 | 70 | $91,623 | 21.9% | 40.3% | Quincy/Wenatchee + Seattle |
| AZ | 69 | 54 | $85,335 | 35.2% | 51.6% | Phoenix/Goodyear hyperscale |
| IA | 65 | 0 | $93,393 | 34.3% | 88.1% | 88% white (rural Midwest) |
| NJ | 62 | 98 | $147,321 | 59.4% | 32.9% | NYC-metro carrier hotels |
| IL | 61 | 128 | $96,191 | 52.9% | 52.0% | Chicago metro |
| GA | 50 | 241 | $101,176 | 51.4% | 31.6% | Atlanta + high-power rural builds |
| NY | 48 | 47 | $77,465 | 47.6% | 74.8% | NYC + upstate |
| NV | 41 | 0 | $93,409 | 31.2% | 34.6% | Reno + Las Vegas |
| TN | 32 | 0 | — | — | 54.8% | Nashville + Memphis (newly visible after state backfill) |
| NC | 31 | 56 | $82,708 | 44.7% | 59.6% | Charlotte + Catawba (nuclear-adjacent) |
**Virginia alone holds 20.6% of all US DCs** (378 of 1,833), with the most affluent tract profile in the top 15 — a Loudoun County effect. The state backfill substantially elevated **Ohio (76 → 103)** and **Texas (135 → 162)**, pushing TX into the #2 slot. The previously-uncounted **Tennessee (32) now appears in the top 15**.
Oregon and Washington tracts look notably different from the urban-heavy states (lower income, lower education, lower broadband, higher Hispanic share), reflecting their rural Columbia River siting.
---
## 3. Spatially clustered DCs vs. isolated DCs
DBSCAN cluster assignment from `master_data_center_spatial_clusters` (1,583 clustered, 250 isolated):
| Metric (median) | Isolated | In cluster | Δ |
|---|---:|---:|---:|
| Median household income | $73,500 | $108,359 | **+$34,859** |
| Bachelor's+ % | 33.2 | 51.2 | **+18.0 pp** |
| Poverty rate | 11.6 | 6.9 | 4.7 pp |
| Non-Hispanic white % | 71.0 | 45.9 | **25.1 pp** |
| EIA generators within 50 km | 40 | 89 | +49 |
| EIA capacity within 50 km (MW) | 2,176 | 3,300 | +1,125 |
**Reading.** A clustered data center sits, at the median, in a tract that is ~$35K richer, 18 pp more educated, and 25 pp less non-Hispanic white than an isolated one — and is surrounded by twice as much energy infrastructure (and 50% more generation capacity). The isolated set looks like rural / small-town America (whiter, poorer, less educated); the clustered set looks like coastal exurban tech corridors.
---
## 4. RUCA (urban-rural) distribution
National baseline of all US tracts: 80% Metropolitan, 9% Micropolitan, 3% Small town, 8% Rural.
| RUCA band | DCs | DC % | US tract % | Over-index |
|---|---:|---:|---:|---:|
| Metropolitan (13) | 1,636 | 89.3% | 80.1% | **1.11×** |
| Micropolitan (46) | 98 | 5.3% | 9.0% | 0.59× |
| Small town (79) | 15 | 0.8% | 2.9% | 0.28× |
| Rural (10) | 77 | 4.2% | 7.6% | 0.55× |
| Unknown / missed match | 7 | 0.4% | — | — |
**Reading.** The metro skew is real but only mild — 1.11×. The eye-catching pattern is that **rural tracts (RUCA 10) hold more DCs than micropolitan or small-town combined**, because the hyperscale greenfield model deliberately bypasses small-city economies in favor of remote, cheap-power, low-population sites.
### Per-RUCA-code drilldown
| RUCA | Description | DCs | Median HH income | Median pop density | Median EIA gens (50km) |
|---:|---|---:|---:|---:|---:|
| 1 | Metro core | 1,425 | $110,333 | 1,859 / sq mi | 97 |
| 2 | Metro high-commute | 206 | $105,404 | 96 | 49 |
| 3 | Metro low-commute | 5 | $119,495 | 22 | 23 |
| 4 | Micropolitan core | 54 | $63,698 | 312 | 53 |
| 5 | Micropolitan high-commute | 22 | $72,465 | 191 | 51 |
| 6 | Micropolitan low-commute | 22 | $72,719 | 69 | 59 |
| 7 | Small town core | 14 | $87,522 | 2,336 | 40 |
| 8 | Small town high-commute | 1 | $69,074 | 36 | 41 |
| 10 | Rural area | 77 | $93,820 | 12 | 42 |
**Two surprises:**
- Rural DCs (RUCA 10) sit in tracts with $93.8K median income — *higher* than micropolitan DCs ($63.7K$72.7K). The rural DC sites are not poor rural America; they are wealthy-by-rural-standards counties chosen for power and water access.
- Micropolitan-core DCs (RUCA 4) have the *lowest* median income at $63.7K — the closest thing to "economic-development DC siting" in the dataset.
---
## 5. Non-metro deep dive (190 DCs, RUCA 410)
### Operators
| Operator | Non-metro DCs |
|---|---:|
| Amazon Web Services | 67 |
| *(null operator)* | 50 |
| Meta | 20 (+ 2 as "Meta, Inc.") |
| Microsoft | 10 |
| Google | 4 |
| Rowan Green Data | 4 |
| NTT | 2 |
| Yahoo | 2 |
| Amazon AWS *(dupe)* | 2 |
**The five hyperscalers (AWS, Meta, Microsoft, Google, Yahoo) account for 105 of 190 non-metro DCs (55%).** If the 50 null-operator rows skew similarly hyperscale (likely — they're disproportionately in OR/WA), the share is probably closer to 75%.
### States (post-backfill)
| State | Non-metro DCs |
|---|---:|
| Oregon | 86 |
| Washington | 40 |
| Texas | 9 |
| New Mexico | 7 |
| North Carolina | 6 |
| Pennsylvania | 5 |
| Wisconsin | 4 |
| New York | 3 |
| Tennessee | 3 |
| Georgia | 3 |
**Oregon + Washington = 126 (66%) of all non-metro DCs.** This is the Columbia River basin: Prineville / Hermiston / Boardman / The Dalles (OR) and Quincy / East Wenatchee / Moses Lake (WA). The pull is hydroelectric power (cheap, low-carbon, abundant) and cool dry climate (free-cooling).
The state backfill clarified the rest of the non-metro tail: **Texas (9)** and **Pennsylvania (5)** were previously hidden in the null bucket. These likely represent shale-gas-adjacent builds (Permian and Marcellus respectively).
---
## 6. Energy footprint by operator (using EIA capacity within 50 km)
Aggregated across DCs in RUCA 210 (i.e. anything outside dense metro core, n=401):
| Operator | DCs | States | Total nearby capacity (GW) | Median per site (GW) | Hydro (GW) | Nuclear (GW) | NG (GW) | Solar (GW) | Wind (GW) |
|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|
| AWS | 93 | 5 | 397 | 4.8 | 66 | 2.5 | 201 | 4.6 | **114** |
| *(Unknown)* | 118 | 26 | 339 | 2.3 | 86 | 35 | 135 | 23 | 19 |
| Meta | 51 | 11 | 120 | 2.0 | 4.9 | 0 | 61 | 16 | 0.3 |
| Microsoft | 26 | 6 | 113 | 3.4 | 28 | **13** | 39 | 9.1 | 8.1 |
| Google | 31 | 5 | 100 | 3.9 | 14 | 0 | 43 | 3.6 | 4.7 |
| Apple | 5 | 2 | 4 | 0.6 | 1.6 | 0 | 1.1 | 0.9 | 0.4 |
| Yahoo | 2 | 1 | 7 | 3.5 | 6.4 | 0 | 0 | 0 | 0.7 |
**Distinct hyperscaler strategies, visible in the fuel mix:**
- **AWS** has aggregated 114 GW of *wind* exposure across its 93 sites — by far the most renewable-coupled portfolio. Also heavy hydro (66 GW) from its OR/WA footprint and 201 GW of natural gas as baseline.
- **Microsoft** has the highest *nuclear* exposure (12.6 GW) — almost entirely from its Goodyear, AZ campuses near Palo Verde Nuclear.
- **Meta** has the most *solar* (16 GW) among the named hyperscalers, but minimal nuclear or wind — consistent with its New Mexico (Los Lunas) and Iowa builds.
- **Google** is split — moderate hydro and natural gas, modest renewables.
### Largest non-metro grid neighborhoods (top sites by surrounding capacity)
| DC | Operator | Location | Nearby capacity | Fuel highlight |
|---|---|---|---:|---|
| PHX70 / PHX-10 / PHX-11 | Microsoft (Azure) | Goodyear, AZ (RUCA 2) | 14.014.6 GW | **4.2 GW nuclear (Palo Verde)** + 6.4 GW gas + 2.2 GW solar |
| Stream PHX-1 | Stream Data Centers | Goodyear, AZ | 13.4 GW | Same Palo Verde / gas mix |
| T5 Charlotte Campus | T5 | Kings Mountain, NC (RUCA 6) | 12.9 GW | **4.9 GW nuclear** (Catawba) + 5.5 GW gas + 1.5 GW coal |
| Apple Maiden | Apple | Maiden, NC (RUCA 2) | 9.1 GW | 2.4 GW nuclear + 4.6 GW gas |
| Percheron DC | Rowan Green Data | (Texas, RUCA 10) | 6.7 GW | **3.0 GW wind** + 0.9 GW hydro + 2.4 GW gas |
---
## Data quality flags
1. ~~196 of 1,833 DCs (10.7%) have null `state`~~ **Resolved 2026-05-18** by backfilling from `geoid` first-2-chars (state FIPS).
2. **`master_data_centers.power_mw` is populated for only 108 / 1,833 DCs (5.9%).** Useless as a sizing metric without imputation or alternative source. Nearby EIA capacity is the more reliable proxy (used as the per-DC scale in this analysis). A grant-funded scrape of Baxtel / Data Center Map would close this gap.
3. **50 of 190 non-metro DCs (26%) have null `operator`.** Likely hyperscalers based on geography (OR/WA) but unconfirmed.
4. **Operator-string fragmentation**: "Meta" vs. "Meta, Inc."; "Amazon Web Services" vs. "Amazon AWS" vs. "amazon web services"; "Microsoft" vs. "Microsoft Azure". Inflates distinct-operator counts and fragments per-operator totals.
5. **`avg_household_size` column has sentinel pollution** (min: 666,666,666). Use median or filter before using.
6. **7 DCs failed RUCA join** — Puerto Rico tracts or non-US locations; trivial.
7. **EIA generator coordinates had a longitude sign error for 2008-01 through 2010-11** (~11K rows with positive lower-48 longitudes). The flat-table build at [ingest_eia_energy_layers.py:839-870](../ingest_eia_energy_layers.py#L839-L870) corrects this in `longitude` and `geom`, so spatial joins are unaffected.
---
## Suggested next steps
1. **Backfill `power_mw`** from Baxtel / Data Center Map (paid scrape — grant work).
2. **Operator-string deduplication** — collapse "Meta"/"Meta, Inc.", "AWS" variants, etc., before any per-operator analysis.
3. **Watershed (HUC8) join**`public.watershed_huc8` is loaded but unused; would let us look at water stress overlap, particularly for the 190 non-metro DCs.
4. **State-level energy demand context**`im3_state_projected_moderate_50` and `seds_state_msn_year` are loaded; joining these would let us compute "DC nearby capacity as share of state grid" rather than absolute MW.
5. **Opposition cases overlay**`opposition_cases_geocoded` is loaded but unused; could test whether cluster-vs-isolated demographic differences predict community opposition.