Add data center demographic, RUCA, and energy capacity analysis
Adds three coordinated changes: - Request nameplate, summer, and winter capacity from the EIA operating-generator-capacity endpoint and project them as typed columns on energy_eia_operating_generator_capacity_flat. The original ingest only pulled latitude and longitude, leaving the flat table with no MW values despite its name. - New cluster_analysis.ipynb joins master_data_centers to ACS-2024 demographics, USDA RUCA-2020 codes (loaded from new/), and EIA generation capacity within 50 km of each site. - Summary doc consolidates the headline findings: DC tracts skew higher income / more educated / more racially diverse than US average, the metro over-index is only 1.11x, the non-metro tail is dominated by hyperscalers in the Columbia River corridor (OR+WA = 66% of non-metro DCs), and Microsoft co-locates with Palo Verde Nuclear in Goodyear AZ. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
222
output/data_center_demographic_ruca_energy_summary.md
Normal file
222
output/data_center_demographic_ruca_energy_summary.md
Normal file
@@ -0,0 +1,222 @@
|
||||
# US Data Centers — Demographic, Urban-Rural & Energy Context Analysis
|
||||
|
||||
**Date:** 2026-05-18
|
||||
**Notebook:** [cluster_analysis.ipynb](../cluster_analysis.ipynb)
|
||||
**Universe:** 1,833 data centers in `public.master_data_centers`, joined to ACS-2024 demographics, USDA RUCA-2020 codes, and EIA operating-generator capacity (50 km radius, latest period 2026-02, status=OP).
|
||||
|
||||
> **Update 2026-05-18**: 196 previously-null `state` values were backfilled from `geoid` (first 2 chars = state FIPS). All 1,833 DCs now have a state; all state-level numbers below reflect the corrected attribution.
|
||||
|
||||
---
|
||||
|
||||
## Headline findings
|
||||
|
||||
1. **DC tracts are richer, more educated, and more diverse than the US average.** Median household income $103,623 vs. national $78,538 (+32%); 49% bachelor's+ vs. 35% (+14 pp); poverty rate 7.2% vs. 12.4%. Non-Hispanic white share is *below* national (50% vs. 58%), driven by Asian-heavy (mean 13% vs. 6%) and Hispanic-significant tracts.
|
||||
2. **The metro skew is more modest than expected: 1.11×.** 89% of DCs sit in metropolitan tracts, but 80% of *all* US tracts are metropolitan — so DCs are only slightly more concentrated than the underlying population distribution would predict.
|
||||
3. **The non-metro tail is overwhelmingly hyperscale and Pacific Northwest.** Of 190 DCs outside metropolitan tracts (RUCA 4–10), AWS owns 67, Meta 22, Microsoft 10, Google 4, Yahoo 2 — combined 55% of the non-metro footprint. Oregon (86) and Washington (40) alone hold 66% of non-metro DCs, anchored to the Columbia River hydropower corridor.
|
||||
4. **Clustered DCs are demographically distinct from isolated ones.** DCs in DBSCAN clusters (n=1,583) sit in tracts with $108K median income vs. $73K for isolated DCs (n=250) — a $35K gap. Clustered DCs are more educated (+18 pp bachelor's), more diverse (–25 pp non-Hispanic white), and embedded in much denser energy infrastructure (89 vs. 40 generators within 50 km).
|
||||
5. **Microsoft co-locates with the largest US nuclear plant.** Microsoft's Goodyear, AZ campus has 14.6 GW of generation within 50 km — including 4.2 GW from Palo Verde Nuclear, the largest in the US. Despite the campus being in a RUCA-2 "Metro high-commute" tract (not strictly metro core), the surrounding grid is the densest by capacity in our analysis.
|
||||
|
||||
---
|
||||
|
||||
## Dataset coverage and joins
|
||||
|
||||
| Source table | Rows | Join key | Coverage |
|
||||
|---|---|---|---|
|
||||
| `master_data_centers` | 1,833 | base | — |
|
||||
| `master_data_center_spatial_clusters` | 1,831 | `master_id` | 99.9% |
|
||||
| `_dc_census_tract_acs_2024` | ~73,000 tracts | `geoid` | 1,807 matched (98.6%) |
|
||||
| `ruca_codes_2020_tract` | 85,528 tracts | `tract_fips_20 = geoid` | 1,826 matched (99.6%) |
|
||||
| `energy_eia_operating_generator_capacity_flat` | 4.7M rows | `ST_DWithin(geom, 50km)` | 1,831 DCs have ≥1 nearby gen |
|
||||
|
||||
Energy aggregation uses period `2026-02` only with `status='OP'`, summing `nameplate_capacity_mw` for operating generators within 50 km of each DC. Note: EIA capacity columns were added to this table on 2026-05-17 — prior to that the `_flat` table had no MW values despite its name.
|
||||
|
||||
---
|
||||
|
||||
## 1. Demographic profile of DC tracts (n=1,807 with non-null ACS)
|
||||
|
||||
| Metric | DC tract (median) | DC tract (mean) | US avg | Δ mean vs. US |
|
||||
|---|---:|---:|---:|---:|
|
||||
| Median household income | $103,623 | $114,543 | $78,538 | **+$36,005** |
|
||||
| Per-capita income | $51,283 | $55,725 | $43,313 | +$12,412 |
|
||||
| Poverty rate | 7.2% | 10.1% | 12.4% | −2.3 pp |
|
||||
| Unemployment rate | 3.5% | 4.4% | 5.4% | −1.0 pp |
|
||||
| Bachelor's+ % | 49.3% | 46.2% | 35.0% | **+11.2 pp** |
|
||||
| Broadband subscription % | 94.9% | 93.5% | 89.0% | +4.5 pp |
|
||||
| Non-Hispanic white % | 50.2% | 51.0% | 58.4% | **−7.4 pp** |
|
||||
| Hispanic / Latino % | 12.8% | 19.5% | 19.5% | 0.0 pp |
|
||||
| Non-Hispanic Black % | 5.9% | 10.6% | 12.1% | −1.5 pp |
|
||||
| Non-Hispanic Asian % | 6.4% | 13.4% | 6.4% | **+7.0 pp** |
|
||||
|
||||
**Interpretation.** DC tracts skew toward high-income, highly-educated, technically connected, and racially diverse (specifically Asian-heavy). The race composition is interesting: DC tracts are *less* non-Hispanic white than national average, not more. This reflects DC siting in mixed-race coastal/exurban tech corridors (Bay Area, Northern Virginia, Seattle) rather than in homogeneous suburbs.
|
||||
|
||||
**Data quality note.** `avg_household_size` contains sentinel-value pollution (min: −666,666,666), so the mean is unusable; the median (2.55) is sensible.
|
||||
|
||||
---
|
||||
|
||||
## 2. Geographic concentration (top 15 states)
|
||||
|
||||
| State | DC count | Total power_mw (where known) | Median HH income | Median bachelor's % | Median % white | Notes |
|
||||
|---|---:|---:|---:|---:|---:|---|
|
||||
| **VA** | **378** | 255 | $141,250 | 62.6% | 42.5% | Loudoun / DC-Alley dominance (20.6% of all US DCs) |
|
||||
| TX | 162 | 597 | $88,228 | 46.2% | 32.0% | DFW + Austin + San Antonio |
|
||||
| CA | 147 | 130 | $164,928 | 56.4% | 22.4% | Bay Area + LA basin |
|
||||
| OR | 145 | 125 | $72,719 | 20.0% | 63.2% | Columbia River hydro corridor (rural) |
|
||||
| OH | 103 | 135 | $128,875 | 47.0% | 74.5% | Columbus boom — fastest-rising market |
|
||||
| WA | 93 | 70 | $91,623 | 21.9% | 40.3% | Quincy/Wenatchee + Seattle |
|
||||
| AZ | 69 | 54 | $85,335 | 35.2% | 51.6% | Phoenix/Goodyear hyperscale |
|
||||
| IA | 65 | 0 | $93,393 | 34.3% | 88.1% | 88% white (rural Midwest) |
|
||||
| NJ | 62 | 98 | $147,321 | 59.4% | 32.9% | NYC-metro carrier hotels |
|
||||
| IL | 61 | 128 | $96,191 | 52.9% | 52.0% | Chicago metro |
|
||||
| GA | 50 | 241 | $101,176 | 51.4% | 31.6% | Atlanta + high-power rural builds |
|
||||
| NY | 48 | 47 | $77,465 | 47.6% | 74.8% | NYC + upstate |
|
||||
| NV | 41 | 0 | $93,409 | 31.2% | 34.6% | Reno + Las Vegas |
|
||||
| TN | 32 | 0 | — | — | 54.8% | Nashville + Memphis (newly visible after state backfill) |
|
||||
| NC | 31 | 56 | $82,708 | 44.7% | 59.6% | Charlotte + Catawba (nuclear-adjacent) |
|
||||
|
||||
**Virginia alone holds 20.6% of all US DCs** (378 of 1,833), with the most affluent tract profile in the top 15 — a Loudoun County effect. The state backfill substantially elevated **Ohio (76 → 103)** and **Texas (135 → 162)**, pushing TX into the #2 slot. The previously-uncounted **Tennessee (32) now appears in the top 15**.
|
||||
|
||||
Oregon and Washington tracts look notably different from the urban-heavy states (lower income, lower education, lower broadband, higher Hispanic share), reflecting their rural Columbia River siting.
|
||||
|
||||
---
|
||||
|
||||
## 3. Spatially clustered DCs vs. isolated DCs
|
||||
|
||||
DBSCAN cluster assignment from `master_data_center_spatial_clusters` (1,583 clustered, 250 isolated):
|
||||
|
||||
| Metric (median) | Isolated | In cluster | Δ |
|
||||
|---|---:|---:|---:|
|
||||
| Median household income | $73,500 | $108,359 | **+$34,859** |
|
||||
| Bachelor's+ % | 33.2 | 51.2 | **+18.0 pp** |
|
||||
| Poverty rate | 11.6 | 6.9 | −4.7 pp |
|
||||
| Non-Hispanic white % | 71.0 | 45.9 | **−25.1 pp** |
|
||||
| EIA generators within 50 km | 40 | 89 | +49 |
|
||||
| EIA capacity within 50 km (MW) | 2,176 | 3,300 | +1,125 |
|
||||
|
||||
**Reading.** A clustered data center sits, at the median, in a tract that is ~$35K richer, 18 pp more educated, and 25 pp less non-Hispanic white than an isolated one — and is surrounded by twice as much energy infrastructure (and 50% more generation capacity). The isolated set looks like rural / small-town America (whiter, poorer, less educated); the clustered set looks like coastal exurban tech corridors.
|
||||
|
||||
---
|
||||
|
||||
## 4. RUCA (urban-rural) distribution
|
||||
|
||||
National baseline of all US tracts: 80% Metropolitan, 9% Micropolitan, 3% Small town, 8% Rural.
|
||||
|
||||
| RUCA band | DCs | DC % | US tract % | Over-index |
|
||||
|---|---:|---:|---:|---:|
|
||||
| Metropolitan (1–3) | 1,636 | 89.3% | 80.1% | **1.11×** |
|
||||
| Micropolitan (4–6) | 98 | 5.3% | 9.0% | 0.59× |
|
||||
| Small town (7–9) | 15 | 0.8% | 2.9% | 0.28× |
|
||||
| Rural (10) | 77 | 4.2% | 7.6% | 0.55× |
|
||||
| Unknown / missed match | 7 | 0.4% | — | — |
|
||||
|
||||
**Reading.** The metro skew is real but only mild — 1.11×. The eye-catching pattern is that **rural tracts (RUCA 10) hold more DCs than micropolitan or small-town combined**, because the hyperscale greenfield model deliberately bypasses small-city economies in favor of remote, cheap-power, low-population sites.
|
||||
|
||||
### Per-RUCA-code drilldown
|
||||
|
||||
| RUCA | Description | DCs | Median HH income | Median pop density | Median EIA gens (50km) |
|
||||
|---:|---|---:|---:|---:|---:|
|
||||
| 1 | Metro core | 1,425 | $110,333 | 1,859 / sq mi | 97 |
|
||||
| 2 | Metro high-commute | 206 | $105,404 | 96 | 49 |
|
||||
| 3 | Metro low-commute | 5 | $119,495 | 22 | 23 |
|
||||
| 4 | Micropolitan core | 54 | $63,698 | 312 | 53 |
|
||||
| 5 | Micropolitan high-commute | 22 | $72,465 | 191 | 51 |
|
||||
| 6 | Micropolitan low-commute | 22 | $72,719 | 69 | 59 |
|
||||
| 7 | Small town core | 14 | $87,522 | 2,336 | 40 |
|
||||
| 8 | Small town high-commute | 1 | $69,074 | 36 | 41 |
|
||||
| 10 | Rural area | 77 | $93,820 | 12 | 42 |
|
||||
|
||||
**Two surprises:**
|
||||
- Rural DCs (RUCA 10) sit in tracts with $93.8K median income — *higher* than micropolitan DCs ($63.7K–$72.7K). The rural DC sites are not poor rural America; they are wealthy-by-rural-standards counties chosen for power and water access.
|
||||
- Micropolitan-core DCs (RUCA 4) have the *lowest* median income at $63.7K — the closest thing to "economic-development DC siting" in the dataset.
|
||||
|
||||
---
|
||||
|
||||
## 5. Non-metro deep dive (190 DCs, RUCA 4–10)
|
||||
|
||||
### Operators
|
||||
|
||||
| Operator | Non-metro DCs |
|
||||
|---|---:|
|
||||
| Amazon Web Services | 67 |
|
||||
| *(null operator)* | 50 |
|
||||
| Meta | 20 (+ 2 as "Meta, Inc.") |
|
||||
| Microsoft | 10 |
|
||||
| Google | 4 |
|
||||
| Rowan Green Data | 4 |
|
||||
| NTT | 2 |
|
||||
| Yahoo | 2 |
|
||||
| Amazon AWS *(dupe)* | 2 |
|
||||
|
||||
**The five hyperscalers (AWS, Meta, Microsoft, Google, Yahoo) account for 105 of 190 non-metro DCs (55%).** If the 50 null-operator rows skew similarly hyperscale (likely — they're disproportionately in OR/WA), the share is probably closer to 75%.
|
||||
|
||||
### States (post-backfill)
|
||||
|
||||
| State | Non-metro DCs |
|
||||
|---|---:|
|
||||
| Oregon | 86 |
|
||||
| Washington | 40 |
|
||||
| Texas | 9 |
|
||||
| New Mexico | 7 |
|
||||
| North Carolina | 6 |
|
||||
| Pennsylvania | 5 |
|
||||
| Wisconsin | 4 |
|
||||
| New York | 3 |
|
||||
| Tennessee | 3 |
|
||||
| Georgia | 3 |
|
||||
|
||||
**Oregon + Washington = 126 (66%) of all non-metro DCs.** This is the Columbia River basin: Prineville / Hermiston / Boardman / The Dalles (OR) and Quincy / East Wenatchee / Moses Lake (WA). The pull is hydroelectric power (cheap, low-carbon, abundant) and cool dry climate (free-cooling).
|
||||
|
||||
The state backfill clarified the rest of the non-metro tail: **Texas (9)** and **Pennsylvania (5)** were previously hidden in the null bucket. These likely represent shale-gas-adjacent builds (Permian and Marcellus respectively).
|
||||
|
||||
---
|
||||
|
||||
## 6. Energy footprint by operator (using EIA capacity within 50 km)
|
||||
|
||||
Aggregated across DCs in RUCA 2–10 (i.e. anything outside dense metro core, n=401):
|
||||
|
||||
| Operator | DCs | States | Total nearby capacity (GW) | Median per site (GW) | Hydro (GW) | Nuclear (GW) | NG (GW) | Solar (GW) | Wind (GW) |
|
||||
|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|
|
||||
| AWS | 93 | 5 | 397 | 4.8 | 66 | 2.5 | 201 | 4.6 | **114** |
|
||||
| *(Unknown)* | 118 | 26 | 339 | 2.3 | 86 | 35 | 135 | 23 | 19 |
|
||||
| Meta | 51 | 11 | 120 | 2.0 | 4.9 | 0 | 61 | 16 | 0.3 |
|
||||
| Microsoft | 26 | 6 | 113 | 3.4 | 28 | **13** | 39 | 9.1 | 8.1 |
|
||||
| Google | 31 | 5 | 100 | 3.9 | 14 | 0 | 43 | 3.6 | 4.7 |
|
||||
| Apple | 5 | 2 | 4 | 0.6 | 1.6 | 0 | 1.1 | 0.9 | 0.4 |
|
||||
| Yahoo | 2 | 1 | 7 | 3.5 | 6.4 | 0 | 0 | 0 | 0.7 |
|
||||
|
||||
**Distinct hyperscaler strategies, visible in the fuel mix:**
|
||||
- **AWS** has aggregated 114 GW of *wind* exposure across its 93 sites — by far the most renewable-coupled portfolio. Also heavy hydro (66 GW) from its OR/WA footprint and 201 GW of natural gas as baseline.
|
||||
- **Microsoft** has the highest *nuclear* exposure (12.6 GW) — almost entirely from its Goodyear, AZ campuses near Palo Verde Nuclear.
|
||||
- **Meta** has the most *solar* (16 GW) among the named hyperscalers, but minimal nuclear or wind — consistent with its New Mexico (Los Lunas) and Iowa builds.
|
||||
- **Google** is split — moderate hydro and natural gas, modest renewables.
|
||||
|
||||
### Largest non-metro grid neighborhoods (top sites by surrounding capacity)
|
||||
|
||||
| DC | Operator | Location | Nearby capacity | Fuel highlight |
|
||||
|---|---|---|---:|---|
|
||||
| PHX70 / PHX-10 / PHX-11 | Microsoft (Azure) | Goodyear, AZ (RUCA 2) | 14.0–14.6 GW | **4.2 GW nuclear (Palo Verde)** + 6.4 GW gas + 2.2 GW solar |
|
||||
| Stream PHX-1 | Stream Data Centers | Goodyear, AZ | 13.4 GW | Same Palo Verde / gas mix |
|
||||
| T5 Charlotte Campus | T5 | Kings Mountain, NC (RUCA 6) | 12.9 GW | **4.9 GW nuclear** (Catawba) + 5.5 GW gas + 1.5 GW coal |
|
||||
| Apple Maiden | Apple | Maiden, NC (RUCA 2) | 9.1 GW | 2.4 GW nuclear + 4.6 GW gas |
|
||||
| Percheron DC | Rowan Green Data | (Texas, RUCA 10) | 6.7 GW | **3.0 GW wind** + 0.9 GW hydro + 2.4 GW gas |
|
||||
|
||||
---
|
||||
|
||||
## Data quality flags
|
||||
|
||||
1. ~~196 of 1,833 DCs (10.7%) have null `state`~~ **Resolved 2026-05-18** by backfilling from `geoid` first-2-chars (state FIPS).
|
||||
2. **`master_data_centers.power_mw` is populated for only 108 / 1,833 DCs (5.9%).** Useless as a sizing metric without imputation or alternative source. Nearby EIA capacity is the more reliable proxy (used as the per-DC scale in this analysis). A grant-funded scrape of Baxtel / Data Center Map would close this gap.
|
||||
3. **50 of 190 non-metro DCs (26%) have null `operator`.** Likely hyperscalers based on geography (OR/WA) but unconfirmed.
|
||||
4. **Operator-string fragmentation**: "Meta" vs. "Meta, Inc."; "Amazon Web Services" vs. "Amazon AWS" vs. "amazon web services"; "Microsoft" vs. "Microsoft Azure". Inflates distinct-operator counts and fragments per-operator totals.
|
||||
5. **`avg_household_size` column has sentinel pollution** (min: −666,666,666). Use median or filter before using.
|
||||
6. **7 DCs failed RUCA join** — Puerto Rico tracts or non-US locations; trivial.
|
||||
7. **EIA generator coordinates had a longitude sign error for 2008-01 through 2010-11** (~11K rows with positive lower-48 longitudes). The flat-table build at [ingest_eia_energy_layers.py:839-870](../ingest_eia_energy_layers.py#L839-L870) corrects this in `longitude` and `geom`, so spatial joins are unaffected.
|
||||
|
||||
---
|
||||
|
||||
## Suggested next steps
|
||||
|
||||
1. **Backfill `power_mw`** from Baxtel / Data Center Map (paid scrape — grant work).
|
||||
2. **Operator-string deduplication** — collapse "Meta"/"Meta, Inc.", "AWS" variants, etc., before any per-operator analysis.
|
||||
3. **Watershed (HUC8) join** — `public.watershed_huc8` is loaded but unused; would let us look at water stress overlap, particularly for the 190 non-metro DCs.
|
||||
4. **State-level energy demand context** — `im3_state_projected_moderate_50` and `seds_state_msn_year` are loaded; joining these would let us compute "DC nearby capacity as share of state grid" rather than absolute MW.
|
||||
5. **Opposition cases overlay** — `opposition_cases_geocoded` is loaded but unused; could test whether cluster-vs-isolated demographic differences predict community opposition.
|
||||
Reference in New Issue
Block a user