# US Data Centers — Demographic, Urban-Rural & Energy Context Analysis **Date:** 2026-05-18 **Notebook:** [cluster_analysis.ipynb](../cluster_analysis.ipynb) **Universe:** 1,833 data centers in `public.master_data_centers`, joined to ACS-2024 demographics, USDA RUCA-2020 codes, USGS HUC8 watersheds, and EIA operating-generator capacity (50 km radius, latest period 2026-02, status=OP). --- ## Headline findings 1. **DC tracts are richer, more educated, and more diverse than the US average.** Median household income $103,623 vs. national $78,538 (+32%); 49% bachelor's+ vs. 35% (+14 pp); poverty rate 7.2% vs. 12.4%. Non-Hispanic white share is *below* national (50% vs. 58%), driven by Asian-heavy (mean 13% vs. 6%) and Hispanic-significant tracts. 2. **The metro skew is more modest than expected: 1.11×.** 89% of DCs sit in metropolitan tracts, but 80% of *all* US tracts are metropolitan — so DCs are only slightly more concentrated than the underlying population distribution would predict. 3. **The non-metro tail is overwhelmingly hyperscale and Pacific Northwest.** Of 190 DCs outside metropolitan tracts (RUCA 4–10), AWS owns 67, Meta 22, Microsoft 10, Google 4, Yahoo 2 — combined 55% of the non-metro footprint. Oregon (86) and Washington (40) alone hold 66% of non-metro DCs, anchored to the Columbia River hydropower corridor. 4. **Clustered DCs are demographically distinct from isolated ones.** DCs in DBSCAN clusters (n=1,583) sit in tracts with $108K median income vs. $73K for isolated DCs (n=250) — a $35K gap. Clustered DCs are more educated (+18 pp bachelor's), more diverse (–25 pp non-Hispanic white), and embedded in much denser energy infrastructure (89 vs. 40 generators within 50 km). 5. **Microsoft co-locates with the largest US nuclear plant.** Microsoft's Goodyear, AZ campus has 14.6 GW of generation within 50 km — including 4.2 GW from Palo Verde Nuclear, the largest in the US. Despite the campus being in a RUCA-2 "Metro high-commute" tract (not strictly metro core), the surrounding grid is the densest by capacity in our analysis. 6. **Extreme watershed concentration: half of all US DCs sit in just 15 of 2,139 HUC8 watersheds.** A single watershed — Middle Potomac-Catoctin (Loudoun County) — holds 235 DCs (12.8% of the US total). The top 2 (both DC-Alley watersheds) hold 18.9%; the top 10 hold 40%. Water stress in any one of these basins propagates to a huge share of national DC capacity. --- ## Dataset coverage and joins | Source table | Rows | Join key | Coverage | |---|---|---|---| | `master_data_centers` | 1,833 | base | — | | `master_data_center_spatial_clusters` | 1,831 | `master_id` | 99.9% | | `_dc_census_tract_acs_2024` | ~73,000 tracts | `geoid` | 1,807 matched (98.6%) | | `ruca_codes_2020_tract` | 85,528 tracts | `tract_fips_20 = geoid` | 1,826 matched (99.6%) | | `watershed_huc8` | 2,139 watersheds | `ST_Contains(w.geom, m.geom)` | 1,831 matched (99.9%) | | `energy_eia_operating_generator_capacity_flat` | 4.7M rows | `ST_DWithin(geom, 50km)` | 1,831 DCs have ≥1 nearby gen | Energy aggregation uses period `2026-02` only with `status='OP'`, summing `nameplate_capacity_mw` for operating generators within 50 km of each DC. Note: EIA capacity columns were added to this table on 2026-05-17 — prior to that the `_flat` table had no MW values despite its name. --- ## 1. Demographic profile of DC tracts (n=1,807 with non-null ACS) | Metric | DC tract (median) | DC tract (mean) | US avg | Δ mean vs. US | |---|---:|---:|---:|---:| | Median household income | $103,623 | $114,543 | $78,538 | **+$36,005** | | Per-capita income | $51,283 | $55,725 | $43,313 | +$12,412 | | Poverty rate | 7.2% | 10.1% | 12.4% | −2.3 pp | | Unemployment rate | 3.5% | 4.4% | 5.4% | −1.0 pp | | Bachelor's+ % | 49.3% | 46.2% | 35.0% | **+11.2 pp** | | Broadband subscription % | 94.9% | 93.5% | 89.0% | +4.5 pp | | Non-Hispanic white % | 50.2% | 51.0% | 58.4% | **−7.4 pp** | | Hispanic / Latino % | 12.8% | 19.5% | 19.5% | 0.0 pp | | Non-Hispanic Black % | 5.9% | 10.6% | 12.1% | −1.5 pp | | Non-Hispanic Asian % | 6.4% | 13.4% | 6.4% | **+7.0 pp** | **Interpretation.** DC tracts skew toward high-income, highly-educated, technically connected, and racially diverse (specifically Asian-heavy). The race composition is interesting: DC tracts are *less* non-Hispanic white than national average, not more. This reflects DC siting in mixed-race coastal/exurban tech corridors (Bay Area, Northern Virginia, Seattle) rather than in homogeneous suburbs. **Data quality note.** `avg_household_size` previously contained ACS sentinel-value pollution (−666,666,666) for 1,089 zero-population tracts in `_dc_census_tract_acs_2024` (29 of which contained DCs) plus 16 rows in `data_center_census_tracts_2024`. As of 2026-05-18, those sentinels have been replaced with `NULL`. The column now has plausible ranges (min 1.00, max 9.33) and a usable mean. --- ## 2. Geographic concentration (top 15 states) | State | DC count | Total power_mw (where known) | Median HH income | Median bachelor's % | Median % white | Notes | |---|---:|---:|---:|---:|---:|---| | **VA** | **378** | 255 | $141,250 | 62.6% | 42.5% | Loudoun / DC-Alley dominance (20.6% of all US DCs) | | TX | 162 | 597 | $88,228 | 46.2% | 32.0% | DFW + Austin + San Antonio | | CA | 147 | 130 | $164,928 | 56.4% | 22.4% | Bay Area + LA basin | | OR | 145 | 125 | $72,719 | 20.0% | 63.2% | Columbia River hydro corridor (rural) | | OH | 103 | 135 | $128,875 | 47.0% | 74.5% | Columbus boom — fastest-rising market | | WA | 93 | 70 | $91,623 | 21.9% | 40.3% | Quincy/Wenatchee + Seattle | | AZ | 69 | 54 | $85,335 | 35.2% | 51.6% | Phoenix/Goodyear hyperscale | | IA | 65 | 0 | $93,393 | 34.3% | 88.1% | 88% white (rural Midwest) | | NJ | 62 | 98 | $147,321 | 59.4% | 32.9% | NYC-metro carrier hotels | | IL | 61 | 128 | $96,191 | 52.9% | 52.0% | Chicago metro | | GA | 50 | 241 | $101,176 | 51.4% | 31.6% | Atlanta + high-power rural builds | | NY | 48 | 47 | $77,465 | 47.6% | 74.8% | NYC + upstate | | NV | 41 | 0 | $93,409 | 31.2% | 34.6% | Reno + Las Vegas | | TN | 32 | 0 | — | — | 54.8% | Nashville + Memphis (newly visible after state backfill) | | NC | 31 | 56 | $82,708 | 44.7% | 59.6% | Charlotte + Catawba (nuclear-adjacent) | **Virginia alone holds 20.6% of all US DCs** (378 of 1,833), with the most affluent tract profile in the top 15 — a Loudoun County effect. The state backfill substantially elevated **Ohio (76 → 103)** and **Texas (135 → 162)**, pushing TX into the #2 slot. The previously-uncounted **Tennessee (32) now appears in the top 15**. Oregon and Washington tracts look notably different from the urban-heavy states (lower income, lower education, lower broadband, higher Hispanic share), reflecting their rural Columbia River siting. --- ## 3. Spatially clustered DCs vs. isolated DCs DBSCAN cluster assignment from `master_data_center_spatial_clusters` (1,583 clustered, 250 isolated): | Metric (median) | Isolated | In cluster | Δ | |---|---:|---:|---:| | Median household income | $73,500 | $108,359 | **+$34,859** | | Bachelor's+ % | 33.2 | 51.2 | **+18.0 pp** | | Poverty rate | 11.6 | 6.9 | −4.7 pp | | Non-Hispanic white % | 71.0 | 45.9 | **−25.1 pp** | | EIA generators within 50 km | 40 | 89 | +49 | | EIA capacity within 50 km (MW) | 2,176 | 3,300 | +1,125 | **Reading.** A clustered data center sits, at the median, in a tract that is ~$35K richer, 18 pp more educated, and 25 pp less non-Hispanic white than an isolated one — and is surrounded by twice as much energy infrastructure (and 50% more generation capacity). The isolated set looks like rural / small-town America (whiter, poorer, less educated); the clustered set looks like coastal exurban tech corridors. --- ## 4. RUCA (urban-rural) distribution National baseline of all US tracts: 80% Metropolitan, 9% Micropolitan, 3% Small town, 8% Rural. | RUCA band | DCs | DC % | US tract % | Over-index | |---|---:|---:|---:|---:| | Metropolitan (1–3) | 1,636 | 89.3% | 80.1% | **1.11×** | | Micropolitan (4–6) | 98 | 5.3% | 9.0% | 0.59× | | Small town (7–9) | 15 | 0.8% | 2.9% | 0.28× | | Rural (10) | 77 | 4.2% | 7.6% | 0.55× | | Unknown / missed match | 7 | 0.4% | — | — | **Reading.** The metro skew is real but only mild — 1.11×. The eye-catching pattern is that **rural tracts (RUCA 10) hold more DCs than micropolitan or small-town combined**, because the hyperscale greenfield model deliberately bypasses small-city economies in favor of remote, cheap-power, low-population sites. ### Per-RUCA-code drilldown | RUCA | Description | DCs | Median HH income | Median pop density | Median EIA gens (50km) | |---:|---|---:|---:|---:|---:| | 1 | Metro core | 1,425 | $110,333 | 1,859 / sq mi | 97 | | 2 | Metro high-commute | 206 | $105,404 | 96 | 49 | | 3 | Metro low-commute | 5 | $119,495 | 22 | 23 | | 4 | Micropolitan core | 54 | $63,698 | 312 | 53 | | 5 | Micropolitan high-commute | 22 | $72,465 | 191 | 51 | | 6 | Micropolitan low-commute | 22 | $72,719 | 69 | 59 | | 7 | Small town core | 14 | $87,522 | 2,336 | 40 | | 8 | Small town high-commute | 1 | $69,074 | 36 | 41 | | 10 | Rural area | 77 | $93,820 | 12 | 42 | **Two surprises:** - Rural DCs (RUCA 10) sit in tracts with $93.8K median income — *higher* than micropolitan DCs ($63.7K–$72.7K). The rural DC sites are not poor rural America; they are wealthy-by-rural-standards counties chosen for power and water access. - Micropolitan-core DCs (RUCA 4) have the *lowest* median income at $63.7K — the closest thing to "economic-development DC siting" in the dataset. --- ## 5. Non-metro deep dive (190 DCs, RUCA 4–10) ### Operators | Operator | Non-metro DCs | |---|---:| | Amazon Web Services | 67 | | *(null operator)* | 50 | | Meta | 20 (+ 2 as "Meta, Inc.") | | Microsoft | 10 | | Google | 4 | | Rowan Green Data | 4 | | NTT | 2 | | Yahoo | 2 | | Amazon AWS *(dupe)* | 2 | **The five hyperscalers (AWS, Meta, Microsoft, Google, Yahoo) account for 105 of 190 non-metro DCs (55%).** If the 50 null-operator rows skew similarly hyperscale (likely — they're disproportionately in OR/WA), the share is probably closer to 75%. ### States (post-backfill) | State | Non-metro DCs | |---|---:| | Oregon | 86 | | Washington | 40 | | Texas | 9 | | New Mexico | 7 | | North Carolina | 6 | | Pennsylvania | 5 | | Wisconsin | 4 | | New York | 3 | | Tennessee | 3 | | Georgia | 3 | **Oregon + Washington = 126 (66%) of all non-metro DCs.** This is the Columbia River basin: Prineville / Hermiston / Boardman / The Dalles (OR) and Quincy / East Wenatchee / Moses Lake (WA). The pull is hydroelectric power (cheap, low-carbon, abundant) and cool dry climate (free-cooling). The state backfill clarified the rest of the non-metro tail: **Texas (9)** and **Pennsylvania (5)** were previously hidden in the null bucket. These likely represent shale-gas-adjacent builds (Permian and Marcellus respectively). --- ## 6. Energy footprint by operator (using EIA capacity within 50 km) Aggregated across DCs in RUCA 2–10 (i.e. anything outside dense metro core, n=401): | Operator | DCs | States | Total nearby capacity (GW) | Median per site (GW) | Hydro (GW) | Nuclear (GW) | NG (GW) | Solar (GW) | Wind (GW) | |---|---:|---:|---:|---:|---:|---:|---:|---:|---:| | AWS | 93 | 5 | 397 | 4.8 | 66 | 2.5 | 201 | 4.6 | **114** | | *(Unknown)* | 118 | 26 | 339 | 2.3 | 86 | 35 | 135 | 23 | 19 | | Meta | 51 | 11 | 120 | 2.0 | 4.9 | 0 | 61 | 16 | 0.3 | | Microsoft | 26 | 6 | 113 | 3.4 | 28 | **13** | 39 | 9.1 | 8.1 | | Google | 31 | 5 | 100 | 3.9 | 14 | 0 | 43 | 3.6 | 4.7 | | Apple | 5 | 2 | 4 | 0.6 | 1.6 | 0 | 1.1 | 0.9 | 0.4 | | Yahoo | 2 | 1 | 7 | 3.5 | 6.4 | 0 | 0 | 0 | 0.7 | **Distinct hyperscaler strategies, visible in the fuel mix:** - **AWS** has aggregated 114 GW of *wind* exposure across its 93 sites — by far the most renewable-coupled portfolio. Also heavy hydro (66 GW) from its OR/WA footprint and 201 GW of natural gas as baseline. - **Microsoft** has the highest *nuclear* exposure (12.6 GW) — almost entirely from its Goodyear, AZ campuses near Palo Verde Nuclear. - **Meta** has the most *solar* (16 GW) among the named hyperscalers, but minimal nuclear or wind — consistent with its New Mexico (Los Lunas) and Iowa builds. - **Google** is split — moderate hydro and natural gas, modest renewables. ### Largest non-metro grid neighborhoods (top sites by surrounding capacity) | DC | Operator | Location | Nearby capacity | Fuel highlight | |---|---|---|---:|---| | PHX70 / PHX-10 / PHX-11 | Microsoft (Azure) | Goodyear, AZ (RUCA 2) | 14.0–14.6 GW | **4.2 GW nuclear (Palo Verde)** + 6.4 GW gas + 2.2 GW solar | | Stream PHX-1 | Stream Data Centers | Goodyear, AZ | 13.4 GW | Same Palo Verde / gas mix | | T5 Charlotte Campus | T5 | Kings Mountain, NC (RUCA 6) | 12.9 GW | **4.9 GW nuclear** (Catawba) + 5.5 GW gas + 1.5 GW coal | | Apple Maiden | Apple | Maiden, NC (RUCA 2) | 9.1 GW | 2.4 GW nuclear + 4.6 GW gas | | Percheron DC | Rowan Green Data | (Texas, RUCA 10) | 6.7 GW | **3.0 GW wind** + 0.9 GW hydro + 2.4 GW gas | --- ## 7. Watershed (HUC8) concentration Each DC sits in exactly one USGS HUC8 watershed (8-digit hydrologic unit, subbasin scale, median ~3,000 sq km). Cooling-water draw and wastewater discharge happen at watershed scale, not state scale — a single stressed basin can cap an entire DC corridor regardless of how big the state's overall water budget is. ### Where the 1,831 matched DCs land - **257 distinct HUC8s** hold at least one DC — that's only **12% of the 2,139 US watersheds** (the other 88% have zero data centers). - **The top 1 watershed alone (Middle Potomac-Catoctin) holds 235 DCs** — 12.8% of the entire US data-center footprint. - DC concentration is much more extreme at the watershed level than at the state level. Virginia has 20.6% of US DCs; the single Loudoun watershed holds 12.8%. ### Cumulative concentration | Top N watersheds | DCs | Share of all US DCs | |---:|---:|---:| | 1 | 235 | 12.8% | | 2 | 346 | 18.9% | | 3 | 434 | 23.7% | | 5 | 551 | 30.1% | | **10** | **736** | **40.2%** | | **15** | **887** | **48.4%** | | 20 | 1,012 | 55.3% | | 30 | 1,186 | 64.8% | | 50 | 1,380 | 75.4% | | 100 | 1,611 | 88.0% | **Half of all US data centers live in just 15 watersheds.** Three-quarters in 50. Water stress, drought policy, or thermal-discharge limits in any one of these basins propagates to a large share of the national footprint. ### Top 15 watersheds by DC count | HUC8 | Name | States | DCs | Cluster | |---|---|---|---:|---| | 02070008 | Middle Potomac-Catoctin | DC, MD, VA, WV | 235 | Loudoun / Ashburn (DC-Alley) | | 02070010 | Middle Potomac-Anacostia-Occoquan | DC, MD, VA | 111 | Fairfax + inner Loudoun + DC | | 18050003 | Coyote | CA | 88 | Silicon Valley / San Jose | | 05060001 | Upper Scioto | OH | 73 | Columbus (fastest-growing market) | | 17070101 | Middle Columbia-Lake Wallula | OR, WA | 44 | Boardman / Hermiston (hyperscale hydro) | | 17020015 | Lower Crab | WA | 40 | Quincy / Moses Lake (hyperscale hydro) | | 17090010 | Tualatin | OR | 39 | Hillsboro (Intel / Google) | | 12030105 | Upper Trinity | TX | 37 | DFW | | 10230006 | Big Papillion-Mosquito | IA, NE | 36 | Council Bluffs / Omaha (Meta) | | 07120004 | Des Plaines | IL, WI | 33 | Chicago metro | | 12100302 | Medina | TX | 32 | San Antonio | | 02030105 | Raritan | NJ | 31 | Central NJ carrier hotels | | 15050100 | Middle Gila | AZ | 30 | Phoenix metro | | 02030103 | Hackensack-Passaic | NJ, NY | 29 | NYC metro east | | 17070103 | Umatilla | OR | 29 | AWS-only (all 29 DCs) — pure single-operator basin | The non-metro / hyperscale Pacific Northwest story is visible at watershed scale: three Columbia-system watersheds (**Middle Columbia-Lake Wallula, Lower Crab, Umatilla**) hold 113 DCs combined, all hyperscale-dominated. Umatilla is operationally **AWS-exclusive** — all 29 DCs in that basin are AWS. ### Non-metro watersheds (RUCA 4–10) — where hyperscalers cluster | HUC8 | Name | States | DCs | Operators | |---|---|---|---:|---| | 17070101 | Middle Columbia-Lake Wallula | OR, WA | 44 | AWS (multiple variants), Rowan Green Data | | 17020015 | Lower Crab | WA | 40 | CyrusOne, Intuit, Microsoft, NTT, Sabey, Yahoo | | 17070103 | Umatilla | OR | 29 | AWS only | | 17070305 | Lower Crooked | OR | 8 | Meta (Prineville) | | 13020203 | Rio Grande-Albuquerque | NM | 7 | Meta (Los Lunas) | | 03050105 | Upper Broad | NC, SC | 6 | Meta | | 13070001 | Lower Pecos-Red Bluff Reservoir | NM, TX | 5 | IONIC Digital | | 17070105 | Middle Columbia-Hood | OR, WA | 4 | Google (The Dalles) | | 02050107 | Upper Susquehanna-Lackawanna | PA | 3 | AWS | | 03070103 | Upper Ocmulgee | GA | 2 | Meta | This view is the cleanest evidence yet of the *hyperscale geographic strategy* — single-operator capture of individual watersheds (Umatilla = AWS, Lower Crooked = Meta, Middle Columbia-Hood = Google, Rio Grande-Albuquerque = Meta). Each of these basins has effectively been claimed by one player. ### Implications for water-stress analysis This watershed view is a **boundary set** for downstream water-stress analysis. Pull WaterWatch streamflow data, USGS water-use estimates, or EPA drought indicators against just these 257 HUC8s (or against just the top 15 for the highest-leverage story). A single-pull stress index against this set would size the "water exposure" of the entire US DC fleet. --- ## Data quality flags 1. **`master_data_centers.power_mw` is populated for only 108 / 1,833 DCs (5.9%).** Useless as a sizing metric without imputation or alternative source. Nearby EIA capacity is the more reliable proxy (used as the per-DC scale in this analysis). A grant-funded scrape of Baxtel / Data Center Map would close this gap. 2. **50 of 190 non-metro DCs (26%) have null `operator`.** Likely hyperscalers based on geography (OR/WA) but unconfirmed. 3. **Operator-string fragmentation**: "Meta" vs. "Meta, Inc."; "Amazon Web Services" vs. "Amazon AWS" vs. "amazon web services"; "Microsoft" vs. "Microsoft Azure". Inflates distinct-operator counts and fragments per-operator totals. 4. ~~`avg_household_size` column has sentinel pollution~~ **Resolved 2026-05-18** — 1,089 sentinel values (−666,666,666) in `_dc_census_tract_acs_2024` and 16 in `data_center_census_tracts_2024` replaced with `NULL`. Affected 29 DCs. 5. **7 DCs failed RUCA join** — Puerto Rico tracts or non-US locations; trivial. 6. **EIA generator coordinates had a longitude sign error for 2008-01 through 2010-11** (~11K rows with positive lower-48 longitudes). The flat-table build at [ingest_eia_energy_layers.py:839-870](../ingest_eia_energy_layers.py#L839-L870) corrects this in `longitude` and `geom`, so spatial joins are unaffected. --- ## Suggested next steps 1. **Backfill `power_mw`** from Baxtel / Data Center Map (paid scrape — grant work). 2. **Operator-string deduplication** — collapse "Meta"/"Meta, Inc.", "AWS" variants, etc., before any per-operator analysis. 3. **Water-stress overlay against the 257 watersheds** — now that the HUC8 join is in place, pull USGS WaterWatch streamflow data, USGS water-use estimates, or EPA drought-status indicators against this watershed set. A single stress index per HUC8 would size the entire US fleet's water exposure. 4. **State-level energy demand context** — `im3_state_projected_moderate_50` and `seds_state_msn_year` are loaded; joining these would let us compute "DC nearby capacity as share of state grid" rather than absolute MW. 5. **Opposition cases overlay** — `opposition_cases_geocoded` is loaded but unused; could test whether cluster-vs-isolated demographic differences (or watershed concentration) predict community opposition.