Extends the demographic/RUCA/energy summary with two new sections: - §7 quantifies each top-DC state's "share of state capacity within 50 km of a DC," surfacing NJ (83%), NV (75%), TN (70%), and OR (68%) as the most DC-saturated grids — reframing the canonical VA-centric story by structural entanglement rather than raw count. - §9 inventories every table in the data_centers schema with a one-line description, flagging cleanup candidates and unused layers for downstream work. Also renumbers watershed analysis to §8, adds the SEDS row to the dataset coverage table, and narrows next-step #4 to the IM3 projection overlay (now that the SEDS join is complete). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
401 lines
28 KiB
Markdown
401 lines
28 KiB
Markdown
# US Data Centers — Demographic, Urban-Rural & Energy Context Analysis
|
||
|
||
**Date:** 2026-05-18
|
||
|
||
**Notebook:** [cluster_analysis.ipynb](../cluster_analysis.ipynb)
|
||
|
||
**Universe:** 1,833 data centers in `public.master_data_centers`, joined to ACS-2024 demographics, USDA RUCA-2020 codes, USGS HUC8 watersheds, and EIA operating-generator capacity (50 km radius, latest period 2026-02, status=OP).
|
||
|
||
---
|
||
|
||
## Headline findings
|
||
|
||
1. **DC tracts are richer, more educated, and more diverse than the US average.** Median household income $103,623 vs. national $78,538 (+32%); 49% bachelor's+ vs. 35% (+14 pp); poverty rate 7.2% vs. 12.4%. Non-Hispanic white share is *below* national (50% vs. 58%), driven by Asian-heavy (mean 13% vs. 6%) and Hispanic-significant tracts.
|
||
2. **The metro skew is more modest than expected: 1.11×.** 89% of DCs sit in metropolitan tracts, but 80% of *all* US tracts are metropolitan — so DCs are only slightly more concentrated than the underlying population distribution would predict.
|
||
3. **The non-metro tail is overwhelmingly hyperscale and Pacific Northwest.** Of 190 DCs outside metropolitan tracts (RUCA 4–10), AWS owns 67, Meta 22, Microsoft 10, Google 4, Yahoo 2 — combined 55% of the non-metro footprint. Oregon (86) and Washington (40) alone hold 66% of non-metro DCs, anchored to the Columbia River hydropower corridor.
|
||
4. **Clustered DCs are demographically distinct from isolated ones.** DCs in DBSCAN clusters (n=1,583) sit in tracts with $108K median income vs. $73K for isolated DCs (n=250) — a $35K gap. Clustered DCs are more educated (+18 pp bachelor's), more diverse (–25 pp non-Hispanic white), and embedded in much denser energy infrastructure (89 vs. 40 generators within 50 km).
|
||
5. **Microsoft co-locates with the largest US nuclear plant.** Microsoft's Goodyear, AZ campus has 14.6 GW of generation within 50 km — including 4.2 GW from Palo Verde Nuclear, the largest in the US. Despite the campus being in a RUCA-2 "Metro high-commute" tract (not strictly metro core), the surrounding grid is the densest by capacity in our analysis.
|
||
6. **Extreme watershed concentration: half of all US DCs sit in just 15 of 2,139 HUC8 watersheds.** A single watershed — Middle Potomac-Catoctin (Loudoun County) — holds 235 DCs (12.8% of the US total). The top 2 (both DC-Alley watersheds) hold 18.9%; the top 10 hold 40%. Water stress in any one of these basins propagates to a huge share of national DC capacity.
|
||
|
||
---
|
||
|
||
## Dataset coverage and joins
|
||
|
||
| Source table | Rows | Join key | Coverage |
|
||
|---|---|---|---|
|
||
| `master_data_centers` | 1,833 | base | — |
|
||
| `master_data_center_spatial_clusters` | 1,831 | `master_id` | 99.9% |
|
||
| `_dc_census_tract_acs_2024` | ~73,000 tracts | `geoid` | 1,807 matched (98.6%) |
|
||
| `ruca_codes_2020_tract` | 85,528 tracts | `tract_fips_20 = geoid` | 1,826 matched (99.6%) |
|
||
| `watershed_huc8` | 2,139 watersheds | `ST_Contains(w.geom, m.geom)` | 1,831 matched (99.9%) |
|
||
| `energy_eia_operating_generator_capacity_flat` | 4.7M rows | `ST_DWithin(geom, 50km)` | 1,831 DCs have ≥1 nearby gen |
|
||
| `energy_eia_seds_flat` (annual, 1960–2024) | 2.57M rows | `state_id` | Used in §7 for state electricity consumption (series `ESTCB`, 2024) |
|
||
|
||
Energy aggregation uses period `2026-02` only with `status='OP'`, summing `nameplate_capacity_mw` for operating generators within 50 km of each DC. Note: EIA capacity columns were added to this table on 2026-05-17 — prior to that the `_flat` table had no MW values despite its name. SEDS was backfilled 2026-05-18 (initial smoke-test had only 50 rows).
|
||
|
||
---
|
||
|
||
## 1. Demographic profile of DC tracts (n=1,807 with non-null ACS)
|
||
|
||
| Metric | DC tract (median) | DC tract (mean) | US avg | Δ mean vs. US |
|
||
|---|---:|---:|---:|---:|
|
||
| Median household income | $103,623 | $114,543 | $78,538 | **+$36,005** |
|
||
| Per-capita income | $51,283 | $55,725 | $43,313 | +$12,412 |
|
||
| Poverty rate | 7.2% | 10.1% | 12.4% | −2.3 pp |
|
||
| Unemployment rate | 3.5% | 4.4% | 5.4% | −1.0 pp |
|
||
| Bachelor's+ % | 49.3% | 46.2% | 35.0% | **+11.2 pp** |
|
||
| Broadband subscription % | 94.9% | 93.5% | 89.0% | +4.5 pp |
|
||
| Non-Hispanic white % | 50.2% | 51.0% | 58.4% | **−7.4 pp** |
|
||
| Hispanic / Latino % | 12.8% | 19.5% | 19.5% | 0.0 pp |
|
||
| Non-Hispanic Black % | 5.9% | 10.6% | 12.1% | −1.5 pp |
|
||
| Non-Hispanic Asian % | 6.4% | 13.4% | 6.4% | **+7.0 pp** |
|
||
|
||
**Interpretation.** DC tracts skew toward high-income, highly-educated, technically connected, and racially diverse (specifically Asian-heavy). The race composition is interesting: DC tracts are *less* non-Hispanic white than national average, not more. This reflects DC siting in mixed-race coastal/exurban tech corridors (Bay Area, Northern Virginia, Seattle) rather than in homogeneous suburbs.
|
||
|
||
**Data quality note.** `avg_household_size` previously contained ACS sentinel-value pollution (−666,666,666) for 1,089 zero-population tracts in `_dc_census_tract_acs_2024` (29 of which contained DCs) plus 16 rows in `data_center_census_tracts_2024`. As of 2026-05-18, those sentinels have been replaced with `NULL`. The column now has plausible ranges (min 1.00, max 9.33) and a usable mean.
|
||
|
||
---
|
||
|
||
## 2. Geographic concentration (top 15 states)
|
||
|
||
| State | DC count | Total power_mw (where known) | Median HH income | Median bachelor's % | Median % white | Notes |
|
||
|---|---:|---:|---:|---:|---:|---|
|
||
| **VA** | **378** | 255 | $141,250 | 62.6% | 42.5% | Loudoun / DC-Alley dominance (20.6% of all US DCs) |
|
||
| TX | 162 | 597 | $88,228 | 46.2% | 32.0% | DFW + Austin + San Antonio |
|
||
| CA | 147 | 130 | $164,928 | 56.4% | 22.4% | Bay Area + LA basin |
|
||
| OR | 145 | 125 | $72,719 | 20.0% | 63.2% | Columbia River hydro corridor (rural) |
|
||
| OH | 103 | 135 | $128,875 | 47.0% | 74.5% | Columbus boom — fastest-rising market |
|
||
| WA | 93 | 70 | $91,623 | 21.9% | 40.3% | Quincy/Wenatchee + Seattle |
|
||
| AZ | 69 | 54 | $85,335 | 35.2% | 51.6% | Phoenix/Goodyear hyperscale |
|
||
| IA | 65 | 0 | $93,393 | 34.3% | 88.1% | 88% white (rural Midwest) |
|
||
| NJ | 62 | 98 | $147,321 | 59.4% | 32.9% | NYC-metro carrier hotels |
|
||
| IL | 61 | 128 | $96,191 | 52.9% | 52.0% | Chicago metro |
|
||
| GA | 50 | 241 | $101,176 | 51.4% | 31.6% | Atlanta + high-power rural builds |
|
||
| NY | 48 | 47 | $77,465 | 47.6% | 74.8% | NYC + upstate |
|
||
| NV | 41 | 0 | $93,409 | 31.2% | 34.6% | Reno + Las Vegas |
|
||
| TN | 32 | 0 | — | — | 54.8% | Nashville + Memphis (newly visible after state backfill) |
|
||
| NC | 31 | 56 | $82,708 | 44.7% | 59.6% | Charlotte + Catawba (nuclear-adjacent) |
|
||
|
||
**Virginia alone holds 20.6% of all US DCs** (378 of 1,833), with the most affluent tract profile in the top 15 — a Loudoun County effect. The state backfill substantially elevated **Ohio (76 → 103)** and **Texas (135 → 162)**, pushing TX into the #2 slot. The previously-uncounted **Tennessee (32) now appears in the top 15**.
|
||
|
||
Oregon and Washington tracts look notably different from the urban-heavy states (lower income, lower education, lower broadband, higher Hispanic share), reflecting their rural Columbia River siting.
|
||
|
||
---
|
||
|
||
## 3. Spatially clustered DCs vs. isolated DCs
|
||
|
||
DBSCAN cluster assignment from `master_data_center_spatial_clusters` (1,583 clustered, 250 isolated):
|
||
|
||
| Metric (median) | Isolated | In cluster | Δ |
|
||
|---|---:|---:|---:|
|
||
| Median household income | $73,500 | $108,359 | **+$34,859** |
|
||
| Bachelor's+ % | 33.2 | 51.2 | **+18.0 pp** |
|
||
| Poverty rate | 11.6 | 6.9 | −4.7 pp |
|
||
| Non-Hispanic white % | 71.0 | 45.9 | **−25.1 pp** |
|
||
| EIA generators within 50 km | 40 | 89 | +49 |
|
||
| EIA capacity within 50 km (MW) | 2,176 | 3,300 | +1,125 |
|
||
|
||
**Reading.** A clustered data center sits, at the median, in a tract that is ~$35K richer, 18 pp more educated, and 25 pp less non-Hispanic white than an isolated one — and is surrounded by twice as much energy infrastructure (and 50% more generation capacity). The isolated set looks like rural / small-town America (whiter, poorer, less educated); the clustered set looks like coastal exurban tech corridors.
|
||
|
||
---
|
||
|
||
## 4. RUCA (urban-rural) distribution
|
||
|
||
National baseline of all US tracts: 80% Metropolitan, 9% Micropolitan, 3% Small town, 8% Rural.
|
||
|
||
| RUCA band | DCs | DC % | US tract % | Over-index |
|
||
|---|---:|---:|---:|---:|
|
||
| Metropolitan (1–3) | 1,636 | 89.3% | 80.1% | **1.11×** |
|
||
| Micropolitan (4–6) | 98 | 5.3% | 9.0% | 0.59× |
|
||
| Small town (7–9) | 15 | 0.8% | 2.9% | 0.28× |
|
||
| Rural (10) | 77 | 4.2% | 7.6% | 0.55× |
|
||
| Unknown / missed match | 7 | 0.4% | — | — |
|
||
|
||
**Reading.** The metro skew is real but only mild — 1.11×. The eye-catching pattern is that **rural tracts (RUCA 10) hold more DCs than micropolitan or small-town combined**, because the hyperscale greenfield model deliberately bypasses small-city economies in favor of remote, cheap-power, low-population sites.
|
||
|
||
### Per-RUCA-code drilldown
|
||
|
||
| RUCA | Description | DCs | Median HH income | Median pop density | Median EIA gens (50km) |
|
||
|---:|---|---:|---:|---:|---:|
|
||
| 1 | Metro core | 1,425 | $110,333 | 1,859 / sq mi | 97 |
|
||
| 2 | Metro high-commute | 206 | $105,404 | 96 | 49 |
|
||
| 3 | Metro low-commute | 5 | $119,495 | 22 | 23 |
|
||
| 4 | Micropolitan core | 54 | $63,698 | 312 | 53 |
|
||
| 5 | Micropolitan high-commute | 22 | $72,465 | 191 | 51 |
|
||
| 6 | Micropolitan low-commute | 22 | $72,719 | 69 | 59 |
|
||
| 7 | Small town core | 14 | $87,522 | 2,336 | 40 |
|
||
| 8 | Small town high-commute | 1 | $69,074 | 36 | 41 |
|
||
| 10 | Rural area | 77 | $93,820 | 12 | 42 |
|
||
|
||
**Two surprises:**
|
||
|
||
- Rural DCs (RUCA 10) sit in tracts with $93.8K median income — *higher* than micropolitan DCs ($63.7K–$72.7K). The rural DC sites are not poor rural America; they are wealthy-by-rural-standards counties chosen for power and water access.
|
||
- Micropolitan-core DCs (RUCA 4) have the *lowest* median income at $63.7K — the closest thing to "economic-development DC siting" in the dataset.
|
||
|
||
---
|
||
|
||
## 5. Non-metro deep dive (190 DCs, RUCA 4–10)
|
||
|
||
### Operators
|
||
|
||
| Operator | Non-metro DCs |
|
||
|---|---:|
|
||
| Amazon Web Services | 67 |
|
||
| *(null operator)* | 50 |
|
||
| Meta | 20 (+ 2 as "Meta, Inc.") |
|
||
| Microsoft | 10 |
|
||
| Google | 4 |
|
||
| Rowan Green Data | 4 |
|
||
| NTT | 2 |
|
||
| Yahoo | 2 |
|
||
| Amazon AWS *(dupe)* | 2 |
|
||
|
||
**The five hyperscalers (AWS, Meta, Microsoft, Google, Yahoo) account for 105 of 190 non-metro DCs (55%).** If the 50 null-operator rows skew similarly hyperscale (likely — they're disproportionately in OR/WA), the share is probably closer to 75%.
|
||
|
||
### States (post-backfill)
|
||
|
||
| State | Non-metro DCs |
|
||
|---|---:|
|
||
| Oregon | 86 |
|
||
| Washington | 40 |
|
||
| Texas | 9 |
|
||
| New Mexico | 7 |
|
||
| North Carolina | 6 |
|
||
| Pennsylvania | 5 |
|
||
| Wisconsin | 4 |
|
||
| New York | 3 |
|
||
| Tennessee | 3 |
|
||
| Georgia | 3 |
|
||
|
||
**Oregon + Washington = 126 (66%) of all non-metro DCs.** This is the Columbia River basin: Prineville / Hermiston / Boardman / The Dalles (OR) and Quincy / East Wenatchee / Moses Lake (WA). The pull is hydroelectric power (cheap, low-carbon, abundant) and cool dry climate (free-cooling).
|
||
|
||
The state backfill clarified the rest of the non-metro tail: **Texas (9)** and **Pennsylvania (5)** were previously hidden in the null bucket. These likely represent shale-gas-adjacent builds (Permian and Marcellus respectively).
|
||
|
||
---
|
||
|
||
## 6. Energy footprint by operator (using EIA capacity within 50 km)
|
||
|
||
Aggregated across DCs in RUCA 2–10 (i.e. anything outside dense metro core, n=401):
|
||
|
||
| Operator | DCs | States | Total nearby capacity (GW) | Median per site (GW) | Hydro (GW) | Nuclear (GW) | NG (GW) | Solar (GW) | Wind (GW) |
|
||
|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|
|
||
| AWS | 93 | 5 | 397 | 4.8 | 66 | 2.5 | 201 | 4.6 | **114** |
|
||
| *(Unknown)* | 118 | 26 | 339 | 2.3 | 86 | 35 | 135 | 23 | 19 |
|
||
| Meta | 51 | 11 | 120 | 2.0 | 4.9 | 0 | 61 | 16 | 0.3 |
|
||
| Microsoft | 26 | 6 | 113 | 3.4 | 28 | **13** | 39 | 9.1 | 8.1 |
|
||
| Google | 31 | 5 | 100 | 3.9 | 14 | 0 | 43 | 3.6 | 4.7 |
|
||
| Apple | 5 | 2 | 4 | 0.6 | 1.6 | 0 | 1.1 | 0.9 | 0.4 |
|
||
| Yahoo | 2 | 1 | 7 | 3.5 | 6.4 | 0 | 0 | 0 | 0.7 |
|
||
|
||
**Distinct hyperscaler strategies, visible in the fuel mix:**
|
||
|
||
- **AWS** has aggregated 114 GW of *wind* exposure across its 93 sites — by far the most renewable-coupled portfolio. Also heavy hydro (66 GW) from its OR/WA footprint and 201 GW of natural gas as baseline.
|
||
- **Microsoft** has the highest *nuclear* exposure (12.6 GW) — almost entirely from its Goodyear, AZ campuses near Palo Verde Nuclear.
|
||
- **Meta** has the most *solar* (16 GW) among the named hyperscalers, but minimal nuclear or wind — consistent with its New Mexico (Los Lunas) and Iowa builds.
|
||
- **Google** is split — moderate hydro and natural gas, modest renewables.
|
||
|
||
### Largest non-metro grid neighborhoods (top sites by surrounding capacity)
|
||
|
||
| DC | Operator | Location | Nearby capacity | Fuel highlight |
|
||
|---|---|---|---:|---|
|
||
| PHX70 / PHX-10 / PHX-11 | Microsoft (Azure) | Goodyear, AZ (RUCA 2) | 14.0–14.6 GW | **4.2 GW nuclear (Palo Verde)** + 6.4 GW gas + 2.2 GW solar |
|
||
| Stream PHX-1 | Stream Data Centers | Goodyear, AZ | 13.4 GW | Same Palo Verde / gas mix |
|
||
| T5 Charlotte Campus | T5 | Kings Mountain, NC (RUCA 6) | 12.9 GW | **4.9 GW nuclear** (Catawba) + 5.5 GW gas + 1.5 GW coal |
|
||
| Apple Maiden | Apple | Maiden, NC (RUCA 2) | 9.1 GW | 2.4 GW nuclear + 4.6 GW gas |
|
||
| Percheron DC | Rowan Green Data | (Texas, RUCA 10) | 6.7 GW | **3.0 GW wind** + 0.9 GW hydro + 2.4 GW gas |
|
||
|
||
---
|
||
|
||
## 7. State grid context — how DC-saturated is each top state?
|
||
|
||
Section 6 shows DC-adjacent capacity in absolute MW, which is hard to interpret without knowing the size of the state grid. Using OGC for state-total generating capacity (period `2026-02`, status `OP`) and SEDS series `ESTCB` for 2024 in-state electricity consumption, we can express each state's DC footprint as a **share of its own grid**.
|
||
|
||
The "DC-adjacent capacity" column sums distinct in-state generators (i.e., no double-counting) whose 50 km neighborhood includes at least one in-state data center.
|
||
|
||
| State | DCs | State grid (GW) | State elec. consumption (TWh, 2024) | DC-adjacent capacity (GW) | **% of state capacity within 50 km of a DC** |
|
||
|---|---:|---:|---:|---:|---:|
|
||
| VA | 378 | 30.8 | 138.0 | 15.4 | 50% |
|
||
| TX | 162 | 194.2 | 505.3 | 61.4 | 32% |
|
||
| CA | 147 | 105.1 | 245.6 | 51.6 | 49% |
|
||
| **OR** | 145 | 17.2 | 59.7 | 11.7 | **68%** |
|
||
| OH | 103 | 34.4 | 153.7 | 12.7 | 37% |
|
||
| WA | 93 | 29.6 | 90.0 | 7.9 | 27% |
|
||
| AZ | 69 | 40.1 | 90.8 | 22.5 | 56% |
|
||
| IA | 65 | 24.6 | 54.9 | 4.9 | 20% |
|
||
| **NJ** | 62 | 17.8 | 73.5 | 14.7 | **83%** |
|
||
| IL | 61 | 51.7 | 133.2 | 17.4 | 34% |
|
||
| GA | 50 | 42.3 | 150.0 | 14.2 | 34% |
|
||
| NY | 48 | 42.7 | 140.5 | 25.8 | 61% |
|
||
| **NV** | 41 | 18.7 | 40.7 | 14.0 | **75%** |
|
||
| **TN** | 32 | 23.3 | 102.9 | 16.4 | **70%** |
|
||
| NC | 31 | 38.9 | 136.9 | 17.4 | 45% |
|
||
|
||
**The DC-saturation reordering.** Virginia leads in raw DC count (378), but four states have grids where *more than two-thirds* of all in-state generating capacity sits within 50 km of a data center:
|
||
|
||
- **New Jersey — 83%.** Effectively the entire state's electrical economy is DC-adjacent. NJ's 62 DCs are NYC-metro carrier hotels concentrated in a small geographic footprint relative to a small state grid (17.8 GW).
|
||
- **Nevada — 75%.** Las Vegas and Reno DCs co-locate with the gas-and-solar generation that serves Las Vegas urbanization. NV has a small grid (18.7 GW) and most of it serves the same two metros.
|
||
- **Tennessee — 70%.** Nashville + Memphis DCs sit near TVA's central generation belt.
|
||
- **Oregon — 68%.** Even though OR's DC cluster is mostly non-metro (Boardman / Hermiston / The Dalles), the Columbia hydro corridor serving them accounts for two-thirds of OR's 17.2 GW grid. This is the only state where the saturation comes from rural hyperscale builds rather than urban carrier hotels.
|
||
|
||
**The opposite end.** **Iowa (20%)** has 65 DCs but they all cluster around Council Bluffs / Des Moines, leaving the rural wind belt that dominates IA's grid unrelated to DC siting. **Washington (27%)** is similar — the Quincy hyperscale cluster is small relative to WA's Columbia hydro and Puget-area generation.
|
||
|
||
**Why the proportional view matters.** A 1 GW DC load lands very differently on the NJ grid (5.6% of total capacity) than on the TX grid (0.5%). Reliability, transmission-queue interconnection waits, and political pushback all scale with the proportional draw, not the absolute MW. By that yardstick, the canonical "VA dominates US DCs" story is incomplete — VA, NJ, OR, NV, TN, NY, and AZ are the states where the DC industry is *structurally entangled* with the grid, and where any large new build runs into capacity-share constraints first.
|
||
|
||
---
|
||
|
||
## 8. Watershed (HUC8) concentration
|
||
|
||
Each DC sits in exactly one USGS HUC8 watershed (8-digit hydrologic unit, subbasin scale, median ~3,000 sq km). Cooling-water draw and wastewater discharge happen at watershed scale, not state scale — a single stressed basin can cap an entire DC corridor regardless of how big the state's overall water budget is.
|
||
|
||
### Where the 1,831 matched DCs land
|
||
|
||
- **257 distinct HUC8s** hold at least one DC — that's only **12% of the 2,139 US watersheds** (the other 88% have zero data centers).
|
||
- **The top 1 watershed alone (Middle Potomac-Catoctin) holds 235 DCs** — 12.8% of the entire US data-center footprint.
|
||
- DC concentration is much more extreme at the watershed level than at the state level. Virginia has 20.6% of US DCs; the single Loudoun watershed holds 12.8%.
|
||
|
||
### Cumulative concentration
|
||
|
||
| Top N watersheds | DCs | Share of all US DCs |
|
||
|---:|---:|---:|
|
||
| 1 | 235 | 12.8% |
|
||
| 2 | 346 | 18.9% |
|
||
| 3 | 434 | 23.7% |
|
||
| 5 | 551 | 30.1% |
|
||
| **10** | **736** | **40.2%** |
|
||
| **15** | **887** | **48.4%** |
|
||
| 20 | 1,012 | 55.3% |
|
||
| 30 | 1,186 | 64.8% |
|
||
| 50 | 1,380 | 75.4% |
|
||
| 100 | 1,611 | 88.0% |
|
||
|
||
**Half of all US data centers live in just 15 watersheds.** Three-quarters in 50. Water stress, drought policy, or thermal-discharge limits in any one of these basins propagates to a large share of the national footprint.
|
||
|
||
### Top 15 watersheds by DC count
|
||
|
||
| HUC8 | Name | States | DCs | Cluster |
|
||
|---|---|---|---:|---|
|
||
| 02070008 | Middle Potomac-Catoctin | DC, MD, VA, WV | 235 | Loudoun / Ashburn (DC-Alley) |
|
||
| 02070010 | Middle Potomac-Anacostia-Occoquan | DC, MD, VA | 111 | Fairfax + inner Loudoun + DC |
|
||
| 18050003 | Coyote | CA | 88 | Silicon Valley / San Jose |
|
||
| 05060001 | Upper Scioto | OH | 73 | Columbus (fastest-growing market) |
|
||
| 17070101 | Middle Columbia-Lake Wallula | OR, WA | 44 | Boardman / Hermiston (hyperscale hydro) |
|
||
| 17020015 | Lower Crab | WA | 40 | Quincy / Moses Lake (hyperscale hydro) |
|
||
| 17090010 | Tualatin | OR | 39 | Hillsboro (Intel / Google) |
|
||
| 12030105 | Upper Trinity | TX | 37 | DFW |
|
||
| 10230006 | Big Papillion-Mosquito | IA, NE | 36 | Council Bluffs / Omaha (Meta) |
|
||
| 07120004 | Des Plaines | IL, WI | 33 | Chicago metro |
|
||
| 12100302 | Medina | TX | 32 | San Antonio |
|
||
| 02030105 | Raritan | NJ | 31 | Central NJ carrier hotels |
|
||
| 15050100 | Middle Gila | AZ | 30 | Phoenix metro |
|
||
| 02030103 | Hackensack-Passaic | NJ, NY | 29 | NYC metro east |
|
||
| 17070103 | Umatilla | OR | 29 | AWS-only (all 29 DCs) — pure single-operator basin |
|
||
|
||
The non-metro / hyperscale Pacific Northwest story is visible at watershed scale: three Columbia-system watersheds (**Middle Columbia-Lake Wallula, Lower Crab, Umatilla**) hold 113 DCs combined, all hyperscale-dominated. Umatilla is operationally **AWS-exclusive** — all 29 DCs in that basin are AWS.
|
||
|
||
### Non-metro watersheds (RUCA 4–10) — where hyperscalers cluster
|
||
|
||
| HUC8 | Name | States | DCs | Operators |
|
||
|---|---|---|---:|---|
|
||
| 17070101 | Middle Columbia-Lake Wallula | OR, WA | 44 | AWS (multiple variants), Rowan Green Data |
|
||
| 17020015 | Lower Crab | WA | 40 | CyrusOne, Intuit, Microsoft, NTT, Sabey, Yahoo |
|
||
| 17070103 | Umatilla | OR | 29 | AWS only |
|
||
| 17070305 | Lower Crooked | OR | 8 | Meta (Prineville) |
|
||
| 13020203 | Rio Grande-Albuquerque | NM | 7 | Meta (Los Lunas) |
|
||
| 03050105 | Upper Broad | NC, SC | 6 | Meta |
|
||
| 13070001 | Lower Pecos-Red Bluff Reservoir | NM, TX | 5 | IONIC Digital |
|
||
| 17070105 | Middle Columbia-Hood | OR, WA | 4 | Google (The Dalles) |
|
||
| 02050107 | Upper Susquehanna-Lackawanna | PA | 3 | AWS |
|
||
| 03070103 | Upper Ocmulgee | GA | 2 | Meta |
|
||
|
||
This view is the cleanest evidence yet of the *hyperscale geographic strategy* — single-operator capture of individual watersheds (Umatilla = AWS, Lower Crooked = Meta, Middle Columbia-Hood = Google, Rio Grande-Albuquerque = Meta). Each of these basins has effectively been claimed by one player.
|
||
|
||
### Implications for water-stress analysis
|
||
|
||
This watershed view is a **boundary set** for downstream water-stress analysis. Pull WaterWatch streamflow data, USGS water-use estimates, or EPA drought indicators against just these 257 HUC8s (or against just the top 15 for the highest-leverage story). A single-pull stress index against this set would size the "water exposure" of the entire US DC fleet.
|
||
|
||
---
|
||
|
||
## 9. Database inventory (`data_centers` schema `public`)
|
||
|
||
All tables in the working database as of 2026-05-18. "Used here" = referenced in §1–§8 of this report. PostGIS internal tables (`spatial_ref_sys`, `geography_columns`, `geometry_columns`) are omitted.
|
||
|
||
### Data center inventory and clustering
|
||
|
||
| Table | Rows | Used here | Description |
|
||
|---|---:|:-:|---|
|
||
| `master_data_centers` | 1,833 | ✓ | Unified, deduplicated DC inventory — the canonical row-per-DC table joining curated, OSM, and sample sources via `master_id`. |
|
||
| `osm_data_centers` | 1,549 | — | Raw OSM-derived DC features (nodes/ways tagged as data centers), one of the inputs to `master_data_centers`. |
|
||
| `us_dc_sample_geocoded` | 1,489 | — | Earlier sample-list DC inventory with geocoding lineage (Nominatim + Census TIGER), superseded by `master_data_centers` but retained for provenance. |
|
||
| `data_centers_union` (view) | — | — | Convenience view unioning the curated and OSM source rows with a `source` discriminator. |
|
||
| `master_data_center_spatial_clusters` | 1,831 | ✓ | DBSCAN cluster assignment per DC (`cluster_id`, noise flag), used in §3. |
|
||
|
||
### Per-DC join tables
|
||
|
||
| Table | Rows | Used here | Description |
|
||
|---|---:|:-:|---|
|
||
| `data_center_census_tracts_2024` | 1,815 | ✓ | One row per DC with attached ACS-2024 demographics from its containing tract — the master demographic join. |
|
||
| `data_center_watershed_huc8` | 1,833 | ✓ | One row per DC with its containing USGS HUC8 watershed (`huc8`, name, states, area), built 2026-05-18 via `ST_Within`. |
|
||
|
||
### Base geographic / demographic layers
|
||
|
||
| Table | Rows | Used here | Description |
|
||
|---|---:|:-:|---|
|
||
| `_dc_census_tract_acs_2024` | 85,382 | ✓ | Staging: ACS-2024 5-year profile attributes for every US tract that contains a DC (and surrounding tracts for context). |
|
||
| `_dc_census_tract_boundaries_2024` | 85,058 | — | Staging: TIGER 2024 tract polygons for the DC-tract universe. |
|
||
| `ruca_codes_2020_tract` | 85,528 | ✓ | USDA RUCA 2020 codes per tract, the metro/micropolitan/rural classification used in §4–§5. |
|
||
| `watershed_huc8` | 2,139 | ✓ | USGS Watershed Boundary Dataset HUC8 subbasin polygons (median ~3,000 km²) covering CONUS + AK. |
|
||
| `_watershed_huc8_stage` | 369 | — | Staging table from an earlier partial WBD load, superseded by `watershed_huc8`. Candidate for cleanup. |
|
||
| `census_tract_huc8_link` | 806 | — | Tract↔HUC8 spatial overlap table (with overlap %) for the subset of tracts containing a DC. Useful for downstream tract-level water-stress joins. |
|
||
|
||
### Energy data
|
||
|
||
| Table | Rows | Used here | Description |
|
||
|---|---:|:-:|---|
|
||
| `energy_eia_operating_generator_capacity_flat` | 4.7M | ✓ | EIA Form-860 operating generator inventory, monthly 2008–2026, with nameplate / summer / winter MW and point geometry. Source for §6 and §7 capacity figures. |
|
||
| `energy_eia_seds_flat` | 2.57M | ✓ | EIA SEDS annual state energy series 1960–2024 (consumption, prices, expenditures by sector / fuel). Source for §7 state electricity consumption (`ESTCB`, 2024). Backfilled 2026-05-18. |
|
||
| `energy_atlas_layers_catalog` | ~5 | — | Metadata catalog of EIA layers ingested by `ingest_eia_energy_layers.py` (table name, source URL, import timestamp). |
|
||
| `im3_state_projected_moderate_50` | 328 | — | PNNL IM3 projected DC siting under the moderate-growth scenario at gravity-weight 0.50 — one row per projected facility (cost, IT MW, cooling-water demand, lat/lon). Loaded but unused. |
|
||
| `im3_projected_state_demand_summary` | 31 | — | State-level rollup of IM3 projected facility counts, IT MW, and cooling demand. Loaded but unused. |
|
||
| `seds_national_msn_year` | 0 | — | Empty placeholder for national SEDS time-series; superseded by `energy_eia_seds_flat`. Drop candidate. |
|
||
| `seds_state_msn_year` | 0 | — | Empty placeholder for state SEDS time-series; superseded by `energy_eia_seds_flat`. Drop candidate. |
|
||
| `utility_rate_tracker_2025_2028` | 374 | — | Utility rate-increase tracker by provider × state × service type, with effective dates and monthly $ + % increases. Loaded but unused in the demographic/energy analysis. |
|
||
|
||
### Connectivity (submarine cables, exchange capacity)
|
||
|
||
| Table | Rows | Used here | Description |
|
||
|---|---:|:-:|---|
|
||
| `internet_cables` | 693 | — | Submarine cable routes (geometry, RFS year, decommission year, owners, length km) from TeleGeography-style data. |
|
||
| `internet_cable_landing_points` | 3,361 | — | Cable landing points (country, name, TBD flag) — endpoint nodes for `internet_cables`. |
|
||
| `internet_cable_meta` | 2 | — | Source-provenance metadata for the cable dataset (key/value). |
|
||
| `internet_cable_year_summaries` | 58 | — | Year-by-year narrative descriptions of cable activity. |
|
||
| `internet_city_dominance` | 4,552 | — | City-level physical capacity (Tbps), logical-dominance IP count, and top ASNs — proxy for internet-hub strength of each candidate DC city. |
|
||
|
||
### Other
|
||
|
||
| Table | Rows | Used here | Description |
|
||
|---|---:|:-:|---|
|
||
| `opposition_cases_geocoded` | 18 | — | Geocoded community-opposition cases against DC builds (developer, investment $B, outcome, governance response). Loaded but unused — see next-steps item #5. |
|
||
|
||
**Cleanup candidates.** `_watershed_huc8_stage`, `seds_national_msn_year`, `seds_state_msn_year`, and possibly `us_dc_sample_geocoded` are superseded by their canonical counterparts and could be dropped to reduce confusion.
|
||
|
||
---
|
||
|
||
## Data quality flags
|
||
|
||
1. **`master_data_centers.power_mw` is populated for only 108 / 1,833 DCs (5.9%).** Useless as a sizing metric without imputation or alternative source. Nearby EIA capacity is the more reliable proxy (used as the per-DC scale in this analysis). A grant-funded scrape of Baxtel / Data Center Map would close this gap.
|
||
2. **50 of 190 non-metro DCs (26%) have null `operator`.** Likely hyperscalers based on geography (OR/WA) but unconfirmed.
|
||
3. **Operator-string fragmentation**: "Meta" vs. "Meta, Inc."; "Amazon Web Services" vs. "Amazon AWS" vs. "amazon web services"; "Microsoft" vs. "Microsoft Azure". Inflates distinct-operator counts and fragments per-operator totals.
|
||
4. ~~`avg_household_size` column has sentinel pollution~~ **Resolved 2026-05-18** — 1,089 sentinel values (−666,666,666) in `_dc_census_tract_acs_2024` and 16 in `data_center_census_tracts_2024` replaced with `NULL`. Affected 29 DCs.
|
||
5. **7 DCs failed RUCA join** — Puerto Rico tracts or non-US locations; trivial.
|
||
6. **EIA generator coordinates had a longitude sign error for 2008-01 through 2010-11** (~11K rows with positive lower-48 longitudes). The flat-table build at [ingest_eia_energy_layers.py:839-870](../ingest_eia_energy_layers.py#L839-L870) corrects this in `longitude` and `geom`, so spatial joins are unaffected.
|
||
|
||
---
|
||
|
||
## Suggested next steps
|
||
|
||
1. **Backfill `power_mw`** from Baxtel / Data Center Map (paid scrape — grant work).
|
||
2. **Operator-string deduplication** — collapse "Meta"/"Meta, Inc.", "AWS" variants, etc., before any per-operator analysis.
|
||
3. **Water-stress overlay against the 257 watersheds** — now that the HUC8 join is in place, pull USGS WaterWatch streamflow data, USGS water-use estimates, or EPA drought-status indicators against this watershed set. A single stress index per HUC8 would size the entire US fleet's water exposure.
|
||
4. **Forward-projected demand overlay** — the static SEDS / OGC capacity-share view in §7 is a snapshot. Joining `im3_state_projected_moderate_50` against the §7 saturation table would let us flag which already-saturated states (NJ, NV, TN, OR) are projected to need the most additional generation before 2050.
|
||
5. **Opposition cases overlay** — `opposition_cases_geocoded` is loaded but unused; could test whether cluster-vs-isolated demographic differences (or watershed concentration) predict community opposition.
|