Files
data-centers/output/data_center_demographic_ruca_energy_summary.md
dadams a7121d601b Add state grid context and database inventory to DC summary
Extends the demographic/RUCA/energy summary with two new sections:
- §7 quantifies each top-DC state's "share of state capacity within
  50 km of a DC," surfacing NJ (83%), NV (75%), TN (70%), and OR (68%)
  as the most DC-saturated grids — reframing the canonical VA-centric
  story by structural entanglement rather than raw count.
- §9 inventories every table in the data_centers schema with a
  one-line description, flagging cleanup candidates and unused layers
  for downstream work.

Also renumbers watershed analysis to §8, adds the SEDS row to the
dataset coverage table, and narrows next-step #4 to the IM3 projection
overlay (now that the SEDS join is complete).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 09:04:56 -07:00

401 lines
28 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# US Data Centers — Demographic, Urban-Rural & Energy Context Analysis
**Date:** 2026-05-18
**Notebook:** [cluster_analysis.ipynb](../cluster_analysis.ipynb)
**Universe:** 1,833 data centers in `public.master_data_centers`, joined to ACS-2024 demographics, USDA RUCA-2020 codes, USGS HUC8 watersheds, and EIA operating-generator capacity (50 km radius, latest period 2026-02, status=OP).
---
## Headline findings
1. **DC tracts are richer, more educated, and more diverse than the US average.** Median household income $103,623 vs. national $78,538 (+32%); 49% bachelor's+ vs. 35% (+14 pp); poverty rate 7.2% vs. 12.4%. Non-Hispanic white share is *below* national (50% vs. 58%), driven by Asian-heavy (mean 13% vs. 6%) and Hispanic-significant tracts.
2. **The metro skew is more modest than expected: 1.11×.** 89% of DCs sit in metropolitan tracts, but 80% of *all* US tracts are metropolitan — so DCs are only slightly more concentrated than the underlying population distribution would predict.
3. **The non-metro tail is overwhelmingly hyperscale and Pacific Northwest.** Of 190 DCs outside metropolitan tracts (RUCA 410), AWS owns 67, Meta 22, Microsoft 10, Google 4, Yahoo 2 — combined 55% of the non-metro footprint. Oregon (86) and Washington (40) alone hold 66% of non-metro DCs, anchored to the Columbia River hydropower corridor.
4. **Clustered DCs are demographically distinct from isolated ones.** DCs in DBSCAN clusters (n=1,583) sit in tracts with $108K median income vs. $73K for isolated DCs (n=250) — a $35K gap. Clustered DCs are more educated (+18 pp bachelor's), more diverse (25 pp non-Hispanic white), and embedded in much denser energy infrastructure (89 vs. 40 generators within 50 km).
5. **Microsoft co-locates with the largest US nuclear plant.** Microsoft's Goodyear, AZ campus has 14.6 GW of generation within 50 km — including 4.2 GW from Palo Verde Nuclear, the largest in the US. Despite the campus being in a RUCA-2 "Metro high-commute" tract (not strictly metro core), the surrounding grid is the densest by capacity in our analysis.
6. **Extreme watershed concentration: half of all US DCs sit in just 15 of 2,139 HUC8 watersheds.** A single watershed — Middle Potomac-Catoctin (Loudoun County) — holds 235 DCs (12.8% of the US total). The top 2 (both DC-Alley watersheds) hold 18.9%; the top 10 hold 40%. Water stress in any one of these basins propagates to a huge share of national DC capacity.
---
## Dataset coverage and joins
| Source table | Rows | Join key | Coverage |
|---|---|---|---|
| `master_data_centers` | 1,833 | base | — |
| `master_data_center_spatial_clusters` | 1,831 | `master_id` | 99.9% |
| `_dc_census_tract_acs_2024` | ~73,000 tracts | `geoid` | 1,807 matched (98.6%) |
| `ruca_codes_2020_tract` | 85,528 tracts | `tract_fips_20 = geoid` | 1,826 matched (99.6%) |
| `watershed_huc8` | 2,139 watersheds | `ST_Contains(w.geom, m.geom)` | 1,831 matched (99.9%) |
| `energy_eia_operating_generator_capacity_flat` | 4.7M rows | `ST_DWithin(geom, 50km)` | 1,831 DCs have ≥1 nearby gen |
| `energy_eia_seds_flat` (annual, 19602024) | 2.57M rows | `state_id` | Used in §7 for state electricity consumption (series `ESTCB`, 2024) |
Energy aggregation uses period `2026-02` only with `status='OP'`, summing `nameplate_capacity_mw` for operating generators within 50 km of each DC. Note: EIA capacity columns were added to this table on 2026-05-17 — prior to that the `_flat` table had no MW values despite its name. SEDS was backfilled 2026-05-18 (initial smoke-test had only 50 rows).
---
## 1. Demographic profile of DC tracts (n=1,807 with non-null ACS)
| Metric | DC tract (median) | DC tract (mean) | US avg | Δ mean vs. US |
|---|---:|---:|---:|---:|
| Median household income | $103,623 | $114,543 | $78,538 | **+$36,005** |
| Per-capita income | $51,283 | $55,725 | $43,313 | +$12,412 |
| Poverty rate | 7.2% | 10.1% | 12.4% | 2.3 pp |
| Unemployment rate | 3.5% | 4.4% | 5.4% | 1.0 pp |
| Bachelor's+ % | 49.3% | 46.2% | 35.0% | **+11.2 pp** |
| Broadband subscription % | 94.9% | 93.5% | 89.0% | +4.5 pp |
| Non-Hispanic white % | 50.2% | 51.0% | 58.4% | **7.4 pp** |
| Hispanic / Latino % | 12.8% | 19.5% | 19.5% | 0.0 pp |
| Non-Hispanic Black % | 5.9% | 10.6% | 12.1% | 1.5 pp |
| Non-Hispanic Asian % | 6.4% | 13.4% | 6.4% | **+7.0 pp** |
**Interpretation.** DC tracts skew toward high-income, highly-educated, technically connected, and racially diverse (specifically Asian-heavy). The race composition is interesting: DC tracts are *less* non-Hispanic white than national average, not more. This reflects DC siting in mixed-race coastal/exurban tech corridors (Bay Area, Northern Virginia, Seattle) rather than in homogeneous suburbs.
**Data quality note.** `avg_household_size` previously contained ACS sentinel-value pollution (666,666,666) for 1,089 zero-population tracts in `_dc_census_tract_acs_2024` (29 of which contained DCs) plus 16 rows in `data_center_census_tracts_2024`. As of 2026-05-18, those sentinels have been replaced with `NULL`. The column now has plausible ranges (min 1.00, max 9.33) and a usable mean.
---
## 2. Geographic concentration (top 15 states)
| State | DC count | Total power_mw (where known) | Median HH income | Median bachelor's % | Median % white | Notes |
|---|---:|---:|---:|---:|---:|---|
| **VA** | **378** | 255 | $141,250 | 62.6% | 42.5% | Loudoun / DC-Alley dominance (20.6% of all US DCs) |
| TX | 162 | 597 | $88,228 | 46.2% | 32.0% | DFW + Austin + San Antonio |
| CA | 147 | 130 | $164,928 | 56.4% | 22.4% | Bay Area + LA basin |
| OR | 145 | 125 | $72,719 | 20.0% | 63.2% | Columbia River hydro corridor (rural) |
| OH | 103 | 135 | $128,875 | 47.0% | 74.5% | Columbus boom — fastest-rising market |
| WA | 93 | 70 | $91,623 | 21.9% | 40.3% | Quincy/Wenatchee + Seattle |
| AZ | 69 | 54 | $85,335 | 35.2% | 51.6% | Phoenix/Goodyear hyperscale |
| IA | 65 | 0 | $93,393 | 34.3% | 88.1% | 88% white (rural Midwest) |
| NJ | 62 | 98 | $147,321 | 59.4% | 32.9% | NYC-metro carrier hotels |
| IL | 61 | 128 | $96,191 | 52.9% | 52.0% | Chicago metro |
| GA | 50 | 241 | $101,176 | 51.4% | 31.6% | Atlanta + high-power rural builds |
| NY | 48 | 47 | $77,465 | 47.6% | 74.8% | NYC + upstate |
| NV | 41 | 0 | $93,409 | 31.2% | 34.6% | Reno + Las Vegas |
| TN | 32 | 0 | — | — | 54.8% | Nashville + Memphis (newly visible after state backfill) |
| NC | 31 | 56 | $82,708 | 44.7% | 59.6% | Charlotte + Catawba (nuclear-adjacent) |
**Virginia alone holds 20.6% of all US DCs** (378 of 1,833), with the most affluent tract profile in the top 15 — a Loudoun County effect. The state backfill substantially elevated **Ohio (76 → 103)** and **Texas (135 → 162)**, pushing TX into the #2 slot. The previously-uncounted **Tennessee (32) now appears in the top 15**.
Oregon and Washington tracts look notably different from the urban-heavy states (lower income, lower education, lower broadband, higher Hispanic share), reflecting their rural Columbia River siting.
---
## 3. Spatially clustered DCs vs. isolated DCs
DBSCAN cluster assignment from `master_data_center_spatial_clusters` (1,583 clustered, 250 isolated):
| Metric (median) | Isolated | In cluster | Δ |
|---|---:|---:|---:|
| Median household income | $73,500 | $108,359 | **+$34,859** |
| Bachelor's+ % | 33.2 | 51.2 | **+18.0 pp** |
| Poverty rate | 11.6 | 6.9 | 4.7 pp |
| Non-Hispanic white % | 71.0 | 45.9 | **25.1 pp** |
| EIA generators within 50 km | 40 | 89 | +49 |
| EIA capacity within 50 km (MW) | 2,176 | 3,300 | +1,125 |
**Reading.** A clustered data center sits, at the median, in a tract that is ~$35K richer, 18 pp more educated, and 25 pp less non-Hispanic white than an isolated one — and is surrounded by twice as much energy infrastructure (and 50% more generation capacity). The isolated set looks like rural / small-town America (whiter, poorer, less educated); the clustered set looks like coastal exurban tech corridors.
---
## 4. RUCA (urban-rural) distribution
National baseline of all US tracts: 80% Metropolitan, 9% Micropolitan, 3% Small town, 8% Rural.
| RUCA band | DCs | DC % | US tract % | Over-index |
|---|---:|---:|---:|---:|
| Metropolitan (13) | 1,636 | 89.3% | 80.1% | **1.11×** |
| Micropolitan (46) | 98 | 5.3% | 9.0% | 0.59× |
| Small town (79) | 15 | 0.8% | 2.9% | 0.28× |
| Rural (10) | 77 | 4.2% | 7.6% | 0.55× |
| Unknown / missed match | 7 | 0.4% | — | — |
**Reading.** The metro skew is real but only mild — 1.11×. The eye-catching pattern is that **rural tracts (RUCA 10) hold more DCs than micropolitan or small-town combined**, because the hyperscale greenfield model deliberately bypasses small-city economies in favor of remote, cheap-power, low-population sites.
### Per-RUCA-code drilldown
| RUCA | Description | DCs | Median HH income | Median pop density | Median EIA gens (50km) |
|---:|---|---:|---:|---:|---:|
| 1 | Metro core | 1,425 | $110,333 | 1,859 / sq mi | 97 |
| 2 | Metro high-commute | 206 | $105,404 | 96 | 49 |
| 3 | Metro low-commute | 5 | $119,495 | 22 | 23 |
| 4 | Micropolitan core | 54 | $63,698 | 312 | 53 |
| 5 | Micropolitan high-commute | 22 | $72,465 | 191 | 51 |
| 6 | Micropolitan low-commute | 22 | $72,719 | 69 | 59 |
| 7 | Small town core | 14 | $87,522 | 2,336 | 40 |
| 8 | Small town high-commute | 1 | $69,074 | 36 | 41 |
| 10 | Rural area | 77 | $93,820 | 12 | 42 |
**Two surprises:**
- Rural DCs (RUCA 10) sit in tracts with $93.8K median income — *higher* than micropolitan DCs ($63.7K$72.7K). The rural DC sites are not poor rural America; they are wealthy-by-rural-standards counties chosen for power and water access.
- Micropolitan-core DCs (RUCA 4) have the *lowest* median income at $63.7K — the closest thing to "economic-development DC siting" in the dataset.
---
## 5. Non-metro deep dive (190 DCs, RUCA 410)
### Operators
| Operator | Non-metro DCs |
|---|---:|
| Amazon Web Services | 67 |
| *(null operator)* | 50 |
| Meta | 20 (+ 2 as "Meta, Inc.") |
| Microsoft | 10 |
| Google | 4 |
| Rowan Green Data | 4 |
| NTT | 2 |
| Yahoo | 2 |
| Amazon AWS *(dupe)* | 2 |
**The five hyperscalers (AWS, Meta, Microsoft, Google, Yahoo) account for 105 of 190 non-metro DCs (55%).** If the 50 null-operator rows skew similarly hyperscale (likely — they're disproportionately in OR/WA), the share is probably closer to 75%.
### States (post-backfill)
| State | Non-metro DCs |
|---|---:|
| Oregon | 86 |
| Washington | 40 |
| Texas | 9 |
| New Mexico | 7 |
| North Carolina | 6 |
| Pennsylvania | 5 |
| Wisconsin | 4 |
| New York | 3 |
| Tennessee | 3 |
| Georgia | 3 |
**Oregon + Washington = 126 (66%) of all non-metro DCs.** This is the Columbia River basin: Prineville / Hermiston / Boardman / The Dalles (OR) and Quincy / East Wenatchee / Moses Lake (WA). The pull is hydroelectric power (cheap, low-carbon, abundant) and cool dry climate (free-cooling).
The state backfill clarified the rest of the non-metro tail: **Texas (9)** and **Pennsylvania (5)** were previously hidden in the null bucket. These likely represent shale-gas-adjacent builds (Permian and Marcellus respectively).
---
## 6. Energy footprint by operator (using EIA capacity within 50 km)
Aggregated across DCs in RUCA 210 (i.e. anything outside dense metro core, n=401):
| Operator | DCs | States | Total nearby capacity (GW) | Median per site (GW) | Hydro (GW) | Nuclear (GW) | NG (GW) | Solar (GW) | Wind (GW) |
|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|
| AWS | 93 | 5 | 397 | 4.8 | 66 | 2.5 | 201 | 4.6 | **114** |
| *(Unknown)* | 118 | 26 | 339 | 2.3 | 86 | 35 | 135 | 23 | 19 |
| Meta | 51 | 11 | 120 | 2.0 | 4.9 | 0 | 61 | 16 | 0.3 |
| Microsoft | 26 | 6 | 113 | 3.4 | 28 | **13** | 39 | 9.1 | 8.1 |
| Google | 31 | 5 | 100 | 3.9 | 14 | 0 | 43 | 3.6 | 4.7 |
| Apple | 5 | 2 | 4 | 0.6 | 1.6 | 0 | 1.1 | 0.9 | 0.4 |
| Yahoo | 2 | 1 | 7 | 3.5 | 6.4 | 0 | 0 | 0 | 0.7 |
**Distinct hyperscaler strategies, visible in the fuel mix:**
- **AWS** has aggregated 114 GW of *wind* exposure across its 93 sites — by far the most renewable-coupled portfolio. Also heavy hydro (66 GW) from its OR/WA footprint and 201 GW of natural gas as baseline.
- **Microsoft** has the highest *nuclear* exposure (12.6 GW) — almost entirely from its Goodyear, AZ campuses near Palo Verde Nuclear.
- **Meta** has the most *solar* (16 GW) among the named hyperscalers, but minimal nuclear or wind — consistent with its New Mexico (Los Lunas) and Iowa builds.
- **Google** is split — moderate hydro and natural gas, modest renewables.
### Largest non-metro grid neighborhoods (top sites by surrounding capacity)
| DC | Operator | Location | Nearby capacity | Fuel highlight |
|---|---|---|---:|---|
| PHX70 / PHX-10 / PHX-11 | Microsoft (Azure) | Goodyear, AZ (RUCA 2) | 14.014.6 GW | **4.2 GW nuclear (Palo Verde)** + 6.4 GW gas + 2.2 GW solar |
| Stream PHX-1 | Stream Data Centers | Goodyear, AZ | 13.4 GW | Same Palo Verde / gas mix |
| T5 Charlotte Campus | T5 | Kings Mountain, NC (RUCA 6) | 12.9 GW | **4.9 GW nuclear** (Catawba) + 5.5 GW gas + 1.5 GW coal |
| Apple Maiden | Apple | Maiden, NC (RUCA 2) | 9.1 GW | 2.4 GW nuclear + 4.6 GW gas |
| Percheron DC | Rowan Green Data | (Texas, RUCA 10) | 6.7 GW | **3.0 GW wind** + 0.9 GW hydro + 2.4 GW gas |
---
## 7. State grid context — how DC-saturated is each top state?
Section 6 shows DC-adjacent capacity in absolute MW, which is hard to interpret without knowing the size of the state grid. Using OGC for state-total generating capacity (period `2026-02`, status `OP`) and SEDS series `ESTCB` for 2024 in-state electricity consumption, we can express each state's DC footprint as a **share of its own grid**.
The "DC-adjacent capacity" column sums distinct in-state generators (i.e., no double-counting) whose 50 km neighborhood includes at least one in-state data center.
| State | DCs | State grid (GW) | State elec. consumption (TWh, 2024) | DC-adjacent capacity (GW) | **% of state capacity within 50 km of a DC** |
|---|---:|---:|---:|---:|---:|
| VA | 378 | 30.8 | 138.0 | 15.4 | 50% |
| TX | 162 | 194.2 | 505.3 | 61.4 | 32% |
| CA | 147 | 105.1 | 245.6 | 51.6 | 49% |
| **OR** | 145 | 17.2 | 59.7 | 11.7 | **68%** |
| OH | 103 | 34.4 | 153.7 | 12.7 | 37% |
| WA | 93 | 29.6 | 90.0 | 7.9 | 27% |
| AZ | 69 | 40.1 | 90.8 | 22.5 | 56% |
| IA | 65 | 24.6 | 54.9 | 4.9 | 20% |
| **NJ** | 62 | 17.8 | 73.5 | 14.7 | **83%** |
| IL | 61 | 51.7 | 133.2 | 17.4 | 34% |
| GA | 50 | 42.3 | 150.0 | 14.2 | 34% |
| NY | 48 | 42.7 | 140.5 | 25.8 | 61% |
| **NV** | 41 | 18.7 | 40.7 | 14.0 | **75%** |
| **TN** | 32 | 23.3 | 102.9 | 16.4 | **70%** |
| NC | 31 | 38.9 | 136.9 | 17.4 | 45% |
**The DC-saturation reordering.** Virginia leads in raw DC count (378), but four states have grids where *more than two-thirds* of all in-state generating capacity sits within 50 km of a data center:
- **New Jersey — 83%.** Effectively the entire state's electrical economy is DC-adjacent. NJ's 62 DCs are NYC-metro carrier hotels concentrated in a small geographic footprint relative to a small state grid (17.8 GW).
- **Nevada — 75%.** Las Vegas and Reno DCs co-locate with the gas-and-solar generation that serves Las Vegas urbanization. NV has a small grid (18.7 GW) and most of it serves the same two metros.
- **Tennessee — 70%.** Nashville + Memphis DCs sit near TVA's central generation belt.
- **Oregon — 68%.** Even though OR's DC cluster is mostly non-metro (Boardman / Hermiston / The Dalles), the Columbia hydro corridor serving them accounts for two-thirds of OR's 17.2 GW grid. This is the only state where the saturation comes from rural hyperscale builds rather than urban carrier hotels.
**The opposite end.** **Iowa (20%)** has 65 DCs but they all cluster around Council Bluffs / Des Moines, leaving the rural wind belt that dominates IA's grid unrelated to DC siting. **Washington (27%)** is similar — the Quincy hyperscale cluster is small relative to WA's Columbia hydro and Puget-area generation.
**Why the proportional view matters.** A 1 GW DC load lands very differently on the NJ grid (5.6% of total capacity) than on the TX grid (0.5%). Reliability, transmission-queue interconnection waits, and political pushback all scale with the proportional draw, not the absolute MW. By that yardstick, the canonical "VA dominates US DCs" story is incomplete — VA, NJ, OR, NV, TN, NY, and AZ are the states where the DC industry is *structurally entangled* with the grid, and where any large new build runs into capacity-share constraints first.
---
## 8. Watershed (HUC8) concentration
Each DC sits in exactly one USGS HUC8 watershed (8-digit hydrologic unit, subbasin scale, median ~3,000 sq km). Cooling-water draw and wastewater discharge happen at watershed scale, not state scale — a single stressed basin can cap an entire DC corridor regardless of how big the state's overall water budget is.
### Where the 1,831 matched DCs land
- **257 distinct HUC8s** hold at least one DC — that's only **12% of the 2,139 US watersheds** (the other 88% have zero data centers).
- **The top 1 watershed alone (Middle Potomac-Catoctin) holds 235 DCs** — 12.8% of the entire US data-center footprint.
- DC concentration is much more extreme at the watershed level than at the state level. Virginia has 20.6% of US DCs; the single Loudoun watershed holds 12.8%.
### Cumulative concentration
| Top N watersheds | DCs | Share of all US DCs |
|---:|---:|---:|
| 1 | 235 | 12.8% |
| 2 | 346 | 18.9% |
| 3 | 434 | 23.7% |
| 5 | 551 | 30.1% |
| **10** | **736** | **40.2%** |
| **15** | **887** | **48.4%** |
| 20 | 1,012 | 55.3% |
| 30 | 1,186 | 64.8% |
| 50 | 1,380 | 75.4% |
| 100 | 1,611 | 88.0% |
**Half of all US data centers live in just 15 watersheds.** Three-quarters in 50. Water stress, drought policy, or thermal-discharge limits in any one of these basins propagates to a large share of the national footprint.
### Top 15 watersheds by DC count
| HUC8 | Name | States | DCs | Cluster |
|---|---|---|---:|---|
| 02070008 | Middle Potomac-Catoctin | DC, MD, VA, WV | 235 | Loudoun / Ashburn (DC-Alley) |
| 02070010 | Middle Potomac-Anacostia-Occoquan | DC, MD, VA | 111 | Fairfax + inner Loudoun + DC |
| 18050003 | Coyote | CA | 88 | Silicon Valley / San Jose |
| 05060001 | Upper Scioto | OH | 73 | Columbus (fastest-growing market) |
| 17070101 | Middle Columbia-Lake Wallula | OR, WA | 44 | Boardman / Hermiston (hyperscale hydro) |
| 17020015 | Lower Crab | WA | 40 | Quincy / Moses Lake (hyperscale hydro) |
| 17090010 | Tualatin | OR | 39 | Hillsboro (Intel / Google) |
| 12030105 | Upper Trinity | TX | 37 | DFW |
| 10230006 | Big Papillion-Mosquito | IA, NE | 36 | Council Bluffs / Omaha (Meta) |
| 07120004 | Des Plaines | IL, WI | 33 | Chicago metro |
| 12100302 | Medina | TX | 32 | San Antonio |
| 02030105 | Raritan | NJ | 31 | Central NJ carrier hotels |
| 15050100 | Middle Gila | AZ | 30 | Phoenix metro |
| 02030103 | Hackensack-Passaic | NJ, NY | 29 | NYC metro east |
| 17070103 | Umatilla | OR | 29 | AWS-only (all 29 DCs) — pure single-operator basin |
The non-metro / hyperscale Pacific Northwest story is visible at watershed scale: three Columbia-system watersheds (**Middle Columbia-Lake Wallula, Lower Crab, Umatilla**) hold 113 DCs combined, all hyperscale-dominated. Umatilla is operationally **AWS-exclusive** — all 29 DCs in that basin are AWS.
### Non-metro watersheds (RUCA 410) — where hyperscalers cluster
| HUC8 | Name | States | DCs | Operators |
|---|---|---|---:|---|
| 17070101 | Middle Columbia-Lake Wallula | OR, WA | 44 | AWS (multiple variants), Rowan Green Data |
| 17020015 | Lower Crab | WA | 40 | CyrusOne, Intuit, Microsoft, NTT, Sabey, Yahoo |
| 17070103 | Umatilla | OR | 29 | AWS only |
| 17070305 | Lower Crooked | OR | 8 | Meta (Prineville) |
| 13020203 | Rio Grande-Albuquerque | NM | 7 | Meta (Los Lunas) |
| 03050105 | Upper Broad | NC, SC | 6 | Meta |
| 13070001 | Lower Pecos-Red Bluff Reservoir | NM, TX | 5 | IONIC Digital |
| 17070105 | Middle Columbia-Hood | OR, WA | 4 | Google (The Dalles) |
| 02050107 | Upper Susquehanna-Lackawanna | PA | 3 | AWS |
| 03070103 | Upper Ocmulgee | GA | 2 | Meta |
This view is the cleanest evidence yet of the *hyperscale geographic strategy* — single-operator capture of individual watersheds (Umatilla = AWS, Lower Crooked = Meta, Middle Columbia-Hood = Google, Rio Grande-Albuquerque = Meta). Each of these basins has effectively been claimed by one player.
### Implications for water-stress analysis
This watershed view is a **boundary set** for downstream water-stress analysis. Pull WaterWatch streamflow data, USGS water-use estimates, or EPA drought indicators against just these 257 HUC8s (or against just the top 15 for the highest-leverage story). A single-pull stress index against this set would size the "water exposure" of the entire US DC fleet.
---
## 9. Database inventory (`data_centers` schema `public`)
All tables in the working database as of 2026-05-18. "Used here" = referenced in §1§8 of this report. PostGIS internal tables (`spatial_ref_sys`, `geography_columns`, `geometry_columns`) are omitted.
### Data center inventory and clustering
| Table | Rows | Used here | Description |
|---|---:|:-:|---|
| `master_data_centers` | 1,833 | ✓ | Unified, deduplicated DC inventory — the canonical row-per-DC table joining curated, OSM, and sample sources via `master_id`. |
| `osm_data_centers` | 1,549 | — | Raw OSM-derived DC features (nodes/ways tagged as data centers), one of the inputs to `master_data_centers`. |
| `us_dc_sample_geocoded` | 1,489 | — | Earlier sample-list DC inventory with geocoding lineage (Nominatim + Census TIGER), superseded by `master_data_centers` but retained for provenance. |
| `data_centers_union` (view) | — | — | Convenience view unioning the curated and OSM source rows with a `source` discriminator. |
| `master_data_center_spatial_clusters` | 1,831 | ✓ | DBSCAN cluster assignment per DC (`cluster_id`, noise flag), used in §3. |
### Per-DC join tables
| Table | Rows | Used here | Description |
|---|---:|:-:|---|
| `data_center_census_tracts_2024` | 1,815 | ✓ | One row per DC with attached ACS-2024 demographics from its containing tract — the master demographic join. |
| `data_center_watershed_huc8` | 1,833 | ✓ | One row per DC with its containing USGS HUC8 watershed (`huc8`, name, states, area), built 2026-05-18 via `ST_Within`. |
### Base geographic / demographic layers
| Table | Rows | Used here | Description |
|---|---:|:-:|---|
| `_dc_census_tract_acs_2024` | 85,382 | ✓ | Staging: ACS-2024 5-year profile attributes for every US tract that contains a DC (and surrounding tracts for context). |
| `_dc_census_tract_boundaries_2024` | 85,058 | — | Staging: TIGER 2024 tract polygons for the DC-tract universe. |
| `ruca_codes_2020_tract` | 85,528 | ✓ | USDA RUCA 2020 codes per tract, the metro/micropolitan/rural classification used in §4§5. |
| `watershed_huc8` | 2,139 | ✓ | USGS Watershed Boundary Dataset HUC8 subbasin polygons (median ~3,000 km²) covering CONUS + AK. |
| `_watershed_huc8_stage` | 369 | — | Staging table from an earlier partial WBD load, superseded by `watershed_huc8`. Candidate for cleanup. |
| `census_tract_huc8_link` | 806 | — | Tract↔HUC8 spatial overlap table (with overlap %) for the subset of tracts containing a DC. Useful for downstream tract-level water-stress joins. |
### Energy data
| Table | Rows | Used here | Description |
|---|---:|:-:|---|
| `energy_eia_operating_generator_capacity_flat` | 4.7M | ✓ | EIA Form-860 operating generator inventory, monthly 20082026, with nameplate / summer / winter MW and point geometry. Source for §6 and §7 capacity figures. |
| `energy_eia_seds_flat` | 2.57M | ✓ | EIA SEDS annual state energy series 19602024 (consumption, prices, expenditures by sector / fuel). Source for §7 state electricity consumption (`ESTCB`, 2024). Backfilled 2026-05-18. |
| `energy_atlas_layers_catalog` | ~5 | — | Metadata catalog of EIA layers ingested by `ingest_eia_energy_layers.py` (table name, source URL, import timestamp). |
| `im3_state_projected_moderate_50` | 328 | — | PNNL IM3 projected DC siting under the moderate-growth scenario at gravity-weight 0.50 — one row per projected facility (cost, IT MW, cooling-water demand, lat/lon). Loaded but unused. |
| `im3_projected_state_demand_summary` | 31 | — | State-level rollup of IM3 projected facility counts, IT MW, and cooling demand. Loaded but unused. |
| `seds_national_msn_year` | 0 | — | Empty placeholder for national SEDS time-series; superseded by `energy_eia_seds_flat`. Drop candidate. |
| `seds_state_msn_year` | 0 | — | Empty placeholder for state SEDS time-series; superseded by `energy_eia_seds_flat`. Drop candidate. |
| `utility_rate_tracker_2025_2028` | 374 | — | Utility rate-increase tracker by provider × state × service type, with effective dates and monthly $ + % increases. Loaded but unused in the demographic/energy analysis. |
### Connectivity (submarine cables, exchange capacity)
| Table | Rows | Used here | Description |
|---|---:|:-:|---|
| `internet_cables` | 693 | — | Submarine cable routes (geometry, RFS year, decommission year, owners, length km) from TeleGeography-style data. |
| `internet_cable_landing_points` | 3,361 | — | Cable landing points (country, name, TBD flag) — endpoint nodes for `internet_cables`. |
| `internet_cable_meta` | 2 | — | Source-provenance metadata for the cable dataset (key/value). |
| `internet_cable_year_summaries` | 58 | — | Year-by-year narrative descriptions of cable activity. |
| `internet_city_dominance` | 4,552 | — | City-level physical capacity (Tbps), logical-dominance IP count, and top ASNs — proxy for internet-hub strength of each candidate DC city. |
### Other
| Table | Rows | Used here | Description |
|---|---:|:-:|---|
| `opposition_cases_geocoded` | 18 | — | Geocoded community-opposition cases against DC builds (developer, investment $B, outcome, governance response). Loaded but unused — see next-steps item #5. |
**Cleanup candidates.** `_watershed_huc8_stage`, `seds_national_msn_year`, `seds_state_msn_year`, and possibly `us_dc_sample_geocoded` are superseded by their canonical counterparts and could be dropped to reduce confusion.
---
## Data quality flags
1. **`master_data_centers.power_mw` is populated for only 108 / 1,833 DCs (5.9%).** Useless as a sizing metric without imputation or alternative source. Nearby EIA capacity is the more reliable proxy (used as the per-DC scale in this analysis). A grant-funded scrape of Baxtel / Data Center Map would close this gap.
2. **50 of 190 non-metro DCs (26%) have null `operator`.** Likely hyperscalers based on geography (OR/WA) but unconfirmed.
3. **Operator-string fragmentation**: "Meta" vs. "Meta, Inc."; "Amazon Web Services" vs. "Amazon AWS" vs. "amazon web services"; "Microsoft" vs. "Microsoft Azure". Inflates distinct-operator counts and fragments per-operator totals.
4. ~~`avg_household_size` column has sentinel pollution~~ **Resolved 2026-05-18** — 1,089 sentinel values (666,666,666) in `_dc_census_tract_acs_2024` and 16 in `data_center_census_tracts_2024` replaced with `NULL`. Affected 29 DCs.
5. **7 DCs failed RUCA join** — Puerto Rico tracts or non-US locations; trivial.
6. **EIA generator coordinates had a longitude sign error for 2008-01 through 2010-11** (~11K rows with positive lower-48 longitudes). The flat-table build at [ingest_eia_energy_layers.py:839-870](../ingest_eia_energy_layers.py#L839-L870) corrects this in `longitude` and `geom`, so spatial joins are unaffected.
---
## Suggested next steps
1. **Backfill `power_mw`** from Baxtel / Data Center Map (paid scrape — grant work).
2. **Operator-string deduplication** — collapse "Meta"/"Meta, Inc.", "AWS" variants, etc., before any per-operator analysis.
3. **Water-stress overlay against the 257 watersheds** — now that the HUC8 join is in place, pull USGS WaterWatch streamflow data, USGS water-use estimates, or EPA drought-status indicators against this watershed set. A single stress index per HUC8 would size the entire US fleet's water exposure.
4. **Forward-projected demand overlay** — the static SEDS / OGC capacity-share view in §7 is a snapshot. Joining `im3_state_projected_moderate_50` against the §7 saturation table would let us flag which already-saturated states (NJ, NV, TN, OR) are projected to need the most additional generation before 2050.
5. **Opposition cases overlay**`opposition_cases_geocoded` is loaded but unused; could test whether cluster-vs-isolated demographic differences (or watershed concentration) predict community opposition.