Files
data-centers/output/data_center_demographic_ruca_energy_summary.md
dadams a7121d601b Add state grid context and database inventory to DC summary
Extends the demographic/RUCA/energy summary with two new sections:
- §7 quantifies each top-DC state's "share of state capacity within
  50 km of a DC," surfacing NJ (83%), NV (75%), TN (70%), and OR (68%)
  as the most DC-saturated grids — reframing the canonical VA-centric
  story by structural entanglement rather than raw count.
- §9 inventories every table in the data_centers schema with a
  one-line description, flagging cleanup candidates and unused layers
  for downstream work.

Also renumbers watershed analysis to §8, adds the SEDS row to the
dataset coverage table, and narrows next-step #4 to the IM3 projection
overlay (now that the SEDS join is complete).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 09:04:56 -07:00

28 KiB
Raw Blame History

US Data Centers — Demographic, Urban-Rural & Energy Context Analysis

Date: 2026-05-18

Notebook: cluster_analysis.ipynb

Universe: 1,833 data centers in public.master_data_centers, joined to ACS-2024 demographics, USDA RUCA-2020 codes, USGS HUC8 watersheds, and EIA operating-generator capacity (50 km radius, latest period 2026-02, status=OP).


Headline findings

  1. DC tracts are richer, more educated, and more diverse than the US average. Median household income $103,623 vs. national $78,538 (+32%); 49% bachelor's+ vs. 35% (+14 pp); poverty rate 7.2% vs. 12.4%. Non-Hispanic white share is below national (50% vs. 58%), driven by Asian-heavy (mean 13% vs. 6%) and Hispanic-significant tracts.
  2. The metro skew is more modest than expected: 1.11×. 89% of DCs sit in metropolitan tracts, but 80% of all US tracts are metropolitan — so DCs are only slightly more concentrated than the underlying population distribution would predict.
  3. The non-metro tail is overwhelmingly hyperscale and Pacific Northwest. Of 190 DCs outside metropolitan tracts (RUCA 410), AWS owns 67, Meta 22, Microsoft 10, Google 4, Yahoo 2 — combined 55% of the non-metro footprint. Oregon (86) and Washington (40) alone hold 66% of non-metro DCs, anchored to the Columbia River hydropower corridor.
  4. Clustered DCs are demographically distinct from isolated ones. DCs in DBSCAN clusters (n=1,583) sit in tracts with $108K median income vs. $73K for isolated DCs (n=250) — a $35K gap. Clustered DCs are more educated (+18 pp bachelor's), more diverse (25 pp non-Hispanic white), and embedded in much denser energy infrastructure (89 vs. 40 generators within 50 km).
  5. Microsoft co-locates with the largest US nuclear plant. Microsoft's Goodyear, AZ campus has 14.6 GW of generation within 50 km — including 4.2 GW from Palo Verde Nuclear, the largest in the US. Despite the campus being in a RUCA-2 "Metro high-commute" tract (not strictly metro core), the surrounding grid is the densest by capacity in our analysis.
  6. Extreme watershed concentration: half of all US DCs sit in just 15 of 2,139 HUC8 watersheds. A single watershed — Middle Potomac-Catoctin (Loudoun County) — holds 235 DCs (12.8% of the US total). The top 2 (both DC-Alley watersheds) hold 18.9%; the top 10 hold 40%. Water stress in any one of these basins propagates to a huge share of national DC capacity.

Dataset coverage and joins

Source table Rows Join key Coverage
master_data_centers 1,833 base
master_data_center_spatial_clusters 1,831 master_id 99.9%
_dc_census_tract_acs_2024 ~73,000 tracts geoid 1,807 matched (98.6%)
ruca_codes_2020_tract 85,528 tracts tract_fips_20 = geoid 1,826 matched (99.6%)
watershed_huc8 2,139 watersheds ST_Contains(w.geom, m.geom) 1,831 matched (99.9%)
energy_eia_operating_generator_capacity_flat 4.7M rows ST_DWithin(geom, 50km) 1,831 DCs have ≥1 nearby gen
energy_eia_seds_flat (annual, 19602024) 2.57M rows state_id Used in §7 for state electricity consumption (series ESTCB, 2024)

Energy aggregation uses period 2026-02 only with status='OP', summing nameplate_capacity_mw for operating generators within 50 km of each DC. Note: EIA capacity columns were added to this table on 2026-05-17 — prior to that the _flat table had no MW values despite its name. SEDS was backfilled 2026-05-18 (initial smoke-test had only 50 rows).


1. Demographic profile of DC tracts (n=1,807 with non-null ACS)

Metric DC tract (median) DC tract (mean) US avg Δ mean vs. US
Median household income $103,623 $114,543 $78,538 +$36,005
Per-capita income $51,283 $55,725 $43,313 +$12,412
Poverty rate 7.2% 10.1% 12.4% 2.3 pp
Unemployment rate 3.5% 4.4% 5.4% 1.0 pp
Bachelor's+ % 49.3% 46.2% 35.0% +11.2 pp
Broadband subscription % 94.9% 93.5% 89.0% +4.5 pp
Non-Hispanic white % 50.2% 51.0% 58.4% 7.4 pp
Hispanic / Latino % 12.8% 19.5% 19.5% 0.0 pp
Non-Hispanic Black % 5.9% 10.6% 12.1% 1.5 pp
Non-Hispanic Asian % 6.4% 13.4% 6.4% +7.0 pp

Interpretation. DC tracts skew toward high-income, highly-educated, technically connected, and racially diverse (specifically Asian-heavy). The race composition is interesting: DC tracts are less non-Hispanic white than national average, not more. This reflects DC siting in mixed-race coastal/exurban tech corridors (Bay Area, Northern Virginia, Seattle) rather than in homogeneous suburbs.

Data quality note. avg_household_size previously contained ACS sentinel-value pollution (666,666,666) for 1,089 zero-population tracts in _dc_census_tract_acs_2024 (29 of which contained DCs) plus 16 rows in data_center_census_tracts_2024. As of 2026-05-18, those sentinels have been replaced with NULL. The column now has plausible ranges (min 1.00, max 9.33) and a usable mean.


2. Geographic concentration (top 15 states)

State DC count Total power_mw (where known) Median HH income Median bachelor's % Median % white Notes
VA 378 255 $141,250 62.6% 42.5% Loudoun / DC-Alley dominance (20.6% of all US DCs)
TX 162 597 $88,228 46.2% 32.0% DFW + Austin + San Antonio
CA 147 130 $164,928 56.4% 22.4% Bay Area + LA basin
OR 145 125 $72,719 20.0% 63.2% Columbia River hydro corridor (rural)
OH 103 135 $128,875 47.0% 74.5% Columbus boom — fastest-rising market
WA 93 70 $91,623 21.9% 40.3% Quincy/Wenatchee + Seattle
AZ 69 54 $85,335 35.2% 51.6% Phoenix/Goodyear hyperscale
IA 65 0 $93,393 34.3% 88.1% 88% white (rural Midwest)
NJ 62 98 $147,321 59.4% 32.9% NYC-metro carrier hotels
IL 61 128 $96,191 52.9% 52.0% Chicago metro
GA 50 241 $101,176 51.4% 31.6% Atlanta + high-power rural builds
NY 48 47 $77,465 47.6% 74.8% NYC + upstate
NV 41 0 $93,409 31.2% 34.6% Reno + Las Vegas
TN 32 0 54.8% Nashville + Memphis (newly visible after state backfill)
NC 31 56 $82,708 44.7% 59.6% Charlotte + Catawba (nuclear-adjacent)

Virginia alone holds 20.6% of all US DCs (378 of 1,833), with the most affluent tract profile in the top 15 — a Loudoun County effect. The state backfill substantially elevated Ohio (76 → 103) and Texas (135 → 162), pushing TX into the #2 slot. The previously-uncounted Tennessee (32) now appears in the top 15.

Oregon and Washington tracts look notably different from the urban-heavy states (lower income, lower education, lower broadband, higher Hispanic share), reflecting their rural Columbia River siting.


3. Spatially clustered DCs vs. isolated DCs

DBSCAN cluster assignment from master_data_center_spatial_clusters (1,583 clustered, 250 isolated):

Metric (median) Isolated In cluster Δ
Median household income $73,500 $108,359 +$34,859
Bachelor's+ % 33.2 51.2 +18.0 pp
Poverty rate 11.6 6.9 4.7 pp
Non-Hispanic white % 71.0 45.9 25.1 pp
EIA generators within 50 km 40 89 +49
EIA capacity within 50 km (MW) 2,176 3,300 +1,125

Reading. A clustered data center sits, at the median, in a tract that is ~$35K richer, 18 pp more educated, and 25 pp less non-Hispanic white than an isolated one — and is surrounded by twice as much energy infrastructure (and 50% more generation capacity). The isolated set looks like rural / small-town America (whiter, poorer, less educated); the clustered set looks like coastal exurban tech corridors.


4. RUCA (urban-rural) distribution

National baseline of all US tracts: 80% Metropolitan, 9% Micropolitan, 3% Small town, 8% Rural.

RUCA band DCs DC % US tract % Over-index
Metropolitan (13) 1,636 89.3% 80.1% 1.11×
Micropolitan (46) 98 5.3% 9.0% 0.59×
Small town (79) 15 0.8% 2.9% 0.28×
Rural (10) 77 4.2% 7.6% 0.55×
Unknown / missed match 7 0.4%

Reading. The metro skew is real but only mild — 1.11×. The eye-catching pattern is that rural tracts (RUCA 10) hold more DCs than micropolitan or small-town combined, because the hyperscale greenfield model deliberately bypasses small-city economies in favor of remote, cheap-power, low-population sites.

Per-RUCA-code drilldown

RUCA Description DCs Median HH income Median pop density Median EIA gens (50km)
1 Metro core 1,425 $110,333 1,859 / sq mi 97
2 Metro high-commute 206 $105,404 96 49
3 Metro low-commute 5 $119,495 22 23
4 Micropolitan core 54 $63,698 312 53
5 Micropolitan high-commute 22 $72,465 191 51
6 Micropolitan low-commute 22 $72,719 69 59
7 Small town core 14 $87,522 2,336 40
8 Small town high-commute 1 $69,074 36 41
10 Rural area 77 $93,820 12 42

Two surprises:

  • Rural DCs (RUCA 10) sit in tracts with $93.8K median income — higher than micropolitan DCs ($63.7K$72.7K). The rural DC sites are not poor rural America; they are wealthy-by-rural-standards counties chosen for power and water access.
  • Micropolitan-core DCs (RUCA 4) have the lowest median income at $63.7K — the closest thing to "economic-development DC siting" in the dataset.

5. Non-metro deep dive (190 DCs, RUCA 410)

Operators

Operator Non-metro DCs
Amazon Web Services 67
(null operator) 50
Meta 20 (+ 2 as "Meta, Inc.")
Microsoft 10
Google 4
Rowan Green Data 4
NTT 2
Yahoo 2
Amazon AWS (dupe) 2

The five hyperscalers (AWS, Meta, Microsoft, Google, Yahoo) account for 105 of 190 non-metro DCs (55%). If the 50 null-operator rows skew similarly hyperscale (likely — they're disproportionately in OR/WA), the share is probably closer to 75%.

States (post-backfill)

State Non-metro DCs
Oregon 86
Washington 40
Texas 9
New Mexico 7
North Carolina 6
Pennsylvania 5
Wisconsin 4
New York 3
Tennessee 3
Georgia 3

Oregon + Washington = 126 (66%) of all non-metro DCs. This is the Columbia River basin: Prineville / Hermiston / Boardman / The Dalles (OR) and Quincy / East Wenatchee / Moses Lake (WA). The pull is hydroelectric power (cheap, low-carbon, abundant) and cool dry climate (free-cooling).

The state backfill clarified the rest of the non-metro tail: Texas (9) and Pennsylvania (5) were previously hidden in the null bucket. These likely represent shale-gas-adjacent builds (Permian and Marcellus respectively).


6. Energy footprint by operator (using EIA capacity within 50 km)

Aggregated across DCs in RUCA 210 (i.e. anything outside dense metro core, n=401):

Operator DCs States Total nearby capacity (GW) Median per site (GW) Hydro (GW) Nuclear (GW) NG (GW) Solar (GW) Wind (GW)
AWS 93 5 397 4.8 66 2.5 201 4.6 114
(Unknown) 118 26 339 2.3 86 35 135 23 19
Meta 51 11 120 2.0 4.9 0 61 16 0.3
Microsoft 26 6 113 3.4 28 13 39 9.1 8.1
Google 31 5 100 3.9 14 0 43 3.6 4.7
Apple 5 2 4 0.6 1.6 0 1.1 0.9 0.4
Yahoo 2 1 7 3.5 6.4 0 0 0 0.7

Distinct hyperscaler strategies, visible in the fuel mix:

  • AWS has aggregated 114 GW of wind exposure across its 93 sites — by far the most renewable-coupled portfolio. Also heavy hydro (66 GW) from its OR/WA footprint and 201 GW of natural gas as baseline.
  • Microsoft has the highest nuclear exposure (12.6 GW) — almost entirely from its Goodyear, AZ campuses near Palo Verde Nuclear.
  • Meta has the most solar (16 GW) among the named hyperscalers, but minimal nuclear or wind — consistent with its New Mexico (Los Lunas) and Iowa builds.
  • Google is split — moderate hydro and natural gas, modest renewables.

Largest non-metro grid neighborhoods (top sites by surrounding capacity)

DC Operator Location Nearby capacity Fuel highlight
PHX70 / PHX-10 / PHX-11 Microsoft (Azure) Goodyear, AZ (RUCA 2) 14.014.6 GW 4.2 GW nuclear (Palo Verde) + 6.4 GW gas + 2.2 GW solar
Stream PHX-1 Stream Data Centers Goodyear, AZ 13.4 GW Same Palo Verde / gas mix
T5 Charlotte Campus T5 Kings Mountain, NC (RUCA 6) 12.9 GW 4.9 GW nuclear (Catawba) + 5.5 GW gas + 1.5 GW coal
Apple Maiden Apple Maiden, NC (RUCA 2) 9.1 GW 2.4 GW nuclear + 4.6 GW gas
Percheron DC Rowan Green Data (Texas, RUCA 10) 6.7 GW 3.0 GW wind + 0.9 GW hydro + 2.4 GW gas

7. State grid context — how DC-saturated is each top state?

Section 6 shows DC-adjacent capacity in absolute MW, which is hard to interpret without knowing the size of the state grid. Using OGC for state-total generating capacity (period 2026-02, status OP) and SEDS series ESTCB for 2024 in-state electricity consumption, we can express each state's DC footprint as a share of its own grid.

The "DC-adjacent capacity" column sums distinct in-state generators (i.e., no double-counting) whose 50 km neighborhood includes at least one in-state data center.

State DCs State grid (GW) State elec. consumption (TWh, 2024) DC-adjacent capacity (GW) % of state capacity within 50 km of a DC
VA 378 30.8 138.0 15.4 50%
TX 162 194.2 505.3 61.4 32%
CA 147 105.1 245.6 51.6 49%
OR 145 17.2 59.7 11.7 68%
OH 103 34.4 153.7 12.7 37%
WA 93 29.6 90.0 7.9 27%
AZ 69 40.1 90.8 22.5 56%
IA 65 24.6 54.9 4.9 20%
NJ 62 17.8 73.5 14.7 83%
IL 61 51.7 133.2 17.4 34%
GA 50 42.3 150.0 14.2 34%
NY 48 42.7 140.5 25.8 61%
NV 41 18.7 40.7 14.0 75%
TN 32 23.3 102.9 16.4 70%
NC 31 38.9 136.9 17.4 45%

The DC-saturation reordering. Virginia leads in raw DC count (378), but four states have grids where more than two-thirds of all in-state generating capacity sits within 50 km of a data center:

  • New Jersey — 83%. Effectively the entire state's electrical economy is DC-adjacent. NJ's 62 DCs are NYC-metro carrier hotels concentrated in a small geographic footprint relative to a small state grid (17.8 GW).
  • Nevada — 75%. Las Vegas and Reno DCs co-locate with the gas-and-solar generation that serves Las Vegas urbanization. NV has a small grid (18.7 GW) and most of it serves the same two metros.
  • Tennessee — 70%. Nashville + Memphis DCs sit near TVA's central generation belt.
  • Oregon — 68%. Even though OR's DC cluster is mostly non-metro (Boardman / Hermiston / The Dalles), the Columbia hydro corridor serving them accounts for two-thirds of OR's 17.2 GW grid. This is the only state where the saturation comes from rural hyperscale builds rather than urban carrier hotels.

The opposite end. Iowa (20%) has 65 DCs but they all cluster around Council Bluffs / Des Moines, leaving the rural wind belt that dominates IA's grid unrelated to DC siting. Washington (27%) is similar — the Quincy hyperscale cluster is small relative to WA's Columbia hydro and Puget-area generation.

Why the proportional view matters. A 1 GW DC load lands very differently on the NJ grid (5.6% of total capacity) than on the TX grid (0.5%). Reliability, transmission-queue interconnection waits, and political pushback all scale with the proportional draw, not the absolute MW. By that yardstick, the canonical "VA dominates US DCs" story is incomplete — VA, NJ, OR, NV, TN, NY, and AZ are the states where the DC industry is structurally entangled with the grid, and where any large new build runs into capacity-share constraints first.


8. Watershed (HUC8) concentration

Each DC sits in exactly one USGS HUC8 watershed (8-digit hydrologic unit, subbasin scale, median ~3,000 sq km). Cooling-water draw and wastewater discharge happen at watershed scale, not state scale — a single stressed basin can cap an entire DC corridor regardless of how big the state's overall water budget is.

Where the 1,831 matched DCs land

  • 257 distinct HUC8s hold at least one DC — that's only 12% of the 2,139 US watersheds (the other 88% have zero data centers).
  • The top 1 watershed alone (Middle Potomac-Catoctin) holds 235 DCs — 12.8% of the entire US data-center footprint.
  • DC concentration is much more extreme at the watershed level than at the state level. Virginia has 20.6% of US DCs; the single Loudoun watershed holds 12.8%.

Cumulative concentration

Top N watersheds DCs Share of all US DCs
1 235 12.8%
2 346 18.9%
3 434 23.7%
5 551 30.1%
10 736 40.2%
15 887 48.4%
20 1,012 55.3%
30 1,186 64.8%
50 1,380 75.4%
100 1,611 88.0%

Half of all US data centers live in just 15 watersheds. Three-quarters in 50. Water stress, drought policy, or thermal-discharge limits in any one of these basins propagates to a large share of the national footprint.

Top 15 watersheds by DC count

HUC8 Name States DCs Cluster
02070008 Middle Potomac-Catoctin DC, MD, VA, WV 235 Loudoun / Ashburn (DC-Alley)
02070010 Middle Potomac-Anacostia-Occoquan DC, MD, VA 111 Fairfax + inner Loudoun + DC
18050003 Coyote CA 88 Silicon Valley / San Jose
05060001 Upper Scioto OH 73 Columbus (fastest-growing market)
17070101 Middle Columbia-Lake Wallula OR, WA 44 Boardman / Hermiston (hyperscale hydro)
17020015 Lower Crab WA 40 Quincy / Moses Lake (hyperscale hydro)
17090010 Tualatin OR 39 Hillsboro (Intel / Google)
12030105 Upper Trinity TX 37 DFW
10230006 Big Papillion-Mosquito IA, NE 36 Council Bluffs / Omaha (Meta)
07120004 Des Plaines IL, WI 33 Chicago metro
12100302 Medina TX 32 San Antonio
02030105 Raritan NJ 31 Central NJ carrier hotels
15050100 Middle Gila AZ 30 Phoenix metro
02030103 Hackensack-Passaic NJ, NY 29 NYC metro east
17070103 Umatilla OR 29 AWS-only (all 29 DCs) — pure single-operator basin

The non-metro / hyperscale Pacific Northwest story is visible at watershed scale: three Columbia-system watersheds (Middle Columbia-Lake Wallula, Lower Crab, Umatilla) hold 113 DCs combined, all hyperscale-dominated. Umatilla is operationally AWS-exclusive — all 29 DCs in that basin are AWS.

Non-metro watersheds (RUCA 410) — where hyperscalers cluster

HUC8 Name States DCs Operators
17070101 Middle Columbia-Lake Wallula OR, WA 44 AWS (multiple variants), Rowan Green Data
17020015 Lower Crab WA 40 CyrusOne, Intuit, Microsoft, NTT, Sabey, Yahoo
17070103 Umatilla OR 29 AWS only
17070305 Lower Crooked OR 8 Meta (Prineville)
13020203 Rio Grande-Albuquerque NM 7 Meta (Los Lunas)
03050105 Upper Broad NC, SC 6 Meta
13070001 Lower Pecos-Red Bluff Reservoir NM, TX 5 IONIC Digital
17070105 Middle Columbia-Hood OR, WA 4 Google (The Dalles)
02050107 Upper Susquehanna-Lackawanna PA 3 AWS
03070103 Upper Ocmulgee GA 2 Meta

This view is the cleanest evidence yet of the hyperscale geographic strategy — single-operator capture of individual watersheds (Umatilla = AWS, Lower Crooked = Meta, Middle Columbia-Hood = Google, Rio Grande-Albuquerque = Meta). Each of these basins has effectively been claimed by one player.

Implications for water-stress analysis

This watershed view is a boundary set for downstream water-stress analysis. Pull WaterWatch streamflow data, USGS water-use estimates, or EPA drought indicators against just these 257 HUC8s (or against just the top 15 for the highest-leverage story). A single-pull stress index against this set would size the "water exposure" of the entire US DC fleet.


9. Database inventory (data_centers schema public)

All tables in the working database as of 2026-05-18. "Used here" = referenced in §1§8 of this report. PostGIS internal tables (spatial_ref_sys, geography_columns, geometry_columns) are omitted.

Data center inventory and clustering

Table Rows Used here Description
master_data_centers 1,833 Unified, deduplicated DC inventory — the canonical row-per-DC table joining curated, OSM, and sample sources via master_id.
osm_data_centers 1,549 Raw OSM-derived DC features (nodes/ways tagged as data centers), one of the inputs to master_data_centers.
us_dc_sample_geocoded 1,489 Earlier sample-list DC inventory with geocoding lineage (Nominatim + Census TIGER), superseded by master_data_centers but retained for provenance.
data_centers_union (view) Convenience view unioning the curated and OSM source rows with a source discriminator.
master_data_center_spatial_clusters 1,831 DBSCAN cluster assignment per DC (cluster_id, noise flag), used in §3.

Per-DC join tables

Table Rows Used here Description
data_center_census_tracts_2024 1,815 One row per DC with attached ACS-2024 demographics from its containing tract — the master demographic join.
data_center_watershed_huc8 1,833 One row per DC with its containing USGS HUC8 watershed (huc8, name, states, area), built 2026-05-18 via ST_Within.

Base geographic / demographic layers

Table Rows Used here Description
_dc_census_tract_acs_2024 85,382 Staging: ACS-2024 5-year profile attributes for every US tract that contains a DC (and surrounding tracts for context).
_dc_census_tract_boundaries_2024 85,058 Staging: TIGER 2024 tract polygons for the DC-tract universe.
ruca_codes_2020_tract 85,528 USDA RUCA 2020 codes per tract, the metro/micropolitan/rural classification used in §4§5.
watershed_huc8 2,139 USGS Watershed Boundary Dataset HUC8 subbasin polygons (median ~3,000 km²) covering CONUS + AK.
_watershed_huc8_stage 369 Staging table from an earlier partial WBD load, superseded by watershed_huc8. Candidate for cleanup.
census_tract_huc8_link 806 Tract↔HUC8 spatial overlap table (with overlap %) for the subset of tracts containing a DC. Useful for downstream tract-level water-stress joins.

Energy data

Table Rows Used here Description
energy_eia_operating_generator_capacity_flat 4.7M EIA Form-860 operating generator inventory, monthly 20082026, with nameplate / summer / winter MW and point geometry. Source for §6 and §7 capacity figures.
energy_eia_seds_flat 2.57M EIA SEDS annual state energy series 19602024 (consumption, prices, expenditures by sector / fuel). Source for §7 state electricity consumption (ESTCB, 2024). Backfilled 2026-05-18.
energy_atlas_layers_catalog ~5 Metadata catalog of EIA layers ingested by ingest_eia_energy_layers.py (table name, source URL, import timestamp).
im3_state_projected_moderate_50 328 PNNL IM3 projected DC siting under the moderate-growth scenario at gravity-weight 0.50 — one row per projected facility (cost, IT MW, cooling-water demand, lat/lon). Loaded but unused.
im3_projected_state_demand_summary 31 State-level rollup of IM3 projected facility counts, IT MW, and cooling demand. Loaded but unused.
seds_national_msn_year 0 Empty placeholder for national SEDS time-series; superseded by energy_eia_seds_flat. Drop candidate.
seds_state_msn_year 0 Empty placeholder for state SEDS time-series; superseded by energy_eia_seds_flat. Drop candidate.
utility_rate_tracker_2025_2028 374 Utility rate-increase tracker by provider × state × service type, with effective dates and monthly $ + % increases. Loaded but unused in the demographic/energy analysis.

Connectivity (submarine cables, exchange capacity)

Table Rows Used here Description
internet_cables 693 Submarine cable routes (geometry, RFS year, decommission year, owners, length km) from TeleGeography-style data.
internet_cable_landing_points 3,361 Cable landing points (country, name, TBD flag) — endpoint nodes for internet_cables.
internet_cable_meta 2 Source-provenance metadata for the cable dataset (key/value).
internet_cable_year_summaries 58 Year-by-year narrative descriptions of cable activity.
internet_city_dominance 4,552 City-level physical capacity (Tbps), logical-dominance IP count, and top ASNs — proxy for internet-hub strength of each candidate DC city.

Other

Table Rows Used here Description
opposition_cases_geocoded 18 Geocoded community-opposition cases against DC builds (developer, investment $B, outcome, governance response). Loaded but unused — see next-steps item #5.

Cleanup candidates. _watershed_huc8_stage, seds_national_msn_year, seds_state_msn_year, and possibly us_dc_sample_geocoded are superseded by their canonical counterparts and could be dropped to reduce confusion.


Data quality flags

  1. master_data_centers.power_mw is populated for only 108 / 1,833 DCs (5.9%). Useless as a sizing metric without imputation or alternative source. Nearby EIA capacity is the more reliable proxy (used as the per-DC scale in this analysis). A grant-funded scrape of Baxtel / Data Center Map would close this gap.
  2. 50 of 190 non-metro DCs (26%) have null operator. Likely hyperscalers based on geography (OR/WA) but unconfirmed.
  3. Operator-string fragmentation: "Meta" vs. "Meta, Inc."; "Amazon Web Services" vs. "Amazon AWS" vs. "amazon web services"; "Microsoft" vs. "Microsoft Azure". Inflates distinct-operator counts and fragments per-operator totals.
  4. avg_household_size column has sentinel pollution Resolved 2026-05-18 — 1,089 sentinel values (666,666,666) in _dc_census_tract_acs_2024 and 16 in data_center_census_tracts_2024 replaced with NULL. Affected 29 DCs.
  5. 7 DCs failed RUCA join — Puerto Rico tracts or non-US locations; trivial.
  6. EIA generator coordinates had a longitude sign error for 2008-01 through 2010-11 (~11K rows with positive lower-48 longitudes). The flat-table build at ingest_eia_energy_layers.py:839-870 corrects this in longitude and geom, so spatial joins are unaffected.

Suggested next steps

  1. Backfill power_mw from Baxtel / Data Center Map (paid scrape — grant work).
  2. Operator-string deduplication — collapse "Meta"/"Meta, Inc.", "AWS" variants, etc., before any per-operator analysis.
  3. Water-stress overlay against the 257 watersheds — now that the HUC8 join is in place, pull USGS WaterWatch streamflow data, USGS water-use estimates, or EPA drought-status indicators against this watershed set. A single stress index per HUC8 would size the entire US fleet's water exposure.
  4. Forward-projected demand overlay — the static SEDS / OGC capacity-share view in §7 is a snapshot. Joining im3_state_projected_moderate_50 against the §7 saturation table would let us flag which already-saturated states (NJ, NV, TN, OR) are projected to need the most additional generation before 2050.
  5. Opposition cases overlayopposition_cases_geocoded is loaded but unused; could test whether cluster-vs-isolated demographic differences (or watershed concentration) predict community opposition.