Cross-tabs normalized data-center operator (owner) against the leading ACS 2024 workforce industry of each enrichment geography (ZCTA and census tract). Emits raw-count and row-percentage CSVs for both geographies. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
US Data Centers - Geospatial Research Infrastructure
A comprehensive geospatial research project investigating the spatial concentration, infrastructure dependencies, and socioeconomic/environmental impacts of US data center locations.
Documentation
- Database Tables - Complete database schema with table descriptions, column definitions, and SQL examples
- Database Table Previews - Research-team-friendly Markdown previews showing the first rows of each documented table
- Research Ideas - Future research directions, data improvements, and potential collaborations
- SQL Queries - Pre-built legislative analysis queries
Project Overview
This repository implements a PostGIS-based analytical framework that integrates multiple data sources to examine:
- Spatial concentration patterns: Where are data centers located and why?
- Infrastructure dependencies: How do data centers relate to submarine cables, power grids, and watersheds?
- Equity and impact: Do data center host communities bear localized burdens while benefits are nationally dispersed?
- Demographics: Who lives in data center-hosting census tracts?
- Environmental exposure: What are the water, energy, and natural hazard exposures?
Key Research Question
Do data centers represent "concentrated costs / dispersed benefits"? Host communities bear localized infrastructure burdens (power, water, land use) while cloud computing benefits are nationally dispersed.
Major Findings
Spatial Concentration
- State level: Top 5 states (VA, TX, CA, OR, OH) hold 51% of all US data centers
- Virginia alone: 20.6% (378 of 1,833 facilities)
- Tract level: Top 1% of data center-hosting census tracts hold 14.5% of all facilities (7 tracts)
- 2.9% of data center-state residents live in a hosting tract
- Per-capita burden is ~35× higher in host tracts vs. state average
- Top host tract: Loudoun County CT 6110.20 — 69 DCs, MHI $141K, 64% BA+
- Watershed level: Half of all US data centers sit in just 15 of 2,139 HUC8 watersheds
- Single watershed (Middle Potomac-Catoctin / Loudoun County): 12.8% of US facilities
Demographics of Host Communities
Compared to the US average, data center host communities are:
- Wealthier: Median household income $103,623 (vs. $78,538, +32%)
- More educated: 49% bachelor's+ (vs. 35%, +14 pp)
- More diverse: 51% non-Hispanic white (vs. 58%), driven by high Asian share (13% vs. 6%)
- Better connected: 94.9% broadband (vs. 89%)
- Politically blue: 59.3% Biden 2020 — affluent tech-corridor suburbs, not resentment communities
- Income gradient by urbanicity: Metro avg MHI $119K → Rural/Small Town $83K → Micropolitan $67K
Infrastructure Insights
- ~90% of data centers are in metropolitan tracts (RUCA 2020); ~10% non-metro
- Non-metro data centers (~10%) are dominated by hyperscalers:
- AWS, Meta, Microsoft, Google = 55.8% of non-metro facilities
- 66.4% are in Oregon + Washington (45.3% OR + 21.1% WA; Columbia River hydro corridor)
- Grid saturation: 4 states have >2/3 of generation within 50 km of a data center:
- New Jersey: 83%, Nevada: 75%, Tennessee: 70%, Oregon: 68%
- Hyperscaler energy strategies (non-metro sites):
- AWS: 114 GW wind + 66 GW hydro
- Microsoft: 13 GW nuclear (Palo Verde co-location)
- Meta: 16 GW solar
Clustered vs. Isolated Facilities
Facilities in DBSCAN clusters differ significantly from isolated sites:
- $35K income gap: Clustered sites in tracts with median income $108K vs. $73K for isolated
- +18 pp education: 51% bachelor's+ vs. 33%
- More diverse: 25 pp less non-Hispanic white
- 2× energy infrastructure: 89 vs. 40 generators within 50 km
Submarine Cables
- Data centers are NOT systematically closer to cables than ordinary US cities
- Only 21.4% of data centers are within 100 km of a submarine cable landing point
- Largest clusters (Ashburn VA, Columbus OH, Iowa) are inland, driven by fiber/power/tax incentives, not cables
Data Sources
Primary Data Center Inventories
- Curated Sample: 1,489 facilities from web scraping + manual curation, geocoded via Census TIGER + Nominatim
- OpenStreetMap: 1,549 OSM features tagged as data centers (via Overpass API)
- IM3 Model Data: PNNL's Integrated Multisector Multiscale Modeling existing facilities
- Master Table: 1,833 deduplicated facilities merging all sources
Geospatial Context Layers
- US Census: 2024 TIGER tract boundaries, ACS 2024 5-year demographics (85k+ tracts)
- USDA RUCA 2020: Rural-Urban Commuting Area codes for metro/micropolitan/rural classification
- USGS HUC8 Watersheds: 2,139 subbasin polygons for water-stress analysis
- FEMA NRI: National Risk Index with 18 natural hazard risk scores by census tract
Infrastructure Layers
- Submarine Cables: 693 cables, 3,361 landing points (TeleGeography-style)
- EIA Energy Data: Operating generator capacity (4.7M monthly records, 2008-2026), facility fuel, state energy data
- FCC Broadband Data: Provider availability by location/block
Additional Data
- RDH Precinct Vote Data: Election results for political-economy analysis
- NOAA HMS Smoke Data: Wildfire smoke exposure (2005-2025)
- USDM Drought Data: Drought severity
- Utility Rate Tracker: State-level electricity rate increases
- LegiScan Legislative Data: All US state + federal bills 2016–2026 (1.3M bills, 646 sessions), tagged for data center, ratepayer, grid, tax, and siting topics
Repository Structure
Core Python Scripts
Data Ingestion (scripts/)
scripts/load_postgis_data_centers.py- Load curated data center CSV into PostGISscripts/load_postgis_osm_data_centers.py- Fetch OSM data centers via Overpass APIscripts/build_master_data_centers.py- Deduplicate & merge curated + OSM sourcesscripts/load_postgis_internet_cables.py- Load submarine cables and landing pointsscripts/ingest_eia_energy_layers.py- Ingest EIA energy data via APIscripts/build_watershed_huc8_tables.py- Load USGS HUC8 watershedsscripts/ingest_legiscan.py- Download all US state/federal bills 2016–2026 via LegiScan API, tag for data center research topics
Enrichment
scripts/create_data_center_census_tract_table.py- Join data centers to Census tracts with ACS demographicsscripts/build_fcc_bdc_broadband_connection_table.py- Build per-facility broadband provider tablescripts/build_fcc_bdc_location_provider_aggregates.py- Aggregate FCC BDC data by county/tract
Analysis
scripts/analyze_dc_tract_concentration.py- Tract-level cost concentration analysis (Gini, HHI, demographic deltas)scripts/analyze_cables_concentration.py- Test if data centers cluster near submarine cablesscripts/make_data_center_map.py- Generate Leaflet map of data centersscripts/make_internet_cables_map.py- Generate Leaflet map of data centers + cables
Key Jupyter Notebooks
spatial_clustering_master_data_centers.ipynb- DBSCAN clustering of data centerscluster_analysis.ipynb- Main demographic/energy/watershed analysisfema_nri_data_centers.ipynb- Join data centers to FEMA hazard scoresrdh_precinct_vote_data_centers.ipynb- Join data centers to election datausdm_drought_data_centers.ipynb- Drought exposure analysishms_smoke_data_centers.ipynb- Wildfire smoke exposureenhanced_data_center_cluster_map.ipynb- Generate enhanced cluster visualization
Output Files
output/data_center_demographic_ruca_energy_summary.md- Flagship analysis reportcables_concentration_report.md- Cable proximity + cost/benefit concentration analysisdata_center_map.html- Basic data center locations (Leaflet)data_centers_cables_map.html- Data centers + submarine cables (Leaflet)output/enhanced_master_data_center_spatial_clusters_map.html- Enhanced cluster visualization
Technical Architecture
Database
- PostgreSQL 13+ with PostGIS 3.x
- Database name:
data_centers - See database-tables.md for complete schema documentation and database-table-heads.md for readable sample rows
Python Environment
- Python 3.10+
- Key libraries:
psycopg2,geopandas,shapely,scikit-learn,pandas,numpy,requests,folium
Data Formats
- CSV (raw data exports)
- GeoJSON (watershed/cluster geometries)
- Shapefiles (Census, USGS, FEMA inputs)
- HTML (interactive Leaflet maps)
Configuration
Credentials stored in ~/.zsh_secrets, loaded via environment variables:
PGWEB_*: PostgreSQL connectionEIA_API_KEY: EIA energy dataFCC_USERNAME,FCC_API_KEY: FCC broadband dataRDH_USERNAME,RDH_PASSWORD: Redistricting Data HubCENSUS_API_KEY: Census ACS APILEGISCAN_API_KEY: LegiScan legislative data
Quick Start
Basic Rebuild Sequence
# 1. Load base data center data
python3 scripts/load_postgis_data_centers.py
python3 scripts/load_postgis_osm_data_centers.py
python3 scripts/build_master_data_centers.py
# 2. Enrich with context layers
python3 scripts/create_data_center_census_tract_table.py --replace-final
python3 scripts/load_postgis_internet_cables.py
python3 scripts/ingest_eia_energy_layers.py --category power
python3 scripts/build_watershed_huc8_tables.py
# 3. Run analyses
python3 scripts/analyze_dc_tract_concentration.py > output/tract_analysis.txt
python3 scripts/analyze_cables_concentration.py > output/cables_analysis.txt
# 4. Execute notebooks
jupyter notebook cluster_analysis.ipynb
# 5. Load legislation (all states, 2016-2026)
python3 scripts/ingest_legiscan.py --all
# Weekly refresh (skips unchanged sessions):
python3 scripts/ingest_legiscan.py --fetch --load
Generate Maps
python3 scripts/make_data_center_map.py
python3 scripts/make_internet_cables_map.py
Key Outputs
Research Reports
- Demographic, Energy & Watershed Analysis:
output/data_center_demographic_ruca_energy_summary.md - Submarine Cable Proximity:
cables_concentration_report.md
Interactive Maps
- Data center locations with cluster assignments
- Submarine cable routes and landing points
- Energy infrastructure proximity
- Watershed boundaries with data center counts
Data Exports
master_data_center_spatial_cluster_points.csv- Data center points with cluster IDsmaster_data_center_spatial_cluster_summary.csv- Cluster-level statisticsoutput/master_data_center_huc8_watersheds.geojson- Watershed polygonsoutput/master_data_center_map_context.csv- Full context for mappingoutput/master_data_center_state_energy_context.csv- State-level energy statistics
Data Quality Notes
- Incomplete power ratings: Only 5.9% of data centers have power ratings (108/1,833)
- Operator fragmentation: String variations ("Meta" vs. "Meta, Inc.", AWS variants) inflate distinct-operator counts
- 45 facilities use city-precision fallback coordinates (approximate tract assignment)
- 7 facilities failed RUCA join (Puerto Rico / non-US)
- Broadband subscribers are a coarse benefit proxy (actual cloud users are global)
- EIA longitude correction: 2008-2010 generator coordinates had sign errors, corrected in flat-table build
Known Limitations
- Power capacity: Only 5.9% populated - nearby EIA generator capacity used as proxy
- Operator strings: Need deduplication (50 of 190 non-metro facilities have null operator)
- Benefit measurement: Broadband subscribers are an imperfect proxy for cloud computing benefits
- Universe: Limited to 46 DC-host states (excludes DC-free states from ACS comparison)
Research Ideas & Future Work
See research-ideas.md for detailed next steps and potential research directions.
Project Status
This is a mature, publication-ready geospatial analysis infrastructure combining authoritative government datasets (Census, EIA, USGS, FEMA) with novel data center location data to test political economy and environmental justice hypotheses.
The "concentrated costs / dispersed benefits" hypothesis is operationalized and tested with rigorous spatial statistics (Gini coefficients, HHI indices, Mann-Whitney tests).
License
Research data compiled from public sources. Please cite appropriately if used in publications.
Contact
For questions about this research project, please contact the repository owner.