# US Data Centers - Geospatial Research Infrastructure A comprehensive geospatial research project investigating the spatial concentration, infrastructure dependencies, and socioeconomic/environmental impacts of US data center locations. ## Documentation - **[Database Tables](docs/database-tables.md)** - Complete database schema with table descriptions, column definitions, and SQL examples - **[Research Ideas](docs/research-ideas.md)** - Future research directions, data improvements, and potential collaborations - **[SQL Queries](docs/query_legiscan_bills.sql)** - Pre-built legislative analysis queries ## Project Overview This repository implements a PostGIS-based analytical framework that integrates multiple data sources to examine: - **Spatial concentration patterns**: Where are data centers located and why? - **Infrastructure dependencies**: How do data centers relate to submarine cables, power grids, and watersheds? - **Equity and impact**: Do data center host communities bear localized burdens while benefits are nationally dispersed? - **Demographics**: Who lives in data center-hosting census tracts? - **Environmental exposure**: What are the water, energy, and natural hazard exposures? ## Key Research Question **Do data centers represent "concentrated costs / dispersed benefits"?** Host communities bear localized infrastructure burdens (power, water, land use) while cloud computing benefits are nationally dispersed. ## Major Findings ### Spatial Concentration - **State level**: Top 5 states (VA, TX, CA, OR, OH) hold 51% of all US data centers - Virginia alone: 20.6% (378 of 1,833 facilities) - **Tract level**: Top 1% of data center-hosting census tracts hold 14.6% of all facilities - Only 0.86% of data center-state residents live in a hosting tract - Per-capita burden is **115× higher** in host tracts vs. state average - **Watershed level**: Half of all US data centers sit in just 15 of 2,139 HUC8 watersheds - Single watershed (Middle Potomac-Catoctin / Loudoun County): 12.8% of US facilities ### Demographics of Host Communities Compared to the US average, data center host communities are: - **Wealthier**: Median household income $103,623 (vs. $78,538, +32%) - **More educated**: 49% bachelor's+ (vs. 35%, +14 pp) - **More diverse**: 50% non-Hispanic white (vs. 58%), driven by high Asian share (13% vs. 6%) - **Better connected**: 94.9% broadband (vs. 89%) ### Infrastructure Insights - **89% of data centers are in metropolitan tracts** (vs. 80% of all US tracts) - only 1.11× over-index - **Non-metro data centers (11%)** are dominated by hyperscalers: - AWS (67), Meta (22), Microsoft (10), Google (4) = 55% of non-metro facilities - 66% are in Oregon + Washington (Columbia River hydro corridor) - **Grid saturation**: 4 states have >2/3 of generation within 50 km of a data center: - New Jersey: 83%, Nevada: 75%, Tennessee: 70%, Oregon: 68% - **Hyperscaler energy strategies** (non-metro sites): - AWS: 114 GW wind + 66 GW hydro - Microsoft: 13 GW nuclear (Palo Verde co-location) - Meta: 16 GW solar ### Clustered vs. Isolated Facilities Facilities in DBSCAN clusters differ significantly from isolated sites: - **$35K income gap**: Clustered sites in tracts with median income $108K vs. $73K for isolated - **+18 pp education**: 51% bachelor's+ vs. 33% - **More diverse**: 25 pp less non-Hispanic white - **2× energy infrastructure**: 89 vs. 40 generators within 50 km ### Submarine Cables - **Data centers are NOT systematically closer to cables** than ordinary US cities - Only 21.4% of data centers are within 100 km of a submarine cable landing point - Largest clusters (Ashburn VA, Columbus OH, Iowa) are inland, driven by fiber/power/tax incentives, not cables ## Data Sources ### Primary Data Center Inventories - **Curated Sample**: 1,489 facilities from web scraping + manual curation, geocoded via Census TIGER + Nominatim - **OpenStreetMap**: 1,549 OSM features tagged as data centers (via Overpass API) - **IM3 Model Data**: PNNL's Integrated Multisector Multiscale Modeling existing facilities - **Master Table**: 1,833 deduplicated facilities merging all sources ### Geospatial Context Layers - **US Census**: 2024 TIGER tract boundaries, ACS 2024 5-year demographics (85k+ tracts) - **USDA RUCA 2020**: Rural-Urban Commuting Area codes for metro/micropolitan/rural classification - **USGS HUC8 Watersheds**: 2,139 subbasin polygons for water-stress analysis - **FEMA NRI**: National Risk Index with 18 natural hazard risk scores by census tract ### Infrastructure Layers - **Submarine Cables**: 693 cables, 3,361 landing points (TeleGeography-style) - **EIA Energy Data**: Operating generator capacity (4.7M monthly records, 2008-2026), facility fuel, state energy data - **FCC Broadband Data**: Provider availability by location/block ### Additional Data - **RDH Precinct Vote Data**: Election results for political-economy analysis - **NOAA HMS Smoke Data**: Wildfire smoke exposure (2005-2025) - **USDM Drought Data**: Drought severity - **Utility Rate Tracker**: State-level electricity rate increases - **LegiScan Legislative Data**: All US state + federal bills 2016–2026 (1.3M bills, 646 sessions), tagged for data center, ratepayer, grid, tax, and siting topics ## Repository Structure ### Core Python Scripts **Data Ingestion** (`scripts/`) - `scripts/load_postgis_data_centers.py` - Load curated data center CSV into PostGIS - `scripts/load_postgis_osm_data_centers.py` - Fetch OSM data centers via Overpass API - `scripts/build_master_data_centers.py` - Deduplicate & merge curated + OSM sources - `scripts/load_postgis_internet_cables.py` - Load submarine cables and landing points - `scripts/ingest_eia_energy_layers.py` - Ingest EIA energy data via API - `scripts/build_watershed_huc8_tables.py` - Load USGS HUC8 watersheds - `scripts/ingest_legiscan.py` - Download all US state/federal bills 2016–2026 via LegiScan API, tag for data center research topics **Enrichment** - `scripts/create_data_center_census_tract_table.py` - Join data centers to Census tracts with ACS demographics - `scripts/build_fcc_bdc_broadband_connection_table.py` - Build per-facility broadband provider table - `scripts/build_fcc_bdc_location_provider_aggregates.py` - Aggregate FCC BDC data by county/tract **Analysis** - `scripts/analyze_dc_tract_concentration.py` - Tract-level cost concentration analysis (Gini, HHI, demographic deltas) - `scripts/analyze_cables_concentration.py` - Test if data centers cluster near submarine cables - `scripts/make_data_center_map.py` - Generate Leaflet map of data centers - `scripts/make_internet_cables_map.py` - Generate Leaflet map of data centers + cables ### Key Jupyter Notebooks - `spatial_clustering_master_data_centers.ipynb` - DBSCAN clustering of data centers - `cluster_analysis.ipynb` - Main demographic/energy/watershed analysis - `fema_nri_data_centers.ipynb` - Join data centers to FEMA hazard scores - `rdh_precinct_vote_data_centers.ipynb` - Join data centers to election data - `usdm_drought_data_centers.ipynb` - Drought exposure analysis - `hms_smoke_data_centers.ipynb` - Wildfire smoke exposure - `enhanced_data_center_cluster_map.ipynb` - Generate enhanced cluster visualization ### Output Files - `output/data_center_demographic_ruca_energy_summary.md` - Flagship analysis report - `cables_concentration_report.md` - Cable proximity + cost/benefit concentration analysis - `data_center_map.html` - Basic data center locations (Leaflet) - `data_centers_cables_map.html` - Data centers + submarine cables (Leaflet) - `output/enhanced_master_data_center_spatial_clusters_map.html` - Enhanced cluster visualization ## Technical Architecture ### Database - **PostgreSQL 13+** with **PostGIS 3.x** - Database name: `data_centers` - See [database-tables.md](database-tables.md) for complete schema documentation ### Python Environment - **Python 3.10+** - Key libraries: `psycopg2`, `geopandas`, `shapely`, `scikit-learn`, `pandas`, `numpy`, `requests`, `folium` ### Data Formats - CSV (raw data exports) - GeoJSON (watershed/cluster geometries) - Shapefiles (Census, USGS, FEMA inputs) - HTML (interactive Leaflet maps) ### Configuration Credentials stored in `~/.zsh_secrets`, loaded via environment variables: - `PGWEB_*`: PostgreSQL connection - `EIA_API_KEY`: EIA energy data - `FCC_USERNAME`, `FCC_API_KEY`: FCC broadband data - `RDH_USERNAME`, `RDH_PASSWORD`: Redistricting Data Hub - `CENSUS_API_KEY`: Census ACS API - `LEGISCAN_API_KEY`: LegiScan legislative data ## Quick Start ### Basic Rebuild Sequence ```bash # 1. Load base data center data python3 scripts/load_postgis_data_centers.py python3 scripts/load_postgis_osm_data_centers.py python3 scripts/build_master_data_centers.py # 2. Enrich with context layers python3 scripts/create_data_center_census_tract_table.py --replace-final python3 scripts/load_postgis_internet_cables.py python3 scripts/ingest_eia_energy_layers.py --category power python3 scripts/build_watershed_huc8_tables.py # 3. Run analyses python3 scripts/analyze_dc_tract_concentration.py > output/tract_analysis.txt python3 scripts/analyze_cables_concentration.py > output/cables_analysis.txt # 4. Execute notebooks jupyter notebook cluster_analysis.ipynb # 5. Load legislation (all states, 2016-2026) python3 scripts/ingest_legiscan.py --all # Weekly refresh (skips unchanged sessions): python3 scripts/ingest_legiscan.py --fetch --load ``` ### Generate Maps ```bash python3 scripts/make_data_center_map.py python3 scripts/make_internet_cables_map.py ``` ## Key Outputs ### Research Reports - **Demographic, Energy & Watershed Analysis**: `output/data_center_demographic_ruca_energy_summary.md` - **Submarine Cable Proximity**: `cables_concentration_report.md` ### Interactive Maps - Data center locations with cluster assignments - Submarine cable routes and landing points - Energy infrastructure proximity - Watershed boundaries with data center counts ### Data Exports - `master_data_center_spatial_cluster_points.csv` - Data center points with cluster IDs - `master_data_center_spatial_cluster_summary.csv` - Cluster-level statistics - `output/master_data_center_huc8_watersheds.geojson` - Watershed polygons - `output/master_data_center_map_context.csv` - Full context for mapping - `output/master_data_center_state_energy_context.csv` - State-level energy statistics ## Data Quality Notes 1. **Incomplete power ratings**: Only 5.9% of data centers have power ratings (108/1,833) 2. **Operator fragmentation**: String variations ("Meta" vs. "Meta, Inc.", AWS variants) inflate distinct-operator counts 3. **45 facilities** use city-precision fallback coordinates (approximate tract assignment) 4. **7 facilities** failed RUCA join (Puerto Rico / non-US) 5. **Broadband subscribers** are a coarse benefit proxy (actual cloud users are global) 6. **EIA longitude correction**: 2008-2010 generator coordinates had sign errors, corrected in flat-table build ## Known Limitations - **Power capacity**: Only 5.9% populated - nearby EIA generator capacity used as proxy - **Operator strings**: Need deduplication (50 of 190 non-metro facilities have null operator) - **Benefit measurement**: Broadband subscribers are an imperfect proxy for cloud computing benefits - **Universe**: Limited to 46 DC-host states (excludes DC-free states from ACS comparison) ## Research Ideas & Future Work See [research-ideas.md](research-ideas.md) for detailed next steps and potential research directions. ## Project Status This is a **mature, publication-ready geospatial analysis infrastructure** combining authoritative government datasets (Census, EIA, USGS, FEMA) with novel data center location data to test political economy and environmental justice hypotheses. The "concentrated costs / dispersed benefits" hypothesis is operationalized and tested with rigorous spatial statistics (Gini coefficients, HHI indices, Mann-Whitney tests). ## License Research data compiled from public sources. Please cite appropriately if used in publications. ## Contact For questions about this research project, please contact the repository owner.