US Data Centers - Geospatial Research Infrastructure
A comprehensive geospatial research project investigating the spatial concentration, infrastructure dependencies, and socioeconomic/environmental impacts of US data center locations.
Project Overview
This repository implements a PostGIS-based analytical framework that integrates multiple data sources to examine:
- Spatial concentration patterns: Where are data centers located and why?
- Infrastructure dependencies: How do data centers relate to submarine cables, power grids, and watersheds?
- Equity and impact: Do data center host communities bear localized burdens while benefits are nationally dispersed?
- Demographics: Who lives in data center-hosting census tracts?
- Environmental exposure: What are the water, energy, and natural hazard exposures?
Key Research Question
Do data centers represent "concentrated costs / dispersed benefits"? Host communities bear localized infrastructure burdens (power, water, land use) while cloud computing benefits are nationally dispersed.
Major Findings
Spatial Concentration
- State level: Top 5 states (VA, TX, CA, OR, OH) hold 51% of all US data centers
- Virginia alone: 20.6% (378 of 1,833 facilities)
- Tract level: Top 1% of data center-hosting census tracts hold 14.6% of all facilities
- Only 0.86% of data center-state residents live in a hosting tract
- Per-capita burden is 115× higher in host tracts vs. state average
- Watershed level: Half of all US data centers sit in just 15 of 2,139 HUC8 watersheds
- Single watershed (Middle Potomac-Catoctin / Loudoun County): 12.8% of US facilities
Demographics of Host Communities
Compared to the US average, data center host communities are:
- Wealthier: Median household income $103,623 (vs. $78,538, +32%)
- More educated: 49% bachelor's+ (vs. 35%, +14 pp)
- More diverse: 50% non-Hispanic white (vs. 58%), driven by high Asian share (13% vs. 6%)
- Better connected: 94.9% broadband (vs. 89%)
Infrastructure Insights
- 89% of data centers are in metropolitan tracts (vs. 80% of all US tracts)
- Non-metro data centers (11%) are dominated by hyperscalers:
- AWS (67), Meta (22), Microsoft (10), Google (4) = 55% of non-metro facilities
- 66% are in Oregon + Washington (Columbia River hydro corridor)
- Energy infrastructure: 4 states have >2/3 of generation within 50 km of a data center:
- New Jersey: 83%, Nevada: 75%, Tennessee: 70%, Oregon: 68%
Submarine Cables
- Data centers are NOT systematically closer to cables than ordinary US cities
- Only 21.4% of data centers are within 100 km of a submarine cable landing point
- Largest clusters (Ashburn VA, Columbus OH, Iowa) are inland, driven by fiber/power/tax incentives, not cables
Data Sources
Primary Data Center Inventories
- Curated Sample: 1,489 facilities from web scraping + manual curation, geocoded via Census TIGER + Nominatim
- OpenStreetMap: 1,549 OSM features tagged as data centers (via Overpass API)
- IM3 Model Data: PNNL's Integrated Multisector Multiscale Modeling existing facilities
- Master Table: 1,833 deduplicated facilities merging all sources
Geospatial Context Layers
- US Census: 2024 TIGER tract boundaries, ACS 2024 5-year demographics (85k+ tracts)
- USDA RUCA 2020: Rural-Urban Commuting Area codes for metro/micropolitan/rural classification
- USGS HUC8 Watersheds: 2,139 subbasin polygons for water-stress analysis
- FEMA NRI: National Risk Index with 18 natural hazard risk scores by census tract
Infrastructure Layers
- Submarine Cables: 693 cables, 3,361 landing points (TeleGeography-style)
- EIA Energy Data: Operating generator capacity (4.7M monthly records, 2008-2026), facility fuel, state energy data
- FCC Broadband Data: Provider availability by location/block
Additional Data
- RDH Precinct Vote Data: Election results for political-economy analysis
- NOAA HMS Smoke Data: Wildfire smoke exposure (2005-2025)
- USDM Drought Data: Drought severity
- Utility Rate Tracker: State-level electricity rate increases
Repository Structure
Core Python Scripts
Data Ingestion
load_postgis_data_centers.py- Load curated data center CSV into PostGISload_postgis_osm_data_centers.py- Fetch OSM data centers via Overpass APIbuild_master_data_centers.py- Deduplicate & merge curated + OSM sourcesload_postgis_internet_cables.py- Load submarine cables and landing pointsingest_eia_energy_layers.py- Ingest EIA energy data via APIbuild_watershed_huc8_tables.py- Load USGS HUC8 watersheds
Enrichment
create_data_center_census_tract_table.py- Join data centers to Census tracts with ACS demographicsbuild_fcc_bdc_broadband_connection_table.py- Build per-facility broadband provider tablebuild_fcc_bdc_location_provider_aggregates.py- Aggregate FCC BDC data by county/tract
Analysis
analyze_dc_tract_concentration.py- Tract-level cost concentration analysis (Gini, HHI, demographic deltas)analyze_cables_concentration.py- Test if data centers cluster near submarine cablesmake_data_center_map.py- Generate Leaflet map of data centersmake_internet_cables_map.py- Generate Leaflet map of data centers + cables
Key Jupyter Notebooks
spatial_clustering_master_data_centers.ipynb- DBSCAN clustering of data centerscluster_analysis.ipynb- Main demographic/energy/watershed analysisfema_nri_data_centers.ipynb- Join data centers to FEMA hazard scoresrdh_precinct_vote_data_centers.ipynb- Join data centers to election datausdm_drought_data_centers.ipynb- Drought exposure analysishms_smoke_data_centers.ipynb- Wildfire smoke exposureenhanced_data_center_cluster_map.ipynb- Generate enhanced cluster visualization
Output Files
output/data_center_demographic_ruca_energy_summary.md- Flagship analysis reportcables_concentration_report.md- Cable proximity + cost/benefit concentration analysisdata_center_map.html- Basic data center locations (Leaflet)data_centers_cables_map.html- Data centers + submarine cables (Leaflet)output/enhanced_master_data_center_spatial_clusters_map.html- Enhanced cluster visualization
Technical Architecture
Database
- PostgreSQL 13+ with PostGIS 3.x
- Database name:
data_centers - See database-tables.md for complete schema documentation
Python Environment
- Python 3.10+
- Key libraries:
psycopg2,geopandas,shapely,scikit-learn,pandas,numpy,requests,folium
Data Formats
- CSV (raw data exports)
- GeoJSON (watershed/cluster geometries)
- Shapefiles (Census, USGS, FEMA inputs)
- HTML (interactive Leaflet maps)
Configuration
Credentials stored in ~/.zsh_secrets, loaded via environment variables:
PGWEB_*: PostgreSQL connectionEIA_API_KEY: EIA energy dataFCC_USERNAME,FCC_API_KEY: FCC broadband dataRDH_USERNAME,RDH_PASSWORD: Redistricting Data HubCENSUS_API_KEY: Census ACS API
Quick Start
Basic Rebuild Sequence
# 1. Load base data center data
python3 load_postgis_data_centers.py
python3 load_postgis_osm_data_centers.py
python3 build_master_data_centers.py
# 2. Enrich with context layers
python3 create_data_center_census_tract_table.py --replace-final
python3 load_postgis_internet_cables.py
python3 ingest_eia_energy_layers.py --category power
python3 build_watershed_huc8_tables.py
# 3. Run analyses
python3 analyze_dc_tract_concentration.py > output/tract_analysis.txt
python3 analyze_cables_concentration.py > output/cables_analysis.txt
# 4. Execute notebooks
jupyter notebook cluster_analysis.ipynb
Generate Maps
python3 make_data_center_map.py
python3 make_internet_cables_map.py
Key Outputs
Research Reports
- Demographic, Energy & Watershed Analysis:
output/data_center_demographic_ruca_energy_summary.md - Submarine Cable Proximity:
cables_concentration_report.md
Interactive Maps
- Data center locations with cluster assignments
- Submarine cable routes and landing points
- Energy infrastructure proximity
- Watershed boundaries with data center counts
Data Exports
master_data_center_spatial_cluster_points.csv- Data center points with cluster IDsmaster_data_center_spatial_cluster_summary.csv- Cluster-level statisticsoutput/master_data_center_huc8_watersheds.geojson- Watershed polygonsoutput/master_data_center_map_context.csv- Full context for mappingoutput/master_data_center_state_energy_context.csv- State-level energy statistics
Data Quality Notes
- Incomplete power ratings: Only 5.9% of data centers have power ratings (108/1,833)
- Operator fragmentation: String variations ("Meta" vs. "Meta, Inc.") inflate distinct-operator counts
- 45 facilities use city-precision fallback coordinates (approximate tract assignment)
- 7 facilities failed RUCA join (Puerto Rico / non-US)
- Broadband subscribers are a coarse benefit proxy (actual cloud users are global)
Research Ideas & Future Work
See research-ideas.md for detailed next steps and potential research directions.
Project Status
This is a mature, publication-ready geospatial analysis infrastructure combining authoritative government datasets (Census, EIA, USGS, FEMA) with novel data center location data to test political economy and environmental justice hypotheses.
The "concentrated costs / dispersed benefits" hypothesis is operationalized and tested with rigorous spatial statistics (Gini coefficients, HHI indices, Mann-Whitney tests).
License
Research data compiled from public sources. Please cite appropriately if used in publications.
Contact
For questions about this research project, please contact the repository owner.