Files
data-centers/README.md
dadams f29755faba Update README with corrected numbers from master dataset
- Per-capita burden: 115× → 35× (master dataset adds 104 larger suburban tracts)
- Host pop share: 0.86% → 2.9% of host-state residents
- Non-metro: 11% → ~10% (RUCA 2020)
- Add: 59.3% Biden 2020 in host communities; income gradient by urbanicity
- Add: top host tract (Loudoun CT 6110.20, 69 DCs, MHI $141K)
- Correct hyperscaler shares to exact figures from live DB

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-13 17:44:22 -07:00

12 KiB
Raw Blame History

US Data Centers - Geospatial Research Infrastructure

A comprehensive geospatial research project investigating the spatial concentration, infrastructure dependencies, and socioeconomic/environmental impacts of US data center locations.

Documentation

  • Database Tables - Complete database schema with table descriptions, column definitions, and SQL examples
  • Database Table Previews - Research-team-friendly Markdown previews showing the first rows of each documented table
  • Research Ideas - Future research directions, data improvements, and potential collaborations
  • SQL Queries - Pre-built legislative analysis queries

Project Overview

This repository implements a PostGIS-based analytical framework that integrates multiple data sources to examine:

  • Spatial concentration patterns: Where are data centers located and why?
  • Infrastructure dependencies: How do data centers relate to submarine cables, power grids, and watersheds?
  • Equity and impact: Do data center host communities bear localized burdens while benefits are nationally dispersed?
  • Demographics: Who lives in data center-hosting census tracts?
  • Environmental exposure: What are the water, energy, and natural hazard exposures?

Key Research Question

Do data centers represent "concentrated costs / dispersed benefits"? Host communities bear localized infrastructure burdens (power, water, land use) while cloud computing benefits are nationally dispersed.

Major Findings

Spatial Concentration

  • State level: Top 5 states (VA, TX, CA, OR, OH) hold 51% of all US data centers
    • Virginia alone: 20.6% (378 of 1,833 facilities)
  • Tract level: Top 1% of data center-hosting census tracts hold 14.5% of all facilities (7 tracts)
    • 2.9% of data center-state residents live in a hosting tract
    • Per-capita burden is ~35× higher in host tracts vs. state average
    • Top host tract: Loudoun County CT 6110.20 — 69 DCs, MHI $141K, 64% BA+
  • Watershed level: Half of all US data centers sit in just 15 of 2,139 HUC8 watersheds
    • Single watershed (Middle Potomac-Catoctin / Loudoun County): 12.8% of US facilities

Demographics of Host Communities

Compared to the US average, data center host communities are:

  • Wealthier: Median household income $103,623 (vs. $78,538, +32%)
  • More educated: 49% bachelor's+ (vs. 35%, +14 pp)
  • More diverse: 51% non-Hispanic white (vs. 58%), driven by high Asian share (13% vs. 6%)
  • Better connected: 94.9% broadband (vs. 89%)
  • Politically blue: 59.3% Biden 2020 — affluent tech-corridor suburbs, not resentment communities
  • Income gradient by urbanicity: Metro avg MHI $119K → Rural/Small Town $83K → Micropolitan $67K

Infrastructure Insights

  • ~90% of data centers are in metropolitan tracts (RUCA 2020); ~10% non-metro
  • Non-metro data centers (~10%) are dominated by hyperscalers:
    • AWS, Meta, Microsoft, Google = 55.8% of non-metro facilities
    • 66.4% are in Oregon + Washington (45.3% OR + 21.1% WA; Columbia River hydro corridor)
  • Grid saturation: 4 states have >2/3 of generation within 50 km of a data center:
    • New Jersey: 83%, Nevada: 75%, Tennessee: 70%, Oregon: 68%
  • Hyperscaler energy strategies (non-metro sites):
    • AWS: 114 GW wind + 66 GW hydro
    • Microsoft: 13 GW nuclear (Palo Verde co-location)
    • Meta: 16 GW solar

Clustered vs. Isolated Facilities

Facilities in DBSCAN clusters differ significantly from isolated sites:

  • $35K income gap: Clustered sites in tracts with median income $108K vs. $73K for isolated
  • +18 pp education: 51% bachelor's+ vs. 33%
  • More diverse: 25 pp less non-Hispanic white
  • 2× energy infrastructure: 89 vs. 40 generators within 50 km

Submarine Cables

  • Data centers are NOT systematically closer to cables than ordinary US cities
  • Only 21.4% of data centers are within 100 km of a submarine cable landing point
  • Largest clusters (Ashburn VA, Columbus OH, Iowa) are inland, driven by fiber/power/tax incentives, not cables

Data Sources

Primary Data Center Inventories

  • Curated Sample: 1,489 facilities from web scraping + manual curation, geocoded via Census TIGER + Nominatim
  • OpenStreetMap: 1,549 OSM features tagged as data centers (via Overpass API)
  • IM3 Model Data: PNNL's Integrated Multisector Multiscale Modeling existing facilities
  • Master Table: 1,833 deduplicated facilities merging all sources

Geospatial Context Layers

  • US Census: 2024 TIGER tract boundaries, ACS 2024 5-year demographics (85k+ tracts)
  • USDA RUCA 2020: Rural-Urban Commuting Area codes for metro/micropolitan/rural classification
  • USGS HUC8 Watersheds: 2,139 subbasin polygons for water-stress analysis
  • FEMA NRI: National Risk Index with 18 natural hazard risk scores by census tract

Infrastructure Layers

  • Submarine Cables: 693 cables, 3,361 landing points (TeleGeography-style)
  • EIA Energy Data: Operating generator capacity (4.7M monthly records, 2008-2026), facility fuel, state energy data
  • FCC Broadband Data: Provider availability by location/block

Additional Data

  • RDH Precinct Vote Data: Election results for political-economy analysis
  • NOAA HMS Smoke Data: Wildfire smoke exposure (2005-2025)
  • USDM Drought Data: Drought severity
  • Utility Rate Tracker: State-level electricity rate increases
  • LegiScan Legislative Data: All US state + federal bills 20162026 (1.3M bills, 646 sessions), tagged for data center, ratepayer, grid, tax, and siting topics

Repository Structure

Core Python Scripts

Data Ingestion (scripts/)

  • scripts/load_postgis_data_centers.py - Load curated data center CSV into PostGIS
  • scripts/load_postgis_osm_data_centers.py - Fetch OSM data centers via Overpass API
  • scripts/build_master_data_centers.py - Deduplicate & merge curated + OSM sources
  • scripts/load_postgis_internet_cables.py - Load submarine cables and landing points
  • scripts/ingest_eia_energy_layers.py - Ingest EIA energy data via API
  • scripts/build_watershed_huc8_tables.py - Load USGS HUC8 watersheds
  • scripts/ingest_legiscan.py - Download all US state/federal bills 20162026 via LegiScan API, tag for data center research topics

Enrichment

  • scripts/create_data_center_census_tract_table.py - Join data centers to Census tracts with ACS demographics
  • scripts/build_fcc_bdc_broadband_connection_table.py - Build per-facility broadband provider table
  • scripts/build_fcc_bdc_location_provider_aggregates.py - Aggregate FCC BDC data by county/tract

Analysis

  • scripts/analyze_dc_tract_concentration.py - Tract-level cost concentration analysis (Gini, HHI, demographic deltas)
  • scripts/analyze_cables_concentration.py - Test if data centers cluster near submarine cables
  • scripts/make_data_center_map.py - Generate Leaflet map of data centers
  • scripts/make_internet_cables_map.py - Generate Leaflet map of data centers + cables

Key Jupyter Notebooks

  • spatial_clustering_master_data_centers.ipynb - DBSCAN clustering of data centers
  • cluster_analysis.ipynb - Main demographic/energy/watershed analysis
  • fema_nri_data_centers.ipynb - Join data centers to FEMA hazard scores
  • rdh_precinct_vote_data_centers.ipynb - Join data centers to election data
  • usdm_drought_data_centers.ipynb - Drought exposure analysis
  • hms_smoke_data_centers.ipynb - Wildfire smoke exposure
  • enhanced_data_center_cluster_map.ipynb - Generate enhanced cluster visualization

Output Files

  • output/data_center_demographic_ruca_energy_summary.md - Flagship analysis report
  • cables_concentration_report.md - Cable proximity + cost/benefit concentration analysis
  • data_center_map.html - Basic data center locations (Leaflet)
  • data_centers_cables_map.html - Data centers + submarine cables (Leaflet)
  • output/enhanced_master_data_center_spatial_clusters_map.html - Enhanced cluster visualization

Technical Architecture

Database

Python Environment

  • Python 3.10+
  • Key libraries: psycopg2, geopandas, shapely, scikit-learn, pandas, numpy, requests, folium

Data Formats

  • CSV (raw data exports)
  • GeoJSON (watershed/cluster geometries)
  • Shapefiles (Census, USGS, FEMA inputs)
  • HTML (interactive Leaflet maps)

Configuration

Credentials stored in ~/.zsh_secrets, loaded via environment variables:

  • PGWEB_*: PostgreSQL connection
  • EIA_API_KEY: EIA energy data
  • FCC_USERNAME, FCC_API_KEY: FCC broadband data
  • RDH_USERNAME, RDH_PASSWORD: Redistricting Data Hub
  • CENSUS_API_KEY: Census ACS API
  • LEGISCAN_API_KEY: LegiScan legislative data

Quick Start

Basic Rebuild Sequence

# 1. Load base data center data
python3 scripts/load_postgis_data_centers.py
python3 scripts/load_postgis_osm_data_centers.py
python3 scripts/build_master_data_centers.py

# 2. Enrich with context layers
python3 scripts/create_data_center_census_tract_table.py --replace-final
python3 scripts/load_postgis_internet_cables.py
python3 scripts/ingest_eia_energy_layers.py --category power
python3 scripts/build_watershed_huc8_tables.py

# 3. Run analyses
python3 scripts/analyze_dc_tract_concentration.py > output/tract_analysis.txt
python3 scripts/analyze_cables_concentration.py > output/cables_analysis.txt

# 4. Execute notebooks
jupyter notebook cluster_analysis.ipynb

# 5. Load legislation (all states, 2016-2026)
python3 scripts/ingest_legiscan.py --all
# Weekly refresh (skips unchanged sessions):
python3 scripts/ingest_legiscan.py --fetch --load

Generate Maps

python3 scripts/make_data_center_map.py
python3 scripts/make_internet_cables_map.py

Key Outputs

Research Reports

  • Demographic, Energy & Watershed Analysis: output/data_center_demographic_ruca_energy_summary.md
  • Submarine Cable Proximity: cables_concentration_report.md

Interactive Maps

  • Data center locations with cluster assignments
  • Submarine cable routes and landing points
  • Energy infrastructure proximity
  • Watershed boundaries with data center counts

Data Exports

  • master_data_center_spatial_cluster_points.csv - Data center points with cluster IDs
  • master_data_center_spatial_cluster_summary.csv - Cluster-level statistics
  • output/master_data_center_huc8_watersheds.geojson - Watershed polygons
  • output/master_data_center_map_context.csv - Full context for mapping
  • output/master_data_center_state_energy_context.csv - State-level energy statistics

Data Quality Notes

  1. Incomplete power ratings: Only 5.9% of data centers have power ratings (108/1,833)
  2. Operator fragmentation: String variations ("Meta" vs. "Meta, Inc.", AWS variants) inflate distinct-operator counts
  3. 45 facilities use city-precision fallback coordinates (approximate tract assignment)
  4. 7 facilities failed RUCA join (Puerto Rico / non-US)
  5. Broadband subscribers are a coarse benefit proxy (actual cloud users are global)
  6. EIA longitude correction: 2008-2010 generator coordinates had sign errors, corrected in flat-table build

Known Limitations

  • Power capacity: Only 5.9% populated - nearby EIA generator capacity used as proxy
  • Operator strings: Need deduplication (50 of 190 non-metro facilities have null operator)
  • Benefit measurement: Broadband subscribers are an imperfect proxy for cloud computing benefits
  • Universe: Limited to 46 DC-host states (excludes DC-free states from ACS comparison)

Research Ideas & Future Work

See research-ideas.md for detailed next steps and potential research directions.

Project Status

This is a mature, publication-ready geospatial analysis infrastructure combining authoritative government datasets (Census, EIA, USGS, FEMA) with novel data center location data to test political economy and environmental justice hypotheses.

The "concentrated costs / dispersed benefits" hypothesis is operationalized and tested with rigorous spatial statistics (Gini coefficients, HHI indices, Mann-Whitney tests).

License

Research data compiled from public sources. Please cite appropriately if used in publications.

Contact

For questions about this research project, please contact the repository owner.