2026-05-15 20:48:41 -07:00
2026-06-09 15:04:47 -07:00
2026-05-17 15:32:51 -07:00
2026-05-15 20:48:41 -07:00
2026-05-22 06:33:45 -07:00
2026-05-22 06:33:45 -07:00
2026-06-09 15:04:47 -07:00

US Data Centers - Geospatial Research Infrastructure

A comprehensive geospatial research project investigating the spatial concentration, infrastructure dependencies, and socioeconomic/environmental impacts of US data center locations.

Documentation

  • Database Tables - Complete database schema with table descriptions, column definitions, and SQL examples
  • Database Table Previews - Research-team-friendly Markdown previews showing the first rows of each documented table
  • Research Ideas - Future research directions, data improvements, and potential collaborations
  • SQL Queries - Pre-built legislative analysis queries

Project Overview

This repository implements a PostGIS-based analytical framework that integrates multiple data sources to examine:

  • Spatial concentration patterns: Where are data centers located and why?
  • Infrastructure dependencies: How do data centers relate to submarine cables, power grids, and watersheds?
  • Equity and impact: Do data center host communities bear localized burdens while benefits are nationally dispersed?
  • Demographics: Who lives in data center-hosting census tracts?
  • Environmental exposure: What are the water, energy, and natural hazard exposures?

Key Research Question

Do data centers represent "concentrated costs / dispersed benefits"? Host communities bear localized infrastructure burdens (power, water, land use) while cloud computing benefits are nationally dispersed.

Major Findings

Spatial Concentration

  • State level: Top 5 states (VA, TX, CA, OR, OH) hold 51% of all US data centers
    • Virginia alone: 20.6% (378 of 1,833 facilities)
  • Tract level: Top 1% of data center-hosting census tracts hold 14.6% of all facilities
    • Only 0.86% of data center-state residents live in a hosting tract
    • Per-capita burden is 115× higher in host tracts vs. state average
  • Watershed level: Half of all US data centers sit in just 15 of 2,139 HUC8 watersheds
    • Single watershed (Middle Potomac-Catoctin / Loudoun County): 12.8% of US facilities

Demographics of Host Communities

Compared to the US average, data center host communities are:

  • Wealthier: Median household income $103,623 (vs. $78,538, +32%)
  • More educated: 49% bachelor's+ (vs. 35%, +14 pp)
  • More diverse: 50% non-Hispanic white (vs. 58%), driven by high Asian share (13% vs. 6%)
  • Better connected: 94.9% broadband (vs. 89%)

Infrastructure Insights

  • 89% of data centers are in metropolitan tracts (vs. 80% of all US tracts) - only 1.11× over-index
  • Non-metro data centers (11%) are dominated by hyperscalers:
    • AWS (67), Meta (22), Microsoft (10), Google (4) = 55% of non-metro facilities
    • 66% are in Oregon + Washington (Columbia River hydro corridor)
  • Grid saturation: 4 states have >2/3 of generation within 50 km of a data center:
    • New Jersey: 83%, Nevada: 75%, Tennessee: 70%, Oregon: 68%
  • Hyperscaler energy strategies (non-metro sites):
    • AWS: 114 GW wind + 66 GW hydro
    • Microsoft: 13 GW nuclear (Palo Verde co-location)
    • Meta: 16 GW solar

Clustered vs. Isolated Facilities

Facilities in DBSCAN clusters differ significantly from isolated sites:

  • $35K income gap: Clustered sites in tracts with median income $108K vs. $73K for isolated
  • +18 pp education: 51% bachelor's+ vs. 33%
  • More diverse: 25 pp less non-Hispanic white
  • 2× energy infrastructure: 89 vs. 40 generators within 50 km

Submarine Cables

  • Data centers are NOT systematically closer to cables than ordinary US cities
  • Only 21.4% of data centers are within 100 km of a submarine cable landing point
  • Largest clusters (Ashburn VA, Columbus OH, Iowa) are inland, driven by fiber/power/tax incentives, not cables

Data Sources

Primary Data Center Inventories

  • Curated Sample: 1,489 facilities from web scraping + manual curation, geocoded via Census TIGER + Nominatim
  • OpenStreetMap: 1,549 OSM features tagged as data centers (via Overpass API)
  • IM3 Model Data: PNNL's Integrated Multisector Multiscale Modeling existing facilities
  • Master Table: 1,833 deduplicated facilities merging all sources

Geospatial Context Layers

  • US Census: 2024 TIGER tract boundaries, ACS 2024 5-year demographics (85k+ tracts)
  • USDA RUCA 2020: Rural-Urban Commuting Area codes for metro/micropolitan/rural classification
  • USGS HUC8 Watersheds: 2,139 subbasin polygons for water-stress analysis
  • FEMA NRI: National Risk Index with 18 natural hazard risk scores by census tract

Infrastructure Layers

  • Submarine Cables: 693 cables, 3,361 landing points (TeleGeography-style)
  • EIA Energy Data: Operating generator capacity (4.7M monthly records, 2008-2026), facility fuel, state energy data
  • FCC Broadband Data: Provider availability by location/block

Additional Data

  • RDH Precinct Vote Data: Election results for political-economy analysis
  • NOAA HMS Smoke Data: Wildfire smoke exposure (2005-2025)
  • USDM Drought Data: Drought severity
  • Utility Rate Tracker: State-level electricity rate increases
  • LegiScan Legislative Data: All US state + federal bills 20162026 (1.3M bills, 646 sessions), tagged for data center, ratepayer, grid, tax, and siting topics

Repository Structure

Core Python Scripts

Data Ingestion (scripts/)

  • scripts/load_postgis_data_centers.py - Load curated data center CSV into PostGIS
  • scripts/load_postgis_osm_data_centers.py - Fetch OSM data centers via Overpass API
  • scripts/build_master_data_centers.py - Deduplicate & merge curated + OSM sources
  • scripts/load_postgis_internet_cables.py - Load submarine cables and landing points
  • scripts/ingest_eia_energy_layers.py - Ingest EIA energy data via API
  • scripts/build_watershed_huc8_tables.py - Load USGS HUC8 watersheds
  • scripts/ingest_legiscan.py - Download all US state/federal bills 20162026 via LegiScan API, tag for data center research topics

Enrichment

  • scripts/create_data_center_census_tract_table.py - Join data centers to Census tracts with ACS demographics
  • scripts/build_fcc_bdc_broadband_connection_table.py - Build per-facility broadband provider table
  • scripts/build_fcc_bdc_location_provider_aggregates.py - Aggregate FCC BDC data by county/tract

Analysis

  • scripts/analyze_dc_tract_concentration.py - Tract-level cost concentration analysis (Gini, HHI, demographic deltas)
  • scripts/analyze_cables_concentration.py - Test if data centers cluster near submarine cables
  • scripts/make_data_center_map.py - Generate Leaflet map of data centers
  • scripts/make_internet_cables_map.py - Generate Leaflet map of data centers + cables

Key Jupyter Notebooks

  • spatial_clustering_master_data_centers.ipynb - DBSCAN clustering of data centers
  • cluster_analysis.ipynb - Main demographic/energy/watershed analysis
  • fema_nri_data_centers.ipynb - Join data centers to FEMA hazard scores
  • rdh_precinct_vote_data_centers.ipynb - Join data centers to election data
  • usdm_drought_data_centers.ipynb - Drought exposure analysis
  • hms_smoke_data_centers.ipynb - Wildfire smoke exposure
  • enhanced_data_center_cluster_map.ipynb - Generate enhanced cluster visualization

Output Files

  • output/data_center_demographic_ruca_energy_summary.md - Flagship analysis report
  • cables_concentration_report.md - Cable proximity + cost/benefit concentration analysis
  • data_center_map.html - Basic data center locations (Leaflet)
  • data_centers_cables_map.html - Data centers + submarine cables (Leaflet)
  • output/enhanced_master_data_center_spatial_clusters_map.html - Enhanced cluster visualization

Technical Architecture

Database

Python Environment

  • Python 3.10+
  • Key libraries: psycopg2, geopandas, shapely, scikit-learn, pandas, numpy, requests, folium

Data Formats

  • CSV (raw data exports)
  • GeoJSON (watershed/cluster geometries)
  • Shapefiles (Census, USGS, FEMA inputs)
  • HTML (interactive Leaflet maps)

Configuration

Credentials stored in ~/.zsh_secrets, loaded via environment variables:

  • PGWEB_*: PostgreSQL connection
  • EIA_API_KEY: EIA energy data
  • FCC_USERNAME, FCC_API_KEY: FCC broadband data
  • RDH_USERNAME, RDH_PASSWORD: Redistricting Data Hub
  • CENSUS_API_KEY: Census ACS API
  • LEGISCAN_API_KEY: LegiScan legislative data

Quick Start

Basic Rebuild Sequence

# 1. Load base data center data
python3 scripts/load_postgis_data_centers.py
python3 scripts/load_postgis_osm_data_centers.py
python3 scripts/build_master_data_centers.py

# 2. Enrich with context layers
python3 scripts/create_data_center_census_tract_table.py --replace-final
python3 scripts/load_postgis_internet_cables.py
python3 scripts/ingest_eia_energy_layers.py --category power
python3 scripts/build_watershed_huc8_tables.py

# 3. Run analyses
python3 scripts/analyze_dc_tract_concentration.py > output/tract_analysis.txt
python3 scripts/analyze_cables_concentration.py > output/cables_analysis.txt

# 4. Execute notebooks
jupyter notebook cluster_analysis.ipynb

# 5. Load legislation (all states, 2016-2026)
python3 scripts/ingest_legiscan.py --all
# Weekly refresh (skips unchanged sessions):
python3 scripts/ingest_legiscan.py --fetch --load

Generate Maps

python3 scripts/make_data_center_map.py
python3 scripts/make_internet_cables_map.py

Key Outputs

Research Reports

  • Demographic, Energy & Watershed Analysis: output/data_center_demographic_ruca_energy_summary.md
  • Submarine Cable Proximity: cables_concentration_report.md

Interactive Maps

  • Data center locations with cluster assignments
  • Submarine cable routes and landing points
  • Energy infrastructure proximity
  • Watershed boundaries with data center counts

Data Exports

  • master_data_center_spatial_cluster_points.csv - Data center points with cluster IDs
  • master_data_center_spatial_cluster_summary.csv - Cluster-level statistics
  • output/master_data_center_huc8_watersheds.geojson - Watershed polygons
  • output/master_data_center_map_context.csv - Full context for mapping
  • output/master_data_center_state_energy_context.csv - State-level energy statistics

Data Quality Notes

  1. Incomplete power ratings: Only 5.9% of data centers have power ratings (108/1,833)
  2. Operator fragmentation: String variations ("Meta" vs. "Meta, Inc.", AWS variants) inflate distinct-operator counts
  3. 45 facilities use city-precision fallback coordinates (approximate tract assignment)
  4. 7 facilities failed RUCA join (Puerto Rico / non-US)
  5. Broadband subscribers are a coarse benefit proxy (actual cloud users are global)
  6. EIA longitude correction: 2008-2010 generator coordinates had sign errors, corrected in flat-table build

Known Limitations

  • Power capacity: Only 5.9% populated - nearby EIA generator capacity used as proxy
  • Operator strings: Need deduplication (50 of 190 non-metro facilities have null operator)
  • Benefit measurement: Broadband subscribers are an imperfect proxy for cloud computing benefits
  • Universe: Limited to 46 DC-host states (excludes DC-free states from ACS comparison)

Research Ideas & Future Work

See research-ideas.md for detailed next steps and potential research directions.

Project Status

This is a mature, publication-ready geospatial analysis infrastructure combining authoritative government datasets (Census, EIA, USGS, FEMA) with novel data center location data to test political economy and environmental justice hypotheses.

The "concentrated costs / dispersed benefits" hypothesis is operationalized and tested with rigorous spatial statistics (Gini coefficients, HHI indices, Mann-Whitney tests).

License

Research data compiled from public sources. Please cite appropriately if used in publications.

Contact

For questions about this research project, please contact the repository owner.

Description
No description provided
Readme 99 MiB
Languages
Jupyter Notebook 97.4%
HTML 1.7%
Python 0.9%