Files
data-centers/README.md

9.7 KiB
Raw Blame History

US Data Centers - Geospatial Research Infrastructure

A comprehensive geospatial research project investigating the spatial concentration, infrastructure dependencies, and socioeconomic/environmental impacts of US data center locations.

Project Overview

This repository implements a PostGIS-based analytical framework that integrates multiple data sources to examine:

  • Spatial concentration patterns: Where are data centers located and why?
  • Infrastructure dependencies: How do data centers relate to submarine cables, power grids, and watersheds?
  • Equity and impact: Do data center host communities bear localized burdens while benefits are nationally dispersed?
  • Demographics: Who lives in data center-hosting census tracts?
  • Environmental exposure: What are the water, energy, and natural hazard exposures?

Key Research Question

Do data centers represent "concentrated costs / dispersed benefits"? Host communities bear localized infrastructure burdens (power, water, land use) while cloud computing benefits are nationally dispersed.

Major Findings

Spatial Concentration

  • State level: Top 5 states (VA, TX, CA, OR, OH) hold 51% of all US data centers
    • Virginia alone: 20.6% (378 of 1,833 facilities)
  • Tract level: Top 1% of data center-hosting census tracts hold 14.6% of all facilities
    • Only 0.86% of data center-state residents live in a hosting tract
    • Per-capita burden is 115× higher in host tracts vs. state average
  • Watershed level: Half of all US data centers sit in just 15 of 2,139 HUC8 watersheds
    • Single watershed (Middle Potomac-Catoctin / Loudoun County): 12.8% of US facilities

Demographics of Host Communities

Compared to the US average, data center host communities are:

  • Wealthier: Median household income $103,623 (vs. $78,538, +32%)
  • More educated: 49% bachelor's+ (vs. 35%, +14 pp)
  • More diverse: 50% non-Hispanic white (vs. 58%), driven by high Asian share (13% vs. 6%)
  • Better connected: 94.9% broadband (vs. 89%)

Infrastructure Insights

  • 89% of data centers are in metropolitan tracts (vs. 80% of all US tracts)
  • Non-metro data centers (11%) are dominated by hyperscalers:
    • AWS (67), Meta (22), Microsoft (10), Google (4) = 55% of non-metro facilities
    • 66% are in Oregon + Washington (Columbia River hydro corridor)
  • Energy infrastructure: 4 states have >2/3 of generation within 50 km of a data center:
    • New Jersey: 83%, Nevada: 75%, Tennessee: 70%, Oregon: 68%

Submarine Cables

  • Data centers are NOT systematically closer to cables than ordinary US cities
  • Only 21.4% of data centers are within 100 km of a submarine cable landing point
  • Largest clusters (Ashburn VA, Columbus OH, Iowa) are inland, driven by fiber/power/tax incentives, not cables

Data Sources

Primary Data Center Inventories

  • Curated Sample: 1,489 facilities from web scraping + manual curation, geocoded via Census TIGER + Nominatim
  • OpenStreetMap: 1,549 OSM features tagged as data centers (via Overpass API)
  • IM3 Model Data: PNNL's Integrated Multisector Multiscale Modeling existing facilities
  • Master Table: 1,833 deduplicated facilities merging all sources

Geospatial Context Layers

  • US Census: 2024 TIGER tract boundaries, ACS 2024 5-year demographics (85k+ tracts)
  • USDA RUCA 2020: Rural-Urban Commuting Area codes for metro/micropolitan/rural classification
  • USGS HUC8 Watersheds: 2,139 subbasin polygons for water-stress analysis
  • FEMA NRI: National Risk Index with 18 natural hazard risk scores by census tract

Infrastructure Layers

  • Submarine Cables: 693 cables, 3,361 landing points (TeleGeography-style)
  • EIA Energy Data: Operating generator capacity (4.7M monthly records, 2008-2026), facility fuel, state energy data
  • FCC Broadband Data: Provider availability by location/block

Additional Data

  • RDH Precinct Vote Data: Election results for political-economy analysis
  • NOAA HMS Smoke Data: Wildfire smoke exposure (2005-2025)
  • USDM Drought Data: Drought severity
  • Utility Rate Tracker: State-level electricity rate increases

Repository Structure

Core Python Scripts

Data Ingestion

  • load_postgis_data_centers.py - Load curated data center CSV into PostGIS
  • load_postgis_osm_data_centers.py - Fetch OSM data centers via Overpass API
  • build_master_data_centers.py - Deduplicate & merge curated + OSM sources
  • load_postgis_internet_cables.py - Load submarine cables and landing points
  • ingest_eia_energy_layers.py - Ingest EIA energy data via API
  • build_watershed_huc8_tables.py - Load USGS HUC8 watersheds

Enrichment

  • create_data_center_census_tract_table.py - Join data centers to Census tracts with ACS demographics
  • build_fcc_bdc_broadband_connection_table.py - Build per-facility broadband provider table
  • build_fcc_bdc_location_provider_aggregates.py - Aggregate FCC BDC data by county/tract

Analysis

  • analyze_dc_tract_concentration.py - Tract-level cost concentration analysis (Gini, HHI, demographic deltas)
  • analyze_cables_concentration.py - Test if data centers cluster near submarine cables
  • make_data_center_map.py - Generate Leaflet map of data centers
  • make_internet_cables_map.py - Generate Leaflet map of data centers + cables

Key Jupyter Notebooks

  • spatial_clustering_master_data_centers.ipynb - DBSCAN clustering of data centers
  • cluster_analysis.ipynb - Main demographic/energy/watershed analysis
  • fema_nri_data_centers.ipynb - Join data centers to FEMA hazard scores
  • rdh_precinct_vote_data_centers.ipynb - Join data centers to election data
  • usdm_drought_data_centers.ipynb - Drought exposure analysis
  • hms_smoke_data_centers.ipynb - Wildfire smoke exposure
  • enhanced_data_center_cluster_map.ipynb - Generate enhanced cluster visualization

Output Files

  • output/data_center_demographic_ruca_energy_summary.md - Flagship analysis report
  • cables_concentration_report.md - Cable proximity + cost/benefit concentration analysis
  • data_center_map.html - Basic data center locations (Leaflet)
  • data_centers_cables_map.html - Data centers + submarine cables (Leaflet)
  • output/enhanced_master_data_center_spatial_clusters_map.html - Enhanced cluster visualization

Technical Architecture

Database

  • PostgreSQL 13+ with PostGIS 3.x
  • Database name: data_centers
  • See database-tables.md for complete schema documentation

Python Environment

  • Python 3.10+
  • Key libraries: psycopg2, geopandas, shapely, scikit-learn, pandas, numpy, requests, folium

Data Formats

  • CSV (raw data exports)
  • GeoJSON (watershed/cluster geometries)
  • Shapefiles (Census, USGS, FEMA inputs)
  • HTML (interactive Leaflet maps)

Configuration

Credentials stored in ~/.zsh_secrets, loaded via environment variables:

  • PGWEB_*: PostgreSQL connection
  • EIA_API_KEY: EIA energy data
  • FCC_USERNAME, FCC_API_KEY: FCC broadband data
  • RDH_USERNAME, RDH_PASSWORD: Redistricting Data Hub
  • CENSUS_API_KEY: Census ACS API

Quick Start

Basic Rebuild Sequence

# 1. Load base data center data
python3 load_postgis_data_centers.py
python3 load_postgis_osm_data_centers.py
python3 build_master_data_centers.py

# 2. Enrich with context layers
python3 create_data_center_census_tract_table.py --replace-final
python3 load_postgis_internet_cables.py
python3 ingest_eia_energy_layers.py --category power
python3 build_watershed_huc8_tables.py

# 3. Run analyses
python3 analyze_dc_tract_concentration.py > output/tract_analysis.txt
python3 analyze_cables_concentration.py > output/cables_analysis.txt

# 4. Execute notebooks
jupyter notebook cluster_analysis.ipynb

Generate Maps

python3 make_data_center_map.py
python3 make_internet_cables_map.py

Key Outputs

Research Reports

  • Demographic, Energy & Watershed Analysis: output/data_center_demographic_ruca_energy_summary.md
  • Submarine Cable Proximity: cables_concentration_report.md

Interactive Maps

  • Data center locations with cluster assignments
  • Submarine cable routes and landing points
  • Energy infrastructure proximity
  • Watershed boundaries with data center counts

Data Exports

  • master_data_center_spatial_cluster_points.csv - Data center points with cluster IDs
  • master_data_center_spatial_cluster_summary.csv - Cluster-level statistics
  • output/master_data_center_huc8_watersheds.geojson - Watershed polygons
  • output/master_data_center_map_context.csv - Full context for mapping
  • output/master_data_center_state_energy_context.csv - State-level energy statistics

Data Quality Notes

  1. Incomplete power ratings: Only 5.9% of data centers have power ratings (108/1,833)
  2. Operator fragmentation: String variations ("Meta" vs. "Meta, Inc.") inflate distinct-operator counts
  3. 45 facilities use city-precision fallback coordinates (approximate tract assignment)
  4. 7 facilities failed RUCA join (Puerto Rico / non-US)
  5. Broadband subscribers are a coarse benefit proxy (actual cloud users are global)

Research Ideas & Future Work

See research-ideas.md for detailed next steps and potential research directions.

Project Status

This is a mature, publication-ready geospatial analysis infrastructure combining authoritative government datasets (Census, EIA, USGS, FEMA) with novel data center location data to test political economy and environmental justice hypotheses.

The "concentrated costs / dispersed benefits" hypothesis is operationalized and tested with rigorous spatial statistics (Gini coefficients, HHI indices, Mann-Whitney tests).

License

Research data compiled from public sources. Please cite appropriately if used in publications.

Contact

For questions about this research project, please contact the repository owner.