2026-05-15 20:48:41 -07:00
2026-05-17 15:32:51 -07:00
2026-05-15 20:48:41 -07:00
2026-05-22 06:33:45 -07:00
2026-05-22 06:33:45 -07:00
2026-05-15 20:48:41 -07:00
2026-05-17 15:32:51 -07:00
2026-05-15 20:48:41 -07:00
2026-05-15 20:48:41 -07:00
2026-05-15 20:48:41 -07:00
2026-05-15 20:48:41 -07:00
2026-05-15 20:48:41 -07:00
2026-05-15 20:48:41 -07:00
2026-05-15 20:48:41 -07:00
2026-05-15 20:48:41 -07:00
2026-05-15 20:48:41 -07:00
2026-05-15 20:48:41 -07:00
2026-05-15 20:48:41 -07:00
2026-05-27 11:33:05 -07:00
2026-05-15 20:48:41 -07:00
2026-05-15 20:48:41 -07:00
2026-05-15 20:48:41 -07:00
2026-05-15 20:48:41 -07:00

US Data Centers - Geospatial Research Infrastructure

A comprehensive geospatial research project investigating the spatial concentration, infrastructure dependencies, and socioeconomic/environmental impacts of US data center locations.

Documentation

  • Database Tables - Complete database schema with table descriptions, column definitions, and SQL examples
  • Research Ideas - Future research directions, data improvements, and potential collaborations

Project Overview

This repository implements a PostGIS-based analytical framework that integrates multiple data sources to examine:

  • Spatial concentration patterns: Where are data centers located and why?
  • Infrastructure dependencies: How do data centers relate to submarine cables, power grids, and watersheds?
  • Equity and impact: Do data center host communities bear localized burdens while benefits are nationally dispersed?
  • Demographics: Who lives in data center-hosting census tracts?
  • Environmental exposure: What are the water, energy, and natural hazard exposures?

Key Research Question

Do data centers represent "concentrated costs / dispersed benefits"? Host communities bear localized infrastructure burdens (power, water, land use) while cloud computing benefits are nationally dispersed.

Major Findings

Spatial Concentration

  • State level: Top 5 states (VA, TX, CA, OR, OH) hold 51% of all US data centers
    • Virginia alone: 20.6% (378 of 1,833 facilities)
  • Tract level: Top 1% of data center-hosting census tracts hold 14.6% of all facilities
    • Only 0.86% of data center-state residents live in a hosting tract
    • Per-capita burden is 115× higher in host tracts vs. state average
  • Watershed level: Half of all US data centers sit in just 15 of 2,139 HUC8 watersheds
    • Single watershed (Middle Potomac-Catoctin / Loudoun County): 12.8% of US facilities

Demographics of Host Communities

Compared to the US average, data center host communities are:

  • Wealthier: Median household income $103,623 (vs. $78,538, +32%)
  • More educated: 49% bachelor's+ (vs. 35%, +14 pp)
  • More diverse: 50% non-Hispanic white (vs. 58%), driven by high Asian share (13% vs. 6%)
  • Better connected: 94.9% broadband (vs. 89%)

Infrastructure Insights

  • 89% of data centers are in metropolitan tracts (vs. 80% of all US tracts)
  • Non-metro data centers (11%) are dominated by hyperscalers:
    • AWS (67), Meta (22), Microsoft (10), Google (4) = 55% of non-metro facilities
    • 66% are in Oregon + Washington (Columbia River hydro corridor)
  • Energy infrastructure: 4 states have >2/3 of generation within 50 km of a data center:
    • New Jersey: 83%, Nevada: 75%, Tennessee: 70%, Oregon: 68%

Submarine Cables

  • Data centers are NOT systematically closer to cables than ordinary US cities
  • Only 21.4% of data centers are within 100 km of a submarine cable landing point
  • Largest clusters (Ashburn VA, Columbus OH, Iowa) are inland, driven by fiber/power/tax incentives, not cables

Data Sources

Primary Data Center Inventories

  • Curated Sample: 1,489 facilities from web scraping + manual curation, geocoded via Census TIGER + Nominatim
  • OpenStreetMap: 1,549 OSM features tagged as data centers (via Overpass API)
  • IM3 Model Data: PNNL's Integrated Multisector Multiscale Modeling existing facilities
  • Master Table: 1,833 deduplicated facilities merging all sources

Geospatial Context Layers

  • US Census: 2024 TIGER tract boundaries, ACS 2024 5-year demographics (85k+ tracts)
  • USDA RUCA 2020: Rural-Urban Commuting Area codes for metro/micropolitan/rural classification
  • USGS HUC8 Watersheds: 2,139 subbasin polygons for water-stress analysis
  • FEMA NRI: National Risk Index with 18 natural hazard risk scores by census tract

Infrastructure Layers

  • Submarine Cables: 693 cables, 3,361 landing points (TeleGeography-style)
  • EIA Energy Data: Operating generator capacity (4.7M monthly records, 2008-2026), facility fuel, state energy data
  • FCC Broadband Data: Provider availability by location/block

Additional Data

  • RDH Precinct Vote Data: Election results for political-economy analysis
  • NOAA HMS Smoke Data: Wildfire smoke exposure (2005-2025)
  • USDM Drought Data: Drought severity
  • Utility Rate Tracker: State-level electricity rate increases

Repository Structure

Core Python Scripts

Data Ingestion

  • load_postgis_data_centers.py - Load curated data center CSV into PostGIS
  • load_postgis_osm_data_centers.py - Fetch OSM data centers via Overpass API
  • build_master_data_centers.py - Deduplicate & merge curated + OSM sources
  • load_postgis_internet_cables.py - Load submarine cables and landing points
  • ingest_eia_energy_layers.py - Ingest EIA energy data via API
  • build_watershed_huc8_tables.py - Load USGS HUC8 watersheds

Enrichment

  • create_data_center_census_tract_table.py - Join data centers to Census tracts with ACS demographics
  • build_fcc_bdc_broadband_connection_table.py - Build per-facility broadband provider table
  • build_fcc_bdc_location_provider_aggregates.py - Aggregate FCC BDC data by county/tract

Analysis

  • analyze_dc_tract_concentration.py - Tract-level cost concentration analysis (Gini, HHI, demographic deltas)
  • analyze_cables_concentration.py - Test if data centers cluster near submarine cables
  • make_data_center_map.py - Generate Leaflet map of data centers
  • make_internet_cables_map.py - Generate Leaflet map of data centers + cables

Key Jupyter Notebooks

  • spatial_clustering_master_data_centers.ipynb - DBSCAN clustering of data centers
  • cluster_analysis.ipynb - Main demographic/energy/watershed analysis
  • fema_nri_data_centers.ipynb - Join data centers to FEMA hazard scores
  • rdh_precinct_vote_data_centers.ipynb - Join data centers to election data
  • usdm_drought_data_centers.ipynb - Drought exposure analysis
  • hms_smoke_data_centers.ipynb - Wildfire smoke exposure
  • enhanced_data_center_cluster_map.ipynb - Generate enhanced cluster visualization

Output Files

  • output/data_center_demographic_ruca_energy_summary.md - Flagship analysis report
  • cables_concentration_report.md - Cable proximity + cost/benefit concentration analysis
  • data_center_map.html - Basic data center locations (Leaflet)
  • data_centers_cables_map.html - Data centers + submarine cables (Leaflet)
  • output/enhanced_master_data_center_spatial_clusters_map.html - Enhanced cluster visualization

Technical Architecture

Database

  • PostgreSQL 13+ with PostGIS 3.x
  • Database name: data_centers
  • See database-tables.md for complete schema documentation

Python Environment

  • Python 3.10+
  • Key libraries: psycopg2, geopandas, shapely, scikit-learn, pandas, numpy, requests, folium

Data Formats

  • CSV (raw data exports)
  • GeoJSON (watershed/cluster geometries)
  • Shapefiles (Census, USGS, FEMA inputs)
  • HTML (interactive Leaflet maps)

Configuration

Credentials stored in ~/.zsh_secrets, loaded via environment variables:

  • PGWEB_*: PostgreSQL connection
  • EIA_API_KEY: EIA energy data
  • FCC_USERNAME, FCC_API_KEY: FCC broadband data
  • RDH_USERNAME, RDH_PASSWORD: Redistricting Data Hub
  • CENSUS_API_KEY: Census ACS API

Quick Start

Basic Rebuild Sequence

# 1. Load base data center data
python3 load_postgis_data_centers.py
python3 load_postgis_osm_data_centers.py
python3 build_master_data_centers.py

# 2. Enrich with context layers
python3 create_data_center_census_tract_table.py --replace-final
python3 load_postgis_internet_cables.py
python3 ingest_eia_energy_layers.py --category power
python3 build_watershed_huc8_tables.py

# 3. Run analyses
python3 analyze_dc_tract_concentration.py > output/tract_analysis.txt
python3 analyze_cables_concentration.py > output/cables_analysis.txt

# 4. Execute notebooks
jupyter notebook cluster_analysis.ipynb

Generate Maps

python3 make_data_center_map.py
python3 make_internet_cables_map.py

Key Outputs

Research Reports

  • Demographic, Energy & Watershed Analysis: output/data_center_demographic_ruca_energy_summary.md
  • Submarine Cable Proximity: cables_concentration_report.md

Interactive Maps

  • Data center locations with cluster assignments
  • Submarine cable routes and landing points
  • Energy infrastructure proximity
  • Watershed boundaries with data center counts

Data Exports

  • master_data_center_spatial_cluster_points.csv - Data center points with cluster IDs
  • master_data_center_spatial_cluster_summary.csv - Cluster-level statistics
  • output/master_data_center_huc8_watersheds.geojson - Watershed polygons
  • output/master_data_center_map_context.csv - Full context for mapping
  • output/master_data_center_state_energy_context.csv - State-level energy statistics

Data Quality Notes

  1. Incomplete power ratings: Only 5.9% of data centers have power ratings (108/1,833)
  2. Operator fragmentation: String variations ("Meta" vs. "Meta, Inc.") inflate distinct-operator counts
  3. 45 facilities use city-precision fallback coordinates (approximate tract assignment)
  4. 7 facilities failed RUCA join (Puerto Rico / non-US)
  5. Broadband subscribers are a coarse benefit proxy (actual cloud users are global)

Research Ideas & Future Work

See research-ideas.md for detailed next steps and potential research directions.

Project Status

This is a mature, publication-ready geospatial analysis infrastructure combining authoritative government datasets (Census, EIA, USGS, FEMA) with novel data center location data to test political economy and environmental justice hypotheses.

The "concentrated costs / dispersed benefits" hypothesis is operationalized and tested with rigorous spatial statistics (Gini coefficients, HHI indices, Mann-Whitney tests).

License

Research data compiled from public sources. Please cite appropriately if used in publications.

Contact

For questions about this research project, please contact the repository owner.

Description
No description provided
Readme 99 MiB
Languages
Jupyter Notebook 97.4%
HTML 1.7%
Python 0.9%