214 lines
9.7 KiB
Markdown
214 lines
9.7 KiB
Markdown
# US Data Centers - Geospatial Research Infrastructure
|
||
|
||
A comprehensive geospatial research project investigating the spatial concentration, infrastructure dependencies, and socioeconomic/environmental impacts of US data center locations.
|
||
|
||
## Project Overview
|
||
|
||
This repository implements a PostGIS-based analytical framework that integrates multiple data sources to examine:
|
||
|
||
- **Spatial concentration patterns**: Where are data centers located and why?
|
||
- **Infrastructure dependencies**: How do data centers relate to submarine cables, power grids, and watersheds?
|
||
- **Equity and impact**: Do data center host communities bear localized burdens while benefits are nationally dispersed?
|
||
- **Demographics**: Who lives in data center-hosting census tracts?
|
||
- **Environmental exposure**: What are the water, energy, and natural hazard exposures?
|
||
|
||
## Key Research Question
|
||
|
||
**Do data centers represent "concentrated costs / dispersed benefits"?** Host communities bear localized infrastructure burdens (power, water, land use) while cloud computing benefits are nationally dispersed.
|
||
|
||
## Major Findings
|
||
|
||
### Spatial Concentration
|
||
- **State level**: Top 5 states (VA, TX, CA, OR, OH) hold 51% of all US data centers
|
||
- Virginia alone: 20.6% (378 of 1,833 facilities)
|
||
- **Tract level**: Top 1% of data center-hosting census tracts hold 14.6% of all facilities
|
||
- Only 0.86% of data center-state residents live in a hosting tract
|
||
- Per-capita burden is **115× higher** in host tracts vs. state average
|
||
- **Watershed level**: Half of all US data centers sit in just 15 of 2,139 HUC8 watersheds
|
||
- Single watershed (Middle Potomac-Catoctin / Loudoun County): 12.8% of US facilities
|
||
|
||
### Demographics of Host Communities
|
||
Compared to the US average, data center host communities are:
|
||
- **Wealthier**: Median household income $103,623 (vs. $78,538, +32%)
|
||
- **More educated**: 49% bachelor's+ (vs. 35%, +14 pp)
|
||
- **More diverse**: 50% non-Hispanic white (vs. 58%), driven by high Asian share (13% vs. 6%)
|
||
- **Better connected**: 94.9% broadband (vs. 89%)
|
||
|
||
### Infrastructure Insights
|
||
- **89% of data centers are in metropolitan tracts** (vs. 80% of all US tracts)
|
||
- **Non-metro data centers (11%)** are dominated by hyperscalers:
|
||
- AWS (67), Meta (22), Microsoft (10), Google (4) = 55% of non-metro facilities
|
||
- 66% are in Oregon + Washington (Columbia River hydro corridor)
|
||
- **Energy infrastructure**: 4 states have >2/3 of generation within 50 km of a data center:
|
||
- New Jersey: 83%, Nevada: 75%, Tennessee: 70%, Oregon: 68%
|
||
|
||
### Submarine Cables
|
||
- **Data centers are NOT systematically closer to cables** than ordinary US cities
|
||
- Only 21.4% of data centers are within 100 km of a submarine cable landing point
|
||
- Largest clusters (Ashburn VA, Columbus OH, Iowa) are inland, driven by fiber/power/tax incentives, not cables
|
||
|
||
## Data Sources
|
||
|
||
### Primary Data Center Inventories
|
||
- **Curated Sample**: 1,489 facilities from web scraping + manual curation, geocoded via Census TIGER + Nominatim
|
||
- **OpenStreetMap**: 1,549 OSM features tagged as data centers (via Overpass API)
|
||
- **IM3 Model Data**: PNNL's Integrated Multisector Multiscale Modeling existing facilities
|
||
- **Master Table**: 1,833 deduplicated facilities merging all sources
|
||
|
||
### Geospatial Context Layers
|
||
- **US Census**: 2024 TIGER tract boundaries, ACS 2024 5-year demographics (85k+ tracts)
|
||
- **USDA RUCA 2020**: Rural-Urban Commuting Area codes for metro/micropolitan/rural classification
|
||
- **USGS HUC8 Watersheds**: 2,139 subbasin polygons for water-stress analysis
|
||
- **FEMA NRI**: National Risk Index with 18 natural hazard risk scores by census tract
|
||
|
||
### Infrastructure Layers
|
||
- **Submarine Cables**: 693 cables, 3,361 landing points (TeleGeography-style)
|
||
- **EIA Energy Data**: Operating generator capacity (4.7M monthly records, 2008-2026), facility fuel, state energy data
|
||
- **FCC Broadband Data**: Provider availability by location/block
|
||
|
||
### Additional Data
|
||
- **RDH Precinct Vote Data**: Election results for political-economy analysis
|
||
- **NOAA HMS Smoke Data**: Wildfire smoke exposure (2005-2025)
|
||
- **USDM Drought Data**: Drought severity
|
||
- **Utility Rate Tracker**: State-level electricity rate increases
|
||
|
||
## Repository Structure
|
||
|
||
### Core Python Scripts
|
||
|
||
**Data Ingestion**
|
||
- `load_postgis_data_centers.py` - Load curated data center CSV into PostGIS
|
||
- `load_postgis_osm_data_centers.py` - Fetch OSM data centers via Overpass API
|
||
- `build_master_data_centers.py` - Deduplicate & merge curated + OSM sources
|
||
- `load_postgis_internet_cables.py` - Load submarine cables and landing points
|
||
- `ingest_eia_energy_layers.py` - Ingest EIA energy data via API
|
||
- `build_watershed_huc8_tables.py` - Load USGS HUC8 watersheds
|
||
|
||
**Enrichment**
|
||
- `create_data_center_census_tract_table.py` - Join data centers to Census tracts with ACS demographics
|
||
- `build_fcc_bdc_broadband_connection_table.py` - Build per-facility broadband provider table
|
||
- `build_fcc_bdc_location_provider_aggregates.py` - Aggregate FCC BDC data by county/tract
|
||
|
||
**Analysis**
|
||
- `analyze_dc_tract_concentration.py` - Tract-level cost concentration analysis (Gini, HHI, demographic deltas)
|
||
- `analyze_cables_concentration.py` - Test if data centers cluster near submarine cables
|
||
- `make_data_center_map.py` - Generate Leaflet map of data centers
|
||
- `make_internet_cables_map.py` - Generate Leaflet map of data centers + cables
|
||
|
||
### Key Jupyter Notebooks
|
||
- `spatial_clustering_master_data_centers.ipynb` - DBSCAN clustering of data centers
|
||
- `cluster_analysis.ipynb` - Main demographic/energy/watershed analysis
|
||
- `fema_nri_data_centers.ipynb` - Join data centers to FEMA hazard scores
|
||
- `rdh_precinct_vote_data_centers.ipynb` - Join data centers to election data
|
||
- `usdm_drought_data_centers.ipynb` - Drought exposure analysis
|
||
- `hms_smoke_data_centers.ipynb` - Wildfire smoke exposure
|
||
- `enhanced_data_center_cluster_map.ipynb` - Generate enhanced cluster visualization
|
||
|
||
### Output Files
|
||
- `output/data_center_demographic_ruca_energy_summary.md` - Flagship analysis report
|
||
- `cables_concentration_report.md` - Cable proximity + cost/benefit concentration analysis
|
||
- `data_center_map.html` - Basic data center locations (Leaflet)
|
||
- `data_centers_cables_map.html` - Data centers + submarine cables (Leaflet)
|
||
- `output/enhanced_master_data_center_spatial_clusters_map.html` - Enhanced cluster visualization
|
||
|
||
## Technical Architecture
|
||
|
||
### Database
|
||
- **PostgreSQL 13+** with **PostGIS 3.x**
|
||
- Database name: `data_centers`
|
||
- See [database-tables.md](database-tables.md) for complete schema documentation
|
||
|
||
### Python Environment
|
||
- **Python 3.10+**
|
||
- Key libraries: `psycopg2`, `geopandas`, `shapely`, `scikit-learn`, `pandas`, `numpy`, `requests`, `folium`
|
||
|
||
### Data Formats
|
||
- CSV (raw data exports)
|
||
- GeoJSON (watershed/cluster geometries)
|
||
- Shapefiles (Census, USGS, FEMA inputs)
|
||
- HTML (interactive Leaflet maps)
|
||
|
||
### Configuration
|
||
Credentials stored in `~/.zsh_secrets`, loaded via environment variables:
|
||
- `PGWEB_*`: PostgreSQL connection
|
||
- `EIA_API_KEY`: EIA energy data
|
||
- `FCC_USERNAME`, `FCC_API_KEY`: FCC broadband data
|
||
- `RDH_USERNAME`, `RDH_PASSWORD`: Redistricting Data Hub
|
||
- `CENSUS_API_KEY`: Census ACS API
|
||
|
||
## Quick Start
|
||
|
||
### Basic Rebuild Sequence
|
||
|
||
```bash
|
||
# 1. Load base data center data
|
||
python3 load_postgis_data_centers.py
|
||
python3 load_postgis_osm_data_centers.py
|
||
python3 build_master_data_centers.py
|
||
|
||
# 2. Enrich with context layers
|
||
python3 create_data_center_census_tract_table.py --replace-final
|
||
python3 load_postgis_internet_cables.py
|
||
python3 ingest_eia_energy_layers.py --category power
|
||
python3 build_watershed_huc8_tables.py
|
||
|
||
# 3. Run analyses
|
||
python3 analyze_dc_tract_concentration.py > output/tract_analysis.txt
|
||
python3 analyze_cables_concentration.py > output/cables_analysis.txt
|
||
|
||
# 4. Execute notebooks
|
||
jupyter notebook cluster_analysis.ipynb
|
||
```
|
||
|
||
### Generate Maps
|
||
|
||
```bash
|
||
python3 make_data_center_map.py
|
||
python3 make_internet_cables_map.py
|
||
```
|
||
|
||
## Key Outputs
|
||
|
||
### Research Reports
|
||
- **Demographic, Energy & Watershed Analysis**: `output/data_center_demographic_ruca_energy_summary.md`
|
||
- **Submarine Cable Proximity**: `cables_concentration_report.md`
|
||
|
||
### Interactive Maps
|
||
- Data center locations with cluster assignments
|
||
- Submarine cable routes and landing points
|
||
- Energy infrastructure proximity
|
||
- Watershed boundaries with data center counts
|
||
|
||
### Data Exports
|
||
- `master_data_center_spatial_cluster_points.csv` - Data center points with cluster IDs
|
||
- `master_data_center_spatial_cluster_summary.csv` - Cluster-level statistics
|
||
- `output/master_data_center_huc8_watersheds.geojson` - Watershed polygons
|
||
- `output/master_data_center_map_context.csv` - Full context for mapping
|
||
- `output/master_data_center_state_energy_context.csv` - State-level energy statistics
|
||
|
||
## Data Quality Notes
|
||
|
||
1. **Incomplete power ratings**: Only 5.9% of data centers have power ratings (108/1,833)
|
||
2. **Operator fragmentation**: String variations ("Meta" vs. "Meta, Inc.") inflate distinct-operator counts
|
||
3. **45 facilities** use city-precision fallback coordinates (approximate tract assignment)
|
||
4. **7 facilities** failed RUCA join (Puerto Rico / non-US)
|
||
5. **Broadband subscribers** are a coarse benefit proxy (actual cloud users are global)
|
||
|
||
## Research Ideas & Future Work
|
||
|
||
See [research-ideas.md](research-ideas.md) for detailed next steps and potential research directions.
|
||
|
||
## Project Status
|
||
|
||
This is a **mature, publication-ready geospatial analysis infrastructure** combining authoritative government datasets (Census, EIA, USGS, FEMA) with novel data center location data to test political economy and environmental justice hypotheses.
|
||
|
||
The "concentrated costs / dispersed benefits" hypothesis is operationalized and tested with rigorous spatial statistics (Gini coefficients, HHI indices, Mann-Whitney tests).
|
||
|
||
## License
|
||
|
||
Research data compiled from public sources. Please cite appropriately if used in publications.
|
||
|
||
## Contact
|
||
|
||
For questions about this research project, please contact the repository owner.
|