Files
data-centers/README.md
dadams 46c8c58545 Enhance documentation with detailed findings from analysis report
- Add clustered vs isolated facility comparison to README
- Expand infrastructure insights with hyperscaler energy strategies
- Document additional database tables (opposition cases, IM3 projections, utility rates)
- Enhance research ideas with specific watershed names and grid saturation data
- Add data quality notes about EIA longitude corrections
- Reference loaded but unused tables for future analysis
2026-05-27 11:36:50 -07:00

238 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# US Data Centers - Geospatial Research Infrastructure
A comprehensive geospatial research project investigating the spatial concentration, infrastructure dependencies, and socioeconomic/environmental impacts of US data center locations.
## Documentation
- **[Database Tables](database-tables.md)** - Complete database schema with table descriptions, column definitions, and SQL examples
- **[Research Ideas](research-ideas.md)** - Future research directions, data improvements, and potential collaborations
## Project Overview
This repository implements a PostGIS-based analytical framework that integrates multiple data sources to examine:
- **Spatial concentration patterns**: Where are data centers located and why?
- **Infrastructure dependencies**: How do data centers relate to submarine cables, power grids, and watersheds?
- **Equity and impact**: Do data center host communities bear localized burdens while benefits are nationally dispersed?
- **Demographics**: Who lives in data center-hosting census tracts?
- **Environmental exposure**: What are the water, energy, and natural hazard exposures?
## Key Research Question
**Do data centers represent "concentrated costs / dispersed benefits"?** Host communities bear localized infrastructure burdens (power, water, land use) while cloud computing benefits are nationally dispersed.
## Major Findings
### Spatial Concentration
- **State level**: Top 5 states (VA, TX, CA, OR, OH) hold 51% of all US data centers
- Virginia alone: 20.6% (378 of 1,833 facilities)
- **Tract level**: Top 1% of data center-hosting census tracts hold 14.6% of all facilities
- Only 0.86% of data center-state residents live in a hosting tract
- Per-capita burden is **115× higher** in host tracts vs. state average
- **Watershed level**: Half of all US data centers sit in just 15 of 2,139 HUC8 watersheds
- Single watershed (Middle Potomac-Catoctin / Loudoun County): 12.8% of US facilities
### Demographics of Host Communities
Compared to the US average, data center host communities are:
- **Wealthier**: Median household income $103,623 (vs. $78,538, +32%)
- **More educated**: 49% bachelor's+ (vs. 35%, +14 pp)
- **More diverse**: 50% non-Hispanic white (vs. 58%), driven by high Asian share (13% vs. 6%)
- **Better connected**: 94.9% broadband (vs. 89%)
### Infrastructure Insights
- **89% of data centers are in metropolitan tracts** (vs. 80% of all US tracts) - only 1.11× over-index
- **Non-metro data centers (11%)** are dominated by hyperscalers:
- AWS (67), Meta (22), Microsoft (10), Google (4) = 55% of non-metro facilities
- 66% are in Oregon + Washington (Columbia River hydro corridor)
- **Grid saturation**: 4 states have >2/3 of generation within 50 km of a data center:
- New Jersey: 83%, Nevada: 75%, Tennessee: 70%, Oregon: 68%
- **Hyperscaler energy strategies** (non-metro sites):
- AWS: 114 GW wind + 66 GW hydro
- Microsoft: 13 GW nuclear (Palo Verde co-location)
- Meta: 16 GW solar
### Clustered vs. Isolated Facilities
Facilities in DBSCAN clusters differ significantly from isolated sites:
- **$35K income gap**: Clustered sites in tracts with median income $108K vs. $73K for isolated
- **+18 pp education**: 51% bachelor's+ vs. 33%
- **More diverse**: 25 pp less non-Hispanic white
- **2× energy infrastructure**: 89 vs. 40 generators within 50 km
### Submarine Cables
- **Data centers are NOT systematically closer to cables** than ordinary US cities
- Only 21.4% of data centers are within 100 km of a submarine cable landing point
- Largest clusters (Ashburn VA, Columbus OH, Iowa) are inland, driven by fiber/power/tax incentives, not cables
## Data Sources
### Primary Data Center Inventories
- **Curated Sample**: 1,489 facilities from web scraping + manual curation, geocoded via Census TIGER + Nominatim
- **OpenStreetMap**: 1,549 OSM features tagged as data centers (via Overpass API)
- **IM3 Model Data**: PNNL's Integrated Multisector Multiscale Modeling existing facilities
- **Master Table**: 1,833 deduplicated facilities merging all sources
### Geospatial Context Layers
- **US Census**: 2024 TIGER tract boundaries, ACS 2024 5-year demographics (85k+ tracts)
- **USDA RUCA 2020**: Rural-Urban Commuting Area codes for metro/micropolitan/rural classification
- **USGS HUC8 Watersheds**: 2,139 subbasin polygons for water-stress analysis
- **FEMA NRI**: National Risk Index with 18 natural hazard risk scores by census tract
### Infrastructure Layers
- **Submarine Cables**: 693 cables, 3,361 landing points (TeleGeography-style)
- **EIA Energy Data**: Operating generator capacity (4.7M monthly records, 2008-2026), facility fuel, state energy data
- **FCC Broadband Data**: Provider availability by location/block
### Additional Data
- **RDH Precinct Vote Data**: Election results for political-economy analysis
- **NOAA HMS Smoke Data**: Wildfire smoke exposure (2005-2025)
- **USDM Drought Data**: Drought severity
- **Utility Rate Tracker**: State-level electricity rate increases
## Repository Structure
### Core Python Scripts
**Data Ingestion**
- `load_postgis_data_centers.py` - Load curated data center CSV into PostGIS
- `load_postgis_osm_data_centers.py` - Fetch OSM data centers via Overpass API
- `build_master_data_centers.py` - Deduplicate & merge curated + OSM sources
- `load_postgis_internet_cables.py` - Load submarine cables and landing points
- `ingest_eia_energy_layers.py` - Ingest EIA energy data via API
- `build_watershed_huc8_tables.py` - Load USGS HUC8 watersheds
**Enrichment**
- `create_data_center_census_tract_table.py` - Join data centers to Census tracts with ACS demographics
- `build_fcc_bdc_broadband_connection_table.py` - Build per-facility broadband provider table
- `build_fcc_bdc_location_provider_aggregates.py` - Aggregate FCC BDC data by county/tract
**Analysis**
- `analyze_dc_tract_concentration.py` - Tract-level cost concentration analysis (Gini, HHI, demographic deltas)
- `analyze_cables_concentration.py` - Test if data centers cluster near submarine cables
- `make_data_center_map.py` - Generate Leaflet map of data centers
- `make_internet_cables_map.py` - Generate Leaflet map of data centers + cables
### Key Jupyter Notebooks
- `spatial_clustering_master_data_centers.ipynb` - DBSCAN clustering of data centers
- `cluster_analysis.ipynb` - Main demographic/energy/watershed analysis
- `fema_nri_data_centers.ipynb` - Join data centers to FEMA hazard scores
- `rdh_precinct_vote_data_centers.ipynb` - Join data centers to election data
- `usdm_drought_data_centers.ipynb` - Drought exposure analysis
- `hms_smoke_data_centers.ipynb` - Wildfire smoke exposure
- `enhanced_data_center_cluster_map.ipynb` - Generate enhanced cluster visualization
### Output Files
- `output/data_center_demographic_ruca_energy_summary.md` - Flagship analysis report
- `cables_concentration_report.md` - Cable proximity + cost/benefit concentration analysis
- `data_center_map.html` - Basic data center locations (Leaflet)
- `data_centers_cables_map.html` - Data centers + submarine cables (Leaflet)
- `output/enhanced_master_data_center_spatial_clusters_map.html` - Enhanced cluster visualization
## Technical Architecture
### Database
- **PostgreSQL 13+** with **PostGIS 3.x**
- Database name: `data_centers`
- See [database-tables.md](database-tables.md) for complete schema documentation
### Python Environment
- **Python 3.10+**
- Key libraries: `psycopg2`, `geopandas`, `shapely`, `scikit-learn`, `pandas`, `numpy`, `requests`, `folium`
### Data Formats
- CSV (raw data exports)
- GeoJSON (watershed/cluster geometries)
- Shapefiles (Census, USGS, FEMA inputs)
- HTML (interactive Leaflet maps)
### Configuration
Credentials stored in `~/.zsh_secrets`, loaded via environment variables:
- `PGWEB_*`: PostgreSQL connection
- `EIA_API_KEY`: EIA energy data
- `FCC_USERNAME`, `FCC_API_KEY`: FCC broadband data
- `RDH_USERNAME`, `RDH_PASSWORD`: Redistricting Data Hub
- `CENSUS_API_KEY`: Census ACS API
## Quick Start
### Basic Rebuild Sequence
```bash
# 1. Load base data center data
python3 load_postgis_data_centers.py
python3 load_postgis_osm_data_centers.py
python3 build_master_data_centers.py
# 2. Enrich with context layers
python3 create_data_center_census_tract_table.py --replace-final
python3 load_postgis_internet_cables.py
python3 ingest_eia_energy_layers.py --category power
python3 build_watershed_huc8_tables.py
# 3. Run analyses
python3 analyze_dc_tract_concentration.py > output/tract_analysis.txt
python3 analyze_cables_concentration.py > output/cables_analysis.txt
# 4. Execute notebooks
jupyter notebook cluster_analysis.ipynb
```
### Generate Maps
```bash
python3 make_data_center_map.py
python3 make_internet_cables_map.py
```
## Key Outputs
### Research Reports
- **Demographic, Energy & Watershed Analysis**: `output/data_center_demographic_ruca_energy_summary.md`
- **Submarine Cable Proximity**: `cables_concentration_report.md`
### Interactive Maps
- Data center locations with cluster assignments
- Submarine cable routes and landing points
- Energy infrastructure proximity
- Watershed boundaries with data center counts
### Data Exports
- `master_data_center_spatial_cluster_points.csv` - Data center points with cluster IDs
- `master_data_center_spatial_cluster_summary.csv` - Cluster-level statistics
- `output/master_data_center_huc8_watersheds.geojson` - Watershed polygons
- `output/master_data_center_map_context.csv` - Full context for mapping
- `output/master_data_center_state_energy_context.csv` - State-level energy statistics
## Data Quality Notes
1. **Incomplete power ratings**: Only 5.9% of data centers have power ratings (108/1,833)
2. **Operator fragmentation**: String variations ("Meta" vs. "Meta, Inc.", AWS variants) inflate distinct-operator counts
3. **45 facilities** use city-precision fallback coordinates (approximate tract assignment)
4. **7 facilities** failed RUCA join (Puerto Rico / non-US)
5. **Broadband subscribers** are a coarse benefit proxy (actual cloud users are global)
6. **EIA longitude correction**: 2008-2010 generator coordinates had sign errors, corrected in flat-table build
## Known Limitations
- **Power capacity**: Only 5.9% populated - nearby EIA generator capacity used as proxy
- **Operator strings**: Need deduplication (50 of 190 non-metro facilities have null operator)
- **Benefit measurement**: Broadband subscribers are an imperfect proxy for cloud computing benefits
- **Universe**: Limited to 46 DC-host states (excludes DC-free states from ACS comparison)
## Research Ideas & Future Work
See [research-ideas.md](research-ideas.md) for detailed next steps and potential research directions.
## Project Status
This is a **mature, publication-ready geospatial analysis infrastructure** combining authoritative government datasets (Census, EIA, USGS, FEMA) with novel data center location data to test political economy and environmental justice hypotheses.
The "concentrated costs / dispersed benefits" hypothesis is operationalized and tested with rigorous spatial statistics (Gini coefficients, HHI indices, Mann-Whitney tests).
## License
Research data compiled from public sources. Please cite appropriately if used in publications.
## Contact
For questions about this research project, please contact the repository owner.