Add comprehensive documentation: README, database tables, and research ideas

This commit is contained in:
2026-05-27 11:28:14 -07:00
parent 98f6e6e237
commit 423e11083d
3 changed files with 1271 additions and 0 deletions

213
README.md Normal file
View File

@@ -0,0 +1,213 @@
# US Data Centers - Geospatial Research Infrastructure
A comprehensive geospatial research project investigating the spatial concentration, infrastructure dependencies, and socioeconomic/environmental impacts of US data center locations.
## Project Overview
This repository implements a PostGIS-based analytical framework that integrates multiple data sources to examine:
- **Spatial concentration patterns**: Where are data centers located and why?
- **Infrastructure dependencies**: How do data centers relate to submarine cables, power grids, and watersheds?
- **Equity and impact**: Do data center host communities bear localized burdens while benefits are nationally dispersed?
- **Demographics**: Who lives in data center-hosting census tracts?
- **Environmental exposure**: What are the water, energy, and natural hazard exposures?
## Key Research Question
**Do data centers represent "concentrated costs / dispersed benefits"?** Host communities bear localized infrastructure burdens (power, water, land use) while cloud computing benefits are nationally dispersed.
## Major Findings
### Spatial Concentration
- **State level**: Top 5 states (VA, TX, CA, OR, OH) hold 51% of all US data centers
- Virginia alone: 20.6% (378 of 1,833 facilities)
- **Tract level**: Top 1% of data center-hosting census tracts hold 14.6% of all facilities
- Only 0.86% of data center-state residents live in a hosting tract
- Per-capita burden is **115× higher** in host tracts vs. state average
- **Watershed level**: Half of all US data centers sit in just 15 of 2,139 HUC8 watersheds
- Single watershed (Middle Potomac-Catoctin / Loudoun County): 12.8% of US facilities
### Demographics of Host Communities
Compared to the US average, data center host communities are:
- **Wealthier**: Median household income $103,623 (vs. $78,538, +32%)
- **More educated**: 49% bachelor's+ (vs. 35%, +14 pp)
- **More diverse**: 50% non-Hispanic white (vs. 58%), driven by high Asian share (13% vs. 6%)
- **Better connected**: 94.9% broadband (vs. 89%)
### Infrastructure Insights
- **89% of data centers are in metropolitan tracts** (vs. 80% of all US tracts)
- **Non-metro data centers (11%)** are dominated by hyperscalers:
- AWS (67), Meta (22), Microsoft (10), Google (4) = 55% of non-metro facilities
- 66% are in Oregon + Washington (Columbia River hydro corridor)
- **Energy infrastructure**: 4 states have >2/3 of generation within 50 km of a data center:
- New Jersey: 83%, Nevada: 75%, Tennessee: 70%, Oregon: 68%
### Submarine Cables
- **Data centers are NOT systematically closer to cables** than ordinary US cities
- Only 21.4% of data centers are within 100 km of a submarine cable landing point
- Largest clusters (Ashburn VA, Columbus OH, Iowa) are inland, driven by fiber/power/tax incentives, not cables
## Data Sources
### Primary Data Center Inventories
- **Curated Sample**: 1,489 facilities from web scraping + manual curation, geocoded via Census TIGER + Nominatim
- **OpenStreetMap**: 1,549 OSM features tagged as data centers (via Overpass API)
- **IM3 Model Data**: PNNL's Integrated Multisector Multiscale Modeling existing facilities
- **Master Table**: 1,833 deduplicated facilities merging all sources
### Geospatial Context Layers
- **US Census**: 2024 TIGER tract boundaries, ACS 2024 5-year demographics (85k+ tracts)
- **USDA RUCA 2020**: Rural-Urban Commuting Area codes for metro/micropolitan/rural classification
- **USGS HUC8 Watersheds**: 2,139 subbasin polygons for water-stress analysis
- **FEMA NRI**: National Risk Index with 18 natural hazard risk scores by census tract
### Infrastructure Layers
- **Submarine Cables**: 693 cables, 3,361 landing points (TeleGeography-style)
- **EIA Energy Data**: Operating generator capacity (4.7M monthly records, 2008-2026), facility fuel, state energy data
- **FCC Broadband Data**: Provider availability by location/block
### Additional Data
- **RDH Precinct Vote Data**: Election results for political-economy analysis
- **NOAA HMS Smoke Data**: Wildfire smoke exposure (2005-2025)
- **USDM Drought Data**: Drought severity
- **Utility Rate Tracker**: State-level electricity rate increases
## Repository Structure
### Core Python Scripts
**Data Ingestion**
- `load_postgis_data_centers.py` - Load curated data center CSV into PostGIS
- `load_postgis_osm_data_centers.py` - Fetch OSM data centers via Overpass API
- `build_master_data_centers.py` - Deduplicate & merge curated + OSM sources
- `load_postgis_internet_cables.py` - Load submarine cables and landing points
- `ingest_eia_energy_layers.py` - Ingest EIA energy data via API
- `build_watershed_huc8_tables.py` - Load USGS HUC8 watersheds
**Enrichment**
- `create_data_center_census_tract_table.py` - Join data centers to Census tracts with ACS demographics
- `build_fcc_bdc_broadband_connection_table.py` - Build per-facility broadband provider table
- `build_fcc_bdc_location_provider_aggregates.py` - Aggregate FCC BDC data by county/tract
**Analysis**
- `analyze_dc_tract_concentration.py` - Tract-level cost concentration analysis (Gini, HHI, demographic deltas)
- `analyze_cables_concentration.py` - Test if data centers cluster near submarine cables
- `make_data_center_map.py` - Generate Leaflet map of data centers
- `make_internet_cables_map.py` - Generate Leaflet map of data centers + cables
### Key Jupyter Notebooks
- `spatial_clustering_master_data_centers.ipynb` - DBSCAN clustering of data centers
- `cluster_analysis.ipynb` - Main demographic/energy/watershed analysis
- `fema_nri_data_centers.ipynb` - Join data centers to FEMA hazard scores
- `rdh_precinct_vote_data_centers.ipynb` - Join data centers to election data
- `usdm_drought_data_centers.ipynb` - Drought exposure analysis
- `hms_smoke_data_centers.ipynb` - Wildfire smoke exposure
- `enhanced_data_center_cluster_map.ipynb` - Generate enhanced cluster visualization
### Output Files
- `output/data_center_demographic_ruca_energy_summary.md` - Flagship analysis report
- `cables_concentration_report.md` - Cable proximity + cost/benefit concentration analysis
- `data_center_map.html` - Basic data center locations (Leaflet)
- `data_centers_cables_map.html` - Data centers + submarine cables (Leaflet)
- `output/enhanced_master_data_center_spatial_clusters_map.html` - Enhanced cluster visualization
## Technical Architecture
### Database
- **PostgreSQL 13+** with **PostGIS 3.x**
- Database name: `data_centers`
- See [database-tables.md](database-tables.md) for complete schema documentation
### Python Environment
- **Python 3.10+**
- Key libraries: `psycopg2`, `geopandas`, `shapely`, `scikit-learn`, `pandas`, `numpy`, `requests`, `folium`
### Data Formats
- CSV (raw data exports)
- GeoJSON (watershed/cluster geometries)
- Shapefiles (Census, USGS, FEMA inputs)
- HTML (interactive Leaflet maps)
### Configuration
Credentials stored in `~/.zsh_secrets`, loaded via environment variables:
- `PGWEB_*`: PostgreSQL connection
- `EIA_API_KEY`: EIA energy data
- `FCC_USERNAME`, `FCC_API_KEY`: FCC broadband data
- `RDH_USERNAME`, `RDH_PASSWORD`: Redistricting Data Hub
- `CENSUS_API_KEY`: Census ACS API
## Quick Start
### Basic Rebuild Sequence
```bash
# 1. Load base data center data
python3 load_postgis_data_centers.py
python3 load_postgis_osm_data_centers.py
python3 build_master_data_centers.py
# 2. Enrich with context layers
python3 create_data_center_census_tract_table.py --replace-final
python3 load_postgis_internet_cables.py
python3 ingest_eia_energy_layers.py --category power
python3 build_watershed_huc8_tables.py
# 3. Run analyses
python3 analyze_dc_tract_concentration.py > output/tract_analysis.txt
python3 analyze_cables_concentration.py > output/cables_analysis.txt
# 4. Execute notebooks
jupyter notebook cluster_analysis.ipynb
```
### Generate Maps
```bash
python3 make_data_center_map.py
python3 make_internet_cables_map.py
```
## Key Outputs
### Research Reports
- **Demographic, Energy & Watershed Analysis**: `output/data_center_demographic_ruca_energy_summary.md`
- **Submarine Cable Proximity**: `cables_concentration_report.md`
### Interactive Maps
- Data center locations with cluster assignments
- Submarine cable routes and landing points
- Energy infrastructure proximity
- Watershed boundaries with data center counts
### Data Exports
- `master_data_center_spatial_cluster_points.csv` - Data center points with cluster IDs
- `master_data_center_spatial_cluster_summary.csv` - Cluster-level statistics
- `output/master_data_center_huc8_watersheds.geojson` - Watershed polygons
- `output/master_data_center_map_context.csv` - Full context for mapping
- `output/master_data_center_state_energy_context.csv` - State-level energy statistics
## Data Quality Notes
1. **Incomplete power ratings**: Only 5.9% of data centers have power ratings (108/1,833)
2. **Operator fragmentation**: String variations ("Meta" vs. "Meta, Inc.") inflate distinct-operator counts
3. **45 facilities** use city-precision fallback coordinates (approximate tract assignment)
4. **7 facilities** failed RUCA join (Puerto Rico / non-US)
5. **Broadband subscribers** are a coarse benefit proxy (actual cloud users are global)
## Research Ideas & Future Work
See [research-ideas.md](research-ideas.md) for detailed next steps and potential research directions.
## Project Status
This is a **mature, publication-ready geospatial analysis infrastructure** combining authoritative government datasets (Census, EIA, USGS, FEMA) with novel data center location data to test political economy and environmental justice hypotheses.
The "concentrated costs / dispersed benefits" hypothesis is operationalized and tested with rigorous spatial statistics (Gini coefficients, HHI indices, Mann-Whitney tests).
## License
Research data compiled from public sources. Please cite appropriately if used in publications.
## Contact
For questions about this research project, please contact the repository owner.