Add comprehensive documentation: README, database tables, and research ideas
This commit is contained in:
213
README.md
Normal file
213
README.md
Normal file
@@ -0,0 +1,213 @@
|
|||||||
|
# US Data Centers - Geospatial Research Infrastructure
|
||||||
|
|
||||||
|
A comprehensive geospatial research project investigating the spatial concentration, infrastructure dependencies, and socioeconomic/environmental impacts of US data center locations.
|
||||||
|
|
||||||
|
## Project Overview
|
||||||
|
|
||||||
|
This repository implements a PostGIS-based analytical framework that integrates multiple data sources to examine:
|
||||||
|
|
||||||
|
- **Spatial concentration patterns**: Where are data centers located and why?
|
||||||
|
- **Infrastructure dependencies**: How do data centers relate to submarine cables, power grids, and watersheds?
|
||||||
|
- **Equity and impact**: Do data center host communities bear localized burdens while benefits are nationally dispersed?
|
||||||
|
- **Demographics**: Who lives in data center-hosting census tracts?
|
||||||
|
- **Environmental exposure**: What are the water, energy, and natural hazard exposures?
|
||||||
|
|
||||||
|
## Key Research Question
|
||||||
|
|
||||||
|
**Do data centers represent "concentrated costs / dispersed benefits"?** Host communities bear localized infrastructure burdens (power, water, land use) while cloud computing benefits are nationally dispersed.
|
||||||
|
|
||||||
|
## Major Findings
|
||||||
|
|
||||||
|
### Spatial Concentration
|
||||||
|
- **State level**: Top 5 states (VA, TX, CA, OR, OH) hold 51% of all US data centers
|
||||||
|
- Virginia alone: 20.6% (378 of 1,833 facilities)
|
||||||
|
- **Tract level**: Top 1% of data center-hosting census tracts hold 14.6% of all facilities
|
||||||
|
- Only 0.86% of data center-state residents live in a hosting tract
|
||||||
|
- Per-capita burden is **115× higher** in host tracts vs. state average
|
||||||
|
- **Watershed level**: Half of all US data centers sit in just 15 of 2,139 HUC8 watersheds
|
||||||
|
- Single watershed (Middle Potomac-Catoctin / Loudoun County): 12.8% of US facilities
|
||||||
|
|
||||||
|
### Demographics of Host Communities
|
||||||
|
Compared to the US average, data center host communities are:
|
||||||
|
- **Wealthier**: Median household income $103,623 (vs. $78,538, +32%)
|
||||||
|
- **More educated**: 49% bachelor's+ (vs. 35%, +14 pp)
|
||||||
|
- **More diverse**: 50% non-Hispanic white (vs. 58%), driven by high Asian share (13% vs. 6%)
|
||||||
|
- **Better connected**: 94.9% broadband (vs. 89%)
|
||||||
|
|
||||||
|
### Infrastructure Insights
|
||||||
|
- **89% of data centers are in metropolitan tracts** (vs. 80% of all US tracts)
|
||||||
|
- **Non-metro data centers (11%)** are dominated by hyperscalers:
|
||||||
|
- AWS (67), Meta (22), Microsoft (10), Google (4) = 55% of non-metro facilities
|
||||||
|
- 66% are in Oregon + Washington (Columbia River hydro corridor)
|
||||||
|
- **Energy infrastructure**: 4 states have >2/3 of generation within 50 km of a data center:
|
||||||
|
- New Jersey: 83%, Nevada: 75%, Tennessee: 70%, Oregon: 68%
|
||||||
|
|
||||||
|
### Submarine Cables
|
||||||
|
- **Data centers are NOT systematically closer to cables** than ordinary US cities
|
||||||
|
- Only 21.4% of data centers are within 100 km of a submarine cable landing point
|
||||||
|
- Largest clusters (Ashburn VA, Columbus OH, Iowa) are inland, driven by fiber/power/tax incentives, not cables
|
||||||
|
|
||||||
|
## Data Sources
|
||||||
|
|
||||||
|
### Primary Data Center Inventories
|
||||||
|
- **Curated Sample**: 1,489 facilities from web scraping + manual curation, geocoded via Census TIGER + Nominatim
|
||||||
|
- **OpenStreetMap**: 1,549 OSM features tagged as data centers (via Overpass API)
|
||||||
|
- **IM3 Model Data**: PNNL's Integrated Multisector Multiscale Modeling existing facilities
|
||||||
|
- **Master Table**: 1,833 deduplicated facilities merging all sources
|
||||||
|
|
||||||
|
### Geospatial Context Layers
|
||||||
|
- **US Census**: 2024 TIGER tract boundaries, ACS 2024 5-year demographics (85k+ tracts)
|
||||||
|
- **USDA RUCA 2020**: Rural-Urban Commuting Area codes for metro/micropolitan/rural classification
|
||||||
|
- **USGS HUC8 Watersheds**: 2,139 subbasin polygons for water-stress analysis
|
||||||
|
- **FEMA NRI**: National Risk Index with 18 natural hazard risk scores by census tract
|
||||||
|
|
||||||
|
### Infrastructure Layers
|
||||||
|
- **Submarine Cables**: 693 cables, 3,361 landing points (TeleGeography-style)
|
||||||
|
- **EIA Energy Data**: Operating generator capacity (4.7M monthly records, 2008-2026), facility fuel, state energy data
|
||||||
|
- **FCC Broadband Data**: Provider availability by location/block
|
||||||
|
|
||||||
|
### Additional Data
|
||||||
|
- **RDH Precinct Vote Data**: Election results for political-economy analysis
|
||||||
|
- **NOAA HMS Smoke Data**: Wildfire smoke exposure (2005-2025)
|
||||||
|
- **USDM Drought Data**: Drought severity
|
||||||
|
- **Utility Rate Tracker**: State-level electricity rate increases
|
||||||
|
|
||||||
|
## Repository Structure
|
||||||
|
|
||||||
|
### Core Python Scripts
|
||||||
|
|
||||||
|
**Data Ingestion**
|
||||||
|
- `load_postgis_data_centers.py` - Load curated data center CSV into PostGIS
|
||||||
|
- `load_postgis_osm_data_centers.py` - Fetch OSM data centers via Overpass API
|
||||||
|
- `build_master_data_centers.py` - Deduplicate & merge curated + OSM sources
|
||||||
|
- `load_postgis_internet_cables.py` - Load submarine cables and landing points
|
||||||
|
- `ingest_eia_energy_layers.py` - Ingest EIA energy data via API
|
||||||
|
- `build_watershed_huc8_tables.py` - Load USGS HUC8 watersheds
|
||||||
|
|
||||||
|
**Enrichment**
|
||||||
|
- `create_data_center_census_tract_table.py` - Join data centers to Census tracts with ACS demographics
|
||||||
|
- `build_fcc_bdc_broadband_connection_table.py` - Build per-facility broadband provider table
|
||||||
|
- `build_fcc_bdc_location_provider_aggregates.py` - Aggregate FCC BDC data by county/tract
|
||||||
|
|
||||||
|
**Analysis**
|
||||||
|
- `analyze_dc_tract_concentration.py` - Tract-level cost concentration analysis (Gini, HHI, demographic deltas)
|
||||||
|
- `analyze_cables_concentration.py` - Test if data centers cluster near submarine cables
|
||||||
|
- `make_data_center_map.py` - Generate Leaflet map of data centers
|
||||||
|
- `make_internet_cables_map.py` - Generate Leaflet map of data centers + cables
|
||||||
|
|
||||||
|
### Key Jupyter Notebooks
|
||||||
|
- `spatial_clustering_master_data_centers.ipynb` - DBSCAN clustering of data centers
|
||||||
|
- `cluster_analysis.ipynb` - Main demographic/energy/watershed analysis
|
||||||
|
- `fema_nri_data_centers.ipynb` - Join data centers to FEMA hazard scores
|
||||||
|
- `rdh_precinct_vote_data_centers.ipynb` - Join data centers to election data
|
||||||
|
- `usdm_drought_data_centers.ipynb` - Drought exposure analysis
|
||||||
|
- `hms_smoke_data_centers.ipynb` - Wildfire smoke exposure
|
||||||
|
- `enhanced_data_center_cluster_map.ipynb` - Generate enhanced cluster visualization
|
||||||
|
|
||||||
|
### Output Files
|
||||||
|
- `output/data_center_demographic_ruca_energy_summary.md` - Flagship analysis report
|
||||||
|
- `cables_concentration_report.md` - Cable proximity + cost/benefit concentration analysis
|
||||||
|
- `data_center_map.html` - Basic data center locations (Leaflet)
|
||||||
|
- `data_centers_cables_map.html` - Data centers + submarine cables (Leaflet)
|
||||||
|
- `output/enhanced_master_data_center_spatial_clusters_map.html` - Enhanced cluster visualization
|
||||||
|
|
||||||
|
## Technical Architecture
|
||||||
|
|
||||||
|
### Database
|
||||||
|
- **PostgreSQL 13+** with **PostGIS 3.x**
|
||||||
|
- Database name: `data_centers`
|
||||||
|
- See [database-tables.md](database-tables.md) for complete schema documentation
|
||||||
|
|
||||||
|
### Python Environment
|
||||||
|
- **Python 3.10+**
|
||||||
|
- Key libraries: `psycopg2`, `geopandas`, `shapely`, `scikit-learn`, `pandas`, `numpy`, `requests`, `folium`
|
||||||
|
|
||||||
|
### Data Formats
|
||||||
|
- CSV (raw data exports)
|
||||||
|
- GeoJSON (watershed/cluster geometries)
|
||||||
|
- Shapefiles (Census, USGS, FEMA inputs)
|
||||||
|
- HTML (interactive Leaflet maps)
|
||||||
|
|
||||||
|
### Configuration
|
||||||
|
Credentials stored in `~/.zsh_secrets`, loaded via environment variables:
|
||||||
|
- `PGWEB_*`: PostgreSQL connection
|
||||||
|
- `EIA_API_KEY`: EIA energy data
|
||||||
|
- `FCC_USERNAME`, `FCC_API_KEY`: FCC broadband data
|
||||||
|
- `RDH_USERNAME`, `RDH_PASSWORD`: Redistricting Data Hub
|
||||||
|
- `CENSUS_API_KEY`: Census ACS API
|
||||||
|
|
||||||
|
## Quick Start
|
||||||
|
|
||||||
|
### Basic Rebuild Sequence
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 1. Load base data center data
|
||||||
|
python3 load_postgis_data_centers.py
|
||||||
|
python3 load_postgis_osm_data_centers.py
|
||||||
|
python3 build_master_data_centers.py
|
||||||
|
|
||||||
|
# 2. Enrich with context layers
|
||||||
|
python3 create_data_center_census_tract_table.py --replace-final
|
||||||
|
python3 load_postgis_internet_cables.py
|
||||||
|
python3 ingest_eia_energy_layers.py --category power
|
||||||
|
python3 build_watershed_huc8_tables.py
|
||||||
|
|
||||||
|
# 3. Run analyses
|
||||||
|
python3 analyze_dc_tract_concentration.py > output/tract_analysis.txt
|
||||||
|
python3 analyze_cables_concentration.py > output/cables_analysis.txt
|
||||||
|
|
||||||
|
# 4. Execute notebooks
|
||||||
|
jupyter notebook cluster_analysis.ipynb
|
||||||
|
```
|
||||||
|
|
||||||
|
### Generate Maps
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python3 make_data_center_map.py
|
||||||
|
python3 make_internet_cables_map.py
|
||||||
|
```
|
||||||
|
|
||||||
|
## Key Outputs
|
||||||
|
|
||||||
|
### Research Reports
|
||||||
|
- **Demographic, Energy & Watershed Analysis**: `output/data_center_demographic_ruca_energy_summary.md`
|
||||||
|
- **Submarine Cable Proximity**: `cables_concentration_report.md`
|
||||||
|
|
||||||
|
### Interactive Maps
|
||||||
|
- Data center locations with cluster assignments
|
||||||
|
- Submarine cable routes and landing points
|
||||||
|
- Energy infrastructure proximity
|
||||||
|
- Watershed boundaries with data center counts
|
||||||
|
|
||||||
|
### Data Exports
|
||||||
|
- `master_data_center_spatial_cluster_points.csv` - Data center points with cluster IDs
|
||||||
|
- `master_data_center_spatial_cluster_summary.csv` - Cluster-level statistics
|
||||||
|
- `output/master_data_center_huc8_watersheds.geojson` - Watershed polygons
|
||||||
|
- `output/master_data_center_map_context.csv` - Full context for mapping
|
||||||
|
- `output/master_data_center_state_energy_context.csv` - State-level energy statistics
|
||||||
|
|
||||||
|
## Data Quality Notes
|
||||||
|
|
||||||
|
1. **Incomplete power ratings**: Only 5.9% of data centers have power ratings (108/1,833)
|
||||||
|
2. **Operator fragmentation**: String variations ("Meta" vs. "Meta, Inc.") inflate distinct-operator counts
|
||||||
|
3. **45 facilities** use city-precision fallback coordinates (approximate tract assignment)
|
||||||
|
4. **7 facilities** failed RUCA join (Puerto Rico / non-US)
|
||||||
|
5. **Broadband subscribers** are a coarse benefit proxy (actual cloud users are global)
|
||||||
|
|
||||||
|
## Research Ideas & Future Work
|
||||||
|
|
||||||
|
See [research-ideas.md](research-ideas.md) for detailed next steps and potential research directions.
|
||||||
|
|
||||||
|
## Project Status
|
||||||
|
|
||||||
|
This is a **mature, publication-ready geospatial analysis infrastructure** combining authoritative government datasets (Census, EIA, USGS, FEMA) with novel data center location data to test political economy and environmental justice hypotheses.
|
||||||
|
|
||||||
|
The "concentrated costs / dispersed benefits" hypothesis is operationalized and tested with rigorous spatial statistics (Gini coefficients, HHI indices, Mann-Whitney tests).
|
||||||
|
|
||||||
|
## License
|
||||||
|
|
||||||
|
Research data compiled from public sources. Please cite appropriately if used in publications.
|
||||||
|
|
||||||
|
## Contact
|
||||||
|
|
||||||
|
For questions about this research project, please contact the repository owner.
|
||||||
534
database-tables.md
Normal file
534
database-tables.md
Normal file
@@ -0,0 +1,534 @@
|
|||||||
|
# Database Tables Documentation
|
||||||
|
|
||||||
|
## Database Configuration
|
||||||
|
|
||||||
|
**Database Name**: `data_centers`
|
||||||
|
**Type**: PostgreSQL with PostGIS extension
|
||||||
|
**Connection**: Environment variables from `~/.zsh_secrets`
|
||||||
|
- `PGWEB_HOST`: Database host
|
||||||
|
- `PGWEB_PORT`: Database port (typically 5432)
|
||||||
|
- `PGWEB_USER`: Database user
|
||||||
|
- `PGWEB_PASSWORD`: Database password
|
||||||
|
- `PGWEB_DATABASE`: Database name (`data_centers`)
|
||||||
|
|
||||||
|
## Table Organization
|
||||||
|
|
||||||
|
Tables are organized into four categories:
|
||||||
|
1. **Core Data Center Tables** - Master inventories and source data
|
||||||
|
2. **Enrichment Tables** - Data centers joined with contextual data
|
||||||
|
3. **Base Layer Tables** - Geographic and demographic reference layers
|
||||||
|
4. **Infrastructure Tables** - Energy and connectivity infrastructure
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Core Data Center Tables
|
||||||
|
|
||||||
|
### `master_data_centers`
|
||||||
|
**Rows**: 1,833
|
||||||
|
**Purpose**: Canonical data center inventory - deduplicated merge of curated + OSM sources
|
||||||
|
|
||||||
|
**Key Columns**:
|
||||||
|
- `id` (INTEGER) - Unique identifier
|
||||||
|
- `name` (TEXT) - Facility name
|
||||||
|
- `address` (TEXT) - Street address
|
||||||
|
- `city` (TEXT) - City
|
||||||
|
- `state` (TEXT) - State code
|
||||||
|
- `latitude` (DOUBLE PRECISION) - Latitude
|
||||||
|
- `longitude` (DOUBLE PRECISION) - Longitude
|
||||||
|
- `geom` (GEOMETRY) - PostGIS point geometry (EPSG:4326)
|
||||||
|
- `operator` (TEXT) - Operator/owner
|
||||||
|
- `power_mw` (DOUBLE PRECISION) - Power capacity in megawatts (sparse: 5.9% populated)
|
||||||
|
- `source` (TEXT) - Data source (`curated`, `osm`, or `both`)
|
||||||
|
- `osm_id` (TEXT) - OpenStreetMap ID if applicable
|
||||||
|
- `geocode_method` (TEXT) - Geocoding provenance
|
||||||
|
|
||||||
|
**Notes**:
|
||||||
|
- 108 of 1,833 facilities have power ratings
|
||||||
|
- 45 facilities use city-precision fallback coordinates
|
||||||
|
- Operator strings have fragmentation issues ("Meta" vs. "Meta, Inc.")
|
||||||
|
|
||||||
|
### `us_dc_sample_geocoded`
|
||||||
|
**Rows**: 1,489
|
||||||
|
**Purpose**: Original curated sample with geocoding provenance (superseded by `master_data_centers`)
|
||||||
|
|
||||||
|
**Key Columns**:
|
||||||
|
- `name`, `address`, `city`, `state`, `zip`
|
||||||
|
- `latitude`, `longitude`, `geom`
|
||||||
|
- `operator`, `power_mw`
|
||||||
|
- `census_lat`, `census_lon` - Census TIGER geocode results
|
||||||
|
- `nominatim_lat`, `nominatim_lon` - Nominatim fallback results
|
||||||
|
- `geocode_source` - Which geocoder was used
|
||||||
|
|
||||||
|
### `osm_data_centers`
|
||||||
|
**Rows**: 1,549
|
||||||
|
**Purpose**: Raw OpenStreetMap-derived facilities
|
||||||
|
|
||||||
|
**Key Columns**:
|
||||||
|
- `osm_id` (TEXT) - OSM element ID
|
||||||
|
- `osm_type` (TEXT) - `node`, `way`, or `relation`
|
||||||
|
- `name` (TEXT) - OSM name tag
|
||||||
|
- `latitude`, `longitude`, `geom`
|
||||||
|
- `tags` (JSONB) - All OSM tags as JSON
|
||||||
|
- `operator` (TEXT) - Extracted from OSM tags
|
||||||
|
- `city`, `state`, `country`
|
||||||
|
|
||||||
|
**Notes**: Fetched via Overpass API with query for `telecom=data_center` or `building=data_center`
|
||||||
|
|
||||||
|
### `master_data_center_spatial_clusters`
|
||||||
|
**Rows**: 1,831
|
||||||
|
**Purpose**: DBSCAN cluster assignments for master data centers
|
||||||
|
|
||||||
|
**Key Columns**:
|
||||||
|
- All columns from `master_data_centers`
|
||||||
|
- `cluster_id` (INTEGER) - Cluster assignment (-1 = noise/singleton)
|
||||||
|
- `cluster_size` (INTEGER) - Number of facilities in cluster
|
||||||
|
- `cluster_label` (TEXT) - Human-readable cluster name
|
||||||
|
|
||||||
|
**Notes**: DBSCAN parameters: eps=15 km, min_samples=2
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Enrichment Tables
|
||||||
|
|
||||||
|
### `data_center_census_tracts_2024`
|
||||||
|
**Rows**: 1,815
|
||||||
|
**Purpose**: Per-facility demographics from containing Census tract
|
||||||
|
|
||||||
|
**Key Columns**:
|
||||||
|
- All columns from `master_data_centers`
|
||||||
|
- `geoid` (TEXT) - 11-digit Census tract GEOID
|
||||||
|
- `state_fips`, `county_fips`, `tract`
|
||||||
|
- **Population**: `total_population`, `population_density_sq_mi`
|
||||||
|
- **Age**: `median_age`, `under_18_pct`, `over_65_pct`
|
||||||
|
- **Race/Ethnicity**: `white_nh_pct`, `black_nh_pct`, `asian_nh_pct`, `hispanic_pct`
|
||||||
|
- **Economics**: `median_household_income`, `per_capita_income`, `poverty_rate`
|
||||||
|
- **Education**: `bachelors_or_higher_pct`, `high_school_or_higher_pct`
|
||||||
|
- **Housing**: `median_home_value`, `median_rent`, `homeownership_rate`
|
||||||
|
- **Broadband**: `broadband_pct` - Households with broadband subscription
|
||||||
|
|
||||||
|
**Source**: ACS 2024 5-year estimates
|
||||||
|
|
||||||
|
**Notes**:
|
||||||
|
- 18 of 1,833 facilities failed tract join (geocoding issues)
|
||||||
|
- Data from `_dc_census_tract_acs_2024` base table
|
||||||
|
|
||||||
|
### `data_center_watershed_huc8`
|
||||||
|
**Rows**: 1,833
|
||||||
|
**Purpose**: Per-facility watershed assignment
|
||||||
|
|
||||||
|
**Key Columns**:
|
||||||
|
- All columns from `master_data_centers`
|
||||||
|
- `huc8` (TEXT) - 8-digit Hydrologic Unit Code
|
||||||
|
- `watershed_name` (TEXT) - Watershed name
|
||||||
|
- `watershed_area_sq_km` (DOUBLE PRECISION)
|
||||||
|
- `states` (TEXT) - States intersecting watershed
|
||||||
|
|
||||||
|
**Source**: USGS Watershed Boundary Dataset
|
||||||
|
|
||||||
|
**Notes**: 257 unique HUC8 watersheds contain at least one data center
|
||||||
|
|
||||||
|
### `data_center_nri_exposure`
|
||||||
|
**Rows**: 1,833
|
||||||
|
**Purpose**: Per-facility FEMA National Risk Index hazard exposure scores
|
||||||
|
|
||||||
|
**Key Columns**:
|
||||||
|
- All columns from `master_data_centers`
|
||||||
|
- `nri_id` (TEXT) - Census tract GEOID (matches `geoid` from demographics)
|
||||||
|
- `risk_score` (DOUBLE PRECISION) - Overall NRI risk score
|
||||||
|
- `social_vulnerability` (DOUBLE PRECISION) - Social vulnerability index
|
||||||
|
- **Hazard-specific risk scores** (18 hazards):
|
||||||
|
- `avalanche_risk`, `coastal_flooding_risk`, `cold_wave_risk`
|
||||||
|
- `drought_risk`, `earthquake_risk`, `hail_risk`
|
||||||
|
- `heat_wave_risk`, `hurricane_risk`, `ice_storm_risk`
|
||||||
|
- `landslide_risk`, `lightning_risk`, `riverine_flooding_risk`
|
||||||
|
- `strong_wind_risk`, `tornado_risk`, `tsunami_risk`
|
||||||
|
- `volcanic_activity_risk`, `wildfire_risk`, `winter_weather_risk`
|
||||||
|
|
||||||
|
**Source**: FEMA National Risk Index (December 2025 release)
|
||||||
|
|
||||||
|
### `data_center_rdh_precinct_vote_matches`
|
||||||
|
**Rows**: Varies
|
||||||
|
**Purpose**: Per-facility precinct-level election results
|
||||||
|
|
||||||
|
**Key Columns**:
|
||||||
|
- Data center identifiers
|
||||||
|
- `precinct_name`, `precinct_id`
|
||||||
|
- `election_year`, `office`
|
||||||
|
- `candidate`, `party`, `votes`
|
||||||
|
- `vote_share_pct`
|
||||||
|
|
||||||
|
**Source**: Redistricting Data Hub precinct shapefiles
|
||||||
|
|
||||||
|
**Notes**: Spatial join to voting precincts (point-in-polygon)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Base Layer Tables
|
||||||
|
|
||||||
|
### `_dc_census_tract_acs_2024`
|
||||||
|
**Rows**: 85,382
|
||||||
|
**Purpose**: ACS 2024 demographics for all Census tracts in states with data centers
|
||||||
|
|
||||||
|
**Key Columns**:
|
||||||
|
- `geoid` (TEXT) - 11-digit tract GEOID (PRIMARY KEY)
|
||||||
|
- `name` (TEXT) - Tract name
|
||||||
|
- `state_fips`, `county_fips`, `tract`
|
||||||
|
- **Full ACS 5-year estimates** (85+ columns):
|
||||||
|
- Population by age, sex, race/ethnicity
|
||||||
|
- Households, families, housing units
|
||||||
|
- Income, poverty, education, employment
|
||||||
|
- Housing values, rents, costs
|
||||||
|
- Broadband, computer access
|
||||||
|
- Commuting, vehicles
|
||||||
|
|
||||||
|
**Source**: Census ACS 2024 5-year estimates API
|
||||||
|
|
||||||
|
**Notes**: Universe limited to 46 states with data centers (excludes DC-free states)
|
||||||
|
|
||||||
|
### `_dc_census_tract_boundaries_2024`
|
||||||
|
**Rows**: 85,058
|
||||||
|
**Purpose**: TIGER 2024 tract polygons for data center states
|
||||||
|
|
||||||
|
**Key Columns**:
|
||||||
|
- `geoid` (TEXT) - 11-digit tract GEOID
|
||||||
|
- `name` (TEXT) - Tract name
|
||||||
|
- `state_fips`, `county_fips`, `tract_code`
|
||||||
|
- `geom` (GEOMETRY) - Polygon geometry (EPSG:4326)
|
||||||
|
- `area_land_sq_m` (DOUBLE PRECISION) - Land area in square meters
|
||||||
|
- `area_water_sq_m` (DOUBLE PRECISION) - Water area in square meters
|
||||||
|
|
||||||
|
**Source**: Census TIGER/Line 2024
|
||||||
|
|
||||||
|
### `ruca_codes_2020_tract`
|
||||||
|
**Rows**: 85,528
|
||||||
|
**Purpose**: USDA Rural-Urban Commuting Area codes for metro/rural classification
|
||||||
|
|
||||||
|
**Key Columns**:
|
||||||
|
- `geoid` (TEXT) - 11-digit tract GEOID (matches Census tracts)
|
||||||
|
- `ruca_code` (TEXT) - Primary RUCA code (1-10)
|
||||||
|
- `ruca_category` (TEXT) - Simplified category:
|
||||||
|
- `Metropolitan` (codes 1-3)
|
||||||
|
- `Micropolitan` (codes 4-6)
|
||||||
|
- `Small town` (codes 7-9)
|
||||||
|
- `Rural` (code 10)
|
||||||
|
- `ruca_description` (TEXT) - Full RUCA code description
|
||||||
|
- `population_2020` (INTEGER)
|
||||||
|
|
||||||
|
**Source**: USDA Economic Research Service RUCA 2020
|
||||||
|
|
||||||
|
**Notes**:
|
||||||
|
- Based on 2020 Census tracts and 2010-2020 commuting patterns
|
||||||
|
- 7 data centers failed RUCA join (Puerto Rico / non-US)
|
||||||
|
|
||||||
|
### `watershed_huc8`
|
||||||
|
**Rows**: 2,139
|
||||||
|
**Purpose**: USGS HUC8 subbasin polygons for water-stress analysis
|
||||||
|
|
||||||
|
**Key Columns**:
|
||||||
|
- `huc8` (TEXT) - 8-digit Hydrologic Unit Code (PRIMARY KEY)
|
||||||
|
- `name` (TEXT) - Watershed name
|
||||||
|
- `geom` (GEOMETRY) - Polygon geometry (EPSG:4326)
|
||||||
|
- `area_sq_km` (DOUBLE PRECISION)
|
||||||
|
- `states` (TEXT) - Comma-separated state codes
|
||||||
|
- `dc_count` (INTEGER) - Number of data centers in watershed
|
||||||
|
|
||||||
|
**Source**: USGS Watershed Boundary Dataset
|
||||||
|
|
||||||
|
**Notes**:
|
||||||
|
- 257 of 2,139 watersheds contain at least one data center
|
||||||
|
- Top 15 watersheds contain 50% of all US data centers
|
||||||
|
|
||||||
|
### `nri_census_tracts`
|
||||||
|
**Rows**: ~84,000
|
||||||
|
**Purpose**: Full FEMA National Risk Index by Census tract
|
||||||
|
|
||||||
|
**Key Columns**:
|
||||||
|
- `nri_id` (TEXT) - Census tract GEOID
|
||||||
|
- `state_name`, `county_name`, `tract_name`
|
||||||
|
- **460+ columns** including:
|
||||||
|
- Overall risk scores and ratings
|
||||||
|
- Expected annual loss (dollars and building value %)
|
||||||
|
- Social vulnerability components (15 factors)
|
||||||
|
- Community resilience score
|
||||||
|
- Individual hazard risk scores (18 hazards)
|
||||||
|
- Exposure, annualized frequency, historic loss ratios per hazard
|
||||||
|
|
||||||
|
**Source**: FEMA National Risk Index v2.1 (December 2025)
|
||||||
|
|
||||||
|
**Notes**:
|
||||||
|
- Massive table with comprehensive natural hazard risk data
|
||||||
|
- Join to data centers via `geoid` field
|
||||||
|
- See [FEMA NRI Technical Documentation](https://hazards.fema.gov/nri/)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Infrastructure Tables
|
||||||
|
|
||||||
|
### Energy Infrastructure
|
||||||
|
|
||||||
|
#### `energy_eia_operating_generator_capacity_flat`
|
||||||
|
**Rows**: 4.7 million
|
||||||
|
**Purpose**: EIA generator inventory with lat/lon/MW (monthly 2008-2026)
|
||||||
|
|
||||||
|
**Key Columns**:
|
||||||
|
- `plant_id` (INTEGER) - EIA plant ID
|
||||||
|
- `generator_id` (TEXT) - Generator unit ID
|
||||||
|
- `plant_name` (TEXT)
|
||||||
|
- `latitude`, `longitude`, `geom`
|
||||||
|
- `state`, `county`
|
||||||
|
- `utility_name`, `operator_name`
|
||||||
|
- `nameplate_capacity_mw` (DOUBLE PRECISION)
|
||||||
|
- `technology` (TEXT) - Generation technology
|
||||||
|
- `energy_source_1`, `energy_source_2` - Primary fuel codes
|
||||||
|
- `operating_month`, `operating_year` - When unit became operational
|
||||||
|
- `status` (TEXT) - Operating, standby, retired, etc.
|
||||||
|
- `report_month`, `report_year` - Data snapshot date
|
||||||
|
|
||||||
|
**Source**: EIA Form 860 via API
|
||||||
|
|
||||||
|
**Notes**:
|
||||||
|
- "Flat" means denormalized for fast spatial queries
|
||||||
|
- Each generator-month is a row (4.7M rows from monthly snapshots)
|
||||||
|
- Use for proximity analysis (e.g., "all generators within 50 km of data center")
|
||||||
|
|
||||||
|
#### `energy_eia_facility_fuel_flat`
|
||||||
|
**Rows**: Varies
|
||||||
|
**Purpose**: Monthly generation by plant/fuel
|
||||||
|
|
||||||
|
**Key Columns**:
|
||||||
|
- `plant_id`, `plant_name`
|
||||||
|
- `report_month`, `report_year`
|
||||||
|
- `energy_source` (TEXT) - Fuel code
|
||||||
|
- `net_generation_mwh` (DOUBLE PRECISION)
|
||||||
|
- `fuel_consumed_mmbtu` (DOUBLE PRECISION)
|
||||||
|
|
||||||
|
**Source**: EIA Form 923 via API
|
||||||
|
|
||||||
|
#### `energy_eia_seds_flat`
|
||||||
|
**Rows**: 2.57 million
|
||||||
|
**Purpose**: Annual state energy consumption/production (1960-2024)
|
||||||
|
|
||||||
|
**Key Columns**:
|
||||||
|
- `state_code` (TEXT)
|
||||||
|
- `year` (INTEGER)
|
||||||
|
- `msn` (TEXT) - Mnemonic series names (e.g., `TETCB` = total energy consumption)
|
||||||
|
- `value` (DOUBLE PRECISION) - Energy in trillion BTU
|
||||||
|
- `unit` (TEXT)
|
||||||
|
- `description` (TEXT) - Human-readable MSN description
|
||||||
|
|
||||||
|
**Source**: EIA State Energy Data System (SEDS)
|
||||||
|
|
||||||
|
**Notes**:
|
||||||
|
- Annual aggregates by state
|
||||||
|
- Use for state-level energy context analysis
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Connectivity Infrastructure
|
||||||
|
|
||||||
|
#### `internet_cables`
|
||||||
|
**Rows**: 693
|
||||||
|
**Purpose**: Submarine cable routes
|
||||||
|
|
||||||
|
**Key Columns**:
|
||||||
|
- `cable_id` (TEXT) - Unique cable identifier
|
||||||
|
- `cable_name` (TEXT) - Official cable name
|
||||||
|
- `geom` (GEOMETRY) - LineString geometry (EPSG:4326)
|
||||||
|
- `rfs_year` (INTEGER) - Ready For Service year
|
||||||
|
- `length_km` (DOUBLE PRECISION)
|
||||||
|
- `owners` (TEXT[]) - Array of owner names
|
||||||
|
- `landing_points` (TEXT[]) - Array of landing point names
|
||||||
|
|
||||||
|
**Source**: TeleGeography-style cable database
|
||||||
|
|
||||||
|
**Notes**:
|
||||||
|
- 693 unique submarine cables
|
||||||
|
- Geometry is approximate route (not exact seabed path)
|
||||||
|
|
||||||
|
#### `internet_cable_landing_points`
|
||||||
|
**Rows**: 3,361
|
||||||
|
**Purpose**: Cable landing points (where cables come ashore)
|
||||||
|
|
||||||
|
**Key Columns**:
|
||||||
|
- `landing_point_id` (TEXT) - Unique identifier
|
||||||
|
- `name` (TEXT) - Landing point name
|
||||||
|
- `city`, `country`
|
||||||
|
- `latitude`, `longitude`, `geom`
|
||||||
|
- `cables` (TEXT[]) - Array of cable names landing at this point
|
||||||
|
- `cable_count` (INTEGER)
|
||||||
|
|
||||||
|
**Source**: TeleGeography-style cable database
|
||||||
|
|
||||||
|
**Notes**:
|
||||||
|
- Used for proximity analysis (how close are data centers to cable landings?)
|
||||||
|
- **Key finding**: Data centers are NOT systematically closer to cables than ordinary US cities
|
||||||
|
|
||||||
|
#### `internet_city_dominance`
|
||||||
|
**Rows**: 4,552
|
||||||
|
**Purpose**: City-level IPs/capacity (internet hub strength proxy)
|
||||||
|
|
||||||
|
**Key Columns**:
|
||||||
|
- `city` (TEXT)
|
||||||
|
- `country` (TEXT)
|
||||||
|
- `latitude`, `longitude`, `geom`
|
||||||
|
- `ip_addresses` (INTEGER) - Number of routable IP addresses
|
||||||
|
- `capacity_rank` (INTEGER) - Relative capacity ranking
|
||||||
|
|
||||||
|
**Source**: Internet topology datasets
|
||||||
|
|
||||||
|
**Notes**: Proxy for "internet hub" strength (not directly used in main analyses)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Broadband
|
||||||
|
|
||||||
|
#### `fcc_bdc_location_provider_aggregates`
|
||||||
|
**Rows**: Varies
|
||||||
|
**Purpose**: FCC BDC provider availability aggregated by county/tract
|
||||||
|
|
||||||
|
**Key Columns**:
|
||||||
|
- `geoid` (TEXT) - County or tract GEOID
|
||||||
|
- `geography_level` (TEXT) - `county` or `tract`
|
||||||
|
- `provider_count` (INTEGER)
|
||||||
|
- `technology_counts` (JSONB) - Count by technology type
|
||||||
|
- `max_download_mbps`, `max_upload_mbps`
|
||||||
|
|
||||||
|
**Source**: FCC Broadband Data Collection (BDC)
|
||||||
|
|
||||||
|
#### `fcc_bdc_broadband_connection_table`
|
||||||
|
**Rows**: Varies
|
||||||
|
**Purpose**: Per-data-center broadband provider availability
|
||||||
|
|
||||||
|
**Key Columns**:
|
||||||
|
- Data center identifiers
|
||||||
|
- `provider_id`, `provider_name`
|
||||||
|
- `technology` (TEXT)
|
||||||
|
- `max_advertised_download_speed`, `max_advertised_upload_speed`
|
||||||
|
- `low_latency` (BOOLEAN)
|
||||||
|
|
||||||
|
**Source**: FCC BDC, joined to data center locations
|
||||||
|
|
||||||
|
**Notes**: Built by `build_fcc_bdc_broadband_connection_table.py`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Commonly Used Joins
|
||||||
|
|
||||||
|
### Data Center to Demographics
|
||||||
|
```sql
|
||||||
|
SELECT
|
||||||
|
dc.*,
|
||||||
|
ct.median_household_income,
|
||||||
|
ct.bachelors_or_higher_pct,
|
||||||
|
ct.broadband_pct
|
||||||
|
FROM master_data_centers dc
|
||||||
|
JOIN data_center_census_tracts_2024 ct
|
||||||
|
ON dc.id = ct.id;
|
||||||
|
```
|
||||||
|
|
||||||
|
### Data Center to Watershed
|
||||||
|
```sql
|
||||||
|
SELECT
|
||||||
|
dc.*,
|
||||||
|
w.huc8,
|
||||||
|
w.watershed_name
|
||||||
|
FROM master_data_centers dc
|
||||||
|
JOIN data_center_watershed_huc8 dw ON dc.id = dw.id
|
||||||
|
JOIN watershed_huc8 w ON dw.huc8 = w.huc8;
|
||||||
|
```
|
||||||
|
|
||||||
|
### Data Center to Energy Infrastructure (50 km radius)
|
||||||
|
```sql
|
||||||
|
SELECT
|
||||||
|
dc.id,
|
||||||
|
dc.name,
|
||||||
|
SUM(eg.nameplate_capacity_mw) AS total_capacity_50km
|
||||||
|
FROM master_data_centers dc
|
||||||
|
JOIN energy_eia_operating_generator_capacity_flat eg
|
||||||
|
ON ST_DWithin(
|
||||||
|
dc.geom::geography,
|
||||||
|
eg.geom::geography,
|
||||||
|
50000 -- 50 km in meters
|
||||||
|
)
|
||||||
|
WHERE eg.status = 'OP' -- Operating only
|
||||||
|
GROUP BY dc.id, dc.name;
|
||||||
|
```
|
||||||
|
|
||||||
|
### Data Center to FEMA Hazard Risk
|
||||||
|
```sql
|
||||||
|
SELECT
|
||||||
|
dc.*,
|
||||||
|
nri.risk_score,
|
||||||
|
nri.wildfire_risk,
|
||||||
|
nri.drought_risk,
|
||||||
|
nri.heat_wave_risk
|
||||||
|
FROM master_data_centers dc
|
||||||
|
JOIN data_center_census_tracts_2024 ct ON dc.id = ct.id
|
||||||
|
JOIN nri_census_tracts nri ON ct.geoid = nri.nri_id;
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Table Naming Conventions
|
||||||
|
|
||||||
|
- **`master_*`** - Canonical, deduplicated tables (use these for analysis)
|
||||||
|
- **`data_center_*`** - Data center-specific enrichment tables
|
||||||
|
- **`_dc_*`** - Base layers scoped to data center states (underscore prefix = private/internal)
|
||||||
|
- **`energy_eia_*`** - EIA energy data
|
||||||
|
- **`internet_*`** - Connectivity infrastructure
|
||||||
|
- **`fcc_bdc_*`** - FCC Broadband Data Collection
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Indexes and Performance
|
||||||
|
|
||||||
|
All tables have spatial indexes on `geom` columns for fast spatial joins:
|
||||||
|
```sql
|
||||||
|
CREATE INDEX idx_tablename_geom ON tablename USING GIST(geom);
|
||||||
|
```
|
||||||
|
|
||||||
|
Key `geoid` columns are indexed for fast demographic joins:
|
||||||
|
```sql
|
||||||
|
CREATE INDEX idx_tablename_geoid ON tablename(geoid);
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Maintenance Notes
|
||||||
|
|
||||||
|
### Updating Data Centers
|
||||||
|
1. Run `load_postgis_osm_data_centers.py` to refresh OSM data
|
||||||
|
2. Run `build_master_data_centers.py` to rebuild master table
|
||||||
|
3. Run enrichment scripts to update joins
|
||||||
|
|
||||||
|
### Updating Demographics
|
||||||
|
1. Update `_dc_census_tract_acs_2024` from Census API
|
||||||
|
2. Run `create_data_center_census_tract_table.py --replace-final`
|
||||||
|
|
||||||
|
### Updating Energy Data
|
||||||
|
```bash
|
||||||
|
python3 ingest_eia_energy_layers.py --category power --update
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Schema Export
|
||||||
|
|
||||||
|
To export the full schema:
|
||||||
|
```bash
|
||||||
|
pg_dump -h $PGWEB_HOST -U $PGWEB_USER -d data_centers --schema-only > schema.sql
|
||||||
|
```
|
||||||
|
|
||||||
|
To list all tables:
|
||||||
|
```sql
|
||||||
|
SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename))
|
||||||
|
FROM pg_tables
|
||||||
|
WHERE schemaname = 'public'
|
||||||
|
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Contact
|
||||||
|
|
||||||
|
For database access or questions, contact the repository owner.
|
||||||
524
research-ideas.md
Normal file
524
research-ideas.md
Normal file
@@ -0,0 +1,524 @@
|
|||||||
|
# Research Ideas & Future Work
|
||||||
|
|
||||||
|
This document outlines potential research directions, data improvements, and analyses that could extend the current US Data Centers geospatial research infrastructure.
|
||||||
|
|
||||||
|
## Priority Next Steps
|
||||||
|
|
||||||
|
### 1. Backfill Power Capacity Data
|
||||||
|
**Status**: Only 5.9% of facilities have `power_mw` values (108/1,833)
|
||||||
|
|
||||||
|
**Approach**:
|
||||||
|
- Scrape Baxtel data center database (requires paid subscription)
|
||||||
|
- Use Data Center Map API/scraping
|
||||||
|
- Cross-reference with utility interconnection queue filings
|
||||||
|
- FOIA requests to state utility commissions for large loads
|
||||||
|
|
||||||
|
**Research Impact**:
|
||||||
|
- Enable capacity-weighted concentration metrics (current analyses are facility-count only)
|
||||||
|
- Correlate power capacity with demographic/environmental variables
|
||||||
|
- Identify "hyperscale" facilities (>100 MW) vs. edge/enterprise (<10 MW)
|
||||||
|
|
||||||
|
**Implementation**:
|
||||||
|
```python
|
||||||
|
# Add capacity-weighted HHI calculation to analyze_dc_tract_concentration.py
|
||||||
|
capacity_weighted_hhi = sum((mw_i / total_mw) ** 2 for mw_i in tract_capacities)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 2. Operator Name Deduplication
|
||||||
|
**Status**: String fragmentation inflates operator counts ("Meta" vs. "Meta, Inc.", AWS variants)
|
||||||
|
|
||||||
|
**Approach**:
|
||||||
|
- Create `operator_mapping` table with canonical names
|
||||||
|
- Use fuzzy matching (e.g., `fuzzywuzzy` library) to standardize
|
||||||
|
- Add `operator_canonical` column to `master_data_centers`
|
||||||
|
|
||||||
|
**Research Impact**:
|
||||||
|
- Accurate hyperscaler market share analysis
|
||||||
|
- Study operator-specific siting strategies (AWS hydro, Microsoft nuclear, Meta solar)
|
||||||
|
- Enable "operator power" political economy analyses
|
||||||
|
|
||||||
|
**Script**:
|
||||||
|
```python
|
||||||
|
# operators_dedupe.py
|
||||||
|
import pandas as pd
|
||||||
|
from fuzzywuzzy import process
|
||||||
|
|
||||||
|
# Load unique operators
|
||||||
|
operators = pd.read_sql("SELECT DISTINCT operator FROM master_data_centers", conn)
|
||||||
|
|
||||||
|
# Manual + fuzzy matching to canonical names
|
||||||
|
canonical_map = {
|
||||||
|
"Meta": ["Meta", "Meta, Inc.", "Meta Platforms", "Facebook"],
|
||||||
|
"Amazon Web Services": ["AWS", "Amazon", "Amazon Web Services"],
|
||||||
|
# ... etc.
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 3. Water Stress Overlay
|
||||||
|
**Status**: 257 HUC8 watersheds contain data centers; 15 watersheds hold 50% of facilities
|
||||||
|
|
||||||
|
**Approach**:
|
||||||
|
- Join to USGS WaterWatch streamflow data
|
||||||
|
- Add USGS Drought Watch indicators by HUC8
|
||||||
|
- Correlate data center density with:
|
||||||
|
- Groundwater depletion rates
|
||||||
|
- Surface water withdrawal permits
|
||||||
|
- Drought frequency/severity (USDM historical data)
|
||||||
|
|
||||||
|
**Research Questions**:
|
||||||
|
- Are data centers sited in water-stressed watersheds?
|
||||||
|
- Do high-density clusters (Loudoun County, Columbus OH) face water constraints?
|
||||||
|
- Compare water stress in hyperscaler non-metro sites (Columbia River corridor) vs. metro clusters
|
||||||
|
|
||||||
|
**Tables to Create**:
|
||||||
|
- `watershed_water_stress` - HUC8-level water stress indicators
|
||||||
|
- `data_center_water_risk` - Per-facility water-stress exposure
|
||||||
|
|
||||||
|
**Notebook**: `water_stress_analysis.ipynb`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 4. Opposition Cases Overlay
|
||||||
|
**Status**: Anecdotal evidence of community opposition to new data centers
|
||||||
|
|
||||||
|
**Approach**:
|
||||||
|
- Compile cases of rejected/delayed data center proposals (news archive scraping)
|
||||||
|
- Geocode opposition cases, join to demographics/hazards
|
||||||
|
- Test hypotheses:
|
||||||
|
- Do wealthier/more educated communities successfully block projects?
|
||||||
|
- Are opposition cases more common in water-stressed or drought-prone areas?
|
||||||
|
- Do smaller non-metro communities have less bargaining power?
|
||||||
|
|
||||||
|
**Research Questions**:
|
||||||
|
- What predicts opposition success?
|
||||||
|
- Are opposition cases spatially clustered?
|
||||||
|
- Do demographics differ between accepted vs. rejected sites?
|
||||||
|
|
||||||
|
**Output**: `opposition_cases_analysis.md`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 5. IM3 Forward Projection Integration
|
||||||
|
**Status**: IM3 model includes projected data center demand growth
|
||||||
|
|
||||||
|
**Approach**:
|
||||||
|
- Load IM3 projected demand scenarios (2030, 2040, 2050)
|
||||||
|
- Overlay projected growth with:
|
||||||
|
- Current grid saturation (% of generation within 50 km)
|
||||||
|
- Water stress indicators
|
||||||
|
- Land availability (zoned industrial parcels)
|
||||||
|
- Identify regions where projected demand may exceed infrastructure capacity
|
||||||
|
|
||||||
|
**Research Questions**:
|
||||||
|
- Which states face grid saturation from data center growth?
|
||||||
|
- Are projected sites in water-stressed watersheds?
|
||||||
|
- Does IM3 assume spatial distribution patterns consistent with current siting?
|
||||||
|
|
||||||
|
**Notebook**: `im3_projection_overlay.ipynb`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Methodological Extensions
|
||||||
|
|
||||||
|
### 6. Time-Series Analysis of Cluster Growth
|
||||||
|
**Approach**:
|
||||||
|
- Use `rfs_year` (ready for service) from cable data and EIA generator vintage
|
||||||
|
- Reconstruct data center siting over time (requires RFS dates for facilities)
|
||||||
|
- Animate cluster formation in interactive map
|
||||||
|
|
||||||
|
**Research Questions**:
|
||||||
|
- Did Ashburn VA become dominant before or after major cable landings?
|
||||||
|
- Do clusters grow via agglomeration (new facilities near existing) or simultaneous build-out?
|
||||||
|
- Correlation between energy infrastructure build-out and data center growth
|
||||||
|
|
||||||
|
**Data Needed**:
|
||||||
|
- Facility RFS dates (scrape from press releases, Baxtel historical data)
|
||||||
|
- Historical tract demographics (decennial Census + ACS back to 2000)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 7. Network Effects: Fiber Route Proximity
|
||||||
|
**Status**: Current analysis tests submarine cable proximity (negative result)
|
||||||
|
|
||||||
|
**Approach**:
|
||||||
|
- Obtain fiber optic backbone route GIS data (from FCC, carriers, or Infrapedia)
|
||||||
|
- Test proximity to long-haul fiber routes (not just submarine cables)
|
||||||
|
- Hypothesis: Data centers cluster near fiber, not cables
|
||||||
|
|
||||||
|
**Data Sources**:
|
||||||
|
- FCC Form 477 fiber deployment data
|
||||||
|
- Infrapedia fiber route database
|
||||||
|
- State-level fiber maps (e.g., Virginia Broadband Map)
|
||||||
|
|
||||||
|
**Expected Result**: Positive correlation (unlike submarine cables)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 8. Land Use & Zoning Analysis
|
||||||
|
**Approach**:
|
||||||
|
- Join data centers to local zoning classifications (industrial, commercial, etc.)
|
||||||
|
- Analyze land prices in data center tracts before/after facility construction
|
||||||
|
- Correlate with property tax revenues
|
||||||
|
|
||||||
|
**Research Questions**:
|
||||||
|
- Do data centers drive local property value increases?
|
||||||
|
- Are they preferentially sited in already-zoned industrial areas?
|
||||||
|
- Do host communities capture tax base growth?
|
||||||
|
|
||||||
|
**Data Sources**:
|
||||||
|
- Zillow Home Value Index (ZHVI) by ZIP
|
||||||
|
- ATTOM property tax assessments
|
||||||
|
- Municipal zoning GIS layers (city-specific, requires scraping/FOIA)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 9. Environmental Justice Scoring
|
||||||
|
**Approach**:
|
||||||
|
- Compare data center host tracts to EPA's EJScreen indices
|
||||||
|
- Add CalEnviroScreen-style burden/benefit framework
|
||||||
|
- Test if data centers increase cumulative environmental burdens
|
||||||
|
|
||||||
|
**Metrics**:
|
||||||
|
- Air quality (PM2.5, ozone)
|
||||||
|
- Hazardous waste proximity
|
||||||
|
- Superfund site proximity
|
||||||
|
- Heat island effect (LST from Landsat)
|
||||||
|
- Noise pollution (traffic, cooling systems)
|
||||||
|
|
||||||
|
**Expected Challenge**: Data centers may improve local metrics (compared to heavy industry) but increase water/energy consumption
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Policy & Political Economy Research
|
||||||
|
|
||||||
|
### 10. Tax Incentive Analysis
|
||||||
|
**Approach**:
|
||||||
|
- Compile state/local tax incentives for data center siting (property tax abatements, sales tax exemptions)
|
||||||
|
- Create `data_center_incentives` table with per-facility incentive details
|
||||||
|
- Correlate incentive generosity with:
|
||||||
|
- State fiscal health
|
||||||
|
- Local government bargaining power
|
||||||
|
- Facility size/operator
|
||||||
|
|
||||||
|
**Research Questions**:
|
||||||
|
- Do weaker fiscal states offer larger incentives?
|
||||||
|
- Are incentives regressive (larger for hyperscalers)?
|
||||||
|
- Do incentives predict siting decisions (natural experiment approach)?
|
||||||
|
|
||||||
|
**Data Sources**:
|
||||||
|
- Good Jobs First Subsidy Tracker
|
||||||
|
- State economic development agency press releases
|
||||||
|
- Local news archives
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 11. Employment & Labor Market Effects
|
||||||
|
**Approach**:
|
||||||
|
- Join to BLS Quarterly Census of Employment and Wages (QCEW) by ZIP/county
|
||||||
|
- Identify "data center construction boom" periods (before/after major facility openings)
|
||||||
|
- Analyze employment effects in:
|
||||||
|
- Construction (NAICS 23)
|
||||||
|
- Transportation/warehousing (NAICS 48-49)
|
||||||
|
- Professional services (NAICS 54)
|
||||||
|
|
||||||
|
**Research Questions**:
|
||||||
|
- Do data centers create durable local employment?
|
||||||
|
- Are jobs filled by local residents or commuters?
|
||||||
|
- Wage effects in host tracts?
|
||||||
|
|
||||||
|
**Data Sources**:
|
||||||
|
- BLS QCEW
|
||||||
|
- Census LEHD Origin-Destination Employment Statistics (LODES)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 12. Energy Cost Pass-Through
|
||||||
|
**Approach**:
|
||||||
|
- Join to state-level electricity rate data (EIA, utility rate tracker)
|
||||||
|
- Test if data center density correlates with residential rate increases
|
||||||
|
- Natural experiment: Compare rate trajectories in high-DC vs. low-DC states
|
||||||
|
|
||||||
|
**Research Questions**:
|
||||||
|
- Do data centers drive residential rate increases (capacity cost allocation)?
|
||||||
|
- Are rate increases concentrated in utility service territories with large data center loads?
|
||||||
|
- Do states with retail choice (deregulated markets) see different effects?
|
||||||
|
|
||||||
|
**Data Sources**:
|
||||||
|
- EIA Form 861 (retail rates by state/utility)
|
||||||
|
- Utility rate case filings (state public utility commissions)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Data Quality & Infrastructure Improvements
|
||||||
|
|
||||||
|
### 13. Address Validation & Geocoding Refinement
|
||||||
|
**Approach**:
|
||||||
|
- Re-geocode the 45 facilities using city-precision fallback
|
||||||
|
- Use USPS address validation API
|
||||||
|
- Cross-reference with Google Maps satellite imagery (manual review)
|
||||||
|
|
||||||
|
**Implementation**:
|
||||||
|
```python
|
||||||
|
# Re-run geocoding with stricter thresholds
|
||||||
|
python3 load_postgis_data_centers.py --revalidate-addresses
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 14. OSM Continuous Monitoring
|
||||||
|
**Approach**:
|
||||||
|
- Set up automated Overpass API queries (daily/weekly)
|
||||||
|
- Detect new OSM data center tags
|
||||||
|
- Alert for review + merge into `master_data_centers`
|
||||||
|
|
||||||
|
**Implementation**:
|
||||||
|
- Cron job running `load_postgis_osm_data_centers.py --update-only`
|
||||||
|
- Slack/email notification on new facilities
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 15. Broadband Speed Validation
|
||||||
|
**Approach**:
|
||||||
|
- Cross-reference FCC BDC provider data with Ookla Speedtest results
|
||||||
|
- Test if data center host tracts have faster actual speeds (not just availability)
|
||||||
|
|
||||||
|
**Hypothesis**: Data center presence correlates with infrastructure investment → higher speeds
|
||||||
|
|
||||||
|
**Data Sources**:
|
||||||
|
- Ookla Open Data (aggregated Speedtest results by tile)
|
||||||
|
- FCC Measuring Broadband America
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Visualization & Communication
|
||||||
|
|
||||||
|
### 16. Interactive Story Map
|
||||||
|
**Approach**:
|
||||||
|
- Build Scrollama.js narrative map
|
||||||
|
- Sections:
|
||||||
|
1. National overview (cluster map)
|
||||||
|
2. Ashburn VA zoom (dominance of single region)
|
||||||
|
3. Demographics comparison (host vs. national)
|
||||||
|
4. Water stress hot spots
|
||||||
|
5. Energy infrastructure saturation
|
||||||
|
|
||||||
|
**Output**: `story_map.html` (standalone web page)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 17. Policy Brief Generation
|
||||||
|
**Approach**:
|
||||||
|
- Auto-generate policy briefs from analysis outputs
|
||||||
|
- Targeted audiences:
|
||||||
|
- State legislators (energy/water policy)
|
||||||
|
- Local governments (tax incentive negotiation)
|
||||||
|
- Environmental justice advocates
|
||||||
|
|
||||||
|
**Template**:
|
||||||
|
```markdown
|
||||||
|
# Data Center Siting in [STATE]: Key Facts for Policymakers
|
||||||
|
|
||||||
|
- **[STATE] hosts X% of US data centers** (rank: #Y)
|
||||||
|
- **Host communities are Z% wealthier** than state average
|
||||||
|
- **A% of state generation is within 50 km of a data center**
|
||||||
|
- **Top watershed holds B facilities** (water stress: [HIGH/MEDIUM/LOW])
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 18. Comparative International Analysis
|
||||||
|
**Approach**:
|
||||||
|
- Extend methodology to EU, Canada, Australia
|
||||||
|
- Compare siting patterns (e.g., Nordic countries = renewable energy, cold climate)
|
||||||
|
- Test if "concentrated costs / dispersed benefits" holds internationally
|
||||||
|
|
||||||
|
**Data Sources**:
|
||||||
|
- OpenStreetMap (global coverage)
|
||||||
|
- Eurostat demographics
|
||||||
|
- IEA energy data
|
||||||
|
- TeleGeography global cable data (already available)
|
||||||
|
|
||||||
|
**Research Questions**:
|
||||||
|
- Are US patterns unique (tax-driven siting) vs. EU (regulatory constraints)?
|
||||||
|
- Do Nordic countries see more equitable distribution?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Speculative / Long-Term Ideas
|
||||||
|
|
||||||
|
### 19. AI Demand Forecasting
|
||||||
|
**Approach**:
|
||||||
|
- Train ML model to predict data center siting
|
||||||
|
- Features: demographics, energy capacity, fiber proximity, tax rates, water availability
|
||||||
|
- Test on historical data (train on pre-2015, test on 2015-2025)
|
||||||
|
|
||||||
|
**Use Case**:
|
||||||
|
- Identify "likely future sites" for proactive policy intervention
|
||||||
|
- Warn communities of potential incoming projects
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 20. Cooling Technology Analysis
|
||||||
|
**Approach**:
|
||||||
|
- Classify facilities by cooling type (air, water, hybrid)
|
||||||
|
- Correlate with:
|
||||||
|
- Climate (CDD: cooling degree days)
|
||||||
|
- Water availability
|
||||||
|
- Facility size
|
||||||
|
|
||||||
|
**Data Sources**:
|
||||||
|
- Manual classification from news/press releases
|
||||||
|
- FOIA requests to water utilities (cooling water withdrawal permits)
|
||||||
|
|
||||||
|
**Research Questions**:
|
||||||
|
- Are water-cooled facilities concentrated in water-stressed regions (paradox)?
|
||||||
|
- Do hyperscalers use more efficient cooling (e.g., Meta's Prineville OR evaporative cooling)?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 21. Bitcoin Mining Facilities
|
||||||
|
**Approach**:
|
||||||
|
- Overlay cryptocurrency mining facilities (subset of "data centers")
|
||||||
|
- Compare siting patterns: Bitcoin mines prefer low electricity costs (WA, TX, NY hydro)
|
||||||
|
- Test if Bitcoin mines face more opposition (negative perception)
|
||||||
|
|
||||||
|
**Data Sources**:
|
||||||
|
- Cambridge Bitcoin Electricity Consumption Index (has facility locations)
|
||||||
|
- News archives of mining farm proposals/rejections
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 22. Disaster Resilience & Redundancy
|
||||||
|
**Approach**:
|
||||||
|
- Model simultaneous hazard exposure across data center clusters
|
||||||
|
- E.g., "What % of US data centers are in wildfire risk zones?"
|
||||||
|
- Identify single points of failure (e.g., Ashburn VA = 20% of US capacity)
|
||||||
|
|
||||||
|
**Research Questions**:
|
||||||
|
- Is the current spatial distribution resilient to climate change?
|
||||||
|
- Should policy incentivize geographic diversification?
|
||||||
|
|
||||||
|
**Output**: `disaster_resilience_report.md`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 23. Edge Data Center Network
|
||||||
|
**Approach**:
|
||||||
|
- Separately analyze edge facilities (<1 MW) vs. hyperscale (>100 MW)
|
||||||
|
- Test if edge DCs follow different siting logic (population density > energy cost)
|
||||||
|
|
||||||
|
**Data Challenge**: Current inventory does not distinguish edge vs. hyperscale (need `power_mw` backfill)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 24. Carbon Intensity of Host Grids
|
||||||
|
**Approach**:
|
||||||
|
- Join to EPA eGRID subregion carbon intensity (lb CO₂/MWh)
|
||||||
|
- Calculate per-facility estimated carbon footprint (if `power_mw` available)
|
||||||
|
- Compare to corporate renewable energy procurement (RECs, PPAs)
|
||||||
|
|
||||||
|
**Research Questions**:
|
||||||
|
- Are data centers disproportionately in high-carbon grids?
|
||||||
|
- Do hyperscaler renewable commitments offset grid carbon?
|
||||||
|
|
||||||
|
**Data Sources**:
|
||||||
|
- EPA eGRID
|
||||||
|
- Corporate sustainability reports (Google, Microsoft, Meta, AWS)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Collaboration Opportunities
|
||||||
|
|
||||||
|
### Academic Partnerships
|
||||||
|
- **Energy researchers**: Joint analysis of grid saturation + IM3 projections
|
||||||
|
- **Environmental justice scholars**: EJScreen overlay + opposition case studies
|
||||||
|
- **Political scientists**: Tax incentive analysis + local government bargaining power
|
||||||
|
|
||||||
|
### Policy Stakeholders
|
||||||
|
- **State energy offices**: Share grid saturation maps
|
||||||
|
- **Water resource agencies**: Watershed analysis for permitting
|
||||||
|
- **Local governments**: Demographic/tax revenue analysis for negotiation leverage
|
||||||
|
|
||||||
|
### Industry Engagement
|
||||||
|
- **Data center operators**: Validate facility data, discuss siting criteria
|
||||||
|
- **Colocation providers**: Access to tenant mix data (multi-tenant vs. single-tenant)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Tools & Infrastructure Improvements
|
||||||
|
|
||||||
|
### Database Enhancements
|
||||||
|
- Add `version` column to track data vintage
|
||||||
|
- Implement `audit_log` table for data lineage
|
||||||
|
- Set up automated backups to S3/Azure Blob
|
||||||
|
|
||||||
|
### Code Quality
|
||||||
|
- Add unit tests for geocoding functions
|
||||||
|
- Create `config.yaml` for database credentials (replace hardcoded env vars)
|
||||||
|
- Dockerize analysis environment for reproducibility
|
||||||
|
|
||||||
|
### Documentation
|
||||||
|
- Add JSDoc-style comments to all Python functions
|
||||||
|
- Create `CONTRIBUTING.md` for external collaborators
|
||||||
|
- Record Jupyter notebook walkthroughs (video tutorials)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Unfunded / Ambitious Ideas
|
||||||
|
|
||||||
|
### 25. Real-Time Energy Monitoring
|
||||||
|
- Partner with utility to get live load data from data center substations
|
||||||
|
- Build dashboard showing real-time energy consumption by facility
|
||||||
|
- Correlate with AWS/Azure/GCP service outages (reverse-engineer capacity from brownouts)
|
||||||
|
|
||||||
|
### 26. Social Media Sentiment Analysis
|
||||||
|
- Scrape Twitter/Reddit for mentions of local data center projects
|
||||||
|
- NLP sentiment analysis: support vs. opposition
|
||||||
|
- Correlate sentiment with facility approval outcomes
|
||||||
|
|
||||||
|
### 27. LIDAR Analysis of Cooling Infrastructure
|
||||||
|
- Use aerial LIDAR to measure rooftop cooling equipment volume
|
||||||
|
- Proxy for facility size (cooling = f(IT load))
|
||||||
|
- Build predictive model: cooling equipment → power capacity
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Contact & Contributions
|
||||||
|
|
||||||
|
If you're interested in collaborating on any of these research directions, please contact the repository owner.
|
||||||
|
|
||||||
|
**Priorities for external collaboration**:
|
||||||
|
1. Power capacity data acquisition
|
||||||
|
2. Water stress/drought overlay
|
||||||
|
3. Opposition cases database compilation
|
||||||
|
4. International comparative analysis
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## References for Future Work
|
||||||
|
|
||||||
|
### Data Sources to Explore
|
||||||
|
- **Department of Energy**: Grid resilience reports, interconnection queues
|
||||||
|
- **NREL**: Renewable energy potential by HUC (solar, wind)
|
||||||
|
- **USDA**: Agricultural water use by county (competition for water)
|
||||||
|
- **NOAA**: Climate normals + projections by grid cell
|
||||||
|
- **BLS**: QCEW employment data, wage data
|
||||||
|
- **EPA**: eGRID, EJScreen, Superfund sites
|
||||||
|
|
||||||
|
### Academic Literature Gaps
|
||||||
|
- Limited peer-reviewed research on data center spatial concentration
|
||||||
|
- No published studies on water stress exposure of data centers
|
||||||
|
- Opportunity for "first mover" publication in major geography/planning journals
|
||||||
|
|
||||||
|
### Policy Levers to Investigate
|
||||||
|
- State renewable portfolio standards (RPS) → data center siting
|
||||||
|
- Federal infrastructure investment (IRA, CHIPS Act) → energy grid capacity
|
||||||
|
- Local zoning reform (industrial land use restrictions)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Last Updated**: May 2026
|
||||||
Reference in New Issue
Block a user