From 423e11083dc519cbec7c5ab6709f4bcdd75d8ace Mon Sep 17 00:00:00 2001 From: dadams Date: Wed, 27 May 2026 11:28:14 -0700 Subject: [PATCH] Add comprehensive documentation: README, database tables, and research ideas --- README.md | 213 ++++++++++++++++++ database-tables.md | 534 +++++++++++++++++++++++++++++++++++++++++++++ research-ideas.md | 524 ++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 1271 insertions(+) create mode 100644 README.md create mode 100644 database-tables.md create mode 100644 research-ideas.md diff --git a/README.md b/README.md new file mode 100644 index 0000000..89aefce --- /dev/null +++ b/README.md @@ -0,0 +1,213 @@ +# US Data Centers - Geospatial Research Infrastructure + +A comprehensive geospatial research project investigating the spatial concentration, infrastructure dependencies, and socioeconomic/environmental impacts of US data center locations. + +## Project Overview + +This repository implements a PostGIS-based analytical framework that integrates multiple data sources to examine: + +- **Spatial concentration patterns**: Where are data centers located and why? +- **Infrastructure dependencies**: How do data centers relate to submarine cables, power grids, and watersheds? +- **Equity and impact**: Do data center host communities bear localized burdens while benefits are nationally dispersed? +- **Demographics**: Who lives in data center-hosting census tracts? +- **Environmental exposure**: What are the water, energy, and natural hazard exposures? + +## Key Research Question + +**Do data centers represent "concentrated costs / dispersed benefits"?** Host communities bear localized infrastructure burdens (power, water, land use) while cloud computing benefits are nationally dispersed. + +## Major Findings + +### Spatial Concentration +- **State level**: Top 5 states (VA, TX, CA, OR, OH) hold 51% of all US data centers + - Virginia alone: 20.6% (378 of 1,833 facilities) +- **Tract level**: Top 1% of data center-hosting census tracts hold 14.6% of all facilities + - Only 0.86% of data center-state residents live in a hosting tract + - Per-capita burden is **115× higher** in host tracts vs. state average +- **Watershed level**: Half of all US data centers sit in just 15 of 2,139 HUC8 watersheds + - Single watershed (Middle Potomac-Catoctin / Loudoun County): 12.8% of US facilities + +### Demographics of Host Communities +Compared to the US average, data center host communities are: +- **Wealthier**: Median household income $103,623 (vs. $78,538, +32%) +- **More educated**: 49% bachelor's+ (vs. 35%, +14 pp) +- **More diverse**: 50% non-Hispanic white (vs. 58%), driven by high Asian share (13% vs. 6%) +- **Better connected**: 94.9% broadband (vs. 89%) + +### Infrastructure Insights +- **89% of data centers are in metropolitan tracts** (vs. 80% of all US tracts) +- **Non-metro data centers (11%)** are dominated by hyperscalers: + - AWS (67), Meta (22), Microsoft (10), Google (4) = 55% of non-metro facilities + - 66% are in Oregon + Washington (Columbia River hydro corridor) +- **Energy infrastructure**: 4 states have >2/3 of generation within 50 km of a data center: + - New Jersey: 83%, Nevada: 75%, Tennessee: 70%, Oregon: 68% + +### Submarine Cables +- **Data centers are NOT systematically closer to cables** than ordinary US cities +- Only 21.4% of data centers are within 100 km of a submarine cable landing point +- Largest clusters (Ashburn VA, Columbus OH, Iowa) are inland, driven by fiber/power/tax incentives, not cables + +## Data Sources + +### Primary Data Center Inventories +- **Curated Sample**: 1,489 facilities from web scraping + manual curation, geocoded via Census TIGER + Nominatim +- **OpenStreetMap**: 1,549 OSM features tagged as data centers (via Overpass API) +- **IM3 Model Data**: PNNL's Integrated Multisector Multiscale Modeling existing facilities +- **Master Table**: 1,833 deduplicated facilities merging all sources + +### Geospatial Context Layers +- **US Census**: 2024 TIGER tract boundaries, ACS 2024 5-year demographics (85k+ tracts) +- **USDA RUCA 2020**: Rural-Urban Commuting Area codes for metro/micropolitan/rural classification +- **USGS HUC8 Watersheds**: 2,139 subbasin polygons for water-stress analysis +- **FEMA NRI**: National Risk Index with 18 natural hazard risk scores by census tract + +### Infrastructure Layers +- **Submarine Cables**: 693 cables, 3,361 landing points (TeleGeography-style) +- **EIA Energy Data**: Operating generator capacity (4.7M monthly records, 2008-2026), facility fuel, state energy data +- **FCC Broadband Data**: Provider availability by location/block + +### Additional Data +- **RDH Precinct Vote Data**: Election results for political-economy analysis +- **NOAA HMS Smoke Data**: Wildfire smoke exposure (2005-2025) +- **USDM Drought Data**: Drought severity +- **Utility Rate Tracker**: State-level electricity rate increases + +## Repository Structure + +### Core Python Scripts + +**Data Ingestion** +- `load_postgis_data_centers.py` - Load curated data center CSV into PostGIS +- `load_postgis_osm_data_centers.py` - Fetch OSM data centers via Overpass API +- `build_master_data_centers.py` - Deduplicate & merge curated + OSM sources +- `load_postgis_internet_cables.py` - Load submarine cables and landing points +- `ingest_eia_energy_layers.py` - Ingest EIA energy data via API +- `build_watershed_huc8_tables.py` - Load USGS HUC8 watersheds + +**Enrichment** +- `create_data_center_census_tract_table.py` - Join data centers to Census tracts with ACS demographics +- `build_fcc_bdc_broadband_connection_table.py` - Build per-facility broadband provider table +- `build_fcc_bdc_location_provider_aggregates.py` - Aggregate FCC BDC data by county/tract + +**Analysis** +- `analyze_dc_tract_concentration.py` - Tract-level cost concentration analysis (Gini, HHI, demographic deltas) +- `analyze_cables_concentration.py` - Test if data centers cluster near submarine cables +- `make_data_center_map.py` - Generate Leaflet map of data centers +- `make_internet_cables_map.py` - Generate Leaflet map of data centers + cables + +### Key Jupyter Notebooks +- `spatial_clustering_master_data_centers.ipynb` - DBSCAN clustering of data centers +- `cluster_analysis.ipynb` - Main demographic/energy/watershed analysis +- `fema_nri_data_centers.ipynb` - Join data centers to FEMA hazard scores +- `rdh_precinct_vote_data_centers.ipynb` - Join data centers to election data +- `usdm_drought_data_centers.ipynb` - Drought exposure analysis +- `hms_smoke_data_centers.ipynb` - Wildfire smoke exposure +- `enhanced_data_center_cluster_map.ipynb` - Generate enhanced cluster visualization + +### Output Files +- `output/data_center_demographic_ruca_energy_summary.md` - Flagship analysis report +- `cables_concentration_report.md` - Cable proximity + cost/benefit concentration analysis +- `data_center_map.html` - Basic data center locations (Leaflet) +- `data_centers_cables_map.html` - Data centers + submarine cables (Leaflet) +- `output/enhanced_master_data_center_spatial_clusters_map.html` - Enhanced cluster visualization + +## Technical Architecture + +### Database +- **PostgreSQL 13+** with **PostGIS 3.x** +- Database name: `data_centers` +- See [database-tables.md](database-tables.md) for complete schema documentation + +### Python Environment +- **Python 3.10+** +- Key libraries: `psycopg2`, `geopandas`, `shapely`, `scikit-learn`, `pandas`, `numpy`, `requests`, `folium` + +### Data Formats +- CSV (raw data exports) +- GeoJSON (watershed/cluster geometries) +- Shapefiles (Census, USGS, FEMA inputs) +- HTML (interactive Leaflet maps) + +### Configuration +Credentials stored in `~/.zsh_secrets`, loaded via environment variables: +- `PGWEB_*`: PostgreSQL connection +- `EIA_API_KEY`: EIA energy data +- `FCC_USERNAME`, `FCC_API_KEY`: FCC broadband data +- `RDH_USERNAME`, `RDH_PASSWORD`: Redistricting Data Hub +- `CENSUS_API_KEY`: Census ACS API + +## Quick Start + +### Basic Rebuild Sequence + +```bash +# 1. Load base data center data +python3 load_postgis_data_centers.py +python3 load_postgis_osm_data_centers.py +python3 build_master_data_centers.py + +# 2. Enrich with context layers +python3 create_data_center_census_tract_table.py --replace-final +python3 load_postgis_internet_cables.py +python3 ingest_eia_energy_layers.py --category power +python3 build_watershed_huc8_tables.py + +# 3. Run analyses +python3 analyze_dc_tract_concentration.py > output/tract_analysis.txt +python3 analyze_cables_concentration.py > output/cables_analysis.txt + +# 4. Execute notebooks +jupyter notebook cluster_analysis.ipynb +``` + +### Generate Maps + +```bash +python3 make_data_center_map.py +python3 make_internet_cables_map.py +``` + +## Key Outputs + +### Research Reports +- **Demographic, Energy & Watershed Analysis**: `output/data_center_demographic_ruca_energy_summary.md` +- **Submarine Cable Proximity**: `cables_concentration_report.md` + +### Interactive Maps +- Data center locations with cluster assignments +- Submarine cable routes and landing points +- Energy infrastructure proximity +- Watershed boundaries with data center counts + +### Data Exports +- `master_data_center_spatial_cluster_points.csv` - Data center points with cluster IDs +- `master_data_center_spatial_cluster_summary.csv` - Cluster-level statistics +- `output/master_data_center_huc8_watersheds.geojson` - Watershed polygons +- `output/master_data_center_map_context.csv` - Full context for mapping +- `output/master_data_center_state_energy_context.csv` - State-level energy statistics + +## Data Quality Notes + +1. **Incomplete power ratings**: Only 5.9% of data centers have power ratings (108/1,833) +2. **Operator fragmentation**: String variations ("Meta" vs. "Meta, Inc.") inflate distinct-operator counts +3. **45 facilities** use city-precision fallback coordinates (approximate tract assignment) +4. **7 facilities** failed RUCA join (Puerto Rico / non-US) +5. **Broadband subscribers** are a coarse benefit proxy (actual cloud users are global) + +## Research Ideas & Future Work + +See [research-ideas.md](research-ideas.md) for detailed next steps and potential research directions. + +## Project Status + +This is a **mature, publication-ready geospatial analysis infrastructure** combining authoritative government datasets (Census, EIA, USGS, FEMA) with novel data center location data to test political economy and environmental justice hypotheses. + +The "concentrated costs / dispersed benefits" hypothesis is operationalized and tested with rigorous spatial statistics (Gini coefficients, HHI indices, Mann-Whitney tests). + +## License + +Research data compiled from public sources. Please cite appropriately if used in publications. + +## Contact + +For questions about this research project, please contact the repository owner. diff --git a/database-tables.md b/database-tables.md new file mode 100644 index 0000000..116e226 --- /dev/null +++ b/database-tables.md @@ -0,0 +1,534 @@ +# Database Tables Documentation + +## Database Configuration + +**Database Name**: `data_centers` +**Type**: PostgreSQL with PostGIS extension +**Connection**: Environment variables from `~/.zsh_secrets` +- `PGWEB_HOST`: Database host +- `PGWEB_PORT`: Database port (typically 5432) +- `PGWEB_USER`: Database user +- `PGWEB_PASSWORD`: Database password +- `PGWEB_DATABASE`: Database name (`data_centers`) + +## Table Organization + +Tables are organized into four categories: +1. **Core Data Center Tables** - Master inventories and source data +2. **Enrichment Tables** - Data centers joined with contextual data +3. **Base Layer Tables** - Geographic and demographic reference layers +4. **Infrastructure Tables** - Energy and connectivity infrastructure + +--- + +## Core Data Center Tables + +### `master_data_centers` +**Rows**: 1,833 +**Purpose**: Canonical data center inventory - deduplicated merge of curated + OSM sources + +**Key Columns**: +- `id` (INTEGER) - Unique identifier +- `name` (TEXT) - Facility name +- `address` (TEXT) - Street address +- `city` (TEXT) - City +- `state` (TEXT) - State code +- `latitude` (DOUBLE PRECISION) - Latitude +- `longitude` (DOUBLE PRECISION) - Longitude +- `geom` (GEOMETRY) - PostGIS point geometry (EPSG:4326) +- `operator` (TEXT) - Operator/owner +- `power_mw` (DOUBLE PRECISION) - Power capacity in megawatts (sparse: 5.9% populated) +- `source` (TEXT) - Data source (`curated`, `osm`, or `both`) +- `osm_id` (TEXT) - OpenStreetMap ID if applicable +- `geocode_method` (TEXT) - Geocoding provenance + +**Notes**: +- 108 of 1,833 facilities have power ratings +- 45 facilities use city-precision fallback coordinates +- Operator strings have fragmentation issues ("Meta" vs. "Meta, Inc.") + +### `us_dc_sample_geocoded` +**Rows**: 1,489 +**Purpose**: Original curated sample with geocoding provenance (superseded by `master_data_centers`) + +**Key Columns**: +- `name`, `address`, `city`, `state`, `zip` +- `latitude`, `longitude`, `geom` +- `operator`, `power_mw` +- `census_lat`, `census_lon` - Census TIGER geocode results +- `nominatim_lat`, `nominatim_lon` - Nominatim fallback results +- `geocode_source` - Which geocoder was used + +### `osm_data_centers` +**Rows**: 1,549 +**Purpose**: Raw OpenStreetMap-derived facilities + +**Key Columns**: +- `osm_id` (TEXT) - OSM element ID +- `osm_type` (TEXT) - `node`, `way`, or `relation` +- `name` (TEXT) - OSM name tag +- `latitude`, `longitude`, `geom` +- `tags` (JSONB) - All OSM tags as JSON +- `operator` (TEXT) - Extracted from OSM tags +- `city`, `state`, `country` + +**Notes**: Fetched via Overpass API with query for `telecom=data_center` or `building=data_center` + +### `master_data_center_spatial_clusters` +**Rows**: 1,831 +**Purpose**: DBSCAN cluster assignments for master data centers + +**Key Columns**: +- All columns from `master_data_centers` +- `cluster_id` (INTEGER) - Cluster assignment (-1 = noise/singleton) +- `cluster_size` (INTEGER) - Number of facilities in cluster +- `cluster_label` (TEXT) - Human-readable cluster name + +**Notes**: DBSCAN parameters: eps=15 km, min_samples=2 + +--- + +## Enrichment Tables + +### `data_center_census_tracts_2024` +**Rows**: 1,815 +**Purpose**: Per-facility demographics from containing Census tract + +**Key Columns**: +- All columns from `master_data_centers` +- `geoid` (TEXT) - 11-digit Census tract GEOID +- `state_fips`, `county_fips`, `tract` +- **Population**: `total_population`, `population_density_sq_mi` +- **Age**: `median_age`, `under_18_pct`, `over_65_pct` +- **Race/Ethnicity**: `white_nh_pct`, `black_nh_pct`, `asian_nh_pct`, `hispanic_pct` +- **Economics**: `median_household_income`, `per_capita_income`, `poverty_rate` +- **Education**: `bachelors_or_higher_pct`, `high_school_or_higher_pct` +- **Housing**: `median_home_value`, `median_rent`, `homeownership_rate` +- **Broadband**: `broadband_pct` - Households with broadband subscription + +**Source**: ACS 2024 5-year estimates + +**Notes**: +- 18 of 1,833 facilities failed tract join (geocoding issues) +- Data from `_dc_census_tract_acs_2024` base table + +### `data_center_watershed_huc8` +**Rows**: 1,833 +**Purpose**: Per-facility watershed assignment + +**Key Columns**: +- All columns from `master_data_centers` +- `huc8` (TEXT) - 8-digit Hydrologic Unit Code +- `watershed_name` (TEXT) - Watershed name +- `watershed_area_sq_km` (DOUBLE PRECISION) +- `states` (TEXT) - States intersecting watershed + +**Source**: USGS Watershed Boundary Dataset + +**Notes**: 257 unique HUC8 watersheds contain at least one data center + +### `data_center_nri_exposure` +**Rows**: 1,833 +**Purpose**: Per-facility FEMA National Risk Index hazard exposure scores + +**Key Columns**: +- All columns from `master_data_centers` +- `nri_id` (TEXT) - Census tract GEOID (matches `geoid` from demographics) +- `risk_score` (DOUBLE PRECISION) - Overall NRI risk score +- `social_vulnerability` (DOUBLE PRECISION) - Social vulnerability index +- **Hazard-specific risk scores** (18 hazards): + - `avalanche_risk`, `coastal_flooding_risk`, `cold_wave_risk` + - `drought_risk`, `earthquake_risk`, `hail_risk` + - `heat_wave_risk`, `hurricane_risk`, `ice_storm_risk` + - `landslide_risk`, `lightning_risk`, `riverine_flooding_risk` + - `strong_wind_risk`, `tornado_risk`, `tsunami_risk` + - `volcanic_activity_risk`, `wildfire_risk`, `winter_weather_risk` + +**Source**: FEMA National Risk Index (December 2025 release) + +### `data_center_rdh_precinct_vote_matches` +**Rows**: Varies +**Purpose**: Per-facility precinct-level election results + +**Key Columns**: +- Data center identifiers +- `precinct_name`, `precinct_id` +- `election_year`, `office` +- `candidate`, `party`, `votes` +- `vote_share_pct` + +**Source**: Redistricting Data Hub precinct shapefiles + +**Notes**: Spatial join to voting precincts (point-in-polygon) + +--- + +## Base Layer Tables + +### `_dc_census_tract_acs_2024` +**Rows**: 85,382 +**Purpose**: ACS 2024 demographics for all Census tracts in states with data centers + +**Key Columns**: +- `geoid` (TEXT) - 11-digit tract GEOID (PRIMARY KEY) +- `name` (TEXT) - Tract name +- `state_fips`, `county_fips`, `tract` +- **Full ACS 5-year estimates** (85+ columns): + - Population by age, sex, race/ethnicity + - Households, families, housing units + - Income, poverty, education, employment + - Housing values, rents, costs + - Broadband, computer access + - Commuting, vehicles + +**Source**: Census ACS 2024 5-year estimates API + +**Notes**: Universe limited to 46 states with data centers (excludes DC-free states) + +### `_dc_census_tract_boundaries_2024` +**Rows**: 85,058 +**Purpose**: TIGER 2024 tract polygons for data center states + +**Key Columns**: +- `geoid` (TEXT) - 11-digit tract GEOID +- `name` (TEXT) - Tract name +- `state_fips`, `county_fips`, `tract_code` +- `geom` (GEOMETRY) - Polygon geometry (EPSG:4326) +- `area_land_sq_m` (DOUBLE PRECISION) - Land area in square meters +- `area_water_sq_m` (DOUBLE PRECISION) - Water area in square meters + +**Source**: Census TIGER/Line 2024 + +### `ruca_codes_2020_tract` +**Rows**: 85,528 +**Purpose**: USDA Rural-Urban Commuting Area codes for metro/rural classification + +**Key Columns**: +- `geoid` (TEXT) - 11-digit tract GEOID (matches Census tracts) +- `ruca_code` (TEXT) - Primary RUCA code (1-10) +- `ruca_category` (TEXT) - Simplified category: + - `Metropolitan` (codes 1-3) + - `Micropolitan` (codes 4-6) + - `Small town` (codes 7-9) + - `Rural` (code 10) +- `ruca_description` (TEXT) - Full RUCA code description +- `population_2020` (INTEGER) + +**Source**: USDA Economic Research Service RUCA 2020 + +**Notes**: +- Based on 2020 Census tracts and 2010-2020 commuting patterns +- 7 data centers failed RUCA join (Puerto Rico / non-US) + +### `watershed_huc8` +**Rows**: 2,139 +**Purpose**: USGS HUC8 subbasin polygons for water-stress analysis + +**Key Columns**: +- `huc8` (TEXT) - 8-digit Hydrologic Unit Code (PRIMARY KEY) +- `name` (TEXT) - Watershed name +- `geom` (GEOMETRY) - Polygon geometry (EPSG:4326) +- `area_sq_km` (DOUBLE PRECISION) +- `states` (TEXT) - Comma-separated state codes +- `dc_count` (INTEGER) - Number of data centers in watershed + +**Source**: USGS Watershed Boundary Dataset + +**Notes**: +- 257 of 2,139 watersheds contain at least one data center +- Top 15 watersheds contain 50% of all US data centers + +### `nri_census_tracts` +**Rows**: ~84,000 +**Purpose**: Full FEMA National Risk Index by Census tract + +**Key Columns**: +- `nri_id` (TEXT) - Census tract GEOID +- `state_name`, `county_name`, `tract_name` +- **460+ columns** including: + - Overall risk scores and ratings + - Expected annual loss (dollars and building value %) + - Social vulnerability components (15 factors) + - Community resilience score + - Individual hazard risk scores (18 hazards) + - Exposure, annualized frequency, historic loss ratios per hazard + +**Source**: FEMA National Risk Index v2.1 (December 2025) + +**Notes**: +- Massive table with comprehensive natural hazard risk data +- Join to data centers via `geoid` field +- See [FEMA NRI Technical Documentation](https://hazards.fema.gov/nri/) + +--- + +## Infrastructure Tables + +### Energy Infrastructure + +#### `energy_eia_operating_generator_capacity_flat` +**Rows**: 4.7 million +**Purpose**: EIA generator inventory with lat/lon/MW (monthly 2008-2026) + +**Key Columns**: +- `plant_id` (INTEGER) - EIA plant ID +- `generator_id` (TEXT) - Generator unit ID +- `plant_name` (TEXT) +- `latitude`, `longitude`, `geom` +- `state`, `county` +- `utility_name`, `operator_name` +- `nameplate_capacity_mw` (DOUBLE PRECISION) +- `technology` (TEXT) - Generation technology +- `energy_source_1`, `energy_source_2` - Primary fuel codes +- `operating_month`, `operating_year` - When unit became operational +- `status` (TEXT) - Operating, standby, retired, etc. +- `report_month`, `report_year` - Data snapshot date + +**Source**: EIA Form 860 via API + +**Notes**: +- "Flat" means denormalized for fast spatial queries +- Each generator-month is a row (4.7M rows from monthly snapshots) +- Use for proximity analysis (e.g., "all generators within 50 km of data center") + +#### `energy_eia_facility_fuel_flat` +**Rows**: Varies +**Purpose**: Monthly generation by plant/fuel + +**Key Columns**: +- `plant_id`, `plant_name` +- `report_month`, `report_year` +- `energy_source` (TEXT) - Fuel code +- `net_generation_mwh` (DOUBLE PRECISION) +- `fuel_consumed_mmbtu` (DOUBLE PRECISION) + +**Source**: EIA Form 923 via API + +#### `energy_eia_seds_flat` +**Rows**: 2.57 million +**Purpose**: Annual state energy consumption/production (1960-2024) + +**Key Columns**: +- `state_code` (TEXT) +- `year` (INTEGER) +- `msn` (TEXT) - Mnemonic series names (e.g., `TETCB` = total energy consumption) +- `value` (DOUBLE PRECISION) - Energy in trillion BTU +- `unit` (TEXT) +- `description` (TEXT) - Human-readable MSN description + +**Source**: EIA State Energy Data System (SEDS) + +**Notes**: +- Annual aggregates by state +- Use for state-level energy context analysis + +--- + +### Connectivity Infrastructure + +#### `internet_cables` +**Rows**: 693 +**Purpose**: Submarine cable routes + +**Key Columns**: +- `cable_id` (TEXT) - Unique cable identifier +- `cable_name` (TEXT) - Official cable name +- `geom` (GEOMETRY) - LineString geometry (EPSG:4326) +- `rfs_year` (INTEGER) - Ready For Service year +- `length_km` (DOUBLE PRECISION) +- `owners` (TEXT[]) - Array of owner names +- `landing_points` (TEXT[]) - Array of landing point names + +**Source**: TeleGeography-style cable database + +**Notes**: +- 693 unique submarine cables +- Geometry is approximate route (not exact seabed path) + +#### `internet_cable_landing_points` +**Rows**: 3,361 +**Purpose**: Cable landing points (where cables come ashore) + +**Key Columns**: +- `landing_point_id` (TEXT) - Unique identifier +- `name` (TEXT) - Landing point name +- `city`, `country` +- `latitude`, `longitude`, `geom` +- `cables` (TEXT[]) - Array of cable names landing at this point +- `cable_count` (INTEGER) + +**Source**: TeleGeography-style cable database + +**Notes**: +- Used for proximity analysis (how close are data centers to cable landings?) +- **Key finding**: Data centers are NOT systematically closer to cables than ordinary US cities + +#### `internet_city_dominance` +**Rows**: 4,552 +**Purpose**: City-level IPs/capacity (internet hub strength proxy) + +**Key Columns**: +- `city` (TEXT) +- `country` (TEXT) +- `latitude`, `longitude`, `geom` +- `ip_addresses` (INTEGER) - Number of routable IP addresses +- `capacity_rank` (INTEGER) - Relative capacity ranking + +**Source**: Internet topology datasets + +**Notes**: Proxy for "internet hub" strength (not directly used in main analyses) + +--- + +### Broadband + +#### `fcc_bdc_location_provider_aggregates` +**Rows**: Varies +**Purpose**: FCC BDC provider availability aggregated by county/tract + +**Key Columns**: +- `geoid` (TEXT) - County or tract GEOID +- `geography_level` (TEXT) - `county` or `tract` +- `provider_count` (INTEGER) +- `technology_counts` (JSONB) - Count by technology type +- `max_download_mbps`, `max_upload_mbps` + +**Source**: FCC Broadband Data Collection (BDC) + +#### `fcc_bdc_broadband_connection_table` +**Rows**: Varies +**Purpose**: Per-data-center broadband provider availability + +**Key Columns**: +- Data center identifiers +- `provider_id`, `provider_name` +- `technology` (TEXT) +- `max_advertised_download_speed`, `max_advertised_upload_speed` +- `low_latency` (BOOLEAN) + +**Source**: FCC BDC, joined to data center locations + +**Notes**: Built by `build_fcc_bdc_broadband_connection_table.py` + +--- + +## Commonly Used Joins + +### Data Center to Demographics +```sql +SELECT + dc.*, + ct.median_household_income, + ct.bachelors_or_higher_pct, + ct.broadband_pct +FROM master_data_centers dc +JOIN data_center_census_tracts_2024 ct + ON dc.id = ct.id; +``` + +### Data Center to Watershed +```sql +SELECT + dc.*, + w.huc8, + w.watershed_name +FROM master_data_centers dc +JOIN data_center_watershed_huc8 dw ON dc.id = dw.id +JOIN watershed_huc8 w ON dw.huc8 = w.huc8; +``` + +### Data Center to Energy Infrastructure (50 km radius) +```sql +SELECT + dc.id, + dc.name, + SUM(eg.nameplate_capacity_mw) AS total_capacity_50km +FROM master_data_centers dc +JOIN energy_eia_operating_generator_capacity_flat eg + ON ST_DWithin( + dc.geom::geography, + eg.geom::geography, + 50000 -- 50 km in meters + ) +WHERE eg.status = 'OP' -- Operating only +GROUP BY dc.id, dc.name; +``` + +### Data Center to FEMA Hazard Risk +```sql +SELECT + dc.*, + nri.risk_score, + nri.wildfire_risk, + nri.drought_risk, + nri.heat_wave_risk +FROM master_data_centers dc +JOIN data_center_census_tracts_2024 ct ON dc.id = ct.id +JOIN nri_census_tracts nri ON ct.geoid = nri.nri_id; +``` + +--- + +## Table Naming Conventions + +- **`master_*`** - Canonical, deduplicated tables (use these for analysis) +- **`data_center_*`** - Data center-specific enrichment tables +- **`_dc_*`** - Base layers scoped to data center states (underscore prefix = private/internal) +- **`energy_eia_*`** - EIA energy data +- **`internet_*`** - Connectivity infrastructure +- **`fcc_bdc_*`** - FCC Broadband Data Collection + +--- + +## Indexes and Performance + +All tables have spatial indexes on `geom` columns for fast spatial joins: +```sql +CREATE INDEX idx_tablename_geom ON tablename USING GIST(geom); +``` + +Key `geoid` columns are indexed for fast demographic joins: +```sql +CREATE INDEX idx_tablename_geoid ON tablename(geoid); +``` + +--- + +## Maintenance Notes + +### Updating Data Centers +1. Run `load_postgis_osm_data_centers.py` to refresh OSM data +2. Run `build_master_data_centers.py` to rebuild master table +3. Run enrichment scripts to update joins + +### Updating Demographics +1. Update `_dc_census_tract_acs_2024` from Census API +2. Run `create_data_center_census_tract_table.py --replace-final` + +### Updating Energy Data +```bash +python3 ingest_eia_energy_layers.py --category power --update +``` + +--- + +## Schema Export + +To export the full schema: +```bash +pg_dump -h $PGWEB_HOST -U $PGWEB_USER -d data_centers --schema-only > schema.sql +``` + +To list all tables: +```sql +SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) +FROM pg_tables +WHERE schemaname = 'public' +ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC; +``` + +--- + +## Contact + +For database access or questions, contact the repository owner. diff --git a/research-ideas.md b/research-ideas.md new file mode 100644 index 0000000..0a82127 --- /dev/null +++ b/research-ideas.md @@ -0,0 +1,524 @@ +# Research Ideas & Future Work + +This document outlines potential research directions, data improvements, and analyses that could extend the current US Data Centers geospatial research infrastructure. + +## Priority Next Steps + +### 1. Backfill Power Capacity Data +**Status**: Only 5.9% of facilities have `power_mw` values (108/1,833) + +**Approach**: +- Scrape Baxtel data center database (requires paid subscription) +- Use Data Center Map API/scraping +- Cross-reference with utility interconnection queue filings +- FOIA requests to state utility commissions for large loads + +**Research Impact**: +- Enable capacity-weighted concentration metrics (current analyses are facility-count only) +- Correlate power capacity with demographic/environmental variables +- Identify "hyperscale" facilities (>100 MW) vs. edge/enterprise (<10 MW) + +**Implementation**: +```python +# Add capacity-weighted HHI calculation to analyze_dc_tract_concentration.py +capacity_weighted_hhi = sum((mw_i / total_mw) ** 2 for mw_i in tract_capacities) +``` + +--- + +### 2. Operator Name Deduplication +**Status**: String fragmentation inflates operator counts ("Meta" vs. "Meta, Inc.", AWS variants) + +**Approach**: +- Create `operator_mapping` table with canonical names +- Use fuzzy matching (e.g., `fuzzywuzzy` library) to standardize +- Add `operator_canonical` column to `master_data_centers` + +**Research Impact**: +- Accurate hyperscaler market share analysis +- Study operator-specific siting strategies (AWS hydro, Microsoft nuclear, Meta solar) +- Enable "operator power" political economy analyses + +**Script**: +```python +# operators_dedupe.py +import pandas as pd +from fuzzywuzzy import process + +# Load unique operators +operators = pd.read_sql("SELECT DISTINCT operator FROM master_data_centers", conn) + +# Manual + fuzzy matching to canonical names +canonical_map = { + "Meta": ["Meta", "Meta, Inc.", "Meta Platforms", "Facebook"], + "Amazon Web Services": ["AWS", "Amazon", "Amazon Web Services"], + # ... etc. +} +``` + +--- + +### 3. Water Stress Overlay +**Status**: 257 HUC8 watersheds contain data centers; 15 watersheds hold 50% of facilities + +**Approach**: +- Join to USGS WaterWatch streamflow data +- Add USGS Drought Watch indicators by HUC8 +- Correlate data center density with: + - Groundwater depletion rates + - Surface water withdrawal permits + - Drought frequency/severity (USDM historical data) + +**Research Questions**: +- Are data centers sited in water-stressed watersheds? +- Do high-density clusters (Loudoun County, Columbus OH) face water constraints? +- Compare water stress in hyperscaler non-metro sites (Columbia River corridor) vs. metro clusters + +**Tables to Create**: +- `watershed_water_stress` - HUC8-level water stress indicators +- `data_center_water_risk` - Per-facility water-stress exposure + +**Notebook**: `water_stress_analysis.ipynb` + +--- + +### 4. Opposition Cases Overlay +**Status**: Anecdotal evidence of community opposition to new data centers + +**Approach**: +- Compile cases of rejected/delayed data center proposals (news archive scraping) +- Geocode opposition cases, join to demographics/hazards +- Test hypotheses: + - Do wealthier/more educated communities successfully block projects? + - Are opposition cases more common in water-stressed or drought-prone areas? + - Do smaller non-metro communities have less bargaining power? + +**Research Questions**: +- What predicts opposition success? +- Are opposition cases spatially clustered? +- Do demographics differ between accepted vs. rejected sites? + +**Output**: `opposition_cases_analysis.md` + +--- + +### 5. IM3 Forward Projection Integration +**Status**: IM3 model includes projected data center demand growth + +**Approach**: +- Load IM3 projected demand scenarios (2030, 2040, 2050) +- Overlay projected growth with: + - Current grid saturation (% of generation within 50 km) + - Water stress indicators + - Land availability (zoned industrial parcels) +- Identify regions where projected demand may exceed infrastructure capacity + +**Research Questions**: +- Which states face grid saturation from data center growth? +- Are projected sites in water-stressed watersheds? +- Does IM3 assume spatial distribution patterns consistent with current siting? + +**Notebook**: `im3_projection_overlay.ipynb` + +--- + +## Methodological Extensions + +### 6. Time-Series Analysis of Cluster Growth +**Approach**: +- Use `rfs_year` (ready for service) from cable data and EIA generator vintage +- Reconstruct data center siting over time (requires RFS dates for facilities) +- Animate cluster formation in interactive map + +**Research Questions**: +- Did Ashburn VA become dominant before or after major cable landings? +- Do clusters grow via agglomeration (new facilities near existing) or simultaneous build-out? +- Correlation between energy infrastructure build-out and data center growth + +**Data Needed**: +- Facility RFS dates (scrape from press releases, Baxtel historical data) +- Historical tract demographics (decennial Census + ACS back to 2000) + +--- + +### 7. Network Effects: Fiber Route Proximity +**Status**: Current analysis tests submarine cable proximity (negative result) + +**Approach**: +- Obtain fiber optic backbone route GIS data (from FCC, carriers, or Infrapedia) +- Test proximity to long-haul fiber routes (not just submarine cables) +- Hypothesis: Data centers cluster near fiber, not cables + +**Data Sources**: +- FCC Form 477 fiber deployment data +- Infrapedia fiber route database +- State-level fiber maps (e.g., Virginia Broadband Map) + +**Expected Result**: Positive correlation (unlike submarine cables) + +--- + +### 8. Land Use & Zoning Analysis +**Approach**: +- Join data centers to local zoning classifications (industrial, commercial, etc.) +- Analyze land prices in data center tracts before/after facility construction +- Correlate with property tax revenues + +**Research Questions**: +- Do data centers drive local property value increases? +- Are they preferentially sited in already-zoned industrial areas? +- Do host communities capture tax base growth? + +**Data Sources**: +- Zillow Home Value Index (ZHVI) by ZIP +- ATTOM property tax assessments +- Municipal zoning GIS layers (city-specific, requires scraping/FOIA) + +--- + +### 9. Environmental Justice Scoring +**Approach**: +- Compare data center host tracts to EPA's EJScreen indices +- Add CalEnviroScreen-style burden/benefit framework +- Test if data centers increase cumulative environmental burdens + +**Metrics**: +- Air quality (PM2.5, ozone) +- Hazardous waste proximity +- Superfund site proximity +- Heat island effect (LST from Landsat) +- Noise pollution (traffic, cooling systems) + +**Expected Challenge**: Data centers may improve local metrics (compared to heavy industry) but increase water/energy consumption + +--- + +## Policy & Political Economy Research + +### 10. Tax Incentive Analysis +**Approach**: +- Compile state/local tax incentives for data center siting (property tax abatements, sales tax exemptions) +- Create `data_center_incentives` table with per-facility incentive details +- Correlate incentive generosity with: + - State fiscal health + - Local government bargaining power + - Facility size/operator + +**Research Questions**: +- Do weaker fiscal states offer larger incentives? +- Are incentives regressive (larger for hyperscalers)? +- Do incentives predict siting decisions (natural experiment approach)? + +**Data Sources**: +- Good Jobs First Subsidy Tracker +- State economic development agency press releases +- Local news archives + +--- + +### 11. Employment & Labor Market Effects +**Approach**: +- Join to BLS Quarterly Census of Employment and Wages (QCEW) by ZIP/county +- Identify "data center construction boom" periods (before/after major facility openings) +- Analyze employment effects in: + - Construction (NAICS 23) + - Transportation/warehousing (NAICS 48-49) + - Professional services (NAICS 54) + +**Research Questions**: +- Do data centers create durable local employment? +- Are jobs filled by local residents or commuters? +- Wage effects in host tracts? + +**Data Sources**: +- BLS QCEW +- Census LEHD Origin-Destination Employment Statistics (LODES) + +--- + +### 12. Energy Cost Pass-Through +**Approach**: +- Join to state-level electricity rate data (EIA, utility rate tracker) +- Test if data center density correlates with residential rate increases +- Natural experiment: Compare rate trajectories in high-DC vs. low-DC states + +**Research Questions**: +- Do data centers drive residential rate increases (capacity cost allocation)? +- Are rate increases concentrated in utility service territories with large data center loads? +- Do states with retail choice (deregulated markets) see different effects? + +**Data Sources**: +- EIA Form 861 (retail rates by state/utility) +- Utility rate case filings (state public utility commissions) + +--- + +## Data Quality & Infrastructure Improvements + +### 13. Address Validation & Geocoding Refinement +**Approach**: +- Re-geocode the 45 facilities using city-precision fallback +- Use USPS address validation API +- Cross-reference with Google Maps satellite imagery (manual review) + +**Implementation**: +```python +# Re-run geocoding with stricter thresholds +python3 load_postgis_data_centers.py --revalidate-addresses +``` + +--- + +### 14. OSM Continuous Monitoring +**Approach**: +- Set up automated Overpass API queries (daily/weekly) +- Detect new OSM data center tags +- Alert for review + merge into `master_data_centers` + +**Implementation**: +- Cron job running `load_postgis_osm_data_centers.py --update-only` +- Slack/email notification on new facilities + +--- + +### 15. Broadband Speed Validation +**Approach**: +- Cross-reference FCC BDC provider data with Ookla Speedtest results +- Test if data center host tracts have faster actual speeds (not just availability) + +**Hypothesis**: Data center presence correlates with infrastructure investment → higher speeds + +**Data Sources**: +- Ookla Open Data (aggregated Speedtest results by tile) +- FCC Measuring Broadband America + +--- + +## Visualization & Communication + +### 16. Interactive Story Map +**Approach**: +- Build Scrollama.js narrative map +- Sections: + 1. National overview (cluster map) + 2. Ashburn VA zoom (dominance of single region) + 3. Demographics comparison (host vs. national) + 4. Water stress hot spots + 5. Energy infrastructure saturation + +**Output**: `story_map.html` (standalone web page) + +--- + +### 17. Policy Brief Generation +**Approach**: +- Auto-generate policy briefs from analysis outputs +- Targeted audiences: + - State legislators (energy/water policy) + - Local governments (tax incentive negotiation) + - Environmental justice advocates + +**Template**: +```markdown +# Data Center Siting in [STATE]: Key Facts for Policymakers + +- **[STATE] hosts X% of US data centers** (rank: #Y) +- **Host communities are Z% wealthier** than state average +- **A% of state generation is within 50 km of a data center** +- **Top watershed holds B facilities** (water stress: [HIGH/MEDIUM/LOW]) +``` + +--- + +### 18. Comparative International Analysis +**Approach**: +- Extend methodology to EU, Canada, Australia +- Compare siting patterns (e.g., Nordic countries = renewable energy, cold climate) +- Test if "concentrated costs / dispersed benefits" holds internationally + +**Data Sources**: +- OpenStreetMap (global coverage) +- Eurostat demographics +- IEA energy data +- TeleGeography global cable data (already available) + +**Research Questions**: +- Are US patterns unique (tax-driven siting) vs. EU (regulatory constraints)? +- Do Nordic countries see more equitable distribution? + +--- + +## Speculative / Long-Term Ideas + +### 19. AI Demand Forecasting +**Approach**: +- Train ML model to predict data center siting +- Features: demographics, energy capacity, fiber proximity, tax rates, water availability +- Test on historical data (train on pre-2015, test on 2015-2025) + +**Use Case**: +- Identify "likely future sites" for proactive policy intervention +- Warn communities of potential incoming projects + +--- + +### 20. Cooling Technology Analysis +**Approach**: +- Classify facilities by cooling type (air, water, hybrid) +- Correlate with: + - Climate (CDD: cooling degree days) + - Water availability + - Facility size + +**Data Sources**: +- Manual classification from news/press releases +- FOIA requests to water utilities (cooling water withdrawal permits) + +**Research Questions**: +- Are water-cooled facilities concentrated in water-stressed regions (paradox)? +- Do hyperscalers use more efficient cooling (e.g., Meta's Prineville OR evaporative cooling)? + +--- + +### 21. Bitcoin Mining Facilities +**Approach**: +- Overlay cryptocurrency mining facilities (subset of "data centers") +- Compare siting patterns: Bitcoin mines prefer low electricity costs (WA, TX, NY hydro) +- Test if Bitcoin mines face more opposition (negative perception) + +**Data Sources**: +- Cambridge Bitcoin Electricity Consumption Index (has facility locations) +- News archives of mining farm proposals/rejections + +--- + +### 22. Disaster Resilience & Redundancy +**Approach**: +- Model simultaneous hazard exposure across data center clusters +- E.g., "What % of US data centers are in wildfire risk zones?" +- Identify single points of failure (e.g., Ashburn VA = 20% of US capacity) + +**Research Questions**: +- Is the current spatial distribution resilient to climate change? +- Should policy incentivize geographic diversification? + +**Output**: `disaster_resilience_report.md` + +--- + +### 23. Edge Data Center Network +**Approach**: +- Separately analyze edge facilities (<1 MW) vs. hyperscale (>100 MW) +- Test if edge DCs follow different siting logic (population density > energy cost) + +**Data Challenge**: Current inventory does not distinguish edge vs. hyperscale (need `power_mw` backfill) + +--- + +### 24. Carbon Intensity of Host Grids +**Approach**: +- Join to EPA eGRID subregion carbon intensity (lb CO₂/MWh) +- Calculate per-facility estimated carbon footprint (if `power_mw` available) +- Compare to corporate renewable energy procurement (RECs, PPAs) + +**Research Questions**: +- Are data centers disproportionately in high-carbon grids? +- Do hyperscaler renewable commitments offset grid carbon? + +**Data Sources**: +- EPA eGRID +- Corporate sustainability reports (Google, Microsoft, Meta, AWS) + +--- + +## Collaboration Opportunities + +### Academic Partnerships +- **Energy researchers**: Joint analysis of grid saturation + IM3 projections +- **Environmental justice scholars**: EJScreen overlay + opposition case studies +- **Political scientists**: Tax incentive analysis + local government bargaining power + +### Policy Stakeholders +- **State energy offices**: Share grid saturation maps +- **Water resource agencies**: Watershed analysis for permitting +- **Local governments**: Demographic/tax revenue analysis for negotiation leverage + +### Industry Engagement +- **Data center operators**: Validate facility data, discuss siting criteria +- **Colocation providers**: Access to tenant mix data (multi-tenant vs. single-tenant) + +--- + +## Tools & Infrastructure Improvements + +### Database Enhancements +- Add `version` column to track data vintage +- Implement `audit_log` table for data lineage +- Set up automated backups to S3/Azure Blob + +### Code Quality +- Add unit tests for geocoding functions +- Create `config.yaml` for database credentials (replace hardcoded env vars) +- Dockerize analysis environment for reproducibility + +### Documentation +- Add JSDoc-style comments to all Python functions +- Create `CONTRIBUTING.md` for external collaborators +- Record Jupyter notebook walkthroughs (video tutorials) + +--- + +## Unfunded / Ambitious Ideas + +### 25. Real-Time Energy Monitoring +- Partner with utility to get live load data from data center substations +- Build dashboard showing real-time energy consumption by facility +- Correlate with AWS/Azure/GCP service outages (reverse-engineer capacity from brownouts) + +### 26. Social Media Sentiment Analysis +- Scrape Twitter/Reddit for mentions of local data center projects +- NLP sentiment analysis: support vs. opposition +- Correlate sentiment with facility approval outcomes + +### 27. LIDAR Analysis of Cooling Infrastructure +- Use aerial LIDAR to measure rooftop cooling equipment volume +- Proxy for facility size (cooling = f(IT load)) +- Build predictive model: cooling equipment → power capacity + +--- + +## Contact & Contributions + +If you're interested in collaborating on any of these research directions, please contact the repository owner. + +**Priorities for external collaboration**: +1. Power capacity data acquisition +2. Water stress/drought overlay +3. Opposition cases database compilation +4. International comparative analysis + +--- + +## References for Future Work + +### Data Sources to Explore +- **Department of Energy**: Grid resilience reports, interconnection queues +- **NREL**: Renewable energy potential by HUC (solar, wind) +- **USDA**: Agricultural water use by county (competition for water) +- **NOAA**: Climate normals + projections by grid cell +- **BLS**: QCEW employment data, wage data +- **EPA**: eGRID, EJScreen, Superfund sites + +### Academic Literature Gaps +- Limited peer-reviewed research on data center spatial concentration +- No published studies on water stress exposure of data centers +- Opportunity for "first mover" publication in major geography/planning journals + +### Policy Levers to Investigate +- State renewable portfolio standards (RPS) → data center siting +- Federal infrastructure investment (IRA, CHIPS Act) → energy grid capacity +- Local zoning reform (industrial land use restrictions) + +--- + +**Last Updated**: May 2026