Reorganize project into scripts/, docs/, data/, output/ directories
Move all Python scripts to scripts/, documentation to docs/, raw input data to data/, and generated HTML/CSV outputs to output/. Update path references in 8 scripts to use Path(__file__).parent.parent as project root so they work correctly from the new location. Update README links and quick-start commands accordingly. Notebooks remain at root. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
BIN
docs/cables_concentration_report.docx
Normal file
BIN
docs/cables_concentration_report.docx
Normal file
Binary file not shown.
194
docs/cables_concentration_report.md
Normal file
194
docs/cables_concentration_report.md
Normal file
@@ -0,0 +1,194 @@
|
||||
# Data Centers, Submarine Cables, and the Concentrated-Costs / Dispersed-Benefits Frame
|
||||
|
||||
**Author:** David Adams · **Date:** 2026-05-17
|
||||
**Data:** PostGIS `data_centers` DB — `us_dc_sample_geocoded` (1,489 DCs),
|
||||
`data_center_census_tracts_2024` (611 tracts, ACS 2024 5-yr enriched),
|
||||
`internet_cables` (693 cables), `internet_city_dominance` (4,552 cities),
|
||||
`census_tract_acs_2024_selected_states.csv` (83,811 tracts, 46 states).
|
||||
|
||||
---
|
||||
|
||||
## 1. Are US data centers spatially tied to submarine cables?
|
||||
|
||||
Distance from each point to the nearest submarine cable line (km):
|
||||
|
||||
| Group | n | Mean | p10 | p25 | **p50** | p75 | p90 | ≤10 km | ≤50 km | ≤100 km | ≤250 km |
|
||||
|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
|
||||
| US data centers | 1,489 | 358.7 | 21.6 | 163.1 | **276.1** | 477.4 | 867.4 | 5.2% | 16.8% | 21.4% | 32.2% |
|
||||
| US population cities | 1,291 | 339.7 | 18.7 | 61.2 | **256.1** | 528.0 | 811.0 | 6.8% | 22.5% | 31.8% | 49.5% |
|
||||
|
||||
Mann-Whitney U two-sided: **z = 2.66, p ≈ 0.008** — significant, but in the
|
||||
*opposite* direction. DCs are **not** systematically closer to cables than
|
||||
ordinary US cities.
|
||||
|
||||
**Interpretation.** At the national level the "cables drive DC siting" story
|
||||
fails. The largest clusters — Loudoun County VA (Ashburn), central
|
||||
Washington, Hillsboro OR, Columbus OH, Iowa — are inland, anchored to
|
||||
terrestrial fiber, cheap power, and tax incentives rather than submarine
|
||||
landings. Only 21.4% of DCs sit within 100 km of any cable.
|
||||
|
||||
---
|
||||
|
||||
## 2. Cost concentration at the state level
|
||||
|
||||
| Measure | Value |
|
||||
|---|---:|
|
||||
| States covered | 46 |
|
||||
| Gini of DC counts across states | 0.648 |
|
||||
| HHI of state shares | 0.080 |
|
||||
|
||||
Top states by share of US data centers:
|
||||
|
||||
| State | DCs | Share | Cumulative |
|
||||
|---|---:|---:|---:|
|
||||
| VA | 319 | 21.4% | 21.4% |
|
||||
| CA | 129 | 8.7% | 30.1% |
|
||||
| TX | 120 | 8.1% | 38.1% |
|
||||
| OR | 102 | 6.9% | 45.0% |
|
||||
| WA | 90 | 6.0% | 51.0% |
|
||||
| OH | 69 | 4.6% | 55.7% |
|
||||
| AZ | 60 | 4.0% | 59.7% |
|
||||
| IA | 58 | 3.9% | 63.6% |
|
||||
|
||||
Five states hold **half** of all US data centers.
|
||||
|
||||
---
|
||||
|
||||
## 3. Cost concentration at the tract level
|
||||
|
||||
Much sharper than state-level:
|
||||
|
||||
| Measure | Value |
|
||||
|---|---:|
|
||||
| DC-hosting tracts | 611 |
|
||||
| DCs in those tracts | 1,489 |
|
||||
| Gini of DC counts across DC-hosting tracts | 0.499 |
|
||||
| HHI of DC shares across DC-hosting tracts | 0.0069 |
|
||||
| **Top 1% of host tracts (6 tracts) hold** | **14.6% of all DCs** |
|
||||
| Top 5% of host tracts (30 tracts) hold | 33.3% of all DCs |
|
||||
| Top 20% of host tracts (122 tracts) hold | 60.6% of all DCs |
|
||||
|
||||
Population scaling:
|
||||
|
||||
| Metric | Value |
|
||||
|---|---:|
|
||||
| Population living in a DC-hosting tract | 2,868,863 |
|
||||
| Total population (DC-state ACS universe) | 332,343,349 |
|
||||
| **% of DC-host-state residents in a DC-hosting tract** | **0.86%** |
|
||||
| DCs per resident, DC-hosting tracts | 1 per 1,927 |
|
||||
| DCs per resident, DC-state average | 1 per 223,199 |
|
||||
| **Per-capita DC burden, host vs. average** | **~115×** |
|
||||
|
||||
---
|
||||
|
||||
## 4. Who bears the costs? (ACS profile of DC tracts vs. peer tracts in same states)
|
||||
|
||||
| Field | DC tracts (median) | Non-DC peers (median) | Δ (DC − peer) |
|
||||
|---|---:|---:|---:|
|
||||
| Median household income ($) | 91,082 | 76,637 | **+14,446** |
|
||||
| Per-capita income ($) | 48,111 | 38,546 | +9,565 |
|
||||
| Broadband subscription (%) | 94.2 | 92.0 | +2.2 |
|
||||
| Poverty rate (%) | 8.8 | 10.8 | −2.0 |
|
||||
| Non-Hispanic White (%) | 52.4 | 64.7 | −12.3 |
|
||||
| Non-Hispanic Black (%) | 6.7 | 3.9 | +2.8 |
|
||||
| Hispanic/Latino (%) | 11.9 | 9.8 | +2.1 |
|
||||
| Non-Hispanic Asian (%) | 5.2 | 1.5 | +3.7 |
|
||||
|
||||
Population-weighted means in DC tracts: MHI **$109,145**, broadband **93.2%**,
|
||||
poverty 11.1%. The actual residents of host communities are concentrated in
|
||||
affluent tech corridors (Loudoun, Silicon Valley, Seattle eastside,
|
||||
Hillsboro OR).
|
||||
|
||||
Primary-industry mix of host tracts (count of tracts):
|
||||
|
||||
| Tracts | Primary industry |
|
||||
|---:|---|
|
||||
| 351 | Educational services, and health care and social assistance |
|
||||
| 133 | Professional, scientific, management, administrative, and waste management services |
|
||||
| 35 | Manufacturing |
|
||||
| 26 | Arts, entertainment, recreation, accommodation, and food services |
|
||||
| 22 | Retail trade |
|
||||
| 14 | Agriculture, forestry, fishing and hunting, and mining |
|
||||
| 10 | Finance and insurance, and real estate and rental and leasing |
|
||||
| 9 | Construction |
|
||||
| 4 | Transportation and warehousing, and utilities |
|
||||
| 3 | Public administration |
|
||||
|
||||
---
|
||||
|
||||
## 5. Cable-adjacent vs. inland DC tracts
|
||||
|
||||
| | ≤100 km from a cable | >100 km from a cable |
|
||||
|---|---:|---:|
|
||||
| Tracts | 159 | 452 |
|
||||
| Data centers | 319 | 1,170 |
|
||||
| Median household income ($) | 106,406 | 86,289 |
|
||||
| Median broadband (%) | 95.2 | 93.9 |
|
||||
| Median DC count | 1 | 1 |
|
||||
|
||||
Inland DCs are roughly **3.7×** the cable-adjacent count. Coastal/cable
|
||||
tracts skew even wealthier than inland DC tracts.
|
||||
|
||||
---
|
||||
|
||||
## 6. Benefit dispersion (broadband subscribers as a benefit proxy)
|
||||
|
||||
| Measure | Value |
|
||||
|---|---:|
|
||||
| Estimated broadband subscribers (DC states) | 119,719,313 |
|
||||
| Tracts with subscriber data | 81,839 |
|
||||
| Gini of subscribers across tracts | 0.253 |
|
||||
| HHI of subscribers across tracts | 0.00001 |
|
||||
|
||||
Side-by-side concentration:
|
||||
|
||||
| Series | HHI |
|
||||
|---|---:|
|
||||
| DCs across DC-hosting tracts | 0.0069 |
|
||||
| Broadband subscribers across DC-state tracts | 0.00001 |
|
||||
| **Concentration ratio** | **~464× more concentrated for DCs** |
|
||||
|
||||
---
|
||||
|
||||
## 7. Verdict
|
||||
|
||||
| Element of the frame | Holds? |
|
||||
|---|---|
|
||||
| Costs concentrated geographically | **Yes** — top 6 tracts carry 15% of DCs; <1% of host-state population lives in a DC tract; per-capita burden ~115× the average. |
|
||||
| Driven by submarine cable infrastructure | **No, broadly** — proximity test fails nationally; submarine cables matter for a coastal subset only. Terrestrial fiber, power, water, land, and tax incentives dominate. |
|
||||
| Benefits dispersed among users | **Yes** — broadband subscribers ~464× more dispersed (by HHI) than DCs. |
|
||||
| Classic political failure mode (weak losers vs. diffuse winners) | **No.** Host tracts skew wealthier, higher-income, higher-broadband than peers. The cost-bearing communities are affluent tech corridors with strong bargaining capacity — they tend to convert concentrated costs into concentrated *rents* (tax base, jobs, infrastructure concessions). |
|
||||
|
||||
**Bottom line.** The structural asymmetry that defines "concentrated costs /
|
||||
dispersed benefits" is unambiguous in the data — DC siting is hyper-local
|
||||
while benefits are continental. But the predicted political dynamic doesn't
|
||||
fit cleanly, because the loser side here is not weak. A more targeted test
|
||||
would split host tracts into power-stressed exurban tracts (parts of
|
||||
Loudoun's edges, central Oregon, Iowa) and urban-suburban tech-corridor
|
||||
tracts, and look at whether the *exurban* subset shows the weak-loser
|
||||
pattern (lower income, slower broadband, higher poverty than its
|
||||
neighbors).
|
||||
|
||||
---
|
||||
|
||||
## Caveats
|
||||
|
||||
- The ACS universe is the 46 DC-host states (already DC-heavy); excludes
|
||||
states with no DCs in the sample.
|
||||
- `data_center_census_tracts_2024` only contains tracts that host at least
|
||||
one DC, by construction.
|
||||
- Broadband-subscription rate is a coarse benefit proxy; cloud services
|
||||
benefit any internet user globally, not just local subscribers.
|
||||
- 45 of 1,489 DCs use city-precision fallback coordinates, so a small share
|
||||
of tract assignments are approximate.
|
||||
- The `logical_dominance_ips` field in `internet_city_dominance` measures
|
||||
IP blocks routed/hosted at each city — a supply-side measure that
|
||||
duplicates the DC signal, not a demand-side user-location measure. It
|
||||
was excluded from the benefit-dispersion calculation for that reason.
|
||||
|
||||
## Reproducible scripts
|
||||
|
||||
- `load_postgis_internet_cables.py` — ingest cables/landings/cities
|
||||
- `make_internet_cables_map.py` — render the combined Leaflet map
|
||||
- `analyze_cables_concentration.py` — state-level + cable-proximity analysis
|
||||
- `analyze_dc_tract_concentration.py` — tract-level analysis used here
|
||||
701
docs/database-tables.md
Normal file
701
docs/database-tables.md
Normal file
@@ -0,0 +1,701 @@
|
||||
# Database Tables Documentation
|
||||
|
||||
## Database Configuration
|
||||
|
||||
**Database Name**: `data_centers`
|
||||
**Type**: PostgreSQL with PostGIS extension
|
||||
**Connection**: Environment variables from `~/.zsh_secrets`
|
||||
- `PGWEB_HOST`: Database host
|
||||
- `PGWEB_PORT`: Database port (5433)
|
||||
- `PGWEB_USER`: Database user
|
||||
- `PGWEB_PASSWORD`: Database password
|
||||
- `PGWEB_DATABASE`: Database name (`data_centers`)
|
||||
|
||||
## Table Organization
|
||||
|
||||
Tables are organized into five categories:
|
||||
1. **Core Data Center Tables** - Master inventories and source data
|
||||
2. **Enrichment Tables** - Data centers joined with contextual data
|
||||
3. **Base Layer Tables** - Geographic and demographic reference layers
|
||||
4. **Infrastructure Tables** - Energy and connectivity infrastructure
|
||||
5. **Legislation Tables** - LegiScan state and federal bill data (2016-2026)
|
||||
|
||||
---
|
||||
|
||||
## Core Data Center Tables
|
||||
|
||||
### `master_data_centers`
|
||||
**Rows**: 1,833
|
||||
**Purpose**: Canonical data center inventory - deduplicated merge of curated + OSM sources
|
||||
|
||||
**Key Columns**:
|
||||
- `id` (INTEGER) - Unique identifier
|
||||
- `name` (TEXT) - Facility name
|
||||
- `address` (TEXT) - Street address
|
||||
- `city` (TEXT) - City
|
||||
- `state` (TEXT) - State code
|
||||
- `latitude` (DOUBLE PRECISION) - Latitude
|
||||
- `longitude` (DOUBLE PRECISION) - Longitude
|
||||
- `geom` (GEOMETRY) - PostGIS point geometry (EPSG:4326)
|
||||
- `operator` (TEXT) - Operator/owner
|
||||
- `power_mw` (DOUBLE PRECISION) - Power capacity in megawatts (sparse: 5.9% populated)
|
||||
- `source` (TEXT) - Data source (`curated`, `osm`, or `both`)
|
||||
- `osm_id` (TEXT) - OpenStreetMap ID if applicable
|
||||
- `geocode_method` (TEXT) - Geocoding provenance
|
||||
|
||||
**Notes**:
|
||||
- 108 of 1,833 facilities have power ratings
|
||||
- 45 facilities use city-precision fallback coordinates
|
||||
- Operator strings have fragmentation issues ("Meta" vs. "Meta, Inc.")
|
||||
|
||||
### `us_dc_sample_geocoded`
|
||||
**Rows**: 1,489
|
||||
**Purpose**: Original curated sample with geocoding provenance (superseded by `master_data_centers`)
|
||||
|
||||
**Key Columns**:
|
||||
- `name`, `address`, `city`, `state`, `zip`
|
||||
- `latitude`, `longitude`, `geom`
|
||||
- `operator`, `power_mw`
|
||||
- `census_lat`, `census_lon` - Census TIGER geocode results
|
||||
- `nominatim_lat`, `nominatim_lon` - Nominatim fallback results
|
||||
- `geocode_source` - Which geocoder was used
|
||||
|
||||
### `osm_data_centers`
|
||||
**Rows**: 1,549
|
||||
**Purpose**: Raw OpenStreetMap-derived facilities
|
||||
|
||||
**Key Columns**:
|
||||
- `osm_id` (TEXT) - OSM element ID
|
||||
- `osm_type` (TEXT) - `node`, `way`, or `relation`
|
||||
- `name` (TEXT) - OSM name tag
|
||||
- `latitude`, `longitude`, `geom`
|
||||
- `tags` (JSONB) - All OSM tags as JSON
|
||||
- `operator` (TEXT) - Extracted from OSM tags
|
||||
- `city`, `state`, `country`
|
||||
|
||||
**Notes**: Fetched via Overpass API with query for `telecom=data_center` or `building=data_center`
|
||||
|
||||
### `master_data_center_spatial_clusters`
|
||||
**Rows**: 1,831
|
||||
**Purpose**: DBSCAN cluster assignments for master data centers
|
||||
|
||||
**Key Columns**:
|
||||
- All columns from `master_data_centers`
|
||||
- `cluster_id` (INTEGER) - Cluster assignment (-1 = noise/singleton)
|
||||
- `cluster_size` (INTEGER) - Number of facilities in cluster
|
||||
- `cluster_label` (TEXT) - Human-readable cluster name
|
||||
|
||||
**Notes**: DBSCAN parameters: eps=15 km, min_samples=2
|
||||
|
||||
---
|
||||
|
||||
## Enrichment Tables
|
||||
|
||||
### `data_center_census_tracts_2024`
|
||||
**Rows**: 1,815
|
||||
**Purpose**: Per-facility demographics from containing Census tract
|
||||
|
||||
**Key Columns**:
|
||||
- All columns from `master_data_centers`
|
||||
- `geoid` (TEXT) - 11-digit Census tract GEOID
|
||||
- `state_fips`, `county_fips`, `tract`
|
||||
- **Population**: `total_population`, `population_density_sq_mi`
|
||||
- **Age**: `median_age`, `under_18_pct`, `over_65_pct`
|
||||
- **Race/Ethnicity**: `white_nh_pct`, `black_nh_pct`, `asian_nh_pct`, `hispanic_pct`
|
||||
- **Economics**: `median_household_income`, `per_capita_income`, `poverty_rate`
|
||||
- **Education**: `bachelors_or_higher_pct`, `high_school_or_higher_pct`
|
||||
- **Housing**: `median_home_value`, `median_rent`, `homeownership_rate`
|
||||
- **Broadband**: `broadband_pct` - Households with broadband subscription
|
||||
|
||||
**Source**: ACS 2024 5-year estimates
|
||||
|
||||
**Notes**:
|
||||
- 18 of 1,833 facilities failed tract join (geocoding issues)
|
||||
- Data from `_dc_census_tract_acs_2024` base table
|
||||
|
||||
### `data_center_watershed_huc8`
|
||||
**Rows**: 1,833
|
||||
**Purpose**: Per-facility watershed assignment
|
||||
|
||||
**Key Columns**:
|
||||
- All columns from `master_data_centers`
|
||||
- `huc8` (TEXT) - 8-digit Hydrologic Unit Code
|
||||
- `watershed_name` (TEXT) - Watershed name
|
||||
- `watershed_area_sq_km` (DOUBLE PRECISION)
|
||||
- `states` (TEXT) - States intersecting watershed
|
||||
|
||||
**Source**: USGS Watershed Boundary Dataset
|
||||
|
||||
**Notes**: 257 unique HUC8 watersheds contain at least one data center
|
||||
|
||||
### `data_center_nri_exposure`
|
||||
**Rows**: 1,833
|
||||
**Purpose**: Per-facility FEMA National Risk Index hazard exposure scores
|
||||
|
||||
**Key Columns**:
|
||||
- All columns from `master_data_centers`
|
||||
- `nri_id` (TEXT) - Census tract GEOID (matches `geoid` from demographics)
|
||||
- `risk_score` (DOUBLE PRECISION) - Overall NRI risk score
|
||||
- `social_vulnerability` (DOUBLE PRECISION) - Social vulnerability index
|
||||
- **Hazard-specific risk scores** (18 hazards):
|
||||
- `avalanche_risk`, `coastal_flooding_risk`, `cold_wave_risk`
|
||||
- `drought_risk`, `earthquake_risk`, `hail_risk`
|
||||
- `heat_wave_risk`, `hurricane_risk`, `ice_storm_risk`
|
||||
- `landslide_risk`, `lightning_risk`, `riverine_flooding_risk`
|
||||
- `strong_wind_risk`, `tornado_risk`, `tsunami_risk`
|
||||
- `volcanic_activity_risk`, `wildfire_risk`, `winter_weather_risk`
|
||||
|
||||
**Source**: FEMA National Risk Index (December 2025 release)
|
||||
|
||||
### `data_center_rdh_precinct_vote_matches`
|
||||
**Rows**: Varies
|
||||
**Purpose**: Per-facility precinct-level election results
|
||||
|
||||
**Key Columns**:
|
||||
- Data center identifiers
|
||||
- `precinct_name`, `precinct_id`
|
||||
- `election_year`, `office`
|
||||
- `candidate`, `party`, `votes`
|
||||
- `vote_share_pct`
|
||||
|
||||
**Source**: Redistricting Data Hub precinct shapefiles
|
||||
|
||||
**Notes**: Spatial join to voting precincts (point-in-polygon)
|
||||
|
||||
---
|
||||
|
||||
## Base Layer Tables
|
||||
|
||||
### `_dc_census_tract_acs_2024`
|
||||
**Rows**: 85,382
|
||||
**Purpose**: ACS 2024 demographics for all Census tracts in states with data centers
|
||||
|
||||
**Key Columns**:
|
||||
- `geoid` (TEXT) - 11-digit tract GEOID (PRIMARY KEY)
|
||||
- `name` (TEXT) - Tract name
|
||||
- `state_fips`, `county_fips`, `tract`
|
||||
- **Full ACS 5-year estimates** (85+ columns):
|
||||
- Population by age, sex, race/ethnicity
|
||||
- Households, families, housing units
|
||||
- Income, poverty, education, employment
|
||||
- Housing values, rents, costs
|
||||
- Broadband, computer access
|
||||
- Commuting, vehicles
|
||||
|
||||
**Source**: Census ACS 2024 5-year estimates API
|
||||
|
||||
**Notes**: Universe limited to 46 states with data centers (excludes DC-free states)
|
||||
|
||||
### `_dc_census_tract_boundaries_2024`
|
||||
**Rows**: 85,058
|
||||
**Purpose**: TIGER 2024 tract polygons for data center states
|
||||
|
||||
**Key Columns**:
|
||||
- `geoid` (TEXT) - 11-digit tract GEOID
|
||||
- `name` (TEXT) - Tract name
|
||||
- `state_fips`, `county_fips`, `tract_code`
|
||||
- `geom` (GEOMETRY) - Polygon geometry (EPSG:4326)
|
||||
- `area_land_sq_m` (DOUBLE PRECISION) - Land area in square meters
|
||||
- `area_water_sq_m` (DOUBLE PRECISION) - Water area in square meters
|
||||
|
||||
**Source**: Census TIGER/Line 2024
|
||||
|
||||
### `ruca_codes_2020_tract`
|
||||
**Rows**: 85,528
|
||||
**Purpose**: USDA Rural-Urban Commuting Area codes for metro/rural classification
|
||||
|
||||
**Key Columns**:
|
||||
- `geoid` (TEXT) - 11-digit tract GEOID (matches Census tracts)
|
||||
- `ruca_code` (TEXT) - Primary RUCA code (1-10)
|
||||
- `ruca_category` (TEXT) - Simplified category:
|
||||
- `Metropolitan` (codes 1-3)
|
||||
- `Micropolitan` (codes 4-6)
|
||||
- `Small town` (codes 7-9)
|
||||
- `Rural` (code 10)
|
||||
- `ruca_description` (TEXT) - Full RUCA code description
|
||||
- `population_2020` (INTEGER)
|
||||
|
||||
**Source**: USDA Economic Research Service RUCA 2020
|
||||
|
||||
**Notes**:
|
||||
- Based on 2020 Census tracts and 2010-2020 commuting patterns
|
||||
- 7 data centers failed RUCA join (Puerto Rico / non-US)
|
||||
|
||||
### `watershed_huc8`
|
||||
**Rows**: 2,139
|
||||
**Purpose**: USGS HUC8 subbasin polygons for water-stress analysis
|
||||
|
||||
**Key Columns**:
|
||||
- `huc8` (TEXT) - 8-digit Hydrologic Unit Code (PRIMARY KEY)
|
||||
- `name` (TEXT) - Watershed name
|
||||
- `geom` (GEOMETRY) - Polygon geometry (EPSG:4326)
|
||||
- `area_sq_km` (DOUBLE PRECISION)
|
||||
- `states` (TEXT) - Comma-separated state codes
|
||||
- `dc_count` (INTEGER) - Number of data centers in watershed
|
||||
|
||||
**Source**: USGS Watershed Boundary Dataset
|
||||
|
||||
**Notes**:
|
||||
- 257 of 2,139 watersheds contain at least one data center
|
||||
- Top 15 watersheds contain 50% of all US data centers
|
||||
|
||||
### `nri_census_tracts`
|
||||
**Rows**: ~84,000
|
||||
**Purpose**: Full FEMA National Risk Index by Census tract
|
||||
|
||||
**Key Columns**:
|
||||
- `nri_id` (TEXT) - Census tract GEOID
|
||||
- `state_name`, `county_name`, `tract_name`
|
||||
- **460+ columns** including:
|
||||
- Overall risk scores and ratings
|
||||
- Expected annual loss (dollars and building value %)
|
||||
- Social vulnerability components (15 factors)
|
||||
- Community resilience score
|
||||
- Individual hazard risk scores (18 hazards)
|
||||
- Exposure, annualized frequency, historic loss ratios per hazard
|
||||
|
||||
**Source**: FEMA National Risk Index v2.1 (December 2025)
|
||||
|
||||
**Notes**:
|
||||
- Massive table with comprehensive natural hazard risk data
|
||||
- Join to data centers via `geoid` field
|
||||
- See [FEMA NRI Technical Documentation](https://hazards.fema.gov/nri/)
|
||||
|
||||
---
|
||||
|
||||
## Infrastructure Tables
|
||||
|
||||
### Energy Infrastructure
|
||||
|
||||
#### `energy_eia_operating_generator_capacity_flat`
|
||||
**Rows**: 4.7 million
|
||||
**Purpose**: EIA generator inventory with lat/lon/MW (monthly 2008-2026)
|
||||
|
||||
**Key Columns**:
|
||||
- `plant_id` (INTEGER) - EIA plant ID
|
||||
- `generator_id` (TEXT) - Generator unit ID
|
||||
- `plant_name` (TEXT)
|
||||
- `latitude`, `longitude`, `geom`
|
||||
- `state`, `county`
|
||||
- `utility_name`, `operator_name`
|
||||
- `nameplate_capacity_mw` (DOUBLE PRECISION)
|
||||
- `technology` (TEXT) - Generation technology
|
||||
- `energy_source_1`, `energy_source_2` - Primary fuel codes
|
||||
- `operating_month`, `operating_year` - When unit became operational
|
||||
- `status` (TEXT) - Operating, standby, retired, etc.
|
||||
- `report_month`, `report_year` - Data snapshot date
|
||||
|
||||
**Source**: EIA Form 860 via API
|
||||
|
||||
**Notes**:
|
||||
- "Flat" means denormalized for fast spatial queries
|
||||
- Each generator-month is a row (4.7M rows from monthly snapshots)
|
||||
- Use for proximity analysis (e.g., "all generators within 50 km of data center")
|
||||
|
||||
#### `energy_eia_facility_fuel_flat`
|
||||
**Rows**: Varies
|
||||
**Purpose**: Monthly generation by plant/fuel
|
||||
|
||||
**Key Columns**:
|
||||
- `plant_id`, `plant_name`
|
||||
- `report_month`, `report_year`
|
||||
- `energy_source` (TEXT) - Fuel code
|
||||
- `net_generation_mwh` (DOUBLE PRECISION)
|
||||
- `fuel_consumed_mmbtu` (DOUBLE PRECISION)
|
||||
|
||||
**Source**: EIA Form 923 via API
|
||||
|
||||
#### `energy_eia_seds_flat`
|
||||
**Rows**: 2.57 million
|
||||
**Purpose**: Annual state energy consumption/production (1960-2024)
|
||||
|
||||
**Key Columns**:
|
||||
- `state_code` (TEXT)
|
||||
- `year` (INTEGER)
|
||||
- `msn` (TEXT) - Mnemonic series names (e.g., `TETCB` = total energy consumption)
|
||||
- `value` (DOUBLE PRECISION) - Energy in trillion BTU
|
||||
- `unit` (TEXT)
|
||||
- `description` (TEXT) - Human-readable MSN description
|
||||
|
||||
**Source**: EIA State Energy Data System (SEDS)
|
||||
|
||||
**Notes**:
|
||||
- Annual aggregates by state
|
||||
- Use for state-level energy context analysis
|
||||
|
||||
---
|
||||
|
||||
### Connectivity Infrastructure
|
||||
|
||||
#### `internet_cables`
|
||||
**Rows**: 693
|
||||
**Purpose**: Submarine cable routes
|
||||
|
||||
**Key Columns**:
|
||||
- `cable_id` (TEXT) - Unique cable identifier
|
||||
- `cable_name` (TEXT) - Official cable name
|
||||
- `geom` (GEOMETRY) - LineString geometry (EPSG:4326)
|
||||
- `rfs_year` (INTEGER) - Ready For Service year
|
||||
- `length_km` (DOUBLE PRECISION)
|
||||
- `owners` (TEXT[]) - Array of owner names
|
||||
- `landing_points` (TEXT[]) - Array of landing point names
|
||||
|
||||
**Source**: TeleGeography-style cable database
|
||||
|
||||
**Notes**:
|
||||
- 693 unique submarine cables
|
||||
- Geometry is approximate route (not exact seabed path)
|
||||
|
||||
#### `internet_cable_landing_points`
|
||||
**Rows**: 3,361
|
||||
**Purpose**: Cable landing points (where cables come ashore)
|
||||
|
||||
**Key Columns**:
|
||||
- `landing_point_id` (TEXT) - Unique identifier
|
||||
- `name` (TEXT) - Landing point name
|
||||
- `city`, `country`
|
||||
- `latitude`, `longitude`, `geom`
|
||||
- `cables` (TEXT[]) - Array of cable names landing at this point
|
||||
- `cable_count` (INTEGER)
|
||||
|
||||
**Source**: TeleGeography-style cable database
|
||||
|
||||
**Notes**:
|
||||
- Used for proximity analysis (how close are data centers to cable landings?)
|
||||
- **Key finding**: Data centers are NOT systematically closer to cables than ordinary US cities
|
||||
|
||||
#### `internet_city_dominance`
|
||||
**Rows**: 4,552
|
||||
**Purpose**: City-level IPs/capacity (internet hub strength proxy)
|
||||
|
||||
**Key Columns**:
|
||||
- `city` (TEXT)
|
||||
- `country` (TEXT)
|
||||
- `latitude`, `longitude`, `geom`
|
||||
- `ip_addresses` (INTEGER) - Number of routable IP addresses
|
||||
- `capacity_rank` (INTEGER) - Relative capacity ranking
|
||||
|
||||
**Source**: Internet topology datasets
|
||||
|
||||
**Notes**: Proxy for "internet hub" strength (not directly used in main analyses)
|
||||
|
||||
---
|
||||
|
||||
### Broadband
|
||||
|
||||
#### `fcc_bdc_location_provider_aggregates`
|
||||
**Rows**: Varies
|
||||
**Purpose**: FCC BDC provider availability aggregated by county/tract
|
||||
|
||||
**Key Columns**:
|
||||
- `geoid` (TEXT) - County or tract GEOID
|
||||
- `geography_level` (TEXT) - `county` or `tract`
|
||||
- `provider_count` (INTEGER)
|
||||
- `technology_counts` (JSONB) - Count by technology type
|
||||
- `max_download_mbps`, `max_upload_mbps`
|
||||
|
||||
**Source**: FCC Broadband Data Collection (BDC)
|
||||
|
||||
#### `fcc_bdc_broadband_connection_table`
|
||||
**Rows**: Varies
|
||||
**Purpose**: Per-data-center broadband provider availability
|
||||
|
||||
**Key Columns**:
|
||||
- Data center identifiers
|
||||
- `provider_id`, `provider_name`
|
||||
- `technology` (TEXT)
|
||||
- `max_advertised_download_speed`, `max_advertised_upload_speed`
|
||||
- `low_latency` (BOOLEAN)
|
||||
|
||||
**Source**: FCC BDC, joined to data center locations
|
||||
|
||||
**Notes**: Built by `build_fcc_bdc_broadband_connection_table.py`
|
||||
|
||||
---
|
||||
|
||||
### Other Tables
|
||||
|
||||
#### `opposition_cases_geocoded`
|
||||
**Rows**: 18
|
||||
**Purpose**: Geocoded community-opposition cases against data center builds
|
||||
|
||||
**Key Columns**:
|
||||
- `case_id` (TEXT) - Unique identifier
|
||||
- `developer` (TEXT) - Proposed developer/operator
|
||||
- `investment_billions` (DOUBLE PRECISION) - Investment amount in billions
|
||||
- `outcome` (TEXT) - Case outcome (approved, rejected, pending)
|
||||
- `governance_response` (TEXT) - Government response
|
||||
- `latitude`, `longitude`, `geom`
|
||||
|
||||
**Source**: Compiled from news archives
|
||||
|
||||
**Notes**: Loaded but currently unused - see research-ideas.md for proposed analyses
|
||||
|
||||
#### `census_tract_huc8_link`
|
||||
**Rows**: 806
|
||||
**Purpose**: Tract↔HUC8 spatial overlap table
|
||||
|
||||
**Key Columns**:
|
||||
- `geoid` (TEXT) - Census tract GEOID
|
||||
- `huc8` (TEXT) - HUC8 watershed code
|
||||
- `overlap_pct` (DOUBLE PRECISION) - Percentage of tract overlapping watershed
|
||||
|
||||
**Notes**: Useful for downstream tract-level water-stress joins; limited to tracts containing data centers
|
||||
|
||||
#### `im3_state_projected_moderate_50`
|
||||
**Rows**: 328
|
||||
**Purpose**: PNNL IM3 projected data center siting (moderate growth, gravity weight 0.50)
|
||||
|
||||
**Key Columns**:
|
||||
- `facility_id` (TEXT)
|
||||
- `state` (TEXT)
|
||||
- `cost_millions` (DOUBLE PRECISION)
|
||||
- `it_mw` (DOUBLE PRECISION) - IT load in megawatts
|
||||
- `cooling_water_demand_gal_per_day` (DOUBLE PRECISION)
|
||||
- `latitude`, `longitude`, `geom`
|
||||
|
||||
**Source**: PNNL Integrated Multisector Multiscale Modeling (IM3)
|
||||
|
||||
**Notes**: Loaded but unused - potential for forward-projection analysis
|
||||
|
||||
#### `im3_projected_state_demand_summary`
|
||||
**Rows**: 31
|
||||
**Purpose**: State-level rollup of IM3 projected facility counts, IT MW, and cooling demand
|
||||
|
||||
**Key Columns**:
|
||||
- `state` (TEXT)
|
||||
- `facility_count` (INTEGER)
|
||||
- `total_it_mw` (DOUBLE PRECISION)
|
||||
- `total_cooling_demand_mgd` (DOUBLE PRECISION) - Million gallons per day
|
||||
|
||||
**Source**: IM3 model outputs
|
||||
|
||||
#### `utility_rate_tracker_2025_2028`
|
||||
**Rows**: 374
|
||||
**Purpose**: Utility rate-increase tracker by provider × state × service type
|
||||
|
||||
**Key Columns**:
|
||||
- `provider` (TEXT) - Utility provider name
|
||||
- `state` (TEXT)
|
||||
- `service_type` (TEXT)
|
||||
- `effective_date` (DATE)
|
||||
- `monthly_increase_dollars` (DOUBLE PRECISION)
|
||||
- `percent_increase` (DOUBLE PRECISION)
|
||||
|
||||
**Source**: Utility rate tracker database
|
||||
|
||||
**Notes**: Loaded but unused in demographic/energy analysis
|
||||
|
||||
#### `energy_atlas_layers_catalog`
|
||||
**Rows**: ~5
|
||||
**Purpose**: Metadata catalog of EIA layers ingested
|
||||
|
||||
**Key Columns**:
|
||||
- `table_name` (TEXT)
|
||||
- `source_url` (TEXT)
|
||||
- `import_timestamp` (TIMESTAMP)
|
||||
- `row_count` (INTEGER)
|
||||
|
||||
**Notes**: Created by `ingest_eia_energy_layers.py`
|
||||
|
||||
---
|
||||
|
||||
---
|
||||
|
||||
## Legislation Tables
|
||||
|
||||
Populated by `ingest_legiscan.py` using the [LegiScan API](https://legiscan.com/legiscan).
|
||||
Covers all 50 states + DC + US Congress, sessions from 2016 through 2026.
|
||||
Data licensed [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) — attribute LegiScan LLC.
|
||||
|
||||
### `legiscan_sessions`
|
||||
**Rows**: 646
|
||||
**Purpose**: One row per legislative session dataset downloaded from LegiScan
|
||||
|
||||
**Key Columns**:
|
||||
- `session_id` (INTEGER) - LegiScan session ID (PRIMARY KEY)
|
||||
- `state_abbr` (VARCHAR) - Two-letter state code (`CA`, `TX`, `US`, etc.)
|
||||
- `state_id` (INTEGER) - LegiScan numeric state ID
|
||||
- `year_start`, `year_end` (INTEGER) - Session year range
|
||||
- `session_title` (TEXT) - Full session name
|
||||
- `session_tag` (TEXT) - Short tag (e.g., "Regular Session", "1st Special Session")
|
||||
- `is_special` (BOOLEAN) - True for special/extraordinary sessions
|
||||
- `is_prior` (BOOLEAN) - True for completed/sine-die sessions
|
||||
- `dataset_hash` (VARCHAR) - MD5 of dataset ZIP; used to detect updates
|
||||
- `dataset_date` (DATE) - Date dataset was last published by LegiScan
|
||||
- `dataset_size_mb` (FLOAT) - Compressed ZIP size
|
||||
- `bill_count` (INTEGER) - Number of bills loaded from this session
|
||||
- `imported_at` (TIMESTAMPTZ) - When this session was last imported
|
||||
|
||||
### `legiscan_bills`
|
||||
**Rows**: ~1,313,000
|
||||
**Purpose**: All bills from all sessions; tagged for relevance to data center research topics
|
||||
|
||||
**Key Columns**:
|
||||
- `bill_id` (INTEGER) - LegiScan bill ID (PRIMARY KEY)
|
||||
- `session_id` (INTEGER) - FK → `legiscan_sessions`
|
||||
- `state` (VARCHAR) - Two-letter state code
|
||||
- `bill_number` (VARCHAR) - Bill number (e.g., `SB 1000`, `HB 233`)
|
||||
- `bill_type` (VARCHAR) - `B`=Bill, `R`=Resolution, `CR`=Concurrent Resolution, etc.
|
||||
- `title` (TEXT) - Short title
|
||||
- `description` (TEXT) - Longer description
|
||||
- `status` (INTEGER) - Current status code (see below)
|
||||
- `status_date` (DATE) - Date of last status change
|
||||
- `completed` (INTEGER) - 1 if bill is in a terminal state
|
||||
- `body` (VARCHAR) - Originating chamber (`H`=House, `S`=Senate, `C`=Council, etc.)
|
||||
- `url` (TEXT) - LegiScan bill page URL
|
||||
- `state_link` (TEXT) - Official state legislature URL
|
||||
- `change_hash` (VARCHAR) - MD5 used to detect bill-level updates
|
||||
- `subjects` (TEXT[]) - LegiScan subject tags (GIN indexed)
|
||||
- `sponsor_count` (INTEGER) - Number of sponsors
|
||||
- `vote_count` (INTEGER) - Number of recorded votes
|
||||
- `text_count` (INTEGER) - Number of bill text versions
|
||||
- `is_relevant` (BOOLEAN) - True if any relevance tag matched (GIN indexed)
|
||||
- `relevance_tags` (TEXT[]) - Matched topic tags (GIN indexed)
|
||||
- `imported_at` (TIMESTAMPTZ) - When this bill was last upserted
|
||||
|
||||
**Status codes**: 1=Introduced, 2=Engrossed, 3=Enrolled, 4=Passed, 5=Vetoed, 6=Failed, 7=Override, 8=Chaptered, 9=Referred, 12=Draft
|
||||
|
||||
**Relevance tags** (keyword-matched against title + description + subjects):
|
||||
|
||||
| Tag | What it captures |
|
||||
|-----|-----------------|
|
||||
| `data_center` | Data centers, hyperscale, colocation, AI campuses, HPC facilities |
|
||||
| `large_load` | Crypto mining, large industrial loads, extraordinary load |
|
||||
| `ratepayer_protection` | Cost shifting, cross-subsidy, rate design, affordability, rate class |
|
||||
| `grid_impact` | Grid reliability, transmission, interconnection queue, IRP |
|
||||
| `tax_incentive` | Tax exemptions, abatements, credits for facilities |
|
||||
| `energy_policy` | Renewable PPAs, green tariffs, clean electricity, decarbonization |
|
||||
| `water_use` | Cooling water, evaporative cooling, water footprint |
|
||||
| `siting_permitting` | Zoning, conditional use permits, local control, preemption |
|
||||
|
||||
**Notes**:
|
||||
- ~60,000 relevant bills out of 1.3M total (~4.6%)
|
||||
- `data_center` tag: ~2,182 bills; `ratepayer_protection`: ~49,000
|
||||
- GIN indexes on `subjects`, `relevance_tags`, and full-text (`title || description`)
|
||||
- Use `query_legiscan_bills.sql` for pre-built research queries
|
||||
- Re-run `python ingest_legiscan.py --fetch --load` weekly to pick up dataset updates
|
||||
- Re-run `python ingest_legiscan.py --tag` after editing keyword lists
|
||||
|
||||
---
|
||||
|
||||
## Commonly Used Joins
|
||||
|
||||
### Data Center to Demographics
|
||||
```sql
|
||||
SELECT
|
||||
dc.*,
|
||||
ct.median_household_income,
|
||||
ct.bachelors_or_higher_pct,
|
||||
ct.broadband_pct
|
||||
FROM master_data_centers dc
|
||||
JOIN data_center_census_tracts_2024 ct
|
||||
ON dc.id = ct.id;
|
||||
```
|
||||
|
||||
### Data Center to Watershed
|
||||
```sql
|
||||
SELECT
|
||||
dc.*,
|
||||
w.huc8,
|
||||
w.watershed_name
|
||||
FROM master_data_centers dc
|
||||
JOIN data_center_watershed_huc8 dw ON dc.id = dw.id
|
||||
JOIN watershed_huc8 w ON dw.huc8 = w.huc8;
|
||||
```
|
||||
|
||||
### Data Center to Energy Infrastructure (50 km radius)
|
||||
```sql
|
||||
SELECT
|
||||
dc.id,
|
||||
dc.name,
|
||||
SUM(eg.nameplate_capacity_mw) AS total_capacity_50km
|
||||
FROM master_data_centers dc
|
||||
JOIN energy_eia_operating_generator_capacity_flat eg
|
||||
ON ST_DWithin(
|
||||
dc.geom::geography,
|
||||
eg.geom::geography,
|
||||
50000 -- 50 km in meters
|
||||
)
|
||||
WHERE eg.status = 'OP' -- Operating only
|
||||
GROUP BY dc.id, dc.name;
|
||||
```
|
||||
|
||||
### Data Center to FEMA Hazard Risk
|
||||
```sql
|
||||
SELECT
|
||||
dc.*,
|
||||
nri.risk_score,
|
||||
nri.wildfire_risk,
|
||||
nri.drought_risk,
|
||||
nri.heat_wave_risk
|
||||
FROM master_data_centers dc
|
||||
JOIN data_center_census_tracts_2024 ct ON dc.id = ct.id
|
||||
JOIN nri_census_tracts nri ON ct.geoid = nri.nri_id;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Table Naming Conventions
|
||||
|
||||
- **`master_*`** - Canonical, deduplicated tables (use these for analysis)
|
||||
- **`data_center_*`** - Data center-specific enrichment tables
|
||||
- **`_dc_*`** - Base layers scoped to data center states (underscore prefix = private/internal)
|
||||
- **`energy_eia_*`** - EIA energy data
|
||||
- **`internet_*`** - Connectivity infrastructure
|
||||
- **`fcc_bdc_*`** - FCC Broadband Data Collection
|
||||
|
||||
---
|
||||
|
||||
## Indexes and Performance
|
||||
|
||||
All tables have spatial indexes on `geom` columns for fast spatial joins:
|
||||
```sql
|
||||
CREATE INDEX idx_tablename_geom ON tablename USING GIST(geom);
|
||||
```
|
||||
|
||||
Key `geoid` columns are indexed for fast demographic joins:
|
||||
```sql
|
||||
CREATE INDEX idx_tablename_geoid ON tablename(geoid);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Maintenance Notes
|
||||
|
||||
### Updating Data Centers
|
||||
1. Run `load_postgis_osm_data_centers.py` to refresh OSM data
|
||||
2. Run `build_master_data_centers.py` to rebuild master table
|
||||
3. Run enrichment scripts to update joins
|
||||
|
||||
### Updating Demographics
|
||||
1. Update `_dc_census_tract_acs_2024` from Census API
|
||||
2. Run `create_data_center_census_tract_table.py --replace-final`
|
||||
|
||||
### Updating Energy Data
|
||||
```bash
|
||||
python3 ingest_eia_energy_layers.py --category power --update
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Schema Export
|
||||
|
||||
To export the full schema:
|
||||
```bash
|
||||
pg_dump -h $PGWEB_HOST -U $PGWEB_USER -d data_centers --schema-only > schema.sql
|
||||
```
|
||||
|
||||
To list all tables:
|
||||
```sql
|
||||
SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename))
|
||||
FROM pg_tables
|
||||
WHERE schemaname = 'public'
|
||||
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Contact
|
||||
|
||||
For database access or questions, contact the repository owner.
|
||||
217
docs/query_legiscan_bills.sql
Normal file
217
docs/query_legiscan_bills.sql
Normal file
@@ -0,0 +1,217 @@
|
||||
-- ============================================================
|
||||
-- LegiScan Legislative Analysis Queries
|
||||
-- Database: data_centers Schema: public
|
||||
-- ============================================================
|
||||
--
|
||||
-- SETUP
|
||||
-- Populate the database first:
|
||||
-- python ingest_legiscan.py --all
|
||||
-- This downloads ~646 sessions (2016-2026, all US states + federal),
|
||||
-- loads ~1.3M bills, and tags ~60K as relevant.
|
||||
--
|
||||
-- To refresh (weekly dataset updates from LegiScan):
|
||||
-- python ingest_legiscan.py --fetch --load
|
||||
-- Already-imported sessions with unchanged dataset_hash are skipped.
|
||||
--
|
||||
-- To retag after editing keyword lists in ingest_legiscan.py:
|
||||
-- python ingest_legiscan.py --tag
|
||||
--
|
||||
-- RELEVANCE TAGS (stored in legiscan_bills.relevance_tags[]):
|
||||
-- data_center - Bills naming data centers, hyperscale, colocation, AI campuses
|
||||
-- large_load - Crypto mining, large industrial loads, extraordinary load
|
||||
-- ratepayer_protection- Cost shifting, cross-subsidy, rate design, affordability
|
||||
-- grid_impact - Grid reliability, transmission, interconnection queue
|
||||
-- tax_incentive - Tax exemptions/abatements/credits for facilities
|
||||
-- energy_policy - Renewable PPAs, green tariffs, clean electricity
|
||||
-- water_use - Cooling water, evaporative cooling, water footprint
|
||||
-- siting_permitting - Zoning, conditional use permits, local control
|
||||
--
|
||||
-- STATUS CODES (legiscan_bills.status):
|
||||
-- 1=Introduced 2=Engrossed 3=Enrolled 4=Passed 5=Vetoed
|
||||
-- 6=Failed 7=Override 8=Chaptered 9=Referred 12=Draft
|
||||
-- ============================================================
|
||||
|
||||
-- ── Quick overview ──────────────────────────────────────────
|
||||
|
||||
SELECT
|
||||
COUNT(*) AS total_bills,
|
||||
COUNT(*) FILTER (WHERE is_relevant) AS relevant_bills,
|
||||
COUNT(DISTINCT state) AS states,
|
||||
MIN(ls.year_start) AS year_from,
|
||||
MAX(ls.year_end) AS year_to
|
||||
FROM legiscan_bills lb
|
||||
JOIN legiscan_sessions ls USING (session_id);
|
||||
|
||||
-- ── Bills per relevance tag ─────────────────────────────────
|
||||
|
||||
SELECT
|
||||
tag,
|
||||
COUNT(*) AS bill_count,
|
||||
COUNT(*) FILTER (WHERE lb.status = 4) AS passed,
|
||||
COUNT(*) FILTER (WHERE lb.status IN (4,8)) AS enacted
|
||||
FROM legiscan_bills lb, unnest(relevance_tags) AS tag
|
||||
GROUP BY tag
|
||||
ORDER BY bill_count DESC;
|
||||
|
||||
-- ── Top states for relevant legislation ────────────────────
|
||||
|
||||
SELECT
|
||||
state,
|
||||
COUNT(*) AS relevant_bills,
|
||||
COUNT(*) FILTER (WHERE 'data_center' = ANY(relevance_tags)) AS data_center,
|
||||
COUNT(*) FILTER (WHERE 'large_load' = ANY(relevance_tags)) AS large_load,
|
||||
COUNT(*) FILTER (WHERE 'ratepayer_protection' = ANY(relevance_tags)) AS ratepayer,
|
||||
COUNT(*) FILTER (WHERE 'tax_incentive' = ANY(relevance_tags)) AS tax_incentive,
|
||||
COUNT(*) FILTER (WHERE 'grid_impact' = ANY(relevance_tags)) AS grid_impact
|
||||
FROM legiscan_bills
|
||||
WHERE is_relevant
|
||||
GROUP BY state
|
||||
ORDER BY relevant_bills DESC
|
||||
LIMIT 20;
|
||||
|
||||
-- ── Trend by year ───────────────────────────────────────────
|
||||
|
||||
SELECT
|
||||
ls.year_start AS year,
|
||||
COUNT(lb.bill_id) AS total_bills,
|
||||
COUNT(lb.bill_id) FILTER (WHERE lb.is_relevant) AS relevant_bills,
|
||||
COUNT(lb.bill_id) FILTER (WHERE lb.is_relevant AND lb.status IN (4,8)) AS enacted,
|
||||
ROUND(100.0 * COUNT(lb.bill_id) FILTER (WHERE lb.is_relevant)
|
||||
/ NULLIF(COUNT(lb.bill_id), 0), 1) AS pct_relevant
|
||||
FROM legiscan_bills lb
|
||||
JOIN legiscan_sessions ls USING (session_id)
|
||||
GROUP BY ls.year_start
|
||||
ORDER BY ls.year_start;
|
||||
|
||||
-- ── Data center bills specifically ─────────────────────────
|
||||
|
||||
SELECT
|
||||
lb.state,
|
||||
lb.bill_number,
|
||||
ls.year_start AS year,
|
||||
lb.status,
|
||||
lb.title,
|
||||
lb.relevance_tags,
|
||||
lb.url
|
||||
FROM legiscan_bills lb
|
||||
JOIN legiscan_sessions ls USING (session_id)
|
||||
WHERE 'data_center' = ANY(lb.relevance_tags)
|
||||
ORDER BY
|
||||
CASE lb.status WHEN 4 THEN 0 WHEN 8 THEN 1 WHEN 3 THEN 2 ELSE 3 END,
|
||||
ls.year_start DESC,
|
||||
lb.state;
|
||||
|
||||
-- ── Ratepayer protection bills ──────────────────────────────
|
||||
|
||||
SELECT
|
||||
lb.state,
|
||||
lb.bill_number,
|
||||
ls.year_start AS year,
|
||||
lb.status,
|
||||
lb.title,
|
||||
lb.relevance_tags,
|
||||
lb.url
|
||||
FROM legiscan_bills lb
|
||||
JOIN legiscan_sessions ls USING (session_id)
|
||||
WHERE 'ratepayer_protection' = ANY(lb.relevance_tags)
|
||||
ORDER BY
|
||||
CASE lb.status WHEN 4 THEN 0 WHEN 8 THEN 1 WHEN 3 THEN 2 ELSE 3 END,
|
||||
ls.year_start DESC,
|
||||
lb.state;
|
||||
|
||||
-- ── Bills at intersection of data center + ratepayer ───────
|
||||
|
||||
SELECT
|
||||
lb.state,
|
||||
lb.bill_number,
|
||||
ls.year_start AS year,
|
||||
lb.status,
|
||||
lb.title,
|
||||
lb.relevance_tags,
|
||||
lb.url
|
||||
FROM legiscan_bills lb
|
||||
JOIN legiscan_sessions ls USING (session_id)
|
||||
WHERE 'data_center' = ANY(lb.relevance_tags)
|
||||
AND 'ratepayer_protection' = ANY(lb.relevance_tags)
|
||||
ORDER BY ls.year_start DESC, lb.state;
|
||||
|
||||
-- ── Large load + grid impact ────────────────────────────────
|
||||
|
||||
SELECT
|
||||
lb.state,
|
||||
lb.bill_number,
|
||||
ls.year_start AS year,
|
||||
lb.status,
|
||||
lb.title,
|
||||
lb.relevance_tags,
|
||||
lb.url
|
||||
FROM legiscan_bills lb
|
||||
JOIN legiscan_sessions ls USING (session_id)
|
||||
WHERE 'large_load' = ANY(lb.relevance_tags)
|
||||
AND 'grid_impact' = ANY(lb.relevance_tags)
|
||||
ORDER BY ls.year_start DESC, lb.state;
|
||||
|
||||
-- ── Tax incentive bills passed/enacted ─────────────────────
|
||||
|
||||
SELECT
|
||||
lb.state,
|
||||
lb.bill_number,
|
||||
ls.year_start AS year,
|
||||
lb.status,
|
||||
lb.title,
|
||||
lb.url
|
||||
FROM legiscan_bills lb
|
||||
JOIN legiscan_sessions ls USING (session_id)
|
||||
WHERE 'tax_incentive' = ANY(lb.relevance_tags)
|
||||
AND lb.status IN (4, 8) -- Passed or Chaptered
|
||||
ORDER BY ls.year_start DESC, lb.state;
|
||||
|
||||
-- ── Join to data centers: states with both DCs and active legislation ──
|
||||
|
||||
SELECT
|
||||
dc.state,
|
||||
COUNT(DISTINCT dc.id) AS data_centers,
|
||||
COUNT(DISTINCT lb.bill_id) AS relevant_bills,
|
||||
COUNT(DISTINCT lb.bill_id)
|
||||
FILTER (WHERE 'ratepayer_protection' = ANY(lb.relevance_tags)) AS ratepayer_bills,
|
||||
COUNT(DISTINCT lb.bill_id)
|
||||
FILTER (WHERE 'data_center' = ANY(lb.relevance_tags)) AS dc_specific_bills,
|
||||
COUNT(DISTINCT lb.bill_id)
|
||||
FILTER (WHERE lb.status IN (4,8)) AS enacted_bills
|
||||
FROM master_data_centers dc
|
||||
LEFT JOIN legiscan_bills lb ON dc.state = lb.state AND lb.is_relevant
|
||||
GROUP BY dc.state
|
||||
ORDER BY relevant_bills DESC;
|
||||
|
||||
-- ── Full-text search: find bills mentioning specific terms ──
|
||||
-- Replace 'hyperscale' with any keyword of interest
|
||||
|
||||
SELECT
|
||||
lb.state,
|
||||
lb.bill_number,
|
||||
ls.year_start AS year,
|
||||
lb.status,
|
||||
lb.title,
|
||||
lb.description,
|
||||
lb.url
|
||||
FROM legiscan_bills lb
|
||||
JOIN legiscan_sessions ls USING (session_id)
|
||||
WHERE to_tsvector('english', COALESCE(lb.title,'') || ' ' || COALESCE(lb.description,''))
|
||||
@@ to_tsquery('english', 'hyperscale | colocation | "data center"')
|
||||
ORDER BY ts_rank(
|
||||
to_tsvector('english', COALESCE(lb.title,'') || ' ' || COALESCE(lb.description,'')),
|
||||
to_tsquery('english', 'hyperscale | colocation | "data center"')
|
||||
) DESC
|
||||
LIMIT 50;
|
||||
|
||||
-- ── Session coverage check ──────────────────────────────────
|
||||
|
||||
SELECT
|
||||
state_abbr,
|
||||
COUNT(*) AS sessions_loaded,
|
||||
SUM(bill_count) AS total_bills,
|
||||
MIN(year_start) AS earliest,
|
||||
MAX(year_end) AS latest
|
||||
FROM legiscan_sessions
|
||||
GROUP BY state_abbr
|
||||
ORDER BY state_abbr;
|
||||
623
docs/research-ideas.md
Normal file
623
docs/research-ideas.md
Normal file
@@ -0,0 +1,623 @@
|
||||
# Research Ideas & Future Work
|
||||
|
||||
This document outlines potential research directions, data improvements, and analyses that could extend the current US Data Centers geospatial research infrastructure.
|
||||
|
||||
## Priority Next Steps
|
||||
|
||||
### 1. Backfill Power Capacity Data
|
||||
**Status**: Only 5.9% of facilities have `power_mw` values (108/1,833)
|
||||
|
||||
**Approach**:
|
||||
- Scrape Baxtel data center database (requires paid subscription)
|
||||
- Use Data Center Map API/scraping
|
||||
- Cross-reference with utility interconnection queue filings
|
||||
- FOIA requests to state utility commissions for large loads
|
||||
|
||||
**Research Impact**:
|
||||
- Enable capacity-weighted concentration metrics (current analyses are facility-count only)
|
||||
- Correlate power capacity with demographic/environmental variables
|
||||
- Identify "hyperscale" facilities (>100 MW) vs. edge/enterprise (<10 MW)
|
||||
|
||||
**Implementation**:
|
||||
```python
|
||||
# Add capacity-weighted HHI calculation to analyze_dc_tract_concentration.py
|
||||
capacity_weighted_hhi = sum((mw_i / total_mw) ** 2 for mw_i in tract_capacities)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 2. Operator Name Deduplication
|
||||
**Status**: String fragmentation inflates operator counts ("Meta" vs. "Meta, Inc.", AWS variants)
|
||||
|
||||
**Approach**:
|
||||
- Create `operator_mapping` table with canonical names
|
||||
- Use fuzzy matching (e.g., `fuzzywuzzy` library) to standardize
|
||||
- Add `operator_canonical` column to `master_data_centers`
|
||||
|
||||
**Research Impact**:
|
||||
- Accurate hyperscaler market share analysis
|
||||
- Study operator-specific siting strategies (AWS hydro, Microsoft nuclear, Meta solar)
|
||||
- Enable "operator power" political economy analyses
|
||||
|
||||
**Script**:
|
||||
```python
|
||||
# operators_dedupe.py
|
||||
import pandas as pd
|
||||
from fuzzywuzzy import process
|
||||
|
||||
# Load unique operators
|
||||
operators = pd.read_sql("SELECT DISTINCT operator FROM master_data_centers", conn)
|
||||
|
||||
# Manual + fuzzy matching to canonical names
|
||||
canonical_map = {
|
||||
"Meta": ["Meta", "Meta, Inc.", "Meta Platforms", "Facebook"],
|
||||
"Amazon Web Services": ["AWS", "Amazon", "Amazon Web Services"],
|
||||
# ... etc.
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3. Water Stress Overlay
|
||||
**Status**: 257 HUC8 watersheds contain data centers; 15 watersheds hold 50% of facilities
|
||||
|
||||
**Priority**: HIGH - Critical for environmental impact analysis
|
||||
|
||||
**Approach**:
|
||||
- Join to USGS WaterWatch streamflow data
|
||||
- Add USGS Drought Watch indicators by HUC8
|
||||
- Correlate data center density with:
|
||||
- Groundwater depletion rates
|
||||
- Surface water withdrawal permits
|
||||
- Drought frequency/severity (USDM historical data)
|
||||
|
||||
**Key Watersheds for Focus**:
|
||||
- **Middle Potomac-Catoctin** (HUC8 02070008): 235 DCs (12.8% of US total) - Loudoun/Ashburn
|
||||
- **Middle Potomac-Anacostia-Occoquan** (02070010): 111 DCs - Fairfax/inner Loudoun
|
||||
- **Coyote** (18050003): 88 DCs - Silicon Valley
|
||||
- **Upper Scioto** (05060001): 73 DCs - Columbus OH
|
||||
- **Umatilla** (17070103): 29 DCs - AWS-exclusive watershed
|
||||
|
||||
**Research Questions**:
|
||||
- Are data centers sited in water-stressed watersheds?
|
||||
- Do high-density clusters (Loudoun County, Columbus OH) face water constraints?
|
||||
- Compare water stress in hyperscaler non-metro sites (Columbia River corridor) vs. metro clusters
|
||||
- Does single-operator watershed capture (Umatilla = AWS only) correlate with water availability?
|
||||
|
||||
**Tables to Create**:
|
||||
- `watershed_water_stress` - HUC8-level water stress indicators
|
||||
- `data_center_water_risk` - Per-facility water-stress exposure
|
||||
|
||||
**Notebook**: `water_stress_analysis.ipynb`
|
||||
|
||||
---
|
||||
|
||||
### 4. Opposition Cases Overlay
|
||||
**Status**: 18 geocoded opposition cases in `opposition_cases_geocoded` table (loaded but unused)
|
||||
|
||||
**Approach**:
|
||||
- Expand dataset: Compile additional cases of rejected/delayed data center proposals from news archives
|
||||
- Geocode all opposition cases, join to demographics/hazards
|
||||
- Test hypotheses:
|
||||
- Do wealthier/more educated communities successfully block projects?
|
||||
- Are opposition cases more common in water-stressed or drought-prone areas?
|
||||
- Do smaller non-metro communities have less bargaining power?
|
||||
- Does clustered vs. isolated location predict opposition likelihood?
|
||||
|
||||
**Research Questions**:
|
||||
- What predicts opposition success?
|
||||
- Are opposition cases spatially clustered?
|
||||
- Do demographics differ between accepted vs. rejected sites?
|
||||
- Correlation with FEMA hazard exposure scores?
|
||||
|
||||
**Analysis Plan**:
|
||||
```sql
|
||||
-- Join opposition cases to demographics
|
||||
SELECT o.*, ct.median_household_income, ct.bachelors_or_higher_pct
|
||||
FROM opposition_cases_geocoded o
|
||||
JOIN _dc_census_tract_acs_2024 ct
|
||||
ON ST_Contains(ct.geom, o.geom);
|
||||
```
|
||||
|
||||
**Output**: `opposition_cases_analysis.md`
|
||||
|
||||
---
|
||||
|
||||
### 5. IM3 Forward Projection Integration
|
||||
**Status**: IM3 model data loaded in `im3_state_projected_moderate_50` (328 rows) and `im3_projected_state_demand_summary` (31 rows)
|
||||
|
||||
**Approach**:
|
||||
- Load IM3 projected demand scenarios (2030, 2040, 2050)
|
||||
- Overlay projected growth with:
|
||||
- Current grid saturation (% of generation within 50 km)
|
||||
- Water stress indicators
|
||||
- Land availability (zoned industrial parcels)
|
||||
- Identify regions where projected demand may exceed infrastructure capacity
|
||||
|
||||
**Grid Saturation Context** (from current analysis):
|
||||
- **New Jersey**: 83% of grid within 50 km of DC
|
||||
- **Nevada**: 75%
|
||||
- **Tennessee**: 70%
|
||||
- **Oregon**: 68%
|
||||
- **Arizona**: 56%
|
||||
- **Virginia**: 50%
|
||||
|
||||
**Research Questions**:
|
||||
- Which states face grid saturation from data center growth?
|
||||
- Are projected sites in water-stressed watersheds?
|
||||
- Does IM3 assume spatial distribution patterns consistent with current siting?
|
||||
- Can states with >50% grid saturation accommodate projected demand?
|
||||
|
||||
**Implementation**:
|
||||
```sql
|
||||
-- Compare current saturation to IM3 projected demand
|
||||
SELECT
|
||||
current.state,
|
||||
current.dc_count,
|
||||
current.pct_grid_saturated,
|
||||
proj.facility_count AS projected_new_facilities,
|
||||
proj.total_it_mw AS projected_new_mw
|
||||
FROM state_grid_saturation current
|
||||
JOIN im3_projected_state_demand_summary proj ON current.state = proj.state
|
||||
WHERE current.pct_grid_saturated > 50
|
||||
ORDER BY current.pct_grid_saturated DESC;
|
||||
```
|
||||
|
||||
**Notebook**: `im3_projection_overlay.ipynb`
|
||||
|
||||
---
|
||||
|
||||
## Methodological Extensions
|
||||
|
||||
### 6. Time-Series Analysis of Cluster Growth
|
||||
**Approach**:
|
||||
- Use `rfs_year` (ready for service) from cable data and EIA generator vintage
|
||||
- Reconstruct data center siting over time (requires RFS dates for facilities)
|
||||
- Animate cluster formation in interactive map
|
||||
|
||||
**Research Questions**:
|
||||
- Did Ashburn VA become dominant before or after major cable landings?
|
||||
- Do clusters grow via agglomeration (new facilities near existing) or simultaneous build-out?
|
||||
- Correlation between energy infrastructure build-out and data center growth
|
||||
|
||||
**Data Needed**:
|
||||
- Facility RFS dates (scrape from press releases, Baxtel historical data)
|
||||
- Historical tract demographics (decennial Census + ACS back to 2000)
|
||||
|
||||
---
|
||||
|
||||
### 7. Network Effects: Fiber Route Proximity
|
||||
**Status**: Current analysis tests submarine cable proximity (negative result)
|
||||
|
||||
**Approach**:
|
||||
- Obtain fiber optic backbone route GIS data (from FCC, carriers, or Infrapedia)
|
||||
- Test proximity to long-haul fiber routes (not just submarine cables)
|
||||
- Hypothesis: Data centers cluster near fiber, not cables
|
||||
|
||||
**Data Sources**:
|
||||
- FCC Form 477 fiber deployment data
|
||||
- Infrapedia fiber route database
|
||||
- State-level fiber maps (e.g., Virginia Broadband Map)
|
||||
|
||||
**Expected Result**: Positive correlation (unlike submarine cables)
|
||||
|
||||
---
|
||||
|
||||
### 8. Land Use & Zoning Analysis
|
||||
**Approach**:
|
||||
- Join data centers to local zoning classifications (industrial, commercial, etc.)
|
||||
- Analyze land prices in data center tracts before/after facility construction
|
||||
- Correlate with property tax revenues
|
||||
|
||||
**Research Questions**:
|
||||
- Do data centers drive local property value increases?
|
||||
- Are they preferentially sited in already-zoned industrial areas?
|
||||
- Do host communities capture tax base growth?
|
||||
|
||||
**Data Sources**:
|
||||
- Zillow Home Value Index (ZHVI) by ZIP
|
||||
- ATTOM property tax assessments
|
||||
- Municipal zoning GIS layers (city-specific, requires scraping/FOIA)
|
||||
|
||||
---
|
||||
|
||||
### 9. Environmental Justice Scoring
|
||||
**Approach**:
|
||||
- Compare data center host tracts to EPA's EJScreen indices
|
||||
- Add CalEnviroScreen-style burden/benefit framework
|
||||
- Test if data centers increase cumulative environmental burdens
|
||||
|
||||
**Metrics**:
|
||||
- Air quality (PM2.5, ozone)
|
||||
- Hazardous waste proximity
|
||||
- Superfund site proximity
|
||||
- Heat island effect (LST from Landsat)
|
||||
- Noise pollution (traffic, cooling systems)
|
||||
|
||||
**Expected Challenge**: Data centers may improve local metrics (compared to heavy industry) but increase water/energy consumption
|
||||
|
||||
---
|
||||
|
||||
## Policy & Political Economy Research
|
||||
|
||||
### 10. Tax Incentive Analysis
|
||||
**Approach**:
|
||||
- Compile state/local tax incentives for data center siting (property tax abatements, sales tax exemptions)
|
||||
- Create `data_center_incentives` table with per-facility incentive details
|
||||
- Correlate incentive generosity with:
|
||||
- State fiscal health
|
||||
- Local government bargaining power
|
||||
- Facility size/operator
|
||||
|
||||
**Research Questions**:
|
||||
- Do weaker fiscal states offer larger incentives?
|
||||
- Are incentives regressive (larger for hyperscalers)?
|
||||
- Do incentives predict siting decisions (natural experiment approach)?
|
||||
|
||||
**Data Sources**:
|
||||
- Good Jobs First Subsidy Tracker
|
||||
- State economic development agency press releases
|
||||
- Local news archives
|
||||
|
||||
---
|
||||
|
||||
### 11. Employment & Labor Market Effects
|
||||
**Approach**:
|
||||
- Join to BLS Quarterly Census of Employment and Wages (QCEW) by ZIP/county
|
||||
- Identify "data center construction boom" periods (before/after major facility openings)
|
||||
- Analyze employment effects in:
|
||||
- Construction (NAICS 23)
|
||||
- Transportation/warehousing (NAICS 48-49)
|
||||
- Professional services (NAICS 54)
|
||||
|
||||
**Research Questions**:
|
||||
- Do data centers create durable local employment?
|
||||
- Are jobs filled by local residents or commuters?
|
||||
- Wage effects in host tracts?
|
||||
|
||||
**Data Sources**:
|
||||
- BLS QCEW
|
||||
- Census LEHD Origin-Destination Employment Statistics (LODES)
|
||||
|
||||
---
|
||||
|
||||
### 12. Energy Cost Pass-Through
|
||||
**Approach**:
|
||||
- Join to state-level electricity rate data (EIA, utility rate tracker)
|
||||
- Test if data center density correlates with residential rate increases
|
||||
- Natural experiment: Compare rate trajectories in high-DC vs. low-DC states
|
||||
|
||||
**Research Questions**:
|
||||
- Do data centers drive residential rate increases (capacity cost allocation)?
|
||||
- Are rate increases concentrated in utility service territories with large data center loads?
|
||||
- Do states with retail choice (deregulated markets) see different effects?
|
||||
|
||||
**Data Sources**:
|
||||
- EIA Form 861 (retail rates by state/utility)
|
||||
- Utility rate case filings (state public utility commissions)
|
||||
|
||||
---
|
||||
|
||||
## Data Quality & Infrastructure Improvements
|
||||
|
||||
### 13. Address Validation & Geocoding Refinement
|
||||
**Approach**:
|
||||
- Re-geocode the 45 facilities using city-precision fallback
|
||||
- Use USPS address validation API
|
||||
- Cross-reference with Google Maps satellite imagery (manual review)
|
||||
|
||||
**Implementation**:
|
||||
```python
|
||||
# Re-run geocoding with stricter thresholds
|
||||
python3 load_postgis_data_centers.py --revalidate-addresses
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 14. OSM Continuous Monitoring
|
||||
**Approach**:
|
||||
- Set up automated Overpass API queries (daily/weekly)
|
||||
- Detect new OSM data center tags
|
||||
- Alert for review + merge into `master_data_centers`
|
||||
|
||||
**Implementation**:
|
||||
- Cron job running `load_postgis_osm_data_centers.py --update-only`
|
||||
- Slack/email notification on new facilities
|
||||
|
||||
---
|
||||
|
||||
### 15. Broadband Speed Validation
|
||||
**Approach**:
|
||||
- Cross-reference FCC BDC provider data with Ookla Speedtest results
|
||||
- Test if data center host tracts have faster actual speeds (not just availability)
|
||||
|
||||
**Hypothesis**: Data center presence correlates with infrastructure investment → higher speeds
|
||||
|
||||
**Data Sources**:
|
||||
- Ookla Open Data (aggregated Speedtest results by tile)
|
||||
- FCC Measuring Broadband America
|
||||
|
||||
---
|
||||
|
||||
## Visualization & Communication
|
||||
|
||||
### 16. Interactive Story Map
|
||||
**Approach**:
|
||||
- Build Scrollama.js narrative map
|
||||
- Sections:
|
||||
1. National overview (cluster map)
|
||||
2. Ashburn VA zoom (dominance of single region)
|
||||
3. Demographics comparison (host vs. national)
|
||||
4. Water stress hot spots
|
||||
5. Energy infrastructure saturation
|
||||
|
||||
**Output**: `story_map.html` (standalone web page)
|
||||
|
||||
---
|
||||
|
||||
### 17. Policy Brief Generation
|
||||
**Approach**:
|
||||
- Auto-generate policy briefs from analysis outputs
|
||||
- Targeted audiences:
|
||||
- State legislators (energy/water policy)
|
||||
- Local governments (tax incentive negotiation)
|
||||
- Environmental justice advocates
|
||||
|
||||
**Template**:
|
||||
```markdown
|
||||
# Data Center Siting in [STATE]: Key Facts for Policymakers
|
||||
|
||||
- **[STATE] hosts X% of US data centers** (rank: #Y)
|
||||
- **Host communities are Z% wealthier** than state average
|
||||
- **A% of state generation is within 50 km of a data center**
|
||||
- **Top watershed holds B facilities** (water stress: [HIGH/MEDIUM/LOW])
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 18. Comparative International Analysis
|
||||
**Approach**:
|
||||
- Extend methodology to EU, Canada, Australia
|
||||
- Compare siting patterns (e.g., Nordic countries = renewable energy, cold climate)
|
||||
- Test if "concentrated costs / dispersed benefits" holds internationally
|
||||
|
||||
**Data Sources**:
|
||||
- OpenStreetMap (global coverage)
|
||||
- Eurostat demographics
|
||||
- IEA energy data
|
||||
- TeleGeography global cable data (already available)
|
||||
|
||||
**Research Questions**:
|
||||
- Are US patterns unique (tax-driven siting) vs. EU (regulatory constraints)?
|
||||
- Do Nordic countries see more equitable distribution?
|
||||
|
||||
---
|
||||
|
||||
## Speculative / Long-Term Ideas
|
||||
|
||||
### 19. AI Demand Forecasting
|
||||
**Approach**:
|
||||
- Train ML model to predict data center siting
|
||||
- Features: demographics, energy capacity, fiber proximity, tax rates, water availability
|
||||
- Test on historical data (train on pre-2015, test on 2015-2025)
|
||||
|
||||
**Use Case**:
|
||||
- Identify "likely future sites" for proactive policy intervention
|
||||
- Warn communities of potential incoming projects
|
||||
|
||||
---
|
||||
|
||||
### 20. Cooling Technology Analysis
|
||||
**Approach**:
|
||||
- Classify facilities by cooling type (air, water, hybrid)
|
||||
- Correlate with:
|
||||
- Climate (CDD: cooling degree days)
|
||||
- Water availability
|
||||
- Facility size
|
||||
|
||||
**Data Sources**:
|
||||
- Manual classification from news/press releases
|
||||
- FOIA requests to water utilities (cooling water withdrawal permits)
|
||||
|
||||
**Research Questions**:
|
||||
- Are water-cooled facilities concentrated in water-stressed regions (paradox)?
|
||||
- Do hyperscalers use more efficient cooling (e.g., Meta's Prineville OR evaporative cooling)?
|
||||
|
||||
---
|
||||
|
||||
### 21. Bitcoin Mining Facilities
|
||||
**Approach**:
|
||||
- Overlay cryptocurrency mining facilities (subset of "data centers")
|
||||
- Compare siting patterns: Bitcoin mines prefer low electricity costs (WA, TX, NY hydro)
|
||||
- Test if Bitcoin mines face more opposition (negative perception)
|
||||
|
||||
**Data Sources**:
|
||||
- Cambridge Bitcoin Electricity Consumption Index (has facility locations)
|
||||
- News archives of mining farm proposals/rejections
|
||||
|
||||
---
|
||||
|
||||
### 22. Disaster Resilience & Redundancy
|
||||
**Approach**:
|
||||
- Model simultaneous hazard exposure across data center clusters
|
||||
- E.g., "What % of US data centers are in wildfire risk zones?"
|
||||
- Identify single points of failure (e.g., Ashburn VA = 20% of US capacity)
|
||||
|
||||
**Research Questions**:
|
||||
- Is the current spatial distribution resilient to climate change?
|
||||
- Should policy incentivize geographic diversification?
|
||||
|
||||
**Output**: `disaster_resilience_report.md`
|
||||
|
||||
---
|
||||
|
||||
### 23. Edge Data Center Network
|
||||
**Approach**:
|
||||
- Separately analyze edge facilities (<1 MW) vs. hyperscale (>100 MW)
|
||||
- Test if edge DCs follow different siting logic (population density > energy cost)
|
||||
|
||||
**Data Challenge**: Current inventory does not distinguish edge vs. hyperscale (need `power_mw` backfill)
|
||||
|
||||
---
|
||||
|
||||
### 24. Carbon Intensity of Host Grids
|
||||
**Approach**:
|
||||
- Join to EPA eGRID subregion carbon intensity (lb CO₂/MWh)
|
||||
- Calculate per-facility estimated carbon footprint (if `power_mw` available)
|
||||
- Compare to corporate renewable energy procurement (RECs, PPAs)
|
||||
|
||||
**Research Questions**:
|
||||
- Are data centers disproportionately in high-carbon grids?
|
||||
- Do hyperscaler renewable commitments offset grid carbon?
|
||||
|
||||
**Data Sources**:
|
||||
- EPA eGRID
|
||||
- Corporate sustainability reports (Google, Microsoft, Meta, AWS)
|
||||
|
||||
---
|
||||
|
||||
## Collaboration Opportunities
|
||||
|
||||
### Academic Partnerships
|
||||
- **Energy researchers**: Joint analysis of grid saturation + IM3 projections
|
||||
- **Environmental justice scholars**: EJScreen overlay + opposition case studies
|
||||
- **Political scientists**: Tax incentive analysis + local government bargaining power
|
||||
|
||||
### Policy Stakeholders
|
||||
- **State energy offices**: Share grid saturation maps
|
||||
- **Water resource agencies**: Watershed analysis for permitting
|
||||
- **Local governments**: Demographic/tax revenue analysis for negotiation leverage
|
||||
|
||||
### Industry Engagement
|
||||
- **Data center operators**: Validate facility data, discuss siting criteria
|
||||
- **Colocation providers**: Access to tenant mix data (multi-tenant vs. single-tenant)
|
||||
|
||||
---
|
||||
|
||||
## Tools & Infrastructure Improvements
|
||||
|
||||
### Database Enhancements
|
||||
- Add `version` column to track data vintage
|
||||
- Implement `audit_log` table for data lineage
|
||||
- Set up automated backups to S3/Azure Blob
|
||||
|
||||
### Code Quality
|
||||
- Add unit tests for geocoding functions
|
||||
- Create `config.yaml` for database credentials (replace hardcoded env vars)
|
||||
- Dockerize analysis environment for reproducibility
|
||||
|
||||
### Documentation
|
||||
- Add JSDoc-style comments to all Python functions
|
||||
- Create `CONTRIBUTING.md` for external collaborators
|
||||
- Record Jupyter notebook walkthroughs (video tutorials)
|
||||
|
||||
---
|
||||
|
||||
## Unfunded / Ambitious Ideas
|
||||
|
||||
### 25. Real-Time Energy Monitoring
|
||||
- Partner with utility to get live load data from data center substations
|
||||
- Build dashboard showing real-time energy consumption by facility
|
||||
- Correlate with AWS/Azure/GCP service outages (reverse-engineer capacity from brownouts)
|
||||
|
||||
### 26. Social Media Sentiment Analysis
|
||||
- Scrape Twitter/Reddit for mentions of local data center projects
|
||||
- NLP sentiment analysis: support vs. opposition
|
||||
- Correlate sentiment with facility approval outcomes
|
||||
|
||||
### 27. LIDAR Analysis of Cooling Infrastructure
|
||||
- Use aerial LIDAR to measure rooftop cooling equipment volume
|
||||
- Proxy for facility size (cooling = f(IT load))
|
||||
- Build predictive model: cooling equipment → power capacity
|
||||
|
||||
---
|
||||
|
||||
## Contact & Contributions
|
||||
|
||||
If you're interested in collaborating on any of these research directions, please contact the repository owner.
|
||||
|
||||
**Priorities for external collaboration**:
|
||||
1. Power capacity data acquisition
|
||||
2. Water stress/drought overlay
|
||||
3. Opposition cases database compilation
|
||||
4. International comparative analysis
|
||||
|
||||
---
|
||||
|
||||
## References for Future Work
|
||||
|
||||
### Data Sources to Explore
|
||||
- **Department of Energy**: Grid resilience reports, interconnection queues
|
||||
- **NREL**: Renewable energy potential by HUC (solar, wind)
|
||||
- **USDA**: Agricultural water use by county (competition for water)
|
||||
- **NOAA**: Climate normals + projections by grid cell
|
||||
- **BLS**: QCEW employment data, wage data
|
||||
- **EPA**: eGRID, EJScreen, Superfund sites
|
||||
|
||||
### Academic Literature Gaps
|
||||
- Limited peer-reviewed research on data center spatial concentration
|
||||
- No published studies on water stress exposure of data centers
|
||||
- Opportunity for "first mover" publication in major geography/planning journals
|
||||
|
||||
### Policy Levers to Investigate
|
||||
- State renewable portfolio standards (RPS) → data center siting
|
||||
- Federal infrastructure investment (IRA, CHIPS Act) → energy grid capacity
|
||||
- Local zoning reform (industrial land use restrictions)
|
||||
|
||||
---
|
||||
|
||||
## Legislative Analysis (LegiScan Data)
|
||||
|
||||
**Status**: Data loaded — 1.3M bills across all US states + federal, 2016–2026; ~60K tagged relevant.
|
||||
**Tables**: `legiscan_sessions`, `legiscan_bills`
|
||||
**Query file**: `query_legiscan_bills.sql`
|
||||
|
||||
### Research Questions
|
||||
|
||||
**1. Ratepayer Cost Shifting**
|
||||
Do states with high data center density show more legislative activity on ratepayer protection?
|
||||
- Join `legiscan_bills WHERE 'ratepayer_protection' = ANY(relevance_tags)` to `master_data_centers` counts by state
|
||||
- Test correlation between DC concentration and # of ratepayer bills introduced/passed
|
||||
- Compare outcomes: do high-DC states pass or fail more ratepayer protections?
|
||||
|
||||
**2. Data Center Legislative Wave**
|
||||
Is there a measurable increase in DC-specific legislation after 2022 (AI boom)?
|
||||
- Trend `data_center` and `large_load` tagged bills by `year_start`
|
||||
- Cross-reference with major AI facility announcements (2022–2025)
|
||||
|
||||
**3. Tax Incentive Geography**
|
||||
Which states enacted tax incentives that may have influenced DC location decisions?
|
||||
- `tax_incentive` bills with `status IN (4,8)` (passed/chaptered)
|
||||
- Overlay with `master_data_centers` growth by state over the same period
|
||||
- Candidate for difference-in-differences analysis
|
||||
|
||||
**4. Grid Interconnection Policy**
|
||||
Do states with `grid_impact` legislation show different EIA capacity expansion patterns?
|
||||
- Join relevant bills to `energy_eia_operating_generator_capacity_flat` by state
|
||||
- Look for correlations between grid policy activity and nameplate MW additions
|
||||
|
||||
**5. Siting Preemption vs. Local Control**
|
||||
Are states passing bills to streamline or restrict local siting authority?
|
||||
- Full-text search within `siting_permitting` bills for "preemption" vs. "local control"
|
||||
- Map bill outcomes by state political environment (cross-ref RDH vote data)
|
||||
|
||||
### Suggested Joins
|
||||
|
||||
```sql
|
||||
-- States with DCs and legislative activity by topic
|
||||
SELECT
|
||||
dc.state,
|
||||
COUNT(DISTINCT dc.id) AS data_centers,
|
||||
COUNT(DISTINCT lb.bill_id) FILTER (WHERE 'data_center' = ANY(relevance_tags)) AS dc_bills,
|
||||
COUNT(DISTINCT lb.bill_id) FILTER (WHERE 'ratepayer_protection'= ANY(relevance_tags)) AS ratepayer_bills,
|
||||
COUNT(DISTINCT lb.bill_id) FILTER (WHERE 'tax_incentive' = ANY(relevance_tags)
|
||||
AND lb.status IN (4,8)) AS tax_incentives_passed
|
||||
FROM master_data_centers dc
|
||||
LEFT JOIN legiscan_bills lb ON dc.state = lb.state AND lb.is_relevant
|
||||
GROUP BY dc.state
|
||||
ORDER BY data_centers DESC;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**Last Updated**: May 2026
|
||||
Reference in New Issue
Block a user