data-centers/docs/database-tables.md

# Database Tables Documentation

## Database Configuration

**Database Name**: `data_centers`
**Type**: PostgreSQL with PostGIS extension
**Connection**: Environment variables from `~/.zsh_secrets`
- `PGWEB_HOST`: Database host
- `PGWEB_PORT`: Database port (5433)
- `PGWEB_USER`: Database user
- `PGWEB_PASSWORD`: Database password
- `PGWEB_DATABASE`: Database name (`data_centers`)

## Table Organization

Tables are organized into six categories:
1. **Core Data Center Tables** - Master inventories and source data
2. **Enrichment Tables** - Data centers joined with contextual data
3. **Environmental and Election Source Tables** - Long-form climate, drought, fire/smoke, and precinct-election source layers
4. **Base Layer Tables** - Geographic and demographic reference layers
5. **Infrastructure Tables** - Energy and connectivity infrastructure
6. **Legislation Tables** - LegiScan state and federal bill data (2016-2026)

---

## Core Data Center Tables

### `master_data_centers`
**Rows**: 1,833
**Purpose**: Canonical data center inventory - deduplicated merge of curated + OSM sources

**Key Columns**:
- `id` (INTEGER) - Unique identifier
- `name` (TEXT) - Facility name
- `address` (TEXT) - Street address
- `city` (TEXT) - City
- `state` (TEXT) - State code
- `latitude` (DOUBLE PRECISION) - Latitude
- `longitude` (DOUBLE PRECISION) - Longitude
- `geom` (GEOMETRY) - PostGIS point geometry (EPSG:4326)
- `operator` (TEXT) - Operator/owner
- `power_mw` (DOUBLE PRECISION) - Power capacity in megawatts (sparse: 5.9% populated)
- `source` (TEXT) - Data source (`curated`, `osm`, or `both`)
- `osm_id` (TEXT) - OpenStreetMap ID if applicable
- `geocode_method` (TEXT) - Geocoding provenance

**Notes**:
- 108 of 1,833 facilities have power ratings
- 45 facilities use city-precision fallback coordinates
- Operator strings have fragmentation issues ("Meta" vs. "Meta, Inc.")

### `us_dc_sample_geocoded`
**Rows**: 1,489
**Purpose**: Original curated sample with geocoding provenance (superseded by `master_data_centers`)

**Key Columns**:
- `name`, `address`, `city`, `state`, `zip`
- `latitude`, `longitude`, `geom`
- `operator`, `power_mw`
- `census_lat`, `census_lon` - Census TIGER geocode results
- `nominatim_lat`, `nominatim_lon` - Nominatim fallback results
- `geocode_source` - Which geocoder was used

### `osm_data_centers`
**Rows**: 1,549
**Purpose**: Raw OpenStreetMap-derived facilities

**Key Columns**:
- `osm_id` (TEXT) - OSM element ID
- `osm_type` (TEXT) - `node`, `way`, or `relation`
- `name` (TEXT) - OSM name tag
- `latitude`, `longitude`, `geom`
- `tags` (JSONB) - All OSM tags as JSON
- `operator` (TEXT) - Extracted from OSM tags
- `city`, `state`, `country`

**Notes**: Fetched via Overpass API with query for `telecom=data_center` or `building=data_center`

### `master_data_center_spatial_clusters`
**Rows**: 1,831
**Purpose**: DBSCAN cluster assignments for master data centers

**Key Columns**:
- All columns from `master_data_centers`
- `cluster_id` (INTEGER) - Cluster assignment (-1 = noise/singleton)
- `cluster_size` (INTEGER) - Number of facilities in cluster
- `cluster_label` (TEXT) - Human-readable cluster name

**Notes**: DBSCAN parameters: eps=15 km, min_samples=2

---

## Enrichment Tables

### `data_center_census_tracts_2024`
**Rows**: 1,815
**Purpose**: Per-facility demographics from containing Census tract

**Key Columns**:
- All columns from `master_data_centers`
- `geoid` (TEXT) - 11-digit Census tract GEOID
- `state_fips`, `county_fips`, `tract`
- **Population**: `total_population`, `population_density_sq_mi`
- **Age**: `median_age`, `under_18_pct`, `over_65_pct`
- **Race/Ethnicity**: `white_nh_pct`, `black_nh_pct`, `asian_nh_pct`, `hispanic_pct`
- **Economics**: `median_household_income`, `per_capita_income`, `poverty_rate`
- **Education**: `bachelors_or_higher_pct`, `high_school_or_higher_pct`
- **Housing**: `median_home_value`, `median_rent`, `homeownership_rate`
- **Broadband**: `broadband_pct` - Households with broadband subscription

**Source**: ACS 2024 5-year estimates

**Notes**:
- 18 of 1,833 facilities failed tract join (geocoding issues)
- Data from `_dc_census_tract_acs_2024` base table

### `data_center_watershed_huc8`
**Rows**: 1,833
**Purpose**: Per-facility watershed assignment

**Key Columns**:
- All columns from `master_data_centers`
- `huc8` (TEXT) - 8-digit Hydrologic Unit Code
- `watershed_name` (TEXT) - Watershed name
- `watershed_area_sq_km` (DOUBLE PRECISION)
- `states` (TEXT) - States intersecting watershed

**Source**: USGS Watershed Boundary Dataset

**Notes**: 257 unique HUC8 watersheds contain at least one data center

### `data_center_nri_exposure`
**Rows**: 1,833
**Purpose**: Per-facility FEMA National Risk Index hazard exposure scores

**Key Columns**:
- All columns from `master_data_centers`
- `nri_id` (TEXT) - Census tract GEOID (matches `geoid` from demographics)
- `risk_score` (DOUBLE PRECISION) - Overall NRI risk score
- `social_vulnerability` (DOUBLE PRECISION) - Social vulnerability index
- **Hazard-specific risk scores** (18 hazards):
  - `avalanche_risk`, `coastal_flooding_risk`, `cold_wave_risk`
  - `drought_risk`, `earthquake_risk`, `hail_risk`
  - `heat_wave_risk`, `hurricane_risk`, `ice_storm_risk`
  - `landslide_risk`, `lightning_risk`, `riverine_flooding_risk`
  - `strong_wind_risk`, `tornado_risk`, `tsunami_risk`
  - `volcanic_activity_risk`, `wildfire_risk`, `winter_weather_risk`

**Source**: FEMA National Risk Index (December 2025 release)

### `data_center_historical_climate`
**Rows**: 1,833
**Purpose**: One-row-per-facility historical climate summary for data center locations

**Key Columns**:
- `master_id` (TEXT) - FK to `master_data_centers`
- `source`, `name`, `operator`, `city`, `state`, `country`
- `latitude`, `longitude`, `geom`
- `daymet_dataset_version`, `gridmet_dataset_version`
- `climate_period_start`, `climate_period_end` - Current period: 1991-01-01 to 2020-12-31
- **Temperature**: `mean_annual_temperature_c`, `mean_summer_temperature_c`, `max_daily_temperature_c`, `min_daily_temperature_c`
- **Humidity / wet bulb**: `mean_relative_humidity_pct`, `mean_wet_bulb_temperature_c`, `max_wet_bulb_temperature_c`, `extreme_wet_bulb_days`
- **Cooling / heat**: `cooling_degree_days_c`, `annual_cooling_degree_days_c_mean`, `extreme_heat_days`, `annual_extreme_heat_days_mean`
- **Precipitation**: `precipitation_total_mm`, `annual_precipitation_mm_mean`, `annual_precipitation_cv`, `wet_day_precipitation_p95_mm`
- **Wind**: `mean_wind_speed_ms`, `max_daily_mean_wind_speed_ms`, `sustained_wind_days`, `annual_sustained_wind_days_mean`

**Source**: Daymet + gridMET historical climate data

**Notes**: Built by `historical_climate_data_centers.ipynb` / `open_meteo_historical_data_centers.ipynb`

### `data_center_usdm_drought_exposure`
**Rows**: 1,833
**Purpose**: Per-facility drought exposure summary from weekly U.S. Drought Monitor polygons

**Key Columns**:
- `master_id` (TEXT) - FK to `master_data_centers`
- `source`, `name`, `operator`, `city`, `state`, `country`
- `latitude`, `longitude`, `geom`
- `usdm_status` - `covered` or `no_coverage`
- `drought_period_start`, `drought_period_end` - Current period: 2000-01-04 to 2025-12-30
- `weeks_observed`
- `weeks_in_d0_or_worse`, `weeks_in_d1_or_worse`, `weeks_in_d2_or_worse`, `weeks_in_d3_or_worse`, `weeks_in_d4`
- `pct_weeks_in_d0_or_worse`, `pct_weeks_in_d1_or_worse`, `pct_weeks_in_d2_or_worse`, `pct_weeks_in_d3_or_worse`, `pct_weeks_in_d4`
- `worst_dm_category`, `mean_dm_category`
- `longest_d0_streak_weeks`, `longest_d2_streak_weeks`, `longest_d3_streak_weeks`

**Source**: U.S. Drought Monitor weekly spatial data

**Notes**:
- Summary table is rolled up from `data_center_usdm_drought_dc_week`
- `dm_category` scale: D0-D4, stored as 0-4
- 1,830 facilities have covered status; 3 have no coverage

### `data_center_hms_smoke_exposure`
**Rows**: 1,833
**Purpose**: Per-facility wildfire-smoke exposure summary from NOAA HMS smoke polygons

**Key Columns**:
- `master_id` (TEXT) - FK to `master_data_centers`
- `source`, `name`, `operator`, `city`, `state`, `country`
- `latitude`, `longitude`, `geom`
- `hms_status`
- `smoke_period_start`, `smoke_period_end` - Current period: 2005-08-05 to 2026-05-22
- `days_observed`
- `days_with_any_smoke`, `days_with_light_or_worse`, `days_with_medium_or_worse`, `days_with_heavy_smoke`
- `pct_days_with_any_smoke`, `pct_days_with_light_or_worse`, `pct_days_with_medium_or_worse`, `pct_days_with_heavy_smoke`
- `worst_density_rank`, `worst_density`, `mean_density_rank`
- `longest_any_smoke_streak_days`, `longest_medium_or_heavy_streak_days`, `longest_heavy_smoke_streak_days`

**Source**: NOAA Hazard Mapping System (HMS) smoke polygons

**Notes**:
- Summary table is rolled up from `data_center_hms_smoke_dc_day`
- Density rank: 0 = observed no smoke, 1 = Light, 2 = Medium, 3 = Heavy
- HMS product path uses NOAA's `/FIRE/web/HMS/Smoke_Polygons/` archive

### `data_center_election_context`
**Rows**: 1,833
**Purpose**: Standardized one-row-per-facility election context derived from RDH precinct matches

**Key Columns**:
- `master_id` (TEXT) - FK to `master_data_centers`
- `name`, `city`, `state`
- `rdh_layer_title`
- `precinct_identifier_name`
- `election_year`, `office`
- `democratic_votes`, `republican_votes`, `total_votes`
- `turnout_or_vote_share`
- `updated_at`

**Source**: Redistricting Data Hub precinct election shapefiles

**Notes**:
- Built from `data_center_rdh_precinct_vote_matches` plus RDH feature properties
- Current rows cover 2020-2024 election layers; 1,829 facilities have non-null election year context

### `data_center_rdh_precinct_vote_matches`
**Rows**: 3,330
**Purpose**: Spatial join bridge between data centers and RDH precinct vote features

**Key Columns**:
- `master_id` (TEXT) - FK to `master_data_centers`
- `feature_id` (TEXT) - FK to `rdh_precinct_vote_features`
- `layer_id` (TEXT) - FK to `rdh_precinct_vote_layers`
- `state_code`
- `join_method`
- `match_distance_m`
- `matched_at`

**Source**: Redistricting Data Hub precinct shapefiles

**Notes**: Spatial join to voting precincts (point-in-polygon, with nearest/fallback logic where needed)

---

## Environmental and Election Source Tables

### `usdm_drought_weekly`
**Rows**: 12,080
**Purpose**: Raw weekly U.S. Drought Monitor polygons by drought category

**Key Columns**:
- `id` (BIGINT) - Primary key
- `week_date` (DATE)
- `dm_category` (SMALLINT) - Drought Monitor category D0-D4 stored as 0-4
- `objectid`, `shape_leng`, `shape_area`
- `geom` (GEOMETRY) - Drought polygon geometry

**Source**: U.S. Drought Monitor spatial archive

**Notes**: Source table for `data_center_usdm_drought_dc_week`

### `data_center_usdm_drought_dc_week`
**Rows**: ~2.48 million
**Purpose**: Long-form weekly drought exposure for each covered data center

**Key Columns**:
- `master_id` (TEXT) - FK to `master_data_centers`
- `week_date` (DATE)
- `worst_dm` (SMALLINT) - Worst drought category covering the facility that week

**Source**: Spatial join of `master_data_centers` to `usdm_drought_weekly`

**Notes**:
- Primary key: (`master_id`, `week_date`)
- `worst_dm = -1` indicates an observed week with no drought polygon covering the facility

### `hms_smoke_days`
**Rows**: 7,075
**Purpose**: One row per observed NOAA HMS smoke product day, including zero-polygon days

**Key Columns**:
- `smoke_date` (DATE) - Primary key
- `source`, `source_file`, `source_url`
- `feature_count` (INTEGER) - Number of smoke polygons for the day
- `fetched_at`, `updated_at`

**Source**: NOAA HMS smoke polygon archive

**Notes**: Denominator table for daily smoke-exposure percentages

### `hms_smoke_daily`
**Rows**: 536,286
**Purpose**: Raw daily NOAA HMS smoke polygons with density categories

**Key Columns**:
- `id` (BIGINT) - Primary key
- `smoke_date` (DATE) - FK to `hms_smoke_days`
- `satellite`
- `start_raw`, `end_raw`, `start_utc`, `end_utc`
- `density`, `density_rank`
- `source`, `source_file`, `source_url`
- `geom` (GEOMETRY) - Smoke polygon geometry

**Source**: NOAA Hazard Mapping System (HMS) smoke polygons

**Notes**: Density rank 1-3 corresponds to Light, Medium, Heavy

### `data_center_hms_smoke_dc_day`
**Rows**: ~13.9 million
**Purpose**: Long-form daily smoke exposure for each data center and observed HMS product day

**Key Columns**:
- `master_id` (TEXT) - FK to `master_data_centers`
- `smoke_date` (DATE) - FK to `hms_smoke_days`
- `max_density_rank` (SMALLINT) - Maximum smoke density covering the facility on that date
- `polygon_hits` (INTEGER)

**Source**: Spatial join of `master_data_centers` to `hms_smoke_daily`

**Notes**:
- Primary key: (`master_id`, `smoke_date`)
- `max_density_rank = 0` indicates an observed HMS day with no smoke polygon covering the facility

### `rdh_precinct_vote_layers`
**Rows**: 69
**Purpose**: Metadata for downloaded RDH precinct election layers

**Key Columns**:
- `layer_id` (TEXT) - Primary key
- `state_code`
- `title`
- `format`
- `datasetid`
- `source_url`
- `filename`, `local_path`, `spatial_path`
- `metadata` (JSONB)
- `loaded_at`

**Source**: Redistricting Data Hub precinct election datasets

**Notes**: Current loaded layers cover 45 distinct state codes

### `rdh_precinct_vote_features`
**Rows**: 260,953
**Purpose**: Staged RDH precinct polygons and source attributes

**Key Columns**:
- `feature_id` (TEXT) - Primary key
- `layer_id` (TEXT) - FK to `rdh_precinct_vote_layers`
- `state_code`
- `source_row`
- `properties` (JSONB) - Raw RDH election attributes
- `geom` (GEOMETRY) - Precinct polygon geometry

**Source**: Redistricting Data Hub precinct election shapefiles

**Notes**: Source feature table for `data_center_rdh_precinct_vote_matches`

---

## Base Layer Tables

### `_dc_census_tract_acs_2024`
**Rows**: 85,382
**Purpose**: ACS 2024 demographics for all Census tracts in states with data centers

**Key Columns**:
- `geoid` (TEXT) - 11-digit tract GEOID (PRIMARY KEY)
- `name` (TEXT) - Tract name
- `state_fips`, `county_fips`, `tract`
- **Full ACS 5-year estimates** (85+ columns):
  - Population by age, sex, race/ethnicity
  - Households, families, housing units
  - Income, poverty, education, employment
  - Housing values, rents, costs
  - Broadband, computer access
  - Commuting, vehicles

**Source**: Census ACS 2024 5-year estimates API

**Notes**: Universe limited to 46 states with data centers (excludes DC-free states)

### `_dc_census_tract_boundaries_2024`
**Rows**: 85,058
**Purpose**: TIGER 2024 tract polygons for data center states

**Key Columns**:
- `geoid` (TEXT) - 11-digit tract GEOID
- `name` (TEXT) - Tract name
- `state_fips`, `county_fips`, `tract_code`
- `geom` (GEOMETRY) - Polygon geometry (EPSG:4326)
- `area_land_sq_m` (DOUBLE PRECISION) - Land area in square meters
- `area_water_sq_m` (DOUBLE PRECISION) - Water area in square meters

**Source**: Census TIGER/Line 2024

### `ruca_codes_2020_tract`
**Rows**: 85,528
**Purpose**: USDA Rural-Urban Commuting Area codes for metro/rural classification

**Key Columns**:
- `geoid` (TEXT) - 11-digit tract GEOID (matches Census tracts)
- `ruca_code` (TEXT) - Primary RUCA code (1-10)
- `ruca_category` (TEXT) - Simplified category:
  - `Metropolitan` (codes 1-3)
  - `Micropolitan` (codes 4-6)
  - `Small town` (codes 7-9)
  - `Rural` (code 10)
- `ruca_description` (TEXT) - Full RUCA code description
- `population_2020` (INTEGER)

**Source**: USDA Economic Research Service RUCA 2020

**Notes**:
- Based on 2020 Census tracts and 2010-2020 commuting patterns
- 7 data centers failed RUCA join (Puerto Rico / non-US)

### `watershed_huc8`
**Rows**: 2,139
**Purpose**: USGS HUC8 subbasin polygons for water-stress analysis

**Key Columns**:
- `huc8` (TEXT) - 8-digit Hydrologic Unit Code (PRIMARY KEY)
- `name` (TEXT) - Watershed name
- `geom` (GEOMETRY) - Polygon geometry (EPSG:4326)
- `area_sq_km` (DOUBLE PRECISION)
- `states` (TEXT) - Comma-separated state codes
- `dc_count` (INTEGER) - Number of data centers in watershed

**Source**: USGS Watershed Boundary Dataset

**Notes**:
- 257 of 2,139 watersheds contain at least one data center
- Top 15 watersheds contain 50% of all US data centers

### `nri_census_tracts`
**Rows**: ~84,000
**Purpose**: Full FEMA National Risk Index by Census tract

**Key Columns**:
- `nri_id` (TEXT) - Census tract GEOID
- `state_name`, `county_name`, `tract_name`
- **460+ columns** including:
  - Overall risk scores and ratings
  - Expected annual loss (dollars and building value %)
  - Social vulnerability components (15 factors)
  - Community resilience score
  - Individual hazard risk scores (18 hazards)
  - Exposure, annualized frequency, historic loss ratios per hazard

**Source**: FEMA National Risk Index v2.1 (December 2025)

**Notes**:
- Massive table with comprehensive natural hazard risk data
- Join to data centers via `geoid` field
- See [FEMA NRI Technical Documentation](https://hazards.fema.gov/nri/)

---

## Infrastructure Tables

### Energy Infrastructure

#### `energy_eia_operating_generator_capacity_flat`
**Rows**: 4.7 million
**Purpose**: EIA generator inventory with lat/lon/MW (monthly 2008-2026)

**Key Columns**:
- `plant_id` (INTEGER) - EIA plant ID
- `generator_id` (TEXT) - Generator unit ID
- `plant_name` (TEXT)
- `latitude`, `longitude`, `geom`
- `state`, `county`
- `utility_name`, `operator_name`
- `nameplate_capacity_mw` (DOUBLE PRECISION)
- `technology` (TEXT) - Generation technology
- `energy_source_1`, `energy_source_2` - Primary fuel codes
- `operating_month`, `operating_year` - When unit became operational
- `status` (TEXT) - Operating, standby, retired, etc.
- `report_month`, `report_year` - Data snapshot date

**Source**: EIA Form 860 via API

**Notes**:
- "Flat" means denormalized for fast spatial queries
- Each generator-month is a row (4.7M rows from monthly snapshots)
- Use for proximity analysis (e.g., "all generators within 50 km of data center")

#### `energy_eia_facility_fuel_flat`
**Rows**: Not loaded yet
**Purpose**: Monthly generation by plant/fuel

**Key Columns**:
- `plant_id`, `plant_name`
- `report_month`, `report_year`
- `energy_source` (TEXT) - Fuel code
- `net_generation_mwh` (DOUBLE PRECISION)
- `fuel_consumed_mmbtu` (DOUBLE PRECISION)

**Source**: EIA Form 923 via API

**Notes**: Target table defined in `ingest_eia_energy_layers.py`; current database does not yet have `public.energy_eia_facility_fuel_flat`.

#### `energy_eia_seds_flat`
**Rows**: 2.57 million
**Purpose**: Annual state energy consumption/production (1960-2024)

**Key Columns**:
- `state_code` (TEXT)
- `year` (INTEGER)
- `msn` (TEXT) - Mnemonic series names (e.g., `TETCB` = total energy consumption)
- `value` (DOUBLE PRECISION) - Energy in trillion BTU
- `unit` (TEXT)
- `description` (TEXT) - Human-readable MSN description

**Source**: EIA State Energy Data System (SEDS)

**Notes**:
- Annual aggregates by state
- Use for state-level energy context analysis

---

### Connectivity Infrastructure

#### `internet_cables`
**Rows**: 693
**Purpose**: Submarine cable routes

**Key Columns**:
- `cable_id` (TEXT) - Unique cable identifier
- `cable_name` (TEXT) - Official cable name
- `geom` (GEOMETRY) - LineString geometry (EPSG:4326)
- `rfs_year` (INTEGER) - Ready For Service year
- `length_km` (DOUBLE PRECISION)
- `owners` (TEXT[]) - Array of owner names
- `landing_points` (TEXT[]) - Array of landing point names

**Source**: TeleGeography-style cable database

**Notes**:
- 693 unique submarine cables
- Geometry is approximate route (not exact seabed path)

#### `internet_cable_landing_points`
**Rows**: 3,361
**Purpose**: Cable landing points (where cables come ashore)

**Key Columns**:
- `landing_point_id` (TEXT) - Unique identifier
- `name` (TEXT) - Landing point name
- `city`, `country`
- `latitude`, `longitude`, `geom`
- `cables` (TEXT[]) - Array of cable names landing at this point
- `cable_count` (INTEGER)

**Source**: TeleGeography-style cable database

**Notes**:
- Used for proximity analysis (how close are data centers to cable landings?)
- **Key finding**: Data centers are NOT systematically closer to cables than ordinary US cities

#### `internet_city_dominance`
**Rows**: 4,552
**Purpose**: City-level IPs/capacity (internet hub strength proxy)

**Key Columns**:
- `city` (TEXT)
- `country` (TEXT)
- `latitude`, `longitude`, `geom`
- `ip_addresses` (INTEGER) - Number of routable IP addresses
- `capacity_rank` (INTEGER) - Relative capacity ranking

**Source**: Internet topology datasets

**Notes**: Proxy for "internet hub" strength (not directly used in main analyses)

---

### Broadband

#### `fcc_bdc_location_provider_aggregates`
**Rows**: Varies
**Purpose**: FCC BDC provider availability aggregated by county/tract

**Key Columns**:
- `geoid` (TEXT) - County or tract GEOID
- `geography_level` (TEXT) - `county` or `tract`
- `provider_count` (INTEGER)
- `technology_counts` (JSONB) - Count by technology type
- `max_download_mbps`, `max_upload_mbps`

**Source**: FCC Broadband Data Collection (BDC)

#### `fcc_bdc_broadband_connection_table`
**Rows**: Varies
**Purpose**: Per-data-center broadband provider availability

**Key Columns**:
- Data center identifiers
- `provider_id`, `provider_name`
- `technology` (TEXT)
- `max_advertised_download_speed`, `max_advertised_upload_speed`
- `low_latency` (BOOLEAN)

**Source**: FCC BDC, joined to data center locations

**Notes**: Built by `build_fcc_bdc_broadband_connection_table.py`

---

### Other Tables

#### `opposition_cases_geocoded`
**Rows**: 18
**Purpose**: Geocoded community-opposition cases against data center builds

**Key Columns**:
- `case_id` (TEXT) - Unique identifier
- `developer` (TEXT) - Proposed developer/operator
- `investment_billions` (DOUBLE PRECISION) - Investment amount in billions
- `outcome` (TEXT) - Case outcome (approved, rejected, pending)
- `governance_response` (TEXT) - Government response
- `latitude`, `longitude`, `geom`

**Source**: Compiled from news archives

**Notes**: Loaded but currently unused - see research-ideas.md for proposed analyses

#### `census_tract_huc8_link`
**Rows**: 806
**Purpose**: Tract↔HUC8 spatial overlap table

**Key Columns**:
- `geoid` (TEXT) - Census tract GEOID
- `huc8` (TEXT) - HUC8 watershed code
- `overlap_pct` (DOUBLE PRECISION) - Percentage of tract overlapping watershed

**Notes**: Useful for downstream tract-level water-stress joins; limited to tracts containing data centers

#### `im3_state_projected_moderate_50`
**Rows**: 328
**Purpose**: PNNL IM3 projected data center siting (moderate growth, gravity weight 0.50)

**Key Columns**:
- `facility_id` (TEXT)
- `state` (TEXT)
- `cost_millions` (DOUBLE PRECISION)
- `it_mw` (DOUBLE PRECISION) - IT load in megawatts
- `cooling_water_demand_gal_per_day` (DOUBLE PRECISION)
- `latitude`, `longitude`, `geom`

**Source**: PNNL Integrated Multisector Multiscale Modeling (IM3)

**Notes**: Loaded but unused - potential for forward-projection analysis

#### `im3_projected_state_demand_summary`
**Rows**: 31
**Purpose**: State-level rollup of IM3 projected facility counts, IT MW, and cooling demand

**Key Columns**:
- `state` (TEXT)
- `facility_count` (INTEGER)
- `total_it_mw` (DOUBLE PRECISION)
- `total_cooling_demand_mgd` (DOUBLE PRECISION) - Million gallons per day

**Source**: IM3 model outputs

#### `utility_rate_tracker_2025_2028`
**Rows**: 374
**Purpose**: Utility rate-increase tracker by provider × state × service type

**Key Columns**:
- `provider` (TEXT) - Utility provider name
- `state` (TEXT)
- `service_type` (TEXT)
- `effective_date` (DATE)
- `monthly_increase_dollars` (DOUBLE PRECISION)
- `percent_increase` (DOUBLE PRECISION)

**Source**: Utility rate tracker database

**Notes**: Loaded but unused in demographic/energy analysis

#### `energy_atlas_layers_catalog`
**Rows**: ~5
**Purpose**: Metadata catalog of EIA layers ingested

**Key Columns**:
- `table_name` (TEXT)
- `source_url` (TEXT)
- `import_timestamp` (TIMESTAMP)
- `row_count` (INTEGER)

**Notes**: Created by `ingest_eia_energy_layers.py`

---

---

## Legislation Tables

Populated by `ingest_legiscan.py` using the [LegiScan API](https://legiscan.com/legiscan).
Covers all 50 states + DC + US Congress, sessions from 2016 through 2026.
Data licensed [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) — attribute LegiScan LLC.

### `legiscan_sessions`
**Rows**: 646
**Purpose**: One row per legislative session dataset downloaded from LegiScan

**Key Columns**:
- `session_id` (INTEGER) - LegiScan session ID (PRIMARY KEY)
- `state_abbr` (VARCHAR) - Two-letter state code (`CA`, `TX`, `US`, etc.)
- `state_id` (INTEGER) - LegiScan numeric state ID
- `year_start`, `year_end` (INTEGER) - Session year range
- `session_title` (TEXT) - Full session name
- `session_tag` (TEXT) - Short tag (e.g., "Regular Session", "1st Special Session")
- `is_special` (BOOLEAN) - True for special/extraordinary sessions
- `is_prior` (BOOLEAN) - True for completed/sine-die sessions
- `dataset_hash` (VARCHAR) - MD5 of dataset ZIP; used to detect updates
- `dataset_date` (DATE) - Date dataset was last published by LegiScan
- `dataset_size_mb` (FLOAT) - Compressed ZIP size
- `bill_count` (INTEGER) - Number of bills loaded from this session
- `imported_at` (TIMESTAMPTZ) - When this session was last imported

### `legiscan_bills`
**Rows**: ~1,313,000
**Purpose**: All bills from all sessions; tagged for relevance to data center research topics

**Key Columns**:
- `bill_id` (INTEGER) - LegiScan bill ID (PRIMARY KEY)
- `session_id` (INTEGER) - FK → `legiscan_sessions`
- `state` (VARCHAR) - Two-letter state code
- `bill_number` (VARCHAR) - Bill number (e.g., `SB 1000`, `HB 233`)
- `bill_type` (VARCHAR) - `B`=Bill, `R`=Resolution, `CR`=Concurrent Resolution, etc.
- `title` (TEXT) - Short title
- `description` (TEXT) - Longer description
- `status` (INTEGER) - Current status code (see below)
- `status_date` (DATE) - Date of last status change
- `completed` (INTEGER) - 1 if bill is in a terminal state
- `body` (VARCHAR) - Originating chamber (`H`=House, `S`=Senate, `C`=Council, etc.)
- `url` (TEXT) - LegiScan bill page URL
- `state_link` (TEXT) - Official state legislature URL
- `change_hash` (VARCHAR) - MD5 used to detect bill-level updates
- `subjects` (TEXT[]) - LegiScan subject tags (GIN indexed)
- `sponsor_count` (INTEGER) - Number of sponsors
- `vote_count` (INTEGER) - Number of recorded votes
- `text_count` (INTEGER) - Number of bill text versions
- `is_relevant` (BOOLEAN) - True if any relevance tag matched (GIN indexed)
- `relevance_tags` (TEXT[]) - Matched topic tags (GIN indexed)
- `imported_at` (TIMESTAMPTZ) - When this bill was last upserted

**Status codes**: 1=Introduced, 2=Engrossed, 3=Enrolled, 4=Passed, 5=Vetoed, 6=Failed, 7=Override, 8=Chaptered, 9=Referred, 12=Draft

**Relevance tags** (keyword-matched against title + description + subjects):

| Tag | What it captures |
|-----|-----------------|
| `data_center` | Data centers, hyperscale, colocation, AI campuses, HPC facilities |
| `large_load` | Crypto mining, large industrial loads, extraordinary load |
| `ratepayer_protection` | Cost shifting, cross-subsidy, rate design, affordability, rate class |
| `grid_impact` | Grid reliability, transmission, interconnection queue, IRP |
| `tax_incentive` | Tax exemptions, abatements, credits for facilities |
| `energy_policy` | Renewable PPAs, green tariffs, clean electricity, decarbonization |
| `water_use` | Cooling water, evaporative cooling, water footprint |
| `siting_permitting` | Zoning, conditional use permits, local control, preemption |

**Notes**:
- ~60,000 relevant bills out of 1.3M total (~4.6%)
- `data_center` tag: ~2,182 bills; `ratepayer_protection`: ~49,000
- GIN indexes on `subjects`, `relevance_tags`, and full-text (`title || description`)
- Use `query_legiscan_bills.sql` for pre-built research queries
- Re-run `python ingest_legiscan.py --fetch --load` weekly to pick up dataset updates
- Re-run `python ingest_legiscan.py --tag` after editing keyword lists

---

## Commonly Used Joins

### Data Center to Demographics
```sql
SELECT
    dc.*,
    ct.median_household_income,
    ct.bachelors_or_higher_pct,
    ct.broadband_pct
FROM master_data_centers dc
JOIN data_center_census_tracts_2024 ct
    ON dc.id = ct.id;
```

### Data Center to Watershed
```sql
SELECT
    dc.*,
    w.huc8,
    w.watershed_name
FROM master_data_centers dc
JOIN data_center_watershed_huc8 dw ON dc.id = dw.id
JOIN watershed_huc8 w ON dw.huc8 = w.huc8;
```

### Data Center to Energy Infrastructure (50 km radius)
```sql
SELECT
    dc.id,
    dc.name,
    SUM(eg.nameplate_capacity_mw) AS total_capacity_50km
FROM master_data_centers dc
JOIN energy_eia_operating_generator_capacity_flat eg
    ON ST_DWithin(
        dc.geom::geography,
        eg.geom::geography,
        50000  -- 50 km in meters
    )
WHERE eg.status = 'OP'  -- Operating only
GROUP BY dc.id, dc.name;
```

### Data Center to FEMA Hazard Risk
```sql
SELECT
    dc.*,
    nri.risk_score,
    nri.wildfire_risk,
    nri.drought_risk,
    nri.heat_wave_risk
FROM master_data_centers dc
JOIN data_center_census_tracts_2024 ct ON dc.id = ct.id
JOIN nri_census_tracts nri ON ct.geoid = nri.nri_id;
```

---

## Table Naming Conventions

- **`master_*`** - Canonical, deduplicated tables (use these for analysis)
- **`data_center_*`** - Data center-specific enrichment tables
- **`_dc_*`** - Base layers scoped to data center states (underscore prefix = private/internal)
- **`energy_eia_*`** - EIA energy data
- **`internet_*`** - Connectivity infrastructure
- **`fcc_bdc_*`** - FCC Broadband Data Collection

---

## Indexes and Performance

All tables have spatial indexes on `geom` columns for fast spatial joins:
```sql
CREATE INDEX idx_tablename_geom ON tablename USING GIST(geom);
```

Key `geoid` columns are indexed for fast demographic joins:
```sql
CREATE INDEX idx_tablename_geoid ON tablename(geoid);
```

---

## Maintenance Notes

### Updating Data Centers
1. Run `load_postgis_osm_data_centers.py` to refresh OSM data
2. Run `build_master_data_centers.py` to rebuild master table
3. Run enrichment scripts to update joins

### Updating Demographics
1. Update `_dc_census_tract_acs_2024` from Census API
2. Run `create_data_center_census_tract_table.py --replace-final`

### Updating Energy Data
```bash
python3 ingest_eia_energy_layers.py --category power --update
```

---

## Schema Export

To export the full schema:
```bash
pg_dump -h $PGWEB_HOST -U $PGWEB_USER -d data_centers --schema-only > schema.sql
```

To list all tables:
```sql
SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename))
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
```

---

## Contact

For database access or questions, contact the repository owner.