# Database Tables Documentation ## Database Configuration **Database Name**: `data_centers` **Type**: PostgreSQL with PostGIS extension **Connection**: Environment variables from `~/.zsh_secrets` - `PGWEB_HOST`: Database host - `PGWEB_PORT`: Database port (5433) - `PGWEB_USER`: Database user - `PGWEB_PASSWORD`: Database password - `PGWEB_DATABASE`: Database name (`data_centers`) ## Table Organization Tables are organized into six categories: 1. **Core Data Center Tables** - Master inventories and source data 2. **Enrichment Tables** - Data centers joined with contextual data 3. **Environmental and Election Source Tables** - Long-form climate, drought, fire/smoke, and precinct-election source layers 4. **Base Layer Tables** - Geographic and demographic reference layers 5. **Infrastructure Tables** - Energy and connectivity infrastructure 6. **Legislation Tables** - LegiScan state and federal bill data (2016-2026) --- ## Core Data Center Tables ### `master_data_centers` **Rows**: 1,833 **Purpose**: Canonical data center inventory - deduplicated merge of curated + OSM sources **Key Columns**: - `id` (INTEGER) - Unique identifier - `name` (TEXT) - Facility name - `address` (TEXT) - Street address - `city` (TEXT) - City - `state` (TEXT) - State code - `latitude` (DOUBLE PRECISION) - Latitude - `longitude` (DOUBLE PRECISION) - Longitude - `geom` (GEOMETRY) - PostGIS point geometry (EPSG:4326) - `operator` (TEXT) - Operator/owner - `power_mw` (DOUBLE PRECISION) - Power capacity in megawatts (sparse: 5.9% populated) - `source` (TEXT) - Data source (`curated`, `osm`, or `both`) - `osm_id` (TEXT) - OpenStreetMap ID if applicable - `geocode_method` (TEXT) - Geocoding provenance **Notes**: - 108 of 1,833 facilities have power ratings - 45 facilities use city-precision fallback coordinates - Operator strings have fragmentation issues ("Meta" vs. "Meta, Inc.") ### `us_dc_sample_geocoded` **Rows**: 1,489 **Purpose**: Original curated sample with geocoding provenance (superseded by `master_data_centers`) **Key Columns**: - `name`, `address`, `city`, `state`, `zip` - `latitude`, `longitude`, `geom` - `operator`, `power_mw` - `census_lat`, `census_lon` - Census TIGER geocode results - `nominatim_lat`, `nominatim_lon` - Nominatim fallback results - `geocode_source` - Which geocoder was used ### `osm_data_centers` **Rows**: 1,549 **Purpose**: Raw OpenStreetMap-derived facilities **Key Columns**: - `osm_id` (TEXT) - OSM element ID - `osm_type` (TEXT) - `node`, `way`, or `relation` - `name` (TEXT) - OSM name tag - `latitude`, `longitude`, `geom` - `tags` (JSONB) - All OSM tags as JSON - `operator` (TEXT) - Extracted from OSM tags - `city`, `state`, `country` **Notes**: Fetched via Overpass API with query for `telecom=data_center` or `building=data_center` ### `master_data_center_spatial_clusters` **Rows**: 1,831 **Purpose**: DBSCAN cluster assignments for master data centers **Key Columns**: - All columns from `master_data_centers` - `cluster_id` (INTEGER) - Cluster assignment (-1 = noise/singleton) - `cluster_size` (INTEGER) - Number of facilities in cluster - `cluster_label` (TEXT) - Human-readable cluster name **Notes**: DBSCAN parameters: eps=15 km, min_samples=2 --- ## Enrichment Tables ### `data_center_census_tracts_2024` **Rows**: 1,815 **Purpose**: Per-facility demographics from containing Census tract **Key Columns**: - All columns from `master_data_centers` - `geoid` (TEXT) - 11-digit Census tract GEOID - `state_fips`, `county_fips`, `tract` - **Population**: `total_population`, `population_density_sq_mi` - **Age**: `median_age`, `under_18_pct`, `over_65_pct` - **Race/Ethnicity**: `white_nh_pct`, `black_nh_pct`, `asian_nh_pct`, `hispanic_pct` - **Economics**: `median_household_income`, `per_capita_income`, `poverty_rate` - **Education**: `bachelors_or_higher_pct`, `high_school_or_higher_pct` - **Housing**: `median_home_value`, `median_rent`, `homeownership_rate` - **Broadband**: `broadband_pct` - Households with broadband subscription **Source**: ACS 2024 5-year estimates **Notes**: - 18 of 1,833 facilities failed tract join (geocoding issues) - Data from `_dc_census_tract_acs_2024` base table ### `data_center_watershed_huc8` **Rows**: 1,833 **Purpose**: Per-facility watershed assignment **Key Columns**: - All columns from `master_data_centers` - `huc8` (TEXT) - 8-digit Hydrologic Unit Code - `watershed_name` (TEXT) - Watershed name - `watershed_area_sq_km` (DOUBLE PRECISION) - `states` (TEXT) - States intersecting watershed **Source**: USGS Watershed Boundary Dataset **Notes**: 257 unique HUC8 watersheds contain at least one data center ### `data_center_nri_exposure` **Rows**: 1,833 **Purpose**: Per-facility FEMA National Risk Index hazard exposure scores **Key Columns**: - All columns from `master_data_centers` - `nri_id` (TEXT) - Census tract GEOID (matches `geoid` from demographics) - `risk_score` (DOUBLE PRECISION) - Overall NRI risk score - `social_vulnerability` (DOUBLE PRECISION) - Social vulnerability index - **Hazard-specific risk scores** (18 hazards): - `avalanche_risk`, `coastal_flooding_risk`, `cold_wave_risk` - `drought_risk`, `earthquake_risk`, `hail_risk` - `heat_wave_risk`, `hurricane_risk`, `ice_storm_risk` - `landslide_risk`, `lightning_risk`, `riverine_flooding_risk` - `strong_wind_risk`, `tornado_risk`, `tsunami_risk` - `volcanic_activity_risk`, `wildfire_risk`, `winter_weather_risk` **Source**: FEMA National Risk Index (December 2025 release) ### `data_center_historical_climate` **Rows**: 1,833 **Purpose**: One-row-per-facility historical climate summary for data center locations **Key Columns**: - `master_id` (TEXT) - FK to `master_data_centers` - `source`, `name`, `operator`, `city`, `state`, `country` - `latitude`, `longitude`, `geom` - `daymet_dataset_version`, `gridmet_dataset_version` - `climate_period_start`, `climate_period_end` - Current period: 1991-01-01 to 2020-12-31 - **Temperature**: `mean_annual_temperature_c`, `mean_summer_temperature_c`, `max_daily_temperature_c`, `min_daily_temperature_c` - **Humidity / wet bulb**: `mean_relative_humidity_pct`, `mean_wet_bulb_temperature_c`, `max_wet_bulb_temperature_c`, `extreme_wet_bulb_days` - **Cooling / heat**: `cooling_degree_days_c`, `annual_cooling_degree_days_c_mean`, `extreme_heat_days`, `annual_extreme_heat_days_mean` - **Precipitation**: `precipitation_total_mm`, `annual_precipitation_mm_mean`, `annual_precipitation_cv`, `wet_day_precipitation_p95_mm` - **Wind**: `mean_wind_speed_ms`, `max_daily_mean_wind_speed_ms`, `sustained_wind_days`, `annual_sustained_wind_days_mean` **Source**: Daymet + gridMET historical climate data **Notes**: Built by `historical_climate_data_centers.ipynb` / `open_meteo_historical_data_centers.ipynb` ### `data_center_usdm_drought_exposure` **Rows**: 1,833 **Purpose**: Per-facility drought exposure summary from weekly U.S. Drought Monitor polygons **Key Columns**: - `master_id` (TEXT) - FK to `master_data_centers` - `source`, `name`, `operator`, `city`, `state`, `country` - `latitude`, `longitude`, `geom` - `usdm_status` - `covered` or `no_coverage` - `drought_period_start`, `drought_period_end` - Current period: 2000-01-04 to 2025-12-30 - `weeks_observed` - `weeks_in_d0_or_worse`, `weeks_in_d1_or_worse`, `weeks_in_d2_or_worse`, `weeks_in_d3_or_worse`, `weeks_in_d4` - `pct_weeks_in_d0_or_worse`, `pct_weeks_in_d1_or_worse`, `pct_weeks_in_d2_or_worse`, `pct_weeks_in_d3_or_worse`, `pct_weeks_in_d4` - `worst_dm_category`, `mean_dm_category` - `longest_d0_streak_weeks`, `longest_d2_streak_weeks`, `longest_d3_streak_weeks` **Source**: U.S. Drought Monitor weekly spatial data **Notes**: - Summary table is rolled up from `data_center_usdm_drought_dc_week` - `dm_category` scale: D0-D4, stored as 0-4 - 1,830 facilities have covered status; 3 have no coverage ### `data_center_hms_smoke_exposure` **Rows**: 1,833 **Purpose**: Per-facility wildfire-smoke exposure summary from NOAA HMS smoke polygons **Key Columns**: - `master_id` (TEXT) - FK to `master_data_centers` - `source`, `name`, `operator`, `city`, `state`, `country` - `latitude`, `longitude`, `geom` - `hms_status` - `smoke_period_start`, `smoke_period_end` - Current period: 2005-08-05 to 2026-05-22 - `days_observed` - `days_with_any_smoke`, `days_with_light_or_worse`, `days_with_medium_or_worse`, `days_with_heavy_smoke` - `pct_days_with_any_smoke`, `pct_days_with_light_or_worse`, `pct_days_with_medium_or_worse`, `pct_days_with_heavy_smoke` - `worst_density_rank`, `worst_density`, `mean_density_rank` - `longest_any_smoke_streak_days`, `longest_medium_or_heavy_streak_days`, `longest_heavy_smoke_streak_days` **Source**: NOAA Hazard Mapping System (HMS) smoke polygons **Notes**: - Summary table is rolled up from `data_center_hms_smoke_dc_day` - Density rank: 0 = observed no smoke, 1 = Light, 2 = Medium, 3 = Heavy - HMS product path uses NOAA's `/FIRE/web/HMS/Smoke_Polygons/` archive ### `data_center_election_context` **Rows**: 1,833 **Purpose**: Standardized one-row-per-facility election context derived from RDH precinct matches **Key Columns**: - `master_id` (TEXT) - FK to `master_data_centers` - `name`, `city`, `state` - `rdh_layer_title` - `precinct_identifier_name` - `election_year`, `office` - `democratic_votes`, `republican_votes`, `total_votes` - `turnout_or_vote_share` - `updated_at` **Source**: Redistricting Data Hub precinct election shapefiles **Notes**: - Built from `data_center_rdh_precinct_vote_matches` plus RDH feature properties - Current rows cover 2020-2024 election layers; 1,829 facilities have non-null election year context ### `data_center_rdh_precinct_vote_matches` **Rows**: 3,330 **Purpose**: Spatial join bridge between data centers and RDH precinct vote features **Key Columns**: - `master_id` (TEXT) - FK to `master_data_centers` - `feature_id` (TEXT) - FK to `rdh_precinct_vote_features` - `layer_id` (TEXT) - FK to `rdh_precinct_vote_layers` - `state_code` - `join_method` - `match_distance_m` - `matched_at` **Source**: Redistricting Data Hub precinct shapefiles **Notes**: Spatial join to voting precincts (point-in-polygon, with nearest/fallback logic where needed) --- ## Environmental and Election Source Tables ### `usdm_drought_weekly` **Rows**: 12,080 **Purpose**: Raw weekly U.S. Drought Monitor polygons by drought category **Key Columns**: - `id` (BIGINT) - Primary key - `week_date` (DATE) - `dm_category` (SMALLINT) - Drought Monitor category D0-D4 stored as 0-4 - `objectid`, `shape_leng`, `shape_area` - `geom` (GEOMETRY) - Drought polygon geometry **Source**: U.S. Drought Monitor spatial archive **Notes**: Source table for `data_center_usdm_drought_dc_week` ### `data_center_usdm_drought_dc_week` **Rows**: ~2.48 million **Purpose**: Long-form weekly drought exposure for each covered data center **Key Columns**: - `master_id` (TEXT) - FK to `master_data_centers` - `week_date` (DATE) - `worst_dm` (SMALLINT) - Worst drought category covering the facility that week **Source**: Spatial join of `master_data_centers` to `usdm_drought_weekly` **Notes**: - Primary key: (`master_id`, `week_date`) - `worst_dm = -1` indicates an observed week with no drought polygon covering the facility ### `hms_smoke_days` **Rows**: 7,075 **Purpose**: One row per observed NOAA HMS smoke product day, including zero-polygon days **Key Columns**: - `smoke_date` (DATE) - Primary key - `source`, `source_file`, `source_url` - `feature_count` (INTEGER) - Number of smoke polygons for the day - `fetched_at`, `updated_at` **Source**: NOAA HMS smoke polygon archive **Notes**: Denominator table for daily smoke-exposure percentages ### `hms_smoke_daily` **Rows**: 536,286 **Purpose**: Raw daily NOAA HMS smoke polygons with density categories **Key Columns**: - `id` (BIGINT) - Primary key - `smoke_date` (DATE) - FK to `hms_smoke_days` - `satellite` - `start_raw`, `end_raw`, `start_utc`, `end_utc` - `density`, `density_rank` - `source`, `source_file`, `source_url` - `geom` (GEOMETRY) - Smoke polygon geometry **Source**: NOAA Hazard Mapping System (HMS) smoke polygons **Notes**: Density rank 1-3 corresponds to Light, Medium, Heavy ### `data_center_hms_smoke_dc_day` **Rows**: ~13.9 million **Purpose**: Long-form daily smoke exposure for each data center and observed HMS product day **Key Columns**: - `master_id` (TEXT) - FK to `master_data_centers` - `smoke_date` (DATE) - FK to `hms_smoke_days` - `max_density_rank` (SMALLINT) - Maximum smoke density covering the facility on that date - `polygon_hits` (INTEGER) **Source**: Spatial join of `master_data_centers` to `hms_smoke_daily` **Notes**: - Primary key: (`master_id`, `smoke_date`) - `max_density_rank = 0` indicates an observed HMS day with no smoke polygon covering the facility ### `rdh_precinct_vote_layers` **Rows**: 69 **Purpose**: Metadata for downloaded RDH precinct election layers **Key Columns**: - `layer_id` (TEXT) - Primary key - `state_code` - `title` - `format` - `datasetid` - `source_url` - `filename`, `local_path`, `spatial_path` - `metadata` (JSONB) - `loaded_at` **Source**: Redistricting Data Hub precinct election datasets **Notes**: Current loaded layers cover 45 distinct state codes ### `rdh_precinct_vote_features` **Rows**: 260,953 **Purpose**: Staged RDH precinct polygons and source attributes **Key Columns**: - `feature_id` (TEXT) - Primary key - `layer_id` (TEXT) - FK to `rdh_precinct_vote_layers` - `state_code` - `source_row` - `properties` (JSONB) - Raw RDH election attributes - `geom` (GEOMETRY) - Precinct polygon geometry **Source**: Redistricting Data Hub precinct election shapefiles **Notes**: Source feature table for `data_center_rdh_precinct_vote_matches` --- ## Base Layer Tables ### `_dc_census_tract_acs_2024` **Rows**: 85,382 **Purpose**: ACS 2024 demographics for all Census tracts in states with data centers **Key Columns**: - `geoid` (TEXT) - 11-digit tract GEOID (PRIMARY KEY) - `name` (TEXT) - Tract name - `state_fips`, `county_fips`, `tract` - **Full ACS 5-year estimates** (85+ columns): - Population by age, sex, race/ethnicity - Households, families, housing units - Income, poverty, education, employment - Housing values, rents, costs - Broadband, computer access - Commuting, vehicles **Source**: Census ACS 2024 5-year estimates API **Notes**: Universe limited to 46 states with data centers (excludes DC-free states) ### `_dc_census_tract_boundaries_2024` **Rows**: 85,058 **Purpose**: TIGER 2024 tract polygons for data center states **Key Columns**: - `geoid` (TEXT) - 11-digit tract GEOID - `name` (TEXT) - Tract name - `state_fips`, `county_fips`, `tract_code` - `geom` (GEOMETRY) - Polygon geometry (EPSG:4326) - `area_land_sq_m` (DOUBLE PRECISION) - Land area in square meters - `area_water_sq_m` (DOUBLE PRECISION) - Water area in square meters **Source**: Census TIGER/Line 2024 ### `ruca_codes_2020_tract` **Rows**: 85,528 **Purpose**: USDA Rural-Urban Commuting Area codes for metro/rural classification **Key Columns**: - `geoid` (TEXT) - 11-digit tract GEOID (matches Census tracts) - `ruca_code` (TEXT) - Primary RUCA code (1-10) - `ruca_category` (TEXT) - Simplified category: - `Metropolitan` (codes 1-3) - `Micropolitan` (codes 4-6) - `Small town` (codes 7-9) - `Rural` (code 10) - `ruca_description` (TEXT) - Full RUCA code description - `population_2020` (INTEGER) **Source**: USDA Economic Research Service RUCA 2020 **Notes**: - Based on 2020 Census tracts and 2010-2020 commuting patterns - 7 data centers failed RUCA join (Puerto Rico / non-US) ### `watershed_huc8` **Rows**: 2,139 **Purpose**: USGS HUC8 subbasin polygons for water-stress analysis **Key Columns**: - `huc8` (TEXT) - 8-digit Hydrologic Unit Code (PRIMARY KEY) - `name` (TEXT) - Watershed name - `geom` (GEOMETRY) - Polygon geometry (EPSG:4326) - `area_sq_km` (DOUBLE PRECISION) - `states` (TEXT) - Comma-separated state codes - `dc_count` (INTEGER) - Number of data centers in watershed **Source**: USGS Watershed Boundary Dataset **Notes**: - 257 of 2,139 watersheds contain at least one data center - Top 15 watersheds contain 50% of all US data centers ### `nri_census_tracts` **Rows**: ~84,000 **Purpose**: Full FEMA National Risk Index by Census tract **Key Columns**: - `nri_id` (TEXT) - Census tract GEOID - `state_name`, `county_name`, `tract_name` - **460+ columns** including: - Overall risk scores and ratings - Expected annual loss (dollars and building value %) - Social vulnerability components (15 factors) - Community resilience score - Individual hazard risk scores (18 hazards) - Exposure, annualized frequency, historic loss ratios per hazard **Source**: FEMA National Risk Index v2.1 (December 2025) **Notes**: - Massive table with comprehensive natural hazard risk data - Join to data centers via `geoid` field - See [FEMA NRI Technical Documentation](https://hazards.fema.gov/nri/) --- ## Infrastructure Tables ### Energy Infrastructure #### `energy_eia_operating_generator_capacity_flat` **Rows**: 4.7 million **Purpose**: EIA generator inventory with lat/lon/MW (monthly 2008-2026) **Key Columns**: - `plant_id` (INTEGER) - EIA plant ID - `generator_id` (TEXT) - Generator unit ID - `plant_name` (TEXT) - `latitude`, `longitude`, `geom` - `state`, `county` - `utility_name`, `operator_name` - `nameplate_capacity_mw` (DOUBLE PRECISION) - `technology` (TEXT) - Generation technology - `energy_source_1`, `energy_source_2` - Primary fuel codes - `operating_month`, `operating_year` - When unit became operational - `status` (TEXT) - Operating, standby, retired, etc. - `report_month`, `report_year` - Data snapshot date **Source**: EIA Form 860 via API **Notes**: - "Flat" means denormalized for fast spatial queries - Each generator-month is a row (4.7M rows from monthly snapshots) - Use for proximity analysis (e.g., "all generators within 50 km of data center") #### `energy_eia_facility_fuel_flat` **Rows**: Not loaded yet **Purpose**: Monthly generation by plant/fuel **Key Columns**: - `plant_id`, `plant_name` - `report_month`, `report_year` - `energy_source` (TEXT) - Fuel code - `net_generation_mwh` (DOUBLE PRECISION) - `fuel_consumed_mmbtu` (DOUBLE PRECISION) **Source**: EIA Form 923 via API **Notes**: Target table defined in `ingest_eia_energy_layers.py`; current database does not yet have `public.energy_eia_facility_fuel_flat`. #### `energy_eia_seds_flat` **Rows**: 2.57 million **Purpose**: Annual state energy consumption/production (1960-2024) **Key Columns**: - `state_code` (TEXT) - `year` (INTEGER) - `msn` (TEXT) - Mnemonic series names (e.g., `TETCB` = total energy consumption) - `value` (DOUBLE PRECISION) - Energy in trillion BTU - `unit` (TEXT) - `description` (TEXT) - Human-readable MSN description **Source**: EIA State Energy Data System (SEDS) **Notes**: - Annual aggregates by state - Use for state-level energy context analysis --- ### Connectivity Infrastructure #### `internet_cables` **Rows**: 693 **Purpose**: Submarine cable routes **Key Columns**: - `cable_id` (TEXT) - Unique cable identifier - `cable_name` (TEXT) - Official cable name - `geom` (GEOMETRY) - LineString geometry (EPSG:4326) - `rfs_year` (INTEGER) - Ready For Service year - `length_km` (DOUBLE PRECISION) - `owners` (TEXT[]) - Array of owner names - `landing_points` (TEXT[]) - Array of landing point names **Source**: TeleGeography-style cable database **Notes**: - 693 unique submarine cables - Geometry is approximate route (not exact seabed path) #### `internet_cable_landing_points` **Rows**: 3,361 **Purpose**: Cable landing points (where cables come ashore) **Key Columns**: - `landing_point_id` (TEXT) - Unique identifier - `name` (TEXT) - Landing point name - `city`, `country` - `latitude`, `longitude`, `geom` - `cables` (TEXT[]) - Array of cable names landing at this point - `cable_count` (INTEGER) **Source**: TeleGeography-style cable database **Notes**: - Used for proximity analysis (how close are data centers to cable landings?) - **Key finding**: Data centers are NOT systematically closer to cables than ordinary US cities #### `internet_city_dominance` **Rows**: 4,552 **Purpose**: City-level IPs/capacity (internet hub strength proxy) **Key Columns**: - `city` (TEXT) - `country` (TEXT) - `latitude`, `longitude`, `geom` - `ip_addresses` (INTEGER) - Number of routable IP addresses - `capacity_rank` (INTEGER) - Relative capacity ranking **Source**: Internet topology datasets **Notes**: Proxy for "internet hub" strength (not directly used in main analyses) --- ### Broadband #### `fcc_bdc_location_provider_aggregates` **Rows**: Varies **Purpose**: FCC BDC provider availability aggregated by county/tract **Key Columns**: - `geoid` (TEXT) - County or tract GEOID - `geography_level` (TEXT) - `county` or `tract` - `provider_count` (INTEGER) - `technology_counts` (JSONB) - Count by technology type - `max_download_mbps`, `max_upload_mbps` **Source**: FCC Broadband Data Collection (BDC) #### `fcc_bdc_broadband_connection_table` **Rows**: Varies **Purpose**: Per-data-center broadband provider availability **Key Columns**: - Data center identifiers - `provider_id`, `provider_name` - `technology` (TEXT) - `max_advertised_download_speed`, `max_advertised_upload_speed` - `low_latency` (BOOLEAN) **Source**: FCC BDC, joined to data center locations **Notes**: Built by `build_fcc_bdc_broadband_connection_table.py` --- ### Other Tables #### `opposition_cases_geocoded` **Rows**: 18 **Purpose**: Geocoded community-opposition cases against data center builds **Key Columns**: - `case_id` (TEXT) - Unique identifier - `developer` (TEXT) - Proposed developer/operator - `investment_billions` (DOUBLE PRECISION) - Investment amount in billions - `outcome` (TEXT) - Case outcome (approved, rejected, pending) - `governance_response` (TEXT) - Government response - `latitude`, `longitude`, `geom` **Source**: Compiled from news archives **Notes**: Loaded but currently unused - see research-ideas.md for proposed analyses #### `census_tract_huc8_link` **Rows**: 806 **Purpose**: Tract↔HUC8 spatial overlap table **Key Columns**: - `geoid` (TEXT) - Census tract GEOID - `huc8` (TEXT) - HUC8 watershed code - `overlap_pct` (DOUBLE PRECISION) - Percentage of tract overlapping watershed **Notes**: Useful for downstream tract-level water-stress joins; limited to tracts containing data centers #### `im3_state_projected_moderate_50` **Rows**: 328 **Purpose**: PNNL IM3 projected data center siting (moderate growth, gravity weight 0.50) **Key Columns**: - `facility_id` (TEXT) - `state` (TEXT) - `cost_millions` (DOUBLE PRECISION) - `it_mw` (DOUBLE PRECISION) - IT load in megawatts - `cooling_water_demand_gal_per_day` (DOUBLE PRECISION) - `latitude`, `longitude`, `geom` **Source**: PNNL Integrated Multisector Multiscale Modeling (IM3) **Notes**: Loaded but unused - potential for forward-projection analysis #### `im3_projected_state_demand_summary` **Rows**: 31 **Purpose**: State-level rollup of IM3 projected facility counts, IT MW, and cooling demand **Key Columns**: - `state` (TEXT) - `facility_count` (INTEGER) - `total_it_mw` (DOUBLE PRECISION) - `total_cooling_demand_mgd` (DOUBLE PRECISION) - Million gallons per day **Source**: IM3 model outputs #### `utility_rate_tracker_2025_2028` **Rows**: 374 **Purpose**: Utility rate-increase tracker by provider × state × service type **Key Columns**: - `provider` (TEXT) - Utility provider name - `state` (TEXT) - `service_type` (TEXT) - `effective_date` (DATE) - `monthly_increase_dollars` (DOUBLE PRECISION) - `percent_increase` (DOUBLE PRECISION) **Source**: Utility rate tracker database **Notes**: Loaded but unused in demographic/energy analysis #### `energy_atlas_layers_catalog` **Rows**: ~5 **Purpose**: Metadata catalog of EIA layers ingested **Key Columns**: - `table_name` (TEXT) - `source_url` (TEXT) - `import_timestamp` (TIMESTAMP) - `row_count` (INTEGER) **Notes**: Created by `ingest_eia_energy_layers.py` --- --- ## Legislation Tables Populated by `ingest_legiscan.py` using the [LegiScan API](https://legiscan.com/legiscan). Covers all 50 states + DC + US Congress, sessions from 2016 through 2026. Data licensed [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) — attribute LegiScan LLC. ### `legiscan_sessions` **Rows**: 646 **Purpose**: One row per legislative session dataset downloaded from LegiScan **Key Columns**: - `session_id` (INTEGER) - LegiScan session ID (PRIMARY KEY) - `state_abbr` (VARCHAR) - Two-letter state code (`CA`, `TX`, `US`, etc.) - `state_id` (INTEGER) - LegiScan numeric state ID - `year_start`, `year_end` (INTEGER) - Session year range - `session_title` (TEXT) - Full session name - `session_tag` (TEXT) - Short tag (e.g., "Regular Session", "1st Special Session") - `is_special` (BOOLEAN) - True for special/extraordinary sessions - `is_prior` (BOOLEAN) - True for completed/sine-die sessions - `dataset_hash` (VARCHAR) - MD5 of dataset ZIP; used to detect updates - `dataset_date` (DATE) - Date dataset was last published by LegiScan - `dataset_size_mb` (FLOAT) - Compressed ZIP size - `bill_count` (INTEGER) - Number of bills loaded from this session - `imported_at` (TIMESTAMPTZ) - When this session was last imported ### `legiscan_bills` **Rows**: ~1,313,000 **Purpose**: All bills from all sessions; tagged for relevance to data center research topics **Key Columns**: - `bill_id` (INTEGER) - LegiScan bill ID (PRIMARY KEY) - `session_id` (INTEGER) - FK → `legiscan_sessions` - `state` (VARCHAR) - Two-letter state code - `bill_number` (VARCHAR) - Bill number (e.g., `SB 1000`, `HB 233`) - `bill_type` (VARCHAR) - `B`=Bill, `R`=Resolution, `CR`=Concurrent Resolution, etc. - `title` (TEXT) - Short title - `description` (TEXT) - Longer description - `status` (INTEGER) - Current status code (see below) - `status_date` (DATE) - Date of last status change - `completed` (INTEGER) - 1 if bill is in a terminal state - `body` (VARCHAR) - Originating chamber (`H`=House, `S`=Senate, `C`=Council, etc.) - `url` (TEXT) - LegiScan bill page URL - `state_link` (TEXT) - Official state legislature URL - `change_hash` (VARCHAR) - MD5 used to detect bill-level updates - `subjects` (TEXT[]) - LegiScan subject tags (GIN indexed) - `sponsor_count` (INTEGER) - Number of sponsors - `vote_count` (INTEGER) - Number of recorded votes - `text_count` (INTEGER) - Number of bill text versions - `is_relevant` (BOOLEAN) - True if any relevance tag matched (GIN indexed) - `relevance_tags` (TEXT[]) - Matched topic tags (GIN indexed) - `imported_at` (TIMESTAMPTZ) - When this bill was last upserted **Status codes**: 1=Introduced, 2=Engrossed, 3=Enrolled, 4=Passed, 5=Vetoed, 6=Failed, 7=Override, 8=Chaptered, 9=Referred, 12=Draft **Relevance tags** (keyword-matched against title + description + subjects): | Tag | What it captures | |-----|-----------------| | `data_center` | Data centers, hyperscale, colocation, AI campuses, HPC facilities | | `large_load` | Crypto mining, large industrial loads, extraordinary load | | `ratepayer_protection` | Cost shifting, cross-subsidy, rate design, affordability, rate class | | `grid_impact` | Grid reliability, transmission, interconnection queue, IRP | | `tax_incentive` | Tax exemptions, abatements, credits for facilities | | `energy_policy` | Renewable PPAs, green tariffs, clean electricity, decarbonization | | `water_use` | Cooling water, evaporative cooling, water footprint | | `siting_permitting` | Zoning, conditional use permits, local control, preemption | **Notes**: - ~60,000 relevant bills out of 1.3M total (~4.6%) - `data_center` tag: ~2,182 bills; `ratepayer_protection`: ~49,000 - GIN indexes on `subjects`, `relevance_tags`, and full-text (`title || description`) - Use `query_legiscan_bills.sql` for pre-built research queries - Re-run `python ingest_legiscan.py --fetch --load` weekly to pick up dataset updates - Re-run `python ingest_legiscan.py --tag` after editing keyword lists --- ## Commonly Used Joins ### Data Center to Demographics ```sql SELECT dc.*, ct.median_household_income, ct.bachelors_or_higher_pct, ct.broadband_pct FROM master_data_centers dc JOIN data_center_census_tracts_2024 ct ON dc.id = ct.id; ``` ### Data Center to Watershed ```sql SELECT dc.*, w.huc8, w.watershed_name FROM master_data_centers dc JOIN data_center_watershed_huc8 dw ON dc.id = dw.id JOIN watershed_huc8 w ON dw.huc8 = w.huc8; ``` ### Data Center to Energy Infrastructure (50 km radius) ```sql SELECT dc.id, dc.name, SUM(eg.nameplate_capacity_mw) AS total_capacity_50km FROM master_data_centers dc JOIN energy_eia_operating_generator_capacity_flat eg ON ST_DWithin( dc.geom::geography, eg.geom::geography, 50000 -- 50 km in meters ) WHERE eg.status = 'OP' -- Operating only GROUP BY dc.id, dc.name; ``` ### Data Center to FEMA Hazard Risk ```sql SELECT dc.*, nri.risk_score, nri.wildfire_risk, nri.drought_risk, nri.heat_wave_risk FROM master_data_centers dc JOIN data_center_census_tracts_2024 ct ON dc.id = ct.id JOIN nri_census_tracts nri ON ct.geoid = nri.nri_id; ``` --- ## Table Naming Conventions - **`master_*`** - Canonical, deduplicated tables (use these for analysis) - **`data_center_*`** - Data center-specific enrichment tables - **`_dc_*`** - Base layers scoped to data center states (underscore prefix = private/internal) - **`energy_eia_*`** - EIA energy data - **`internet_*`** - Connectivity infrastructure - **`fcc_bdc_*`** - FCC Broadband Data Collection --- ## Indexes and Performance All tables have spatial indexes on `geom` columns for fast spatial joins: ```sql CREATE INDEX idx_tablename_geom ON tablename USING GIST(geom); ``` Key `geoid` columns are indexed for fast demographic joins: ```sql CREATE INDEX idx_tablename_geoid ON tablename(geoid); ``` --- ## Maintenance Notes ### Updating Data Centers 1. Run `load_postgis_osm_data_centers.py` to refresh OSM data 2. Run `build_master_data_centers.py` to rebuild master table 3. Run enrichment scripts to update joins ### Updating Demographics 1. Update `_dc_census_tract_acs_2024` from Census API 2. Run `create_data_center_census_tract_table.py --replace-final` ### Updating Energy Data ```bash python3 ingest_eia_energy_layers.py --category power --update ``` --- ## Schema Export To export the full schema: ```bash pg_dump -h $PGWEB_HOST -U $PGWEB_USER -d data_centers --schema-only > schema.sql ``` To list all tables: ```sql SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) FROM pg_tables WHERE schemaname = 'public' ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC; ``` --- ## Contact For database access or questions, contact the repository owner.