Files
data-centers/docs/database-tables.md

30 KiB
Raw Blame History

Database Tables Documentation

Database Configuration

Database Name: data_centers
Type: PostgreSQL with PostGIS extension
Connection: Environment variables from ~/.zsh_secrets

  • PGWEB_HOST: Database host
  • PGWEB_PORT: Database port (5433)
  • PGWEB_USER: Database user
  • PGWEB_PASSWORD: Database password
  • PGWEB_DATABASE: Database name (data_centers)

Table Organization

Tables are organized into six categories:

  1. Core Data Center Tables - Master inventories and source data
  2. Enrichment Tables - Data centers joined with contextual data
  3. Environmental and Election Source Tables - Long-form climate, drought, fire/smoke, and precinct-election source layers
  4. Base Layer Tables - Geographic and demographic reference layers
  5. Infrastructure Tables - Energy and connectivity infrastructure
  6. Legislation Tables - LegiScan state and federal bill data (2016-2026)

Core Data Center Tables

master_data_centers

Rows: 1,833
Purpose: Canonical data center inventory - deduplicated merge of curated + OSM sources

Key Columns:

  • id (INTEGER) - Unique identifier
  • name (TEXT) - Facility name
  • address (TEXT) - Street address
  • city (TEXT) - City
  • state (TEXT) - State code
  • latitude (DOUBLE PRECISION) - Latitude
  • longitude (DOUBLE PRECISION) - Longitude
  • geom (GEOMETRY) - PostGIS point geometry (EPSG:4326)
  • operator (TEXT) - Operator/owner
  • power_mw (DOUBLE PRECISION) - Power capacity in megawatts (sparse: 5.9% populated)
  • source (TEXT) - Data source (curated, osm, or both)
  • osm_id (TEXT) - OpenStreetMap ID if applicable
  • geocode_method (TEXT) - Geocoding provenance

Notes:

  • 108 of 1,833 facilities have power ratings
  • 45 facilities use city-precision fallback coordinates
  • Operator strings have fragmentation issues ("Meta" vs. "Meta, Inc.")

us_dc_sample_geocoded

Rows: 1,489
Purpose: Original curated sample with geocoding provenance (superseded by master_data_centers)

Key Columns:

  • name, address, city, state, zip
  • latitude, longitude, geom
  • operator, power_mw
  • census_lat, census_lon - Census TIGER geocode results
  • nominatim_lat, nominatim_lon - Nominatim fallback results
  • geocode_source - Which geocoder was used

osm_data_centers

Rows: 1,549
Purpose: Raw OpenStreetMap-derived facilities

Key Columns:

  • osm_id (TEXT) - OSM element ID
  • osm_type (TEXT) - node, way, or relation
  • name (TEXT) - OSM name tag
  • latitude, longitude, geom
  • tags (JSONB) - All OSM tags as JSON
  • operator (TEXT) - Extracted from OSM tags
  • city, state, country

Notes: Fetched via Overpass API with query for telecom=data_center or building=data_center

master_data_center_spatial_clusters

Rows: 1,831
Purpose: DBSCAN cluster assignments for master data centers

Key Columns:

  • All columns from master_data_centers
  • cluster_id (INTEGER) - Cluster assignment (-1 = noise/singleton)
  • cluster_size (INTEGER) - Number of facilities in cluster
  • cluster_label (TEXT) - Human-readable cluster name

Notes: DBSCAN parameters: eps=15 km, min_samples=2


Enrichment Tables

data_center_census_tracts_2024

Rows: 1,815
Purpose: Per-facility demographics from containing Census tract

Key Columns:

  • All columns from master_data_centers
  • geoid (TEXT) - 11-digit Census tract GEOID
  • state_fips, county_fips, tract
  • Population: total_population, population_density_sq_mi
  • Age: median_age, under_18_pct, over_65_pct
  • Race/Ethnicity: white_nh_pct, black_nh_pct, asian_nh_pct, hispanic_pct
  • Economics: median_household_income, per_capita_income, poverty_rate
  • Education: bachelors_or_higher_pct, high_school_or_higher_pct
  • Housing: median_home_value, median_rent, homeownership_rate
  • Broadband: broadband_pct - Households with broadband subscription

Source: ACS 2024 5-year estimates

Notes:

  • 18 of 1,833 facilities failed tract join (geocoding issues)
  • Data from _dc_census_tract_acs_2024 base table

data_center_watershed_huc8

Rows: 1,833
Purpose: Per-facility watershed assignment

Key Columns:

  • All columns from master_data_centers
  • huc8 (TEXT) - 8-digit Hydrologic Unit Code
  • watershed_name (TEXT) - Watershed name
  • watershed_area_sq_km (DOUBLE PRECISION)
  • states (TEXT) - States intersecting watershed

Source: USGS Watershed Boundary Dataset

Notes: 257 unique HUC8 watersheds contain at least one data center

data_center_nri_exposure

Rows: 1,833
Purpose: Per-facility FEMA National Risk Index hazard exposure scores

Key Columns:

  • All columns from master_data_centers
  • nri_id (TEXT) - Census tract GEOID (matches geoid from demographics)
  • risk_score (DOUBLE PRECISION) - Overall NRI risk score
  • social_vulnerability (DOUBLE PRECISION) - Social vulnerability index
  • Hazard-specific risk scores (18 hazards):
    • avalanche_risk, coastal_flooding_risk, cold_wave_risk
    • drought_risk, earthquake_risk, hail_risk
    • heat_wave_risk, hurricane_risk, ice_storm_risk
    • landslide_risk, lightning_risk, riverine_flooding_risk
    • strong_wind_risk, tornado_risk, tsunami_risk
    • volcanic_activity_risk, wildfire_risk, winter_weather_risk

Source: FEMA National Risk Index (December 2025 release)

data_center_historical_climate

Rows: 1,833
Purpose: One-row-per-facility historical climate summary for data center locations

Key Columns:

  • master_id (TEXT) - FK to master_data_centers
  • source, name, operator, city, state, country
  • latitude, longitude, geom
  • daymet_dataset_version, gridmet_dataset_version
  • climate_period_start, climate_period_end - Current period: 1991-01-01 to 2020-12-31
  • Temperature: mean_annual_temperature_c, mean_summer_temperature_c, max_daily_temperature_c, min_daily_temperature_c
  • Humidity / wet bulb: mean_relative_humidity_pct, mean_wet_bulb_temperature_c, max_wet_bulb_temperature_c, extreme_wet_bulb_days
  • Cooling / heat: cooling_degree_days_c, annual_cooling_degree_days_c_mean, extreme_heat_days, annual_extreme_heat_days_mean
  • Precipitation: precipitation_total_mm, annual_precipitation_mm_mean, annual_precipitation_cv, wet_day_precipitation_p95_mm
  • Wind: mean_wind_speed_ms, max_daily_mean_wind_speed_ms, sustained_wind_days, annual_sustained_wind_days_mean

Source: Daymet + gridMET historical climate data

Notes: Built by historical_climate_data_centers.ipynb / open_meteo_historical_data_centers.ipynb

data_center_usdm_drought_exposure

Rows: 1,833
Purpose: Per-facility drought exposure summary from weekly U.S. Drought Monitor polygons

Key Columns:

  • master_id (TEXT) - FK to master_data_centers
  • source, name, operator, city, state, country
  • latitude, longitude, geom
  • usdm_status - covered or no_coverage
  • drought_period_start, drought_period_end - Current period: 2000-01-04 to 2025-12-30
  • weeks_observed
  • weeks_in_d0_or_worse, weeks_in_d1_or_worse, weeks_in_d2_or_worse, weeks_in_d3_or_worse, weeks_in_d4
  • pct_weeks_in_d0_or_worse, pct_weeks_in_d1_or_worse, pct_weeks_in_d2_or_worse, pct_weeks_in_d3_or_worse, pct_weeks_in_d4
  • worst_dm_category, mean_dm_category
  • longest_d0_streak_weeks, longest_d2_streak_weeks, longest_d3_streak_weeks

Source: U.S. Drought Monitor weekly spatial data

Notes:

  • Summary table is rolled up from data_center_usdm_drought_dc_week
  • dm_category scale: D0-D4, stored as 0-4
  • 1,830 facilities have covered status; 3 have no coverage

data_center_hms_smoke_exposure

Rows: 1,833
Purpose: Per-facility wildfire-smoke exposure summary from NOAA HMS smoke polygons

Key Columns:

  • master_id (TEXT) - FK to master_data_centers
  • source, name, operator, city, state, country
  • latitude, longitude, geom
  • hms_status
  • smoke_period_start, smoke_period_end - Current period: 2005-08-05 to 2026-05-22
  • days_observed
  • days_with_any_smoke, days_with_light_or_worse, days_with_medium_or_worse, days_with_heavy_smoke
  • pct_days_with_any_smoke, pct_days_with_light_or_worse, pct_days_with_medium_or_worse, pct_days_with_heavy_smoke
  • worst_density_rank, worst_density, mean_density_rank
  • longest_any_smoke_streak_days, longest_medium_or_heavy_streak_days, longest_heavy_smoke_streak_days

Source: NOAA Hazard Mapping System (HMS) smoke polygons

Notes:

  • Summary table is rolled up from data_center_hms_smoke_dc_day
  • Density rank: 0 = observed no smoke, 1 = Light, 2 = Medium, 3 = Heavy
  • HMS product path uses NOAA's /FIRE/web/HMS/Smoke_Polygons/ archive

data_center_election_context

Rows: 1,833
Purpose: Standardized one-row-per-facility election context derived from RDH precinct matches

Key Columns:

  • master_id (TEXT) - FK to master_data_centers
  • name, city, state
  • rdh_layer_title
  • precinct_identifier_name
  • election_year, office
  • democratic_votes, republican_votes, total_votes
  • turnout_or_vote_share
  • updated_at

Source: Redistricting Data Hub precinct election shapefiles

Notes:

  • Built from data_center_rdh_precinct_vote_matches plus RDH feature properties
  • Current rows cover 2020-2024 election layers; 1,829 facilities have non-null election year context

data_center_rdh_precinct_vote_matches

Rows: 3,330
Purpose: Spatial join bridge between data centers and RDH precinct vote features

Key Columns:

  • master_id (TEXT) - FK to master_data_centers
  • feature_id (TEXT) - FK to rdh_precinct_vote_features
  • layer_id (TEXT) - FK to rdh_precinct_vote_layers
  • state_code
  • join_method
  • match_distance_m
  • matched_at

Source: Redistricting Data Hub precinct shapefiles

Notes: Spatial join to voting precincts (point-in-polygon, with nearest/fallback logic where needed)


Environmental and Election Source Tables

usdm_drought_weekly

Rows: 12,080
Purpose: Raw weekly U.S. Drought Monitor polygons by drought category

Key Columns:

  • id (BIGINT) - Primary key
  • week_date (DATE)
  • dm_category (SMALLINT) - Drought Monitor category D0-D4 stored as 0-4
  • objectid, shape_leng, shape_area
  • geom (GEOMETRY) - Drought polygon geometry

Source: U.S. Drought Monitor spatial archive

Notes: Source table for data_center_usdm_drought_dc_week

data_center_usdm_drought_dc_week

Rows: ~2.48 million
Purpose: Long-form weekly drought exposure for each covered data center

Key Columns:

  • master_id (TEXT) - FK to master_data_centers
  • week_date (DATE)
  • worst_dm (SMALLINT) - Worst drought category covering the facility that week

Source: Spatial join of master_data_centers to usdm_drought_weekly

Notes:

  • Primary key: (master_id, week_date)
  • worst_dm = -1 indicates an observed week with no drought polygon covering the facility

hms_smoke_days

Rows: 7,075
Purpose: One row per observed NOAA HMS smoke product day, including zero-polygon days

Key Columns:

  • smoke_date (DATE) - Primary key
  • source, source_file, source_url
  • feature_count (INTEGER) - Number of smoke polygons for the day
  • fetched_at, updated_at

Source: NOAA HMS smoke polygon archive

Notes: Denominator table for daily smoke-exposure percentages

hms_smoke_daily

Rows: 536,286
Purpose: Raw daily NOAA HMS smoke polygons with density categories

Key Columns:

  • id (BIGINT) - Primary key
  • smoke_date (DATE) - FK to hms_smoke_days
  • satellite
  • start_raw, end_raw, start_utc, end_utc
  • density, density_rank
  • source, source_file, source_url
  • geom (GEOMETRY) - Smoke polygon geometry

Source: NOAA Hazard Mapping System (HMS) smoke polygons

Notes: Density rank 1-3 corresponds to Light, Medium, Heavy

data_center_hms_smoke_dc_day

Rows: ~13.9 million
Purpose: Long-form daily smoke exposure for each data center and observed HMS product day

Key Columns:

  • master_id (TEXT) - FK to master_data_centers
  • smoke_date (DATE) - FK to hms_smoke_days
  • max_density_rank (SMALLINT) - Maximum smoke density covering the facility on that date
  • polygon_hits (INTEGER)

Source: Spatial join of master_data_centers to hms_smoke_daily

Notes:

  • Primary key: (master_id, smoke_date)
  • max_density_rank = 0 indicates an observed HMS day with no smoke polygon covering the facility

rdh_precinct_vote_layers

Rows: 69
Purpose: Metadata for downloaded RDH precinct election layers

Key Columns:

  • layer_id (TEXT) - Primary key
  • state_code
  • title
  • format
  • datasetid
  • source_url
  • filename, local_path, spatial_path
  • metadata (JSONB)
  • loaded_at

Source: Redistricting Data Hub precinct election datasets

Notes: Current loaded layers cover 45 distinct state codes

rdh_precinct_vote_features

Rows: 260,953
Purpose: Staged RDH precinct polygons and source attributes

Key Columns:

  • feature_id (TEXT) - Primary key
  • layer_id (TEXT) - FK to rdh_precinct_vote_layers
  • state_code
  • source_row
  • properties (JSONB) - Raw RDH election attributes
  • geom (GEOMETRY) - Precinct polygon geometry

Source: Redistricting Data Hub precinct election shapefiles

Notes: Source feature table for data_center_rdh_precinct_vote_matches


Base Layer Tables

_dc_census_tract_acs_2024

Rows: 85,382
Purpose: ACS 2024 demographics for all Census tracts in states with data centers

Key Columns:

  • geoid (TEXT) - 11-digit tract GEOID (PRIMARY KEY)
  • name (TEXT) - Tract name
  • state_fips, county_fips, tract
  • Full ACS 5-year estimates (85+ columns):
    • Population by age, sex, race/ethnicity
    • Households, families, housing units
    • Income, poverty, education, employment
    • Housing values, rents, costs
    • Broadband, computer access
    • Commuting, vehicles

Source: Census ACS 2024 5-year estimates API

Notes: Universe limited to 46 states with data centers (excludes DC-free states)

_dc_census_tract_boundaries_2024

Rows: 85,058
Purpose: TIGER 2024 tract polygons for data center states

Key Columns:

  • geoid (TEXT) - 11-digit tract GEOID
  • name (TEXT) - Tract name
  • state_fips, county_fips, tract_code
  • geom (GEOMETRY) - Polygon geometry (EPSG:4326)
  • area_land_sq_m (DOUBLE PRECISION) - Land area in square meters
  • area_water_sq_m (DOUBLE PRECISION) - Water area in square meters

Source: Census TIGER/Line 2024

ruca_codes_2020_tract

Rows: 85,528
Purpose: USDA Rural-Urban Commuting Area codes for metro/rural classification

Key Columns:

  • geoid (TEXT) - 11-digit tract GEOID (matches Census tracts)
  • ruca_code (TEXT) - Primary RUCA code (1-10)
  • ruca_category (TEXT) - Simplified category:
    • Metropolitan (codes 1-3)
    • Micropolitan (codes 4-6)
    • Small town (codes 7-9)
    • Rural (code 10)
  • ruca_description (TEXT) - Full RUCA code description
  • population_2020 (INTEGER)

Source: USDA Economic Research Service RUCA 2020

Notes:

  • Based on 2020 Census tracts and 2010-2020 commuting patterns
  • 7 data centers failed RUCA join (Puerto Rico / non-US)

watershed_huc8

Rows: 2,139
Purpose: USGS HUC8 subbasin polygons for water-stress analysis

Key Columns:

  • huc8 (TEXT) - 8-digit Hydrologic Unit Code (PRIMARY KEY)
  • name (TEXT) - Watershed name
  • geom (GEOMETRY) - Polygon geometry (EPSG:4326)
  • area_sq_km (DOUBLE PRECISION)
  • states (TEXT) - Comma-separated state codes
  • dc_count (INTEGER) - Number of data centers in watershed

Source: USGS Watershed Boundary Dataset

Notes:

  • 257 of 2,139 watersheds contain at least one data center
  • Top 15 watersheds contain 50% of all US data centers

nri_census_tracts

Rows: ~84,000
Purpose: Full FEMA National Risk Index by Census tract

Key Columns:

  • nri_id (TEXT) - Census tract GEOID
  • state_name, county_name, tract_name
  • 460+ columns including:
    • Overall risk scores and ratings
    • Expected annual loss (dollars and building value %)
    • Social vulnerability components (15 factors)
    • Community resilience score
    • Individual hazard risk scores (18 hazards)
    • Exposure, annualized frequency, historic loss ratios per hazard

Source: FEMA National Risk Index v2.1 (December 2025)

Notes:


Infrastructure Tables

Energy Infrastructure

energy_eia_operating_generator_capacity_flat

Rows: 4.7 million
Purpose: EIA generator inventory with lat/lon/MW (monthly 2008-2026)

Key Columns:

  • plant_id (INTEGER) - EIA plant ID
  • generator_id (TEXT) - Generator unit ID
  • plant_name (TEXT)
  • latitude, longitude, geom
  • state, county
  • utility_name, operator_name
  • nameplate_capacity_mw (DOUBLE PRECISION)
  • technology (TEXT) - Generation technology
  • energy_source_1, energy_source_2 - Primary fuel codes
  • operating_month, operating_year - When unit became operational
  • status (TEXT) - Operating, standby, retired, etc.
  • report_month, report_year - Data snapshot date

Source: EIA Form 860 via API

Notes:

  • "Flat" means denormalized for fast spatial queries
  • Each generator-month is a row (4.7M rows from monthly snapshots)
  • Use for proximity analysis (e.g., "all generators within 50 km of data center")

energy_eia_facility_fuel_flat

Rows: Not loaded yet
Purpose: Monthly generation by plant/fuel

Key Columns:

  • plant_id, plant_name
  • report_month, report_year
  • energy_source (TEXT) - Fuel code
  • net_generation_mwh (DOUBLE PRECISION)
  • fuel_consumed_mmbtu (DOUBLE PRECISION)

Source: EIA Form 923 via API

Notes: Target table defined in ingest_eia_energy_layers.py; current database does not yet have public.energy_eia_facility_fuel_flat.

energy_eia_seds_flat

Rows: 2.57 million
Purpose: Annual state energy consumption/production (1960-2024)

Key Columns:

  • state_code (TEXT)
  • year (INTEGER)
  • msn (TEXT) - Mnemonic series names (e.g., TETCB = total energy consumption)
  • value (DOUBLE PRECISION) - Energy in trillion BTU
  • unit (TEXT)
  • description (TEXT) - Human-readable MSN description

Source: EIA State Energy Data System (SEDS)

Notes:

  • Annual aggregates by state
  • Use for state-level energy context analysis

Connectivity Infrastructure

internet_cables

Rows: 693
Purpose: Submarine cable routes

Key Columns:

  • cable_id (TEXT) - Unique cable identifier
  • cable_name (TEXT) - Official cable name
  • geom (GEOMETRY) - LineString geometry (EPSG:4326)
  • rfs_year (INTEGER) - Ready For Service year
  • length_km (DOUBLE PRECISION)
  • owners (TEXT[]) - Array of owner names
  • landing_points (TEXT[]) - Array of landing point names

Source: TeleGeography-style cable database

Notes:

  • 693 unique submarine cables
  • Geometry is approximate route (not exact seabed path)

internet_cable_landing_points

Rows: 3,361
Purpose: Cable landing points (where cables come ashore)

Key Columns:

  • landing_point_id (TEXT) - Unique identifier
  • name (TEXT) - Landing point name
  • city, country
  • latitude, longitude, geom
  • cables (TEXT[]) - Array of cable names landing at this point
  • cable_count (INTEGER)

Source: TeleGeography-style cable database

Notes:

  • Used for proximity analysis (how close are data centers to cable landings?)
  • Key finding: Data centers are NOT systematically closer to cables than ordinary US cities

internet_city_dominance

Rows: 4,552
Purpose: City-level IPs/capacity (internet hub strength proxy)

Key Columns:

  • city (TEXT)
  • country (TEXT)
  • latitude, longitude, geom
  • ip_addresses (INTEGER) - Number of routable IP addresses
  • capacity_rank (INTEGER) - Relative capacity ranking

Source: Internet topology datasets

Notes: Proxy for "internet hub" strength (not directly used in main analyses)


Broadband

fcc_bdc_location_provider_aggregates

Rows: Varies
Purpose: FCC BDC provider availability aggregated by county/tract

Key Columns:

  • geoid (TEXT) - County or tract GEOID
  • geography_level (TEXT) - county or tract
  • provider_count (INTEGER)
  • technology_counts (JSONB) - Count by technology type
  • max_download_mbps, max_upload_mbps

Source: FCC Broadband Data Collection (BDC)

fcc_bdc_broadband_connection_table

Rows: Varies
Purpose: Per-data-center broadband provider availability

Key Columns:

  • Data center identifiers
  • provider_id, provider_name
  • technology (TEXT)
  • max_advertised_download_speed, max_advertised_upload_speed
  • low_latency (BOOLEAN)

Source: FCC BDC, joined to data center locations

Notes: Built by build_fcc_bdc_broadband_connection_table.py


Other Tables

opposition_cases_geocoded

Rows: 18
Purpose: Geocoded community-opposition cases against data center builds

Key Columns:

  • case_id (TEXT) - Unique identifier
  • developer (TEXT) - Proposed developer/operator
  • investment_billions (DOUBLE PRECISION) - Investment amount in billions
  • outcome (TEXT) - Case outcome (approved, rejected, pending)
  • governance_response (TEXT) - Government response
  • latitude, longitude, geom

Source: Compiled from news archives

Notes: Loaded but currently unused - see research-ideas.md for proposed analyses

Rows: 806
Purpose: Tract↔HUC8 spatial overlap table

Key Columns:

  • geoid (TEXT) - Census tract GEOID
  • huc8 (TEXT) - HUC8 watershed code
  • overlap_pct (DOUBLE PRECISION) - Percentage of tract overlapping watershed

Notes: Useful for downstream tract-level water-stress joins; limited to tracts containing data centers

im3_state_projected_moderate_50

Rows: 328
Purpose: PNNL IM3 projected data center siting (moderate growth, gravity weight 0.50)

Key Columns:

  • facility_id (TEXT)
  • state (TEXT)
  • cost_millions (DOUBLE PRECISION)
  • it_mw (DOUBLE PRECISION) - IT load in megawatts
  • cooling_water_demand_gal_per_day (DOUBLE PRECISION)
  • latitude, longitude, geom

Source: PNNL Integrated Multisector Multiscale Modeling (IM3)

Notes: Loaded but unused - potential for forward-projection analysis

im3_projected_state_demand_summary

Rows: 31
Purpose: State-level rollup of IM3 projected facility counts, IT MW, and cooling demand

Key Columns:

  • state (TEXT)
  • facility_count (INTEGER)
  • total_it_mw (DOUBLE PRECISION)
  • total_cooling_demand_mgd (DOUBLE PRECISION) - Million gallons per day

Source: IM3 model outputs

utility_rate_tracker_2025_2028

Rows: 374
Purpose: Utility rate-increase tracker by provider × state × service type

Key Columns:

  • provider (TEXT) - Utility provider name
  • state (TEXT)
  • service_type (TEXT)
  • effective_date (DATE)
  • monthly_increase_dollars (DOUBLE PRECISION)
  • percent_increase (DOUBLE PRECISION)

Source: Utility rate tracker database

Notes: Loaded but unused in demographic/energy analysis

energy_atlas_layers_catalog

Rows: ~5
Purpose: Metadata catalog of EIA layers ingested

Key Columns:

  • table_name (TEXT)
  • source_url (TEXT)
  • import_timestamp (TIMESTAMP)
  • row_count (INTEGER)

Notes: Created by ingest_eia_energy_layers.py



Legislation Tables

Populated by ingest_legiscan.py using the LegiScan API.
Covers all 50 states + DC + US Congress, sessions from 2016 through 2026.
Data licensed CC BY 4.0 — attribute LegiScan LLC.

legiscan_sessions

Rows: 646
Purpose: One row per legislative session dataset downloaded from LegiScan

Key Columns:

  • session_id (INTEGER) - LegiScan session ID (PRIMARY KEY)
  • state_abbr (VARCHAR) - Two-letter state code (CA, TX, US, etc.)
  • state_id (INTEGER) - LegiScan numeric state ID
  • year_start, year_end (INTEGER) - Session year range
  • session_title (TEXT) - Full session name
  • session_tag (TEXT) - Short tag (e.g., "Regular Session", "1st Special Session")
  • is_special (BOOLEAN) - True for special/extraordinary sessions
  • is_prior (BOOLEAN) - True for completed/sine-die sessions
  • dataset_hash (VARCHAR) - MD5 of dataset ZIP; used to detect updates
  • dataset_date (DATE) - Date dataset was last published by LegiScan
  • dataset_size_mb (FLOAT) - Compressed ZIP size
  • bill_count (INTEGER) - Number of bills loaded from this session
  • imported_at (TIMESTAMPTZ) - When this session was last imported

legiscan_bills

Rows: ~1,313,000
Purpose: All bills from all sessions; tagged for relevance to data center research topics

Key Columns:

  • bill_id (INTEGER) - LegiScan bill ID (PRIMARY KEY)
  • session_id (INTEGER) - FK → legiscan_sessions
  • state (VARCHAR) - Two-letter state code
  • bill_number (VARCHAR) - Bill number (e.g., SB 1000, HB 233)
  • bill_type (VARCHAR) - B=Bill, R=Resolution, CR=Concurrent Resolution, etc.
  • title (TEXT) - Short title
  • description (TEXT) - Longer description
  • status (INTEGER) - Current status code (see below)
  • status_date (DATE) - Date of last status change
  • completed (INTEGER) - 1 if bill is in a terminal state
  • body (VARCHAR) - Originating chamber (H=House, S=Senate, C=Council, etc.)
  • url (TEXT) - LegiScan bill page URL
  • state_link (TEXT) - Official state legislature URL
  • change_hash (VARCHAR) - MD5 used to detect bill-level updates
  • subjects (TEXT[]) - LegiScan subject tags (GIN indexed)
  • sponsor_count (INTEGER) - Number of sponsors
  • vote_count (INTEGER) - Number of recorded votes
  • text_count (INTEGER) - Number of bill text versions
  • is_relevant (BOOLEAN) - True if any relevance tag matched (GIN indexed)
  • relevance_tags (TEXT[]) - Matched topic tags (GIN indexed)
  • imported_at (TIMESTAMPTZ) - When this bill was last upserted

Status codes: 1=Introduced, 2=Engrossed, 3=Enrolled, 4=Passed, 5=Vetoed, 6=Failed, 7=Override, 8=Chaptered, 9=Referred, 12=Draft

Relevance tags (keyword-matched against title + description + subjects):

Tag What it captures
data_center Data centers, hyperscale, colocation, AI campuses, HPC facilities
large_load Crypto mining, large industrial loads, extraordinary load
ratepayer_protection Cost shifting, cross-subsidy, rate design, affordability, rate class
grid_impact Grid reliability, transmission, interconnection queue, IRP
tax_incentive Tax exemptions, abatements, credits for facilities
energy_policy Renewable PPAs, green tariffs, clean electricity, decarbonization
water_use Cooling water, evaporative cooling, water footprint
siting_permitting Zoning, conditional use permits, local control, preemption

Notes:

  • ~60,000 relevant bills out of 1.3M total (~4.6%)
  • data_center tag: ~2,182 bills; ratepayer_protection: ~49,000
  • GIN indexes on subjects, relevance_tags, and full-text (title || description)
  • Use query_legiscan_bills.sql for pre-built research queries
  • Re-run python ingest_legiscan.py --fetch --load weekly to pick up dataset updates
  • Re-run python ingest_legiscan.py --tag after editing keyword lists

Commonly Used Joins

Data Center to Demographics

SELECT 
    dc.*,
    ct.median_household_income,
    ct.bachelors_or_higher_pct,
    ct.broadband_pct
FROM master_data_centers dc
JOIN data_center_census_tracts_2024 ct 
    ON dc.id = ct.id;

Data Center to Watershed

SELECT 
    dc.*,
    w.huc8,
    w.watershed_name
FROM master_data_centers dc
JOIN data_center_watershed_huc8 dw ON dc.id = dw.id
JOIN watershed_huc8 w ON dw.huc8 = w.huc8;

Data Center to Energy Infrastructure (50 km radius)

SELECT 
    dc.id,
    dc.name,
    SUM(eg.nameplate_capacity_mw) AS total_capacity_50km
FROM master_data_centers dc
JOIN energy_eia_operating_generator_capacity_flat eg
    ON ST_DWithin(
        dc.geom::geography,
        eg.geom::geography,
        50000  -- 50 km in meters
    )
WHERE eg.status = 'OP'  -- Operating only
GROUP BY dc.id, dc.name;

Data Center to FEMA Hazard Risk

SELECT 
    dc.*,
    nri.risk_score,
    nri.wildfire_risk,
    nri.drought_risk,
    nri.heat_wave_risk
FROM master_data_centers dc
JOIN data_center_census_tracts_2024 ct ON dc.id = ct.id
JOIN nri_census_tracts nri ON ct.geoid = nri.nri_id;

Table Naming Conventions

  • master_* - Canonical, deduplicated tables (use these for analysis)
  • data_center_* - Data center-specific enrichment tables
  • _dc_* - Base layers scoped to data center states (underscore prefix = private/internal)
  • energy_eia_* - EIA energy data
  • internet_* - Connectivity infrastructure
  • fcc_bdc_* - FCC Broadband Data Collection

Indexes and Performance

All tables have spatial indexes on geom columns for fast spatial joins:

CREATE INDEX idx_tablename_geom ON tablename USING GIST(geom);

Key geoid columns are indexed for fast demographic joins:

CREATE INDEX idx_tablename_geoid ON tablename(geoid);

Maintenance Notes

Updating Data Centers

  1. Run load_postgis_osm_data_centers.py to refresh OSM data
  2. Run build_master_data_centers.py to rebuild master table
  3. Run enrichment scripts to update joins

Updating Demographics

  1. Update _dc_census_tract_acs_2024 from Census API
  2. Run create_data_center_census_tract_table.py --replace-final

Updating Energy Data

python3 ingest_eia_energy_layers.py --category power --update

Schema Export

To export the full schema:

pg_dump -h $PGWEB_HOST -U $PGWEB_USER -d data_centers --schema-only > schema.sql

To list all tables:

SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename))
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;

Contact

For database access or questions, contact the repository owner.