Files
data-centers/database-tables.md
dadams 4525ea3f97 Add LegiScan legislation ingestion and analysis queries
Adds ingest_legiscan.py to pull all US state + federal bills (2016-2026)
from the LegiScan API into legiscan_sessions and legiscan_bills tables.
Bills are keyword-tagged across 8 research categories (data_center,
ratepayer_protection, large_load, grid_impact, tax_incentive, etc.).
Loads ~1.3M bills; ~60K tagged relevant. Adds query_legiscan_bills.sql
with pre-built analysis queries including state/DC joins. Updates
database-tables.md, README.md, and research-ideas.md accordingly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 21:30:31 -07:00

22 KiB
Raw Blame History

Database Tables Documentation

Database Configuration

Database Name: data_centers
Type: PostgreSQL with PostGIS extension
Connection: Environment variables from ~/.zsh_secrets

  • PGWEB_HOST: Database host
  • PGWEB_PORT: Database port (typically 5432)
  • PGWEB_USER: Database user
  • PGWEB_PASSWORD: Database password
  • PGWEB_DATABASE: Database name (data_centers)

Table Organization

Tables are organized into five categories:

  1. Core Data Center Tables - Master inventories and source data
  2. Enrichment Tables - Data centers joined with contextual data
  3. Base Layer Tables - Geographic and demographic reference layers
  4. Infrastructure Tables - Energy and connectivity infrastructure
  5. Legislation Tables - LegiScan state and federal bill data (2016-2026)

Core Data Center Tables

master_data_centers

Rows: 1,833
Purpose: Canonical data center inventory - deduplicated merge of curated + OSM sources

Key Columns:

  • id (INTEGER) - Unique identifier
  • name (TEXT) - Facility name
  • address (TEXT) - Street address
  • city (TEXT) - City
  • state (TEXT) - State code
  • latitude (DOUBLE PRECISION) - Latitude
  • longitude (DOUBLE PRECISION) - Longitude
  • geom (GEOMETRY) - PostGIS point geometry (EPSG:4326)
  • operator (TEXT) - Operator/owner
  • power_mw (DOUBLE PRECISION) - Power capacity in megawatts (sparse: 5.9% populated)
  • source (TEXT) - Data source (curated, osm, or both)
  • osm_id (TEXT) - OpenStreetMap ID if applicable
  • geocode_method (TEXT) - Geocoding provenance

Notes:

  • 108 of 1,833 facilities have power ratings
  • 45 facilities use city-precision fallback coordinates
  • Operator strings have fragmentation issues ("Meta" vs. "Meta, Inc.")

us_dc_sample_geocoded

Rows: 1,489
Purpose: Original curated sample with geocoding provenance (superseded by master_data_centers)

Key Columns:

  • name, address, city, state, zip
  • latitude, longitude, geom
  • operator, power_mw
  • census_lat, census_lon - Census TIGER geocode results
  • nominatim_lat, nominatim_lon - Nominatim fallback results
  • geocode_source - Which geocoder was used

osm_data_centers

Rows: 1,549
Purpose: Raw OpenStreetMap-derived facilities

Key Columns:

  • osm_id (TEXT) - OSM element ID
  • osm_type (TEXT) - node, way, or relation
  • name (TEXT) - OSM name tag
  • latitude, longitude, geom
  • tags (JSONB) - All OSM tags as JSON
  • operator (TEXT) - Extracted from OSM tags
  • city, state, country

Notes: Fetched via Overpass API with query for telecom=data_center or building=data_center

master_data_center_spatial_clusters

Rows: 1,831
Purpose: DBSCAN cluster assignments for master data centers

Key Columns:

  • All columns from master_data_centers
  • cluster_id (INTEGER) - Cluster assignment (-1 = noise/singleton)
  • cluster_size (INTEGER) - Number of facilities in cluster
  • cluster_label (TEXT) - Human-readable cluster name

Notes: DBSCAN parameters: eps=15 km, min_samples=2


Enrichment Tables

data_center_census_tracts_2024

Rows: 1,815
Purpose: Per-facility demographics from containing Census tract

Key Columns:

  • All columns from master_data_centers
  • geoid (TEXT) - 11-digit Census tract GEOID
  • state_fips, county_fips, tract
  • Population: total_population, population_density_sq_mi
  • Age: median_age, under_18_pct, over_65_pct
  • Race/Ethnicity: white_nh_pct, black_nh_pct, asian_nh_pct, hispanic_pct
  • Economics: median_household_income, per_capita_income, poverty_rate
  • Education: bachelors_or_higher_pct, high_school_or_higher_pct
  • Housing: median_home_value, median_rent, homeownership_rate
  • Broadband: broadband_pct - Households with broadband subscription

Source: ACS 2024 5-year estimates

Notes:

  • 18 of 1,833 facilities failed tract join (geocoding issues)
  • Data from _dc_census_tract_acs_2024 base table

data_center_watershed_huc8

Rows: 1,833
Purpose: Per-facility watershed assignment

Key Columns:

  • All columns from master_data_centers
  • huc8 (TEXT) - 8-digit Hydrologic Unit Code
  • watershed_name (TEXT) - Watershed name
  • watershed_area_sq_km (DOUBLE PRECISION)
  • states (TEXT) - States intersecting watershed

Source: USGS Watershed Boundary Dataset

Notes: 257 unique HUC8 watersheds contain at least one data center

data_center_nri_exposure

Rows: 1,833
Purpose: Per-facility FEMA National Risk Index hazard exposure scores

Key Columns:

  • All columns from master_data_centers
  • nri_id (TEXT) - Census tract GEOID (matches geoid from demographics)
  • risk_score (DOUBLE PRECISION) - Overall NRI risk score
  • social_vulnerability (DOUBLE PRECISION) - Social vulnerability index
  • Hazard-specific risk scores (18 hazards):
    • avalanche_risk, coastal_flooding_risk, cold_wave_risk
    • drought_risk, earthquake_risk, hail_risk
    • heat_wave_risk, hurricane_risk, ice_storm_risk
    • landslide_risk, lightning_risk, riverine_flooding_risk
    • strong_wind_risk, tornado_risk, tsunami_risk
    • volcanic_activity_risk, wildfire_risk, winter_weather_risk

Source: FEMA National Risk Index (December 2025 release)

data_center_rdh_precinct_vote_matches

Rows: Varies
Purpose: Per-facility precinct-level election results

Key Columns:

  • Data center identifiers
  • precinct_name, precinct_id
  • election_year, office
  • candidate, party, votes
  • vote_share_pct

Source: Redistricting Data Hub precinct shapefiles

Notes: Spatial join to voting precincts (point-in-polygon)


Base Layer Tables

_dc_census_tract_acs_2024

Rows: 85,382
Purpose: ACS 2024 demographics for all Census tracts in states with data centers

Key Columns:

  • geoid (TEXT) - 11-digit tract GEOID (PRIMARY KEY)
  • name (TEXT) - Tract name
  • state_fips, county_fips, tract
  • Full ACS 5-year estimates (85+ columns):
    • Population by age, sex, race/ethnicity
    • Households, families, housing units
    • Income, poverty, education, employment
    • Housing values, rents, costs
    • Broadband, computer access
    • Commuting, vehicles

Source: Census ACS 2024 5-year estimates API

Notes: Universe limited to 46 states with data centers (excludes DC-free states)

_dc_census_tract_boundaries_2024

Rows: 85,058
Purpose: TIGER 2024 tract polygons for data center states

Key Columns:

  • geoid (TEXT) - 11-digit tract GEOID
  • name (TEXT) - Tract name
  • state_fips, county_fips, tract_code
  • geom (GEOMETRY) - Polygon geometry (EPSG:4326)
  • area_land_sq_m (DOUBLE PRECISION) - Land area in square meters
  • area_water_sq_m (DOUBLE PRECISION) - Water area in square meters

Source: Census TIGER/Line 2024

ruca_codes_2020_tract

Rows: 85,528
Purpose: USDA Rural-Urban Commuting Area codes for metro/rural classification

Key Columns:

  • geoid (TEXT) - 11-digit tract GEOID (matches Census tracts)
  • ruca_code (TEXT) - Primary RUCA code (1-10)
  • ruca_category (TEXT) - Simplified category:
    • Metropolitan (codes 1-3)
    • Micropolitan (codes 4-6)
    • Small town (codes 7-9)
    • Rural (code 10)
  • ruca_description (TEXT) - Full RUCA code description
  • population_2020 (INTEGER)

Source: USDA Economic Research Service RUCA 2020

Notes:

  • Based on 2020 Census tracts and 2010-2020 commuting patterns
  • 7 data centers failed RUCA join (Puerto Rico / non-US)

watershed_huc8

Rows: 2,139
Purpose: USGS HUC8 subbasin polygons for water-stress analysis

Key Columns:

  • huc8 (TEXT) - 8-digit Hydrologic Unit Code (PRIMARY KEY)
  • name (TEXT) - Watershed name
  • geom (GEOMETRY) - Polygon geometry (EPSG:4326)
  • area_sq_km (DOUBLE PRECISION)
  • states (TEXT) - Comma-separated state codes
  • dc_count (INTEGER) - Number of data centers in watershed

Source: USGS Watershed Boundary Dataset

Notes:

  • 257 of 2,139 watersheds contain at least one data center
  • Top 15 watersheds contain 50% of all US data centers

nri_census_tracts

Rows: ~84,000
Purpose: Full FEMA National Risk Index by Census tract

Key Columns:

  • nri_id (TEXT) - Census tract GEOID
  • state_name, county_name, tract_name
  • 460+ columns including:
    • Overall risk scores and ratings
    • Expected annual loss (dollars and building value %)
    • Social vulnerability components (15 factors)
    • Community resilience score
    • Individual hazard risk scores (18 hazards)
    • Exposure, annualized frequency, historic loss ratios per hazard

Source: FEMA National Risk Index v2.1 (December 2025)

Notes:


Infrastructure Tables

Energy Infrastructure

energy_eia_operating_generator_capacity_flat

Rows: 4.7 million
Purpose: EIA generator inventory with lat/lon/MW (monthly 2008-2026)

Key Columns:

  • plant_id (INTEGER) - EIA plant ID
  • generator_id (TEXT) - Generator unit ID
  • plant_name (TEXT)
  • latitude, longitude, geom
  • state, county
  • utility_name, operator_name
  • nameplate_capacity_mw (DOUBLE PRECISION)
  • technology (TEXT) - Generation technology
  • energy_source_1, energy_source_2 - Primary fuel codes
  • operating_month, operating_year - When unit became operational
  • status (TEXT) - Operating, standby, retired, etc.
  • report_month, report_year - Data snapshot date

Source: EIA Form 860 via API

Notes:

  • "Flat" means denormalized for fast spatial queries
  • Each generator-month is a row (4.7M rows from monthly snapshots)
  • Use for proximity analysis (e.g., "all generators within 50 km of data center")

energy_eia_facility_fuel_flat

Rows: Varies
Purpose: Monthly generation by plant/fuel

Key Columns:

  • plant_id, plant_name
  • report_month, report_year
  • energy_source (TEXT) - Fuel code
  • net_generation_mwh (DOUBLE PRECISION)
  • fuel_consumed_mmbtu (DOUBLE PRECISION)

Source: EIA Form 923 via API

energy_eia_seds_flat

Rows: 2.57 million
Purpose: Annual state energy consumption/production (1960-2024)

Key Columns:

  • state_code (TEXT)
  • year (INTEGER)
  • msn (TEXT) - Mnemonic series names (e.g., TETCB = total energy consumption)
  • value (DOUBLE PRECISION) - Energy in trillion BTU
  • unit (TEXT)
  • description (TEXT) - Human-readable MSN description

Source: EIA State Energy Data System (SEDS)

Notes:

  • Annual aggregates by state
  • Use for state-level energy context analysis

Connectivity Infrastructure

internet_cables

Rows: 693
Purpose: Submarine cable routes

Key Columns:

  • cable_id (TEXT) - Unique cable identifier
  • cable_name (TEXT) - Official cable name
  • geom (GEOMETRY) - LineString geometry (EPSG:4326)
  • rfs_year (INTEGER) - Ready For Service year
  • length_km (DOUBLE PRECISION)
  • owners (TEXT[]) - Array of owner names
  • landing_points (TEXT[]) - Array of landing point names

Source: TeleGeography-style cable database

Notes:

  • 693 unique submarine cables
  • Geometry is approximate route (not exact seabed path)

internet_cable_landing_points

Rows: 3,361
Purpose: Cable landing points (where cables come ashore)

Key Columns:

  • landing_point_id (TEXT) - Unique identifier
  • name (TEXT) - Landing point name
  • city, country
  • latitude, longitude, geom
  • cables (TEXT[]) - Array of cable names landing at this point
  • cable_count (INTEGER)

Source: TeleGeography-style cable database

Notes:

  • Used for proximity analysis (how close are data centers to cable landings?)
  • Key finding: Data centers are NOT systematically closer to cables than ordinary US cities

internet_city_dominance

Rows: 4,552
Purpose: City-level IPs/capacity (internet hub strength proxy)

Key Columns:

  • city (TEXT)
  • country (TEXT)
  • latitude, longitude, geom
  • ip_addresses (INTEGER) - Number of routable IP addresses
  • capacity_rank (INTEGER) - Relative capacity ranking

Source: Internet topology datasets

Notes: Proxy for "internet hub" strength (not directly used in main analyses)


Broadband

fcc_bdc_location_provider_aggregates

Rows: Varies
Purpose: FCC BDC provider availability aggregated by county/tract

Key Columns:

  • geoid (TEXT) - County or tract GEOID
  • geography_level (TEXT) - county or tract
  • provider_count (INTEGER)
  • technology_counts (JSONB) - Count by technology type
  • max_download_mbps, max_upload_mbps

Source: FCC Broadband Data Collection (BDC)

fcc_bdc_broadband_connection_table

Rows: Varies
Purpose: Per-data-center broadband provider availability

Key Columns:

  • Data center identifiers
  • provider_id, provider_name
  • technology (TEXT)
  • max_advertised_download_speed, max_advertised_upload_speed
  • low_latency (BOOLEAN)

Source: FCC BDC, joined to data center locations

Notes: Built by build_fcc_bdc_broadband_connection_table.py


Other Tables

opposition_cases_geocoded

Rows: 18
Purpose: Geocoded community-opposition cases against data center builds

Key Columns:

  • case_id (TEXT) - Unique identifier
  • developer (TEXT) - Proposed developer/operator
  • investment_billions (DOUBLE PRECISION) - Investment amount in billions
  • outcome (TEXT) - Case outcome (approved, rejected, pending)
  • governance_response (TEXT) - Government response
  • latitude, longitude, geom

Source: Compiled from news archives

Notes: Loaded but currently unused - see research-ideas.md for proposed analyses

Rows: 806
Purpose: Tract↔HUC8 spatial overlap table

Key Columns:

  • geoid (TEXT) - Census tract GEOID
  • huc8 (TEXT) - HUC8 watershed code
  • overlap_pct (DOUBLE PRECISION) - Percentage of tract overlapping watershed

Notes: Useful for downstream tract-level water-stress joins; limited to tracts containing data centers

im3_state_projected_moderate_50

Rows: 328
Purpose: PNNL IM3 projected data center siting (moderate growth, gravity weight 0.50)

Key Columns:

  • facility_id (TEXT)
  • state (TEXT)
  • cost_millions (DOUBLE PRECISION)
  • it_mw (DOUBLE PRECISION) - IT load in megawatts
  • cooling_water_demand_gal_per_day (DOUBLE PRECISION)
  • latitude, longitude, geom

Source: PNNL Integrated Multisector Multiscale Modeling (IM3)

Notes: Loaded but unused - potential for forward-projection analysis

im3_projected_state_demand_summary

Rows: 31
Purpose: State-level rollup of IM3 projected facility counts, IT MW, and cooling demand

Key Columns:

  • state (TEXT)
  • facility_count (INTEGER)
  • total_it_mw (DOUBLE PRECISION)
  • total_cooling_demand_mgd (DOUBLE PRECISION) - Million gallons per day

Source: IM3 model outputs

utility_rate_tracker_2025_2028

Rows: 374
Purpose: Utility rate-increase tracker by provider × state × service type

Key Columns:

  • provider (TEXT) - Utility provider name
  • state (TEXT)
  • service_type (TEXT)
  • effective_date (DATE)
  • monthly_increase_dollars (DOUBLE PRECISION)
  • percent_increase (DOUBLE PRECISION)

Source: Utility rate tracker database

Notes: Loaded but unused in demographic/energy analysis

energy_atlas_layers_catalog

Rows: ~5
Purpose: Metadata catalog of EIA layers ingested

Key Columns:

  • table_name (TEXT)
  • source_url (TEXT)
  • import_timestamp (TIMESTAMP)
  • row_count (INTEGER)

Notes: Created by ingest_eia_energy_layers.py



Legislation Tables

Populated by ingest_legiscan.py using the LegiScan API.
Covers all 50 states + DC + US Congress, sessions from 2016 through 2026.
Data licensed CC BY 4.0 — attribute LegiScan LLC.

legiscan_sessions

Rows: 646
Purpose: One row per legislative session dataset downloaded from LegiScan

Key Columns:

  • session_id (INTEGER) - LegiScan session ID (PRIMARY KEY)
  • state_abbr (VARCHAR) - Two-letter state code (CA, TX, US, etc.)
  • state_id (INTEGER) - LegiScan numeric state ID
  • year_start, year_end (INTEGER) - Session year range
  • session_title (TEXT) - Full session name
  • session_tag (TEXT) - Short tag (e.g., "Regular Session", "1st Special Session")
  • is_special (BOOLEAN) - True for special/extraordinary sessions
  • is_prior (BOOLEAN) - True for completed/sine-die sessions
  • dataset_hash (VARCHAR) - MD5 of dataset ZIP; used to detect updates
  • dataset_date (DATE) - Date dataset was last published by LegiScan
  • dataset_size_mb (FLOAT) - Compressed ZIP size
  • bill_count (INTEGER) - Number of bills loaded from this session
  • imported_at (TIMESTAMPTZ) - When this session was last imported

legiscan_bills

Rows: ~1,313,000
Purpose: All bills from all sessions; tagged for relevance to data center research topics

Key Columns:

  • bill_id (INTEGER) - LegiScan bill ID (PRIMARY KEY)
  • session_id (INTEGER) - FK → legiscan_sessions
  • state (VARCHAR) - Two-letter state code
  • bill_number (VARCHAR) - Bill number (e.g., SB 1000, HB 233)
  • bill_type (VARCHAR) - B=Bill, R=Resolution, CR=Concurrent Resolution, etc.
  • title (TEXT) - Short title
  • description (TEXT) - Longer description
  • status (INTEGER) - Current status code (see below)
  • status_date (DATE) - Date of last status change
  • completed (INTEGER) - 1 if bill is in a terminal state
  • body (VARCHAR) - Originating chamber (H=House, S=Senate, C=Council, etc.)
  • url (TEXT) - LegiScan bill page URL
  • state_link (TEXT) - Official state legislature URL
  • change_hash (VARCHAR) - MD5 used to detect bill-level updates
  • subjects (TEXT[]) - LegiScan subject tags (GIN indexed)
  • sponsor_count (INTEGER) - Number of sponsors
  • vote_count (INTEGER) - Number of recorded votes
  • text_count (INTEGER) - Number of bill text versions
  • is_relevant (BOOLEAN) - True if any relevance tag matched (GIN indexed)
  • relevance_tags (TEXT[]) - Matched topic tags (GIN indexed)
  • imported_at (TIMESTAMPTZ) - When this bill was last upserted

Status codes: 1=Introduced, 2=Engrossed, 3=Enrolled, 4=Passed, 5=Vetoed, 6=Failed, 7=Override, 8=Chaptered, 9=Referred, 12=Draft

Relevance tags (keyword-matched against title + description + subjects):

Tag What it captures
data_center Data centers, hyperscale, colocation, AI campuses, HPC facilities
large_load Crypto mining, large industrial loads, extraordinary load
ratepayer_protection Cost shifting, cross-subsidy, rate design, affordability, rate class
grid_impact Grid reliability, transmission, interconnection queue, IRP
tax_incentive Tax exemptions, abatements, credits for facilities
energy_policy Renewable PPAs, green tariffs, clean electricity, decarbonization
water_use Cooling water, evaporative cooling, water footprint
siting_permitting Zoning, conditional use permits, local control, preemption

Notes:

  • ~60,000 relevant bills out of 1.3M total (~4.6%)
  • data_center tag: ~2,182 bills; ratepayer_protection: ~49,000
  • GIN indexes on subjects, relevance_tags, and full-text (title || description)
  • Use query_legiscan_bills.sql for pre-built research queries
  • Re-run python ingest_legiscan.py --fetch --load weekly to pick up dataset updates
  • Re-run python ingest_legiscan.py --tag after editing keyword lists

Commonly Used Joins

Data Center to Demographics

SELECT 
    dc.*,
    ct.median_household_income,
    ct.bachelors_or_higher_pct,
    ct.broadband_pct
FROM master_data_centers dc
JOIN data_center_census_tracts_2024 ct 
    ON dc.id = ct.id;

Data Center to Watershed

SELECT 
    dc.*,
    w.huc8,
    w.watershed_name
FROM master_data_centers dc
JOIN data_center_watershed_huc8 dw ON dc.id = dw.id
JOIN watershed_huc8 w ON dw.huc8 = w.huc8;

Data Center to Energy Infrastructure (50 km radius)

SELECT 
    dc.id,
    dc.name,
    SUM(eg.nameplate_capacity_mw) AS total_capacity_50km
FROM master_data_centers dc
JOIN energy_eia_operating_generator_capacity_flat eg
    ON ST_DWithin(
        dc.geom::geography,
        eg.geom::geography,
        50000  -- 50 km in meters
    )
WHERE eg.status = 'OP'  -- Operating only
GROUP BY dc.id, dc.name;

Data Center to FEMA Hazard Risk

SELECT 
    dc.*,
    nri.risk_score,
    nri.wildfire_risk,
    nri.drought_risk,
    nri.heat_wave_risk
FROM master_data_centers dc
JOIN data_center_census_tracts_2024 ct ON dc.id = ct.id
JOIN nri_census_tracts nri ON ct.geoid = nri.nri_id;

Table Naming Conventions

  • master_* - Canonical, deduplicated tables (use these for analysis)
  • data_center_* - Data center-specific enrichment tables
  • _dc_* - Base layers scoped to data center states (underscore prefix = private/internal)
  • energy_eia_* - EIA energy data
  • internet_* - Connectivity infrastructure
  • fcc_bdc_* - FCC Broadband Data Collection

Indexes and Performance

All tables have spatial indexes on geom columns for fast spatial joins:

CREATE INDEX idx_tablename_geom ON tablename USING GIST(geom);

Key geoid columns are indexed for fast demographic joins:

CREATE INDEX idx_tablename_geoid ON tablename(geoid);

Maintenance Notes

Updating Data Centers

  1. Run load_postgis_osm_data_centers.py to refresh OSM data
  2. Run build_master_data_centers.py to rebuild master table
  3. Run enrichment scripts to update joins

Updating Demographics

  1. Update _dc_census_tract_acs_2024 from Census API
  2. Run create_data_center_census_tract_table.py --replace-final

Updating Energy Data

python3 ingest_eia_energy_layers.py --category power --update

Schema Export

To export the full schema:

pg_dump -h $PGWEB_HOST -U $PGWEB_USER -d data_centers --schema-only > schema.sql

To list all tables:

SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename))
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;

Contact

For database access or questions, contact the repository owner.