Move all Python scripts to scripts/, documentation to docs/, raw input data to data/, and generated HTML/CSV outputs to output/. Update path references in 8 scripts to use Path(__file__).parent.parent as project root so they work correctly from the new location. Update README links and quick-start commands accordingly. Notebooks remain at root. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
22 KiB
Database Tables Documentation
Database Configuration
Database Name: data_centers
Type: PostgreSQL with PostGIS extension
Connection: Environment variables from ~/.zsh_secrets
PGWEB_HOST: Database hostPGWEB_PORT: Database port (5433)PGWEB_USER: Database userPGWEB_PASSWORD: Database passwordPGWEB_DATABASE: Database name (data_centers)
Table Organization
Tables are organized into five categories:
- Core Data Center Tables - Master inventories and source data
- Enrichment Tables - Data centers joined with contextual data
- Base Layer Tables - Geographic and demographic reference layers
- Infrastructure Tables - Energy and connectivity infrastructure
- Legislation Tables - LegiScan state and federal bill data (2016-2026)
Core Data Center Tables
master_data_centers
Rows: 1,833
Purpose: Canonical data center inventory - deduplicated merge of curated + OSM sources
Key Columns:
id(INTEGER) - Unique identifiername(TEXT) - Facility nameaddress(TEXT) - Street addresscity(TEXT) - Citystate(TEXT) - State codelatitude(DOUBLE PRECISION) - Latitudelongitude(DOUBLE PRECISION) - Longitudegeom(GEOMETRY) - PostGIS point geometry (EPSG:4326)operator(TEXT) - Operator/ownerpower_mw(DOUBLE PRECISION) - Power capacity in megawatts (sparse: 5.9% populated)source(TEXT) - Data source (curated,osm, orboth)osm_id(TEXT) - OpenStreetMap ID if applicablegeocode_method(TEXT) - Geocoding provenance
Notes:
- 108 of 1,833 facilities have power ratings
- 45 facilities use city-precision fallback coordinates
- Operator strings have fragmentation issues ("Meta" vs. "Meta, Inc.")
us_dc_sample_geocoded
Rows: 1,489
Purpose: Original curated sample with geocoding provenance (superseded by master_data_centers)
Key Columns:
name,address,city,state,ziplatitude,longitude,geomoperator,power_mwcensus_lat,census_lon- Census TIGER geocode resultsnominatim_lat,nominatim_lon- Nominatim fallback resultsgeocode_source- Which geocoder was used
osm_data_centers
Rows: 1,549
Purpose: Raw OpenStreetMap-derived facilities
Key Columns:
osm_id(TEXT) - OSM element IDosm_type(TEXT) -node,way, orrelationname(TEXT) - OSM name taglatitude,longitude,geomtags(JSONB) - All OSM tags as JSONoperator(TEXT) - Extracted from OSM tagscity,state,country
Notes: Fetched via Overpass API with query for telecom=data_center or building=data_center
master_data_center_spatial_clusters
Rows: 1,831
Purpose: DBSCAN cluster assignments for master data centers
Key Columns:
- All columns from
master_data_centers cluster_id(INTEGER) - Cluster assignment (-1 = noise/singleton)cluster_size(INTEGER) - Number of facilities in clustercluster_label(TEXT) - Human-readable cluster name
Notes: DBSCAN parameters: eps=15 km, min_samples=2
Enrichment Tables
data_center_census_tracts_2024
Rows: 1,815
Purpose: Per-facility demographics from containing Census tract
Key Columns:
- All columns from
master_data_centers geoid(TEXT) - 11-digit Census tract GEOIDstate_fips,county_fips,tract- Population:
total_population,population_density_sq_mi - Age:
median_age,under_18_pct,over_65_pct - Race/Ethnicity:
white_nh_pct,black_nh_pct,asian_nh_pct,hispanic_pct - Economics:
median_household_income,per_capita_income,poverty_rate - Education:
bachelors_or_higher_pct,high_school_or_higher_pct - Housing:
median_home_value,median_rent,homeownership_rate - Broadband:
broadband_pct- Households with broadband subscription
Source: ACS 2024 5-year estimates
Notes:
- 18 of 1,833 facilities failed tract join (geocoding issues)
- Data from
_dc_census_tract_acs_2024base table
data_center_watershed_huc8
Rows: 1,833
Purpose: Per-facility watershed assignment
Key Columns:
- All columns from
master_data_centers huc8(TEXT) - 8-digit Hydrologic Unit Codewatershed_name(TEXT) - Watershed namewatershed_area_sq_km(DOUBLE PRECISION)states(TEXT) - States intersecting watershed
Source: USGS Watershed Boundary Dataset
Notes: 257 unique HUC8 watersheds contain at least one data center
data_center_nri_exposure
Rows: 1,833
Purpose: Per-facility FEMA National Risk Index hazard exposure scores
Key Columns:
- All columns from
master_data_centers nri_id(TEXT) - Census tract GEOID (matchesgeoidfrom demographics)risk_score(DOUBLE PRECISION) - Overall NRI risk scoresocial_vulnerability(DOUBLE PRECISION) - Social vulnerability index- Hazard-specific risk scores (18 hazards):
avalanche_risk,coastal_flooding_risk,cold_wave_riskdrought_risk,earthquake_risk,hail_riskheat_wave_risk,hurricane_risk,ice_storm_risklandslide_risk,lightning_risk,riverine_flooding_riskstrong_wind_risk,tornado_risk,tsunami_riskvolcanic_activity_risk,wildfire_risk,winter_weather_risk
Source: FEMA National Risk Index (December 2025 release)
data_center_rdh_precinct_vote_matches
Rows: Varies
Purpose: Per-facility precinct-level election results
Key Columns:
- Data center identifiers
precinct_name,precinct_idelection_year,officecandidate,party,votesvote_share_pct
Source: Redistricting Data Hub precinct shapefiles
Notes: Spatial join to voting precincts (point-in-polygon)
Base Layer Tables
_dc_census_tract_acs_2024
Rows: 85,382
Purpose: ACS 2024 demographics for all Census tracts in states with data centers
Key Columns:
geoid(TEXT) - 11-digit tract GEOID (PRIMARY KEY)name(TEXT) - Tract namestate_fips,county_fips,tract- Full ACS 5-year estimates (85+ columns):
- Population by age, sex, race/ethnicity
- Households, families, housing units
- Income, poverty, education, employment
- Housing values, rents, costs
- Broadband, computer access
- Commuting, vehicles
Source: Census ACS 2024 5-year estimates API
Notes: Universe limited to 46 states with data centers (excludes DC-free states)
_dc_census_tract_boundaries_2024
Rows: 85,058
Purpose: TIGER 2024 tract polygons for data center states
Key Columns:
geoid(TEXT) - 11-digit tract GEOIDname(TEXT) - Tract namestate_fips,county_fips,tract_codegeom(GEOMETRY) - Polygon geometry (EPSG:4326)area_land_sq_m(DOUBLE PRECISION) - Land area in square metersarea_water_sq_m(DOUBLE PRECISION) - Water area in square meters
Source: Census TIGER/Line 2024
ruca_codes_2020_tract
Rows: 85,528
Purpose: USDA Rural-Urban Commuting Area codes for metro/rural classification
Key Columns:
geoid(TEXT) - 11-digit tract GEOID (matches Census tracts)ruca_code(TEXT) - Primary RUCA code (1-10)ruca_category(TEXT) - Simplified category:Metropolitan(codes 1-3)Micropolitan(codes 4-6)Small town(codes 7-9)Rural(code 10)
ruca_description(TEXT) - Full RUCA code descriptionpopulation_2020(INTEGER)
Source: USDA Economic Research Service RUCA 2020
Notes:
- Based on 2020 Census tracts and 2010-2020 commuting patterns
- 7 data centers failed RUCA join (Puerto Rico / non-US)
watershed_huc8
Rows: 2,139
Purpose: USGS HUC8 subbasin polygons for water-stress analysis
Key Columns:
huc8(TEXT) - 8-digit Hydrologic Unit Code (PRIMARY KEY)name(TEXT) - Watershed namegeom(GEOMETRY) - Polygon geometry (EPSG:4326)area_sq_km(DOUBLE PRECISION)states(TEXT) - Comma-separated state codesdc_count(INTEGER) - Number of data centers in watershed
Source: USGS Watershed Boundary Dataset
Notes:
- 257 of 2,139 watersheds contain at least one data center
- Top 15 watersheds contain 50% of all US data centers
nri_census_tracts
Rows: ~84,000
Purpose: Full FEMA National Risk Index by Census tract
Key Columns:
nri_id(TEXT) - Census tract GEOIDstate_name,county_name,tract_name- 460+ columns including:
- Overall risk scores and ratings
- Expected annual loss (dollars and building value %)
- Social vulnerability components (15 factors)
- Community resilience score
- Individual hazard risk scores (18 hazards)
- Exposure, annualized frequency, historic loss ratios per hazard
Source: FEMA National Risk Index v2.1 (December 2025)
Notes:
- Massive table with comprehensive natural hazard risk data
- Join to data centers via
geoidfield - See FEMA NRI Technical Documentation
Infrastructure Tables
Energy Infrastructure
energy_eia_operating_generator_capacity_flat
Rows: 4.7 million
Purpose: EIA generator inventory with lat/lon/MW (monthly 2008-2026)
Key Columns:
plant_id(INTEGER) - EIA plant IDgenerator_id(TEXT) - Generator unit IDplant_name(TEXT)latitude,longitude,geomstate,countyutility_name,operator_namenameplate_capacity_mw(DOUBLE PRECISION)technology(TEXT) - Generation technologyenergy_source_1,energy_source_2- Primary fuel codesoperating_month,operating_year- When unit became operationalstatus(TEXT) - Operating, standby, retired, etc.report_month,report_year- Data snapshot date
Source: EIA Form 860 via API
Notes:
- "Flat" means denormalized for fast spatial queries
- Each generator-month is a row (4.7M rows from monthly snapshots)
- Use for proximity analysis (e.g., "all generators within 50 km of data center")
energy_eia_facility_fuel_flat
Rows: Varies
Purpose: Monthly generation by plant/fuel
Key Columns:
plant_id,plant_namereport_month,report_yearenergy_source(TEXT) - Fuel codenet_generation_mwh(DOUBLE PRECISION)fuel_consumed_mmbtu(DOUBLE PRECISION)
Source: EIA Form 923 via API
energy_eia_seds_flat
Rows: 2.57 million
Purpose: Annual state energy consumption/production (1960-2024)
Key Columns:
state_code(TEXT)year(INTEGER)msn(TEXT) - Mnemonic series names (e.g.,TETCB= total energy consumption)value(DOUBLE PRECISION) - Energy in trillion BTUunit(TEXT)description(TEXT) - Human-readable MSN description
Source: EIA State Energy Data System (SEDS)
Notes:
- Annual aggregates by state
- Use for state-level energy context analysis
Connectivity Infrastructure
internet_cables
Rows: 693
Purpose: Submarine cable routes
Key Columns:
cable_id(TEXT) - Unique cable identifiercable_name(TEXT) - Official cable namegeom(GEOMETRY) - LineString geometry (EPSG:4326)rfs_year(INTEGER) - Ready For Service yearlength_km(DOUBLE PRECISION)owners(TEXT[]) - Array of owner nameslanding_points(TEXT[]) - Array of landing point names
Source: TeleGeography-style cable database
Notes:
- 693 unique submarine cables
- Geometry is approximate route (not exact seabed path)
internet_cable_landing_points
Rows: 3,361
Purpose: Cable landing points (where cables come ashore)
Key Columns:
landing_point_id(TEXT) - Unique identifiername(TEXT) - Landing point namecity,countrylatitude,longitude,geomcables(TEXT[]) - Array of cable names landing at this pointcable_count(INTEGER)
Source: TeleGeography-style cable database
Notes:
- Used for proximity analysis (how close are data centers to cable landings?)
- Key finding: Data centers are NOT systematically closer to cables than ordinary US cities
internet_city_dominance
Rows: 4,552
Purpose: City-level IPs/capacity (internet hub strength proxy)
Key Columns:
city(TEXT)country(TEXT)latitude,longitude,geomip_addresses(INTEGER) - Number of routable IP addressescapacity_rank(INTEGER) - Relative capacity ranking
Source: Internet topology datasets
Notes: Proxy for "internet hub" strength (not directly used in main analyses)
Broadband
fcc_bdc_location_provider_aggregates
Rows: Varies
Purpose: FCC BDC provider availability aggregated by county/tract
Key Columns:
geoid(TEXT) - County or tract GEOIDgeography_level(TEXT) -countyortractprovider_count(INTEGER)technology_counts(JSONB) - Count by technology typemax_download_mbps,max_upload_mbps
Source: FCC Broadband Data Collection (BDC)
fcc_bdc_broadband_connection_table
Rows: Varies
Purpose: Per-data-center broadband provider availability
Key Columns:
- Data center identifiers
provider_id,provider_nametechnology(TEXT)max_advertised_download_speed,max_advertised_upload_speedlow_latency(BOOLEAN)
Source: FCC BDC, joined to data center locations
Notes: Built by build_fcc_bdc_broadband_connection_table.py
Other Tables
opposition_cases_geocoded
Rows: 18
Purpose: Geocoded community-opposition cases against data center builds
Key Columns:
case_id(TEXT) - Unique identifierdeveloper(TEXT) - Proposed developer/operatorinvestment_billions(DOUBLE PRECISION) - Investment amount in billionsoutcome(TEXT) - Case outcome (approved, rejected, pending)governance_response(TEXT) - Government responselatitude,longitude,geom
Source: Compiled from news archives
Notes: Loaded but currently unused - see research-ideas.md for proposed analyses
census_tract_huc8_link
Rows: 806
Purpose: Tract↔HUC8 spatial overlap table
Key Columns:
geoid(TEXT) - Census tract GEOIDhuc8(TEXT) - HUC8 watershed codeoverlap_pct(DOUBLE PRECISION) - Percentage of tract overlapping watershed
Notes: Useful for downstream tract-level water-stress joins; limited to tracts containing data centers
im3_state_projected_moderate_50
Rows: 328
Purpose: PNNL IM3 projected data center siting (moderate growth, gravity weight 0.50)
Key Columns:
facility_id(TEXT)state(TEXT)cost_millions(DOUBLE PRECISION)it_mw(DOUBLE PRECISION) - IT load in megawattscooling_water_demand_gal_per_day(DOUBLE PRECISION)latitude,longitude,geom
Source: PNNL Integrated Multisector Multiscale Modeling (IM3)
Notes: Loaded but unused - potential for forward-projection analysis
im3_projected_state_demand_summary
Rows: 31
Purpose: State-level rollup of IM3 projected facility counts, IT MW, and cooling demand
Key Columns:
state(TEXT)facility_count(INTEGER)total_it_mw(DOUBLE PRECISION)total_cooling_demand_mgd(DOUBLE PRECISION) - Million gallons per day
Source: IM3 model outputs
utility_rate_tracker_2025_2028
Rows: 374
Purpose: Utility rate-increase tracker by provider × state × service type
Key Columns:
provider(TEXT) - Utility provider namestate(TEXT)service_type(TEXT)effective_date(DATE)monthly_increase_dollars(DOUBLE PRECISION)percent_increase(DOUBLE PRECISION)
Source: Utility rate tracker database
Notes: Loaded but unused in demographic/energy analysis
energy_atlas_layers_catalog
Rows: ~5
Purpose: Metadata catalog of EIA layers ingested
Key Columns:
table_name(TEXT)source_url(TEXT)import_timestamp(TIMESTAMP)row_count(INTEGER)
Notes: Created by ingest_eia_energy_layers.py
Legislation Tables
Populated by ingest_legiscan.py using the LegiScan API.
Covers all 50 states + DC + US Congress, sessions from 2016 through 2026.
Data licensed CC BY 4.0 — attribute LegiScan LLC.
legiscan_sessions
Rows: 646
Purpose: One row per legislative session dataset downloaded from LegiScan
Key Columns:
session_id(INTEGER) - LegiScan session ID (PRIMARY KEY)state_abbr(VARCHAR) - Two-letter state code (CA,TX,US, etc.)state_id(INTEGER) - LegiScan numeric state IDyear_start,year_end(INTEGER) - Session year rangesession_title(TEXT) - Full session namesession_tag(TEXT) - Short tag (e.g., "Regular Session", "1st Special Session")is_special(BOOLEAN) - True for special/extraordinary sessionsis_prior(BOOLEAN) - True for completed/sine-die sessionsdataset_hash(VARCHAR) - MD5 of dataset ZIP; used to detect updatesdataset_date(DATE) - Date dataset was last published by LegiScandataset_size_mb(FLOAT) - Compressed ZIP sizebill_count(INTEGER) - Number of bills loaded from this sessionimported_at(TIMESTAMPTZ) - When this session was last imported
legiscan_bills
Rows: ~1,313,000
Purpose: All bills from all sessions; tagged for relevance to data center research topics
Key Columns:
bill_id(INTEGER) - LegiScan bill ID (PRIMARY KEY)session_id(INTEGER) - FK →legiscan_sessionsstate(VARCHAR) - Two-letter state codebill_number(VARCHAR) - Bill number (e.g.,SB 1000,HB 233)bill_type(VARCHAR) -B=Bill,R=Resolution,CR=Concurrent Resolution, etc.title(TEXT) - Short titledescription(TEXT) - Longer descriptionstatus(INTEGER) - Current status code (see below)status_date(DATE) - Date of last status changecompleted(INTEGER) - 1 if bill is in a terminal statebody(VARCHAR) - Originating chamber (H=House,S=Senate,C=Council, etc.)url(TEXT) - LegiScan bill page URLstate_link(TEXT) - Official state legislature URLchange_hash(VARCHAR) - MD5 used to detect bill-level updatessubjects(TEXT[]) - LegiScan subject tags (GIN indexed)sponsor_count(INTEGER) - Number of sponsorsvote_count(INTEGER) - Number of recorded votestext_count(INTEGER) - Number of bill text versionsis_relevant(BOOLEAN) - True if any relevance tag matched (GIN indexed)relevance_tags(TEXT[]) - Matched topic tags (GIN indexed)imported_at(TIMESTAMPTZ) - When this bill was last upserted
Status codes: 1=Introduced, 2=Engrossed, 3=Enrolled, 4=Passed, 5=Vetoed, 6=Failed, 7=Override, 8=Chaptered, 9=Referred, 12=Draft
Relevance tags (keyword-matched against title + description + subjects):
| Tag | What it captures |
|---|---|
data_center |
Data centers, hyperscale, colocation, AI campuses, HPC facilities |
large_load |
Crypto mining, large industrial loads, extraordinary load |
ratepayer_protection |
Cost shifting, cross-subsidy, rate design, affordability, rate class |
grid_impact |
Grid reliability, transmission, interconnection queue, IRP |
tax_incentive |
Tax exemptions, abatements, credits for facilities |
energy_policy |
Renewable PPAs, green tariffs, clean electricity, decarbonization |
water_use |
Cooling water, evaporative cooling, water footprint |
siting_permitting |
Zoning, conditional use permits, local control, preemption |
Notes:
- ~60,000 relevant bills out of 1.3M total (~4.6%)
data_centertag: ~2,182 bills;ratepayer_protection: ~49,000- GIN indexes on
subjects,relevance_tags, and full-text (title || description) - Use
query_legiscan_bills.sqlfor pre-built research queries - Re-run
python ingest_legiscan.py --fetch --loadweekly to pick up dataset updates - Re-run
python ingest_legiscan.py --tagafter editing keyword lists
Commonly Used Joins
Data Center to Demographics
SELECT
dc.*,
ct.median_household_income,
ct.bachelors_or_higher_pct,
ct.broadband_pct
FROM master_data_centers dc
JOIN data_center_census_tracts_2024 ct
ON dc.id = ct.id;
Data Center to Watershed
SELECT
dc.*,
w.huc8,
w.watershed_name
FROM master_data_centers dc
JOIN data_center_watershed_huc8 dw ON dc.id = dw.id
JOIN watershed_huc8 w ON dw.huc8 = w.huc8;
Data Center to Energy Infrastructure (50 km radius)
SELECT
dc.id,
dc.name,
SUM(eg.nameplate_capacity_mw) AS total_capacity_50km
FROM master_data_centers dc
JOIN energy_eia_operating_generator_capacity_flat eg
ON ST_DWithin(
dc.geom::geography,
eg.geom::geography,
50000 -- 50 km in meters
)
WHERE eg.status = 'OP' -- Operating only
GROUP BY dc.id, dc.name;
Data Center to FEMA Hazard Risk
SELECT
dc.*,
nri.risk_score,
nri.wildfire_risk,
nri.drought_risk,
nri.heat_wave_risk
FROM master_data_centers dc
JOIN data_center_census_tracts_2024 ct ON dc.id = ct.id
JOIN nri_census_tracts nri ON ct.geoid = nri.nri_id;
Table Naming Conventions
master_*- Canonical, deduplicated tables (use these for analysis)data_center_*- Data center-specific enrichment tables_dc_*- Base layers scoped to data center states (underscore prefix = private/internal)energy_eia_*- EIA energy datainternet_*- Connectivity infrastructurefcc_bdc_*- FCC Broadband Data Collection
Indexes and Performance
All tables have spatial indexes on geom columns for fast spatial joins:
CREATE INDEX idx_tablename_geom ON tablename USING GIST(geom);
Key geoid columns are indexed for fast demographic joins:
CREATE INDEX idx_tablename_geoid ON tablename(geoid);
Maintenance Notes
Updating Data Centers
- Run
load_postgis_osm_data_centers.pyto refresh OSM data - Run
build_master_data_centers.pyto rebuild master table - Run enrichment scripts to update joins
Updating Demographics
- Update
_dc_census_tract_acs_2024from Census API - Run
create_data_center_census_tract_table.py --replace-final
Updating Energy Data
python3 ingest_eia_energy_layers.py --category power --update
Schema Export
To export the full schema:
pg_dump -h $PGWEB_HOST -U $PGWEB_USER -d data_centers --schema-only > schema.sql
To list all tables:
SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename))
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
Contact
For database access or questions, contact the repository owner.