16 KiB
Database Tables Documentation
Database Configuration
Database Name: data_centers
Type: PostgreSQL with PostGIS extension
Connection: Environment variables from ~/.zsh_secrets
PGWEB_HOST: Database hostPGWEB_PORT: Database port (typically 5432)PGWEB_USER: Database userPGWEB_PASSWORD: Database passwordPGWEB_DATABASE: Database name (data_centers)
Table Organization
Tables are organized into four categories:
- Core Data Center Tables - Master inventories and source data
- Enrichment Tables - Data centers joined with contextual data
- Base Layer Tables - Geographic and demographic reference layers
- Infrastructure Tables - Energy and connectivity infrastructure
Core Data Center Tables
master_data_centers
Rows: 1,833
Purpose: Canonical data center inventory - deduplicated merge of curated + OSM sources
Key Columns:
id(INTEGER) - Unique identifiername(TEXT) - Facility nameaddress(TEXT) - Street addresscity(TEXT) - Citystate(TEXT) - State codelatitude(DOUBLE PRECISION) - Latitudelongitude(DOUBLE PRECISION) - Longitudegeom(GEOMETRY) - PostGIS point geometry (EPSG:4326)operator(TEXT) - Operator/ownerpower_mw(DOUBLE PRECISION) - Power capacity in megawatts (sparse: 5.9% populated)source(TEXT) - Data source (curated,osm, orboth)osm_id(TEXT) - OpenStreetMap ID if applicablegeocode_method(TEXT) - Geocoding provenance
Notes:
- 108 of 1,833 facilities have power ratings
- 45 facilities use city-precision fallback coordinates
- Operator strings have fragmentation issues ("Meta" vs. "Meta, Inc.")
us_dc_sample_geocoded
Rows: 1,489
Purpose: Original curated sample with geocoding provenance (superseded by master_data_centers)
Key Columns:
name,address,city,state,ziplatitude,longitude,geomoperator,power_mwcensus_lat,census_lon- Census TIGER geocode resultsnominatim_lat,nominatim_lon- Nominatim fallback resultsgeocode_source- Which geocoder was used
osm_data_centers
Rows: 1,549
Purpose: Raw OpenStreetMap-derived facilities
Key Columns:
osm_id(TEXT) - OSM element IDosm_type(TEXT) -node,way, orrelationname(TEXT) - OSM name taglatitude,longitude,geomtags(JSONB) - All OSM tags as JSONoperator(TEXT) - Extracted from OSM tagscity,state,country
Notes: Fetched via Overpass API with query for telecom=data_center or building=data_center
master_data_center_spatial_clusters
Rows: 1,831
Purpose: DBSCAN cluster assignments for master data centers
Key Columns:
- All columns from
master_data_centers cluster_id(INTEGER) - Cluster assignment (-1 = noise/singleton)cluster_size(INTEGER) - Number of facilities in clustercluster_label(TEXT) - Human-readable cluster name
Notes: DBSCAN parameters: eps=15 km, min_samples=2
Enrichment Tables
data_center_census_tracts_2024
Rows: 1,815
Purpose: Per-facility demographics from containing Census tract
Key Columns:
- All columns from
master_data_centers geoid(TEXT) - 11-digit Census tract GEOIDstate_fips,county_fips,tract- Population:
total_population,population_density_sq_mi - Age:
median_age,under_18_pct,over_65_pct - Race/Ethnicity:
white_nh_pct,black_nh_pct,asian_nh_pct,hispanic_pct - Economics:
median_household_income,per_capita_income,poverty_rate - Education:
bachelors_or_higher_pct,high_school_or_higher_pct - Housing:
median_home_value,median_rent,homeownership_rate - Broadband:
broadband_pct- Households with broadband subscription
Source: ACS 2024 5-year estimates
Notes:
- 18 of 1,833 facilities failed tract join (geocoding issues)
- Data from
_dc_census_tract_acs_2024base table
data_center_watershed_huc8
Rows: 1,833
Purpose: Per-facility watershed assignment
Key Columns:
- All columns from
master_data_centers huc8(TEXT) - 8-digit Hydrologic Unit Codewatershed_name(TEXT) - Watershed namewatershed_area_sq_km(DOUBLE PRECISION)states(TEXT) - States intersecting watershed
Source: USGS Watershed Boundary Dataset
Notes: 257 unique HUC8 watersheds contain at least one data center
data_center_nri_exposure
Rows: 1,833
Purpose: Per-facility FEMA National Risk Index hazard exposure scores
Key Columns:
- All columns from
master_data_centers nri_id(TEXT) - Census tract GEOID (matchesgeoidfrom demographics)risk_score(DOUBLE PRECISION) - Overall NRI risk scoresocial_vulnerability(DOUBLE PRECISION) - Social vulnerability index- Hazard-specific risk scores (18 hazards):
avalanche_risk,coastal_flooding_risk,cold_wave_riskdrought_risk,earthquake_risk,hail_riskheat_wave_risk,hurricane_risk,ice_storm_risklandslide_risk,lightning_risk,riverine_flooding_riskstrong_wind_risk,tornado_risk,tsunami_riskvolcanic_activity_risk,wildfire_risk,winter_weather_risk
Source: FEMA National Risk Index (December 2025 release)
data_center_rdh_precinct_vote_matches
Rows: Varies
Purpose: Per-facility precinct-level election results
Key Columns:
- Data center identifiers
precinct_name,precinct_idelection_year,officecandidate,party,votesvote_share_pct
Source: Redistricting Data Hub precinct shapefiles
Notes: Spatial join to voting precincts (point-in-polygon)
Base Layer Tables
_dc_census_tract_acs_2024
Rows: 85,382
Purpose: ACS 2024 demographics for all Census tracts in states with data centers
Key Columns:
geoid(TEXT) - 11-digit tract GEOID (PRIMARY KEY)name(TEXT) - Tract namestate_fips,county_fips,tract- Full ACS 5-year estimates (85+ columns):
- Population by age, sex, race/ethnicity
- Households, families, housing units
- Income, poverty, education, employment
- Housing values, rents, costs
- Broadband, computer access
- Commuting, vehicles
Source: Census ACS 2024 5-year estimates API
Notes: Universe limited to 46 states with data centers (excludes DC-free states)
_dc_census_tract_boundaries_2024
Rows: 85,058
Purpose: TIGER 2024 tract polygons for data center states
Key Columns:
geoid(TEXT) - 11-digit tract GEOIDname(TEXT) - Tract namestate_fips,county_fips,tract_codegeom(GEOMETRY) - Polygon geometry (EPSG:4326)area_land_sq_m(DOUBLE PRECISION) - Land area in square metersarea_water_sq_m(DOUBLE PRECISION) - Water area in square meters
Source: Census TIGER/Line 2024
ruca_codes_2020_tract
Rows: 85,528
Purpose: USDA Rural-Urban Commuting Area codes for metro/rural classification
Key Columns:
geoid(TEXT) - 11-digit tract GEOID (matches Census tracts)ruca_code(TEXT) - Primary RUCA code (1-10)ruca_category(TEXT) - Simplified category:Metropolitan(codes 1-3)Micropolitan(codes 4-6)Small town(codes 7-9)Rural(code 10)
ruca_description(TEXT) - Full RUCA code descriptionpopulation_2020(INTEGER)
Source: USDA Economic Research Service RUCA 2020
Notes:
- Based on 2020 Census tracts and 2010-2020 commuting patterns
- 7 data centers failed RUCA join (Puerto Rico / non-US)
watershed_huc8
Rows: 2,139
Purpose: USGS HUC8 subbasin polygons for water-stress analysis
Key Columns:
huc8(TEXT) - 8-digit Hydrologic Unit Code (PRIMARY KEY)name(TEXT) - Watershed namegeom(GEOMETRY) - Polygon geometry (EPSG:4326)area_sq_km(DOUBLE PRECISION)states(TEXT) - Comma-separated state codesdc_count(INTEGER) - Number of data centers in watershed
Source: USGS Watershed Boundary Dataset
Notes:
- 257 of 2,139 watersheds contain at least one data center
- Top 15 watersheds contain 50% of all US data centers
nri_census_tracts
Rows: ~84,000
Purpose: Full FEMA National Risk Index by Census tract
Key Columns:
nri_id(TEXT) - Census tract GEOIDstate_name,county_name,tract_name- 460+ columns including:
- Overall risk scores and ratings
- Expected annual loss (dollars and building value %)
- Social vulnerability components (15 factors)
- Community resilience score
- Individual hazard risk scores (18 hazards)
- Exposure, annualized frequency, historic loss ratios per hazard
Source: FEMA National Risk Index v2.1 (December 2025)
Notes:
- Massive table with comprehensive natural hazard risk data
- Join to data centers via
geoidfield - See FEMA NRI Technical Documentation
Infrastructure Tables
Energy Infrastructure
energy_eia_operating_generator_capacity_flat
Rows: 4.7 million
Purpose: EIA generator inventory with lat/lon/MW (monthly 2008-2026)
Key Columns:
plant_id(INTEGER) - EIA plant IDgenerator_id(TEXT) - Generator unit IDplant_name(TEXT)latitude,longitude,geomstate,countyutility_name,operator_namenameplate_capacity_mw(DOUBLE PRECISION)technology(TEXT) - Generation technologyenergy_source_1,energy_source_2- Primary fuel codesoperating_month,operating_year- When unit became operationalstatus(TEXT) - Operating, standby, retired, etc.report_month,report_year- Data snapshot date
Source: EIA Form 860 via API
Notes:
- "Flat" means denormalized for fast spatial queries
- Each generator-month is a row (4.7M rows from monthly snapshots)
- Use for proximity analysis (e.g., "all generators within 50 km of data center")
energy_eia_facility_fuel_flat
Rows: Varies
Purpose: Monthly generation by plant/fuel
Key Columns:
plant_id,plant_namereport_month,report_yearenergy_source(TEXT) - Fuel codenet_generation_mwh(DOUBLE PRECISION)fuel_consumed_mmbtu(DOUBLE PRECISION)
Source: EIA Form 923 via API
energy_eia_seds_flat
Rows: 2.57 million
Purpose: Annual state energy consumption/production (1960-2024)
Key Columns:
state_code(TEXT)year(INTEGER)msn(TEXT) - Mnemonic series names (e.g.,TETCB= total energy consumption)value(DOUBLE PRECISION) - Energy in trillion BTUunit(TEXT)description(TEXT) - Human-readable MSN description
Source: EIA State Energy Data System (SEDS)
Notes:
- Annual aggregates by state
- Use for state-level energy context analysis
Connectivity Infrastructure
internet_cables
Rows: 693
Purpose: Submarine cable routes
Key Columns:
cable_id(TEXT) - Unique cable identifiercable_name(TEXT) - Official cable namegeom(GEOMETRY) - LineString geometry (EPSG:4326)rfs_year(INTEGER) - Ready For Service yearlength_km(DOUBLE PRECISION)owners(TEXT[]) - Array of owner nameslanding_points(TEXT[]) - Array of landing point names
Source: TeleGeography-style cable database
Notes:
- 693 unique submarine cables
- Geometry is approximate route (not exact seabed path)
internet_cable_landing_points
Rows: 3,361
Purpose: Cable landing points (where cables come ashore)
Key Columns:
landing_point_id(TEXT) - Unique identifiername(TEXT) - Landing point namecity,countrylatitude,longitude,geomcables(TEXT[]) - Array of cable names landing at this pointcable_count(INTEGER)
Source: TeleGeography-style cable database
Notes:
- Used for proximity analysis (how close are data centers to cable landings?)
- Key finding: Data centers are NOT systematically closer to cables than ordinary US cities
internet_city_dominance
Rows: 4,552
Purpose: City-level IPs/capacity (internet hub strength proxy)
Key Columns:
city(TEXT)country(TEXT)latitude,longitude,geomip_addresses(INTEGER) - Number of routable IP addressescapacity_rank(INTEGER) - Relative capacity ranking
Source: Internet topology datasets
Notes: Proxy for "internet hub" strength (not directly used in main analyses)
Broadband
fcc_bdc_location_provider_aggregates
Rows: Varies
Purpose: FCC BDC provider availability aggregated by county/tract
Key Columns:
geoid(TEXT) - County or tract GEOIDgeography_level(TEXT) -countyortractprovider_count(INTEGER)technology_counts(JSONB) - Count by technology typemax_download_mbps,max_upload_mbps
Source: FCC Broadband Data Collection (BDC)
fcc_bdc_broadband_connection_table
Rows: Varies
Purpose: Per-data-center broadband provider availability
Key Columns:
- Data center identifiers
provider_id,provider_nametechnology(TEXT)max_advertised_download_speed,max_advertised_upload_speedlow_latency(BOOLEAN)
Source: FCC BDC, joined to data center locations
Notes: Built by build_fcc_bdc_broadband_connection_table.py
Commonly Used Joins
Data Center to Demographics
SELECT
dc.*,
ct.median_household_income,
ct.bachelors_or_higher_pct,
ct.broadband_pct
FROM master_data_centers dc
JOIN data_center_census_tracts_2024 ct
ON dc.id = ct.id;
Data Center to Watershed
SELECT
dc.*,
w.huc8,
w.watershed_name
FROM master_data_centers dc
JOIN data_center_watershed_huc8 dw ON dc.id = dw.id
JOIN watershed_huc8 w ON dw.huc8 = w.huc8;
Data Center to Energy Infrastructure (50 km radius)
SELECT
dc.id,
dc.name,
SUM(eg.nameplate_capacity_mw) AS total_capacity_50km
FROM master_data_centers dc
JOIN energy_eia_operating_generator_capacity_flat eg
ON ST_DWithin(
dc.geom::geography,
eg.geom::geography,
50000 -- 50 km in meters
)
WHERE eg.status = 'OP' -- Operating only
GROUP BY dc.id, dc.name;
Data Center to FEMA Hazard Risk
SELECT
dc.*,
nri.risk_score,
nri.wildfire_risk,
nri.drought_risk,
nri.heat_wave_risk
FROM master_data_centers dc
JOIN data_center_census_tracts_2024 ct ON dc.id = ct.id
JOIN nri_census_tracts nri ON ct.geoid = nri.nri_id;
Table Naming Conventions
master_*- Canonical, deduplicated tables (use these for analysis)data_center_*- Data center-specific enrichment tables_dc_*- Base layers scoped to data center states (underscore prefix = private/internal)energy_eia_*- EIA energy datainternet_*- Connectivity infrastructurefcc_bdc_*- FCC Broadband Data Collection
Indexes and Performance
All tables have spatial indexes on geom columns for fast spatial joins:
CREATE INDEX idx_tablename_geom ON tablename USING GIST(geom);
Key geoid columns are indexed for fast demographic joins:
CREATE INDEX idx_tablename_geoid ON tablename(geoid);
Maintenance Notes
Updating Data Centers
- Run
load_postgis_osm_data_centers.pyto refresh OSM data - Run
build_master_data_centers.pyto rebuild master table - Run enrichment scripts to update joins
Updating Demographics
- Update
_dc_census_tract_acs_2024from Census API - Run
create_data_center_census_tract_table.py --replace-final
Updating Energy Data
python3 ingest_eia_energy_layers.py --category power --update
Schema Export
To export the full schema:
pg_dump -h $PGWEB_HOST -U $PGWEB_USER -d data_centers --schema-only > schema.sql
To list all tables:
SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename))
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
Contact
For database access or questions, contact the repository owner.