Files
data-centers/database-tables.md
dadams 4525ea3f97 Add LegiScan legislation ingestion and analysis queries
Adds ingest_legiscan.py to pull all US state + federal bills (2016-2026)
from the LegiScan API into legiscan_sessions and legiscan_bills tables.
Bills are keyword-tagged across 8 research categories (data_center,
ratepayer_protection, large_load, grid_impact, tax_incentive, etc.).
Loads ~1.3M bills; ~60K tagged relevant. Adds query_legiscan_bills.sql
with pre-built analysis queries including state/DC joins. Updates
database-tables.md, README.md, and research-ideas.md accordingly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 21:30:31 -07:00

702 lines
22 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Database Tables Documentation
## Database Configuration
**Database Name**: `data_centers`
**Type**: PostgreSQL with PostGIS extension
**Connection**: Environment variables from `~/.zsh_secrets`
- `PGWEB_HOST`: Database host
- `PGWEB_PORT`: Database port (typically 5432)
- `PGWEB_USER`: Database user
- `PGWEB_PASSWORD`: Database password
- `PGWEB_DATABASE`: Database name (`data_centers`)
## Table Organization
Tables are organized into five categories:
1. **Core Data Center Tables** - Master inventories and source data
2. **Enrichment Tables** - Data centers joined with contextual data
3. **Base Layer Tables** - Geographic and demographic reference layers
4. **Infrastructure Tables** - Energy and connectivity infrastructure
5. **Legislation Tables** - LegiScan state and federal bill data (2016-2026)
---
## Core Data Center Tables
### `master_data_centers`
**Rows**: 1,833
**Purpose**: Canonical data center inventory - deduplicated merge of curated + OSM sources
**Key Columns**:
- `id` (INTEGER) - Unique identifier
- `name` (TEXT) - Facility name
- `address` (TEXT) - Street address
- `city` (TEXT) - City
- `state` (TEXT) - State code
- `latitude` (DOUBLE PRECISION) - Latitude
- `longitude` (DOUBLE PRECISION) - Longitude
- `geom` (GEOMETRY) - PostGIS point geometry (EPSG:4326)
- `operator` (TEXT) - Operator/owner
- `power_mw` (DOUBLE PRECISION) - Power capacity in megawatts (sparse: 5.9% populated)
- `source` (TEXT) - Data source (`curated`, `osm`, or `both`)
- `osm_id` (TEXT) - OpenStreetMap ID if applicable
- `geocode_method` (TEXT) - Geocoding provenance
**Notes**:
- 108 of 1,833 facilities have power ratings
- 45 facilities use city-precision fallback coordinates
- Operator strings have fragmentation issues ("Meta" vs. "Meta, Inc.")
### `us_dc_sample_geocoded`
**Rows**: 1,489
**Purpose**: Original curated sample with geocoding provenance (superseded by `master_data_centers`)
**Key Columns**:
- `name`, `address`, `city`, `state`, `zip`
- `latitude`, `longitude`, `geom`
- `operator`, `power_mw`
- `census_lat`, `census_lon` - Census TIGER geocode results
- `nominatim_lat`, `nominatim_lon` - Nominatim fallback results
- `geocode_source` - Which geocoder was used
### `osm_data_centers`
**Rows**: 1,549
**Purpose**: Raw OpenStreetMap-derived facilities
**Key Columns**:
- `osm_id` (TEXT) - OSM element ID
- `osm_type` (TEXT) - `node`, `way`, or `relation`
- `name` (TEXT) - OSM name tag
- `latitude`, `longitude`, `geom`
- `tags` (JSONB) - All OSM tags as JSON
- `operator` (TEXT) - Extracted from OSM tags
- `city`, `state`, `country`
**Notes**: Fetched via Overpass API with query for `telecom=data_center` or `building=data_center`
### `master_data_center_spatial_clusters`
**Rows**: 1,831
**Purpose**: DBSCAN cluster assignments for master data centers
**Key Columns**:
- All columns from `master_data_centers`
- `cluster_id` (INTEGER) - Cluster assignment (-1 = noise/singleton)
- `cluster_size` (INTEGER) - Number of facilities in cluster
- `cluster_label` (TEXT) - Human-readable cluster name
**Notes**: DBSCAN parameters: eps=15 km, min_samples=2
---
## Enrichment Tables
### `data_center_census_tracts_2024`
**Rows**: 1,815
**Purpose**: Per-facility demographics from containing Census tract
**Key Columns**:
- All columns from `master_data_centers`
- `geoid` (TEXT) - 11-digit Census tract GEOID
- `state_fips`, `county_fips`, `tract`
- **Population**: `total_population`, `population_density_sq_mi`
- **Age**: `median_age`, `under_18_pct`, `over_65_pct`
- **Race/Ethnicity**: `white_nh_pct`, `black_nh_pct`, `asian_nh_pct`, `hispanic_pct`
- **Economics**: `median_household_income`, `per_capita_income`, `poverty_rate`
- **Education**: `bachelors_or_higher_pct`, `high_school_or_higher_pct`
- **Housing**: `median_home_value`, `median_rent`, `homeownership_rate`
- **Broadband**: `broadband_pct` - Households with broadband subscription
**Source**: ACS 2024 5-year estimates
**Notes**:
- 18 of 1,833 facilities failed tract join (geocoding issues)
- Data from `_dc_census_tract_acs_2024` base table
### `data_center_watershed_huc8`
**Rows**: 1,833
**Purpose**: Per-facility watershed assignment
**Key Columns**:
- All columns from `master_data_centers`
- `huc8` (TEXT) - 8-digit Hydrologic Unit Code
- `watershed_name` (TEXT) - Watershed name
- `watershed_area_sq_km` (DOUBLE PRECISION)
- `states` (TEXT) - States intersecting watershed
**Source**: USGS Watershed Boundary Dataset
**Notes**: 257 unique HUC8 watersheds contain at least one data center
### `data_center_nri_exposure`
**Rows**: 1,833
**Purpose**: Per-facility FEMA National Risk Index hazard exposure scores
**Key Columns**:
- All columns from `master_data_centers`
- `nri_id` (TEXT) - Census tract GEOID (matches `geoid` from demographics)
- `risk_score` (DOUBLE PRECISION) - Overall NRI risk score
- `social_vulnerability` (DOUBLE PRECISION) - Social vulnerability index
- **Hazard-specific risk scores** (18 hazards):
- `avalanche_risk`, `coastal_flooding_risk`, `cold_wave_risk`
- `drought_risk`, `earthquake_risk`, `hail_risk`
- `heat_wave_risk`, `hurricane_risk`, `ice_storm_risk`
- `landslide_risk`, `lightning_risk`, `riverine_flooding_risk`
- `strong_wind_risk`, `tornado_risk`, `tsunami_risk`
- `volcanic_activity_risk`, `wildfire_risk`, `winter_weather_risk`
**Source**: FEMA National Risk Index (December 2025 release)
### `data_center_rdh_precinct_vote_matches`
**Rows**: Varies
**Purpose**: Per-facility precinct-level election results
**Key Columns**:
- Data center identifiers
- `precinct_name`, `precinct_id`
- `election_year`, `office`
- `candidate`, `party`, `votes`
- `vote_share_pct`
**Source**: Redistricting Data Hub precinct shapefiles
**Notes**: Spatial join to voting precincts (point-in-polygon)
---
## Base Layer Tables
### `_dc_census_tract_acs_2024`
**Rows**: 85,382
**Purpose**: ACS 2024 demographics for all Census tracts in states with data centers
**Key Columns**:
- `geoid` (TEXT) - 11-digit tract GEOID (PRIMARY KEY)
- `name` (TEXT) - Tract name
- `state_fips`, `county_fips`, `tract`
- **Full ACS 5-year estimates** (85+ columns):
- Population by age, sex, race/ethnicity
- Households, families, housing units
- Income, poverty, education, employment
- Housing values, rents, costs
- Broadband, computer access
- Commuting, vehicles
**Source**: Census ACS 2024 5-year estimates API
**Notes**: Universe limited to 46 states with data centers (excludes DC-free states)
### `_dc_census_tract_boundaries_2024`
**Rows**: 85,058
**Purpose**: TIGER 2024 tract polygons for data center states
**Key Columns**:
- `geoid` (TEXT) - 11-digit tract GEOID
- `name` (TEXT) - Tract name
- `state_fips`, `county_fips`, `tract_code`
- `geom` (GEOMETRY) - Polygon geometry (EPSG:4326)
- `area_land_sq_m` (DOUBLE PRECISION) - Land area in square meters
- `area_water_sq_m` (DOUBLE PRECISION) - Water area in square meters
**Source**: Census TIGER/Line 2024
### `ruca_codes_2020_tract`
**Rows**: 85,528
**Purpose**: USDA Rural-Urban Commuting Area codes for metro/rural classification
**Key Columns**:
- `geoid` (TEXT) - 11-digit tract GEOID (matches Census tracts)
- `ruca_code` (TEXT) - Primary RUCA code (1-10)
- `ruca_category` (TEXT) - Simplified category:
- `Metropolitan` (codes 1-3)
- `Micropolitan` (codes 4-6)
- `Small town` (codes 7-9)
- `Rural` (code 10)
- `ruca_description` (TEXT) - Full RUCA code description
- `population_2020` (INTEGER)
**Source**: USDA Economic Research Service RUCA 2020
**Notes**:
- Based on 2020 Census tracts and 2010-2020 commuting patterns
- 7 data centers failed RUCA join (Puerto Rico / non-US)
### `watershed_huc8`
**Rows**: 2,139
**Purpose**: USGS HUC8 subbasin polygons for water-stress analysis
**Key Columns**:
- `huc8` (TEXT) - 8-digit Hydrologic Unit Code (PRIMARY KEY)
- `name` (TEXT) - Watershed name
- `geom` (GEOMETRY) - Polygon geometry (EPSG:4326)
- `area_sq_km` (DOUBLE PRECISION)
- `states` (TEXT) - Comma-separated state codes
- `dc_count` (INTEGER) - Number of data centers in watershed
**Source**: USGS Watershed Boundary Dataset
**Notes**:
- 257 of 2,139 watersheds contain at least one data center
- Top 15 watersheds contain 50% of all US data centers
### `nri_census_tracts`
**Rows**: ~84,000
**Purpose**: Full FEMA National Risk Index by Census tract
**Key Columns**:
- `nri_id` (TEXT) - Census tract GEOID
- `state_name`, `county_name`, `tract_name`
- **460+ columns** including:
- Overall risk scores and ratings
- Expected annual loss (dollars and building value %)
- Social vulnerability components (15 factors)
- Community resilience score
- Individual hazard risk scores (18 hazards)
- Exposure, annualized frequency, historic loss ratios per hazard
**Source**: FEMA National Risk Index v2.1 (December 2025)
**Notes**:
- Massive table with comprehensive natural hazard risk data
- Join to data centers via `geoid` field
- See [FEMA NRI Technical Documentation](https://hazards.fema.gov/nri/)
---
## Infrastructure Tables
### Energy Infrastructure
#### `energy_eia_operating_generator_capacity_flat`
**Rows**: 4.7 million
**Purpose**: EIA generator inventory with lat/lon/MW (monthly 2008-2026)
**Key Columns**:
- `plant_id` (INTEGER) - EIA plant ID
- `generator_id` (TEXT) - Generator unit ID
- `plant_name` (TEXT)
- `latitude`, `longitude`, `geom`
- `state`, `county`
- `utility_name`, `operator_name`
- `nameplate_capacity_mw` (DOUBLE PRECISION)
- `technology` (TEXT) - Generation technology
- `energy_source_1`, `energy_source_2` - Primary fuel codes
- `operating_month`, `operating_year` - When unit became operational
- `status` (TEXT) - Operating, standby, retired, etc.
- `report_month`, `report_year` - Data snapshot date
**Source**: EIA Form 860 via API
**Notes**:
- "Flat" means denormalized for fast spatial queries
- Each generator-month is a row (4.7M rows from monthly snapshots)
- Use for proximity analysis (e.g., "all generators within 50 km of data center")
#### `energy_eia_facility_fuel_flat`
**Rows**: Varies
**Purpose**: Monthly generation by plant/fuel
**Key Columns**:
- `plant_id`, `plant_name`
- `report_month`, `report_year`
- `energy_source` (TEXT) - Fuel code
- `net_generation_mwh` (DOUBLE PRECISION)
- `fuel_consumed_mmbtu` (DOUBLE PRECISION)
**Source**: EIA Form 923 via API
#### `energy_eia_seds_flat`
**Rows**: 2.57 million
**Purpose**: Annual state energy consumption/production (1960-2024)
**Key Columns**:
- `state_code` (TEXT)
- `year` (INTEGER)
- `msn` (TEXT) - Mnemonic series names (e.g., `TETCB` = total energy consumption)
- `value` (DOUBLE PRECISION) - Energy in trillion BTU
- `unit` (TEXT)
- `description` (TEXT) - Human-readable MSN description
**Source**: EIA State Energy Data System (SEDS)
**Notes**:
- Annual aggregates by state
- Use for state-level energy context analysis
---
### Connectivity Infrastructure
#### `internet_cables`
**Rows**: 693
**Purpose**: Submarine cable routes
**Key Columns**:
- `cable_id` (TEXT) - Unique cable identifier
- `cable_name` (TEXT) - Official cable name
- `geom` (GEOMETRY) - LineString geometry (EPSG:4326)
- `rfs_year` (INTEGER) - Ready For Service year
- `length_km` (DOUBLE PRECISION)
- `owners` (TEXT[]) - Array of owner names
- `landing_points` (TEXT[]) - Array of landing point names
**Source**: TeleGeography-style cable database
**Notes**:
- 693 unique submarine cables
- Geometry is approximate route (not exact seabed path)
#### `internet_cable_landing_points`
**Rows**: 3,361
**Purpose**: Cable landing points (where cables come ashore)
**Key Columns**:
- `landing_point_id` (TEXT) - Unique identifier
- `name` (TEXT) - Landing point name
- `city`, `country`
- `latitude`, `longitude`, `geom`
- `cables` (TEXT[]) - Array of cable names landing at this point
- `cable_count` (INTEGER)
**Source**: TeleGeography-style cable database
**Notes**:
- Used for proximity analysis (how close are data centers to cable landings?)
- **Key finding**: Data centers are NOT systematically closer to cables than ordinary US cities
#### `internet_city_dominance`
**Rows**: 4,552
**Purpose**: City-level IPs/capacity (internet hub strength proxy)
**Key Columns**:
- `city` (TEXT)
- `country` (TEXT)
- `latitude`, `longitude`, `geom`
- `ip_addresses` (INTEGER) - Number of routable IP addresses
- `capacity_rank` (INTEGER) - Relative capacity ranking
**Source**: Internet topology datasets
**Notes**: Proxy for "internet hub" strength (not directly used in main analyses)
---
### Broadband
#### `fcc_bdc_location_provider_aggregates`
**Rows**: Varies
**Purpose**: FCC BDC provider availability aggregated by county/tract
**Key Columns**:
- `geoid` (TEXT) - County or tract GEOID
- `geography_level` (TEXT) - `county` or `tract`
- `provider_count` (INTEGER)
- `technology_counts` (JSONB) - Count by technology type
- `max_download_mbps`, `max_upload_mbps`
**Source**: FCC Broadband Data Collection (BDC)
#### `fcc_bdc_broadband_connection_table`
**Rows**: Varies
**Purpose**: Per-data-center broadband provider availability
**Key Columns**:
- Data center identifiers
- `provider_id`, `provider_name`
- `technology` (TEXT)
- `max_advertised_download_speed`, `max_advertised_upload_speed`
- `low_latency` (BOOLEAN)
**Source**: FCC BDC, joined to data center locations
**Notes**: Built by `build_fcc_bdc_broadband_connection_table.py`
---
### Other Tables
#### `opposition_cases_geocoded`
**Rows**: 18
**Purpose**: Geocoded community-opposition cases against data center builds
**Key Columns**:
- `case_id` (TEXT) - Unique identifier
- `developer` (TEXT) - Proposed developer/operator
- `investment_billions` (DOUBLE PRECISION) - Investment amount in billions
- `outcome` (TEXT) - Case outcome (approved, rejected, pending)
- `governance_response` (TEXT) - Government response
- `latitude`, `longitude`, `geom`
**Source**: Compiled from news archives
**Notes**: Loaded but currently unused - see research-ideas.md for proposed analyses
#### `census_tract_huc8_link`
**Rows**: 806
**Purpose**: Tract↔HUC8 spatial overlap table
**Key Columns**:
- `geoid` (TEXT) - Census tract GEOID
- `huc8` (TEXT) - HUC8 watershed code
- `overlap_pct` (DOUBLE PRECISION) - Percentage of tract overlapping watershed
**Notes**: Useful for downstream tract-level water-stress joins; limited to tracts containing data centers
#### `im3_state_projected_moderate_50`
**Rows**: 328
**Purpose**: PNNL IM3 projected data center siting (moderate growth, gravity weight 0.50)
**Key Columns**:
- `facility_id` (TEXT)
- `state` (TEXT)
- `cost_millions` (DOUBLE PRECISION)
- `it_mw` (DOUBLE PRECISION) - IT load in megawatts
- `cooling_water_demand_gal_per_day` (DOUBLE PRECISION)
- `latitude`, `longitude`, `geom`
**Source**: PNNL Integrated Multisector Multiscale Modeling (IM3)
**Notes**: Loaded but unused - potential for forward-projection analysis
#### `im3_projected_state_demand_summary`
**Rows**: 31
**Purpose**: State-level rollup of IM3 projected facility counts, IT MW, and cooling demand
**Key Columns**:
- `state` (TEXT)
- `facility_count` (INTEGER)
- `total_it_mw` (DOUBLE PRECISION)
- `total_cooling_demand_mgd` (DOUBLE PRECISION) - Million gallons per day
**Source**: IM3 model outputs
#### `utility_rate_tracker_2025_2028`
**Rows**: 374
**Purpose**: Utility rate-increase tracker by provider × state × service type
**Key Columns**:
- `provider` (TEXT) - Utility provider name
- `state` (TEXT)
- `service_type` (TEXT)
- `effective_date` (DATE)
- `monthly_increase_dollars` (DOUBLE PRECISION)
- `percent_increase` (DOUBLE PRECISION)
**Source**: Utility rate tracker database
**Notes**: Loaded but unused in demographic/energy analysis
#### `energy_atlas_layers_catalog`
**Rows**: ~5
**Purpose**: Metadata catalog of EIA layers ingested
**Key Columns**:
- `table_name` (TEXT)
- `source_url` (TEXT)
- `import_timestamp` (TIMESTAMP)
- `row_count` (INTEGER)
**Notes**: Created by `ingest_eia_energy_layers.py`
---
---
## Legislation Tables
Populated by `ingest_legiscan.py` using the [LegiScan API](https://legiscan.com/legiscan).
Covers all 50 states + DC + US Congress, sessions from 2016 through 2026.
Data licensed [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) — attribute LegiScan LLC.
### `legiscan_sessions`
**Rows**: 646
**Purpose**: One row per legislative session dataset downloaded from LegiScan
**Key Columns**:
- `session_id` (INTEGER) - LegiScan session ID (PRIMARY KEY)
- `state_abbr` (VARCHAR) - Two-letter state code (`CA`, `TX`, `US`, etc.)
- `state_id` (INTEGER) - LegiScan numeric state ID
- `year_start`, `year_end` (INTEGER) - Session year range
- `session_title` (TEXT) - Full session name
- `session_tag` (TEXT) - Short tag (e.g., "Regular Session", "1st Special Session")
- `is_special` (BOOLEAN) - True for special/extraordinary sessions
- `is_prior` (BOOLEAN) - True for completed/sine-die sessions
- `dataset_hash` (VARCHAR) - MD5 of dataset ZIP; used to detect updates
- `dataset_date` (DATE) - Date dataset was last published by LegiScan
- `dataset_size_mb` (FLOAT) - Compressed ZIP size
- `bill_count` (INTEGER) - Number of bills loaded from this session
- `imported_at` (TIMESTAMPTZ) - When this session was last imported
### `legiscan_bills`
**Rows**: ~1,313,000
**Purpose**: All bills from all sessions; tagged for relevance to data center research topics
**Key Columns**:
- `bill_id` (INTEGER) - LegiScan bill ID (PRIMARY KEY)
- `session_id` (INTEGER) - FK → `legiscan_sessions`
- `state` (VARCHAR) - Two-letter state code
- `bill_number` (VARCHAR) - Bill number (e.g., `SB 1000`, `HB 233`)
- `bill_type` (VARCHAR) - `B`=Bill, `R`=Resolution, `CR`=Concurrent Resolution, etc.
- `title` (TEXT) - Short title
- `description` (TEXT) - Longer description
- `status` (INTEGER) - Current status code (see below)
- `status_date` (DATE) - Date of last status change
- `completed` (INTEGER) - 1 if bill is in a terminal state
- `body` (VARCHAR) - Originating chamber (`H`=House, `S`=Senate, `C`=Council, etc.)
- `url` (TEXT) - LegiScan bill page URL
- `state_link` (TEXT) - Official state legislature URL
- `change_hash` (VARCHAR) - MD5 used to detect bill-level updates
- `subjects` (TEXT[]) - LegiScan subject tags (GIN indexed)
- `sponsor_count` (INTEGER) - Number of sponsors
- `vote_count` (INTEGER) - Number of recorded votes
- `text_count` (INTEGER) - Number of bill text versions
- `is_relevant` (BOOLEAN) - True if any relevance tag matched (GIN indexed)
- `relevance_tags` (TEXT[]) - Matched topic tags (GIN indexed)
- `imported_at` (TIMESTAMPTZ) - When this bill was last upserted
**Status codes**: 1=Introduced, 2=Engrossed, 3=Enrolled, 4=Passed, 5=Vetoed, 6=Failed, 7=Override, 8=Chaptered, 9=Referred, 12=Draft
**Relevance tags** (keyword-matched against title + description + subjects):
| Tag | What it captures |
|-----|-----------------|
| `data_center` | Data centers, hyperscale, colocation, AI campuses, HPC facilities |
| `large_load` | Crypto mining, large industrial loads, extraordinary load |
| `ratepayer_protection` | Cost shifting, cross-subsidy, rate design, affordability, rate class |
| `grid_impact` | Grid reliability, transmission, interconnection queue, IRP |
| `tax_incentive` | Tax exemptions, abatements, credits for facilities |
| `energy_policy` | Renewable PPAs, green tariffs, clean electricity, decarbonization |
| `water_use` | Cooling water, evaporative cooling, water footprint |
| `siting_permitting` | Zoning, conditional use permits, local control, preemption |
**Notes**:
- ~60,000 relevant bills out of 1.3M total (~4.6%)
- `data_center` tag: ~2,182 bills; `ratepayer_protection`: ~49,000
- GIN indexes on `subjects`, `relevance_tags`, and full-text (`title || description`)
- Use `query_legiscan_bills.sql` for pre-built research queries
- Re-run `python ingest_legiscan.py --fetch --load` weekly to pick up dataset updates
- Re-run `python ingest_legiscan.py --tag` after editing keyword lists
---
## Commonly Used Joins
### Data Center to Demographics
```sql
SELECT
dc.*,
ct.median_household_income,
ct.bachelors_or_higher_pct,
ct.broadband_pct
FROM master_data_centers dc
JOIN data_center_census_tracts_2024 ct
ON dc.id = ct.id;
```
### Data Center to Watershed
```sql
SELECT
dc.*,
w.huc8,
w.watershed_name
FROM master_data_centers dc
JOIN data_center_watershed_huc8 dw ON dc.id = dw.id
JOIN watershed_huc8 w ON dw.huc8 = w.huc8;
```
### Data Center to Energy Infrastructure (50 km radius)
```sql
SELECT
dc.id,
dc.name,
SUM(eg.nameplate_capacity_mw) AS total_capacity_50km
FROM master_data_centers dc
JOIN energy_eia_operating_generator_capacity_flat eg
ON ST_DWithin(
dc.geom::geography,
eg.geom::geography,
50000 -- 50 km in meters
)
WHERE eg.status = 'OP' -- Operating only
GROUP BY dc.id, dc.name;
```
### Data Center to FEMA Hazard Risk
```sql
SELECT
dc.*,
nri.risk_score,
nri.wildfire_risk,
nri.drought_risk,
nri.heat_wave_risk
FROM master_data_centers dc
JOIN data_center_census_tracts_2024 ct ON dc.id = ct.id
JOIN nri_census_tracts nri ON ct.geoid = nri.nri_id;
```
---
## Table Naming Conventions
- **`master_*`** - Canonical, deduplicated tables (use these for analysis)
- **`data_center_*`** - Data center-specific enrichment tables
- **`_dc_*`** - Base layers scoped to data center states (underscore prefix = private/internal)
- **`energy_eia_*`** - EIA energy data
- **`internet_*`** - Connectivity infrastructure
- **`fcc_bdc_*`** - FCC Broadband Data Collection
---
## Indexes and Performance
All tables have spatial indexes on `geom` columns for fast spatial joins:
```sql
CREATE INDEX idx_tablename_geom ON tablename USING GIST(geom);
```
Key `geoid` columns are indexed for fast demographic joins:
```sql
CREATE INDEX idx_tablename_geoid ON tablename(geoid);
```
---
## Maintenance Notes
### Updating Data Centers
1. Run `load_postgis_osm_data_centers.py` to refresh OSM data
2. Run `build_master_data_centers.py` to rebuild master table
3. Run enrichment scripts to update joins
### Updating Demographics
1. Update `_dc_census_tract_acs_2024` from Census API
2. Run `create_data_center_census_tract_table.py --replace-final`
### Updating Energy Data
```bash
python3 ingest_eia_energy_layers.py --category power --update
```
---
## Schema Export
To export the full schema:
```bash
pg_dump -h $PGWEB_HOST -U $PGWEB_USER -d data_centers --schema-only > schema.sql
```
To list all tables:
```sql
SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename))
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
```
---
## Contact
For database access or questions, contact the repository owner.