Reorganize project into scripts/, docs/, data/, output/ directories

Move all Python scripts to scripts/, documentation to docs/, raw input data to data/, and generated HTML/CSV outputs to output/. Update path references in 8 scripts to use Path(__file__).parent.parent as project root so they work correctly from the new location. Update README links and quick-start commands accordingly. Notebooks remain at root. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 21:57:22 -07:00
parent a2e295d95b
commit ee5856661a
40 changed files with 31 additions and 30 deletions
--- a/docs/database-tables.md
+++ b/docs/database-tables.md
@@ -0,0 +1,701 @@
+# Database Tables Documentation
+
+## Database Configuration
+
+**Database Name**: `data_centers`  
+**Type**: PostgreSQL with PostGIS extension  
+**Connection**: Environment variables from `~/.zsh_secrets`
+- `PGWEB_HOST`: Database host
+- `PGWEB_PORT`: Database port (5433)
+- `PGWEB_USER`: Database user
+- `PGWEB_PASSWORD`: Database password
+- `PGWEB_DATABASE`: Database name (`data_centers`)
+
+## Table Organization
+
+Tables are organized into five categories:
+1. **Core Data Center Tables** - Master inventories and source data
+2. **Enrichment Tables** - Data centers joined with contextual data
+3. **Base Layer Tables** - Geographic and demographic reference layers
+4. **Infrastructure Tables** - Energy and connectivity infrastructure
+5. **Legislation Tables** - LegiScan state and federal bill data (2016-2026)
+
+---
+
+## Core Data Center Tables
+
+### `master_data_centers`
+**Rows**: 1,833  
+**Purpose**: Canonical data center inventory - deduplicated merge of curated + OSM sources
+
+**Key Columns**:
+- `id` (INTEGER) - Unique identifier
+- `name` (TEXT) - Facility name
+- `address` (TEXT) - Street address
+- `city` (TEXT) - City
+- `state` (TEXT) - State code
+- `latitude` (DOUBLE PRECISION) - Latitude
+- `longitude` (DOUBLE PRECISION) - Longitude
+- `geom` (GEOMETRY) - PostGIS point geometry (EPSG:4326)
+- `operator` (TEXT) - Operator/owner
+- `power_mw` (DOUBLE PRECISION) - Power capacity in megawatts (sparse: 5.9% populated)
+- `source` (TEXT) - Data source (`curated`, `osm`, or `both`)
+- `osm_id` (TEXT) - OpenStreetMap ID if applicable
+- `geocode_method` (TEXT) - Geocoding provenance
+
+**Notes**: 
+- 108 of 1,833 facilities have power ratings
+- 45 facilities use city-precision fallback coordinates
+- Operator strings have fragmentation issues ("Meta" vs. "Meta, Inc.")
+
+### `us_dc_sample_geocoded`
+**Rows**: 1,489  
+**Purpose**: Original curated sample with geocoding provenance (superseded by `master_data_centers`)
+
+**Key Columns**:
+- `name`, `address`, `city`, `state`, `zip`
+- `latitude`, `longitude`, `geom`
+- `operator`, `power_mw`
+- `census_lat`, `census_lon` - Census TIGER geocode results
+- `nominatim_lat`, `nominatim_lon` - Nominatim fallback results
+- `geocode_source` - Which geocoder was used
+
+### `osm_data_centers`
+**Rows**: 1,549  
+**Purpose**: Raw OpenStreetMap-derived facilities
+
+**Key Columns**:
+- `osm_id` (TEXT) - OSM element ID
+- `osm_type` (TEXT) - `node`, `way`, or `relation`
+- `name` (TEXT) - OSM name tag
+- `latitude`, `longitude`, `geom`
+- `tags` (JSONB) - All OSM tags as JSON
+- `operator` (TEXT) - Extracted from OSM tags
+- `city`, `state`, `country`
+
+**Notes**: Fetched via Overpass API with query for `telecom=data_center` or `building=data_center`
+
+### `master_data_center_spatial_clusters`
+**Rows**: 1,831  
+**Purpose**: DBSCAN cluster assignments for master data centers
+
+**Key Columns**:
+- All columns from `master_data_centers`
+- `cluster_id` (INTEGER) - Cluster assignment (-1 = noise/singleton)
+- `cluster_size` (INTEGER) - Number of facilities in cluster
+- `cluster_label` (TEXT) - Human-readable cluster name
+
+**Notes**: DBSCAN parameters: eps=15 km, min_samples=2
+
+---
+
+## Enrichment Tables
+
+### `data_center_census_tracts_2024`
+**Rows**: 1,815  
+**Purpose**: Per-facility demographics from containing Census tract
+
+**Key Columns**:
+- All columns from `master_data_centers`
+- `geoid` (TEXT) - 11-digit Census tract GEOID
+- `state_fips`, `county_fips`, `tract`
+- **Population**: `total_population`, `population_density_sq_mi`
+- **Age**: `median_age`, `under_18_pct`, `over_65_pct`
+- **Race/Ethnicity**: `white_nh_pct`, `black_nh_pct`, `asian_nh_pct`, `hispanic_pct`
+- **Economics**: `median_household_income`, `per_capita_income`, `poverty_rate`
+- **Education**: `bachelors_or_higher_pct`, `high_school_or_higher_pct`
+- **Housing**: `median_home_value`, `median_rent`, `homeownership_rate`
+- **Broadband**: `broadband_pct` - Households with broadband subscription
+
+**Source**: ACS 2024 5-year estimates
+
+**Notes**: 
+- 18 of 1,833 facilities failed tract join (geocoding issues)
+- Data from `_dc_census_tract_acs_2024` base table
+
+### `data_center_watershed_huc8`
+**Rows**: 1,833  
+**Purpose**: Per-facility watershed assignment
+
+**Key Columns**:
+- All columns from `master_data_centers`
+- `huc8` (TEXT) - 8-digit Hydrologic Unit Code
+- `watershed_name` (TEXT) - Watershed name
+- `watershed_area_sq_km` (DOUBLE PRECISION)
+- `states` (TEXT) - States intersecting watershed
+
+**Source**: USGS Watershed Boundary Dataset
+
+**Notes**: 257 unique HUC8 watersheds contain at least one data center
+
+### `data_center_nri_exposure`
+**Rows**: 1,833  
+**Purpose**: Per-facility FEMA National Risk Index hazard exposure scores
+
+**Key Columns**:
+- All columns from `master_data_centers`
+- `nri_id` (TEXT) - Census tract GEOID (matches `geoid` from demographics)
+- `risk_score` (DOUBLE PRECISION) - Overall NRI risk score
+- `social_vulnerability` (DOUBLE PRECISION) - Social vulnerability index
+- **Hazard-specific risk scores** (18 hazards):
+  - `avalanche_risk`, `coastal_flooding_risk`, `cold_wave_risk`
+  - `drought_risk`, `earthquake_risk`, `hail_risk`
+  - `heat_wave_risk`, `hurricane_risk`, `ice_storm_risk`
+  - `landslide_risk`, `lightning_risk`, `riverine_flooding_risk`
+  - `strong_wind_risk`, `tornado_risk`, `tsunami_risk`
+  - `volcanic_activity_risk`, `wildfire_risk`, `winter_weather_risk`
+
+**Source**: FEMA National Risk Index (December 2025 release)
+
+### `data_center_rdh_precinct_vote_matches`
+**Rows**: Varies  
+**Purpose**: Per-facility precinct-level election results
+
+**Key Columns**:
+- Data center identifiers
+- `precinct_name`, `precinct_id`
+- `election_year`, `office`
+- `candidate`, `party`, `votes`
+- `vote_share_pct`
+
+**Source**: Redistricting Data Hub precinct shapefiles
+
+**Notes**: Spatial join to voting precincts (point-in-polygon)
+
+---
+
+## Base Layer Tables
+
+### `_dc_census_tract_acs_2024`
+**Rows**: 85,382  
+**Purpose**: ACS 2024 demographics for all Census tracts in states with data centers
+
+**Key Columns**:
+- `geoid` (TEXT) - 11-digit tract GEOID (PRIMARY KEY)
+- `name` (TEXT) - Tract name
+- `state_fips`, `county_fips`, `tract`
+- **Full ACS 5-year estimates** (85+ columns):
+  - Population by age, sex, race/ethnicity
+  - Households, families, housing units
+  - Income, poverty, education, employment
+  - Housing values, rents, costs
+  - Broadband, computer access
+  - Commuting, vehicles
+
+**Source**: Census ACS 2024 5-year estimates API
+
+**Notes**: Universe limited to 46 states with data centers (excludes DC-free states)
+
+### `_dc_census_tract_boundaries_2024`
+**Rows**: 85,058  
+**Purpose**: TIGER 2024 tract polygons for data center states
+
+**Key Columns**:
+- `geoid` (TEXT) - 11-digit tract GEOID
+- `name` (TEXT) - Tract name
+- `state_fips`, `county_fips`, `tract_code`
+- `geom` (GEOMETRY) - Polygon geometry (EPSG:4326)
+- `area_land_sq_m` (DOUBLE PRECISION) - Land area in square meters
+- `area_water_sq_m` (DOUBLE PRECISION) - Water area in square meters
+
+**Source**: Census TIGER/Line 2024
+
+### `ruca_codes_2020_tract`
+**Rows**: 85,528  
+**Purpose**: USDA Rural-Urban Commuting Area codes for metro/rural classification
+
+**Key Columns**:
+- `geoid` (TEXT) - 11-digit tract GEOID (matches Census tracts)
+- `ruca_code` (TEXT) - Primary RUCA code (1-10)
+- `ruca_category` (TEXT) - Simplified category:
+  - `Metropolitan` (codes 1-3)
+  - `Micropolitan` (codes 4-6)
+  - `Small town` (codes 7-9)
+  - `Rural` (code 10)
+- `ruca_description` (TEXT) - Full RUCA code description
+- `population_2020` (INTEGER)
+
+**Source**: USDA Economic Research Service RUCA 2020
+
+**Notes**: 
+- Based on 2020 Census tracts and 2010-2020 commuting patterns
+- 7 data centers failed RUCA join (Puerto Rico / non-US)
+
+### `watershed_huc8`
+**Rows**: 2,139  
+**Purpose**: USGS HUC8 subbasin polygons for water-stress analysis
+
+**Key Columns**:
+- `huc8` (TEXT) - 8-digit Hydrologic Unit Code (PRIMARY KEY)
+- `name` (TEXT) - Watershed name
+- `geom` (GEOMETRY) - Polygon geometry (EPSG:4326)
+- `area_sq_km` (DOUBLE PRECISION)
+- `states` (TEXT) - Comma-separated state codes
+- `dc_count` (INTEGER) - Number of data centers in watershed
+
+**Source**: USGS Watershed Boundary Dataset
+
+**Notes**: 
+- 257 of 2,139 watersheds contain at least one data center
+- Top 15 watersheds contain 50% of all US data centers
+
+### `nri_census_tracts`
+**Rows**: ~84,000  
+**Purpose**: Full FEMA National Risk Index by Census tract
+
+**Key Columns**:
+- `nri_id` (TEXT) - Census tract GEOID
+- `state_name`, `county_name`, `tract_name`
+- **460+ columns** including:
+  - Overall risk scores and ratings
+  - Expected annual loss (dollars and building value %)
+  - Social vulnerability components (15 factors)
+  - Community resilience score
+  - Individual hazard risk scores (18 hazards)
+  - Exposure, annualized frequency, historic loss ratios per hazard
+
+**Source**: FEMA National Risk Index v2.1 (December 2025)
+
+**Notes**: 
+- Massive table with comprehensive natural hazard risk data
+- Join to data centers via `geoid` field
+- See [FEMA NRI Technical Documentation](https://hazards.fema.gov/nri/)
+
+---
+
+## Infrastructure Tables
+
+### Energy Infrastructure
+
+#### `energy_eia_operating_generator_capacity_flat`
+**Rows**: 4.7 million  
+**Purpose**: EIA generator inventory with lat/lon/MW (monthly 2008-2026)
+
+**Key Columns**:
+- `plant_id` (INTEGER) - EIA plant ID
+- `generator_id` (TEXT) - Generator unit ID
+- `plant_name` (TEXT)
+- `latitude`, `longitude`, `geom`
+- `state`, `county`
+- `utility_name`, `operator_name`
+- `nameplate_capacity_mw` (DOUBLE PRECISION)
+- `technology` (TEXT) - Generation technology
+- `energy_source_1`, `energy_source_2` - Primary fuel codes
+- `operating_month`, `operating_year` - When unit became operational
+- `status` (TEXT) - Operating, standby, retired, etc.
+- `report_month`, `report_year` - Data snapshot date
+
+**Source**: EIA Form 860 via API
+
+**Notes**:
+- "Flat" means denormalized for fast spatial queries
+- Each generator-month is a row (4.7M rows from monthly snapshots)
+- Use for proximity analysis (e.g., "all generators within 50 km of data center")
+
+#### `energy_eia_facility_fuel_flat`
+**Rows**: Varies  
+**Purpose**: Monthly generation by plant/fuel
+
+**Key Columns**:
+- `plant_id`, `plant_name`
+- `report_month`, `report_year`
+- `energy_source` (TEXT) - Fuel code
+- `net_generation_mwh` (DOUBLE PRECISION)
+- `fuel_consumed_mmbtu` (DOUBLE PRECISION)
+
+**Source**: EIA Form 923 via API
+
+#### `energy_eia_seds_flat`
+**Rows**: 2.57 million  
+**Purpose**: Annual state energy consumption/production (1960-2024)
+
+**Key Columns**:
+- `state_code` (TEXT)
+- `year` (INTEGER)
+- `msn` (TEXT) - Mnemonic series names (e.g., `TETCB` = total energy consumption)
+- `value` (DOUBLE PRECISION) - Energy in trillion BTU
+- `unit` (TEXT)
+- `description` (TEXT) - Human-readable MSN description
+
+**Source**: EIA State Energy Data System (SEDS)
+
+**Notes**: 
+- Annual aggregates by state
+- Use for state-level energy context analysis
+
+---
+
+### Connectivity Infrastructure
+
+#### `internet_cables`
+**Rows**: 693  
+**Purpose**: Submarine cable routes
+
+**Key Columns**:
+- `cable_id` (TEXT) - Unique cable identifier
+- `cable_name` (TEXT) - Official cable name
+- `geom` (GEOMETRY) - LineString geometry (EPSG:4326)
+- `rfs_year` (INTEGER) - Ready For Service year
+- `length_km` (DOUBLE PRECISION)
+- `owners` (TEXT[]) - Array of owner names
+- `landing_points` (TEXT[]) - Array of landing point names
+
+**Source**: TeleGeography-style cable database
+
+**Notes**: 
+- 693 unique submarine cables
+- Geometry is approximate route (not exact seabed path)
+
+#### `internet_cable_landing_points`
+**Rows**: 3,361  
+**Purpose**: Cable landing points (where cables come ashore)
+
+**Key Columns**:
+- `landing_point_id` (TEXT) - Unique identifier
+- `name` (TEXT) - Landing point name
+- `city`, `country`
+- `latitude`, `longitude`, `geom`
+- `cables` (TEXT[]) - Array of cable names landing at this point
+- `cable_count` (INTEGER)
+
+**Source**: TeleGeography-style cable database
+
+**Notes**: 
+- Used for proximity analysis (how close are data centers to cable landings?)
+- **Key finding**: Data centers are NOT systematically closer to cables than ordinary US cities
+
+#### `internet_city_dominance`
+**Rows**: 4,552  
+**Purpose**: City-level IPs/capacity (internet hub strength proxy)
+
+**Key Columns**:
+- `city` (TEXT)
+- `country` (TEXT)
+- `latitude`, `longitude`, `geom`
+- `ip_addresses` (INTEGER) - Number of routable IP addresses
+- `capacity_rank` (INTEGER) - Relative capacity ranking
+
+**Source**: Internet topology datasets
+
+**Notes**: Proxy for "internet hub" strength (not directly used in main analyses)
+
+---
+
+### Broadband
+
+#### `fcc_bdc_location_provider_aggregates`
+**Rows**: Varies  
+**Purpose**: FCC BDC provider availability aggregated by county/tract
+
+**Key Columns**:
+- `geoid` (TEXT) - County or tract GEOID
+- `geography_level` (TEXT) - `county` or `tract`
+- `provider_count` (INTEGER)
+- `technology_counts` (JSONB) - Count by technology type
+- `max_download_mbps`, `max_upload_mbps`
+
+**Source**: FCC Broadband Data Collection (BDC)
+
+#### `fcc_bdc_broadband_connection_table`
+**Rows**: Varies  
+**Purpose**: Per-data-center broadband provider availability
+
+**Key Columns**:
+- Data center identifiers
+- `provider_id`, `provider_name`
+- `technology` (TEXT)
+- `max_advertised_download_speed`, `max_advertised_upload_speed`
+- `low_latency` (BOOLEAN)
+
+**Source**: FCC BDC, joined to data center locations
+
+**Notes**: Built by `build_fcc_bdc_broadband_connection_table.py`
+
+---
+
+### Other Tables
+
+#### `opposition_cases_geocoded`
+**Rows**: 18  
+**Purpose**: Geocoded community-opposition cases against data center builds
+
+**Key Columns**:
+- `case_id` (TEXT) - Unique identifier
+- `developer` (TEXT) - Proposed developer/operator
+- `investment_billions` (DOUBLE PRECISION) - Investment amount in billions
+- `outcome` (TEXT) - Case outcome (approved, rejected, pending)
+- `governance_response` (TEXT) - Government response
+- `latitude`, `longitude`, `geom`
+
+**Source**: Compiled from news archives
+
+**Notes**: Loaded but currently unused - see research-ideas.md for proposed analyses
+
+#### `census_tract_huc8_link`
+**Rows**: 806  
+**Purpose**: Tract↔HUC8 spatial overlap table
+
+**Key Columns**:
+- `geoid` (TEXT) - Census tract GEOID
+- `huc8` (TEXT) - HUC8 watershed code
+- `overlap_pct` (DOUBLE PRECISION) - Percentage of tract overlapping watershed
+
+**Notes**: Useful for downstream tract-level water-stress joins; limited to tracts containing data centers
+
+#### `im3_state_projected_moderate_50`
+**Rows**: 328  
+**Purpose**: PNNL IM3 projected data center siting (moderate growth, gravity weight 0.50)
+
+**Key Columns**:
+- `facility_id` (TEXT)
+- `state` (TEXT)
+- `cost_millions` (DOUBLE PRECISION)
+- `it_mw` (DOUBLE PRECISION) - IT load in megawatts
+- `cooling_water_demand_gal_per_day` (DOUBLE PRECISION)
+- `latitude`, `longitude`, `geom`
+
+**Source**: PNNL Integrated Multisector Multiscale Modeling (IM3)
+
+**Notes**: Loaded but unused - potential for forward-projection analysis
+
+#### `im3_projected_state_demand_summary`
+**Rows**: 31  
+**Purpose**: State-level rollup of IM3 projected facility counts, IT MW, and cooling demand
+
+**Key Columns**:
+- `state` (TEXT)
+- `facility_count` (INTEGER)
+- `total_it_mw` (DOUBLE PRECISION)
+- `total_cooling_demand_mgd` (DOUBLE PRECISION) - Million gallons per day
+
+**Source**: IM3 model outputs
+
+#### `utility_rate_tracker_2025_2028`
+**Rows**: 374  
+**Purpose**: Utility rate-increase tracker by provider × state × service type
+
+**Key Columns**:
+- `provider` (TEXT) - Utility provider name
+- `state` (TEXT)
+- `service_type` (TEXT)
+- `effective_date` (DATE)
+- `monthly_increase_dollars` (DOUBLE PRECISION)
+- `percent_increase` (DOUBLE PRECISION)
+
+**Source**: Utility rate tracker database
+
+**Notes**: Loaded but unused in demographic/energy analysis
+
+#### `energy_atlas_layers_catalog`
+**Rows**: ~5  
+**Purpose**: Metadata catalog of EIA layers ingested
+
+**Key Columns**:
+- `table_name` (TEXT)
+- `source_url` (TEXT)
+- `import_timestamp` (TIMESTAMP)
+- `row_count` (INTEGER)
+
+**Notes**: Created by `ingest_eia_energy_layers.py`
+
+---
+
+---
+
+## Legislation Tables
+
+Populated by `ingest_legiscan.py` using the [LegiScan API](https://legiscan.com/legiscan).  
+Covers all 50 states + DC + US Congress, sessions from 2016 through 2026.  
+Data licensed [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) — attribute LegiScan LLC.
+
+### `legiscan_sessions`
+**Rows**: 646  
+**Purpose**: One row per legislative session dataset downloaded from LegiScan
+
+**Key Columns**:
+- `session_id` (INTEGER) - LegiScan session ID (PRIMARY KEY)
+- `state_abbr` (VARCHAR) - Two-letter state code (`CA`, `TX`, `US`, etc.)
+- `state_id` (INTEGER) - LegiScan numeric state ID
+- `year_start`, `year_end` (INTEGER) - Session year range
+- `session_title` (TEXT) - Full session name
+- `session_tag` (TEXT) - Short tag (e.g., "Regular Session", "1st Special Session")
+- `is_special` (BOOLEAN) - True for special/extraordinary sessions
+- `is_prior` (BOOLEAN) - True for completed/sine-die sessions
+- `dataset_hash` (VARCHAR) - MD5 of dataset ZIP; used to detect updates
+- `dataset_date` (DATE) - Date dataset was last published by LegiScan
+- `dataset_size_mb` (FLOAT) - Compressed ZIP size
+- `bill_count` (INTEGER) - Number of bills loaded from this session
+- `imported_at` (TIMESTAMPTZ) - When this session was last imported
+
+### `legiscan_bills`
+**Rows**: ~1,313,000  
+**Purpose**: All bills from all sessions; tagged for relevance to data center research topics
+
+**Key Columns**:
+- `bill_id` (INTEGER) - LegiScan bill ID (PRIMARY KEY)
+- `session_id` (INTEGER) - FK → `legiscan_sessions`
+- `state` (VARCHAR) - Two-letter state code
+- `bill_number` (VARCHAR) - Bill number (e.g., `SB 1000`, `HB 233`)
+- `bill_type` (VARCHAR) - `B`=Bill, `R`=Resolution, `CR`=Concurrent Resolution, etc.
+- `title` (TEXT) - Short title
+- `description` (TEXT) - Longer description
+- `status` (INTEGER) - Current status code (see below)
+- `status_date` (DATE) - Date of last status change
+- `completed` (INTEGER) - 1 if bill is in a terminal state
+- `body` (VARCHAR) - Originating chamber (`H`=House, `S`=Senate, `C`=Council, etc.)
+- `url` (TEXT) - LegiScan bill page URL
+- `state_link` (TEXT) - Official state legislature URL
+- `change_hash` (VARCHAR) - MD5 used to detect bill-level updates
+- `subjects` (TEXT[]) - LegiScan subject tags (GIN indexed)
+- `sponsor_count` (INTEGER) - Number of sponsors
+- `vote_count` (INTEGER) - Number of recorded votes
+- `text_count` (INTEGER) - Number of bill text versions
+- `is_relevant` (BOOLEAN) - True if any relevance tag matched (GIN indexed)
+- `relevance_tags` (TEXT[]) - Matched topic tags (GIN indexed)
+- `imported_at` (TIMESTAMPTZ) - When this bill was last upserted
+
+**Status codes**: 1=Introduced, 2=Engrossed, 3=Enrolled, 4=Passed, 5=Vetoed, 6=Failed, 7=Override, 8=Chaptered, 9=Referred, 12=Draft
+
+**Relevance tags** (keyword-matched against title + description + subjects):
+
+| Tag | What it captures |
+|-----|-----------------|
+| `data_center` | Data centers, hyperscale, colocation, AI campuses, HPC facilities |
+| `large_load` | Crypto mining, large industrial loads, extraordinary load |
+| `ratepayer_protection` | Cost shifting, cross-subsidy, rate design, affordability, rate class |
+| `grid_impact` | Grid reliability, transmission, interconnection queue, IRP |
+| `tax_incentive` | Tax exemptions, abatements, credits for facilities |
+| `energy_policy` | Renewable PPAs, green tariffs, clean electricity, decarbonization |
+| `water_use` | Cooling water, evaporative cooling, water footprint |
+| `siting_permitting` | Zoning, conditional use permits, local control, preemption |
+
+**Notes**:
+- ~60,000 relevant bills out of 1.3M total (~4.6%)
+- `data_center` tag: ~2,182 bills; `ratepayer_protection`: ~49,000
+- GIN indexes on `subjects`, `relevance_tags`, and full-text (`title || description`)
+- Use `query_legiscan_bills.sql` for pre-built research queries
+- Re-run `python ingest_legiscan.py --fetch --load` weekly to pick up dataset updates
+- Re-run `python ingest_legiscan.py --tag` after editing keyword lists
+
+---
+
+## Commonly Used Joins
+
+### Data Center to Demographics
+```sql
+SELECT 
+    dc.*,
+    ct.median_household_income,
+    ct.bachelors_or_higher_pct,
+    ct.broadband_pct
+FROM master_data_centers dc
+JOIN data_center_census_tracts_2024 ct 
+    ON dc.id = ct.id;
+```
+
+### Data Center to Watershed
+```sql
+SELECT 
+    dc.*,
+    w.huc8,
+    w.watershed_name
+FROM master_data_centers dc
+JOIN data_center_watershed_huc8 dw ON dc.id = dw.id
+JOIN watershed_huc8 w ON dw.huc8 = w.huc8;
+```
+
+### Data Center to Energy Infrastructure (50 km radius)
+```sql
+SELECT 
+    dc.id,
+    dc.name,
+    SUM(eg.nameplate_capacity_mw) AS total_capacity_50km
+FROM master_data_centers dc
+JOIN energy_eia_operating_generator_capacity_flat eg
+    ON ST_DWithin(
+        dc.geom::geography,
+        eg.geom::geography,
+        50000  -- 50 km in meters
+    )
+WHERE eg.status = 'OP'  -- Operating only
+GROUP BY dc.id, dc.name;
+```
+
+### Data Center to FEMA Hazard Risk
+```sql
+SELECT 
+    dc.*,
+    nri.risk_score,
+    nri.wildfire_risk,
+    nri.drought_risk,
+    nri.heat_wave_risk
+FROM master_data_centers dc
+JOIN data_center_census_tracts_2024 ct ON dc.id = ct.id
+JOIN nri_census_tracts nri ON ct.geoid = nri.nri_id;
+```
+
+---
+
+## Table Naming Conventions
+
+- **`master_*`** - Canonical, deduplicated tables (use these for analysis)
+- **`data_center_*`** - Data center-specific enrichment tables
+- **`_dc_*`** - Base layers scoped to data center states (underscore prefix = private/internal)
+- **`energy_eia_*`** - EIA energy data
+- **`internet_*`** - Connectivity infrastructure
+- **`fcc_bdc_*`** - FCC Broadband Data Collection
+
+---
+
+## Indexes and Performance
+
+All tables have spatial indexes on `geom` columns for fast spatial joins:
+```sql
+CREATE INDEX idx_tablename_geom ON tablename USING GIST(geom);
+```
+
+Key `geoid` columns are indexed for fast demographic joins:
+```sql
+CREATE INDEX idx_tablename_geoid ON tablename(geoid);
+```
+
+---
+
+## Maintenance Notes
+
+### Updating Data Centers
+1. Run `load_postgis_osm_data_centers.py` to refresh OSM data
+2. Run `build_master_data_centers.py` to rebuild master table
+3. Run enrichment scripts to update joins
+
+### Updating Demographics
+1. Update `_dc_census_tract_acs_2024` from Census API
+2. Run `create_data_center_census_tract_table.py --replace-final`
+
+### Updating Energy Data
+```bash
+python3 ingest_eia_energy_layers.py --category power --update
+```
+
+---
+
+## Schema Export
+
+To export the full schema:
+```bash
+pg_dump -h $PGWEB_HOST -U $PGWEB_USER -d data_centers --schema-only > schema.sql
+```
+
+To list all tables:
+```sql
+SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename))
+FROM pg_tables
+WHERE schemaname = 'public'
+ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
+```
+
+---
+
+## Contact
+
+For database access or questions, contact the repository owner.