Add LegiScan legislation ingestion and analysis queries

Adds ingest_legiscan.py to pull all US state + federal bills (2016-2026)
from the LegiScan API into legiscan_sessions and legiscan_bills tables.
Bills are keyword-tagged across 8 research categories (data_center,
ratepayer_protection, large_load, grid_impact, tax_incentive, etc.).
Loads ~1.3M bills; ~60K tagged relevant. Adds query_legiscan_bills.sql
with pre-built analysis queries including state/DC joins. Updates
database-tables.md, README.md, and research-ideas.md accordingly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-05-27 21:30:31 -07:00
parent 46c8c58545
commit 4525ea3f97
5 changed files with 1046 additions and 1 deletions

View File

@@ -13,11 +13,12 @@
## Table Organization
Tables are organized into four categories:
Tables are organized into five categories:
1. **Core Data Center Tables** - Master inventories and source data
2. **Enrichment Tables** - Data centers joined with contextual data
3. **Base Layer Tables** - Geographic and demographic reference layers
4. **Infrastructure Tables** - Energy and connectivity infrastructure
5. **Legislation Tables** - LegiScan state and federal bill data (2016-2026)
---
@@ -499,6 +500,85 @@ Tables are organized into four categories:
---
---
## Legislation Tables
Populated by `ingest_legiscan.py` using the [LegiScan API](https://legiscan.com/legiscan).
Covers all 50 states + DC + US Congress, sessions from 2016 through 2026.
Data licensed [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) — attribute LegiScan LLC.
### `legiscan_sessions`
**Rows**: 646
**Purpose**: One row per legislative session dataset downloaded from LegiScan
**Key Columns**:
- `session_id` (INTEGER) - LegiScan session ID (PRIMARY KEY)
- `state_abbr` (VARCHAR) - Two-letter state code (`CA`, `TX`, `US`, etc.)
- `state_id` (INTEGER) - LegiScan numeric state ID
- `year_start`, `year_end` (INTEGER) - Session year range
- `session_title` (TEXT) - Full session name
- `session_tag` (TEXT) - Short tag (e.g., "Regular Session", "1st Special Session")
- `is_special` (BOOLEAN) - True for special/extraordinary sessions
- `is_prior` (BOOLEAN) - True for completed/sine-die sessions
- `dataset_hash` (VARCHAR) - MD5 of dataset ZIP; used to detect updates
- `dataset_date` (DATE) - Date dataset was last published by LegiScan
- `dataset_size_mb` (FLOAT) - Compressed ZIP size
- `bill_count` (INTEGER) - Number of bills loaded from this session
- `imported_at` (TIMESTAMPTZ) - When this session was last imported
### `legiscan_bills`
**Rows**: ~1,313,000
**Purpose**: All bills from all sessions; tagged for relevance to data center research topics
**Key Columns**:
- `bill_id` (INTEGER) - LegiScan bill ID (PRIMARY KEY)
- `session_id` (INTEGER) - FK → `legiscan_sessions`
- `state` (VARCHAR) - Two-letter state code
- `bill_number` (VARCHAR) - Bill number (e.g., `SB 1000`, `HB 233`)
- `bill_type` (VARCHAR) - `B`=Bill, `R`=Resolution, `CR`=Concurrent Resolution, etc.
- `title` (TEXT) - Short title
- `description` (TEXT) - Longer description
- `status` (INTEGER) - Current status code (see below)
- `status_date` (DATE) - Date of last status change
- `completed` (INTEGER) - 1 if bill is in a terminal state
- `body` (VARCHAR) - Originating chamber (`H`=House, `S`=Senate, `C`=Council, etc.)
- `url` (TEXT) - LegiScan bill page URL
- `state_link` (TEXT) - Official state legislature URL
- `change_hash` (VARCHAR) - MD5 used to detect bill-level updates
- `subjects` (TEXT[]) - LegiScan subject tags (GIN indexed)
- `sponsor_count` (INTEGER) - Number of sponsors
- `vote_count` (INTEGER) - Number of recorded votes
- `text_count` (INTEGER) - Number of bill text versions
- `is_relevant` (BOOLEAN) - True if any relevance tag matched (GIN indexed)
- `relevance_tags` (TEXT[]) - Matched topic tags (GIN indexed)
- `imported_at` (TIMESTAMPTZ) - When this bill was last upserted
**Status codes**: 1=Introduced, 2=Engrossed, 3=Enrolled, 4=Passed, 5=Vetoed, 6=Failed, 7=Override, 8=Chaptered, 9=Referred, 12=Draft
**Relevance tags** (keyword-matched against title + description + subjects):
| Tag | What it captures |
|-----|-----------------|
| `data_center` | Data centers, hyperscale, colocation, AI campuses, HPC facilities |
| `large_load` | Crypto mining, large industrial loads, extraordinary load |
| `ratepayer_protection` | Cost shifting, cross-subsidy, rate design, affordability, rate class |
| `grid_impact` | Grid reliability, transmission, interconnection queue, IRP |
| `tax_incentive` | Tax exemptions, abatements, credits for facilities |
| `energy_policy` | Renewable PPAs, green tariffs, clean electricity, decarbonization |
| `water_use` | Cooling water, evaporative cooling, water footprint |
| `siting_permitting` | Zoning, conditional use permits, local control, preemption |
**Notes**:
- ~60,000 relevant bills out of 1.3M total (~4.6%)
- `data_center` tag: ~2,182 bills; `ratepayer_protection`: ~49,000
- GIN indexes on `subjects`, `relevance_tags`, and full-text (`title || description`)
- Use `query_legiscan_bills.sql` for pre-built research queries
- Re-run `python ingest_legiscan.py --fetch --load` weekly to pick up dataset updates
- Re-run `python ingest_legiscan.py --tag` after editing keyword lists
---
## Commonly Used Joins
### Data Center to Demographics