Files
data-centers/rdh_precinct_vote_data_centers.ipynb
2026-05-22 06:33:45 -07:00

1211 lines
47 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "0",
"metadata": {},
"source": [
"# RDH Precinct Vote Data for Master Data Centers\n",
"\n",
"This notebook discovers, downloads, stages, and spatially joins Redistricting Data Hub precinct-level election data to `public.master_data_centers`.\n",
"\n",
"It is designed to be rerunnable as new data centers are added:\n",
"- Data-center locations come from `public.master_data_centers`.\n",
"- RDH credentials are read from `RDH_USERNAME` / `RDH_PASSWORD` or prompted securely with `getpass`.\n",
"- RDH files are downloaded into `data/rdh_precinct_vote/`.\n",
"- Original precinct attributes are preserved in `properties jsonb` because RDH vote-column names vary by state/year/election.\n",
"- Matches are written to `public.data_center_rdh_precinct_vote_matches`, joinable by `master_id`.\n",
"\n",
"Primary join method is point-in-precinct using longitude/latitude. Census tract context is included as a fallback/diagnostic path when the existing census tract table is available.\n",
"\n",
"Local RDH API reference used for this notebook:\n",
"- `/home/dadams/Repos/api-redistricting_datahub/RDH_API.ipynb`\n",
"- `/home/dadams/Repos/api-redistricting_datahub/RDH_API_SET_PARAMS.ipynb`\n",
"- `/home/dadams/Repos/api-redistricting_datahub/Download_API.txt`\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1",
"metadata": {},
"outputs": [],
"source": [
"import hashlib\n",
"import io\n",
"import json\n",
"import os\n",
"import re\n",
"import shutil\n",
"import subprocess\n",
"import time\n",
"import zipfile\n",
"from getpass import getpass\n",
"from pathlib import Path\n",
"from urllib.parse import parse_qs, urlparse, unquote\n",
"\n",
"import numpy as np\n",
"import pandas as pd\n",
"import psycopg2\n",
"import requests\n",
"from psycopg2 import sql\n",
"from psycopg2.extras import Json, execute_values\n",
"\n",
"try:\n",
" import geopandas as gpd\n",
" HAS_GEOPANDAS = True\n",
"except ImportError:\n",
" gpd = None\n",
" HAS_GEOPANDAS = False\n",
"\n",
"pd.set_option('display.max_columns', 120)\n",
"pd.set_option('display.max_rows', 120)\n",
"\n",
"print('pandas:', pd.__version__)\n",
"print('requests:', requests.__version__)\n",
"print('geopandas:', 'ok' if HAS_GEOPANDAS else 'not installed; spatial file loading cells will be skipped')\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2",
"metadata": {},
"outputs": [],
"source": [
"def load_env_file(env_path: str = '.env') -> None:\n",
" p = Path(env_path)\n",
" if not p.exists():\n",
" print(f'No {env_path} file found in {Path.cwd()}')\n",
" return\n",
"\n",
" loaded = 0\n",
" for raw_line in p.read_text(encoding='utf-8').splitlines():\n",
" line = raw_line.strip()\n",
" if not line or line.startswith('#') or '=' not in line:\n",
" continue\n",
" key, value = line.split('=', 1)\n",
" key = key.strip()\n",
" value = value.strip().strip('\"').strip(\"'\")\n",
" if key and key not in os.environ:\n",
" os.environ[key] = value\n",
" loaded += 1\n",
" print(f'Loaded {loaded} env var(s) from {env_path}')\n",
"\n",
"\n",
"def require_env(keys):\n",
" missing = [k for k in keys if not os.getenv(k)]\n",
" if missing:\n",
" raise EnvironmentError(\n",
" 'Missing required env vars in notebook kernel: ' + ', '.join(missing) +\n",
" '.\\nSet them in this notebook, or add them to a .env file in this folder.'\n",
" )\n",
"\n",
"\n",
"load_env_file('.env')\n",
"require_env(['PGWEB_HOST', 'PGWEB_PORT', 'PGWEB_USER', 'PGWEB_PASSWORD'])\n",
"\n",
"DB_NAME = 'data_centers'\n",
"MASTER_TABLE = 'public.master_data_centers'\n",
"CENSUS_TRACT_TABLE = 'public.data_center_census_tracts_2024'\n",
"\n",
"LAYER_TABLE = 'public.rdh_precinct_vote_layers'\n",
"FEATURE_TABLE = 'public.rdh_precinct_vote_features'\n",
"MATCH_TABLE = 'public.data_center_rdh_precinct_vote_matches'\n",
"\n",
"\n",
"def get_conn():\n",
" return psycopg2.connect(\n",
" host=os.environ['PGWEB_HOST'],\n",
" port=os.environ['PGWEB_PORT'],\n",
" user=os.environ['PGWEB_USER'],\n",
" password=os.environ['PGWEB_PASSWORD'],\n",
" dbname=DB_NAME,\n",
" )\n",
"\n",
"\n",
"with get_conn() as conn:\n",
" with conn.cursor() as cur:\n",
" cur.execute('select current_database(), current_user')\n",
" db, usr = cur.fetchone()\n",
" print('Connected to DB:', db)\n",
" print('As user:', usr)\n",
" cur.execute('create extension if not exists postgis')\n",
" cur.execute('select to_regclass(%s)', (MASTER_TABLE,))\n",
" if cur.fetchone()[0] is None:\n",
" raise RuntimeError(f'{MASTER_TABLE} does not exist. Run build_master_data_centers.py first.')\n",
" cur.execute(sql.SQL('select count(*) from {}').format(sql.SQL(MASTER_TABLE)))\n",
" print(f'{MASTER_TABLE} rows:', f'{cur.fetchone()[0]:,}')"
]
},
{
"cell_type": "markdown",
"id": "3",
"metadata": {},
"source": [
"## Parameters\n",
"\n",
"The defaults run a small real pilot for Virginia 2020, because Virginia has many data centers in the master table and a statewide precinct layer should produce visible matches. After the pilot works, broaden `TARGET_STATES` and `FILTER_YEARS_ANY`. Use `TARGET_STATES = None` to infer all states from `public.master_data_centers`.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4",
"metadata": {},
"outputs": [],
"source": [
"RDH_LIST_URL = 'https://redistrictingdatahub.org/wp-json/download/list'\n",
"\n",
"DATA_DIR = Path('data/rdh_precinct_vote')\n",
"RAW_DIR = DATA_DIR / 'raw'\n",
"EXTRACT_DIR = DATA_DIR / 'extracted'\n",
"MANIFEST_PATH = DATA_DIR / 'rdh_precinct_vote_download_manifest.csv'\n",
"LISTING_CACHE_PATH = DATA_DIR / 'rdh_precinct_vote_listing_cache.csv'\n",
"\n",
"TARGET_STATES = None # None = infer all states from master_data_centers; or list e.g. ['VA','TX']\n",
"FILTER_TERMS_ALL = ['election results', 'precinct']\n",
"FILTER_TERMS_ANY = [] # e.g. ['general', 'president']\n",
"FILTER_YEARS_ANY = ['2020'] # pilot first; empty keeps all years returned by RDH\n",
"PREFERRED_FORMATS = ['SHP'] # point-in-precinct joins need spatial files\n",
"\n",
"DOWNLOAD_FILES = True\n",
"OVERWRITE_DOWNLOADS = False\n",
"LOAD_TO_POSTGIS = True\n",
"RUN_SPATIAL_MATCH = True\n",
"RUN_NEAREST_PRECINCT_FALLBACK = True\n",
"NEAREST_PRECINCT_MAX_DISTANCE_M = 500\n",
"\n",
"REQUEST_SLEEP_SECONDS = 1.0\n",
"\n",
"for p in [DATA_DIR, RAW_DIR, EXTRACT_DIR]:\n",
" p.mkdir(parents=True, exist_ok=True)\n",
"\n",
"print('Data directory:', DATA_DIR.resolve())\n",
"print('Download files:', DOWNLOAD_FILES)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5",
"metadata": {},
"outputs": [],
"source": [
"STATE_NAME_TO_CODE = {\n",
" 'alabama': 'AL', 'alaska': 'AK', 'arizona': 'AZ', 'arkansas': 'AR',\n",
" 'california': 'CA', 'colorado': 'CO', 'connecticut': 'CT', 'delaware': 'DE',\n",
" 'district of columbia': 'DC', 'florida': 'FL', 'georgia': 'GA', 'hawaii': 'HI',\n",
" 'idaho': 'ID', 'illinois': 'IL', 'indiana': 'IN', 'iowa': 'IA',\n",
" 'kansas': 'KS', 'kentucky': 'KY', 'louisiana': 'LA', 'maine': 'ME',\n",
" 'maryland': 'MD', 'massachusetts': 'MA', 'michigan': 'MI', 'minnesota': 'MN',\n",
" 'mississippi': 'MS', 'missouri': 'MO', 'montana': 'MT', 'nebraska': 'NE',\n",
" 'nevada': 'NV', 'new hampshire': 'NH', 'new jersey': 'NJ', 'new mexico': 'NM',\n",
" 'new york': 'NY', 'north carolina': 'NC', 'north dakota': 'ND', 'ohio': 'OH',\n",
" 'oklahoma': 'OK', 'oregon': 'OR', 'pennsylvania': 'PA', 'rhode island': 'RI',\n",
" 'south carolina': 'SC', 'south dakota': 'SD', 'tennessee': 'TN', 'texas': 'TX',\n",
" 'utah': 'UT', 'vermont': 'VT', 'virginia': 'VA', 'washington': 'WA',\n",
" 'west virginia': 'WV', 'wisconsin': 'WI', 'wyoming': 'WY',\n",
"}\n",
"STATE_CODE_TO_NAME = {v: k for k, v in STATE_NAME_TO_CODE.items()}\n",
"\n",
"\n",
"def normalize_state_code(value):\n",
" if pd.isna(value) or str(value).strip() == '':\n",
" return None\n",
" raw = str(value).strip()\n",
" upper = raw.upper()\n",
" if upper in STATE_CODE_TO_NAME:\n",
" return upper\n",
" return STATE_NAME_TO_CODE.get(raw.lower())\n",
"\n",
"\n",
"def infer_target_states():\n",
" if TARGET_STATES:\n",
" return sorted({normalize_state_code(s) for s in TARGET_STATES if normalize_state_code(s)})\n",
" query = f'''\n",
" select state, count(*) as rows\n",
" from {MASTER_TABLE}\n",
" where geom is not null\n",
" group by state\n",
" order by state\n",
" '''\n",
" with get_conn() as conn:\n",
" states = pd.read_sql_query(query, conn)\n",
" states['state_code'] = states['state'].map(normalize_state_code)\n",
" display(states)\n",
" missing = states.loc[states['state_code'].isna() & states['state'].notna()]\n",
" if not missing.empty:\n",
" print('Warning: could not normalize these state values:')\n",
" display(missing)\n",
" return sorted(states['state_code'].dropna().unique())\n",
"\n",
"\n",
"target_states = infer_target_states()\n",
"print('Target state codes:', ', '.join(target_states))\n"
]
},
{
"cell_type": "markdown",
"id": "6",
"metadata": {},
"source": [
"## RDH Credentials"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7",
"metadata": {},
"outputs": [],
"source": [
"RDH_USERNAME = os.getenv('RDH_USERNAME') or os.getenv('RDH_EMAIL')\n",
"RDH_PASSWORD = os.getenv('RDH_PASSWORD')\n",
"\n",
"if not RDH_USERNAME:\n",
" RDH_USERNAME = input('RDH username or email: ').strip()\n",
"if not RDH_PASSWORD:\n",
" RDH_PASSWORD = getpass('RDH password: ')\n",
"\n",
"if not RDH_USERNAME or not RDH_PASSWORD:\n",
" raise RuntimeError('RDH credentials are required. Use prompts or set RDH_USERNAME/RDH_PASSWORD.')\n",
"\n",
"print('RDH username loaded:', RDH_USERNAME)\n",
"print('RDH password loaded:', bool(RDH_PASSWORD))\n"
]
},
{
"cell_type": "markdown",
"id": "8",
"metadata": {},
"source": [
"## Discover RDH Datasets"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9",
"metadata": {},
"outputs": [],
"source": [
"def get_rdh_list_for_state(state_code):\n",
" params = {\n",
" 'username': RDH_USERNAME,\n",
" 'password': RDH_PASSWORD,\n",
" 'format': 'csv',\n",
" 'states': STATE_CODE_TO_NAME.get(state_code, state_code).lower(),\n",
" }\n",
" response = requests.get(RDH_LIST_URL, params=params, timeout=120)\n",
" response.raise_for_status()\n",
" text = response.content.decode('utf-8', errors='replace')\n",
" df = pd.read_csv(io.StringIO(text))\n",
" df['query_state_code'] = state_code\n",
" df['query_state_name'] = STATE_CODE_TO_NAME.get(state_code, state_code)\n",
" time.sleep(REQUEST_SLEEP_SECONDS)\n",
" return df\n",
"\n",
"\n",
"def load_or_fetch_rdh_listing(refresh=False):\n",
" cached = None\n",
" if LISTING_CACHE_PATH.exists() and not refresh:\n",
" cached = pd.read_csv(LISTING_CACHE_PATH)\n",
" cached_states = set(cached.get('query_state_code', pd.Series(dtype=str)).dropna().unique())\n",
" missing = [s for s in target_states if s not in cached_states]\n",
" print(f'Cache has {len(cached):,} rows across {len(cached_states)} state(s); missing: {missing or \"none\"}')\n",
" else:\n",
" cached_states = set()\n",
" missing = list(target_states)\n",
"\n",
" if missing:\n",
" frames = [] if cached is None else [cached]\n",
" for state_code in missing:\n",
" print('Retrieving RDH listing for', state_code)\n",
" frames.append(get_rdh_list_for_state(state_code))\n",
" cached = pd.concat(frames, ignore_index=True)\n",
" LISTING_CACHE_PATH.parent.mkdir(parents=True, exist_ok=True)\n",
" cached.to_csv(LISTING_CACHE_PATH, index=False)\n",
" print('Updated listing cache:', LISTING_CACHE_PATH)\n",
"\n",
" if cached is not None and 'query_state_code' in cached.columns:\n",
" cached = cached[cached['query_state_code'].isin(target_states)].copy()\n",
" return cached if cached is not None else pd.DataFrame()\n",
"\n",
"\n",
"listing = load_or_fetch_rdh_listing(refresh=False)\n",
"print(f'RDH listing rows for target states: {len(listing):,}')\n",
"display(listing.head())\n",
"display(pd.DataFrame({'column': listing.columns}))\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "10",
"metadata": {},
"outputs": [],
"source": [
"def text_contains_all(text, terms):\n",
" text = str(text).lower()\n",
" return all(term.lower() in text for term in terms if term)\n",
"\n",
"\n",
"def text_contains_any(text, terms):\n",
" terms = [term for term in terms if term]\n",
" if not terms:\n",
" return True\n",
" text = str(text).lower()\n",
" return any(term.lower() in text for term in terms)\n",
"\n",
"\n",
"def row_year_match(text):\n",
" if not FILTER_YEARS_ANY:\n",
" return True\n",
" text = str(text)\n",
" return any(str(year) in text for year in FILTER_YEARS_ANY)\n",
"\n",
"\n",
"def parse_datasetid(url):\n",
" if pd.isna(url):\n",
" return None\n",
" parsed = urlparse(str(url))\n",
" qs = parse_qs(parsed.query)\n",
" values = qs.get('datasetid')\n",
" return values[0] if values else None\n",
"\n",
"\n",
"def clean_filename_from_url(url, fmt):\n",
" base = unquote(str(url).split('?')[0]).rstrip('/')\n",
" name = base.split('/')[-1]\n",
" if not name:\n",
" name = 'rdh_download.zip'\n",
" suffix = Path(name).suffix\n",
" if not suffix:\n",
" suffix = '.zip' if str(fmt).upper() == 'SHP' else ''\n",
" name = name + suffix\n",
" return re.sub(r'[^A-Za-z0-9._-]+', '_', name)\n",
"\n",
"\n",
"work = listing.copy()\n",
"for required in ['Title', 'Format', 'URL']:\n",
" if required not in work.columns:\n",
" raise RuntimeError(f'RDH listing is missing expected column: {required}')\n",
"\n",
"work['title_format'] = work['Title'].fillna('').astype(str) + ' ' + work['Format'].fillna('').astype(str)\n",
"work['datasetid'] = work['URL'].map(parse_datasetid)\n",
"work['format_norm'] = work['Format'].fillna('').astype(str).str.upper()\n",
"work['filename'] = work.apply(lambda r: clean_filename_from_url(r['URL'], r['Format']), axis=1)\n",
"\n",
"filtered = work[\n",
" work['title_format'].map(lambda x: text_contains_all(x, FILTER_TERMS_ALL))\n",
" & work['title_format'].map(lambda x: text_contains_any(x, FILTER_TERMS_ANY))\n",
" & work['title_format'].map(row_year_match)\n",
" & work['format_norm'].isin(PREFERRED_FORMATS)\n",
"].copy()\n",
"\n",
"filtered = filtered.sort_values(['query_state_code', 'Title', 'Format', 'filename']).reset_index(drop=True)\n",
"print(f'Filtered candidate files: {len(filtered):,}')\n",
"display(filtered[['query_state_code', 'Title', 'Format', 'datasetid', 'filename', 'URL']].head(100))\n"
]
},
{
"cell_type": "markdown",
"id": "11",
"metadata": {},
"source": [
"## Download Candidate Files"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "12",
"metadata": {},
"outputs": [],
"source": [
"def layer_id_for_row(row):\n",
" key = '|'.join([\n",
" str(row.get('query_state_code', '')),\n",
" str(row.get('Title', '')),\n",
" str(row.get('Format', '')),\n",
" str(row.get('datasetid', '')),\n",
" str(row.get('URL', '')),\n",
" ])\n",
" return hashlib.sha1(key.encode('utf-8')).hexdigest()[:16]\n",
"\n",
"\n",
"def download_rdh_file(row):\n",
" base_url = str(row['URL']).split('?')[0]\n",
" params = {\n",
" 'username': RDH_USERNAME,\n",
" 'password': RDH_PASSWORD,\n",
" }\n",
" if row.get('datasetid'):\n",
" params['datasetid'] = row['datasetid']\n",
"\n",
" state_dir = RAW_DIR / row['query_state_code']\n",
" state_dir.mkdir(parents=True, exist_ok=True)\n",
" out_path = state_dir / row['filename']\n",
" if out_path.exists() and not OVERWRITE_DOWNLOADS:\n",
" return out_path, 'exists'\n",
"\n",
" response = requests.get(base_url, params=params, timeout=300)\n",
" response.raise_for_status()\n",
" out_path.write_bytes(response.content)\n",
" time.sleep(REQUEST_SLEEP_SECONDS)\n",
" return out_path, 'downloaded'\n",
"\n",
"\n",
"download_rows = []\n",
"for row in filtered.to_dict('records'):\n",
" layer_id = layer_id_for_row(row)\n",
" out_path = RAW_DIR / row['query_state_code'] / row['filename']\n",
" status = 'planned'\n",
" if DOWNLOAD_FILES:\n",
" print('Retrieving', row['query_state_code'], row['Title'], row['filename'])\n",
" out_path, status = download_rdh_file(row)\n",
" elif out_path.exists():\n",
" status = 'exists'\n",
" download_rows.append({\n",
" 'layer_id': layer_id,\n",
" 'state_code': row['query_state_code'],\n",
" 'title': row['Title'],\n",
" 'format': row['Format'],\n",
" 'datasetid': row.get('datasetid'),\n",
" 'source_url': row['URL'],\n",
" 'filename': row['filename'],\n",
" 'local_path': str(out_path),\n",
" 'download_status': status,\n",
" })\n",
"\n",
"manifest = pd.DataFrame(download_rows)\n",
"manifest.to_csv(MANIFEST_PATH, index=False)\n",
"print('Wrote manifest:', MANIFEST_PATH)\n",
"display(manifest.head(100))\n",
"display(manifest['download_status'].value_counts(dropna=False).rename_axis('status').reset_index(name='rows'))\n",
"\n",
"if not DOWNLOAD_FILES and not manifest['local_path'].map(lambda p: Path(str(p)).exists()).any():\n",
" print('No RDH shapefiles have been downloaded yet. Review the candidates above, then set DOWNLOAD_FILES=True and rerun this cell and the cells below.')\n"
]
},
{
"cell_type": "markdown",
"id": "13",
"metadata": {},
"source": [
"## Unpack and Find Spatial Files"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "14",
"metadata": {},
"outputs": [],
"source": [
"def extract_archive(path, layer_id):\n",
" path = Path(path)\n",
" extract_to = EXTRACT_DIR / layer_id\n",
" extract_to.mkdir(parents=True, exist_ok=True)\n",
" if path.suffix.lower() == '.zip':\n",
" with zipfile.ZipFile(path) as zf:\n",
" zf.extractall(extract_to)\n",
" else:\n",
" shutil.copy2(path, extract_to / path.name)\n",
" return extract_to\n",
"\n",
"\n",
"def find_spatial_file(folder):\n",
" folder = Path(folder)\n",
" candidates = (\n",
" list(folder.rglob('*.shp')) +\n",
" list(folder.rglob('*.geojson')) +\n",
" list(folder.rglob('*.gpkg'))\n",
" )\n",
" if not candidates:\n",
" return None\n",
" candidates = sorted(candidates, key=lambda p: (p.suffix.lower() != '.shp', len(str(p))))\n",
" return candidates[0]\n",
"\n",
"\n",
"if MANIFEST_PATH.exists():\n",
" manifest = pd.read_csv(MANIFEST_PATH)\n",
"\n",
"spatial_rows = []\n",
"for row in manifest.to_dict('records'):\n",
" local_path = Path(str(row['local_path']))\n",
" if not local_path.exists():\n",
" spatial_rows.append({**row, 'extract_dir': None, 'spatial_path': None, 'spatial_status': 'missing_download'})\n",
" continue\n",
" extract_dir = extract_archive(local_path, row['layer_id'])\n",
" spatial_path = find_spatial_file(extract_dir)\n",
" spatial_rows.append({\n",
" **row,\n",
" 'extract_dir': str(extract_dir),\n",
" 'spatial_path': str(spatial_path) if spatial_path else None,\n",
" 'spatial_status': 'found' if spatial_path else 'no_spatial_file',\n",
" })\n",
"\n",
"spatial_manifest = pd.DataFrame(spatial_rows)\n",
"spatial_manifest.to_csv(DATA_DIR / 'rdh_precinct_vote_spatial_manifest.csv', index=False)\n",
"display(spatial_manifest[['state_code', 'title', 'filename', 'spatial_status', 'spatial_path']].head(100))\n",
"display(spatial_manifest['spatial_status'].value_counts(dropna=False).rename_axis('status').reset_index(name='rows'))\n",
"\n",
"if not spatial_manifest['spatial_status'].eq('found').any():\n",
" print('No spatial files were found. If spatial_status is missing_download, set DOWNLOAD_FILES=True and rerun the download/unpack cells.')\n"
]
},
{
"cell_type": "markdown",
"id": "15",
"metadata": {},
"source": [
"## Create PostGIS Staging Tables"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "16",
"metadata": {},
"outputs": [],
"source": [
"CREATE_TABLES_SQL = f'''\n",
"create table if not exists {LAYER_TABLE} (\n",
" layer_id text primary key,\n",
" state_code text,\n",
" title text,\n",
" format text,\n",
" datasetid text,\n",
" source_url text,\n",
" filename text,\n",
" local_path text,\n",
" spatial_path text,\n",
" metadata jsonb,\n",
" loaded_at timestamptz not null default now()\n",
");\n",
"\n",
"create table if not exists {FEATURE_TABLE} (\n",
" feature_id text primary key,\n",
" layer_id text not null references public.rdh_precinct_vote_layers(layer_id) on delete cascade,\n",
" state_code text,\n",
" source_row integer,\n",
" properties jsonb,\n",
" geom geometry(MultiPolygon, 4326)\n",
");\n",
"\n",
"create index if not exists rdh_precinct_vote_features_geom_gix\n",
" on {FEATURE_TABLE} using gist (geom);\n",
"create index if not exists rdh_precinct_vote_features_layer_idx\n",
" on {FEATURE_TABLE} (layer_id);\n",
"create index if not exists rdh_precinct_vote_features_state_idx\n",
" on {FEATURE_TABLE} (state_code);\n",
"\n",
"create table if not exists {MATCH_TABLE} (\n",
" master_id text not null references public.master_data_centers(master_id) on delete cascade,\n",
" feature_id text not null references public.rdh_precinct_vote_features(feature_id) on delete cascade,\n",
" layer_id text not null references public.rdh_precinct_vote_layers(layer_id) on delete cascade,\n",
" state_code text,\n",
" join_method text not null,\n",
" match_distance_m double precision,\n",
" matched_at timestamptz not null default now(),\n",
" primary key (master_id, feature_id)\n",
");\n",
"create index if not exists data_center_rdh_precinct_vote_matches_master_idx\n",
" on {MATCH_TABLE} (master_id);\n",
"create index if not exists data_center_rdh_precinct_vote_matches_layer_idx\n",
" on {MATCH_TABLE} (layer_id);\n",
"'''\n",
"\n",
"with get_conn() as conn:\n",
" with conn.cursor() as cur:\n",
" cur.execute(CREATE_TABLES_SQL)\n",
" print('PostGIS staging tables are ready.')\n"
]
},
{
"cell_type": "markdown",
"id": "17",
"metadata": {},
"source": [
"## Load Precinct Geometries to PostGIS"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "18",
"metadata": {},
"outputs": [],
"source": [
"def json_safe(value):\n",
" if isinstance(value, (pd.Timestamp, np.datetime64)):\n",
" return str(value)\n",
" if isinstance(value, np.generic):\n",
" return value.item()\n",
" if pd.isna(value):\n",
" return None\n",
" return value\n",
"\n",
"\n",
"def stable_feature_id(layer_id, source_row, geom):\n",
" geom_hash = hashlib.sha1(geom.wkb).hexdigest()[:16] if geom is not None else 'nogeom'\n",
" return hashlib.sha1(f'{layer_id}|{source_row}|{geom_hash}'.encode('utf-8')).hexdigest()[:24]\n",
"\n",
"\n",
"def load_layer_to_postgis_with_ogr(row):\n",
" spatial_path = row.get('spatial_path')\n",
" if not spatial_path or not Path(spatial_path).exists():\n",
" return {'layer_id': row['layer_id'], 'status': 'missing_spatial_file', 'features': 0}\n",
"\n",
" ogr2ogr_path = shutil.which('ogr2ogr')\n",
" if not ogr2ogr_path:\n",
" return {'layer_id': row['layer_id'], 'status': 'missing_geopandas_and_ogr2ogr', 'features': 0}\n",
"\n",
" staging_table = '_rdh_import_' + re.sub(r'[^a-zA-Z0-9_]', '_', row['layer_id']).lower()\n",
" dsn = 'PG:host={host} port={port} user={user} dbname={dbname}'.format(\n",
" host=os.environ['PGWEB_HOST'],\n",
" port=os.environ['PGWEB_PORT'],\n",
" user=os.environ['PGWEB_USER'],\n",
" dbname=DB_NAME,\n",
" )\n",
" env = os.environ.copy()\n",
" env['PGPASSWORD'] = os.environ['PGWEB_PASSWORD']\n",
" cmd = [\n",
" ogr2ogr_path,\n",
" '-f', 'PostgreSQL',\n",
" dsn,\n",
" str(spatial_path),\n",
" '-nln', f'public.{staging_table}',\n",
" '-overwrite',\n",
" '-t_srs', 'EPSG:4326',\n",
" '-nlt', 'PROMOTE_TO_MULTI',\n",
" '-dim', 'XY',\n",
" '-lco', 'GEOMETRY_NAME=geom',\n",
" ]\n",
" result = subprocess.run(cmd, env=env, capture_output=True, text=True)\n",
" if result.returncode != 0:\n",
" return {\n",
" 'layer_id': row['layer_id'],\n",
" 'status': 'ogr2ogr_failed',\n",
" 'features': 0,\n",
" 'error': (result.stderr or result.stdout)[-1000:],\n",
" }\n",
"\n",
" with get_conn() as conn:\n",
" with conn.cursor() as cur:\n",
" cur.execute('''\n",
" select column_name\n",
" from information_schema.columns\n",
" where table_schema = 'public' and table_name = %s\n",
" order by ordinal_position\n",
" ''', (staging_table,))\n",
" columns = [r[0] for r in cur.fetchall()]\n",
" geom_col = 'geom' if 'geom' in columns else 'wkb_geometry'\n",
" order_col = 'ogc_fid' if 'ogc_fid' in columns else 'ctid'\n",
"\n",
" metadata = {\n",
" 'columns': columns,\n",
" 'source_spatial_path': str(spatial_path),\n",
" 'loader': 'ogr2ogr',\n",
" }\n",
" cur.execute(sql.SQL('''\n",
" insert into {layers} (\n",
" layer_id, state_code, title, format, datasetid, source_url,\n",
" filename, local_path, spatial_path, metadata, loaded_at\n",
" ) values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, now())\n",
" on conflict (layer_id) do update set\n",
" state_code = excluded.state_code,\n",
" title = excluded.title,\n",
" format = excluded.format,\n",
" datasetid = excluded.datasetid,\n",
" source_url = excluded.source_url,\n",
" filename = excluded.filename,\n",
" local_path = excluded.local_path,\n",
" spatial_path = excluded.spatial_path,\n",
" metadata = excluded.metadata,\n",
" loaded_at = now()\n",
" ''').format(layers=sql.SQL(LAYER_TABLE)), (\n",
" row['layer_id'], row['state_code'], row['title'], row['format'],\n",
" row.get('datasetid'), row['source_url'], row['filename'],\n",
" row['local_path'], spatial_path, Json(metadata)\n",
" ))\n",
" cur.execute(sql.SQL('delete from {} where layer_id = %s').format(sql.SQL(FEATURE_TABLE)), (row['layer_id'],))\n",
" insert_sql = sql.SQL('''\n",
" insert into {features} (feature_id, layer_id, state_code, source_row, properties, geom)\n",
" with src as (\n",
" select\n",
" row_number() over (order by {order_col})::integer as source_row,\n",
" to_jsonb(t) - %s - 'ogc_fid' - 'fid' as properties,\n",
" ST_Force2D({geom_col}) as geom\n",
" from {staging} t\n",
" where {geom_col} is not null\n",
" ), fixed as (\n",
" select\n",
" source_row,\n",
" properties,\n",
" (ST_Dump(ST_Multi(ST_CollectionExtract(ST_MakeValid(geom), 3)))).geom as geom\n",
" from src\n",
" )\n",
" select\n",
" %s || '/' || source_row::text,\n",
" %s,\n",
" %s,\n",
" source_row,\n",
" properties,\n",
" ST_Multi(ST_Force2D(ST_SetSRID(geom, 4326)))::geometry(MultiPolygon, 4326)\n",
" from fixed\n",
" where geom is not null and not ST_IsEmpty(geom)\n",
" ''').format(\n",
" features=sql.SQL(FEATURE_TABLE),\n",
" staging=sql.Identifier('public', staging_table),\n",
" geom_col=sql.Identifier(geom_col),\n",
" order_col=sql.SQL(order_col) if order_col == 'ctid' else sql.Identifier(order_col),\n",
" )\n",
" cur.execute(insert_sql, (geom_col, row['layer_id'], row['layer_id'], row['state_code']))\n",
" inserted = cur.rowcount\n",
" cur.execute(sql.SQL('drop table if exists {}').format(sql.Identifier('public', staging_table)))\n",
" cur.execute(sql.SQL('analyze {}').format(sql.SQL(FEATURE_TABLE)))\n",
" return {'layer_id': row['layer_id'], 'status': 'loaded_ogr2ogr', 'features': inserted}\n",
"\n",
"\n",
"def load_layer_to_postgis(row):\n",
" if not HAS_GEOPANDAS:\n",
" return load_layer_to_postgis_with_ogr(row)\n",
" spatial_path = row.get('spatial_path')\n",
" if not spatial_path or not Path(spatial_path).exists():\n",
" return {'layer_id': row['layer_id'], 'status': 'missing_spatial_file', 'features': 0}\n",
"\n",
" gdf = gpd.read_file(spatial_path)\n",
" if gdf.empty:\n",
" return {'layer_id': row['layer_id'], 'status': 'empty_file', 'features': 0}\n",
" if gdf.crs is None:\n",
" print(f'Warning: {spatial_path} has no CRS; assuming EPSG:4326.')\n",
" gdf = gdf.set_crs(4326)\n",
" else:\n",
" gdf = gdf.to_crs(4326)\n",
"\n",
" gdf = gdf[gdf.geometry.notna()].copy()\n",
" gdf = gdf[~gdf.geometry.is_empty].copy()\n",
" if gdf.empty:\n",
" return {'layer_id': row['layer_id'], 'status': 'no_valid_geometry', 'features': 0}\n",
"\n",
" layer_metadata = {\n",
" 'columns': [str(c) for c in gdf.columns],\n",
" 'row_count': int(len(gdf)),\n",
" 'source_spatial_path': spatial_path,\n",
" }\n",
"\n",
" feature_rows = []\n",
" prop_cols = [c for c in gdf.columns if c != gdf.geometry.name]\n",
" for source_row, (_, record) in enumerate(gdf.iterrows(), start=1):\n",
" geom = record.geometry\n",
" if geom is None or geom.is_empty:\n",
" continue\n",
" properties = {str(col): json_safe(record[col]) for col in prop_cols}\n",
" feature_rows.append((\n",
" stable_feature_id(row['layer_id'], source_row, geom),\n",
" row['layer_id'],\n",
" row['state_code'],\n",
" source_row,\n",
" Json(properties),\n",
" json.dumps(geom.__geo_interface__),\n",
" ))\n",
"\n",
" with get_conn() as conn:\n",
" with conn.cursor() as cur:\n",
" cur.execute(sql.SQL('''\n",
" insert into {layers} (\n",
" layer_id, state_code, title, format, datasetid, source_url,\n",
" filename, local_path, spatial_path, metadata, loaded_at\n",
" ) values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, now())\n",
" on conflict (layer_id) do update set\n",
" state_code = excluded.state_code,\n",
" title = excluded.title,\n",
" format = excluded.format,\n",
" datasetid = excluded.datasetid,\n",
" source_url = excluded.source_url,\n",
" filename = excluded.filename,\n",
" local_path = excluded.local_path,\n",
" spatial_path = excluded.spatial_path,\n",
" metadata = excluded.metadata,\n",
" loaded_at = now()\n",
" ''').format(layers=sql.SQL(LAYER_TABLE)), (\n",
" row['layer_id'], row['state_code'], row['title'], row['format'],\n",
" row.get('datasetid'), row['source_url'], row['filename'],\n",
" row['local_path'], spatial_path, Json(layer_metadata)\n",
" ))\n",
" cur.execute(sql.SQL('delete from {} where layer_id = %s').format(sql.SQL(FEATURE_TABLE)), (row['layer_id'],))\n",
" if feature_rows:\n",
" execute_values(\n",
" cur,\n",
" sql.SQL('''\n",
" insert into {features} (\n",
" feature_id, layer_id, state_code, source_row, properties, geom\n",
" ) values %s\n",
" ''').format(features=sql.SQL(FEATURE_TABLE)).as_string(conn),\n",
" feature_rows,\n",
" template='(%s, %s, %s, %s, %s, ST_Multi(ST_CollectionExtract(ST_MakeValid(ST_Force2D(ST_SetSRID(ST_GeomFromGeoJSON(%s), 4326))), 3)))',\n",
" page_size=1000,\n",
" )\n",
" cur.execute(sql.SQL('analyze {}').format(sql.SQL(FEATURE_TABLE)))\n",
" return {'layer_id': row['layer_id'], 'status': 'loaded', 'features': len(feature_rows)}\n",
"\n",
"\n",
"SKIP_ALREADY_LOADED_LAYERS = True\n",
"\n",
"load_results = []\n",
"if LOAD_TO_POSTGIS:\n",
" found = spatial_manifest[spatial_manifest['spatial_status'].eq('found')].copy()\n",
" already_loaded = set()\n",
" if SKIP_ALREADY_LOADED_LAYERS:\n",
" with get_conn() as conn:\n",
" with conn.cursor() as cur:\n",
" cur.execute(f'select layer_id from {LAYER_TABLE}')\n",
" already_loaded = {r[0] for r in cur.fetchall()}\n",
" print(f'Skipping {len(already_loaded)} layer(s) already in PostGIS')\n",
" for row in found.to_dict('records'):\n",
" if row['layer_id'] in already_loaded:\n",
" load_results.append({'layer_id': row['layer_id'], 'status': 'already_loaded', 'features': None})\n",
" continue\n",
" print('Loading layer:', row['state_code'], row['title'])\n",
" load_results.append(load_layer_to_postgis(row))\n",
"else:\n",
" print('LOAD_TO_POSTGIS=False; skipping spatial file load.')\n",
"\n",
"load_results = pd.DataFrame(load_results)\n",
"display(load_results)\n"
]
},
{
"cell_type": "markdown",
"id": "19",
"metadata": {},
"source": [
"## Match Data Centers to Precinct Layers"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "20",
"metadata": {},
"outputs": [],
"source": [
"MATCH_SQL = f'''\n",
"insert into {MATCH_TABLE} (\n",
" master_id, feature_id, layer_id, state_code, join_method, match_distance_m, matched_at\n",
")\n",
"select\n",
" m.master_id,\n",
" f.feature_id,\n",
" f.layer_id,\n",
" f.state_code,\n",
" 'point_in_precinct' as join_method,\n",
" 0.0 as match_distance_m,\n",
" now()\n",
"from {MASTER_TABLE} m\n",
"join {FEATURE_TABLE} f\n",
" on m.geom && f.geom\n",
" and ST_Covers(f.geom, m.geom)\n",
"where m.geom is not null\n",
"on conflict (master_id, feature_id) do update set\n",
" join_method = excluded.join_method,\n",
" match_distance_m = excluded.match_distance_m,\n",
" matched_at = now()\n",
"'''\n",
"\n",
"if RUN_SPATIAL_MATCH:\n",
" with get_conn() as conn:\n",
" with conn.cursor() as cur:\n",
" cur.execute(MATCH_SQL)\n",
" print('Inserted/updated matches:', cur.rowcount)\n",
" cur.execute(sql.SQL('analyze {}').format(sql.SQL(MATCH_TABLE)))\n",
"else:\n",
" print('RUN_SPATIAL_MATCH=False; skipping point-in-precinct matching.')\n"
]
},
{
"cell_type": "markdown",
"id": "21",
"metadata": {},
"source": [
"## Match Diagnostics"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "22",
"metadata": {},
"outputs": [],
"source": [
"diagnostics_sql = f'''\n",
"with dc as (\n",
" select state, count(*) as data_centers, ST_Extent(geom) as dc_extent\n",
" from {MASTER_TABLE}\n",
" where geom is not null\n",
" group by state\n",
"), feats as (\n",
" select state_code as state, count(*) as precinct_features, ST_Extent(geom) as precinct_extent\n",
" from {FEATURE_TABLE}\n",
" where geom is not null\n",
" group by state_code\n",
"), matches as (\n",
" select state_code as state, count(distinct master_id) as matched_data_centers\n",
" from {MATCH_TABLE}\n",
" group by state_code\n",
")\n",
"select\n",
" coalesce(dc.state, feats.state, matches.state) as state,\n",
" coalesce(dc.data_centers, 0) as data_centers,\n",
" coalesce(feats.precinct_features, 0) as precinct_features,\n",
" coalesce(matches.matched_data_centers, 0) as matched_data_centers,\n",
" dc.dc_extent::text as data_center_extent,\n",
" feats.precinct_extent::text as precinct_extent\n",
"from dc\n",
"full join feats on feats.state = dc.state\n",
"full join matches on matches.state = coalesce(dc.state, feats.state)\n",
"order by state\n",
"'''\n",
"\n",
"nearest_sql = f'''\n",
"select\n",
" m.master_id,\n",
" m.name,\n",
" m.city,\n",
" m.state,\n",
" nearest.layer_id,\n",
" nearest.feature_id,\n",
" nearest.distance_m\n",
"from {MASTER_TABLE} m\n",
"left join {MATCH_TABLE} matched on matched.master_id = m.master_id\n",
"left join lateral (\n",
" select\n",
" f.layer_id,\n",
" f.feature_id,\n",
" ST_Distance(m.geom::geography, f.geom::geography) as distance_m\n",
" from {FEATURE_TABLE} f\n",
" where f.geom is not null\n",
" order by m.geom <-> f.geom\n",
" limit 1\n",
") nearest on true\n",
"where m.geom is not null\n",
" and matched.master_id is null\n",
"order by nearest.distance_m nulls last\n",
"limit 50\n",
"'''\n",
"\n",
"with get_conn() as conn:\n",
" diagnostics = pd.read_sql_query(diagnostics_sql, conn)\n",
" nearest_unmatched = pd.read_sql_query(nearest_sql, conn)\n",
"\n",
"display(diagnostics)\n",
"display(nearest_unmatched)\n",
"\n",
"if diagnostics['precinct_features'].sum() == 0:\n",
" print('PostGIS has zero RDH precinct features loaded. Run the download, unpack, and load cells before matching.')\n",
"elif diagnostics['matched_data_centers'].sum() == 0:\n",
" print('Precinct features exist, but no point-in-polygon matches were found. Check the extents and nearest unmatched distances above.')\n"
]
},
{
"cell_type": "markdown",
"id": "23",
"metadata": {},
"source": [
"## Optional Nearest-Precinct Fallback"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "24",
"metadata": {},
"outputs": [],
"source": [
"NEAREST_FALLBACK_SQL = f'''\n",
"insert into {MATCH_TABLE} (\n",
" master_id, feature_id, layer_id, state_code, join_method, match_distance_m, matched_at\n",
")\n",
"with candidates as (\n",
" select m.master_id, m.geom, upper(m.state) as state_code\n",
" from {MASTER_TABLE} m\n",
" where m.geom is not null\n",
" and m.state is not null\n",
" and upper(m.state) in (select distinct state_code from {FEATURE_TABLE} where geom is not null)\n",
" and not exists (\n",
" select 1 from {MATCH_TABLE} existing where existing.master_id = m.master_id\n",
" )\n",
")\n",
"select\n",
" c.master_id,\n",
" nearest.feature_id,\n",
" nearest.layer_id,\n",
" nearest.state_code,\n",
" 'nearest_precinct_within_threshold' as join_method,\n",
" nearest.distance_m,\n",
" now()\n",
"from candidates c\n",
"join lateral (\n",
" select\n",
" f.feature_id,\n",
" f.layer_id,\n",
" f.state_code,\n",
" ST_Distance(c.geom::geography, f.geom::geography) as distance_m\n",
" from {FEATURE_TABLE} f\n",
" where f.geom is not null\n",
" and f.state_code = c.state_code\n",
" order by c.geom <-> f.geom\n",
" limit 1\n",
") nearest on true\n",
"where nearest.distance_m <= %s\n",
"on conflict (master_id, feature_id) do update set\n",
" join_method = excluded.join_method,\n",
" match_distance_m = excluded.match_distance_m,\n",
" matched_at = now()\n",
"'''\n",
"\n",
"if RUN_SPATIAL_MATCH and RUN_NEAREST_PRECINCT_FALLBACK:\n",
" with get_conn() as conn:\n",
" with conn.cursor() as cur:\n",
" cur.execute(NEAREST_FALLBACK_SQL, (NEAREST_PRECINCT_MAX_DISTANCE_M,))\n",
" print('Inserted/updated nearest fallback matches:', cur.rowcount)\n",
" cur.execute(sql.SQL('analyze {}').format(sql.SQL(MATCH_TABLE)))\n",
"else:\n",
" print('Nearest fallback disabled.')\n"
]
},
{
"cell_type": "markdown",
"id": "25",
"metadata": {},
"source": [
"## Inspect Matches and Census-Tract Context"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "26",
"metadata": {},
"outputs": [],
"source": [
"match_summary_sql = f'''\n",
"select\n",
" l.state_code,\n",
" l.title,\n",
" count(distinct f.feature_id) as precinct_features,\n",
" count(distinct m.master_id) as matched_data_centers\n",
"from {LAYER_TABLE} l\n",
"left join {FEATURE_TABLE} f on f.layer_id = l.layer_id\n",
"left join {MATCH_TABLE} m on m.feature_id = f.feature_id\n",
"group by l.state_code, l.title\n",
"order by l.state_code, matched_data_centers desc, l.title\n",
"'''\n",
"\n",
"sample_sql = f'''\n",
"select\n",
" m.master_id,\n",
" dc.name,\n",
" dc.city,\n",
" dc.state,\n",
" l.title as rdh_layer_title,\n",
" f.properties as precinct_properties\n",
"from {MATCH_TABLE} m\n",
"join {MASTER_TABLE} dc on dc.master_id = m.master_id\n",
"join {FEATURE_TABLE} f on f.feature_id = m.feature_id\n",
"join {LAYER_TABLE} l on l.layer_id = m.layer_id\n",
"order by dc.state, dc.city, dc.name\n",
"limit 25\n",
"'''\n",
"\n",
"with get_conn() as conn:\n",
" match_summary = pd.read_sql_query(match_summary_sql, conn)\n",
" sample_matches = pd.read_sql_query(sample_sql, conn)\n",
"\n",
"display(match_summary)\n",
"display(sample_matches)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "27",
"metadata": {},
"outputs": [],
"source": [
"with get_conn() as conn:\n",
" with conn.cursor() as cur:\n",
" cur.execute('select to_regclass(%s)', (CENSUS_TRACT_TABLE,))\n",
" has_census_tract_table = cur.fetchone()[0] is not None\n",
"\n",
"if has_census_tract_table:\n",
" tract_context_sql = f'''\n",
" select\n",
" dc.master_id,\n",
" dc.name,\n",
" dc.city,\n",
" dc.state,\n",
" dc.geoid as census_tract_geoid,\n",
" count(pm.feature_id) as precinct_layer_matches\n",
" from {MASTER_TABLE} dc\n",
" left join {CENSUS_TRACT_TABLE} ct on ct.geoid = dc.geoid\n",
" left join {MATCH_TABLE} pm on pm.master_id = dc.master_id\n",
" group by dc.master_id, dc.name, dc.city, dc.state, ct.geoid\n",
" order by precinct_layer_matches asc, dc.state, dc.city\n",
" limit 50\n",
" '''\n",
" with get_conn() as conn:\n",
" tract_context = pd.read_sql_query(tract_context_sql, conn)\n",
" display(tract_context)\n",
"else:\n",
" print(f'{CENSUS_TRACT_TABLE} not found; census tract fallback/context skipped.')\n"
]
},
{
"cell_type": "markdown",
"id": "28",
"metadata": {},
"source": [
"## Next Refinement: Tidy Vote Columns\n",
"\n",
"The RDH staging table intentionally stores each precinct row's original attributes in `properties jsonb`. Once the downloaded layers are visible, inspect `precinct_properties` above to identify vote-column patterns for the states/years you care about.\n",
"\n",
"Useful follow-up views can then extract fields like:\n",
"- precinct identifier/name\n",
"- election year\n",
"- office\n",
"- Democratic votes\n",
"- Republican votes\n",
"- total votes\n",
"- turnout or vote share\n",
"\n",
"That extraction is best added after confirming the specific RDH file families selected by the filters.\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.14.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}