expanded voter data

This commit is contained in:
2026-05-22 14:18:01 -07:00
parent dc8755cde0
commit c95f22fcdb
4 changed files with 628 additions and 7 deletions

View File

@@ -145,7 +145,7 @@
"source": [
"## Parameters\n",
"\n",
"The defaults run a small real pilot for Virginia 2020, because Virginia has many data centers in the master table and a statewide precinct layer should produce visible matches. After the pilot works, broaden `TARGET_STATES` and `FILTER_YEARS_ANY`. Use `TARGET_STATES = None` to infer all states from `public.master_data_centers`.\n"
"The defaults now target both 2020 and 2024 precinct election layers across all inferred data-center states. Set `TARGET_STATES` to a small list like `['VA']` for a quick pilot run, or keep `TARGET_STATES = None` to infer all states from `public.master_data_centers`. Use `FILTER_YEARS_ANY = []` to keep all years returned by RDH."
]
},
{
@@ -166,7 +166,7 @@
"TARGET_STATES = None # None = infer all states from master_data_centers; or list e.g. ['VA','TX']\n",
"FILTER_TERMS_ALL = ['election results', 'precinct']\n",
"FILTER_TERMS_ANY = [] # e.g. ['general', 'president']\n",
"FILTER_YEARS_ANY = ['2020'] # pilot first; empty keeps all years returned by RDH\n",
"FILTER_YEARS_ANY = ['2020', '2024'] # set [] to keep all years returned by RDH\n",
"PREFERRED_FORMATS = ['SHP'] # point-in-precinct joins need spatial files\n",
"\n",
"DOWNLOAD_FILES = True\n",
@@ -387,6 +387,11 @@
" return re.sub(r'[^A-Za-z0-9._-]+', '_', name)\n",
"\n",
"\n",
"def detect_year(text):\n",
" match = re.search(r'\\b(20\\d{2})\\b', str(text))\n",
" return match.group(1) if match else None\n",
"\n",
"\n",
"work = listing.copy()\n",
"for required in ['Title', 'Format', 'URL']:\n",
" if required not in work.columns:\n",
@@ -405,8 +410,20 @@
"].copy()\n",
"\n",
"filtered = filtered.sort_values(['query_state_code', 'Title', 'Format', 'filename']).reset_index(drop=True)\n",
"filtered['detected_year'] = filtered['Title'].map(detect_year)\n",
"\n",
"print(f'Filtered candidate files: {len(filtered):,}')\n",
"display(filtered[['query_state_code', 'Title', 'Format', 'datasetid', 'filename', 'URL']].head(100))\n"
"year_summary = (\n",
" filtered.assign(detected_year=filtered['detected_year'].fillna('unknown'))\n",
" .groupby('detected_year', dropna=False)\n",
" .size()\n",
" .reset_index(name='rows')\n",
" .sort_values('detected_year')\n",
")\n",
"print('Candidate rows by detected year:')\n",
"display(year_summary)\n",
"\n",
"display(filtered[['query_state_code', 'detected_year', 'Title', 'Format', 'datasetid', 'filename', 'URL']].head(100))"
]
},
{
@@ -1169,11 +1186,11 @@
"id": "28",
"metadata": {},
"source": [
"## Next Refinement: Tidy Vote Columns\n",
"## Standardized Vote Fields\n",
"\n",
"The RDH staging table intentionally stores each precinct row's original attributes in `properties jsonb`. Once the downloaded layers are visible, inspect `precinct_properties` above to identify vote-column patterns for the states/years you care about.\n",
"The cell below extracts a standardized set of election attributes from `precinct_properties` using heuristic key matching across RDH file families.\n",
"\n",
"Useful follow-up views can then extract fields like:\n",
"Extracted fields:\n",
"- precinct identifier/name\n",
"- election year\n",
"- office\n",
@@ -1182,7 +1199,521 @@
"- total votes\n",
"- turnout or vote share\n",
"\n",
"That extraction is best added after confirming the specific RDH file families selected by the filters.\n"
"Because RDH schemas vary by state and source, this step is intentionally tolerant and computes fallback vote-share values when direct turnout/share fields are not present."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "29",
"metadata": {},
"outputs": [],
"source": [
"STANDARDIZED_LIMIT = None # set an int (e.g., 2000) for faster sampling\n",
"\n",
"limit_clause = '' if STANDARDIZED_LIMIT is None else 'limit %s'\n",
"standardized_sql = f'''\n",
"select\n",
" m.master_id,\n",
" dc.name,\n",
" dc.city,\n",
" dc.state,\n",
" l.title as rdh_layer_title,\n",
" f.properties as precinct_properties\n",
"from {MATCH_TABLE} m\n",
"join {MASTER_TABLE} dc on dc.master_id = m.master_id\n",
"join {FEATURE_TABLE} f on f.feature_id = m.feature_id\n",
"join {LAYER_TABLE} l on l.layer_id = m.layer_id\n",
"order by dc.state, dc.city, dc.name\n",
"{limit_clause}\n",
"'''\n",
"\n",
"with get_conn() as conn:\n",
" if STANDARDIZED_LIMIT is None:\n",
" raw_standardized = pd.read_sql_query(standardized_sql, conn)\n",
" else:\n",
" raw_standardized = pd.read_sql_query(standardized_sql, conn, params=[STANDARDIZED_LIMIT])\n",
"\n",
"\n",
"def parse_props(value):\n",
" if isinstance(value, dict):\n",
" return value\n",
" if pd.isna(value):\n",
" return {}\n",
" text = str(value).strip()\n",
" if not text:\n",
" return {}\n",
" try:\n",
" obj = json.loads(text)\n",
" return obj if isinstance(obj, dict) else {}\n",
" except Exception:\n",
" return {}\n",
"\n",
"\n",
"def norm_key(k):\n",
" return re.sub(r'[^a-z0-9]+', '_', str(k).strip().lower()).strip('_')\n",
"\n",
"\n",
"def as_number(v):\n",
" if v is None:\n",
" return None\n",
" if isinstance(v, (int, float, np.integer, np.floating)):\n",
" if pd.isna(v):\n",
" return None\n",
" return float(v)\n",
" text = str(v).strip().replace(',', '')\n",
" if text == '':\n",
" return None\n",
" if re.fullmatch(r'-?\\d+(\\.\\d+)?', text):\n",
" return float(text)\n",
" return None\n",
"\n",
"\n",
"def parse_year_from_title(title):\n",
" m = re.search(r'\\b((?:19|20)\\d{2})\\b', str(title))\n",
" return int(m.group(1)) if m else None\n",
"\n",
"\n",
"def infer_year_from_keys(props_norm):\n",
" key_patterns = [\n",
" re.compile(r'^[pg](\\d{2})(pre|uss|con|gov|ag|sos|ltg|tre|aud).*'),\n",
" re.compile(r'^[pg](\\d{2}).*'),\n",
" ]\n",
" for key in props_norm.keys():\n",
" nk = norm_key(key)\n",
" for pat in key_patterns:\n",
" m = pat.match(nk)\n",
" if m:\n",
" yy = int(m.group(1))\n",
" return 2000 + yy if yy < 60 else 1900 + yy\n",
" return None\n",
"\n",
"\n",
"def decode_rdh_vote_key(key):\n",
" k = norm_key(key)\n",
"\n",
" m = re.match(r'^[pg](\\d{2})pre([a-z]).*', k)\n",
" if m:\n",
" party_code = m.group(2)\n",
" return ('President', party_code)\n",
"\n",
" m = re.match(r'^[pg](\\d{2})uss([a-z]).*', k)\n",
" if m:\n",
" party_code = m.group(2)\n",
" return ('U.S. Senate', party_code)\n",
"\n",
" m = re.match(r'^[pg](\\d{2})con(\\d{2})([a-z]).*', k)\n",
" if m:\n",
" district = m.group(2)\n",
" party_code = m.group(3)\n",
" return (f'U.S. House District {district}', party_code)\n",
"\n",
" return (None, None)\n",
"\n",
"\n",
"def party_from_key(key):\n",
" k = norm_key(key)\n",
" office, party_code = decode_rdh_vote_key(k)\n",
" if party_code == 'd':\n",
" return office, 'D'\n",
" if party_code == 'r':\n",
" return office, 'R'\n",
"\n",
" if any(t in k for t in ['biden', 'dem', 'democrat']):\n",
" return office, 'D'\n",
" if any(t in k for t in ['trump', 'gop', 'rep', 'republican']):\n",
" return office, 'R'\n",
"\n",
" return office, None\n",
"\n",
"\n",
"def detect_office(title, props_norm, vote_office_totals):\n",
" title_lower = str(title).lower()\n",
" if 'president' in title_lower or 'presidential' in title_lower:\n",
" return 'President'\n",
" if 'senate' in title_lower:\n",
" return 'U.S. Senate'\n",
" if 'house' in title_lower or 'congress' in title_lower:\n",
" return 'U.S. House'\n",
" if 'governor' in title_lower:\n",
" return 'Governor'\n",
"\n",
" if vote_office_totals:\n",
" return max(vote_office_totals.items(), key=lambda x: x[1])[0]\n",
"\n",
" office_key_hits = [k for k in props_norm if any(x in k for x in ['office', 'contest', 'race'])]\n",
" if office_key_hits:\n",
" best = office_key_hits[0]\n",
" val = props_norm.get(best)\n",
" if isinstance(val, str) and val.strip():\n",
" return val.strip()\n",
" return None\n",
"\n",
"\n",
"def best_precinct_identifier(props_norm):\n",
" preferred_keys = [\n",
" 'precinct', 'precinct_name', 'precinctid', 'precinct_id', 'precinct20',\n",
" 'pctname', 'pct', 'vtd', 'vtdst', 'vtdst20', 'name20',\n",
" 'district', 'district_name', 'ward', 'geoid', 'geoid20', 'unique_id',\n",
" ]\n",
" for key in preferred_keys:\n",
" if key in props_norm and str(props_norm[key]).strip():\n",
" return str(props_norm[key]).strip()\n",
"\n",
" fallback_candidates = [\n",
" (k, v) for k, v in props_norm.items()\n",
" if any(t in k for t in ['precinct', 'vtd', 'ward', 'district', 'geo', 'name']) and str(v).strip()\n",
" ]\n",
" if fallback_candidates:\n",
" return str(fallback_candidates[0][1]).strip()\n",
" return None\n",
"\n",
"\n",
"def extract_vote_fields(row):\n",
" props = parse_props(row['precinct_properties'])\n",
" props_norm = {norm_key(k): v for k, v in props.items()}\n",
"\n",
" precinct_id_or_name = best_precinct_identifier(props_norm)\n",
" election_year = parse_year_from_title(row['rdh_layer_title'])\n",
" if election_year is None:\n",
" election_year = infer_year_from_keys(props_norm)\n",
"\n",
" year_keys = [k for k in props_norm if 'year' in k]\n",
" if election_year is None and year_keys:\n",
" for k in year_keys:\n",
" y = as_number(props_norm[k])\n",
" if y and 1900 <= y <= 2100:\n",
" election_year = int(y)\n",
" break\n",
"\n",
" numeric_items = [(k, as_number(v)) for k, v in props_norm.items()]\n",
" numeric_items = [(k, v) for k, v in numeric_items if v is not None]\n",
"\n",
" dem_votes = None\n",
" rep_votes = None\n",
" vote_office_totals = {}\n",
"\n",
" for key, value in numeric_items:\n",
" office_guess, party_guess = party_from_key(key)\n",
" if party_guess == 'D':\n",
" dem_votes = value if dem_votes is None else max(dem_votes, value)\n",
" elif party_guess == 'R':\n",
" rep_votes = value if rep_votes is None else max(rep_votes, value)\n",
"\n",
" if office_guess is not None and party_guess in {'D', 'R'}:\n",
" vote_office_totals[office_guess] = vote_office_totals.get(office_guess, 0.0) + float(value)\n",
"\n",
" total_candidates = [\n",
" v for k, v in numeric_items\n",
" if (\n",
" ('total' in k and 'vote' in k)\n",
" or ('tot' in k and 'vote' in k)\n",
" or k in {'votes_total', 'total_votes', 'vote_total'}\n",
" )\n",
" ]\n",
" total_votes = max(total_candidates) if total_candidates else None\n",
" if total_votes is None and dem_votes is not None and rep_votes is not None:\n",
" total_votes = dem_votes + rep_votes\n",
"\n",
" turnout_candidates = [\n",
" v for k, v in numeric_items\n",
" if any(x in k for x in ['turnout', 'turnout_pct', 'turnout_rate', 'vote_share', 'share', 'pct'])\n",
" ]\n",
" turnout_or_vote_share = turnout_candidates[0] if turnout_candidates else None\n",
"\n",
" if turnout_or_vote_share is None:\n",
" reg_voters = props_norm.get('reg_voters')\n",
" reg_voters_num = as_number(reg_voters)\n",
" if reg_voters_num and total_votes:\n",
" turnout_or_vote_share = total_votes / reg_voters_num\n",
" elif dem_votes is not None and rep_votes is not None and (dem_votes + rep_votes) > 0:\n",
" turnout_or_vote_share = dem_votes / (dem_votes + rep_votes)\n",
"\n",
" office = detect_office(row['rdh_layer_title'], props_norm, vote_office_totals)\n",
"\n",
" return pd.Series({\n",
" 'precinct_identifier_name': precinct_id_or_name,\n",
" 'election_year': election_year,\n",
" 'office': office,\n",
" 'democratic_votes': dem_votes,\n",
" 'republican_votes': rep_votes,\n",
" 'total_votes': total_votes,\n",
" 'turnout_or_vote_share': turnout_or_vote_share,\n",
" })\n",
"\n",
"\n",
"standardized_fields = raw_standardized.apply(extract_vote_fields, axis=1)\n",
"standardized_preview = pd.concat(\n",
" [\n",
" raw_standardized[['master_id', 'name', 'city', 'state', 'rdh_layer_title']],\n",
" standardized_fields,\n",
" ],\n",
" axis=1,\n",
")\n",
"\n",
"standardized_summary = pd.DataFrame({\n",
" 'field': [\n",
" 'precinct_identifier_name', 'election_year', 'office',\n",
" 'democratic_votes', 'republican_votes', 'total_votes', 'turnout_or_vote_share',\n",
" ]\n",
"})\n",
"standardized_summary['non_null_rows'] = standardized_summary['field'].map(\n",
" lambda c: int(standardized_preview[c].notna().sum())\n",
")\n",
"\n",
"print(f'Standardized preview rows: {len(standardized_preview):,}')\n",
"display(standardized_summary)\n",
"display(standardized_preview.head(50))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "30",
"metadata": {},
"outputs": [],
"source": [
"ELECTION_CONTEXT_TABLE = 'public.data_center_election_context'\n",
"\n",
"required_cols = [\n",
" 'master_id', 'rdh_layer_title',\n",
" 'precinct_identifier_name', 'election_year', 'office',\n",
" 'democratic_votes', 'republican_votes', 'total_votes', 'turnout_or_vote_share',\n",
"]\n",
"\n",
"missing_cols = [c for c in required_cols if c not in standardized_preview.columns]\n",
"if missing_cols:\n",
" raise RuntimeError(\n",
" 'standardized_preview is missing required columns: '\n",
" + ', '.join(missing_cols)\n",
" + '. Run the standardized extraction cell first.'\n",
" )\n",
"\n",
"persist_best = standardized_preview[required_cols].copy()\n",
"persist_best['non_null_score'] = persist_best[\n",
" ['precinct_identifier_name', 'election_year', 'office', 'democratic_votes', 'republican_votes', 'total_votes', 'turnout_or_vote_share']\n",
"].notna().sum(axis=1)\n",
"\n",
"persist_best = persist_best.sort_values(\n",
" ['master_id', 'non_null_score', 'total_votes'],\n",
" ascending=[True, False, False],\n",
" na_position='last'\n",
")\n",
"persist_best = persist_best.drop_duplicates(subset=['master_id'], keep='first').copy()\n",
"\n",
"with get_conn() as conn:\n",
" master_base = pd.read_sql_query(\n",
" f'''\n",
" select master_id, name, city, upper(state) as state\n",
" from {MASTER_TABLE}\n",
" ''',\n",
" conn,\n",
" )\n",
"\n",
"persist_df = master_base.merge(\n",
" persist_best.drop(columns=['non_null_score']),\n",
" on='master_id',\n",
" how='left',\n",
")\n",
"\n",
"create_sql = f'''\n",
"create table if not exists {ELECTION_CONTEXT_TABLE} (\n",
" master_id text primary key references public.master_data_centers(master_id) on delete cascade,\n",
" name text,\n",
" city text,\n",
" state text,\n",
" rdh_layer_title text,\n",
" precinct_identifier_name text,\n",
" election_year integer,\n",
" office text,\n",
" democratic_votes double precision,\n",
" republican_votes double precision,\n",
" total_votes double precision,\n",
" turnout_or_vote_share double precision,\n",
" updated_at timestamptz not null default now()\n",
");\n",
"create index if not exists data_center_election_context_state_idx\n",
" on {ELECTION_CONTEXT_TABLE} (state);\n",
"create index if not exists data_center_election_context_year_idx\n",
" on {ELECTION_CONTEXT_TABLE} (election_year);\n",
"'''\n",
"\n",
"upsert_sql = f'''\n",
"insert into {ELECTION_CONTEXT_TABLE} (\n",
" master_id, name, city, state, rdh_layer_title,\n",
" precinct_identifier_name, election_year, office,\n",
" democratic_votes, republican_votes, total_votes, turnout_or_vote_share,\n",
" updated_at\n",
")\n",
"values %s\n",
"on conflict (master_id) do update set\n",
" name = excluded.name,\n",
" city = excluded.city,\n",
" state = excluded.state,\n",
" rdh_layer_title = excluded.rdh_layer_title,\n",
" precinct_identifier_name = excluded.precinct_identifier_name,\n",
" election_year = excluded.election_year,\n",
" office = excluded.office,\n",
" democratic_votes = excluded.democratic_votes,\n",
" republican_votes = excluded.republican_votes,\n",
" total_votes = excluded.total_votes,\n",
" turnout_or_vote_share = excluded.turnout_or_vote_share,\n",
" updated_at = now()\n",
"'''\n",
"\n",
"rows = []\n",
"for rec in persist_df.to_dict('records'):\n",
" rows.append((\n",
" rec['master_id'],\n",
" rec['name'],\n",
" rec['city'],\n",
" rec['state'],\n",
" rec.get('rdh_layer_title'),\n",
" rec.get('precinct_identifier_name'),\n",
" int(rec['election_year']) if pd.notna(rec.get('election_year')) else None,\n",
" rec.get('office'),\n",
" float(rec['democratic_votes']) if pd.notna(rec.get('democratic_votes')) else None,\n",
" float(rec['republican_votes']) if pd.notna(rec.get('republican_votes')) else None,\n",
" float(rec['total_votes']) if pd.notna(rec.get('total_votes')) else None,\n",
" float(rec['turnout_or_vote_share']) if pd.notna(rec.get('turnout_or_vote_share')) else None,\n",
" ))\n",
"\n",
"with get_conn() as conn:\n",
" with conn.cursor() as cur:\n",
" cur.execute(create_sql)\n",
" if rows:\n",
" execute_values(\n",
" cur,\n",
" upsert_sql,\n",
" rows,\n",
" template='(%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, now())',\n",
" page_size=1000,\n",
" )\n",
" cur.execute(f'select count(*) from {ELECTION_CONTEXT_TABLE}')\n",
" table_rows = cur.fetchone()[0]\n",
" cur.execute(\n",
" f'''\n",
" select\n",
" state,\n",
" count(*) as rows,\n",
" count(*) filter (\n",
" where election_year is not null\n",
" or office is not null\n",
" or democratic_votes is not null\n",
" or republican_votes is not null\n",
" or total_votes is not null\n",
" or turnout_or_vote_share is not null\n",
" ) as rows_with_election\n",
" from {ELECTION_CONTEXT_TABLE}\n",
" group by state\n",
" order by rows desc, state\n",
" limit 15\n",
" '''\n",
" )\n",
" state_counts = cur.fetchall()\n",
"\n",
"rows_with_election = int(\n",
" persist_df[\n",
" ['election_year', 'office', 'democratic_votes', 'republican_votes', 'total_votes', 'turnout_or_vote_share']\n",
" ].notna().any(axis=1).sum()\n",
")\n",
"print(f'Rows prepared for upsert: {len(rows):,}')\n",
"print(f'Rows with election context: {rows_with_election:,}')\n",
"print(f'Rows currently in {ELECTION_CONTEXT_TABLE}: {table_rows:,}')\n",
"display(pd.DataFrame(state_counts, columns=['state', 'rows', 'rows_with_election']))"
]
},
{
"cell_type": "markdown",
"id": "31",
"metadata": {},
"source": [
"## Persist Standardized Election Context\n",
"\n",
"Writes one standardized election-context row per `master_id` into `public.data_center_election_context` for reuse in map and reporting workflows."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "32",
"metadata": {},
"outputs": [],
"source": [
"# Targeted state coverage check\n",
"states = ['VA', 'WA', 'WI', 'WV', 'WY', 'DC', 'PR']\n",
"\n",
"with get_conn() as conn:\n",
" check_df = pd.read_sql_query(\n",
" f'''\n",
" select\n",
" state,\n",
" count(*) as rows,\n",
" count(*) filter (\n",
" where election_year is not null\n",
" or office is not null\n",
" or democratic_votes is not null\n",
" or republican_votes is not null\n",
" or total_votes is not null\n",
" or turnout_or_vote_share is not null\n",
" ) as rows_with_election\n",
" from {ELECTION_CONTEXT_TABLE}\n",
" where state = any(%s)\n",
" group by state\n",
" order by state\n",
" ''',\n",
" conn,\n",
" params=[states],\n",
" )\n",
"\n",
"display(check_df)"
]
},
{
"cell_type": "markdown",
"id": "33",
"metadata": {},
"source": [
"## Tables Created by This Notebook and Their Relationships\n",
"\n",
"This notebook creates and/or maintains the following PostGIS/PostgreSQL tables:\n",
"\n",
"1. `public.rdh_precinct_vote_layers`\n",
"- One row per RDH precinct-election layer ingested.\n",
"- Key columns: `layer_id` (PK), `state_code`, `title`, `format`, file/source metadata, `loaded_at`.\n",
"\n",
"2. `public.rdh_precinct_vote_features`\n",
"- One row per precinct polygon feature from a loaded layer.\n",
"- Key columns: `feature_id` (PK), `layer_id` (FK), `state_code`, `source_row`, `properties` (JSONB), `geom` (MultiPolygon).\n",
"- Relationship: many features belong to one layer.\n",
"\n",
"3. `public.data_center_rdh_precinct_vote_matches`\n",
"- Spatial match table linking data centers to precinct features.\n",
"- Key columns: `master_id` (FK), `feature_id` (FK), `layer_id` (FK), `state_code`, `join_method`, `match_distance_m`, `matched_at`.\n",
"- Primary key: (`master_id`, `feature_id`).\n",
"- Relationship: many-to-many bridge between data centers and precinct features (with match metadata).\n",
"\n",
"4. `public.data_center_election_context`\n",
"- Final standardized, one-row-per-data-center election context used by downstream mapping/analysis.\n",
"- Key columns: `master_id` (PK, FK), `name`, `city`, `state`, `rdh_layer_title`,\n",
" `precinct_identifier_name`, `election_year`, `office`, `democratic_votes`, `republican_votes`,\n",
" `total_votes`, `turnout_or_vote_share`, `updated_at`.\n",
"- Relationship: one row per `master_id` in `public.master_data_centers` (left-joined so all master rows can be retained, even if election fields are null).\n",
"\n",
"### Relationship Summary\n",
"\n",
"- `public.master_data_centers (master_id)`\n",
" - 1-to-many -> `public.data_center_rdh_precinct_vote_matches (master_id)`\n",
" - 1-to-1 (effective in this notebook) -> `public.data_center_election_context (master_id)`\n",
"\n",
"- `public.rdh_precinct_vote_layers (layer_id)`\n",
" - 1-to-many -> `public.rdh_precinct_vote_features (layer_id)`\n",
" - 1-to-many -> `public.data_center_rdh_precinct_vote_matches (layer_id)`\n",
"\n",
"- `public.rdh_precinct_vote_features (feature_id)`\n",
" - 1-to-many -> `public.data_center_rdh_precinct_vote_matches (feature_id)`\n",
"\n",
"In short: **layers -> features -> matches**, then matches are standardized into **one election-context row per data center**."
]
}
],