expanded voter data
This commit is contained in:
@@ -145,7 +145,7 @@
|
||||
"source": [
|
||||
"## Parameters\n",
|
||||
"\n",
|
||||
"The defaults run a small real pilot for Virginia 2020, because Virginia has many data centers in the master table and a statewide precinct layer should produce visible matches. After the pilot works, broaden `TARGET_STATES` and `FILTER_YEARS_ANY`. Use `TARGET_STATES = None` to infer all states from `public.master_data_centers`.\n"
|
||||
"The defaults now target both 2020 and 2024 precinct election layers across all inferred data-center states. Set `TARGET_STATES` to a small list like `['VA']` for a quick pilot run, or keep `TARGET_STATES = None` to infer all states from `public.master_data_centers`. Use `FILTER_YEARS_ANY = []` to keep all years returned by RDH."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -166,7 +166,7 @@
|
||||
"TARGET_STATES = None # None = infer all states from master_data_centers; or list e.g. ['VA','TX']\n",
|
||||
"FILTER_TERMS_ALL = ['election results', 'precinct']\n",
|
||||
"FILTER_TERMS_ANY = [] # e.g. ['general', 'president']\n",
|
||||
"FILTER_YEARS_ANY = ['2020'] # pilot first; empty keeps all years returned by RDH\n",
|
||||
"FILTER_YEARS_ANY = ['2020', '2024'] # set [] to keep all years returned by RDH\n",
|
||||
"PREFERRED_FORMATS = ['SHP'] # point-in-precinct joins need spatial files\n",
|
||||
"\n",
|
||||
"DOWNLOAD_FILES = True\n",
|
||||
@@ -387,6 +387,11 @@
|
||||
" return re.sub(r'[^A-Za-z0-9._-]+', '_', name)\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"def detect_year(text):\n",
|
||||
" match = re.search(r'\\b(20\\d{2})\\b', str(text))\n",
|
||||
" return match.group(1) if match else None\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"work = listing.copy()\n",
|
||||
"for required in ['Title', 'Format', 'URL']:\n",
|
||||
" if required not in work.columns:\n",
|
||||
@@ -405,8 +410,20 @@
|
||||
"].copy()\n",
|
||||
"\n",
|
||||
"filtered = filtered.sort_values(['query_state_code', 'Title', 'Format', 'filename']).reset_index(drop=True)\n",
|
||||
"filtered['detected_year'] = filtered['Title'].map(detect_year)\n",
|
||||
"\n",
|
||||
"print(f'Filtered candidate files: {len(filtered):,}')\n",
|
||||
"display(filtered[['query_state_code', 'Title', 'Format', 'datasetid', 'filename', 'URL']].head(100))\n"
|
||||
"year_summary = (\n",
|
||||
" filtered.assign(detected_year=filtered['detected_year'].fillna('unknown'))\n",
|
||||
" .groupby('detected_year', dropna=False)\n",
|
||||
" .size()\n",
|
||||
" .reset_index(name='rows')\n",
|
||||
" .sort_values('detected_year')\n",
|
||||
")\n",
|
||||
"print('Candidate rows by detected year:')\n",
|
||||
"display(year_summary)\n",
|
||||
"\n",
|
||||
"display(filtered[['query_state_code', 'detected_year', 'Title', 'Format', 'datasetid', 'filename', 'URL']].head(100))"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -1169,11 +1186,11 @@
|
||||
"id": "28",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Next Refinement: Tidy Vote Columns\n",
|
||||
"## Standardized Vote Fields\n",
|
||||
"\n",
|
||||
"The RDH staging table intentionally stores each precinct row's original attributes in `properties jsonb`. Once the downloaded layers are visible, inspect `precinct_properties` above to identify vote-column patterns for the states/years you care about.\n",
|
||||
"The cell below extracts a standardized set of election attributes from `precinct_properties` using heuristic key matching across RDH file families.\n",
|
||||
"\n",
|
||||
"Useful follow-up views can then extract fields like:\n",
|
||||
"Extracted fields:\n",
|
||||
"- precinct identifier/name\n",
|
||||
"- election year\n",
|
||||
"- office\n",
|
||||
@@ -1182,7 +1199,521 @@
|
||||
"- total votes\n",
|
||||
"- turnout or vote share\n",
|
||||
"\n",
|
||||
"That extraction is best added after confirming the specific RDH file families selected by the filters.\n"
|
||||
"Because RDH schemas vary by state and source, this step is intentionally tolerant and computes fallback vote-share values when direct turnout/share fields are not present."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "29",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"STANDARDIZED_LIMIT = None # set an int (e.g., 2000) for faster sampling\n",
|
||||
"\n",
|
||||
"limit_clause = '' if STANDARDIZED_LIMIT is None else 'limit %s'\n",
|
||||
"standardized_sql = f'''\n",
|
||||
"select\n",
|
||||
" m.master_id,\n",
|
||||
" dc.name,\n",
|
||||
" dc.city,\n",
|
||||
" dc.state,\n",
|
||||
" l.title as rdh_layer_title,\n",
|
||||
" f.properties as precinct_properties\n",
|
||||
"from {MATCH_TABLE} m\n",
|
||||
"join {MASTER_TABLE} dc on dc.master_id = m.master_id\n",
|
||||
"join {FEATURE_TABLE} f on f.feature_id = m.feature_id\n",
|
||||
"join {LAYER_TABLE} l on l.layer_id = m.layer_id\n",
|
||||
"order by dc.state, dc.city, dc.name\n",
|
||||
"{limit_clause}\n",
|
||||
"'''\n",
|
||||
"\n",
|
||||
"with get_conn() as conn:\n",
|
||||
" if STANDARDIZED_LIMIT is None:\n",
|
||||
" raw_standardized = pd.read_sql_query(standardized_sql, conn)\n",
|
||||
" else:\n",
|
||||
" raw_standardized = pd.read_sql_query(standardized_sql, conn, params=[STANDARDIZED_LIMIT])\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"def parse_props(value):\n",
|
||||
" if isinstance(value, dict):\n",
|
||||
" return value\n",
|
||||
" if pd.isna(value):\n",
|
||||
" return {}\n",
|
||||
" text = str(value).strip()\n",
|
||||
" if not text:\n",
|
||||
" return {}\n",
|
||||
" try:\n",
|
||||
" obj = json.loads(text)\n",
|
||||
" return obj if isinstance(obj, dict) else {}\n",
|
||||
" except Exception:\n",
|
||||
" return {}\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"def norm_key(k):\n",
|
||||
" return re.sub(r'[^a-z0-9]+', '_', str(k).strip().lower()).strip('_')\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"def as_number(v):\n",
|
||||
" if v is None:\n",
|
||||
" return None\n",
|
||||
" if isinstance(v, (int, float, np.integer, np.floating)):\n",
|
||||
" if pd.isna(v):\n",
|
||||
" return None\n",
|
||||
" return float(v)\n",
|
||||
" text = str(v).strip().replace(',', '')\n",
|
||||
" if text == '':\n",
|
||||
" return None\n",
|
||||
" if re.fullmatch(r'-?\\d+(\\.\\d+)?', text):\n",
|
||||
" return float(text)\n",
|
||||
" return None\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"def parse_year_from_title(title):\n",
|
||||
" m = re.search(r'\\b((?:19|20)\\d{2})\\b', str(title))\n",
|
||||
" return int(m.group(1)) if m else None\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"def infer_year_from_keys(props_norm):\n",
|
||||
" key_patterns = [\n",
|
||||
" re.compile(r'^[pg](\\d{2})(pre|uss|con|gov|ag|sos|ltg|tre|aud).*'),\n",
|
||||
" re.compile(r'^[pg](\\d{2}).*'),\n",
|
||||
" ]\n",
|
||||
" for key in props_norm.keys():\n",
|
||||
" nk = norm_key(key)\n",
|
||||
" for pat in key_patterns:\n",
|
||||
" m = pat.match(nk)\n",
|
||||
" if m:\n",
|
||||
" yy = int(m.group(1))\n",
|
||||
" return 2000 + yy if yy < 60 else 1900 + yy\n",
|
||||
" return None\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"def decode_rdh_vote_key(key):\n",
|
||||
" k = norm_key(key)\n",
|
||||
"\n",
|
||||
" m = re.match(r'^[pg](\\d{2})pre([a-z]).*', k)\n",
|
||||
" if m:\n",
|
||||
" party_code = m.group(2)\n",
|
||||
" return ('President', party_code)\n",
|
||||
"\n",
|
||||
" m = re.match(r'^[pg](\\d{2})uss([a-z]).*', k)\n",
|
||||
" if m:\n",
|
||||
" party_code = m.group(2)\n",
|
||||
" return ('U.S. Senate', party_code)\n",
|
||||
"\n",
|
||||
" m = re.match(r'^[pg](\\d{2})con(\\d{2})([a-z]).*', k)\n",
|
||||
" if m:\n",
|
||||
" district = m.group(2)\n",
|
||||
" party_code = m.group(3)\n",
|
||||
" return (f'U.S. House District {district}', party_code)\n",
|
||||
"\n",
|
||||
" return (None, None)\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"def party_from_key(key):\n",
|
||||
" k = norm_key(key)\n",
|
||||
" office, party_code = decode_rdh_vote_key(k)\n",
|
||||
" if party_code == 'd':\n",
|
||||
" return office, 'D'\n",
|
||||
" if party_code == 'r':\n",
|
||||
" return office, 'R'\n",
|
||||
"\n",
|
||||
" if any(t in k for t in ['biden', 'dem', 'democrat']):\n",
|
||||
" return office, 'D'\n",
|
||||
" if any(t in k for t in ['trump', 'gop', 'rep', 'republican']):\n",
|
||||
" return office, 'R'\n",
|
||||
"\n",
|
||||
" return office, None\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"def detect_office(title, props_norm, vote_office_totals):\n",
|
||||
" title_lower = str(title).lower()\n",
|
||||
" if 'president' in title_lower or 'presidential' in title_lower:\n",
|
||||
" return 'President'\n",
|
||||
" if 'senate' in title_lower:\n",
|
||||
" return 'U.S. Senate'\n",
|
||||
" if 'house' in title_lower or 'congress' in title_lower:\n",
|
||||
" return 'U.S. House'\n",
|
||||
" if 'governor' in title_lower:\n",
|
||||
" return 'Governor'\n",
|
||||
"\n",
|
||||
" if vote_office_totals:\n",
|
||||
" return max(vote_office_totals.items(), key=lambda x: x[1])[0]\n",
|
||||
"\n",
|
||||
" office_key_hits = [k for k in props_norm if any(x in k for x in ['office', 'contest', 'race'])]\n",
|
||||
" if office_key_hits:\n",
|
||||
" best = office_key_hits[0]\n",
|
||||
" val = props_norm.get(best)\n",
|
||||
" if isinstance(val, str) and val.strip():\n",
|
||||
" return val.strip()\n",
|
||||
" return None\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"def best_precinct_identifier(props_norm):\n",
|
||||
" preferred_keys = [\n",
|
||||
" 'precinct', 'precinct_name', 'precinctid', 'precinct_id', 'precinct20',\n",
|
||||
" 'pctname', 'pct', 'vtd', 'vtdst', 'vtdst20', 'name20',\n",
|
||||
" 'district', 'district_name', 'ward', 'geoid', 'geoid20', 'unique_id',\n",
|
||||
" ]\n",
|
||||
" for key in preferred_keys:\n",
|
||||
" if key in props_norm and str(props_norm[key]).strip():\n",
|
||||
" return str(props_norm[key]).strip()\n",
|
||||
"\n",
|
||||
" fallback_candidates = [\n",
|
||||
" (k, v) for k, v in props_norm.items()\n",
|
||||
" if any(t in k for t in ['precinct', 'vtd', 'ward', 'district', 'geo', 'name']) and str(v).strip()\n",
|
||||
" ]\n",
|
||||
" if fallback_candidates:\n",
|
||||
" return str(fallback_candidates[0][1]).strip()\n",
|
||||
" return None\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"def extract_vote_fields(row):\n",
|
||||
" props = parse_props(row['precinct_properties'])\n",
|
||||
" props_norm = {norm_key(k): v for k, v in props.items()}\n",
|
||||
"\n",
|
||||
" precinct_id_or_name = best_precinct_identifier(props_norm)\n",
|
||||
" election_year = parse_year_from_title(row['rdh_layer_title'])\n",
|
||||
" if election_year is None:\n",
|
||||
" election_year = infer_year_from_keys(props_norm)\n",
|
||||
"\n",
|
||||
" year_keys = [k for k in props_norm if 'year' in k]\n",
|
||||
" if election_year is None and year_keys:\n",
|
||||
" for k in year_keys:\n",
|
||||
" y = as_number(props_norm[k])\n",
|
||||
" if y and 1900 <= y <= 2100:\n",
|
||||
" election_year = int(y)\n",
|
||||
" break\n",
|
||||
"\n",
|
||||
" numeric_items = [(k, as_number(v)) for k, v in props_norm.items()]\n",
|
||||
" numeric_items = [(k, v) for k, v in numeric_items if v is not None]\n",
|
||||
"\n",
|
||||
" dem_votes = None\n",
|
||||
" rep_votes = None\n",
|
||||
" vote_office_totals = {}\n",
|
||||
"\n",
|
||||
" for key, value in numeric_items:\n",
|
||||
" office_guess, party_guess = party_from_key(key)\n",
|
||||
" if party_guess == 'D':\n",
|
||||
" dem_votes = value if dem_votes is None else max(dem_votes, value)\n",
|
||||
" elif party_guess == 'R':\n",
|
||||
" rep_votes = value if rep_votes is None else max(rep_votes, value)\n",
|
||||
"\n",
|
||||
" if office_guess is not None and party_guess in {'D', 'R'}:\n",
|
||||
" vote_office_totals[office_guess] = vote_office_totals.get(office_guess, 0.0) + float(value)\n",
|
||||
"\n",
|
||||
" total_candidates = [\n",
|
||||
" v for k, v in numeric_items\n",
|
||||
" if (\n",
|
||||
" ('total' in k and 'vote' in k)\n",
|
||||
" or ('tot' in k and 'vote' in k)\n",
|
||||
" or k in {'votes_total', 'total_votes', 'vote_total'}\n",
|
||||
" )\n",
|
||||
" ]\n",
|
||||
" total_votes = max(total_candidates) if total_candidates else None\n",
|
||||
" if total_votes is None and dem_votes is not None and rep_votes is not None:\n",
|
||||
" total_votes = dem_votes + rep_votes\n",
|
||||
"\n",
|
||||
" turnout_candidates = [\n",
|
||||
" v for k, v in numeric_items\n",
|
||||
" if any(x in k for x in ['turnout', 'turnout_pct', 'turnout_rate', 'vote_share', 'share', 'pct'])\n",
|
||||
" ]\n",
|
||||
" turnout_or_vote_share = turnout_candidates[0] if turnout_candidates else None\n",
|
||||
"\n",
|
||||
" if turnout_or_vote_share is None:\n",
|
||||
" reg_voters = props_norm.get('reg_voters')\n",
|
||||
" reg_voters_num = as_number(reg_voters)\n",
|
||||
" if reg_voters_num and total_votes:\n",
|
||||
" turnout_or_vote_share = total_votes / reg_voters_num\n",
|
||||
" elif dem_votes is not None and rep_votes is not None and (dem_votes + rep_votes) > 0:\n",
|
||||
" turnout_or_vote_share = dem_votes / (dem_votes + rep_votes)\n",
|
||||
"\n",
|
||||
" office = detect_office(row['rdh_layer_title'], props_norm, vote_office_totals)\n",
|
||||
"\n",
|
||||
" return pd.Series({\n",
|
||||
" 'precinct_identifier_name': precinct_id_or_name,\n",
|
||||
" 'election_year': election_year,\n",
|
||||
" 'office': office,\n",
|
||||
" 'democratic_votes': dem_votes,\n",
|
||||
" 'republican_votes': rep_votes,\n",
|
||||
" 'total_votes': total_votes,\n",
|
||||
" 'turnout_or_vote_share': turnout_or_vote_share,\n",
|
||||
" })\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"standardized_fields = raw_standardized.apply(extract_vote_fields, axis=1)\n",
|
||||
"standardized_preview = pd.concat(\n",
|
||||
" [\n",
|
||||
" raw_standardized[['master_id', 'name', 'city', 'state', 'rdh_layer_title']],\n",
|
||||
" standardized_fields,\n",
|
||||
" ],\n",
|
||||
" axis=1,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"standardized_summary = pd.DataFrame({\n",
|
||||
" 'field': [\n",
|
||||
" 'precinct_identifier_name', 'election_year', 'office',\n",
|
||||
" 'democratic_votes', 'republican_votes', 'total_votes', 'turnout_or_vote_share',\n",
|
||||
" ]\n",
|
||||
"})\n",
|
||||
"standardized_summary['non_null_rows'] = standardized_summary['field'].map(\n",
|
||||
" lambda c: int(standardized_preview[c].notna().sum())\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"print(f'Standardized preview rows: {len(standardized_preview):,}')\n",
|
||||
"display(standardized_summary)\n",
|
||||
"display(standardized_preview.head(50))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "30",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"ELECTION_CONTEXT_TABLE = 'public.data_center_election_context'\n",
|
||||
"\n",
|
||||
"required_cols = [\n",
|
||||
" 'master_id', 'rdh_layer_title',\n",
|
||||
" 'precinct_identifier_name', 'election_year', 'office',\n",
|
||||
" 'democratic_votes', 'republican_votes', 'total_votes', 'turnout_or_vote_share',\n",
|
||||
"]\n",
|
||||
"\n",
|
||||
"missing_cols = [c for c in required_cols if c not in standardized_preview.columns]\n",
|
||||
"if missing_cols:\n",
|
||||
" raise RuntimeError(\n",
|
||||
" 'standardized_preview is missing required columns: '\n",
|
||||
" + ', '.join(missing_cols)\n",
|
||||
" + '. Run the standardized extraction cell first.'\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
"persist_best = standardized_preview[required_cols].copy()\n",
|
||||
"persist_best['non_null_score'] = persist_best[\n",
|
||||
" ['precinct_identifier_name', 'election_year', 'office', 'democratic_votes', 'republican_votes', 'total_votes', 'turnout_or_vote_share']\n",
|
||||
"].notna().sum(axis=1)\n",
|
||||
"\n",
|
||||
"persist_best = persist_best.sort_values(\n",
|
||||
" ['master_id', 'non_null_score', 'total_votes'],\n",
|
||||
" ascending=[True, False, False],\n",
|
||||
" na_position='last'\n",
|
||||
")\n",
|
||||
"persist_best = persist_best.drop_duplicates(subset=['master_id'], keep='first').copy()\n",
|
||||
"\n",
|
||||
"with get_conn() as conn:\n",
|
||||
" master_base = pd.read_sql_query(\n",
|
||||
" f'''\n",
|
||||
" select master_id, name, city, upper(state) as state\n",
|
||||
" from {MASTER_TABLE}\n",
|
||||
" ''',\n",
|
||||
" conn,\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
"persist_df = master_base.merge(\n",
|
||||
" persist_best.drop(columns=['non_null_score']),\n",
|
||||
" on='master_id',\n",
|
||||
" how='left',\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"create_sql = f'''\n",
|
||||
"create table if not exists {ELECTION_CONTEXT_TABLE} (\n",
|
||||
" master_id text primary key references public.master_data_centers(master_id) on delete cascade,\n",
|
||||
" name text,\n",
|
||||
" city text,\n",
|
||||
" state text,\n",
|
||||
" rdh_layer_title text,\n",
|
||||
" precinct_identifier_name text,\n",
|
||||
" election_year integer,\n",
|
||||
" office text,\n",
|
||||
" democratic_votes double precision,\n",
|
||||
" republican_votes double precision,\n",
|
||||
" total_votes double precision,\n",
|
||||
" turnout_or_vote_share double precision,\n",
|
||||
" updated_at timestamptz not null default now()\n",
|
||||
");\n",
|
||||
"create index if not exists data_center_election_context_state_idx\n",
|
||||
" on {ELECTION_CONTEXT_TABLE} (state);\n",
|
||||
"create index if not exists data_center_election_context_year_idx\n",
|
||||
" on {ELECTION_CONTEXT_TABLE} (election_year);\n",
|
||||
"'''\n",
|
||||
"\n",
|
||||
"upsert_sql = f'''\n",
|
||||
"insert into {ELECTION_CONTEXT_TABLE} (\n",
|
||||
" master_id, name, city, state, rdh_layer_title,\n",
|
||||
" precinct_identifier_name, election_year, office,\n",
|
||||
" democratic_votes, republican_votes, total_votes, turnout_or_vote_share,\n",
|
||||
" updated_at\n",
|
||||
")\n",
|
||||
"values %s\n",
|
||||
"on conflict (master_id) do update set\n",
|
||||
" name = excluded.name,\n",
|
||||
" city = excluded.city,\n",
|
||||
" state = excluded.state,\n",
|
||||
" rdh_layer_title = excluded.rdh_layer_title,\n",
|
||||
" precinct_identifier_name = excluded.precinct_identifier_name,\n",
|
||||
" election_year = excluded.election_year,\n",
|
||||
" office = excluded.office,\n",
|
||||
" democratic_votes = excluded.democratic_votes,\n",
|
||||
" republican_votes = excluded.republican_votes,\n",
|
||||
" total_votes = excluded.total_votes,\n",
|
||||
" turnout_or_vote_share = excluded.turnout_or_vote_share,\n",
|
||||
" updated_at = now()\n",
|
||||
"'''\n",
|
||||
"\n",
|
||||
"rows = []\n",
|
||||
"for rec in persist_df.to_dict('records'):\n",
|
||||
" rows.append((\n",
|
||||
" rec['master_id'],\n",
|
||||
" rec['name'],\n",
|
||||
" rec['city'],\n",
|
||||
" rec['state'],\n",
|
||||
" rec.get('rdh_layer_title'),\n",
|
||||
" rec.get('precinct_identifier_name'),\n",
|
||||
" int(rec['election_year']) if pd.notna(rec.get('election_year')) else None,\n",
|
||||
" rec.get('office'),\n",
|
||||
" float(rec['democratic_votes']) if pd.notna(rec.get('democratic_votes')) else None,\n",
|
||||
" float(rec['republican_votes']) if pd.notna(rec.get('republican_votes')) else None,\n",
|
||||
" float(rec['total_votes']) if pd.notna(rec.get('total_votes')) else None,\n",
|
||||
" float(rec['turnout_or_vote_share']) if pd.notna(rec.get('turnout_or_vote_share')) else None,\n",
|
||||
" ))\n",
|
||||
"\n",
|
||||
"with get_conn() as conn:\n",
|
||||
" with conn.cursor() as cur:\n",
|
||||
" cur.execute(create_sql)\n",
|
||||
" if rows:\n",
|
||||
" execute_values(\n",
|
||||
" cur,\n",
|
||||
" upsert_sql,\n",
|
||||
" rows,\n",
|
||||
" template='(%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, now())',\n",
|
||||
" page_size=1000,\n",
|
||||
" )\n",
|
||||
" cur.execute(f'select count(*) from {ELECTION_CONTEXT_TABLE}')\n",
|
||||
" table_rows = cur.fetchone()[0]\n",
|
||||
" cur.execute(\n",
|
||||
" f'''\n",
|
||||
" select\n",
|
||||
" state,\n",
|
||||
" count(*) as rows,\n",
|
||||
" count(*) filter (\n",
|
||||
" where election_year is not null\n",
|
||||
" or office is not null\n",
|
||||
" or democratic_votes is not null\n",
|
||||
" or republican_votes is not null\n",
|
||||
" or total_votes is not null\n",
|
||||
" or turnout_or_vote_share is not null\n",
|
||||
" ) as rows_with_election\n",
|
||||
" from {ELECTION_CONTEXT_TABLE}\n",
|
||||
" group by state\n",
|
||||
" order by rows desc, state\n",
|
||||
" limit 15\n",
|
||||
" '''\n",
|
||||
" )\n",
|
||||
" state_counts = cur.fetchall()\n",
|
||||
"\n",
|
||||
"rows_with_election = int(\n",
|
||||
" persist_df[\n",
|
||||
" ['election_year', 'office', 'democratic_votes', 'republican_votes', 'total_votes', 'turnout_or_vote_share']\n",
|
||||
" ].notna().any(axis=1).sum()\n",
|
||||
")\n",
|
||||
"print(f'Rows prepared for upsert: {len(rows):,}')\n",
|
||||
"print(f'Rows with election context: {rows_with_election:,}')\n",
|
||||
"print(f'Rows currently in {ELECTION_CONTEXT_TABLE}: {table_rows:,}')\n",
|
||||
"display(pd.DataFrame(state_counts, columns=['state', 'rows', 'rows_with_election']))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "31",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Persist Standardized Election Context\n",
|
||||
"\n",
|
||||
"Writes one standardized election-context row per `master_id` into `public.data_center_election_context` for reuse in map and reporting workflows."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "32",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Targeted state coverage check\n",
|
||||
"states = ['VA', 'WA', 'WI', 'WV', 'WY', 'DC', 'PR']\n",
|
||||
"\n",
|
||||
"with get_conn() as conn:\n",
|
||||
" check_df = pd.read_sql_query(\n",
|
||||
" f'''\n",
|
||||
" select\n",
|
||||
" state,\n",
|
||||
" count(*) as rows,\n",
|
||||
" count(*) filter (\n",
|
||||
" where election_year is not null\n",
|
||||
" or office is not null\n",
|
||||
" or democratic_votes is not null\n",
|
||||
" or republican_votes is not null\n",
|
||||
" or total_votes is not null\n",
|
||||
" or turnout_or_vote_share is not null\n",
|
||||
" ) as rows_with_election\n",
|
||||
" from {ELECTION_CONTEXT_TABLE}\n",
|
||||
" where state = any(%s)\n",
|
||||
" group by state\n",
|
||||
" order by state\n",
|
||||
" ''',\n",
|
||||
" conn,\n",
|
||||
" params=[states],\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
"display(check_df)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "33",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Tables Created by This Notebook and Their Relationships\n",
|
||||
"\n",
|
||||
"This notebook creates and/or maintains the following PostGIS/PostgreSQL tables:\n",
|
||||
"\n",
|
||||
"1. `public.rdh_precinct_vote_layers`\n",
|
||||
"- One row per RDH precinct-election layer ingested.\n",
|
||||
"- Key columns: `layer_id` (PK), `state_code`, `title`, `format`, file/source metadata, `loaded_at`.\n",
|
||||
"\n",
|
||||
"2. `public.rdh_precinct_vote_features`\n",
|
||||
"- One row per precinct polygon feature from a loaded layer.\n",
|
||||
"- Key columns: `feature_id` (PK), `layer_id` (FK), `state_code`, `source_row`, `properties` (JSONB), `geom` (MultiPolygon).\n",
|
||||
"- Relationship: many features belong to one layer.\n",
|
||||
"\n",
|
||||
"3. `public.data_center_rdh_precinct_vote_matches`\n",
|
||||
"- Spatial match table linking data centers to precinct features.\n",
|
||||
"- Key columns: `master_id` (FK), `feature_id` (FK), `layer_id` (FK), `state_code`, `join_method`, `match_distance_m`, `matched_at`.\n",
|
||||
"- Primary key: (`master_id`, `feature_id`).\n",
|
||||
"- Relationship: many-to-many bridge between data centers and precinct features (with match metadata).\n",
|
||||
"\n",
|
||||
"4. `public.data_center_election_context`\n",
|
||||
"- Final standardized, one-row-per-data-center election context used by downstream mapping/analysis.\n",
|
||||
"- Key columns: `master_id` (PK, FK), `name`, `city`, `state`, `rdh_layer_title`,\n",
|
||||
" `precinct_identifier_name`, `election_year`, `office`, `democratic_votes`, `republican_votes`,\n",
|
||||
" `total_votes`, `turnout_or_vote_share`, `updated_at`.\n",
|
||||
"- Relationship: one row per `master_id` in `public.master_data_centers` (left-joined so all master rows can be retained, even if election fields are null).\n",
|
||||
"\n",
|
||||
"### Relationship Summary\n",
|
||||
"\n",
|
||||
"- `public.master_data_centers (master_id)`\n",
|
||||
" - 1-to-many -> `public.data_center_rdh_precinct_vote_matches (master_id)`\n",
|
||||
" - 1-to-1 (effective in this notebook) -> `public.data_center_election_context (master_id)`\n",
|
||||
"\n",
|
||||
"- `public.rdh_precinct_vote_layers (layer_id)`\n",
|
||||
" - 1-to-many -> `public.rdh_precinct_vote_features (layer_id)`\n",
|
||||
" - 1-to-many -> `public.data_center_rdh_precinct_vote_matches (layer_id)`\n",
|
||||
"\n",
|
||||
"- `public.rdh_precinct_vote_features (feature_id)`\n",
|
||||
" - 1-to-many -> `public.data_center_rdh_precinct_vote_matches (feature_id)`\n",
|
||||
"\n",
|
||||
"In short: **layers -> features -> matches**, then matches are standardized into **one election-context row per data center**."
|
||||
]
|
||||
}
|
||||
],
|
||||
|
||||
Reference in New Issue
Block a user