diff --git a/build_fcc_bdc_broadband_connection_table.ipynb b/build_fcc_bdc_broadband_connection_table.ipynb index 44964b3..e165bd4 100644 --- a/build_fcc_bdc_broadband_connection_table.ipynb +++ b/build_fcc_bdc_broadband_connection_table.ipynb @@ -1478,6 +1478,48 @@ "- `fcc_max_advertised_download_mbps` / `fcc_max_advertised_upload_mbps` are estimated from the highest speed tier with non-zero availability share.\n", "- Provider-count columns are populated from the separate provider-summary catalog, which is global catalog context rather than geography-specific broadband coverage." ] + }, + { + "cell_type": "markdown", + "id": "28", + "metadata": {}, + "source": [ + "## Tables Created by This Notebook and Their Relationships\n", + "\n", + "This notebook creates and/or maintains five PostgreSQL tables in the `public` schema:\n", + "\n", + "1. `public.fcc_bdc_as_of`\n", + "- One row per FCC BDC release date and data type.\n", + "- Primary metadata table used to track versioning (`as_of_date`) for downstream loads.\n", + "\n", + "2. `public.fcc_bdc_files`\n", + "- One row per file discovered/downloaded for a release.\n", + "- Linked to releases via `as_of_date` and used as file-level lineage/provenance.\n", + "\n", + "3. `public.fcc_bdc_broadband_by_datacenter`\n", + "- Fact table keyed by `(master_id, as_of_date)` for per-data-center broadband availability metrics.\n", + "- Includes scalar broadband fields and summary JSON payloads.\n", + "- `master_id` aligns with `public.master_data_centers.master_id`.\n", + "\n", + "4. `public.fcc_bdc_broadband_summary`\n", + "- Aggregated summary metrics by release (`as_of_date`) used for QA and reporting.\n", + "\n", + "5. `public.fcc_bdc_provider_summary`\n", + "- Provider catalog/aggregation table by release (`as_of_date`) with provider class rollups.\n", + "\n", + "### Relationship Summary\n", + "\n", + "- `public.fcc_bdc_as_of (as_of_date)`\n", + " - 1-to-many -> `public.fcc_bdc_files (as_of_date)`\n", + " - 1-to-many -> `public.fcc_bdc_broadband_by_datacenter (as_of_date)`\n", + " - 1-to-many -> `public.fcc_bdc_broadband_summary (as_of_date)`\n", + " - 1-to-many -> `public.fcc_bdc_provider_summary (as_of_date)`\n", + "\n", + "- `public.master_data_centers (master_id)`\n", + " - 1-to-many over time -> `public.fcc_bdc_broadband_by_datacenter (master_id, as_of_date)`\n", + "\n", + "In short: **release metadata (`as_of` + `files`) supports reproducible loads, while per-DC broadband facts and release-level/provider-level summaries support analysis.**" + ] } ], "metadata": { diff --git a/historical_climate_data_centers.ipynb b/historical_climate_data_centers.ipynb index 8cf8852..e5368ed 100644 --- a/historical_climate_data_centers.ipynb +++ b/historical_climate_data_centers.ipynb @@ -887,6 +887,30 @@ "\n", "**Metric change vs. the prior Open-Meteo implementation.** The old `windstorm_days` field counted days where the daily *max wind gust* exceeded 17.2 m/s (~38 mph, ERA5 reanalysis gust). gridMET reports only daily *mean* wind, not gusts, so the new `sustained_wind_days` field counts days with daily-mean wind ≥ 8 m/s (~18 mph). These are fundamentally different signals — do not compare values across the two columns. If true gust/storm event counts are needed, NOAA's Storm Events Database is the appropriate next source.\n" ] + }, + { + "cell_type": "markdown", + "id": "18", + "metadata": {}, + "source": [ + "## Tables Created by This Notebook and Their Relationships\n", + "\n", + "This notebook creates and/or maintains one primary PostGIS table:\n", + "\n", + "1. `public.data_center_historical_climate`\n", + "- One row per data center (`master_id`).\n", + "- Stores climate summary metrics (temperature, humidity, wet-bulb, precipitation variability, cooling-degree-days, wind fields/status), geometry, and lineage timestamps.\n", + "- Upserted incrementally so reruns refresh changed rows without duplicating records.\n", + "\n", + "### Relationship Summary\n", + "\n", + "- `public.master_data_centers (master_id)`\n", + " - 1-to-1 (effective) -> `public.data_center_historical_climate (master_id)`\n", + "\n", + "`public.data_center_historical_climate.master_id` is a foreign key to `public.master_data_centers.master_id` (with cascade delete), so climate rows track the master data-center record set.\n", + "\n", + "In short: **`master_data_centers` is the entity table, and `data_center_historical_climate` is its one-row-per-site climate feature extension.**" + ] } ], "metadata": { diff --git a/open_meteo_historical_data_centers.ipynb b/open_meteo_historical_data_centers.ipynb index a60ceb2..c5c89c9 100644 --- a/open_meteo_historical_data_centers.ipynb +++ b/open_meteo_historical_data_centers.ipynb @@ -836,6 +836,30 @@ "\n", "**Metric change vs. the prior Open-Meteo implementation.** The old `windstorm_days` field counted days where the daily *max wind gust* exceeded 17.2 m/s (~38 mph, ERA5 reanalysis gust). gridMET reports only daily *mean* wind, not gusts, so the new `sustained_wind_days` field counts days with daily-mean wind ≥ 8 m/s (~18 mph). These are fundamentally different signals — do not compare values across the two columns. If true gust/storm event counts are needed, NOAA's Storm Events Database is the appropriate next source.\n" ] + }, + { + "cell_type": "markdown", + "id": "18", + "metadata": {}, + "source": [ + "## Tables Created by This Notebook and Their Relationships\n", + "\n", + "This notebook creates and/or maintains one primary PostGIS table:\n", + "\n", + "1. `public.data_center_historical_climate`\n", + "- One row per data center (`master_id`).\n", + "- Stores climate summary metrics (temperature, humidity, wet-bulb, precipitation variability, cooling-degree-days, wind fields/status), geometry, and lineage timestamps.\n", + "- Upserted incrementally so reruns refresh changed rows without duplicating records.\n", + "\n", + "### Relationship Summary\n", + "\n", + "- `public.master_data_centers (master_id)`\n", + " - 1-to-1 (effective) -> `public.data_center_historical_climate (master_id)`\n", + "\n", + "`public.data_center_historical_climate.master_id` is a foreign key to `public.master_data_centers.master_id` (with cascade delete), so climate rows track the master data-center record set.\n", + "\n", + "In short: **`master_data_centers` is the entity table, and `data_center_historical_climate` is its one-row-per-site climate feature extension.**" + ] } ], "metadata": { diff --git a/rdh_precinct_vote_data_centers.ipynb b/rdh_precinct_vote_data_centers.ipynb index bd60495..dd789fb 100644 --- a/rdh_precinct_vote_data_centers.ipynb +++ b/rdh_precinct_vote_data_centers.ipynb @@ -145,7 +145,7 @@ "source": [ "## Parameters\n", "\n", - "The defaults run a small real pilot for Virginia 2020, because Virginia has many data centers in the master table and a statewide precinct layer should produce visible matches. After the pilot works, broaden `TARGET_STATES` and `FILTER_YEARS_ANY`. Use `TARGET_STATES = None` to infer all states from `public.master_data_centers`.\n" + "The defaults now target both 2020 and 2024 precinct election layers across all inferred data-center states. Set `TARGET_STATES` to a small list like `['VA']` for a quick pilot run, or keep `TARGET_STATES = None` to infer all states from `public.master_data_centers`. Use `FILTER_YEARS_ANY = []` to keep all years returned by RDH." ] }, { @@ -166,7 +166,7 @@ "TARGET_STATES = None # None = infer all states from master_data_centers; or list e.g. ['VA','TX']\n", "FILTER_TERMS_ALL = ['election results', 'precinct']\n", "FILTER_TERMS_ANY = [] # e.g. ['general', 'president']\n", - "FILTER_YEARS_ANY = ['2020'] # pilot first; empty keeps all years returned by RDH\n", + "FILTER_YEARS_ANY = ['2020', '2024'] # set [] to keep all years returned by RDH\n", "PREFERRED_FORMATS = ['SHP'] # point-in-precinct joins need spatial files\n", "\n", "DOWNLOAD_FILES = True\n", @@ -387,6 +387,11 @@ " return re.sub(r'[^A-Za-z0-9._-]+', '_', name)\n", "\n", "\n", + "def detect_year(text):\n", + " match = re.search(r'\\b(20\\d{2})\\b', str(text))\n", + " return match.group(1) if match else None\n", + "\n", + "\n", "work = listing.copy()\n", "for required in ['Title', 'Format', 'URL']:\n", " if required not in work.columns:\n", @@ -405,8 +410,20 @@ "].copy()\n", "\n", "filtered = filtered.sort_values(['query_state_code', 'Title', 'Format', 'filename']).reset_index(drop=True)\n", + "filtered['detected_year'] = filtered['Title'].map(detect_year)\n", + "\n", "print(f'Filtered candidate files: {len(filtered):,}')\n", - "display(filtered[['query_state_code', 'Title', 'Format', 'datasetid', 'filename', 'URL']].head(100))\n" + "year_summary = (\n", + " filtered.assign(detected_year=filtered['detected_year'].fillna('unknown'))\n", + " .groupby('detected_year', dropna=False)\n", + " .size()\n", + " .reset_index(name='rows')\n", + " .sort_values('detected_year')\n", + ")\n", + "print('Candidate rows by detected year:')\n", + "display(year_summary)\n", + "\n", + "display(filtered[['query_state_code', 'detected_year', 'Title', 'Format', 'datasetid', 'filename', 'URL']].head(100))" ] }, { @@ -1169,11 +1186,11 @@ "id": "28", "metadata": {}, "source": [ - "## Next Refinement: Tidy Vote Columns\n", + "## Standardized Vote Fields\n", "\n", - "The RDH staging table intentionally stores each precinct row's original attributes in `properties jsonb`. Once the downloaded layers are visible, inspect `precinct_properties` above to identify vote-column patterns for the states/years you care about.\n", + "The cell below extracts a standardized set of election attributes from `precinct_properties` using heuristic key matching across RDH file families.\n", "\n", - "Useful follow-up views can then extract fields like:\n", + "Extracted fields:\n", "- precinct identifier/name\n", "- election year\n", "- office\n", @@ -1182,7 +1199,521 @@ "- total votes\n", "- turnout or vote share\n", "\n", - "That extraction is best added after confirming the specific RDH file families selected by the filters.\n" + "Because RDH schemas vary by state and source, this step is intentionally tolerant and computes fallback vote-share values when direct turnout/share fields are not present." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "29", + "metadata": {}, + "outputs": [], + "source": [ + "STANDARDIZED_LIMIT = None # set an int (e.g., 2000) for faster sampling\n", + "\n", + "limit_clause = '' if STANDARDIZED_LIMIT is None else 'limit %s'\n", + "standardized_sql = f'''\n", + "select\n", + " m.master_id,\n", + " dc.name,\n", + " dc.city,\n", + " dc.state,\n", + " l.title as rdh_layer_title,\n", + " f.properties as precinct_properties\n", + "from {MATCH_TABLE} m\n", + "join {MASTER_TABLE} dc on dc.master_id = m.master_id\n", + "join {FEATURE_TABLE} f on f.feature_id = m.feature_id\n", + "join {LAYER_TABLE} l on l.layer_id = m.layer_id\n", + "order by dc.state, dc.city, dc.name\n", + "{limit_clause}\n", + "'''\n", + "\n", + "with get_conn() as conn:\n", + " if STANDARDIZED_LIMIT is None:\n", + " raw_standardized = pd.read_sql_query(standardized_sql, conn)\n", + " else:\n", + " raw_standardized = pd.read_sql_query(standardized_sql, conn, params=[STANDARDIZED_LIMIT])\n", + "\n", + "\n", + "def parse_props(value):\n", + " if isinstance(value, dict):\n", + " return value\n", + " if pd.isna(value):\n", + " return {}\n", + " text = str(value).strip()\n", + " if not text:\n", + " return {}\n", + " try:\n", + " obj = json.loads(text)\n", + " return obj if isinstance(obj, dict) else {}\n", + " except Exception:\n", + " return {}\n", + "\n", + "\n", + "def norm_key(k):\n", + " return re.sub(r'[^a-z0-9]+', '_', str(k).strip().lower()).strip('_')\n", + "\n", + "\n", + "def as_number(v):\n", + " if v is None:\n", + " return None\n", + " if isinstance(v, (int, float, np.integer, np.floating)):\n", + " if pd.isna(v):\n", + " return None\n", + " return float(v)\n", + " text = str(v).strip().replace(',', '')\n", + " if text == '':\n", + " return None\n", + " if re.fullmatch(r'-?\\d+(\\.\\d+)?', text):\n", + " return float(text)\n", + " return None\n", + "\n", + "\n", + "def parse_year_from_title(title):\n", + " m = re.search(r'\\b((?:19|20)\\d{2})\\b', str(title))\n", + " return int(m.group(1)) if m else None\n", + "\n", + "\n", + "def infer_year_from_keys(props_norm):\n", + " key_patterns = [\n", + " re.compile(r'^[pg](\\d{2})(pre|uss|con|gov|ag|sos|ltg|tre|aud).*'),\n", + " re.compile(r'^[pg](\\d{2}).*'),\n", + " ]\n", + " for key in props_norm.keys():\n", + " nk = norm_key(key)\n", + " for pat in key_patterns:\n", + " m = pat.match(nk)\n", + " if m:\n", + " yy = int(m.group(1))\n", + " return 2000 + yy if yy < 60 else 1900 + yy\n", + " return None\n", + "\n", + "\n", + "def decode_rdh_vote_key(key):\n", + " k = norm_key(key)\n", + "\n", + " m = re.match(r'^[pg](\\d{2})pre([a-z]).*', k)\n", + " if m:\n", + " party_code = m.group(2)\n", + " return ('President', party_code)\n", + "\n", + " m = re.match(r'^[pg](\\d{2})uss([a-z]).*', k)\n", + " if m:\n", + " party_code = m.group(2)\n", + " return ('U.S. Senate', party_code)\n", + "\n", + " m = re.match(r'^[pg](\\d{2})con(\\d{2})([a-z]).*', k)\n", + " if m:\n", + " district = m.group(2)\n", + " party_code = m.group(3)\n", + " return (f'U.S. House District {district}', party_code)\n", + "\n", + " return (None, None)\n", + "\n", + "\n", + "def party_from_key(key):\n", + " k = norm_key(key)\n", + " office, party_code = decode_rdh_vote_key(k)\n", + " if party_code == 'd':\n", + " return office, 'D'\n", + " if party_code == 'r':\n", + " return office, 'R'\n", + "\n", + " if any(t in k for t in ['biden', 'dem', 'democrat']):\n", + " return office, 'D'\n", + " if any(t in k for t in ['trump', 'gop', 'rep', 'republican']):\n", + " return office, 'R'\n", + "\n", + " return office, None\n", + "\n", + "\n", + "def detect_office(title, props_norm, vote_office_totals):\n", + " title_lower = str(title).lower()\n", + " if 'president' in title_lower or 'presidential' in title_lower:\n", + " return 'President'\n", + " if 'senate' in title_lower:\n", + " return 'U.S. Senate'\n", + " if 'house' in title_lower or 'congress' in title_lower:\n", + " return 'U.S. House'\n", + " if 'governor' in title_lower:\n", + " return 'Governor'\n", + "\n", + " if vote_office_totals:\n", + " return max(vote_office_totals.items(), key=lambda x: x[1])[0]\n", + "\n", + " office_key_hits = [k for k in props_norm if any(x in k for x in ['office', 'contest', 'race'])]\n", + " if office_key_hits:\n", + " best = office_key_hits[0]\n", + " val = props_norm.get(best)\n", + " if isinstance(val, str) and val.strip():\n", + " return val.strip()\n", + " return None\n", + "\n", + "\n", + "def best_precinct_identifier(props_norm):\n", + " preferred_keys = [\n", + " 'precinct', 'precinct_name', 'precinctid', 'precinct_id', 'precinct20',\n", + " 'pctname', 'pct', 'vtd', 'vtdst', 'vtdst20', 'name20',\n", + " 'district', 'district_name', 'ward', 'geoid', 'geoid20', 'unique_id',\n", + " ]\n", + " for key in preferred_keys:\n", + " if key in props_norm and str(props_norm[key]).strip():\n", + " return str(props_norm[key]).strip()\n", + "\n", + " fallback_candidates = [\n", + " (k, v) for k, v in props_norm.items()\n", + " if any(t in k for t in ['precinct', 'vtd', 'ward', 'district', 'geo', 'name']) and str(v).strip()\n", + " ]\n", + " if fallback_candidates:\n", + " return str(fallback_candidates[0][1]).strip()\n", + " return None\n", + "\n", + "\n", + "def extract_vote_fields(row):\n", + " props = parse_props(row['precinct_properties'])\n", + " props_norm = {norm_key(k): v for k, v in props.items()}\n", + "\n", + " precinct_id_or_name = best_precinct_identifier(props_norm)\n", + " election_year = parse_year_from_title(row['rdh_layer_title'])\n", + " if election_year is None:\n", + " election_year = infer_year_from_keys(props_norm)\n", + "\n", + " year_keys = [k for k in props_norm if 'year' in k]\n", + " if election_year is None and year_keys:\n", + " for k in year_keys:\n", + " y = as_number(props_norm[k])\n", + " if y and 1900 <= y <= 2100:\n", + " election_year = int(y)\n", + " break\n", + "\n", + " numeric_items = [(k, as_number(v)) for k, v in props_norm.items()]\n", + " numeric_items = [(k, v) for k, v in numeric_items if v is not None]\n", + "\n", + " dem_votes = None\n", + " rep_votes = None\n", + " vote_office_totals = {}\n", + "\n", + " for key, value in numeric_items:\n", + " office_guess, party_guess = party_from_key(key)\n", + " if party_guess == 'D':\n", + " dem_votes = value if dem_votes is None else max(dem_votes, value)\n", + " elif party_guess == 'R':\n", + " rep_votes = value if rep_votes is None else max(rep_votes, value)\n", + "\n", + " if office_guess is not None and party_guess in {'D', 'R'}:\n", + " vote_office_totals[office_guess] = vote_office_totals.get(office_guess, 0.0) + float(value)\n", + "\n", + " total_candidates = [\n", + " v for k, v in numeric_items\n", + " if (\n", + " ('total' in k and 'vote' in k)\n", + " or ('tot' in k and 'vote' in k)\n", + " or k in {'votes_total', 'total_votes', 'vote_total'}\n", + " )\n", + " ]\n", + " total_votes = max(total_candidates) if total_candidates else None\n", + " if total_votes is None and dem_votes is not None and rep_votes is not None:\n", + " total_votes = dem_votes + rep_votes\n", + "\n", + " turnout_candidates = [\n", + " v for k, v in numeric_items\n", + " if any(x in k for x in ['turnout', 'turnout_pct', 'turnout_rate', 'vote_share', 'share', 'pct'])\n", + " ]\n", + " turnout_or_vote_share = turnout_candidates[0] if turnout_candidates else None\n", + "\n", + " if turnout_or_vote_share is None:\n", + " reg_voters = props_norm.get('reg_voters')\n", + " reg_voters_num = as_number(reg_voters)\n", + " if reg_voters_num and total_votes:\n", + " turnout_or_vote_share = total_votes / reg_voters_num\n", + " elif dem_votes is not None and rep_votes is not None and (dem_votes + rep_votes) > 0:\n", + " turnout_or_vote_share = dem_votes / (dem_votes + rep_votes)\n", + "\n", + " office = detect_office(row['rdh_layer_title'], props_norm, vote_office_totals)\n", + "\n", + " return pd.Series({\n", + " 'precinct_identifier_name': precinct_id_or_name,\n", + " 'election_year': election_year,\n", + " 'office': office,\n", + " 'democratic_votes': dem_votes,\n", + " 'republican_votes': rep_votes,\n", + " 'total_votes': total_votes,\n", + " 'turnout_or_vote_share': turnout_or_vote_share,\n", + " })\n", + "\n", + "\n", + "standardized_fields = raw_standardized.apply(extract_vote_fields, axis=1)\n", + "standardized_preview = pd.concat(\n", + " [\n", + " raw_standardized[['master_id', 'name', 'city', 'state', 'rdh_layer_title']],\n", + " standardized_fields,\n", + " ],\n", + " axis=1,\n", + ")\n", + "\n", + "standardized_summary = pd.DataFrame({\n", + " 'field': [\n", + " 'precinct_identifier_name', 'election_year', 'office',\n", + " 'democratic_votes', 'republican_votes', 'total_votes', 'turnout_or_vote_share',\n", + " ]\n", + "})\n", + "standardized_summary['non_null_rows'] = standardized_summary['field'].map(\n", + " lambda c: int(standardized_preview[c].notna().sum())\n", + ")\n", + "\n", + "print(f'Standardized preview rows: {len(standardized_preview):,}')\n", + "display(standardized_summary)\n", + "display(standardized_preview.head(50))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "30", + "metadata": {}, + "outputs": [], + "source": [ + "ELECTION_CONTEXT_TABLE = 'public.data_center_election_context'\n", + "\n", + "required_cols = [\n", + " 'master_id', 'rdh_layer_title',\n", + " 'precinct_identifier_name', 'election_year', 'office',\n", + " 'democratic_votes', 'republican_votes', 'total_votes', 'turnout_or_vote_share',\n", + "]\n", + "\n", + "missing_cols = [c for c in required_cols if c not in standardized_preview.columns]\n", + "if missing_cols:\n", + " raise RuntimeError(\n", + " 'standardized_preview is missing required columns: '\n", + " + ', '.join(missing_cols)\n", + " + '. Run the standardized extraction cell first.'\n", + " )\n", + "\n", + "persist_best = standardized_preview[required_cols].copy()\n", + "persist_best['non_null_score'] = persist_best[\n", + " ['precinct_identifier_name', 'election_year', 'office', 'democratic_votes', 'republican_votes', 'total_votes', 'turnout_or_vote_share']\n", + "].notna().sum(axis=1)\n", + "\n", + "persist_best = persist_best.sort_values(\n", + " ['master_id', 'non_null_score', 'total_votes'],\n", + " ascending=[True, False, False],\n", + " na_position='last'\n", + ")\n", + "persist_best = persist_best.drop_duplicates(subset=['master_id'], keep='first').copy()\n", + "\n", + "with get_conn() as conn:\n", + " master_base = pd.read_sql_query(\n", + " f'''\n", + " select master_id, name, city, upper(state) as state\n", + " from {MASTER_TABLE}\n", + " ''',\n", + " conn,\n", + " )\n", + "\n", + "persist_df = master_base.merge(\n", + " persist_best.drop(columns=['non_null_score']),\n", + " on='master_id',\n", + " how='left',\n", + ")\n", + "\n", + "create_sql = f'''\n", + "create table if not exists {ELECTION_CONTEXT_TABLE} (\n", + " master_id text primary key references public.master_data_centers(master_id) on delete cascade,\n", + " name text,\n", + " city text,\n", + " state text,\n", + " rdh_layer_title text,\n", + " precinct_identifier_name text,\n", + " election_year integer,\n", + " office text,\n", + " democratic_votes double precision,\n", + " republican_votes double precision,\n", + " total_votes double precision,\n", + " turnout_or_vote_share double precision,\n", + " updated_at timestamptz not null default now()\n", + ");\n", + "create index if not exists data_center_election_context_state_idx\n", + " on {ELECTION_CONTEXT_TABLE} (state);\n", + "create index if not exists data_center_election_context_year_idx\n", + " on {ELECTION_CONTEXT_TABLE} (election_year);\n", + "'''\n", + "\n", + "upsert_sql = f'''\n", + "insert into {ELECTION_CONTEXT_TABLE} (\n", + " master_id, name, city, state, rdh_layer_title,\n", + " precinct_identifier_name, election_year, office,\n", + " democratic_votes, republican_votes, total_votes, turnout_or_vote_share,\n", + " updated_at\n", + ")\n", + "values %s\n", + "on conflict (master_id) do update set\n", + " name = excluded.name,\n", + " city = excluded.city,\n", + " state = excluded.state,\n", + " rdh_layer_title = excluded.rdh_layer_title,\n", + " precinct_identifier_name = excluded.precinct_identifier_name,\n", + " election_year = excluded.election_year,\n", + " office = excluded.office,\n", + " democratic_votes = excluded.democratic_votes,\n", + " republican_votes = excluded.republican_votes,\n", + " total_votes = excluded.total_votes,\n", + " turnout_or_vote_share = excluded.turnout_or_vote_share,\n", + " updated_at = now()\n", + "'''\n", + "\n", + "rows = []\n", + "for rec in persist_df.to_dict('records'):\n", + " rows.append((\n", + " rec['master_id'],\n", + " rec['name'],\n", + " rec['city'],\n", + " rec['state'],\n", + " rec.get('rdh_layer_title'),\n", + " rec.get('precinct_identifier_name'),\n", + " int(rec['election_year']) if pd.notna(rec.get('election_year')) else None,\n", + " rec.get('office'),\n", + " float(rec['democratic_votes']) if pd.notna(rec.get('democratic_votes')) else None,\n", + " float(rec['republican_votes']) if pd.notna(rec.get('republican_votes')) else None,\n", + " float(rec['total_votes']) if pd.notna(rec.get('total_votes')) else None,\n", + " float(rec['turnout_or_vote_share']) if pd.notna(rec.get('turnout_or_vote_share')) else None,\n", + " ))\n", + "\n", + "with get_conn() as conn:\n", + " with conn.cursor() as cur:\n", + " cur.execute(create_sql)\n", + " if rows:\n", + " execute_values(\n", + " cur,\n", + " upsert_sql,\n", + " rows,\n", + " template='(%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, now())',\n", + " page_size=1000,\n", + " )\n", + " cur.execute(f'select count(*) from {ELECTION_CONTEXT_TABLE}')\n", + " table_rows = cur.fetchone()[0]\n", + " cur.execute(\n", + " f'''\n", + " select\n", + " state,\n", + " count(*) as rows,\n", + " count(*) filter (\n", + " where election_year is not null\n", + " or office is not null\n", + " or democratic_votes is not null\n", + " or republican_votes is not null\n", + " or total_votes is not null\n", + " or turnout_or_vote_share is not null\n", + " ) as rows_with_election\n", + " from {ELECTION_CONTEXT_TABLE}\n", + " group by state\n", + " order by rows desc, state\n", + " limit 15\n", + " '''\n", + " )\n", + " state_counts = cur.fetchall()\n", + "\n", + "rows_with_election = int(\n", + " persist_df[\n", + " ['election_year', 'office', 'democratic_votes', 'republican_votes', 'total_votes', 'turnout_or_vote_share']\n", + " ].notna().any(axis=1).sum()\n", + ")\n", + "print(f'Rows prepared for upsert: {len(rows):,}')\n", + "print(f'Rows with election context: {rows_with_election:,}')\n", + "print(f'Rows currently in {ELECTION_CONTEXT_TABLE}: {table_rows:,}')\n", + "display(pd.DataFrame(state_counts, columns=['state', 'rows', 'rows_with_election']))" + ] + }, + { + "cell_type": "markdown", + "id": "31", + "metadata": {}, + "source": [ + "## Persist Standardized Election Context\n", + "\n", + "Writes one standardized election-context row per `master_id` into `public.data_center_election_context` for reuse in map and reporting workflows." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "32", + "metadata": {}, + "outputs": [], + "source": [ + "# Targeted state coverage check\n", + "states = ['VA', 'WA', 'WI', 'WV', 'WY', 'DC', 'PR']\n", + "\n", + "with get_conn() as conn:\n", + " check_df = pd.read_sql_query(\n", + " f'''\n", + " select\n", + " state,\n", + " count(*) as rows,\n", + " count(*) filter (\n", + " where election_year is not null\n", + " or office is not null\n", + " or democratic_votes is not null\n", + " or republican_votes is not null\n", + " or total_votes is not null\n", + " or turnout_or_vote_share is not null\n", + " ) as rows_with_election\n", + " from {ELECTION_CONTEXT_TABLE}\n", + " where state = any(%s)\n", + " group by state\n", + " order by state\n", + " ''',\n", + " conn,\n", + " params=[states],\n", + " )\n", + "\n", + "display(check_df)" + ] + }, + { + "cell_type": "markdown", + "id": "33", + "metadata": {}, + "source": [ + "## Tables Created by This Notebook and Their Relationships\n", + "\n", + "This notebook creates and/or maintains the following PostGIS/PostgreSQL tables:\n", + "\n", + "1. `public.rdh_precinct_vote_layers`\n", + "- One row per RDH precinct-election layer ingested.\n", + "- Key columns: `layer_id` (PK), `state_code`, `title`, `format`, file/source metadata, `loaded_at`.\n", + "\n", + "2. `public.rdh_precinct_vote_features`\n", + "- One row per precinct polygon feature from a loaded layer.\n", + "- Key columns: `feature_id` (PK), `layer_id` (FK), `state_code`, `source_row`, `properties` (JSONB), `geom` (MultiPolygon).\n", + "- Relationship: many features belong to one layer.\n", + "\n", + "3. `public.data_center_rdh_precinct_vote_matches`\n", + "- Spatial match table linking data centers to precinct features.\n", + "- Key columns: `master_id` (FK), `feature_id` (FK), `layer_id` (FK), `state_code`, `join_method`, `match_distance_m`, `matched_at`.\n", + "- Primary key: (`master_id`, `feature_id`).\n", + "- Relationship: many-to-many bridge between data centers and precinct features (with match metadata).\n", + "\n", + "4. `public.data_center_election_context`\n", + "- Final standardized, one-row-per-data-center election context used by downstream mapping/analysis.\n", + "- Key columns: `master_id` (PK, FK), `name`, `city`, `state`, `rdh_layer_title`,\n", + " `precinct_identifier_name`, `election_year`, `office`, `democratic_votes`, `republican_votes`,\n", + " `total_votes`, `turnout_or_vote_share`, `updated_at`.\n", + "- Relationship: one row per `master_id` in `public.master_data_centers` (left-joined so all master rows can be retained, even if election fields are null).\n", + "\n", + "### Relationship Summary\n", + "\n", + "- `public.master_data_centers (master_id)`\n", + " - 1-to-many -> `public.data_center_rdh_precinct_vote_matches (master_id)`\n", + " - 1-to-1 (effective in this notebook) -> `public.data_center_election_context (master_id)`\n", + "\n", + "- `public.rdh_precinct_vote_layers (layer_id)`\n", + " - 1-to-many -> `public.rdh_precinct_vote_features (layer_id)`\n", + " - 1-to-many -> `public.data_center_rdh_precinct_vote_matches (layer_id)`\n", + "\n", + "- `public.rdh_precinct_vote_features (feature_id)`\n", + " - 1-to-many -> `public.data_center_rdh_precinct_vote_matches (feature_id)`\n", + "\n", + "In short: **layers -> features -> matches**, then matches are standardized into **one election-context row per data center**." ] } ],