1241 lines
49 KiB
Plaintext
1241 lines
49 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "0",
|
|
"metadata": {},
|
|
"source": [
|
|
"# HMS Wildfire Smoke Exposure for Master Data Centers\n",
|
|
"\n",
|
|
"Builds four PostGIS tables from NOAA HMS smoke polygon products:\n",
|
|
"\n",
|
|
"1. **`public.hms_smoke_days`** — one row per observed HMS smoke product day, including days with zero smoke polygons. This is the denominator for percentages.\n",
|
|
"2. **`public.hms_smoke_daily`** — raw daily smoke polygons with HMS density (`Light`, `Medium`, `Heavy`).\n",
|
|
"3. **`public.data_center_hms_smoke_dc_day`** — per-DC daily exposure keyed by `(master_id, smoke_date)`, storing the maximum smoke density covering the DC point, or `0` for observed no smoke.\n",
|
|
"4. **`public.data_center_hms_smoke_exposure`** — per-DC summary keyed by `master_id`, joinable to `master_data_centers`.\n",
|
|
"\n",
|
|
"**HMS density scale used here:**\n",
|
|
"\n",
|
|
"| Rank | Density |\n",
|
|
"|---:|---|\n",
|
|
"| 0 | No smoke polygon contained the DC on an observed HMS day |\n",
|
|
"| 1 | Light or unspecified smoke density |\n",
|
|
"| 2 | Medium |\n",
|
|
"| 3 | Heavy |\n",
|
|
"\n",
|
|
"Historical backfill uses NOAA HMS annual shapefile bundles for completed years and daily shapefile ZIPs for the current year. The ArcGIS GeoJSON endpoint is optional and is used as a current-day refresh when it has features.\n",
|
|
"\n",
|
|
"Sources:\n",
|
|
"- Annual bundles: `https://satepsanone.nesdis.noaa.gov/pub/FIRE/web/HMS/Smoke_Polygons/Shapefile/Annual_Bundles/`\n",
|
|
"- Daily ZIPs: `https://satepsanone.nesdis.noaa.gov/pub/FIRE/web/HMS/Smoke_Polygons/Shapefile/YYYY/MM/hms_smokeYYYYMMDD.zip`\n",
|
|
"- Current ArcGIS smoke polygons: `NOAA_Satellite_Smoke_Detection_(v1)`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "1",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Imports"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "2",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import io\n",
|
|
"import os\n",
|
|
"import re\n",
|
|
"import time\n",
|
|
"import tempfile\n",
|
|
"import zipfile\n",
|
|
"from datetime import date, datetime, timedelta, timezone\n",
|
|
"from html.parser import HTMLParser\n",
|
|
"from pathlib import Path\n",
|
|
"from urllib.parse import urljoin\n",
|
|
"\n",
|
|
"import geopandas as gpd\n",
|
|
"import numpy as np\n",
|
|
"import pandas as pd\n",
|
|
"import psycopg2\n",
|
|
"import requests\n",
|
|
"from psycopg2 import sql\n",
|
|
"from psycopg2.extras import execute_values\n",
|
|
"from shapely import wkb as shapely_wkb\n",
|
|
"from shapely.geometry import MultiPolygon\n",
|
|
"\n",
|
|
"try:\n",
|
|
" from shapely.validation import make_valid\n",
|
|
"except ImportError:\n",
|
|
" make_valid = None\n",
|
|
"\n",
|
|
"pd.set_option('display.max_columns', 120)\n",
|
|
"pd.set_option('display.max_rows', 120)\n",
|
|
"\n",
|
|
"print('geopandas:', gpd.__version__)\n",
|
|
"print('pandas: ', pd.__version__)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "3",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Database Connection"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "4",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def load_env_file(env_path: str = '.env') -> None:\n",
|
|
" p = Path(env_path)\n",
|
|
" if not p.exists():\n",
|
|
" print(f'No {env_path} file found in {Path.cwd()}')\n",
|
|
" return\n",
|
|
" loaded = 0\n",
|
|
" for raw_line in p.read_text(encoding='utf-8').splitlines():\n",
|
|
" line = raw_line.strip()\n",
|
|
" if not line or line.startswith('#') or '=' not in line:\n",
|
|
" continue\n",
|
|
" key, value = line.split('=', 1)\n",
|
|
" key = key.strip()\n",
|
|
" value = value.strip().strip('\"').strip(\"'\")\n",
|
|
" if key and key not in os.environ:\n",
|
|
" os.environ[key] = value\n",
|
|
" loaded += 1\n",
|
|
" print(f'Loaded {loaded} env var(s) from {env_path}')\n",
|
|
"\n",
|
|
"\n",
|
|
"def require_env(keys):\n",
|
|
" missing = [k for k in keys if not os.getenv(k)]\n",
|
|
" if missing:\n",
|
|
" raise EnvironmentError('Missing required env vars: ' + ', '.join(missing))\n",
|
|
"\n",
|
|
"\n",
|
|
"load_env_file('.env')\n",
|
|
"require_env(['PGWEB_HOST', 'PGWEB_PORT', 'PGWEB_USER', 'PGWEB_PASSWORD'])\n",
|
|
"\n",
|
|
"MASTER_TABLE = 'public.master_data_centers'\n",
|
|
"DAYS_TABLE = 'public.hms_smoke_days'\n",
|
|
"SMOKE_TABLE = 'public.hms_smoke_daily'\n",
|
|
"DC_DAY_TABLE = 'public.data_center_hms_smoke_dc_day'\n",
|
|
"EXPOSURE_TABLE = 'public.data_center_hms_smoke_exposure'\n",
|
|
"\n",
|
|
"\n",
|
|
"def get_conn():\n",
|
|
" return psycopg2.connect(\n",
|
|
" host=os.environ['PGWEB_HOST'],\n",
|
|
" port=os.environ['PGWEB_PORT'],\n",
|
|
" user=os.environ['PGWEB_USER'],\n",
|
|
" password=os.environ['PGWEB_PASSWORD'],\n",
|
|
" dbname='data_centers',\n",
|
|
" )\n",
|
|
"\n",
|
|
"\n",
|
|
"with get_conn() as conn:\n",
|
|
" with conn.cursor() as cur:\n",
|
|
" cur.execute('select current_database(), current_user')\n",
|
|
" print('Connected:', cur.fetchone())\n",
|
|
" cur.execute('create extension if not exists postgis')\n",
|
|
" cur.execute('select to_regclass(%s)', (MASTER_TABLE,))\n",
|
|
" if cur.fetchone()[0] is None:\n",
|
|
" raise RuntimeError(f'{MASTER_TABLE} missing')\n",
|
|
" cur.execute(sql.SQL('select count(*) from {}').format(sql.SQL(MASTER_TABLE)))\n",
|
|
" print(f'{MASTER_TABLE}: {cur.fetchone()[0]:,} rows')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "5",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Parameters"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "6",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"HMS_SHAPEFILE_BASE_URL = (\n",
|
|
" 'https://satepsanone.nesdis.noaa.gov/pub/FIRE/web/HMS/'\n",
|
|
" 'Smoke_Polygons/Shapefile/'\n",
|
|
")\n",
|
|
"ANNUAL_BUNDLE_BASE_URL = urljoin(HMS_SHAPEFILE_BASE_URL, 'Annual_Bundles/')\n",
|
|
"\n",
|
|
"ARCGIS_SMOKE_URL = (\n",
|
|
" 'https://services2.arcgis.com/C8EMgrsFcRFL6LrL/arcgis/rest/services/'\n",
|
|
" 'NOAA_Satellite_Smoke_Detection_%28v1%29/FeatureServer/0/query'\n",
|
|
" '?where=1%3D1&outFields=*&f=geojson'\n",
|
|
")\n",
|
|
"\n",
|
|
"CACHE_DIR = Path('data/hms_smoke')\n",
|
|
"ANNUAL_CACHE_DIR = CACHE_DIR / 'annual'\n",
|
|
"DAILY_CACHE_DIR = CACHE_DIR / 'daily'\n",
|
|
"\n",
|
|
"# HMS smoke polygon archive starts at 2005 in the NOAA directory.\n",
|
|
"START_YEAR = 2005\n",
|
|
"END_DATE = date.today()\n",
|
|
"\n",
|
|
"# Completed years are fastest from annual bundles. Current year is loaded\n",
|
|
"# from daily ZIPs so empty/no-smoke days are retained and today's partial\n",
|
|
"# product can be picked up.\n",
|
|
"LOAD_ANNUAL_BUNDLES = True\n",
|
|
"LOAD_CURRENT_YEAR_DAILY_ZIPS = True\n",
|
|
"LOAD_CURRENT_ARCGIS = True\n",
|
|
"\n",
|
|
"# If True, drop and rebuild all HMS smoke tables.\n",
|
|
"# If False, skip days already present in hms_smoke_days.\n",
|
|
"RELOAD_SMOKE = False\n",
|
|
"\n",
|
|
"# If True, rebuild the DC-day and summary exposure tables after loading source data.\n",
|
|
"RECOMPUTE_SUMMARY = True\n",
|
|
"\n",
|
|
"REQUEST_TIMEOUT = 90\n",
|
|
"USER_AGENT = 'data-center-hms-smoke-loader/1.0'\n",
|
|
"\n",
|
|
"DENSITY_RANK = {\n",
|
|
" 'na': 1,\n",
|
|
" 'n/a': 1,\n",
|
|
" 'unknown': 1,\n",
|
|
" 'unspecified': 1,\n",
|
|
" 'light': 1,\n",
|
|
" 'medium': 2,\n",
|
|
" 'heavy': 3,\n",
|
|
"}\n",
|
|
"RANK_DENSITY = {0: 'None', 1: 'Light/Unspecified', 2: 'Medium', 3: 'Heavy'}\n",
|
|
"\n",
|
|
"print(f'HMS load window: {START_YEAR}-01-01 through {END_DATE}')\n",
|
|
"print(f'Annual bundles through: {END_DATE.year - 1}')\n",
|
|
"print(f'Current-year daily ZIPs: {LOAD_CURRENT_YEAR_DAILY_ZIPS}')\n",
|
|
"print(f'ArcGIS current refresh: {LOAD_CURRENT_ARCGIS}')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "7",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Create Tables"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "8",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"CREATE_DAYS_SQL = f'''\n",
|
|
"create table if not exists {DAYS_TABLE} (\n",
|
|
" smoke_date date primary key,\n",
|
|
" source text not null,\n",
|
|
" source_file text,\n",
|
|
" source_url text,\n",
|
|
" feature_count integer not null default 0,\n",
|
|
" fetched_at timestamptz not null default now(),\n",
|
|
" updated_at timestamptz not null default now()\n",
|
|
");\n",
|
|
"'''\n",
|
|
"\n",
|
|
"CREATE_SMOKE_SQL = f'''\n",
|
|
"create table if not exists {SMOKE_TABLE} (\n",
|
|
" id bigserial primary key,\n",
|
|
" smoke_date date not null references {DAYS_TABLE}(smoke_date) on delete cascade,\n",
|
|
" satellite text,\n",
|
|
" start_raw text,\n",
|
|
" end_raw text,\n",
|
|
" start_utc timestamptz,\n",
|
|
" end_utc timestamptz,\n",
|
|
" density text,\n",
|
|
" density_rank smallint not null check (density_rank between 1 and 3),\n",
|
|
" source text not null,\n",
|
|
" source_file text,\n",
|
|
" source_url text,\n",
|
|
" geom geometry(MultiPolygon, 4326) not null,\n",
|
|
" fetched_at timestamptz not null default now()\n",
|
|
");\n",
|
|
"\n",
|
|
"create index if not exists hms_smoke_daily_geom_gix\n",
|
|
" on {SMOKE_TABLE} using gist (geom);\n",
|
|
"create index if not exists hms_smoke_daily_date_idx\n",
|
|
" on {SMOKE_TABLE} (smoke_date);\n",
|
|
"create index if not exists hms_smoke_daily_density_idx\n",
|
|
" on {SMOKE_TABLE} (density_rank);\n",
|
|
"'''\n",
|
|
"\n",
|
|
"CREATE_DC_DAY_SQL = f'''\n",
|
|
"create table if not exists {DC_DAY_TABLE} (\n",
|
|
" master_id text not null references public.master_data_centers(master_id) on delete cascade,\n",
|
|
" smoke_date date not null references {DAYS_TABLE}(smoke_date) on delete cascade,\n",
|
|
" max_density_rank smallint not null check (max_density_rank between 0 and 3),\n",
|
|
" polygon_hits integer not null default 0,\n",
|
|
" primary key (master_id, smoke_date)\n",
|
|
");\n",
|
|
"\n",
|
|
"create index if not exists data_center_hms_smoke_dc_day_date_idx\n",
|
|
" on {DC_DAY_TABLE} (smoke_date);\n",
|
|
"create index if not exists data_center_hms_smoke_dc_day_density_idx\n",
|
|
" on {DC_DAY_TABLE} (max_density_rank);\n",
|
|
"'''\n",
|
|
"\n",
|
|
"CREATE_EXPOSURE_SQL = f'''\n",
|
|
"create table if not exists {EXPOSURE_TABLE} (\n",
|
|
" master_id text primary key references public.master_data_centers(master_id) on delete cascade,\n",
|
|
" source text, name text, operator text, city text, state text, country text,\n",
|
|
" longitude double precision, latitude double precision,\n",
|
|
" geom geometry(Point, 4326),\n",
|
|
"\n",
|
|
" hms_status text not null, -- 'observed', 'no_observed_days', or 'no_geom'\n",
|
|
" smoke_period_start date,\n",
|
|
" smoke_period_end date,\n",
|
|
" days_observed integer not null default 0,\n",
|
|
"\n",
|
|
" days_with_any_smoke integer not null default 0,\n",
|
|
" days_with_light_or_worse integer not null default 0,\n",
|
|
" days_with_medium_or_worse integer not null default 0,\n",
|
|
" days_with_heavy_smoke integer not null default 0,\n",
|
|
"\n",
|
|
" pct_days_with_any_smoke double precision,\n",
|
|
" pct_days_with_light_or_worse double precision,\n",
|
|
" pct_days_with_medium_or_worse double precision,\n",
|
|
" pct_days_with_heavy_smoke double precision,\n",
|
|
"\n",
|
|
" worst_density_rank smallint,\n",
|
|
" worst_density text,\n",
|
|
" mean_density_rank double precision,\n",
|
|
"\n",
|
|
" longest_any_smoke_streak_days integer,\n",
|
|
" longest_medium_or_heavy_streak_days integer,\n",
|
|
" longest_heavy_smoke_streak_days integer,\n",
|
|
"\n",
|
|
" fetched_at timestamptz not null default now(),\n",
|
|
" updated_at timestamptz not null default now()\n",
|
|
");\n",
|
|
"\n",
|
|
"create index if not exists data_center_hms_smoke_exposure_state_idx\n",
|
|
" on {EXPOSURE_TABLE} (state);\n",
|
|
"create index if not exists data_center_hms_smoke_exposure_geom_gix\n",
|
|
" on {EXPOSURE_TABLE} using gist (geom);\n",
|
|
"create index if not exists data_center_hms_smoke_exposure_worst_idx\n",
|
|
" on {EXPOSURE_TABLE} (worst_density_rank);\n",
|
|
"'''\n",
|
|
"\n",
|
|
"with get_conn() as conn:\n",
|
|
" with conn.cursor() as cur:\n",
|
|
" if RELOAD_SMOKE:\n",
|
|
" cur.execute(f'drop table if exists {EXPOSURE_TABLE} cascade')\n",
|
|
" cur.execute(f'drop table if exists {DC_DAY_TABLE} cascade')\n",
|
|
" cur.execute(f'drop table if exists {SMOKE_TABLE} cascade')\n",
|
|
" cur.execute(f'drop table if exists {DAYS_TABLE} cascade')\n",
|
|
"\n",
|
|
" cur.execute(CREATE_DAYS_SQL)\n",
|
|
" cur.execute(CREATE_SMOKE_SQL)\n",
|
|
" cur.execute(CREATE_DC_DAY_SQL)\n",
|
|
" cur.execute(CREATE_EXPOSURE_SQL)\n",
|
|
"\n",
|
|
" for t in (DAYS_TABLE, SMOKE_TABLE, DC_DAY_TABLE, EXPOSURE_TABLE):\n",
|
|
" cur.execute(sql.SQL('select count(*) from {}').format(sql.SQL(t)))\n",
|
|
" print(f'{t}: {cur.fetchone()[0]:,} rows')\n",
|
|
" conn.commit()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "9",
|
|
"metadata": {},
|
|
"source": [
|
|
"## HMS Download and Normalization Helpers"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "10",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"class LinkParser(HTMLParser):\n",
|
|
" def __init__(self):\n",
|
|
" super().__init__()\n",
|
|
" self.links = []\n",
|
|
"\n",
|
|
" def handle_starttag(self, tag, attrs):\n",
|
|
" if tag.lower() != 'a':\n",
|
|
" return\n",
|
|
" attrs = dict(attrs)\n",
|
|
" href = attrs.get('href')\n",
|
|
" if href:\n",
|
|
" self.links.append(href)\n",
|
|
"\n",
|
|
"\n",
|
|
"DATE_RX = re.compile(r'hms_smoke(\\d{8})', re.IGNORECASE)\n",
|
|
"YEAR_RX = re.compile(r'hms_smoke(\\d{4})\\.zip$', re.IGNORECASE)\n",
|
|
"HMS_TIME_RX = re.compile(r'^\\s*(\\d{4})(\\d{3})\\s+(\\d{4})\\s*$')\n",
|
|
"\n",
|
|
"\n",
|
|
"def http_get(url: str, *, stream: bool = False, attempts: int = 4):\n",
|
|
" last_exc = None\n",
|
|
" for attempt in range(1, attempts + 1):\n",
|
|
" try:\n",
|
|
" r = requests.get(\n",
|
|
" url,\n",
|
|
" headers={'User-Agent': USER_AGENT},\n",
|
|
" timeout=REQUEST_TIMEOUT,\n",
|
|
" stream=stream,\n",
|
|
" )\n",
|
|
" r.raise_for_status()\n",
|
|
" return r\n",
|
|
" except (requests.ConnectionError, requests.Timeout, requests.HTTPError) as exc:\n",
|
|
" last_exc = exc\n",
|
|
" if attempt == attempts:\n",
|
|
" raise\n",
|
|
" wait_s = min(30, 2 ** attempt)\n",
|
|
" print(f' HTTP retry {attempt}/{attempts} for {url}: {exc}; sleeping {wait_s}s')\n",
|
|
" time.sleep(wait_s)\n",
|
|
" raise last_exc\n",
|
|
"\n",
|
|
"\n",
|
|
"def list_links(url: str) -> list[str]:\n",
|
|
" parser = LinkParser()\n",
|
|
" parser.feed(http_get(url).text)\n",
|
|
" return [urljoin(url, href) for href in parser.links if href not in ('../',)]\n",
|
|
"\n",
|
|
"\n",
|
|
"def download_if_needed(url: str, dest: Path) -> tuple[Path, bool]:\n",
|
|
" dest.parent.mkdir(parents=True, exist_ok=True)\n",
|
|
" if dest.exists() and dest.stat().st_size > 0:\n",
|
|
" return dest, False\n",
|
|
"\n",
|
|
" tmp = dest.with_suffix(dest.suffix + '.part')\n",
|
|
" if tmp.exists():\n",
|
|
" tmp.unlink()\n",
|
|
"\n",
|
|
" with http_get(url, stream=True) as r:\n",
|
|
" with tmp.open('wb') as f:\n",
|
|
" for chunk in r.iter_content(chunk_size=1024 * 1024):\n",
|
|
" if chunk:\n",
|
|
" f.write(chunk)\n",
|
|
" tmp.replace(dest)\n",
|
|
" return dest, True\n",
|
|
"\n",
|
|
"\n",
|
|
"def parse_smoke_date_from_name(name: str) -> date | None:\n",
|
|
" m = DATE_RX.search(name)\n",
|
|
" if not m:\n",
|
|
" return None\n",
|
|
" s = m.group(1)\n",
|
|
" return date(int(s[:4]), int(s[4:6]), int(s[6:8]))\n",
|
|
"\n",
|
|
"\n",
|
|
"def parse_hms_timestamp(value) -> datetime | None:\n",
|
|
" if value is None or pd.isna(value):\n",
|
|
" return None\n",
|
|
" s = str(value).strip()\n",
|
|
" m = HMS_TIME_RX.match(s)\n",
|
|
" if not m:\n",
|
|
" return None\n",
|
|
" year, jday, hhmm = m.groups()\n",
|
|
" return datetime.strptime(f'{year}{jday}{hhmm}', '%Y%j%H%M').replace(tzinfo=timezone.utc)\n",
|
|
"\n",
|
|
"\n",
|
|
"def daterange(start: date, end: date):\n",
|
|
" d = start\n",
|
|
" while d <= end:\n",
|
|
" yield d\n",
|
|
" d += timedelta(days=1)\n",
|
|
"\n",
|
|
"\n",
|
|
"def normalize_density_label(value) -> str | None:\n",
|
|
" if value is None or pd.isna(value):\n",
|
|
" return None\n",
|
|
" raw = str(value).strip()\n",
|
|
" if not raw:\n",
|
|
" return None\n",
|
|
" if raw.lower() in ('na', 'n/a', 'unknown', 'unspecified'):\n",
|
|
" return 'Unspecified'\n",
|
|
" return raw.title()\n",
|
|
"\n",
|
|
"\n",
|
|
"def density_to_rank(value) -> int | None:\n",
|
|
" label = normalize_density_label(value)\n",
|
|
" if label is None:\n",
|
|
" return None\n",
|
|
" return DENSITY_RANK.get(label.lower())\n",
|
|
"\n",
|
|
"\n",
|
|
"def clean_polygon_geometry(geom):\n",
|
|
" if geom is None or geom.is_empty:\n",
|
|
" return None\n",
|
|
"\n",
|
|
" if not geom.is_valid:\n",
|
|
" geom = make_valid(geom) if make_valid is not None else geom.buffer(0)\n",
|
|
"\n",
|
|
" if geom.geom_type == 'Polygon':\n",
|
|
" return MultiPolygon([geom])\n",
|
|
" if geom.geom_type == 'MultiPolygon':\n",
|
|
" return geom\n",
|
|
" if geom.geom_type == 'GeometryCollection':\n",
|
|
" parts = []\n",
|
|
" for part in geom.geoms:\n",
|
|
" if part.geom_type == 'Polygon':\n",
|
|
" parts.append(part)\n",
|
|
" elif part.geom_type == 'MultiPolygon':\n",
|
|
" parts.extend(part.geoms)\n",
|
|
" return MultiPolygon(parts) if parts else None\n",
|
|
" return None\n",
|
|
"\n",
|
|
"\n",
|
|
"def empty_smoke_gdf() -> gpd.GeoDataFrame:\n",
|
|
" return gpd.GeoDataFrame(\n",
|
|
" columns=['Satellite', 'Start', 'End', 'Density', 'geometry'],\n",
|
|
" geometry='geometry',\n",
|
|
" crs='EPSG:4326',\n",
|
|
" )\n",
|
|
"\n",
|
|
"\n",
|
|
"def read_smoke_zip(zip_path: Path) -> gpd.GeoDataFrame:\n",
|
|
" with tempfile.TemporaryDirectory() as td:\n",
|
|
" try:\n",
|
|
" with zipfile.ZipFile(zip_path) as zf:\n",
|
|
" zf.extractall(td)\n",
|
|
" except zipfile.BadZipFile as exc:\n",
|
|
" raise RuntimeError(f'Bad zip file: {zip_path}') from exc\n",
|
|
"\n",
|
|
" shp_files = list(Path(td).glob('*.shp'))\n",
|
|
" if not shp_files:\n",
|
|
" return empty_smoke_gdf()\n",
|
|
"\n",
|
|
" # Some older HMS shapefiles contain unclosed rings. Pyogrio/Shapely\n",
|
|
" # can repair these while reading instead of aborting the whole year.\n",
|
|
" gdf = gpd.read_file(shp_files[0], on_invalid='fix')\n",
|
|
" if gdf.crs is None:\n",
|
|
" gdf = gdf.set_crs('EPSG:4326')\n",
|
|
" elif str(gdf.crs).lower() not in ('epsg:4326', 'wgs 84'):\n",
|
|
" gdf = gdf.to_crs('EPSG:4326')\n",
|
|
" return gdf\n",
|
|
"\n",
|
|
"\n",
|
|
"def normalize_smoke_gdf(gdf: gpd.GeoDataFrame, source_file: str) -> gpd.GeoDataFrame:\n",
|
|
" fallback_date = parse_smoke_date_from_name(source_file)\n",
|
|
"\n",
|
|
" if 'End_' in gdf.columns and 'End' not in gdf.columns:\n",
|
|
" gdf = gdf.rename(columns={'End_': 'End'})\n",
|
|
"\n",
|
|
" for col in ('Satellite', 'Start', 'End', 'Density'):\n",
|
|
" if col not in gdf.columns:\n",
|
|
" gdf[col] = None\n",
|
|
"\n",
|
|
" if 'geometry' not in gdf.columns:\n",
|
|
" gdf = empty_smoke_gdf()\n",
|
|
"\n",
|
|
" if gdf.crs is None:\n",
|
|
" gdf = gdf.set_crs('EPSG:4326')\n",
|
|
" elif str(gdf.crs).lower() not in ('epsg:4326', 'wgs 84'):\n",
|
|
" gdf = gdf.to_crs('EPSG:4326')\n",
|
|
"\n",
|
|
" gdf = gdf.copy()\n",
|
|
" gdf['start_utc'] = gdf['Start'].map(parse_hms_timestamp)\n",
|
|
" gdf['end_utc'] = gdf['End'].map(parse_hms_timestamp)\n",
|
|
" gdf['smoke_date'] = gdf['start_utc'].map(lambda x: x.date() if x is not None else fallback_date)\n",
|
|
" gdf['density_rank'] = gdf['Density'].map(density_to_rank)\n",
|
|
"\n",
|
|
" gdf = gdf[gdf['smoke_date'].notna() & gdf['density_rank'].notna()].copy()\n",
|
|
" if gdf.empty:\n",
|
|
" return gpd.GeoDataFrame(\n",
|
|
" gdf,\n",
|
|
" geometry='geometry',\n",
|
|
" crs='EPSG:4326',\n",
|
|
" )\n",
|
|
"\n",
|
|
" gdf['geometry'] = gdf.geometry.map(clean_polygon_geometry)\n",
|
|
" gdf = gdf[gdf.geometry.notna() & ~gdf.geometry.is_empty].copy()\n",
|
|
" return gpd.GeoDataFrame(gdf, geometry='geometry', crs='EPSG:4326')\n",
|
|
"\n",
|
|
"\n",
|
|
"def existing_smoke_days() -> set[date]:\n",
|
|
" with get_conn() as conn:\n",
|
|
" with conn.cursor() as cur:\n",
|
|
" cur.execute(f'select smoke_date from {DAYS_TABLE}')\n",
|
|
" return {r[0] for r in cur.fetchall()}"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "11",
|
|
"metadata": {},
|
|
"source": [
|
|
"## PostGIS Insert Helpers"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "12",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def upsert_smoke_days(cur, rows):\n",
|
|
" if not rows:\n",
|
|
" return\n",
|
|
" execute_values(\n",
|
|
" cur,\n",
|
|
" f'''\n",
|
|
" insert into {DAYS_TABLE}\n",
|
|
" (smoke_date, source, source_file, source_url, feature_count)\n",
|
|
" values %s\n",
|
|
" on conflict (smoke_date) do update set\n",
|
|
" source = excluded.source,\n",
|
|
" source_file = excluded.source_file,\n",
|
|
" source_url = excluded.source_url,\n",
|
|
" feature_count = excluded.feature_count,\n",
|
|
" updated_at = now()\n",
|
|
" ''',\n",
|
|
" rows,\n",
|
|
" page_size=1000,\n",
|
|
" )\n",
|
|
"\n",
|
|
"\n",
|
|
"def insert_smoke_polygons(cur, gdf: gpd.GeoDataFrame, source: str, source_file: str, source_url: str) -> int:\n",
|
|
" if gdf.empty:\n",
|
|
" return 0\n",
|
|
"\n",
|
|
" rows = []\n",
|
|
" for _, r in gdf.iterrows():\n",
|
|
" geom = r.geometry\n",
|
|
" if geom is None or geom.is_empty:\n",
|
|
" continue\n",
|
|
" rows.append((\n",
|
|
" r['smoke_date'],\n",
|
|
" str(r.get('Satellite')) if pd.notna(r.get('Satellite')) else None,\n",
|
|
" str(r.get('Start')) if pd.notna(r.get('Start')) else None,\n",
|
|
" str(r.get('End')) if pd.notna(r.get('End')) else None,\n",
|
|
" r.get('start_utc'),\n",
|
|
" r.get('end_utc'),\n",
|
|
" normalize_density_label(r.get('Density')),\n",
|
|
" int(r['density_rank']),\n",
|
|
" source,\n",
|
|
" source_file,\n",
|
|
" source_url,\n",
|
|
" psycopg2.Binary(shapely_wkb.dumps(geom, hex=False)),\n",
|
|
" ))\n",
|
|
"\n",
|
|
" if not rows:\n",
|
|
" return 0\n",
|
|
"\n",
|
|
" execute_values(\n",
|
|
" cur,\n",
|
|
" f'''\n",
|
|
" insert into {SMOKE_TABLE} (\n",
|
|
" smoke_date, satellite, start_raw, end_raw, start_utc, end_utc,\n",
|
|
" density, density_rank, source, source_file, source_url, geom\n",
|
|
" )\n",
|
|
" values %s\n",
|
|
" ''',\n",
|
|
" rows,\n",
|
|
" template='(%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, ST_Multi(ST_GeomFromWKB(%s, 4326)))',\n",
|
|
" page_size=1000,\n",
|
|
" )\n",
|
|
" return len(rows)\n",
|
|
"\n",
|
|
"\n",
|
|
"def feature_count_rows(days: list[date], gdf: gpd.GeoDataFrame, source: str, source_file: str, source_url: str):\n",
|
|
" if gdf.empty:\n",
|
|
" counts = {}\n",
|
|
" else:\n",
|
|
" counts = gdf.groupby('smoke_date').size().to_dict()\n",
|
|
" return [(d, source, source_file, source_url, int(counts.get(d, 0))) for d in days]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "13",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Load NOAA HMS Smoke Polygons"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "14",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def load_annual_bundle(year: int, have_days: set[date]) -> tuple[int, int]:\n",
|
|
" year_start = date(year, 1, 1)\n",
|
|
" year_end = date(year, 12, 31)\n",
|
|
" daily_urls = discover_daily_zip_urls(year_start, year_end)\n",
|
|
" listed_days = set(daily_urls.keys())\n",
|
|
"\n",
|
|
" url = urljoin(ANNUAL_BUNDLE_BASE_URL, f'hms_smoke{year}.zip')\n",
|
|
" zip_path, downloaded = download_if_needed(url, ANNUAL_CACHE_DIR / f'hms_smoke{year}.zip')\n",
|
|
" gdf = normalize_smoke_gdf(read_smoke_zip(zip_path), zip_path.name)\n",
|
|
" gdf = gdf[(gdf['smoke_date'] >= year_start) & (gdf['smoke_date'] <= year_end)].copy()\n",
|
|
" polygon_days = set(gdf['smoke_date'].dropna().unique()) if not gdf.empty else set()\n",
|
|
" year_days = sorted(listed_days | polygon_days)\n",
|
|
"\n",
|
|
" if not year_days:\n",
|
|
" print(f' {year}: no observed daily ZIPs or polygons found; skipped')\n",
|
|
" return 0, 0\n",
|
|
"\n",
|
|
" if not RELOAD_SMOKE and all(d in have_days for d in year_days):\n",
|
|
" print(f' {year}: skipped; all {len(year_days)} days already loaded')\n",
|
|
" return 0, 0\n",
|
|
"\n",
|
|
" with get_conn() as conn:\n",
|
|
" with conn.cursor() as cur:\n",
|
|
" cur.execute(\n",
|
|
" f'delete from {SMOKE_TABLE} where smoke_date >= %s and smoke_date <= %s',\n",
|
|
" (year_start, year_end),\n",
|
|
" )\n",
|
|
" upsert_smoke_days(\n",
|
|
" cur,\n",
|
|
" feature_count_rows(year_days, gdf, 'annual_bundle', zip_path.name, url),\n",
|
|
" )\n",
|
|
" n_polygons = insert_smoke_polygons(cur, gdf, 'annual_bundle', zip_path.name, url)\n",
|
|
" conn.commit()\n",
|
|
"\n",
|
|
" have_days.update(year_days)\n",
|
|
" print(f' {year}: {n_polygons:,} polygons, {len(year_days)} observed days, downloaded={downloaded}')\n",
|
|
" return n_polygons, len(year_days)\n",
|
|
"\n",
|
|
"\n",
|
|
"def discover_year_month_urls(year: int) -> dict[int, str]:\n",
|
|
" year_url = f'{HMS_SHAPEFILE_BASE_URL}{year}/'\n",
|
|
" try:\n",
|
|
" links = list_links(year_url)\n",
|
|
" except requests.HTTPError as exc:\n",
|
|
" print(f' Missing/unreadable year listing: {year_url} ({exc})')\n",
|
|
" return {}\n",
|
|
"\n",
|
|
" month_urls = {}\n",
|
|
" for link in links:\n",
|
|
" name = link.rstrip('/').rsplit('/', 1)[-1]\n",
|
|
" if re.fullmatch(r'\\d{2}', name):\n",
|
|
" month_urls[int(name)] = link if link.endswith('/') else link + '/'\n",
|
|
" return month_urls\n",
|
|
"\n",
|
|
"\n",
|
|
"def discover_daily_zip_urls(start_day: date, end_day: date) -> dict[date, str]:\n",
|
|
" urls = {}\n",
|
|
" month = date(start_day.year, start_day.month, 1)\n",
|
|
" last_month = date(end_day.year, end_day.month, 1)\n",
|
|
" month_url_cache = {}\n",
|
|
"\n",
|
|
" while month <= last_month:\n",
|
|
" if month.year not in month_url_cache:\n",
|
|
" month_url_cache[month.year] = discover_year_month_urls(month.year)\n",
|
|
" month_url = month_url_cache[month.year].get(month.month)\n",
|
|
" links = list_links(month_url) if month_url is not None else []\n",
|
|
"\n",
|
|
" for link in links:\n",
|
|
" name = link.rstrip('/').rsplit('/', 1)[-1]\n",
|
|
" if not name.lower().endswith('.zip'):\n",
|
|
" continue\n",
|
|
" d = parse_smoke_date_from_name(name)\n",
|
|
" if d is not None and start_day <= d <= end_day:\n",
|
|
" urls[d] = link\n",
|
|
"\n",
|
|
" if month.month == 12:\n",
|
|
" month = date(month.year + 1, 1, 1)\n",
|
|
" else:\n",
|
|
" month = date(month.year, month.month + 1, 1)\n",
|
|
"\n",
|
|
" return dict(sorted(urls.items()))\n",
|
|
"\n",
|
|
"\n",
|
|
"def load_daily_zip(day: date, url: str, have_days: set[date]) -> tuple[int, int]:\n",
|
|
" if not RELOAD_SMOKE and day in have_days:\n",
|
|
" return 0, 0\n",
|
|
"\n",
|
|
" zip_name = f'hms_smoke{day:%Y%m%d}.zip'\n",
|
|
" zip_path, downloaded = download_if_needed(url, DAILY_CACHE_DIR / f'{day:%Y}' / f'{day:%m}' / zip_name)\n",
|
|
" gdf = normalize_smoke_gdf(read_smoke_zip(zip_path), zip_path.name)\n",
|
|
" gdf = gdf[gdf['smoke_date'] == day].copy()\n",
|
|
"\n",
|
|
" with get_conn() as conn:\n",
|
|
" with conn.cursor() as cur:\n",
|
|
" cur.execute(f'delete from {SMOKE_TABLE} where smoke_date = %s', (day,))\n",
|
|
" upsert_smoke_days(cur, feature_count_rows([day], gdf, 'daily_zip', zip_path.name, url))\n",
|
|
" n_polygons = insert_smoke_polygons(cur, gdf, 'daily_zip', zip_path.name, url)\n",
|
|
" conn.commit()\n",
|
|
"\n",
|
|
" have_days.add(day)\n",
|
|
" print(f' {day}: {n_polygons:,} polygons, downloaded={downloaded}')\n",
|
|
" return n_polygons, 1\n",
|
|
"\n",
|
|
"\n",
|
|
"def load_current_arcgis(have_days: set[date]) -> tuple[int, int]:\n",
|
|
" try:\n",
|
|
" gdf = gpd.read_file(ARCGIS_SMOKE_URL, on_invalid='fix')\n",
|
|
" except Exception as exc:\n",
|
|
" print(f'ArcGIS current smoke refresh skipped: {exc}')\n",
|
|
" return 0, 0\n",
|
|
"\n",
|
|
" source_file = f'arcgis_current_{END_DATE:%Y%m%d}.geojson'\n",
|
|
" gdf = normalize_smoke_gdf(gdf, source_file)\n",
|
|
"\n",
|
|
" if gdf.empty:\n",
|
|
" if END_DATE not in have_days:\n",
|
|
" with get_conn() as conn:\n",
|
|
" with conn.cursor() as cur:\n",
|
|
" upsert_smoke_days(cur, [(END_DATE, 'arcgis_current', source_file, ARCGIS_SMOKE_URL, 0)])\n",
|
|
" conn.commit()\n",
|
|
" have_days.add(END_DATE)\n",
|
|
" print(f'ArcGIS current: 0 polygons; marked {END_DATE} as observed')\n",
|
|
" return 0, 1\n",
|
|
" print('ArcGIS current: 0 polygons; existing observed day left unchanged')\n",
|
|
" return 0, 0\n",
|
|
"\n",
|
|
" days = sorted(gdf['smoke_date'].dropna().unique())\n",
|
|
" with get_conn() as conn:\n",
|
|
" with conn.cursor() as cur:\n",
|
|
" for d in days:\n",
|
|
" cur.execute(f'delete from {SMOKE_TABLE} where smoke_date = %s', (d,))\n",
|
|
" upsert_smoke_days(cur, feature_count_rows(days, gdf, 'arcgis_current', source_file, ARCGIS_SMOKE_URL))\n",
|
|
" n_polygons = insert_smoke_polygons(cur, gdf, 'arcgis_current', source_file, ARCGIS_SMOKE_URL)\n",
|
|
" conn.commit()\n",
|
|
"\n",
|
|
" have_days.update(days)\n",
|
|
" print(f'ArcGIS current: {n_polygons:,} polygons across {len(days)} day(s)')\n",
|
|
" return n_polygons, len(days)\n",
|
|
"\n",
|
|
"\n",
|
|
"t0 = time.time()\n",
|
|
"have_days = set() if RELOAD_SMOKE else existing_smoke_days()\n",
|
|
"print(f'Already loaded observed HMS days: {len(have_days):,}')\n",
|
|
"\n",
|
|
"total_polygons = 0\n",
|
|
"total_days = 0\n",
|
|
"\n",
|
|
"if LOAD_ANNUAL_BUNDLES:\n",
|
|
" annual_end_year = END_DATE.year - 1\n",
|
|
" for year in range(START_YEAR, annual_end_year + 1):\n",
|
|
" n_polys, n_days = load_annual_bundle(year, have_days)\n",
|
|
" total_polygons += n_polys\n",
|
|
" total_days += n_days\n",
|
|
"\n",
|
|
"if LOAD_CURRENT_YEAR_DAILY_ZIPS:\n",
|
|
" daily_start = max(date(END_DATE.year, 1, 1), date(START_YEAR, 1, 1))\n",
|
|
" if daily_start <= END_DATE:\n",
|
|
" daily_urls = discover_daily_zip_urls(daily_start, END_DATE)\n",
|
|
" print(f'Current-year daily ZIPs discovered: {len(daily_urls):,}')\n",
|
|
" for day, url in daily_urls.items():\n",
|
|
" n_polys, n_days = load_daily_zip(day, url, have_days)\n",
|
|
" total_polygons += n_polys\n",
|
|
" total_days += n_days\n",
|
|
"\n",
|
|
"if LOAD_CURRENT_ARCGIS:\n",
|
|
" n_polys, n_days = load_current_arcgis(have_days)\n",
|
|
" total_polygons += n_polys\n",
|
|
" total_days += n_days\n",
|
|
"\n",
|
|
"elapsed = time.time() - t0\n",
|
|
"print(f'\\nSource load done: +{total_polygons:,} polygons / +{total_days:,} observed days in {elapsed/60:.1f} min')\n",
|
|
"\n",
|
|
"with get_conn() as conn:\n",
|
|
" with conn.cursor() as cur:\n",
|
|
" cur.execute(f'select count(*), min(smoke_date), max(smoke_date) from {DAYS_TABLE}')\n",
|
|
" print('Observed days:', cur.fetchone())\n",
|
|
" cur.execute(f'select count(*), min(smoke_date), max(smoke_date) from {SMOKE_TABLE}')\n",
|
|
" print('Smoke polygons:', cur.fetchone())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "15",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Compute Per-DC Daily Worst Smoke Density"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "16",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"POPULATE_INSMOKE_YEAR_SQL = f'''\n",
|
|
"insert into {DC_DAY_TABLE} (master_id, smoke_date, max_density_rank, polygon_hits)\n",
|
|
"select\n",
|
|
" dc.master_id,\n",
|
|
" s.smoke_date,\n",
|
|
" max(s.density_rank)::smallint as max_density_rank,\n",
|
|
" count(*)::integer as polygon_hits\n",
|
|
"from {MASTER_TABLE} dc\n",
|
|
"join {SMOKE_TABLE} s\n",
|
|
" on dc.geom && s.geom\n",
|
|
" and st_intersects(dc.geom, s.geom)\n",
|
|
"where dc.geom is not null\n",
|
|
" and s.smoke_date >= %s::date\n",
|
|
" and s.smoke_date < %s::date\n",
|
|
"group by dc.master_id, s.smoke_date\n",
|
|
"'''\n",
|
|
"\n",
|
|
"POPULATE_NOSMOKE_YEAR_SQL = f'''\n",
|
|
"insert into {DC_DAY_TABLE} (master_id, smoke_date, max_density_rank, polygon_hits)\n",
|
|
"select\n",
|
|
" dc.master_id,\n",
|
|
" d.smoke_date,\n",
|
|
" 0::smallint as max_density_rank,\n",
|
|
" 0::integer as polygon_hits\n",
|
|
"from {MASTER_TABLE} dc\n",
|
|
"cross join {DAYS_TABLE} d\n",
|
|
"where dc.geom is not null\n",
|
|
" and d.smoke_date >= %s::date\n",
|
|
" and d.smoke_date < %s::date\n",
|
|
" and not exists (\n",
|
|
" select 1\n",
|
|
" from {DC_DAY_TABLE} dd\n",
|
|
" where dd.master_id = dc.master_id\n",
|
|
" and dd.smoke_date = d.smoke_date\n",
|
|
" )\n",
|
|
"'''\n",
|
|
"\n",
|
|
"if RECOMPUTE_SUMMARY:\n",
|
|
" t0 = time.time()\n",
|
|
" with get_conn() as conn:\n",
|
|
" with conn.cursor() as cur:\n",
|
|
" cur.execute(f'select min(smoke_date), max(smoke_date) from {DAYS_TABLE}')\n",
|
|
" day_min, day_max = cur.fetchone()\n",
|
|
" if day_min is None:\n",
|
|
" raise RuntimeError(f'{DAYS_TABLE} is empty; load HMS smoke days before computing exposure')\n",
|
|
" print(f'Processing daily DC exposure from {day_min} through {day_max}')\n",
|
|
"\n",
|
|
" with get_conn() as conn:\n",
|
|
" with conn.cursor() as cur:\n",
|
|
" cur.execute('set local statement_timeout = 0')\n",
|
|
" cur.execute(f'truncate {DC_DAY_TABLE}')\n",
|
|
"\n",
|
|
" for yr in range(day_min.year, day_max.year + 1):\n",
|
|
" year_start = date(yr, 1, 1)\n",
|
|
" year_end = date(yr + 1, 1, 1)\n",
|
|
"\n",
|
|
" t = time.time()\n",
|
|
" cur.execute(POPULATE_INSMOKE_YEAR_SQL, (year_start, year_end))\n",
|
|
" cur.execute(f'select count(*) from {DC_DAY_TABLE}')\n",
|
|
" smoke_rows = cur.fetchone()[0]\n",
|
|
" print(f' {yr}: smoke rows cumulative = {smoke_rows:>12,} ({time.time() - t:.1f}s)')\n",
|
|
"\n",
|
|
" t = time.time()\n",
|
|
" cur.execute(POPULATE_NOSMOKE_YEAR_SQL, (year_start, year_end))\n",
|
|
" cur.execute(f'select count(*) from {DC_DAY_TABLE}')\n",
|
|
" total_rows = cur.fetchone()[0]\n",
|
|
" print(f' {yr}: rows after no-smoke fill = {total_rows:>12,} ({time.time() - t:.1f}s)')\n",
|
|
"\n",
|
|
" cur.execute(f'analyze {DC_DAY_TABLE}')\n",
|
|
" conn.commit()\n",
|
|
"\n",
|
|
" print(f'\\nDone in {(time.time() - t0) / 60:.1f} min')\n",
|
|
"\n",
|
|
" with get_conn() as conn:\n",
|
|
" with conn.cursor() as cur:\n",
|
|
" cur.execute(f'select count(distinct master_id), count(distinct smoke_date), count(*) from {DC_DAY_TABLE}')\n",
|
|
" n_dc, n_days, n_rows = cur.fetchone()\n",
|
|
" print(f' unique DCs={n_dc:,} unique days={n_days:,} rows={n_rows:,}')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "17",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Compute Per-DC Exposure Summary"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "18",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"POPULATE_EXPOSURE_SQL = f'''\n",
|
|
"truncate {EXPOSURE_TABLE};\n",
|
|
"\n",
|
|
"with per_dc as (\n",
|
|
" select\n",
|
|
" master_id,\n",
|
|
" count(*) as days_observed,\n",
|
|
" sum((max_density_rank >= 1)::int) as days_any,\n",
|
|
" sum((max_density_rank >= 1)::int) as days_light,\n",
|
|
" sum((max_density_rank >= 2)::int) as days_medium,\n",
|
|
" sum((max_density_rank >= 3)::int) as days_heavy,\n",
|
|
" max(max_density_rank) as worst_density_rank,\n",
|
|
" avg(max_density_rank::float) as mean_density_rank,\n",
|
|
" min(smoke_date) as period_start,\n",
|
|
" max(smoke_date) as period_end\n",
|
|
" from {DC_DAY_TABLE}\n",
|
|
" group by master_id\n",
|
|
"),\n",
|
|
"streaks as (\n",
|
|
" select\n",
|
|
" master_id,\n",
|
|
" max(case when threshold = 1 then run_len end) as longest_any,\n",
|
|
" max(case when threshold = 2 then run_len end) as longest_medium,\n",
|
|
" max(case when threshold = 3 then run_len end) as longest_heavy\n",
|
|
" from (\n",
|
|
" select master_id, threshold, count(*) as run_len\n",
|
|
" from (\n",
|
|
" select\n",
|
|
" master_id,\n",
|
|
" threshold,\n",
|
|
" smoke_date,\n",
|
|
" smoke_date - (row_number() over (partition by master_id, threshold order by smoke_date))::int as grp\n",
|
|
" from (\n",
|
|
" select master_id, smoke_date, 1::int as threshold\n",
|
|
" from {DC_DAY_TABLE}\n",
|
|
" where max_density_rank >= 1\n",
|
|
" union all\n",
|
|
" select master_id, smoke_date, 2::int as threshold\n",
|
|
" from {DC_DAY_TABLE}\n",
|
|
" where max_density_rank >= 2\n",
|
|
" union all\n",
|
|
" select master_id, smoke_date, 3::int as threshold\n",
|
|
" from {DC_DAY_TABLE}\n",
|
|
" where max_density_rank >= 3\n",
|
|
" ) threshold_days\n",
|
|
" ) grouped\n",
|
|
" group by master_id, threshold, grp\n",
|
|
" ) runs\n",
|
|
" group by master_id\n",
|
|
")\n",
|
|
"insert into {EXPOSURE_TABLE} (\n",
|
|
" master_id, source, name, operator, city, state, country, longitude, latitude, geom,\n",
|
|
" hms_status, smoke_period_start, smoke_period_end, days_observed,\n",
|
|
" days_with_any_smoke, days_with_light_or_worse, days_with_medium_or_worse,\n",
|
|
" days_with_heavy_smoke,\n",
|
|
" pct_days_with_any_smoke, pct_days_with_light_or_worse,\n",
|
|
" pct_days_with_medium_or_worse, pct_days_with_heavy_smoke,\n",
|
|
" worst_density_rank, worst_density, mean_density_rank,\n",
|
|
" longest_any_smoke_streak_days,\n",
|
|
" longest_medium_or_heavy_streak_days,\n",
|
|
" longest_heavy_smoke_streak_days\n",
|
|
")\n",
|
|
"select\n",
|
|
" dc.master_id, dc.source, dc.name, dc.operator, dc.city, dc.state, dc.country,\n",
|
|
" dc.longitude, dc.latitude, dc.geom,\n",
|
|
" case\n",
|
|
" when dc.geom is null then 'no_geom'\n",
|
|
" when p.master_id is null then 'no_observed_days'\n",
|
|
" else 'observed'\n",
|
|
" end as hms_status,\n",
|
|
" p.period_start,\n",
|
|
" p.period_end,\n",
|
|
" coalesce(p.days_observed, 0),\n",
|
|
" coalesce(p.days_any, 0),\n",
|
|
" coalesce(p.days_light, 0),\n",
|
|
" coalesce(p.days_medium, 0),\n",
|
|
" coalesce(p.days_heavy, 0),\n",
|
|
" case when p.days_observed > 0 then p.days_any::float / p.days_observed end,\n",
|
|
" case when p.days_observed > 0 then p.days_light::float / p.days_observed end,\n",
|
|
" case when p.days_observed > 0 then p.days_medium::float / p.days_observed end,\n",
|
|
" case when p.days_observed > 0 then p.days_heavy::float / p.days_observed end,\n",
|
|
" p.worst_density_rank,\n",
|
|
" case p.worst_density_rank\n",
|
|
" when 3 then 'Heavy'\n",
|
|
" when 2 then 'Medium'\n",
|
|
" when 1 then 'Light/Unspecified'\n",
|
|
" when 0 then 'None'\n",
|
|
" else null\n",
|
|
" end as worst_density,\n",
|
|
" p.mean_density_rank,\n",
|
|
" coalesce(s.longest_any, 0),\n",
|
|
" coalesce(s.longest_medium, 0),\n",
|
|
" coalesce(s.longest_heavy, 0)\n",
|
|
"from {MASTER_TABLE} dc\n",
|
|
"left join per_dc p on p.master_id = dc.master_id\n",
|
|
"left join streaks s on s.master_id = dc.master_id;\n",
|
|
"\n",
|
|
"analyze {EXPOSURE_TABLE};\n",
|
|
"'''\n",
|
|
"\n",
|
|
"if RECOMPUTE_SUMMARY:\n",
|
|
" t = time.time()\n",
|
|
" with get_conn() as conn:\n",
|
|
" with conn.cursor() as cur:\n",
|
|
" cur.execute('set local statement_timeout = 0')\n",
|
|
" cur.execute(POPULATE_EXPOSURE_SQL)\n",
|
|
" conn.commit()\n",
|
|
" print(f'Populated {EXPOSURE_TABLE} in {time.time() - t:.0f}s')\n",
|
|
"\n",
|
|
" with get_conn() as conn:\n",
|
|
" with conn.cursor() as cur:\n",
|
|
" cur.execute(f'select hms_status, count(*) from {EXPOSURE_TABLE} group by hms_status order by 1')\n",
|
|
" print('Status distribution:')\n",
|
|
" for r in cur.fetchall():\n",
|
|
" print(' ', r)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "19",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Inspect Results"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "20",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"summary_query = f'''\n",
|
|
"with observed as (\n",
|
|
" select * from {EXPOSURE_TABLE} where hms_status = 'observed'\n",
|
|
")\n",
|
|
"select\n",
|
|
" count(*) as observed_dcs,\n",
|
|
" round(avg(pct_days_with_any_smoke * 100)::numeric, 1) as avg_pct_any_smoke,\n",
|
|
" round(avg(pct_days_with_medium_or_worse * 100)::numeric, 1) as avg_pct_medium_or_worse,\n",
|
|
" round(avg(pct_days_with_heavy_smoke * 100)::numeric, 2) as avg_pct_heavy,\n",
|
|
" round(avg(mean_density_rank)::numeric, 2) as avg_mean_density_rank,\n",
|
|
" max(longest_medium_or_heavy_streak_days) as max_medium_heavy_streak_days\n",
|
|
"from observed\n",
|
|
"'''\n",
|
|
"\n",
|
|
"by_state_query = f'''\n",
|
|
"select\n",
|
|
" state,\n",
|
|
" count(*) as n_dcs,\n",
|
|
" round(avg(pct_days_with_any_smoke * 100)::numeric, 1) as avg_pct_any_smoke,\n",
|
|
" round(avg(pct_days_with_medium_or_worse * 100)::numeric, 1) as avg_pct_medium_or_worse,\n",
|
|
" round(avg(pct_days_with_heavy_smoke * 100)::numeric, 2) as avg_pct_heavy,\n",
|
|
" round(avg(mean_density_rank)::numeric, 2) as avg_mean_density_rank,\n",
|
|
" max(worst_density_rank) as worst_density_rank,\n",
|
|
" round(avg(longest_medium_or_heavy_streak_days)::numeric, 1) as avg_medium_heavy_streak_days\n",
|
|
"from {EXPOSURE_TABLE}\n",
|
|
"where hms_status = 'observed'\n",
|
|
"group by state\n",
|
|
"order by avg_pct_medium_or_worse desc nulls last\n",
|
|
"limit 20\n",
|
|
"'''\n",
|
|
"\n",
|
|
"worst_query = f'''\n",
|
|
"select\n",
|
|
" master_id, name, state,\n",
|
|
" round((pct_days_with_any_smoke * 100)::numeric, 1) as pct_any_smoke,\n",
|
|
" round((pct_days_with_medium_or_worse * 100)::numeric, 1) as pct_medium_or_worse,\n",
|
|
" round((pct_days_with_heavy_smoke * 100)::numeric, 2) as pct_heavy,\n",
|
|
" days_with_heavy_smoke,\n",
|
|
" worst_density,\n",
|
|
" longest_medium_or_heavy_streak_days\n",
|
|
"from {EXPOSURE_TABLE}\n",
|
|
"where hms_status = 'observed'\n",
|
|
"order by mean_density_rank desc, pct_days_with_medium_or_worse desc\n",
|
|
"limit 20\n",
|
|
"'''\n",
|
|
"\n",
|
|
"coverage_query = f'''\n",
|
|
"select\n",
|
|
" min(smoke_date) as first_day,\n",
|
|
" max(smoke_date) as last_day,\n",
|
|
" count(*) as observed_days,\n",
|
|
" sum(feature_count) as source_polygons\n",
|
|
"from {DAYS_TABLE}\n",
|
|
"'''\n",
|
|
"\n",
|
|
"with get_conn() as conn:\n",
|
|
" coverage = pd.read_sql_query(coverage_query, conn)\n",
|
|
" summary = pd.read_sql_query(summary_query, conn)\n",
|
|
" by_state = pd.read_sql_query(by_state_query, conn)\n",
|
|
" worst = pd.read_sql_query(worst_query, conn)\n",
|
|
"\n",
|
|
"print('=== HMS source coverage ===')\n",
|
|
"display(coverage)\n",
|
|
"print('\\n=== Overall ===')\n",
|
|
"display(summary)\n",
|
|
"print('\\n=== Top states by avg medium-or-worse smoke share ===')\n",
|
|
"display(by_state)\n",
|
|
"print('\\n=== Top most smoke-exposed individual DCs ===')\n",
|
|
"display(worst)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "21",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Notes\n",
|
|
"\n",
|
|
"- `max_density_rank = 0` in `data_center_hms_smoke_dc_day` means the HMS product day was observed but no smoke polygon contained the data center point.\n",
|
|
"- `pct_days_with_medium_or_worse` is the most useful headline metric if you want to avoid counting very diffuse/light or pre-density-classification plumes as strongly as thicker smoke.\n",
|
|
"- Older HMS polygons often use `Density = NA`; this notebook maps those to rank `1` so smoke presence is preserved without counting it as medium/heavy smoke.\n",
|
|
"- HMS smoke polygons are analyst-produced satellite interpretations, not ground-level PM2.5 readings. Treat them as smoke presence/intensity exposure, not measured air quality.\n",
|
|
"- The ArcGIS endpoint is only used as a current-day refresh. Historical backfills should use NOAA HMS ZIP archives because they are stable and include the daily product history.\n",
|
|
"- The fire detections endpoint is a different product. Use it for nearby active fire/ignition analyses, not for smoke plume exposure."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "22",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Tables Created by This Notebook and Their Relationships\n",
|
|
"\n",
|
|
"### Tables Created / Maintained\n",
|
|
"1. `public.hms_smoke_days`\n",
|
|
"- One row per observed HMS product day (daily denominator table).\n",
|
|
"\n",
|
|
"2. `public.hms_smoke_daily`\n",
|
|
"- One row per smoke polygon geometry from HMS source products.\n",
|
|
"\n",
|
|
"3. `public.data_center_hms_smoke_dc_day`\n",
|
|
"- One row per `(master_id, smoke_date)` with daily smoke exposure classification.\n",
|
|
"\n",
|
|
"4. `public.data_center_hms_smoke_exposure`\n",
|
|
"- One row per `master_id` with summary smoke-exposure metrics.\n",
|
|
"\n",
|
|
"### Key Relationships\n",
|
|
"- `public.hms_smoke_days (smoke_date)`\n",
|
|
" - 1-to-many -> `public.hms_smoke_daily (smoke_date)`\n",
|
|
"\n",
|
|
"- `public.master_data_centers (master_id)`\n",
|
|
" - 1-to-many -> `public.data_center_hms_smoke_dc_day (master_id, smoke_date)`\n",
|
|
" - 1-to-1 (effective) -> `public.data_center_hms_smoke_exposure (master_id)`\n",
|
|
"\n",
|
|
"- `public.data_center_hms_smoke_dc_day`\n",
|
|
" - many-to-1 summary rollup -> `public.data_center_hms_smoke_exposure`\n",
|
|
"\n",
|
|
"### Rerun Notes\n",
|
|
"- Designed for repeat refreshes as additional HMS days become available.\n",
|
|
"- Summary exposure table is recomputed from daily source/bridge tables so results stay consistent after reloads."
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": ".venv",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.14.5"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|