data-centers/historical_climate_data_centers.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0",
   "metadata": {},
   "source": [
    "# Historical Climate for Master Data Centers (Daymet + gridMET)\n",
    "\n",
    "This notebook builds or refreshes `public.data_center_historical_climate`, a one-row-per-data-center climate summary keyed by `master_id`.\n",
    "\n",
    "**Source switch on 2026-05-20.** Previously this notebook used Open-Meteo's archive API, but a 30-year × 9-variable request consumes ~7,000 of the free 10,000 daily call budget, limiting the run to ~1 site/day. Two free, generous federal sources replace it:\n",
    "\n",
    "- **Daymet v4 R1** (ORNL DAAC): daily 1 km gridded `tmin`, `tmax`, `prcp`, `vp` (vapor pressure) for North America, 1980–present. Pulled via the single-pixel REST API.\n",
    "- **gridMET / METDATA** (U. Idaho): daily 4 km gridded daily-mean wind speed (`vs`) for CONUS, 1979–present. Pulled via THREDDS NetCDF Subset Service as CSV (no `xarray` dependency).\n",
    "\n",
    "Climate normal period: **1991–2020** (standard WMO baseline).\n",
    "\n",
    "| Metric | Field(s) | Source / derivation |\n",
    "| --- | --- | --- |\n",
    "| Mean annual / summer temperature | `mean_annual_temperature_c`, `mean_summer_temperature_c` | Daymet `(tmin + tmax) / 2` |\n",
    "| Max / min daily temperature | `max_daily_temperature_c`, `min_daily_temperature_c` | Daymet `tmax`, `tmin` |\n",
    "| Diurnal temperature range | `mean_diurnal_temperature_range_c` | Daymet `tmax − tmin` |\n",
    "| Relative humidity | `mean_relative_humidity_pct` | Derived `100 × vp / e_sat(T)` via Magnus formula |\n",
    "| Wet bulb temperature | `mean_wet_bulb_temperature_c`, `max_wet_bulb_temperature_c`, `extreme_wet_bulb_days` | Stull (2011) approximation from T + derived RH |\n",
    "| Cooling degree days | `cooling_degree_days_c`, `annual_cooling_degree_days_c_mean` | Σ max(0, T_mean − 18.3 °C) |\n",
    "| Extreme heat days | `extreme_heat_days`, `annual_extreme_heat_days_mean` | Count days with `tmax ≥ 35 °C` |\n",
    "| Precipitation | `precipitation_total_mm`, `annual_precipitation_mm_mean`, `annual_precipitation_cv`, `wet_day_precipitation_p95_mm` | Daymet `prcp` aggregated |\n",
    "| Wind | `mean_wind_speed_ms`, `max_daily_mean_wind_speed_ms`, `sustained_wind_days`, `annual_sustained_wind_days_mean` | gridMET `vs` (daily-mean wind). **Sustained-wind day = daily mean ≥ 8 m/s** — this is *not* the same as the old Open-Meteo gust threshold; see notes at end. |\n",
    "| Drought severity, wildfire smoke | `*_source_status` columns | External sources needed (USDM, gridMET PDSI, NOAA HMS) |\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1",
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import math\n",
    "import os\n",
    "import time\n",
    "import urllib.error\n",
    "import urllib.parse\n",
    "import urllib.request\n",
    "from io import StringIO\n",
    "from pathlib import Path\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import psycopg2\n",
    "from psycopg2 import sql\n",
    "from psycopg2.extras import execute_values\n",
    "\n",
    "pd.set_option('display.max_columns', 100)\n",
    "pd.set_option('display.max_rows', 120)\n",
    "\n",
    "print('pandas:', pd.__version__)\n",
    "print('numpy:', np.__version__)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2",
   "metadata": {},
   "outputs": [],
   "source": [
    "def load_env_file(env_path: str = '.env') -> None:\n",
    "    p = Path(env_path)\n",
    "    if not p.exists():\n",
    "        print(f'No {env_path} file found in {Path.cwd()}')\n",
    "        return\n",
    "    loaded = 0\n",
    "    for raw_line in p.read_text(encoding='utf-8').splitlines():\n",
    "        line = raw_line.strip()\n",
    "        if not line or line.startswith('#') or '=' not in line:\n",
    "            continue\n",
    "        key, value = line.split('=', 1)\n",
    "        key = key.strip()\n",
    "        value = value.strip().strip('\"').strip(\"'\")\n",
    "        if key and key not in os.environ:\n",
    "            os.environ[key] = value\n",
    "            loaded += 1\n",
    "    print(f'Loaded {loaded} env var(s) from {env_path}')\n",
    "\n",
    "\n",
    "def require_env(keys):\n",
    "    missing = [k for k in keys if not os.getenv(k)]\n",
    "    if missing:\n",
    "        raise EnvironmentError(\n",
    "            'Missing required env vars in notebook kernel: ' + ', '.join(missing) +\n",
    "            '.\\nSet them in this notebook, or add them to a .env file in this folder.'\n",
    "        )\n",
    "\n",
    "\n",
    "load_env_file('.env')\n",
    "required_keys = ['PGWEB_HOST', 'PGWEB_PORT', 'PGWEB_USER', 'PGWEB_PASSWORD']\n",
    "require_env(required_keys)\n",
    "\n",
    "DB_NAME = os.getenv('PGDATABASE', 'data_centers')\n",
    "MASTER_TABLE  = 'public.master_data_centers'\n",
    "CLIMATE_TABLE = 'public.data_center_historical_climate'\n",
    "\n",
    "\n",
    "def get_conn():\n",
    "    return psycopg2.connect(\n",
    "        host=os.environ['PGWEB_HOST'],\n",
    "        port=os.environ['PGWEB_PORT'],\n",
    "        user=os.environ['PGWEB_USER'],\n",
    "        password=os.environ['PGWEB_PASSWORD'],\n",
    "        dbname='data_centers',\n",
    "    )\n",
    "\n",
    "\n",
    "with get_conn() as conn:\n",
    "    with conn.cursor() as cur:\n",
    "        cur.execute('select current_database(), current_user')\n",
    "        db, usr = cur.fetchone()\n",
    "        print('Connected to DB:', db)\n",
    "        print('As user:', usr)\n",
    "        cur.execute('create extension if not exists postgis')\n",
    "        cur.execute('select to_regclass(%s)', (MASTER_TABLE,))\n",
    "        if cur.fetchone()[0] is None:\n",
    "            raise RuntimeError(f'{MASTER_TABLE} does not exist. Run build_master_data_centers.py first.')\n",
    "        cur.execute(sql.SQL('select count(*) from {}').format(sql.SQL(MASTER_TABLE)))\n",
    "        print(f'{MASTER_TABLE} rows:', f'{cur.fetchone()[0]:,}')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3",
   "metadata": {},
   "source": [
    "## Parameters\n",
    "\n",
    "Climate normal period defaults to **1991–2020**, matching the current WMO baseline. Both Daymet and gridMET cover this range fully.\n",
    "\n",
    "- `REFRESH_EXISTING=True` recomputes existing rows (e.g. after a threshold change).\n",
    "- `INCLUDE_WIND=False` skips the per-site gridMET pass; wind columns stay null and `wind_status='pending_gridmet'`. Useful for a fast first pass since Daymet alone is one HTTP call per site, while gridMET is ~30 (one per year).\n",
    "- `MAX_POINTS=3` for testing, `None` for all eligible sites.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ---- Daymet (T, precip, vapor pressure) ----\n",
    "DAYMET_API_URL         = 'https://daymet.ornl.gov/single-pixel/api/data'\n",
    "DAYMET_VARIABLES       = ['tmin', 'tmax', 'prcp', 'vp']\n",
    "DAYMET_DATASET_VERSION = 'daymet_v4r1'\n",
    "\n",
    "# ---- gridMET (daily-mean wind) ----\n",
    "GRIDMET_NCSS_URL_TEMPLATE = 'https://thredds.northwestknowledge.net/thredds/ncss/MET/vs/vs_{year}.nc'\n",
    "GRIDMET_VARIABLE          = 'wind_speed'\n",
    "GRIDMET_SCALE_FACTOR      = 0.1  # NCSS CSV returns packed short ints; multiply to get m/s\n",
    "GRIDMET_DATASET_VERSION   = 'gridmet_vs'\n",
    "\n",
    "# ---- Period ----\n",
    "START_DATE = '1991-01-01'\n",
    "END_DATE   = '2020-12-31'\n",
    "START_YEAR = int(START_DATE[:4])\n",
    "END_YEAR   = int(END_DATE[:4])\n",
    "\n",
    "# ---- Run control ----\n",
    "REFRESH_EXISTING = False\n",
    "INCLUDE_WIND     = True\n",
    "DRY_RUN          = False\n",
    "MAX_POINTS       = None         # small int for testing; None for all eligible\n",
    "REQUEST_SLEEP_SECONDS   = 0.5   # politeness pause between sites\n",
    "REQUEST_TIMEOUT_SECONDS = 90\n",
    "MAX_RETRIES = 5\n",
    "\n",
    "# ---- Metric thresholds ----\n",
    "CDD_BASE_C                   = 18.3\n",
    "EXTREME_HEAT_THRESHOLD_C     = 35.0\n",
    "EXTREME_WET_BULB_THRESHOLD_C = 28.0\n",
    "SUSTAINED_WIND_THRESHOLD_MS  = 8.0   # ~18 mph daily-mean wind. NOT a gust threshold.\n",
    "WET_DAY_THRESHOLD_MM         = 1.0\n",
    "\n",
    "print(f'Period:        {START_DATE} to {END_DATE}')\n",
    "print(f'Target table:  {CLIMATE_TABLE}')\n",
    "print(f'Daymet vars:   {DAYMET_VARIABLES}')\n",
    "print(f'Include wind:  {INCLUDE_WIND}  (gridMET sustained-wind threshold = {SUSTAINED_WIND_THRESHOLD_MS} m/s daily mean)')\n",
    "print(f'Refresh:       {REFRESH_EXISTING}')\n",
    "print(f'Max sites:     {MAX_POINTS}')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5",
   "metadata": {},
   "source": [
    "## Create Target Table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6",
   "metadata": {},
   "outputs": [],
   "source": [
    "CREATE_CLIMATE_TABLE_SQL = f'''\n",
    "create table if not exists {CLIMATE_TABLE} (\n",
    "    master_id text primary key references public.master_data_centers(master_id) on delete cascade,\n",
    "    source text,\n",
    "    name text,\n",
    "    operator text,\n",
    "    city text,\n",
    "    state text,\n",
    "    country text,\n",
    "    longitude double precision not null,\n",
    "    latitude double precision not null,\n",
    "    geom geometry(Point, 4326),\n",
    "\n",
    "    daymet_dataset_version  text not null,\n",
    "    gridmet_dataset_version text,\n",
    "    climate_period_start date not null,\n",
    "    climate_period_end   date not null,\n",
    "    daymet_tile          integer,\n",
    "    daymet_elevation_m   double precision,\n",
    "    daymet_grid_x_m      double precision,\n",
    "    daymet_grid_y_m      double precision,\n",
    "    gridmet_grid_latitude  double precision,\n",
    "    gridmet_grid_longitude double precision,\n",
    "    observation_days  integer not null,\n",
    "    observation_years integer not null,\n",
    "\n",
    "    -- temperature\n",
    "    mean_annual_temperature_c        double precision,\n",
    "    mean_summer_temperature_c        double precision,\n",
    "    max_daily_temperature_c          double precision,\n",
    "    min_daily_temperature_c          double precision,\n",
    "    mean_diurnal_temperature_range_c double precision,\n",
    "\n",
    "    -- humidity / wet-bulb (derived from Daymet T + vp)\n",
    "    mean_relative_humidity_pct  double precision,\n",
    "    mean_wet_bulb_temperature_c double precision,\n",
    "    max_wet_bulb_temperature_c  double precision,\n",
    "    extreme_wet_bulb_days       integer,\n",
    "\n",
    "    -- cooling load / heat extremes\n",
    "    cooling_degree_days_c             double precision,\n",
    "    annual_cooling_degree_days_c_mean double precision,\n",
    "    extreme_heat_days                 integer,\n",
    "    annual_extreme_heat_days_mean     double precision,\n",
    "\n",
    "    -- precipitation\n",
    "    precipitation_total_mm        double precision,\n",
    "    annual_precipitation_mm_mean  double precision,\n",
    "    annual_precipitation_cv       double precision,\n",
    "    wet_day_precipitation_p95_mm  double precision,\n",
    "\n",
    "    -- wind (gridMET daily-mean, NOT gust)\n",
    "    mean_wind_speed_ms                double precision,\n",
    "    max_daily_mean_wind_speed_ms      double precision,\n",
    "    sustained_wind_days               integer,\n",
    "    annual_sustained_wind_days_mean   double precision,\n",
    "    wind_status                       text not null default 'pending_gridmet',\n",
    "\n",
    "    -- external-source placeholders\n",
    "    drought_severity_index_source_status  text not null default 'needs_external_source',\n",
    "    wildfire_smoke_exposure_source_status text not null default 'needs_external_source',\n",
    "\n",
    "    -- provenance\n",
    "    daymet_variables text[] not null,\n",
    "    daymet_url       text,\n",
    "    fetched_at timestamptz not null default now(),\n",
    "    updated_at timestamptz not null default now()\n",
    ");\n",
    "\n",
    "create index if not exists data_center_hist_climate_geom_gix\n",
    "    on {CLIMATE_TABLE} using gist (geom);\n",
    "create index if not exists data_center_hist_climate_state_idx\n",
    "    on {CLIMATE_TABLE} (state);\n",
    "create index if not exists data_center_hist_climate_period_idx\n",
    "    on {CLIMATE_TABLE} (climate_period_start, climate_period_end);\n",
    "'''\n",
    "\n",
    "with get_conn() as conn:\n",
    "    with conn.cursor() as cur:\n",
    "        cur.execute(CREATE_CLIMATE_TABLE_SQL)\n",
    "        cur.execute(sql.SQL('select count(*) from {}').format(sql.SQL(CLIMATE_TABLE)))\n",
    "        existing = cur.fetchone()[0]\n",
    "        print(f'{CLIMATE_TABLE} currently has {existing:,} row(s)')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7",
   "metadata": {},
   "source": [
    "## Load Data Centers Needing Climate Metrics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8",
   "metadata": {},
   "outputs": [],
   "source": [
    "def load_target_points(refresh_existing=False, max_points=None):\n",
    "    exists_filter = ''\n",
    "    params = []\n",
    "    if not refresh_existing:\n",
    "        exists_filter = f'''\n",
    "        and not exists (\n",
    "            select 1\n",
    "            from {CLIMATE_TABLE} c\n",
    "            where c.master_id = m.master_id\n",
    "              and c.daymet_dataset_version = %s\n",
    "              and c.climate_period_start = %s::date\n",
    "              and c.climate_period_end   = %s::date\n",
    "        )\n",
    "        '''\n",
    "        params = [DAYMET_DATASET_VERSION, START_DATE, END_DATE]\n",
    "\n",
    "    limit_sql = ''\n",
    "    if max_points is not None:\n",
    "        limit_sql = 'limit %s'\n",
    "        params.append(int(max_points))\n",
    "\n",
    "    query = f'''\n",
    "        select\n",
    "            m.master_id, m.source, m.name, m.operator,\n",
    "            m.city, m.state, m.country,\n",
    "            m.longitude, m.latitude\n",
    "        from {MASTER_TABLE} m\n",
    "        where m.longitude is not null\n",
    "          and m.latitude  is not null\n",
    "          and m.geom is not null\n",
    "          -- Daymet covers North America bounding box\n",
    "          and m.longitude between -178 and -52\n",
    "          and m.latitude  between   14 and  83\n",
    "          {exists_filter}\n",
    "        order by m.master_id\n",
    "        {limit_sql}\n",
    "    '''\n",
    "    with get_conn() as conn:\n",
    "        return pd.read_sql_query(query, conn, params=params)\n",
    "\n",
    "\n",
    "targets = load_target_points(refresh_existing=REFRESH_EXISTING, max_points=MAX_POINTS)\n",
    "print(f'Target data centers to fetch: {len(targets):,}')\n",
    "display(targets.head())\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9",
   "metadata": {},
   "source": [
    "## Fetchers and derived-metric helpers\n",
    "\n",
    "`fetch_daymet` issues one single-pixel REST call for the full 30-year period. `fetch_gridmet_wind` issues one NCSS CSV call per year (Daymet has no equivalent multi-year aggregated endpoint we can rely on; per-year fetches keep each response small).\n",
    "\n",
    "Wet bulb is derived in two steps:\n",
    "1. **RH from Daymet vp** using the Magnus saturation-vapor-pressure formula.\n",
    "2. **Wet bulb from T + RH** using the Stull (2011) approximation (valid 5 % ≤ RH ≤ 99 %, −20 °C ≤ T ≤ +50 °C).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "10",
   "metadata": {},
   "outputs": [],
   "source": [
    "def http_get(url, max_retries=MAX_RETRIES, timeout=REQUEST_TIMEOUT_SECONDS):\n",
    "    last_error = None\n",
    "    for attempt in range(1, max_retries + 1):\n",
    "        try:\n",
    "            req = urllib.request.Request(url, headers={'User-Agent': 'data-center-climate-notebook/1.1'})\n",
    "            with urllib.request.urlopen(req, timeout=timeout) as resp:\n",
    "                return resp.read().decode('utf-8')\n",
    "        except urllib.error.HTTPError as exc:\n",
    "            last_error = exc\n",
    "            try:\n",
    "                body = exc.read().decode('utf-8', errors='replace')[:200]\n",
    "            except Exception:\n",
    "                body = ''\n",
    "            if exc.code in (400, 404):\n",
    "                raise RuntimeError(f'HTTP {exc.code} on {url}: {body or exc.reason}')\n",
    "            sleep_for = min(2 ** attempt, 60)\n",
    "            print(f'  HTTP {exc.code} attempt {attempt}/{max_retries}: {body}; sleeping {sleep_for}s')\n",
    "            time.sleep(sleep_for)\n",
    "        except (urllib.error.URLError, TimeoutError) as exc:\n",
    "            last_error = exc\n",
    "            sleep_for = min(2 ** attempt, 60)\n",
    "            print(f'  Network error attempt {attempt}/{max_retries}: {exc}; sleeping {sleep_for}s')\n",
    "            time.sleep(sleep_for)\n",
    "    raise RuntimeError(f'Request failed after {max_retries} attempt(s) for {url}: {last_error}')\n",
    "\n",
    "\n",
    "def fetch_daymet(lat, lon):\n",
    "    \"\"\"Return (header_dict, dataframe) for one point covering START_DATE..END_DATE.\"\"\"\n",
    "    params = {\n",
    "        'lat':    f'{float(lat):.6f}',\n",
    "        'lon':    f'{float(lon):.6f}',\n",
    "        'vars':   ','.join(DAYMET_VARIABLES),\n",
    "        'start':  START_DATE,\n",
    "        'end':    END_DATE,\n",
    "        'format': 'csv',\n",
    "    }\n",
    "    url = DAYMET_API_URL + '?' + urllib.parse.urlencode(params)\n",
    "    text = http_get(url)\n",
    "    # Daymet CSV preamble: Latitude / Longitude / X & Y / Tile / Elevation / \"All years; ...\" / cite line\n",
    "    lines = text.splitlines()\n",
    "    header = {'url': url}\n",
    "    data_start = None\n",
    "    for i, line in enumerate(lines):\n",
    "        if line.strip().startswith('year,'):\n",
    "            data_start = i\n",
    "            break\n",
    "        if ':' in line:\n",
    "            k, v = line.split(':', 1)\n",
    "            header[k.strip()] = v.strip()\n",
    "    if data_start is None:\n",
    "        raise RuntimeError(f'Daymet CSV missing column header. First 500 chars: {text[:500]}')\n",
    "    df = pd.read_csv(StringIO('\\n'.join(lines[data_start:])))\n",
    "    # Normalize column names: 'prcp (mm/day)' -> 'prcp', etc.\n",
    "    df.columns = [c.split(' ')[0] for c in df.columns]\n",
    "    # Daymet uses 365-day calendar (Dec 31 dropped on leap years), yday is 1..365\n",
    "    df['date'] = pd.to_datetime(df['year'].astype(str), format='%Y') + pd.to_timedelta(df['yday'] - 1, unit='D')\n",
    "    return header, df\n",
    "\n",
    "\n",
    "# gridMET 4 km grid (from THREDDS dataset.xml): lat origin 49.4°N decreasing,\n",
    "# lon origin -124.7667°E increasing, step 0.04166667°. Two sites in the same cell\n",
    "# yield the same (row, col) and share one set of 30 yearly NCSS fetches.\n",
    "GRIDMET_LAT_ORIGIN = 49.400000000000006\n",
    "GRIDMET_LAT_STEP   = -0.041666666666671404\n",
    "GRIDMET_LON_ORIGIN = -124.76666663333334\n",
    "GRIDMET_LON_STEP   = 0.041666666666671404\n",
    "\n",
    "GRIDMET_WIND_CACHE = {}\n",
    "_GRIDMET_CACHE_STATS = {'hits': 0, 'misses': 0}\n",
    "\n",
    "\n",
    "def gridmet_cell_key(lat, lon):\n",
    "    row = int(round((float(lat) - GRIDMET_LAT_ORIGIN) / GRIDMET_LAT_STEP))\n",
    "    col = int(round((float(lon) - GRIDMET_LON_ORIGIN) / GRIDMET_LON_STEP))\n",
    "    return (row, col)\n",
    "\n",
    "\n",
    "# gridMET CONUS coverage bounds (from THREDDS dataset.xml projectionBox).\n",
    "# Sites outside this box (AK, HI, PR, territories) cannot get gridMET wind.\n",
    "GRIDMET_CONUS_LAT_MIN = 25.04583333333333\n",
    "GRIDMET_CONUS_LAT_MAX = 49.420833333333334\n",
    "GRIDMET_CONUS_LON_MIN = -124.78749996666667\n",
    "GRIDMET_CONUS_LON_MAX = -67.03749996666669\n",
    "\n",
    "\n",
    "def is_in_gridmet_conus(lat, lon):\n",
    "    return (GRIDMET_CONUS_LAT_MIN <= float(lat) <= GRIDMET_CONUS_LAT_MAX\n",
    "            and GRIDMET_CONUS_LON_MIN <= float(lon) <= GRIDMET_CONUS_LON_MAX)\n",
    "\n",
    "\n",
    "def fetch_gridmet_wind(lat, lon):\n",
    "    \"\"\"Return (meta_dict, dataframe[date, wind_speed_ms]) covering START_YEAR..END_YEAR.\n",
    "\n",
    "    Cached by gridMET 4 km cell — sites in the same cell share one set of fetches.\n",
    "    Cache miss issues one NCSS CSV request per year (~30 requests, ~3 KB each).\"\"\"\n",
    "    cache_key = gridmet_cell_key(lat, lon)\n",
    "    cached = GRIDMET_WIND_CACHE.get(cache_key)\n",
    "    if cached is not None:\n",
    "        _GRIDMET_CACHE_STATS['hits'] += 1\n",
    "        return cached\n",
    "    _GRIDMET_CACHE_STATS['misses'] += 1\n",
    "\n",
    "    frames = []\n",
    "    grid_lat = grid_lon = None\n",
    "    for year in range(START_YEAR, END_YEAR + 1):\n",
    "        url = (GRIDMET_NCSS_URL_TEMPLATE.format(year=year) +\n",
    "               f'?var={GRIDMET_VARIABLE}'\n",
    "               f'&latitude={float(lat):.6f}&longitude={float(lon):.6f}'\n",
    "               f'&time_start={year}-01-01T00:00:00Z&time_end={year}-12-31T23:59:59Z'\n",
    "               f'&accept=csv')\n",
    "        text = http_get(url)\n",
    "        df = pd.read_csv(StringIO(text))\n",
    "        # NCSS column names look like:  time, latitude[unit=\"...\"], longitude[..], wind_speed[unit=\"m/s\"]\n",
    "        df = df.rename(columns={c: c.split('[')[0].strip() for c in df.columns})\n",
    "        df['date'] = pd.to_datetime(df['time']).dt.tz_localize(None).dt.normalize()\n",
    "        # NCSS CSV does NOT apply scale_factor for packed shorts; apply here.\n",
    "        df['wind_speed_ms'] = pd.to_numeric(df['wind_speed'], errors='coerce') * GRIDMET_SCALE_FACTOR\n",
    "        if grid_lat is None and 'latitude' in df.columns:\n",
    "            grid_lat = float(df['latitude'].iloc[0])\n",
    "            grid_lon = float(df['longitude'].iloc[0])\n",
    "        frames.append(df[['date', 'wind_speed_ms']])\n",
    "    result = ({'grid_lat': grid_lat, 'grid_lon': grid_lon}, pd.concat(frames, ignore_index=True))\n",
    "    GRIDMET_WIND_CACHE[cache_key] = result\n",
    "    return result\n",
    "\n",
    "\n",
    "def saturation_vapor_pressure_pa(t_c):\n",
    "    \"\"\"Magnus / Tetens. T in °C, returns saturation vapor pressure in Pa.\"\"\"\n",
    "    return 611.2 * np.exp(17.67 * t_c / (t_c + 243.5))\n",
    "\n",
    "\n",
    "def relative_humidity_pct(vp_pa, t_c):\n",
    "    e_sat = saturation_vapor_pressure_pa(t_c)\n",
    "    rh = 100.0 * vp_pa / e_sat\n",
    "    return np.clip(rh, 0.0, 100.0)\n",
    "\n",
    "\n",
    "def wet_bulb_temperature_c(t_c, rh_pct):\n",
    "    \"\"\"Stull (2011) approximation. Valid for 5 ≤ RH ≤ 99 %, −20 ≤ T ≤ +50 °C.\"\"\"\n",
    "    rh = np.clip(rh_pct, 5.0, 99.0)\n",
    "    return (\n",
    "        t_c * np.arctan(0.151977 * np.sqrt(rh + 8.313659))\n",
    "        + np.arctan(t_c + rh)\n",
    "        - np.arctan(rh - 1.676331)\n",
    "        + 0.00391838 * (rh ** 1.5) * np.arctan(0.023101 * rh)\n",
    "        - 4.686035\n",
    "    )\n",
    "\n",
    "\n",
    "def _parse_daymet_header(header):\n",
    "    def f(s):\n",
    "        try: return float(str(s).split()[0])\n",
    "        except Exception: return None\n",
    "    elevation_m = f(header.get('Elevation', '').replace('meters', '').strip())\n",
    "    tile = None\n",
    "    try: tile = int(header.get('Tile', '').strip())\n",
    "    except Exception: pass\n",
    "    xy = header.get('X & Y on Lambert Conformal Conic', '')\n",
    "    daymet_x = daymet_y = None\n",
    "    if xy:\n",
    "        parts = xy.split()\n",
    "        try:\n",
    "            daymet_x = float(parts[0]); daymet_y = float(parts[1])\n",
    "        except Exception:\n",
    "            pass\n",
    "    return tile, elevation_m, daymet_x, daymet_y\n",
    "\n",
    "\n",
    "def summarize_site(dc_row, daymet_header, daymet_df, wind_info):\n",
    "    df = daymet_df.copy()\n",
    "    df['tmean'] = (df['tmin'] + df['tmax']) / 2.0\n",
    "    df['year']  = df['date'].dt.year\n",
    "    df['month'] = df['date'].dt.month\n",
    "\n",
    "    df['rh_mean_pct']     = relative_humidity_pct(df['vp'], df['tmean'])\n",
    "    df['rh_at_tmax_pct']  = relative_humidity_pct(df['vp'], df['tmax'])\n",
    "    df['wet_bulb_mean_c'] = wet_bulb_temperature_c(df['tmean'], df['rh_mean_pct'])\n",
    "    df['wet_bulb_max_c']  = wet_bulb_temperature_c(df['tmax'],  df['rh_at_tmax_pct'])\n",
    "\n",
    "    df['cdd_c']            = (df['tmean'] - CDD_BASE_C).clip(lower=0)\n",
    "    df['extreme_heat']     = df['tmax'] >= EXTREME_HEAT_THRESHOLD_C\n",
    "    df['extreme_wet_bulb'] = df['wet_bulb_max_c'] >= EXTREME_WET_BULB_THRESHOLD_C\n",
    "    df['dtr_c']            = df['tmax'] - df['tmin']\n",
    "\n",
    "    annual = (df.groupby('year', as_index=False)\n",
    "                .agg(annual_mean_t_c=('tmean','mean'),\n",
    "                     annual_cdd=('cdd_c','sum'),\n",
    "                     annual_extreme_heat=('extreme_heat','sum'),\n",
    "                     annual_prcp_mm=('prcp','sum')))\n",
    "    summer = df.loc[df['month'].isin([6, 7, 8])]\n",
    "    wet_days = df.loc[df['prcp'] >= WET_DAY_THRESHOLD_MM, 'prcp']\n",
    "\n",
    "    annual_prcp = annual['annual_prcp_mm'].dropna()\n",
    "    precip_cv = float(annual_prcp.std(ddof=1) / annual_prcp.mean()) if len(annual_prcp) > 1 and annual_prcp.mean() != 0 else np.nan\n",
    "\n",
    "    tile, elevation_m, daymet_x, daymet_y = _parse_daymet_header(daymet_header)\n",
    "\n",
    "    row = {\n",
    "        'master_id': dc_row.master_id,\n",
    "        'source':    dc_row.source,\n",
    "        'name':      dc_row.name,\n",
    "        'operator':  dc_row.operator,\n",
    "        'city':      dc_row.city,\n",
    "        'state':     dc_row.state,\n",
    "        'country':   dc_row.country,\n",
    "        'longitude': float(dc_row.longitude),\n",
    "        'latitude':  float(dc_row.latitude),\n",
    "\n",
    "        'daymet_dataset_version':  DAYMET_DATASET_VERSION,\n",
    "        'gridmet_dataset_version': GRIDMET_DATASET_VERSION if wind_info is not None else None,\n",
    "        'climate_period_start':    START_DATE,\n",
    "        'climate_period_end':      END_DATE,\n",
    "        'daymet_tile':             tile,\n",
    "        'daymet_elevation_m':      elevation_m,\n",
    "        'daymet_grid_x_m':         daymet_x,\n",
    "        'daymet_grid_y_m':         daymet_y,\n",
    "        'gridmet_grid_latitude':   wind_info['grid_lat']  if wind_info is not None else None,\n",
    "        'gridmet_grid_longitude':  wind_info['grid_lon']  if wind_info is not None else None,\n",
    "        'observation_days':        int(len(df)),\n",
    "        'observation_years':       int(df['year'].nunique()),\n",
    "\n",
    "        'mean_annual_temperature_c':        float(annual['annual_mean_t_c'].mean()),\n",
    "        'mean_summer_temperature_c':        float(summer['tmean'].mean()) if len(summer) else np.nan,\n",
    "        'max_daily_temperature_c':          float(df['tmax'].max()),\n",
    "        'min_daily_temperature_c':          float(df['tmin'].min()),\n",
    "        'mean_diurnal_temperature_range_c': float(df['dtr_c'].mean()),\n",
    "\n",
    "        'mean_relative_humidity_pct':   float(df['rh_mean_pct'].mean()),\n",
    "        'mean_wet_bulb_temperature_c':  float(df['wet_bulb_mean_c'].mean()),\n",
    "        'max_wet_bulb_temperature_c':   float(df['wet_bulb_max_c'].max()),\n",
    "        'extreme_wet_bulb_days':        int(df['extreme_wet_bulb'].sum()),\n",
    "\n",
    "        'cooling_degree_days_c':              float(df['cdd_c'].sum()),\n",
    "        'annual_cooling_degree_days_c_mean':  float(annual['annual_cdd'].mean()),\n",
    "        'extreme_heat_days':                  int(df['extreme_heat'].sum()),\n",
    "        'annual_extreme_heat_days_mean':      float(annual['annual_extreme_heat'].mean()),\n",
    "\n",
    "        'precipitation_total_mm':       float(df['prcp'].sum()),\n",
    "        'annual_precipitation_mm_mean': float(annual['annual_prcp_mm'].mean()),\n",
    "        'annual_precipitation_cv':      precip_cv,\n",
    "        'wet_day_precipitation_p95_mm': float(wet_days.quantile(0.95)) if len(wet_days) else 0.0,\n",
    "\n",
    "        'mean_wind_speed_ms':              None,\n",
    "        'max_daily_mean_wind_speed_ms':    None,\n",
    "        'sustained_wind_days':             None,\n",
    "        'annual_sustained_wind_days_mean': None,\n",
    "        'wind_status':                     'pending_gridmet',\n",
    "\n",
    "        'drought_severity_index_source_status':  'needs_external_source',\n",
    "        'wildfire_smoke_exposure_source_status': 'needs_external_source',\n",
    "\n",
    "        'daymet_variables': DAYMET_VARIABLES,\n",
    "        'daymet_url':       daymet_header.get('url'),\n",
    "    }\n",
    "\n",
    "    if wind_info is not None:\n",
    "        wdf = wind_info['data'].copy()\n",
    "        wdf['year']      = wdf['date'].dt.year\n",
    "        wdf['sustained'] = wdf['wind_speed_ms'] >= SUSTAINED_WIND_THRESHOLD_MS\n",
    "        annual_wind = wdf.groupby('year', as_index=False).agg(d=('sustained', 'sum'))\n",
    "        row.update({\n",
    "            'mean_wind_speed_ms':              float(wdf['wind_speed_ms'].mean()),\n",
    "            'max_daily_mean_wind_speed_ms':    float(wdf['wind_speed_ms'].max()),\n",
    "            'sustained_wind_days':             int(wdf['sustained'].sum()),\n",
    "            'annual_sustained_wind_days_mean': float(annual_wind['d'].mean()),\n",
    "            'wind_status':                     'gridmet',\n",
    "        })\n",
    "\n",
    "    return row\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11",
   "metadata": {},
   "source": [
    "## Upsert Helper"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "12",
   "metadata": {},
   "outputs": [],
   "source": [
    "UPSERT_COLUMNS = [\n",
    "    'master_id', 'source', 'name', 'operator', 'city', 'state', 'country',\n",
    "    'longitude', 'latitude',\n",
    "    'daymet_dataset_version', 'gridmet_dataset_version',\n",
    "    'climate_period_start', 'climate_period_end',\n",
    "    'daymet_tile', 'daymet_elevation_m', 'daymet_grid_x_m', 'daymet_grid_y_m',\n",
    "    'gridmet_grid_latitude', 'gridmet_grid_longitude',\n",
    "    'observation_days', 'observation_years',\n",
    "    'mean_annual_temperature_c', 'mean_summer_temperature_c',\n",
    "    'max_daily_temperature_c',   'min_daily_temperature_c',\n",
    "    'mean_diurnal_temperature_range_c',\n",
    "    'mean_relative_humidity_pct',\n",
    "    'mean_wet_bulb_temperature_c', 'max_wet_bulb_temperature_c',\n",
    "    'extreme_wet_bulb_days',\n",
    "    'cooling_degree_days_c', 'annual_cooling_degree_days_c_mean',\n",
    "    'extreme_heat_days', 'annual_extreme_heat_days_mean',\n",
    "    'precipitation_total_mm', 'annual_precipitation_mm_mean',\n",
    "    'annual_precipitation_cv', 'wet_day_precipitation_p95_mm',\n",
    "    'mean_wind_speed_ms', 'max_daily_mean_wind_speed_ms',\n",
    "    'sustained_wind_days', 'annual_sustained_wind_days_mean',\n",
    "    'wind_status',\n",
    "    'drought_severity_index_source_status',\n",
    "    'wildfire_smoke_exposure_source_status',\n",
    "    'daymet_variables', 'daymet_url',\n",
    "]\n",
    "\n",
    "\n",
    "def clean_db_value(value):\n",
    "    if isinstance(value, (list, tuple)):\n",
    "        return list(value)\n",
    "    if value is None:\n",
    "        return None\n",
    "    if isinstance(value, float) and np.isnan(value):\n",
    "        return None\n",
    "    try:\n",
    "        if pd.isna(value):\n",
    "            return None\n",
    "    except (TypeError, ValueError):\n",
    "        pass\n",
    "    return value\n",
    "\n",
    "\n",
    "def upsert_climate_rows(rows):\n",
    "    if not rows:\n",
    "        return\n",
    "\n",
    "    values = [\n",
    "        tuple(clean_db_value(row.get(col)) for col in UPSERT_COLUMNS)\n",
    "        for row in rows\n",
    "    ]\n",
    "    col_sql = sql.SQL(', ').join(map(sql.Identifier, UPSERT_COLUMNS))\n",
    "    update_assignments = sql.SQL(', ').join(\n",
    "        sql.SQL('{} = excluded.{}').format(sql.Identifier(col), sql.Identifier(col))\n",
    "        for col in UPSERT_COLUMNS if col != 'master_id'\n",
    "    )\n",
    "\n",
    "    insert_sql = sql.SQL('''\n",
    "        insert into {table} ({cols}, geom, updated_at)\n",
    "        values %s\n",
    "        on conflict (master_id) do update set\n",
    "            {updates},\n",
    "            geom       = excluded.geom,\n",
    "            fetched_at = now(),\n",
    "            updated_at = now()\n",
    "    ''').format(\n",
    "        table=sql.SQL(CLIMATE_TABLE),\n",
    "        cols=col_sql,\n",
    "        updates=update_assignments,\n",
    "    )\n",
    "\n",
    "    template = '(' + ', '.join(['%s'] * len(UPSERT_COLUMNS)) + ', ST_SetSRID(ST_MakePoint(%s, %s), 4326), now())'\n",
    "    lon_i = UPSERT_COLUMNS.index('longitude')\n",
    "    lat_i = UPSERT_COLUMNS.index('latitude')\n",
    "    values_with_geom = [r + (r[lon_i], r[lat_i]) for r in values]\n",
    "\n",
    "    with get_conn() as conn:\n",
    "        with conn.cursor() as cur:\n",
    "            execute_values(cur, insert_sql.as_string(conn), values_with_geom, template=template, page_size=500)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13",
   "metadata": {},
   "source": [
    "## Fetch + Upsert (per site, with checkpointing)\n",
    "\n",
    "Each site is fetched, summarized, **and upserted to Postgres immediately**. That way:\n",
    "\n",
    "- A mid-run crash never loses more than the current site.\n",
    "- Re-running with `REFRESH_EXISTING=False` skips everything already in `data_center_historical_climate` (filtered in `load_target_points`).\n",
    "- The CSV checkpoint at `output/historical_climate_checkpoint.csv` is purely a debug artifact; the DB is the source of truth.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "14",
   "metadata": {},
   "outputs": [],
   "source": [
    "if targets.empty:\n",
    "    climate_rows = []\n",
    "    print('No new target rows. Set REFRESH_EXISTING=True to recompute existing rows.')\n",
    "else:\n",
    "    climate_rows = []\n",
    "    errors = []\n",
    "    total = len(targets)\n",
    "    checkpoint_path = Path('output/historical_climate_checkpoint.csv')\n",
    "    checkpoint_path.parent.mkdir(exist_ok=True)\n",
    "\n",
    "    for i, dc_row in enumerate(targets.itertuples(index=False), start=1):\n",
    "        try:\n",
    "            daymet_header, daymet_df = fetch_daymet(dc_row.latitude, dc_row.longitude)\n",
    "            wind_info = None\n",
    "            wind_skip_reason = None\n",
    "            if INCLUDE_WIND:\n",
    "                if is_in_gridmet_conus(dc_row.latitude, dc_row.longitude):\n",
    "                    grid_meta, wind_df = fetch_gridmet_wind(dc_row.latitude, dc_row.longitude)\n",
    "                    wind_info = {\n",
    "                        'grid_lat': grid_meta['grid_lat'],\n",
    "                        'grid_lon': grid_meta['grid_lon'],\n",
    "                        'data':     wind_df,\n",
    "                    }\n",
    "                else:\n",
    "                    wind_skip_reason = 'unavailable_outside_conus'\n",
    "            row = summarize_site(dc_row, daymet_header, daymet_df, wind_info)\n",
    "            if wind_skip_reason is not None:\n",
    "                row['wind_status'] = wind_skip_reason\n",
    "            climate_rows.append(row)\n",
    "\n",
    "            if not DRY_RUN:\n",
    "                upsert_climate_rows([row])\n",
    "            pd.DataFrame(climate_rows).to_csv(checkpoint_path, index=False)\n",
    "\n",
    "            print(f'{i:>5,}/{total:,}  {dc_row.master_id}  '\n",
    "                  f'T_mean={row[\"mean_annual_temperature_c\"]:.2f}°C  '\n",
    "                  f'CDD={row[\"cooling_degree_days_c\"]:,.0f}  '\n",
    "                  f'wind={row[\"wind_status\"]}  '\n",
    "                  f'successes={len(climate_rows):,} errors={len(errors):,}')\n",
    "        except Exception as exc:\n",
    "            errors.append({'master_id': dc_row.master_id, 'error': str(exc)})\n",
    "            print(f'ERROR {i:>5,}/{total:,} ({dc_row.master_id}): {exc}')\n",
    "\n",
    "        time.sleep(REQUEST_SLEEP_SECONDS)\n",
    "\n",
    "    climate   = pd.DataFrame(climate_rows)\n",
    "    error_log = pd.DataFrame(errors)\n",
    "    print(f'\\nDone: {len(climate):,} successes / {len(error_log):,} errors')\n",
    "    if INCLUDE_WIND:\n",
    "        h = _GRIDMET_CACHE_STATS['hits']; m = _GRIDMET_CACHE_STATS['misses']\n",
    "        tot = h + m\n",
    "        print(f'gridMET cache: {h:,} hits / {m:,} misses ({100*h/tot:.1f}% hit rate, {len(GRIDMET_WIND_CACHE):,} unique cells)' if tot else 'gridMET cache: no calls')\n",
    "    if not climate.empty:\n",
    "        display(climate.head())\n",
    "    if not error_log.empty:\n",
    "        display(error_log)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "15",
   "metadata": {},
   "source": [
    "## Inspect Results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "16",
   "metadata": {},
   "outputs": [],
   "source": [
    "summary_query = f'''\n",
    "select\n",
    "    count(*) as rows,\n",
    "    min(climate_period_start) as min_period_start,\n",
    "    max(climate_period_end)   as max_period_end,\n",
    "    count(*) filter (where wind_status = 'gridmet')         as with_wind,\n",
    "    count(*) filter (where wind_status = 'pending_gridmet') as wind_pending,\n",
    "    count(*) filter (where drought_severity_index_source_status  = 'needs_external_source') as drought_external_needed,\n",
    "    count(*) filter (where wildfire_smoke_exposure_source_status = 'needs_external_source') as smoke_external_needed\n",
    "from {CLIMATE_TABLE}\n",
    "where daymet_dataset_version = %s\n",
    "  and climate_period_start = %s::date\n",
    "  and climate_period_end   = %s::date\n",
    "'''\n",
    "\n",
    "sample_query = f'''\n",
    "select\n",
    "    c.master_id, c.name, c.state,\n",
    "    round(c.mean_annual_temperature_c::numeric, 2) as mean_t_c,\n",
    "    round(c.mean_summer_temperature_c::numeric, 2) as summer_t_c,\n",
    "    round(c.mean_relative_humidity_pct::numeric, 1) as rh_pct,\n",
    "    round(c.mean_wet_bulb_temperature_c::numeric, 2) as wb_mean_c,\n",
    "    round(c.cooling_degree_days_c::numeric, 0) as cdd_total,\n",
    "    c.extreme_heat_days,\n",
    "    round(c.annual_precipitation_cv::numeric, 3) as precip_cv,\n",
    "    round(c.mean_wind_speed_ms::numeric, 2) as wind_mean_ms,\n",
    "    c.sustained_wind_days,\n",
    "    c.wind_status\n",
    "from {CLIMATE_TABLE} c\n",
    "order by c.cooling_degree_days_c desc nulls last\n",
    "limit 20\n",
    "'''\n",
    "\n",
    "with get_conn() as conn:\n",
    "    summary = pd.read_sql_query(summary_query, conn, params=(DAYMET_DATASET_VERSION, START_DATE, END_DATE))\n",
    "    sample  = pd.read_sql_query(sample_query, conn)\n",
    "\n",
    "display(summary)\n",
    "display(sample)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "17",
   "metadata": {},
   "source": [
    "## Notes for the Next Data Sources\n",
    "\n",
    "Daymet (T/precip/VP at 1 km) + gridMET (wind at 4 km) cover the core weather/climate burden metrics for CONUS data centers. Two metrics still need better external sources:\n",
    "\n",
    "- **Drought severity index** — gridMET also publishes `pdsi` (Palmer Drought Severity Index) and `etr` (reference ET) at the same THREDDS server. A follow-up cell can pull PDSI percentiles using the same NCSS pattern as `vs` here. Alternatively, the US Drought Monitor (USDM) shapefiles give weekly drought categories that can be spatially joined to the DC point.\n",
    "- **Wildfire smoke exposure** — NOAA HMS smoke shapefiles (daily plume polygons, 2005–present) or EPA AirNow PM2.5 monitor data. Smoke-day counts would be directly comparable to the heat-day metric here.\n",
    "\n",
    "**Metric change vs. the prior Open-Meteo implementation.** The old `windstorm_days` field counted days where the daily *max wind gust* exceeded 17.2 m/s (~38 mph, ERA5 reanalysis gust). gridMET reports only daily *mean* wind, not gusts, so the new `sustained_wind_days` field counts days with daily-mean wind ≥ 8 m/s (~18 mph). These are fundamentally different signals — do not compare values across the two columns. If true gust/storm event counts are needed, NOAA's Storm Events Database is the appropriate next source.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "18",
   "metadata": {},
   "source": [
    "## Tables Created by This Notebook and Their Relationships\n",
    "\n",
    "### Tables Created / Maintained\n",
    "1. `public.data_center_historical_climate`\n",
    "- One row per `master_id` with climate summary fields and geometry.\n",
    "- Populated by incremental upsert so reruns refresh existing sites and add new sites.\n",
    "\n",
    "### Key Relationships\n",
    "- `public.master_data_centers (master_id)`\n",
    "  - 1-to-1 (effective) -> `public.data_center_historical_climate (master_id)`\n",
    "\n",
    "### Rerun Notes\n",
    "- Safe to rerun when the master data-center set changes.\n",
    "- Existing rows are updated in place; no duplicate-per-site history table is created by this notebook."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.14.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}