Add comprehensive documentation: README, database tables, and research ideas

2026-05-27 11:28:14 -07:00
parent 98f6e6e237
commit 423e11083d
3 changed files with 1271 additions and 0 deletions
--- a/research-ideas.md
+++ b/research-ideas.md
@@ -0,0 +1,524 @@
+# Research Ideas & Future Work
+
+This document outlines potential research directions, data improvements, and analyses that could extend the current US Data Centers geospatial research infrastructure.
+
+## Priority Next Steps
+
+### 1. Backfill Power Capacity Data
+**Status**: Only 5.9% of facilities have `power_mw` values (108/1,833)
+
+**Approach**:
+- Scrape Baxtel data center database (requires paid subscription)
+- Use Data Center Map API/scraping
+- Cross-reference with utility interconnection queue filings
+- FOIA requests to state utility commissions for large loads
+
+**Research Impact**: 
+- Enable capacity-weighted concentration metrics (current analyses are facility-count only)
+- Correlate power capacity with demographic/environmental variables
+- Identify "hyperscale" facilities (>100 MW) vs. edge/enterprise (<10 MW)
+
+**Implementation**:
+```python
+# Add capacity-weighted HHI calculation to analyze_dc_tract_concentration.py
+capacity_weighted_hhi = sum((mw_i / total_mw) ** 2 for mw_i in tract_capacities)
+```
+
+---
+
+### 2. Operator Name Deduplication
+**Status**: String fragmentation inflates operator counts ("Meta" vs. "Meta, Inc.", AWS variants)
+
+**Approach**:
+- Create `operator_mapping` table with canonical names
+- Use fuzzy matching (e.g., `fuzzywuzzy` library) to standardize
+- Add `operator_canonical` column to `master_data_centers`
+
+**Research Impact**:
+- Accurate hyperscaler market share analysis
+- Study operator-specific siting strategies (AWS hydro, Microsoft nuclear, Meta solar)
+- Enable "operator power" political economy analyses
+
+**Script**:
+```python
+# operators_dedupe.py
+import pandas as pd
+from fuzzywuzzy import process
+
+# Load unique operators
+operators = pd.read_sql("SELECT DISTINCT operator FROM master_data_centers", conn)
+
+# Manual + fuzzy matching to canonical names
+canonical_map = {
+    "Meta": ["Meta", "Meta, Inc.", "Meta Platforms", "Facebook"],
+    "Amazon Web Services": ["AWS", "Amazon", "Amazon Web Services"],
+    # ... etc.
+}
+```
+
+---
+
+### 3. Water Stress Overlay
+**Status**: 257 HUC8 watersheds contain data centers; 15 watersheds hold 50% of facilities
+
+**Approach**:
+- Join to USGS WaterWatch streamflow data
+- Add USGS Drought Watch indicators by HUC8
+- Correlate data center density with:
+  - Groundwater depletion rates
+  - Surface water withdrawal permits
+  - Drought frequency/severity (USDM historical data)
+
+**Research Questions**:
+- Are data centers sited in water-stressed watersheds?
+- Do high-density clusters (Loudoun County, Columbus OH) face water constraints?
+- Compare water stress in hyperscaler non-metro sites (Columbia River corridor) vs. metro clusters
+
+**Tables to Create**:
+- `watershed_water_stress` - HUC8-level water stress indicators
+- `data_center_water_risk` - Per-facility water-stress exposure
+
+**Notebook**: `water_stress_analysis.ipynb`
+
+---
+
+### 4. Opposition Cases Overlay
+**Status**: Anecdotal evidence of community opposition to new data centers
+
+**Approach**:
+- Compile cases of rejected/delayed data center proposals (news archive scraping)
+- Geocode opposition cases, join to demographics/hazards
+- Test hypotheses:
+  - Do wealthier/more educated communities successfully block projects?
+  - Are opposition cases more common in water-stressed or drought-prone areas?
+  - Do smaller non-metro communities have less bargaining power?
+
+**Research Questions**:
+- What predicts opposition success?
+- Are opposition cases spatially clustered?
+- Do demographics differ between accepted vs. rejected sites?
+
+**Output**: `opposition_cases_analysis.md`
+
+---
+
+### 5. IM3 Forward Projection Integration
+**Status**: IM3 model includes projected data center demand growth
+
+**Approach**:
+- Load IM3 projected demand scenarios (2030, 2040, 2050)
+- Overlay projected growth with:
+  - Current grid saturation (% of generation within 50 km)
+  - Water stress indicators
+  - Land availability (zoned industrial parcels)
+- Identify regions where projected demand may exceed infrastructure capacity
+
+**Research Questions**:
+- Which states face grid saturation from data center growth?
+- Are projected sites in water-stressed watersheds?
+- Does IM3 assume spatial distribution patterns consistent with current siting?
+
+**Notebook**: `im3_projection_overlay.ipynb`
+
+---
+
+## Methodological Extensions
+
+### 6. Time-Series Analysis of Cluster Growth
+**Approach**:
+- Use `rfs_year` (ready for service) from cable data and EIA generator vintage
+- Reconstruct data center siting over time (requires RFS dates for facilities)
+- Animate cluster formation in interactive map
+
+**Research Questions**:
+- Did Ashburn VA become dominant before or after major cable landings?
+- Do clusters grow via agglomeration (new facilities near existing) or simultaneous build-out?
+- Correlation between energy infrastructure build-out and data center growth
+
+**Data Needed**:
+- Facility RFS dates (scrape from press releases, Baxtel historical data)
+- Historical tract demographics (decennial Census + ACS back to 2000)
+
+---
+
+### 7. Network Effects: Fiber Route Proximity
+**Status**: Current analysis tests submarine cable proximity (negative result)
+
+**Approach**:
+- Obtain fiber optic backbone route GIS data (from FCC, carriers, or Infrapedia)
+- Test proximity to long-haul fiber routes (not just submarine cables)
+- Hypothesis: Data centers cluster near fiber, not cables
+
+**Data Sources**:
+- FCC Form 477 fiber deployment data
+- Infrapedia fiber route database
+- State-level fiber maps (e.g., Virginia Broadband Map)
+
+**Expected Result**: Positive correlation (unlike submarine cables)
+
+---
+
+### 8. Land Use & Zoning Analysis
+**Approach**:
+- Join data centers to local zoning classifications (industrial, commercial, etc.)
+- Analyze land prices in data center tracts before/after facility construction
+- Correlate with property tax revenues
+
+**Research Questions**:
+- Do data centers drive local property value increases?
+- Are they preferentially sited in already-zoned industrial areas?
+- Do host communities capture tax base growth?
+
+**Data Sources**:
+- Zillow Home Value Index (ZHVI) by ZIP
+- ATTOM property tax assessments
+- Municipal zoning GIS layers (city-specific, requires scraping/FOIA)
+
+---
+
+### 9. Environmental Justice Scoring
+**Approach**:
+- Compare data center host tracts to EPA's EJScreen indices
+- Add CalEnviroScreen-style burden/benefit framework
+- Test if data centers increase cumulative environmental burdens
+
+**Metrics**:
+- Air quality (PM2.5, ozone)
+- Hazardous waste proximity
+- Superfund site proximity
+- Heat island effect (LST from Landsat)
+- Noise pollution (traffic, cooling systems)
+
+**Expected Challenge**: Data centers may improve local metrics (compared to heavy industry) but increase water/energy consumption
+
+---
+
+## Policy & Political Economy Research
+
+### 10. Tax Incentive Analysis
+**Approach**:
+- Compile state/local tax incentives for data center siting (property tax abatements, sales tax exemptions)
+- Create `data_center_incentives` table with per-facility incentive details
+- Correlate incentive generosity with:
+  - State fiscal health
+  - Local government bargaining power
+  - Facility size/operator
+
+**Research Questions**:
+- Do weaker fiscal states offer larger incentives?
+- Are incentives regressive (larger for hyperscalers)?
+- Do incentives predict siting decisions (natural experiment approach)?
+
+**Data Sources**:
+- Good Jobs First Subsidy Tracker
+- State economic development agency press releases
+- Local news archives
+
+---
+
+### 11. Employment & Labor Market Effects
+**Approach**:
+- Join to BLS Quarterly Census of Employment and Wages (QCEW) by ZIP/county
+- Identify "data center construction boom" periods (before/after major facility openings)
+- Analyze employment effects in:
+  - Construction (NAICS 23)
+  - Transportation/warehousing (NAICS 48-49)
+  - Professional services (NAICS 54)
+
+**Research Questions**:
+- Do data centers create durable local employment?
+- Are jobs filled by local residents or commuters?
+- Wage effects in host tracts?
+
+**Data Sources**:
+- BLS QCEW
+- Census LEHD Origin-Destination Employment Statistics (LODES)
+
+---
+
+### 12. Energy Cost Pass-Through
+**Approach**:
+- Join to state-level electricity rate data (EIA, utility rate tracker)
+- Test if data center density correlates with residential rate increases
+- Natural experiment: Compare rate trajectories in high-DC vs. low-DC states
+
+**Research Questions**:
+- Do data centers drive residential rate increases (capacity cost allocation)?
+- Are rate increases concentrated in utility service territories with large data center loads?
+- Do states with retail choice (deregulated markets) see different effects?
+
+**Data Sources**:
+- EIA Form 861 (retail rates by state/utility)
+- Utility rate case filings (state public utility commissions)
+
+---
+
+## Data Quality & Infrastructure Improvements
+
+### 13. Address Validation & Geocoding Refinement
+**Approach**:
+- Re-geocode the 45 facilities using city-precision fallback
+- Use USPS address validation API
+- Cross-reference with Google Maps satellite imagery (manual review)
+
+**Implementation**:
+```python
+# Re-run geocoding with stricter thresholds
+python3 load_postgis_data_centers.py --revalidate-addresses
+```
+
+---
+
+### 14. OSM Continuous Monitoring
+**Approach**:
+- Set up automated Overpass API queries (daily/weekly)
+- Detect new OSM data center tags
+- Alert for review + merge into `master_data_centers`
+
+**Implementation**:
+- Cron job running `load_postgis_osm_data_centers.py --update-only`
+- Slack/email notification on new facilities
+
+---
+
+### 15. Broadband Speed Validation
+**Approach**:
+- Cross-reference FCC BDC provider data with Ookla Speedtest results
+- Test if data center host tracts have faster actual speeds (not just availability)
+
+**Hypothesis**: Data center presence correlates with infrastructure investment → higher speeds
+
+**Data Sources**:
+- Ookla Open Data (aggregated Speedtest results by tile)
+- FCC Measuring Broadband America
+
+---
+
+## Visualization & Communication
+
+### 16. Interactive Story Map
+**Approach**:
+- Build Scrollama.js narrative map
+- Sections:
+  1. National overview (cluster map)
+  2. Ashburn VA zoom (dominance of single region)
+  3. Demographics comparison (host vs. national)
+  4. Water stress hot spots
+  5. Energy infrastructure saturation
+
+**Output**: `story_map.html` (standalone web page)
+
+---
+
+### 17. Policy Brief Generation
+**Approach**:
+- Auto-generate policy briefs from analysis outputs
+- Targeted audiences:
+  - State legislators (energy/water policy)
+  - Local governments (tax incentive negotiation)
+  - Environmental justice advocates
+
+**Template**:
+```markdown
+# Data Center Siting in [STATE]: Key Facts for Policymakers
+
+- **[STATE] hosts X% of US data centers** (rank: #Y)
+- **Host communities are Z% wealthier** than state average
+- **A% of state generation is within 50 km of a data center**
+- **Top watershed holds B facilities** (water stress: [HIGH/MEDIUM/LOW])
+```
+
+---
+
+### 18. Comparative International Analysis
+**Approach**:
+- Extend methodology to EU, Canada, Australia
+- Compare siting patterns (e.g., Nordic countries = renewable energy, cold climate)
+- Test if "concentrated costs / dispersed benefits" holds internationally
+
+**Data Sources**:
+- OpenStreetMap (global coverage)
+- Eurostat demographics
+- IEA energy data
+- TeleGeography global cable data (already available)
+
+**Research Questions**:
+- Are US patterns unique (tax-driven siting) vs. EU (regulatory constraints)?
+- Do Nordic countries see more equitable distribution?
+
+---
+
+## Speculative / Long-Term Ideas
+
+### 19. AI Demand Forecasting
+**Approach**:
+- Train ML model to predict data center siting
+- Features: demographics, energy capacity, fiber proximity, tax rates, water availability
+- Test on historical data (train on pre-2015, test on 2015-2025)
+
+**Use Case**: 
+- Identify "likely future sites" for proactive policy intervention
+- Warn communities of potential incoming projects
+
+---
+
+### 20. Cooling Technology Analysis
+**Approach**:
+- Classify facilities by cooling type (air, water, hybrid)
+- Correlate with:
+  - Climate (CDD: cooling degree days)
+  - Water availability
+  - Facility size
+
+**Data Sources**:
+- Manual classification from news/press releases
+- FOIA requests to water utilities (cooling water withdrawal permits)
+
+**Research Questions**:
+- Are water-cooled facilities concentrated in water-stressed regions (paradox)?
+- Do hyperscalers use more efficient cooling (e.g., Meta's Prineville OR evaporative cooling)?
+
+---
+
+### 21. Bitcoin Mining Facilities
+**Approach**:
+- Overlay cryptocurrency mining facilities (subset of "data centers")
+- Compare siting patterns: Bitcoin mines prefer low electricity costs (WA, TX, NY hydro)
+- Test if Bitcoin mines face more opposition (negative perception)
+
+**Data Sources**:
+- Cambridge Bitcoin Electricity Consumption Index (has facility locations)
+- News archives of mining farm proposals/rejections
+
+---
+
+### 22. Disaster Resilience & Redundancy
+**Approach**:
+- Model simultaneous hazard exposure across data center clusters
+- E.g., "What % of US data centers are in wildfire risk zones?"
+- Identify single points of failure (e.g., Ashburn VA = 20% of US capacity)
+
+**Research Questions**:
+- Is the current spatial distribution resilient to climate change?
+- Should policy incentivize geographic diversification?
+
+**Output**: `disaster_resilience_report.md`
+
+---
+
+### 23. Edge Data Center Network
+**Approach**:
+- Separately analyze edge facilities (<1 MW) vs. hyperscale (>100 MW)
+- Test if edge DCs follow different siting logic (population density > energy cost)
+
+**Data Challenge**: Current inventory does not distinguish edge vs. hyperscale (need `power_mw` backfill)
+
+---
+
+### 24. Carbon Intensity of Host Grids
+**Approach**:
+- Join to EPA eGRID subregion carbon intensity (lb CO₂/MWh)
+- Calculate per-facility estimated carbon footprint (if `power_mw` available)
+- Compare to corporate renewable energy procurement (RECs, PPAs)
+
+**Research Questions**:
+- Are data centers disproportionately in high-carbon grids?
+- Do hyperscaler renewable commitments offset grid carbon?
+
+**Data Sources**:
+- EPA eGRID
+- Corporate sustainability reports (Google, Microsoft, Meta, AWS)
+
+---
+
+## Collaboration Opportunities
+
+### Academic Partnerships
+- **Energy researchers**: Joint analysis of grid saturation + IM3 projections
+- **Environmental justice scholars**: EJScreen overlay + opposition case studies
+- **Political scientists**: Tax incentive analysis + local government bargaining power
+
+### Policy Stakeholders
+- **State energy offices**: Share grid saturation maps
+- **Water resource agencies**: Watershed analysis for permitting
+- **Local governments**: Demographic/tax revenue analysis for negotiation leverage
+
+### Industry Engagement
+- **Data center operators**: Validate facility data, discuss siting criteria
+- **Colocation providers**: Access to tenant mix data (multi-tenant vs. single-tenant)
+
+---
+
+## Tools & Infrastructure Improvements
+
+### Database Enhancements
+- Add `version` column to track data vintage
+- Implement `audit_log` table for data lineage
+- Set up automated backups to S3/Azure Blob
+
+### Code Quality
+- Add unit tests for geocoding functions
+- Create `config.yaml` for database credentials (replace hardcoded env vars)
+- Dockerize analysis environment for reproducibility
+
+### Documentation
+- Add JSDoc-style comments to all Python functions
+- Create `CONTRIBUTING.md` for external collaborators
+- Record Jupyter notebook walkthroughs (video tutorials)
+
+---
+
+## Unfunded / Ambitious Ideas
+
+### 25. Real-Time Energy Monitoring
+- Partner with utility to get live load data from data center substations
+- Build dashboard showing real-time energy consumption by facility
+- Correlate with AWS/Azure/GCP service outages (reverse-engineer capacity from brownouts)
+
+### 26. Social Media Sentiment Analysis
+- Scrape Twitter/Reddit for mentions of local data center projects
+- NLP sentiment analysis: support vs. opposition
+- Correlate sentiment with facility approval outcomes
+
+### 27. LIDAR Analysis of Cooling Infrastructure
+- Use aerial LIDAR to measure rooftop cooling equipment volume
+- Proxy for facility size (cooling = f(IT load))
+- Build predictive model: cooling equipment → power capacity
+
+---
+
+## Contact & Contributions
+
+If you're interested in collaborating on any of these research directions, please contact the repository owner.
+
+**Priorities for external collaboration**:
+1. Power capacity data acquisition
+2. Water stress/drought overlay
+3. Opposition cases database compilation
+4. International comparative analysis
+
+---
+
+## References for Future Work
+
+### Data Sources to Explore
+- **Department of Energy**: Grid resilience reports, interconnection queues
+- **NREL**: Renewable energy potential by HUC (solar, wind)
+- **USDA**: Agricultural water use by county (competition for water)
+- **NOAA**: Climate normals + projections by grid cell
+- **BLS**: QCEW employment data, wage data
+- **EPA**: eGRID, EJScreen, Superfund sites
+
+### Academic Literature Gaps
+- Limited peer-reviewed research on data center spatial concentration
+- No published studies on water stress exposure of data centers
+- Opportunity for "first mover" publication in major geography/planning journals
+
+### Policy Levers to Investigate
+- State renewable portfolio standards (RPS) → data center siting
+- Federal infrastructure investment (IRA, CHIPS Act) → energy grid capacity
+- Local zoning reform (industrial land use restrictions)
+
+---
+
+**Last Updated**: May 2026