# Research Ideas & Future Work This document outlines potential research directions, data improvements, and analyses that could extend the current US Data Centers geospatial research infrastructure. ## Priority Next Steps ### 1. Backfill Power Capacity Data **Status**: Only 5.9% of facilities have `power_mw` values (108/1,833) **Approach**: - Scrape Baxtel data center database (requires paid subscription) - Use Data Center Map API/scraping - Cross-reference with utility interconnection queue filings - FOIA requests to state utility commissions for large loads **Research Impact**: - Enable capacity-weighted concentration metrics (current analyses are facility-count only) - Correlate power capacity with demographic/environmental variables - Identify "hyperscale" facilities (>100 MW) vs. edge/enterprise (<10 MW) **Implementation**: ```python # Add capacity-weighted HHI calculation to analyze_dc_tract_concentration.py capacity_weighted_hhi = sum((mw_i / total_mw) ** 2 for mw_i in tract_capacities) ``` --- ### 2. Operator Name Deduplication **Status**: String fragmentation inflates operator counts ("Meta" vs. "Meta, Inc.", AWS variants) **Approach**: - Create `operator_mapping` table with canonical names - Use fuzzy matching (e.g., `fuzzywuzzy` library) to standardize - Add `operator_canonical` column to `master_data_centers` **Research Impact**: - Accurate hyperscaler market share analysis - Study operator-specific siting strategies (AWS hydro, Microsoft nuclear, Meta solar) - Enable "operator power" political economy analyses **Script**: ```python # operators_dedupe.py import pandas as pd from fuzzywuzzy import process # Load unique operators operators = pd.read_sql("SELECT DISTINCT operator FROM master_data_centers", conn) # Manual + fuzzy matching to canonical names canonical_map = { "Meta": ["Meta", "Meta, Inc.", "Meta Platforms", "Facebook"], "Amazon Web Services": ["AWS", "Amazon", "Amazon Web Services"], # ... etc. } ``` --- ### 3. Water Stress Overlay **Status**: 257 HUC8 watersheds contain data centers; 15 watersheds hold 50% of facilities **Approach**: - Join to USGS WaterWatch streamflow data - Add USGS Drought Watch indicators by HUC8 - Correlate data center density with: - Groundwater depletion rates - Surface water withdrawal permits - Drought frequency/severity (USDM historical data) **Research Questions**: - Are data centers sited in water-stressed watersheds? - Do high-density clusters (Loudoun County, Columbus OH) face water constraints? - Compare water stress in hyperscaler non-metro sites (Columbia River corridor) vs. metro clusters **Tables to Create**: - `watershed_water_stress` - HUC8-level water stress indicators - `data_center_water_risk` - Per-facility water-stress exposure **Notebook**: `water_stress_analysis.ipynb` --- ### 4. Opposition Cases Overlay **Status**: Anecdotal evidence of community opposition to new data centers **Approach**: - Compile cases of rejected/delayed data center proposals (news archive scraping) - Geocode opposition cases, join to demographics/hazards - Test hypotheses: - Do wealthier/more educated communities successfully block projects? - Are opposition cases more common in water-stressed or drought-prone areas? - Do smaller non-metro communities have less bargaining power? **Research Questions**: - What predicts opposition success? - Are opposition cases spatially clustered? - Do demographics differ between accepted vs. rejected sites? **Output**: `opposition_cases_analysis.md` --- ### 5. IM3 Forward Projection Integration **Status**: IM3 model includes projected data center demand growth **Approach**: - Load IM3 projected demand scenarios (2030, 2040, 2050) - Overlay projected growth with: - Current grid saturation (% of generation within 50 km) - Water stress indicators - Land availability (zoned industrial parcels) - Identify regions where projected demand may exceed infrastructure capacity **Research Questions**: - Which states face grid saturation from data center growth? - Are projected sites in water-stressed watersheds? - Does IM3 assume spatial distribution patterns consistent with current siting? **Notebook**: `im3_projection_overlay.ipynb` --- ## Methodological Extensions ### 6. Time-Series Analysis of Cluster Growth **Approach**: - Use `rfs_year` (ready for service) from cable data and EIA generator vintage - Reconstruct data center siting over time (requires RFS dates for facilities) - Animate cluster formation in interactive map **Research Questions**: - Did Ashburn VA become dominant before or after major cable landings? - Do clusters grow via agglomeration (new facilities near existing) or simultaneous build-out? - Correlation between energy infrastructure build-out and data center growth **Data Needed**: - Facility RFS dates (scrape from press releases, Baxtel historical data) - Historical tract demographics (decennial Census + ACS back to 2000) --- ### 7. Network Effects: Fiber Route Proximity **Status**: Current analysis tests submarine cable proximity (negative result) **Approach**: - Obtain fiber optic backbone route GIS data (from FCC, carriers, or Infrapedia) - Test proximity to long-haul fiber routes (not just submarine cables) - Hypothesis: Data centers cluster near fiber, not cables **Data Sources**: - FCC Form 477 fiber deployment data - Infrapedia fiber route database - State-level fiber maps (e.g., Virginia Broadband Map) **Expected Result**: Positive correlation (unlike submarine cables) --- ### 8. Land Use & Zoning Analysis **Approach**: - Join data centers to local zoning classifications (industrial, commercial, etc.) - Analyze land prices in data center tracts before/after facility construction - Correlate with property tax revenues **Research Questions**: - Do data centers drive local property value increases? - Are they preferentially sited in already-zoned industrial areas? - Do host communities capture tax base growth? **Data Sources**: - Zillow Home Value Index (ZHVI) by ZIP - ATTOM property tax assessments - Municipal zoning GIS layers (city-specific, requires scraping/FOIA) --- ### 9. Environmental Justice Scoring **Approach**: - Compare data center host tracts to EPA's EJScreen indices - Add CalEnviroScreen-style burden/benefit framework - Test if data centers increase cumulative environmental burdens **Metrics**: - Air quality (PM2.5, ozone) - Hazardous waste proximity - Superfund site proximity - Heat island effect (LST from Landsat) - Noise pollution (traffic, cooling systems) **Expected Challenge**: Data centers may improve local metrics (compared to heavy industry) but increase water/energy consumption --- ## Policy & Political Economy Research ### 10. Tax Incentive Analysis **Approach**: - Compile state/local tax incentives for data center siting (property tax abatements, sales tax exemptions) - Create `data_center_incentives` table with per-facility incentive details - Correlate incentive generosity with: - State fiscal health - Local government bargaining power - Facility size/operator **Research Questions**: - Do weaker fiscal states offer larger incentives? - Are incentives regressive (larger for hyperscalers)? - Do incentives predict siting decisions (natural experiment approach)? **Data Sources**: - Good Jobs First Subsidy Tracker - State economic development agency press releases - Local news archives --- ### 11. Employment & Labor Market Effects **Approach**: - Join to BLS Quarterly Census of Employment and Wages (QCEW) by ZIP/county - Identify "data center construction boom" periods (before/after major facility openings) - Analyze employment effects in: - Construction (NAICS 23) - Transportation/warehousing (NAICS 48-49) - Professional services (NAICS 54) **Research Questions**: - Do data centers create durable local employment? - Are jobs filled by local residents or commuters? - Wage effects in host tracts? **Data Sources**: - BLS QCEW - Census LEHD Origin-Destination Employment Statistics (LODES) --- ### 12. Energy Cost Pass-Through **Approach**: - Join to state-level electricity rate data (EIA, utility rate tracker) - Test if data center density correlates with residential rate increases - Natural experiment: Compare rate trajectories in high-DC vs. low-DC states **Research Questions**: - Do data centers drive residential rate increases (capacity cost allocation)? - Are rate increases concentrated in utility service territories with large data center loads? - Do states with retail choice (deregulated markets) see different effects? **Data Sources**: - EIA Form 861 (retail rates by state/utility) - Utility rate case filings (state public utility commissions) --- ## Data Quality & Infrastructure Improvements ### 13. Address Validation & Geocoding Refinement **Approach**: - Re-geocode the 45 facilities using city-precision fallback - Use USPS address validation API - Cross-reference with Google Maps satellite imagery (manual review) **Implementation**: ```python # Re-run geocoding with stricter thresholds python3 load_postgis_data_centers.py --revalidate-addresses ``` --- ### 14. OSM Continuous Monitoring **Approach**: - Set up automated Overpass API queries (daily/weekly) - Detect new OSM data center tags - Alert for review + merge into `master_data_centers` **Implementation**: - Cron job running `load_postgis_osm_data_centers.py --update-only` - Slack/email notification on new facilities --- ### 15. Broadband Speed Validation **Approach**: - Cross-reference FCC BDC provider data with Ookla Speedtest results - Test if data center host tracts have faster actual speeds (not just availability) **Hypothesis**: Data center presence correlates with infrastructure investment → higher speeds **Data Sources**: - Ookla Open Data (aggregated Speedtest results by tile) - FCC Measuring Broadband America --- ## Visualization & Communication ### 16. Interactive Story Map **Approach**: - Build Scrollama.js narrative map - Sections: 1. National overview (cluster map) 2. Ashburn VA zoom (dominance of single region) 3. Demographics comparison (host vs. national) 4. Water stress hot spots 5. Energy infrastructure saturation **Output**: `story_map.html` (standalone web page) --- ### 17. Policy Brief Generation **Approach**: - Auto-generate policy briefs from analysis outputs - Targeted audiences: - State legislators (energy/water policy) - Local governments (tax incentive negotiation) - Environmental justice advocates **Template**: ```markdown # Data Center Siting in [STATE]: Key Facts for Policymakers - **[STATE] hosts X% of US data centers** (rank: #Y) - **Host communities are Z% wealthier** than state average - **A% of state generation is within 50 km of a data center** - **Top watershed holds B facilities** (water stress: [HIGH/MEDIUM/LOW]) ``` --- ### 18. Comparative International Analysis **Approach**: - Extend methodology to EU, Canada, Australia - Compare siting patterns (e.g., Nordic countries = renewable energy, cold climate) - Test if "concentrated costs / dispersed benefits" holds internationally **Data Sources**: - OpenStreetMap (global coverage) - Eurostat demographics - IEA energy data - TeleGeography global cable data (already available) **Research Questions**: - Are US patterns unique (tax-driven siting) vs. EU (regulatory constraints)? - Do Nordic countries see more equitable distribution? --- ## Speculative / Long-Term Ideas ### 19. AI Demand Forecasting **Approach**: - Train ML model to predict data center siting - Features: demographics, energy capacity, fiber proximity, tax rates, water availability - Test on historical data (train on pre-2015, test on 2015-2025) **Use Case**: - Identify "likely future sites" for proactive policy intervention - Warn communities of potential incoming projects --- ### 20. Cooling Technology Analysis **Approach**: - Classify facilities by cooling type (air, water, hybrid) - Correlate with: - Climate (CDD: cooling degree days) - Water availability - Facility size **Data Sources**: - Manual classification from news/press releases - FOIA requests to water utilities (cooling water withdrawal permits) **Research Questions**: - Are water-cooled facilities concentrated in water-stressed regions (paradox)? - Do hyperscalers use more efficient cooling (e.g., Meta's Prineville OR evaporative cooling)? --- ### 21. Bitcoin Mining Facilities **Approach**: - Overlay cryptocurrency mining facilities (subset of "data centers") - Compare siting patterns: Bitcoin mines prefer low electricity costs (WA, TX, NY hydro) - Test if Bitcoin mines face more opposition (negative perception) **Data Sources**: - Cambridge Bitcoin Electricity Consumption Index (has facility locations) - News archives of mining farm proposals/rejections --- ### 22. Disaster Resilience & Redundancy **Approach**: - Model simultaneous hazard exposure across data center clusters - E.g., "What % of US data centers are in wildfire risk zones?" - Identify single points of failure (e.g., Ashburn VA = 20% of US capacity) **Research Questions**: - Is the current spatial distribution resilient to climate change? - Should policy incentivize geographic diversification? **Output**: `disaster_resilience_report.md` --- ### 23. Edge Data Center Network **Approach**: - Separately analyze edge facilities (<1 MW) vs. hyperscale (>100 MW) - Test if edge DCs follow different siting logic (population density > energy cost) **Data Challenge**: Current inventory does not distinguish edge vs. hyperscale (need `power_mw` backfill) --- ### 24. Carbon Intensity of Host Grids **Approach**: - Join to EPA eGRID subregion carbon intensity (lb CO₂/MWh) - Calculate per-facility estimated carbon footprint (if `power_mw` available) - Compare to corporate renewable energy procurement (RECs, PPAs) **Research Questions**: - Are data centers disproportionately in high-carbon grids? - Do hyperscaler renewable commitments offset grid carbon? **Data Sources**: - EPA eGRID - Corporate sustainability reports (Google, Microsoft, Meta, AWS) --- ## Collaboration Opportunities ### Academic Partnerships - **Energy researchers**: Joint analysis of grid saturation + IM3 projections - **Environmental justice scholars**: EJScreen overlay + opposition case studies - **Political scientists**: Tax incentive analysis + local government bargaining power ### Policy Stakeholders - **State energy offices**: Share grid saturation maps - **Water resource agencies**: Watershed analysis for permitting - **Local governments**: Demographic/tax revenue analysis for negotiation leverage ### Industry Engagement - **Data center operators**: Validate facility data, discuss siting criteria - **Colocation providers**: Access to tenant mix data (multi-tenant vs. single-tenant) --- ## Tools & Infrastructure Improvements ### Database Enhancements - Add `version` column to track data vintage - Implement `audit_log` table for data lineage - Set up automated backups to S3/Azure Blob ### Code Quality - Add unit tests for geocoding functions - Create `config.yaml` for database credentials (replace hardcoded env vars) - Dockerize analysis environment for reproducibility ### Documentation - Add JSDoc-style comments to all Python functions - Create `CONTRIBUTING.md` for external collaborators - Record Jupyter notebook walkthroughs (video tutorials) --- ## Unfunded / Ambitious Ideas ### 25. Real-Time Energy Monitoring - Partner with utility to get live load data from data center substations - Build dashboard showing real-time energy consumption by facility - Correlate with AWS/Azure/GCP service outages (reverse-engineer capacity from brownouts) ### 26. Social Media Sentiment Analysis - Scrape Twitter/Reddit for mentions of local data center projects - NLP sentiment analysis: support vs. opposition - Correlate sentiment with facility approval outcomes ### 27. LIDAR Analysis of Cooling Infrastructure - Use aerial LIDAR to measure rooftop cooling equipment volume - Proxy for facility size (cooling = f(IT load)) - Build predictive model: cooling equipment → power capacity --- ## Contact & Contributions If you're interested in collaborating on any of these research directions, please contact the repository owner. **Priorities for external collaboration**: 1. Power capacity data acquisition 2. Water stress/drought overlay 3. Opposition cases database compilation 4. International comparative analysis --- ## References for Future Work ### Data Sources to Explore - **Department of Energy**: Grid resilience reports, interconnection queues - **NREL**: Renewable energy potential by HUC (solar, wind) - **USDA**: Agricultural water use by county (competition for water) - **NOAA**: Climate normals + projections by grid cell - **BLS**: QCEW employment data, wage data - **EPA**: eGRID, EJScreen, Superfund sites ### Academic Literature Gaps - Limited peer-reviewed research on data center spatial concentration - No published studies on water stress exposure of data centers - Opportunity for "first mover" publication in major geography/planning journals ### Policy Levers to Investigate - State renewable portfolio standards (RPS) → data center siting - Federal infrastructure investment (IRA, CHIPS Act) → energy grid capacity - Local zoning reform (industrial land use restrictions) --- **Last Updated**: May 2026