Files
data-centers/research-ideas.md
dadams 4525ea3f97 Add LegiScan legislation ingestion and analysis queries
Adds ingest_legiscan.py to pull all US state + federal bills (2016-2026)
from the LegiScan API into legiscan_sessions and legiscan_bills tables.
Bills are keyword-tagged across 8 research categories (data_center,
ratepayer_protection, large_load, grid_impact, tax_incentive, etc.).
Loads ~1.3M bills; ~60K tagged relevant. Adds query_legiscan_bills.sql
with pre-built analysis queries including state/DC joins. Updates
database-tables.md, README.md, and research-ideas.md accordingly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 21:30:31 -07:00

21 KiB
Raw Blame History

Research Ideas & Future Work

This document outlines potential research directions, data improvements, and analyses that could extend the current US Data Centers geospatial research infrastructure.

Priority Next Steps

1. Backfill Power Capacity Data

Status: Only 5.9% of facilities have power_mw values (108/1,833)

Approach:

  • Scrape Baxtel data center database (requires paid subscription)
  • Use Data Center Map API/scraping
  • Cross-reference with utility interconnection queue filings
  • FOIA requests to state utility commissions for large loads

Research Impact:

  • Enable capacity-weighted concentration metrics (current analyses are facility-count only)
  • Correlate power capacity with demographic/environmental variables
  • Identify "hyperscale" facilities (>100 MW) vs. edge/enterprise (<10 MW)

Implementation:

# Add capacity-weighted HHI calculation to analyze_dc_tract_concentration.py
capacity_weighted_hhi = sum((mw_i / total_mw) ** 2 for mw_i in tract_capacities)

2. Operator Name Deduplication

Status: String fragmentation inflates operator counts ("Meta" vs. "Meta, Inc.", AWS variants)

Approach:

  • Create operator_mapping table with canonical names
  • Use fuzzy matching (e.g., fuzzywuzzy library) to standardize
  • Add operator_canonical column to master_data_centers

Research Impact:

  • Accurate hyperscaler market share analysis
  • Study operator-specific siting strategies (AWS hydro, Microsoft nuclear, Meta solar)
  • Enable "operator power" political economy analyses

Script:

# operators_dedupe.py
import pandas as pd
from fuzzywuzzy import process

# Load unique operators
operators = pd.read_sql("SELECT DISTINCT operator FROM master_data_centers", conn)

# Manual + fuzzy matching to canonical names
canonical_map = {
    "Meta": ["Meta", "Meta, Inc.", "Meta Platforms", "Facebook"],
    "Amazon Web Services": ["AWS", "Amazon", "Amazon Web Services"],
    # ... etc.
}

3. Water Stress Overlay

Status: 257 HUC8 watersheds contain data centers; 15 watersheds hold 50% of facilities

Priority: HIGH - Critical for environmental impact analysis

Approach:

  • Join to USGS WaterWatch streamflow data
  • Add USGS Drought Watch indicators by HUC8
  • Correlate data center density with:
    • Groundwater depletion rates
    • Surface water withdrawal permits
    • Drought frequency/severity (USDM historical data)

Key Watersheds for Focus:

  • Middle Potomac-Catoctin (HUC8 02070008): 235 DCs (12.8% of US total) - Loudoun/Ashburn
  • Middle Potomac-Anacostia-Occoquan (02070010): 111 DCs - Fairfax/inner Loudoun
  • Coyote (18050003): 88 DCs - Silicon Valley
  • Upper Scioto (05060001): 73 DCs - Columbus OH
  • Umatilla (17070103): 29 DCs - AWS-exclusive watershed

Research Questions:

  • Are data centers sited in water-stressed watersheds?
  • Do high-density clusters (Loudoun County, Columbus OH) face water constraints?
  • Compare water stress in hyperscaler non-metro sites (Columbia River corridor) vs. metro clusters
  • Does single-operator watershed capture (Umatilla = AWS only) correlate with water availability?

Tables to Create:

  • watershed_water_stress - HUC8-level water stress indicators
  • data_center_water_risk - Per-facility water-stress exposure

Notebook: water_stress_analysis.ipynb


4. Opposition Cases Overlay

Status: 18 geocoded opposition cases in opposition_cases_geocoded table (loaded but unused)

Approach:

  • Expand dataset: Compile additional cases of rejected/delayed data center proposals from news archives
  • Geocode all opposition cases, join to demographics/hazards
  • Test hypotheses:
    • Do wealthier/more educated communities successfully block projects?
    • Are opposition cases more common in water-stressed or drought-prone areas?
    • Do smaller non-metro communities have less bargaining power?
    • Does clustered vs. isolated location predict opposition likelihood?

Research Questions:

  • What predicts opposition success?
  • Are opposition cases spatially clustered?
  • Do demographics differ between accepted vs. rejected sites?
  • Correlation with FEMA hazard exposure scores?

Analysis Plan:

-- Join opposition cases to demographics
SELECT o.*, ct.median_household_income, ct.bachelors_or_higher_pct
FROM opposition_cases_geocoded o
JOIN _dc_census_tract_acs_2024 ct 
  ON ST_Contains(ct.geom, o.geom);

Output: opposition_cases_analysis.md


5. IM3 Forward Projection Integration

Status: IM3 model data loaded in im3_state_projected_moderate_50 (328 rows) and im3_projected_state_demand_summary (31 rows)

Approach:

  • Load IM3 projected demand scenarios (2030, 2040, 2050)
  • Overlay projected growth with:
    • Current grid saturation (% of generation within 50 km)
    • Water stress indicators
    • Land availability (zoned industrial parcels)
  • Identify regions where projected demand may exceed infrastructure capacity

Grid Saturation Context (from current analysis):

  • New Jersey: 83% of grid within 50 km of DC
  • Nevada: 75%
  • Tennessee: 70%
  • Oregon: 68%
  • Arizona: 56%
  • Virginia: 50%

Research Questions:

  • Which states face grid saturation from data center growth?
  • Are projected sites in water-stressed watersheds?
  • Does IM3 assume spatial distribution patterns consistent with current siting?
  • Can states with >50% grid saturation accommodate projected demand?

Implementation:

-- Compare current saturation to IM3 projected demand
SELECT 
  current.state,
  current.dc_count,
  current.pct_grid_saturated,
  proj.facility_count AS projected_new_facilities,
  proj.total_it_mw AS projected_new_mw
FROM state_grid_saturation current
JOIN im3_projected_state_demand_summary proj ON current.state = proj.state
WHERE current.pct_grid_saturated > 50
ORDER BY current.pct_grid_saturated DESC;

Notebook: im3_projection_overlay.ipynb


Methodological Extensions

6. Time-Series Analysis of Cluster Growth

Approach:

  • Use rfs_year (ready for service) from cable data and EIA generator vintage
  • Reconstruct data center siting over time (requires RFS dates for facilities)
  • Animate cluster formation in interactive map

Research Questions:

  • Did Ashburn VA become dominant before or after major cable landings?
  • Do clusters grow via agglomeration (new facilities near existing) or simultaneous build-out?
  • Correlation between energy infrastructure build-out and data center growth

Data Needed:

  • Facility RFS dates (scrape from press releases, Baxtel historical data)
  • Historical tract demographics (decennial Census + ACS back to 2000)

7. Network Effects: Fiber Route Proximity

Status: Current analysis tests submarine cable proximity (negative result)

Approach:

  • Obtain fiber optic backbone route GIS data (from FCC, carriers, or Infrapedia)
  • Test proximity to long-haul fiber routes (not just submarine cables)
  • Hypothesis: Data centers cluster near fiber, not cables

Data Sources:

  • FCC Form 477 fiber deployment data
  • Infrapedia fiber route database
  • State-level fiber maps (e.g., Virginia Broadband Map)

Expected Result: Positive correlation (unlike submarine cables)


8. Land Use & Zoning Analysis

Approach:

  • Join data centers to local zoning classifications (industrial, commercial, etc.)
  • Analyze land prices in data center tracts before/after facility construction
  • Correlate with property tax revenues

Research Questions:

  • Do data centers drive local property value increases?
  • Are they preferentially sited in already-zoned industrial areas?
  • Do host communities capture tax base growth?

Data Sources:

  • Zillow Home Value Index (ZHVI) by ZIP
  • ATTOM property tax assessments
  • Municipal zoning GIS layers (city-specific, requires scraping/FOIA)

9. Environmental Justice Scoring

Approach:

  • Compare data center host tracts to EPA's EJScreen indices
  • Add CalEnviroScreen-style burden/benefit framework
  • Test if data centers increase cumulative environmental burdens

Metrics:

  • Air quality (PM2.5, ozone)
  • Hazardous waste proximity
  • Superfund site proximity
  • Heat island effect (LST from Landsat)
  • Noise pollution (traffic, cooling systems)

Expected Challenge: Data centers may improve local metrics (compared to heavy industry) but increase water/energy consumption


Policy & Political Economy Research

10. Tax Incentive Analysis

Approach:

  • Compile state/local tax incentives for data center siting (property tax abatements, sales tax exemptions)
  • Create data_center_incentives table with per-facility incentive details
  • Correlate incentive generosity with:
    • State fiscal health
    • Local government bargaining power
    • Facility size/operator

Research Questions:

  • Do weaker fiscal states offer larger incentives?
  • Are incentives regressive (larger for hyperscalers)?
  • Do incentives predict siting decisions (natural experiment approach)?

Data Sources:

  • Good Jobs First Subsidy Tracker
  • State economic development agency press releases
  • Local news archives

11. Employment & Labor Market Effects

Approach:

  • Join to BLS Quarterly Census of Employment and Wages (QCEW) by ZIP/county
  • Identify "data center construction boom" periods (before/after major facility openings)
  • Analyze employment effects in:
    • Construction (NAICS 23)
    • Transportation/warehousing (NAICS 48-49)
    • Professional services (NAICS 54)

Research Questions:

  • Do data centers create durable local employment?
  • Are jobs filled by local residents or commuters?
  • Wage effects in host tracts?

Data Sources:

  • BLS QCEW
  • Census LEHD Origin-Destination Employment Statistics (LODES)

12. Energy Cost Pass-Through

Approach:

  • Join to state-level electricity rate data (EIA, utility rate tracker)
  • Test if data center density correlates with residential rate increases
  • Natural experiment: Compare rate trajectories in high-DC vs. low-DC states

Research Questions:

  • Do data centers drive residential rate increases (capacity cost allocation)?
  • Are rate increases concentrated in utility service territories with large data center loads?
  • Do states with retail choice (deregulated markets) see different effects?

Data Sources:

  • EIA Form 861 (retail rates by state/utility)
  • Utility rate case filings (state public utility commissions)

Data Quality & Infrastructure Improvements

13. Address Validation & Geocoding Refinement

Approach:

  • Re-geocode the 45 facilities using city-precision fallback
  • Use USPS address validation API
  • Cross-reference with Google Maps satellite imagery (manual review)

Implementation:

# Re-run geocoding with stricter thresholds
python3 load_postgis_data_centers.py --revalidate-addresses

14. OSM Continuous Monitoring

Approach:

  • Set up automated Overpass API queries (daily/weekly)
  • Detect new OSM data center tags
  • Alert for review + merge into master_data_centers

Implementation:

  • Cron job running load_postgis_osm_data_centers.py --update-only
  • Slack/email notification on new facilities

15. Broadband Speed Validation

Approach:

  • Cross-reference FCC BDC provider data with Ookla Speedtest results
  • Test if data center host tracts have faster actual speeds (not just availability)

Hypothesis: Data center presence correlates with infrastructure investment → higher speeds

Data Sources:

  • Ookla Open Data (aggregated Speedtest results by tile)
  • FCC Measuring Broadband America

Visualization & Communication

16. Interactive Story Map

Approach:

  • Build Scrollama.js narrative map
  • Sections:
    1. National overview (cluster map)
    2. Ashburn VA zoom (dominance of single region)
    3. Demographics comparison (host vs. national)
    4. Water stress hot spots
    5. Energy infrastructure saturation

Output: story_map.html (standalone web page)


17. Policy Brief Generation

Approach:

  • Auto-generate policy briefs from analysis outputs
  • Targeted audiences:
    • State legislators (energy/water policy)
    • Local governments (tax incentive negotiation)
    • Environmental justice advocates

Template:

# Data Center Siting in [STATE]: Key Facts for Policymakers

- **[STATE] hosts X% of US data centers** (rank: #Y)
- **Host communities are Z% wealthier** than state average
- **A% of state generation is within 50 km of a data center**
- **Top watershed holds B facilities** (water stress: [HIGH/MEDIUM/LOW])

18. Comparative International Analysis

Approach:

  • Extend methodology to EU, Canada, Australia
  • Compare siting patterns (e.g., Nordic countries = renewable energy, cold climate)
  • Test if "concentrated costs / dispersed benefits" holds internationally

Data Sources:

  • OpenStreetMap (global coverage)
  • Eurostat demographics
  • IEA energy data
  • TeleGeography global cable data (already available)

Research Questions:

  • Are US patterns unique (tax-driven siting) vs. EU (regulatory constraints)?
  • Do Nordic countries see more equitable distribution?

Speculative / Long-Term Ideas

19. AI Demand Forecasting

Approach:

  • Train ML model to predict data center siting
  • Features: demographics, energy capacity, fiber proximity, tax rates, water availability
  • Test on historical data (train on pre-2015, test on 2015-2025)

Use Case:

  • Identify "likely future sites" for proactive policy intervention
  • Warn communities of potential incoming projects

20. Cooling Technology Analysis

Approach:

  • Classify facilities by cooling type (air, water, hybrid)
  • Correlate with:
    • Climate (CDD: cooling degree days)
    • Water availability
    • Facility size

Data Sources:

  • Manual classification from news/press releases
  • FOIA requests to water utilities (cooling water withdrawal permits)

Research Questions:

  • Are water-cooled facilities concentrated in water-stressed regions (paradox)?
  • Do hyperscalers use more efficient cooling (e.g., Meta's Prineville OR evaporative cooling)?

21. Bitcoin Mining Facilities

Approach:

  • Overlay cryptocurrency mining facilities (subset of "data centers")
  • Compare siting patterns: Bitcoin mines prefer low electricity costs (WA, TX, NY hydro)
  • Test if Bitcoin mines face more opposition (negative perception)

Data Sources:

  • Cambridge Bitcoin Electricity Consumption Index (has facility locations)
  • News archives of mining farm proposals/rejections

22. Disaster Resilience & Redundancy

Approach:

  • Model simultaneous hazard exposure across data center clusters
  • E.g., "What % of US data centers are in wildfire risk zones?"
  • Identify single points of failure (e.g., Ashburn VA = 20% of US capacity)

Research Questions:

  • Is the current spatial distribution resilient to climate change?
  • Should policy incentivize geographic diversification?

Output: disaster_resilience_report.md


23. Edge Data Center Network

Approach:

  • Separately analyze edge facilities (<1 MW) vs. hyperscale (>100 MW)
  • Test if edge DCs follow different siting logic (population density > energy cost)

Data Challenge: Current inventory does not distinguish edge vs. hyperscale (need power_mw backfill)


24. Carbon Intensity of Host Grids

Approach:

  • Join to EPA eGRID subregion carbon intensity (lb CO₂/MWh)
  • Calculate per-facility estimated carbon footprint (if power_mw available)
  • Compare to corporate renewable energy procurement (RECs, PPAs)

Research Questions:

  • Are data centers disproportionately in high-carbon grids?
  • Do hyperscaler renewable commitments offset grid carbon?

Data Sources:

  • EPA eGRID
  • Corporate sustainability reports (Google, Microsoft, Meta, AWS)

Collaboration Opportunities

Academic Partnerships

  • Energy researchers: Joint analysis of grid saturation + IM3 projections
  • Environmental justice scholars: EJScreen overlay + opposition case studies
  • Political scientists: Tax incentive analysis + local government bargaining power

Policy Stakeholders

  • State energy offices: Share grid saturation maps
  • Water resource agencies: Watershed analysis for permitting
  • Local governments: Demographic/tax revenue analysis for negotiation leverage

Industry Engagement

  • Data center operators: Validate facility data, discuss siting criteria
  • Colocation providers: Access to tenant mix data (multi-tenant vs. single-tenant)

Tools & Infrastructure Improvements

Database Enhancements

  • Add version column to track data vintage
  • Implement audit_log table for data lineage
  • Set up automated backups to S3/Azure Blob

Code Quality

  • Add unit tests for geocoding functions
  • Create config.yaml for database credentials (replace hardcoded env vars)
  • Dockerize analysis environment for reproducibility

Documentation

  • Add JSDoc-style comments to all Python functions
  • Create CONTRIBUTING.md for external collaborators
  • Record Jupyter notebook walkthroughs (video tutorials)

Unfunded / Ambitious Ideas

25. Real-Time Energy Monitoring

  • Partner with utility to get live load data from data center substations
  • Build dashboard showing real-time energy consumption by facility
  • Correlate with AWS/Azure/GCP service outages (reverse-engineer capacity from brownouts)

26. Social Media Sentiment Analysis

  • Scrape Twitter/Reddit for mentions of local data center projects
  • NLP sentiment analysis: support vs. opposition
  • Correlate sentiment with facility approval outcomes

27. LIDAR Analysis of Cooling Infrastructure

  • Use aerial LIDAR to measure rooftop cooling equipment volume
  • Proxy for facility size (cooling = f(IT load))
  • Build predictive model: cooling equipment → power capacity

Contact & Contributions

If you're interested in collaborating on any of these research directions, please contact the repository owner.

Priorities for external collaboration:

  1. Power capacity data acquisition
  2. Water stress/drought overlay
  3. Opposition cases database compilation
  4. International comparative analysis

References for Future Work

Data Sources to Explore

  • Department of Energy: Grid resilience reports, interconnection queues
  • NREL: Renewable energy potential by HUC (solar, wind)
  • USDA: Agricultural water use by county (competition for water)
  • NOAA: Climate normals + projections by grid cell
  • BLS: QCEW employment data, wage data
  • EPA: eGRID, EJScreen, Superfund sites

Academic Literature Gaps

  • Limited peer-reviewed research on data center spatial concentration
  • No published studies on water stress exposure of data centers
  • Opportunity for "first mover" publication in major geography/planning journals

Policy Levers to Investigate

  • State renewable portfolio standards (RPS) → data center siting
  • Federal infrastructure investment (IRA, CHIPS Act) → energy grid capacity
  • Local zoning reform (industrial land use restrictions)

Legislative Analysis (LegiScan Data)

Status: Data loaded — 1.3M bills across all US states + federal, 20162026; ~60K tagged relevant.
Tables: legiscan_sessions, legiscan_bills
Query file: query_legiscan_bills.sql

Research Questions

1. Ratepayer Cost Shifting
Do states with high data center density show more legislative activity on ratepayer protection?

  • Join legiscan_bills WHERE 'ratepayer_protection' = ANY(relevance_tags) to master_data_centers counts by state
  • Test correlation between DC concentration and # of ratepayer bills introduced/passed
  • Compare outcomes: do high-DC states pass or fail more ratepayer protections?

2. Data Center Legislative Wave
Is there a measurable increase in DC-specific legislation after 2022 (AI boom)?

  • Trend data_center and large_load tagged bills by year_start
  • Cross-reference with major AI facility announcements (20222025)

3. Tax Incentive Geography
Which states enacted tax incentives that may have influenced DC location decisions?

  • tax_incentive bills with status IN (4,8) (passed/chaptered)
  • Overlay with master_data_centers growth by state over the same period
  • Candidate for difference-in-differences analysis

4. Grid Interconnection Policy
Do states with grid_impact legislation show different EIA capacity expansion patterns?

  • Join relevant bills to energy_eia_operating_generator_capacity_flat by state
  • Look for correlations between grid policy activity and nameplate MW additions

5. Siting Preemption vs. Local Control
Are states passing bills to streamline or restrict local siting authority?

  • Full-text search within siting_permitting bills for "preemption" vs. "local control"
  • Map bill outcomes by state political environment (cross-ref RDH vote data)

Suggested Joins

-- States with DCs and legislative activity by topic
SELECT
    dc.state,
    COUNT(DISTINCT dc.id)                                              AS data_centers,
    COUNT(DISTINCT lb.bill_id) FILTER (WHERE 'data_center'        = ANY(relevance_tags)) AS dc_bills,
    COUNT(DISTINCT lb.bill_id) FILTER (WHERE 'ratepayer_protection'= ANY(relevance_tags)) AS ratepayer_bills,
    COUNT(DISTINCT lb.bill_id) FILTER (WHERE 'tax_incentive'      = ANY(relevance_tags)
                                         AND lb.status IN (4,8))      AS tax_incentives_passed
FROM master_data_centers dc
LEFT JOIN legiscan_bills lb ON dc.state = lb.state AND lb.is_relevant
GROUP BY dc.state
ORDER BY data_centers DESC;

Last Updated: May 2026