Files

dadams ee5856661a Reorganize project into scripts/, docs/, data/, output/ directories

Move all Python scripts to scripts/, documentation to docs/, raw input
data to data/, and generated HTML/CSV outputs to output/. Update path
references in 8 scripts to use Path(__file__).parent.parent as project
root so they work correctly from the new location. Update README links
and quick-start commands accordingly. Notebooks remain at root.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-27 21:57:47 -07:00

21 KiB

Raw Blame History

Research Ideas & Future Work

This document outlines potential research directions, data improvements, and analyses that could extend the current US Data Centers geospatial research infrastructure.

Priority Next Steps

1. Backfill Power Capacity Data

Status: Only 5.9% of facilities have power_mw values (108/1,833)

Approach:

Scrape Baxtel data center database (requires paid subscription)
Use Data Center Map API/scraping
Cross-reference with utility interconnection queue filings
FOIA requests to state utility commissions for large loads

Research Impact:

Enable capacity-weighted concentration metrics (current analyses are facility-count only)
Correlate power capacity with demographic/environmental variables
Identify "hyperscale" facilities (>100 MW) vs. edge/enterprise (<10 MW)

Implementation:

# Add capacity-weighted HHI calculation to analyze_dc_tract_concentration.py
capacity_weighted_hhi = sum((mw_i / total_mw) ** 2 for mw_i in tract_capacities)

2. Operator Name Deduplication

Status: String fragmentation inflates operator counts ("Meta" vs. "Meta, Inc.", AWS variants)

Approach:

Create operator_mapping table with canonical names
Use fuzzy matching (e.g., fuzzywuzzy library) to standardize
Add operator_canonical column to master_data_centers

Research Impact:

Accurate hyperscaler market share analysis
Study operator-specific siting strategies (AWS hydro, Microsoft nuclear, Meta solar)
Enable "operator power" political economy analyses

Script:

# operators_dedupe.py
import pandas as pd
from fuzzywuzzy import process

# Load unique operators
operators = pd.read_sql("SELECT DISTINCT operator FROM master_data_centers", conn)

# Manual + fuzzy matching to canonical names
canonical_map = {
    "Meta": ["Meta", "Meta, Inc.", "Meta Platforms", "Facebook"],
    "Amazon Web Services": ["AWS", "Amazon", "Amazon Web Services"],
    # ... etc.
}

3. Water Stress Overlay

Status: 257 HUC8 watersheds contain data centers; 15 watersheds hold 50% of facilities

Priority: HIGH - Critical for environmental impact analysis

Approach:

Join to USGS WaterWatch streamflow data
Add USGS Drought Watch indicators by HUC8
Correlate data center density with:
- Groundwater depletion rates
- Surface water withdrawal permits
- Drought frequency/severity (USDM historical data)

Key Watersheds for Focus:

Middle Potomac-Catoctin (HUC8 02070008): 235 DCs (12.8% of US total) - Loudoun/Ashburn
Middle Potomac-Anacostia-Occoquan (02070010): 111 DCs - Fairfax/inner Loudoun
Coyote (18050003): 88 DCs - Silicon Valley
Upper Scioto (05060001): 73 DCs - Columbus OH
Umatilla (17070103): 29 DCs - AWS-exclusive watershed

Research Questions:

Are data centers sited in water-stressed watersheds?
Do high-density clusters (Loudoun County, Columbus OH) face water constraints?
Compare water stress in hyperscaler non-metro sites (Columbia River corridor) vs. metro clusters
Does single-operator watershed capture (Umatilla = AWS only) correlate with water availability?

Tables to Create:

watershed_water_stress - HUC8-level water stress indicators
data_center_water_risk - Per-facility water-stress exposure

Notebook: water_stress_analysis.ipynb

4. Opposition Cases Overlay

Status: 18 geocoded opposition cases in opposition_cases_geocoded table (loaded but unused)

Approach:

Expand dataset: Compile additional cases of rejected/delayed data center proposals from news archives
Geocode all opposition cases, join to demographics/hazards
Test hypotheses:
- Do wealthier/more educated communities successfully block projects?
- Are opposition cases more common in water-stressed or drought-prone areas?
- Do smaller non-metro communities have less bargaining power?
- Does clustered vs. isolated location predict opposition likelihood?

Research Questions:

What predicts opposition success?
Are opposition cases spatially clustered?
Do demographics differ between accepted vs. rejected sites?
Correlation with FEMA hazard exposure scores?

Analysis Plan:

-- Join opposition cases to demographics
SELECT o.*, ct.median_household_income, ct.bachelors_or_higher_pct
FROM opposition_cases_geocoded o
JOIN _dc_census_tract_acs_2024 ct 
  ON ST_Contains(ct.geom, o.geom);

Output: opposition_cases_analysis.md

5. IM3 Forward Projection Integration

Status: IM3 model data loaded in im3_state_projected_moderate_50 (328 rows) and im3_projected_state_demand_summary (31 rows)

Approach:

Load IM3 projected demand scenarios (2030, 2040, 2050)
Overlay projected growth with:
- Current grid saturation (% of generation within 50 km)
- Water stress indicators
- Land availability (zoned industrial parcels)
Identify regions where projected demand may exceed infrastructure capacity

Grid Saturation Context (from current analysis):

New Jersey: 83% of grid within 50 km of DC
Nevada: 75%
Tennessee: 70%
Oregon: 68%
Arizona: 56%
Virginia: 50%

Research Questions:

Which states face grid saturation from data center growth?
Are projected sites in water-stressed watersheds?
Does IM3 assume spatial distribution patterns consistent with current siting?
Can states with >50% grid saturation accommodate projected demand?

Implementation:

-- Compare current saturation to IM3 projected demand
SELECT 
  current.state,
  current.dc_count,
  current.pct_grid_saturated,
  proj.facility_count AS projected_new_facilities,
  proj.total_it_mw AS projected_new_mw
FROM state_grid_saturation current
JOIN im3_projected_state_demand_summary proj ON current.state = proj.state
WHERE current.pct_grid_saturated > 50
ORDER BY current.pct_grid_saturated DESC;

Notebook: im3_projection_overlay.ipynb

Methodological Extensions

6. Time-Series Analysis of Cluster Growth

Approach:

Use rfs_year (ready for service) from cable data and EIA generator vintage
Reconstruct data center siting over time (requires RFS dates for facilities)
Animate cluster formation in interactive map

Research Questions:

Did Ashburn VA become dominant before or after major cable landings?
Do clusters grow via agglomeration (new facilities near existing) or simultaneous build-out?
Correlation between energy infrastructure build-out and data center growth

Data Needed:

Facility RFS dates (scrape from press releases, Baxtel historical data)
Historical tract demographics (decennial Census + ACS back to 2000)

7. Network Effects: Fiber Route Proximity

Status: Current analysis tests submarine cable proximity (negative result)

Approach:

Obtain fiber optic backbone route GIS data (from FCC, carriers, or Infrapedia)
Test proximity to long-haul fiber routes (not just submarine cables)
Hypothesis: Data centers cluster near fiber, not cables

Data Sources:

FCC Form 477 fiber deployment data
Infrapedia fiber route database
State-level fiber maps (e.g., Virginia Broadband Map)

Expected Result: Positive correlation (unlike submarine cables)

8. Land Use & Zoning Analysis

Approach:

Join data centers to local zoning classifications (industrial, commercial, etc.)
Analyze land prices in data center tracts before/after facility construction
Correlate with property tax revenues

Research Questions:

Do data centers drive local property value increases?
Are they preferentially sited in already-zoned industrial areas?
Do host communities capture tax base growth?

Data Sources:

Zillow Home Value Index (ZHVI) by ZIP
ATTOM property tax assessments
Municipal zoning GIS layers (city-specific, requires scraping/FOIA)

9. Environmental Justice Scoring

Approach:

Compare data center host tracts to EPA's EJScreen indices
Add CalEnviroScreen-style burden/benefit framework
Test if data centers increase cumulative environmental burdens

Metrics:

Air quality (PM2.5, ozone)
Hazardous waste proximity
Superfund site proximity
Heat island effect (LST from Landsat)
Noise pollution (traffic, cooling systems)

Expected Challenge: Data centers may improve local metrics (compared to heavy industry) but increase water/energy consumption

Policy & Political Economy Research

10. Tax Incentive Analysis

Approach:

Compile state/local tax incentives for data center siting (property tax abatements, sales tax exemptions)
Create data_center_incentives table with per-facility incentive details
Correlate incentive generosity with:
- State fiscal health
- Local government bargaining power
- Facility size/operator

Research Questions:

Do weaker fiscal states offer larger incentives?
Are incentives regressive (larger for hyperscalers)?
Do incentives predict siting decisions (natural experiment approach)?

Data Sources:

Good Jobs First Subsidy Tracker
State economic development agency press releases
Local news archives

11. Employment & Labor Market Effects

Approach:

Join to BLS Quarterly Census of Employment and Wages (QCEW) by ZIP/county
Identify "data center construction boom" periods (before/after major facility openings)
Analyze employment effects in:
- Construction (NAICS 23)
- Transportation/warehousing (NAICS 48-49)
- Professional services (NAICS 54)

Research Questions:

Do data centers create durable local employment?
Are jobs filled by local residents or commuters?
Wage effects in host tracts?

Data Sources:

BLS QCEW
Census LEHD Origin-Destination Employment Statistics (LODES)

12. Energy Cost Pass-Through

Approach:

Join to state-level electricity rate data (EIA, utility rate tracker)
Test if data center density correlates with residential rate increases
Natural experiment: Compare rate trajectories in high-DC vs. low-DC states

Research Questions:

Do data centers drive residential rate increases (capacity cost allocation)?
Are rate increases concentrated in utility service territories with large data center loads?
Do states with retail choice (deregulated markets) see different effects?

Data Sources:

EIA Form 861 (retail rates by state/utility)
Utility rate case filings (state public utility commissions)

Data Quality & Infrastructure Improvements

13. Address Validation & Geocoding Refinement

Approach:

Re-geocode the 45 facilities using city-precision fallback
Use USPS address validation API
Cross-reference with Google Maps satellite imagery (manual review)

Implementation:

# Re-run geocoding with stricter thresholds
python3 load_postgis_data_centers.py --revalidate-addresses

14. OSM Continuous Monitoring

Approach:

Set up automated Overpass API queries (daily/weekly)
Detect new OSM data center tags
Alert for review + merge into master_data_centers

Implementation:

Cron job running load_postgis_osm_data_centers.py --update-only
Slack/email notification on new facilities

15. Broadband Speed Validation

Approach:

Cross-reference FCC BDC provider data with Ookla Speedtest results
Test if data center host tracts have faster actual speeds (not just availability)

Hypothesis: Data center presence correlates with infrastructure investment → higher speeds

Data Sources:

Ookla Open Data (aggregated Speedtest results by tile)
FCC Measuring Broadband America

Visualization & Communication

16. Interactive Story Map

Approach:

Build Scrollama.js narrative map
Sections:
1. National overview (cluster map)
2. Ashburn VA zoom (dominance of single region)
3. Demographics comparison (host vs. national)
4. Water stress hot spots
5. Energy infrastructure saturation

Output: story_map.html (standalone web page)

17. Policy Brief Generation

Approach:

Auto-generate policy briefs from analysis outputs
Targeted audiences:
- State legislators (energy/water policy)
- Local governments (tax incentive negotiation)
- Environmental justice advocates

Template:

# Data Center Siting in [STATE]: Key Facts for Policymakers

- **[STATE] hosts X% of US data centers** (rank: #Y)
- **Host communities are Z% wealthier** than state average
- **A% of state generation is within 50 km of a data center**
- **Top watershed holds B facilities** (water stress: [HIGH/MEDIUM/LOW])

18. Comparative International Analysis

Approach:

Extend methodology to EU, Canada, Australia
Compare siting patterns (e.g., Nordic countries = renewable energy, cold climate)
Test if "concentrated costs / dispersed benefits" holds internationally

Data Sources:

OpenStreetMap (global coverage)
Eurostat demographics
IEA energy data
TeleGeography global cable data (already available)

Research Questions:

Are US patterns unique (tax-driven siting) vs. EU (regulatory constraints)?
Do Nordic countries see more equitable distribution?

Speculative / Long-Term Ideas

19. AI Demand Forecasting

Approach:

Train ML model to predict data center siting
Features: demographics, energy capacity, fiber proximity, tax rates, water availability
Test on historical data (train on pre-2015, test on 2015-2025)

Use Case:

Identify "likely future sites" for proactive policy intervention
Warn communities of potential incoming projects

20. Cooling Technology Analysis

Approach:

Classify facilities by cooling type (air, water, hybrid)
Correlate with:
- Climate (CDD: cooling degree days)
- Water availability
- Facility size

Data Sources:

Manual classification from news/press releases
FOIA requests to water utilities (cooling water withdrawal permits)

Research Questions:

Are water-cooled facilities concentrated in water-stressed regions (paradox)?
Do hyperscalers use more efficient cooling (e.g., Meta's Prineville OR evaporative cooling)?

21. Bitcoin Mining Facilities

Approach:

Overlay cryptocurrency mining facilities (subset of "data centers")
Compare siting patterns: Bitcoin mines prefer low electricity costs (WA, TX, NY hydro)
Test if Bitcoin mines face more opposition (negative perception)

Data Sources:

Cambridge Bitcoin Electricity Consumption Index (has facility locations)
News archives of mining farm proposals/rejections

22. Disaster Resilience & Redundancy

Approach:

Model simultaneous hazard exposure across data center clusters
E.g., "What % of US data centers are in wildfire risk zones?"
Identify single points of failure (e.g., Ashburn VA = 20% of US capacity)

Research Questions:

Is the current spatial distribution resilient to climate change?
Should policy incentivize geographic diversification?

Output: disaster_resilience_report.md

23. Edge Data Center Network

Approach:

Separately analyze edge facilities (<1 MW) vs. hyperscale (>100 MW)
Test if edge DCs follow different siting logic (population density > energy cost)

Data Challenge: Current inventory does not distinguish edge vs. hyperscale (need power_mw backfill)

24. Carbon Intensity of Host Grids

Approach:

Join to EPA eGRID subregion carbon intensity (lb CO₂/MWh)
Calculate per-facility estimated carbon footprint (if power_mw available)
Compare to corporate renewable energy procurement (RECs, PPAs)

Research Questions:

Are data centers disproportionately in high-carbon grids?
Do hyperscaler renewable commitments offset grid carbon?

Data Sources:

EPA eGRID
Corporate sustainability reports (Google, Microsoft, Meta, AWS)

Collaboration Opportunities

Academic Partnerships

Energy researchers: Joint analysis of grid saturation + IM3 projections
Environmental justice scholars: EJScreen overlay + opposition case studies
Political scientists: Tax incentive analysis + local government bargaining power

Policy Stakeholders

State energy offices: Share grid saturation maps
Water resource agencies: Watershed analysis for permitting
Local governments: Demographic/tax revenue analysis for negotiation leverage

Industry Engagement

Data center operators: Validate facility data, discuss siting criteria
Colocation providers: Access to tenant mix data (multi-tenant vs. single-tenant)

Tools & Infrastructure Improvements

Database Enhancements

Add version column to track data vintage
Implement audit_log table for data lineage
Set up automated backups to S3/Azure Blob

Code Quality

Add unit tests for geocoding functions
Create config.yaml for database credentials (replace hardcoded env vars)
Dockerize analysis environment for reproducibility

Documentation

Add JSDoc-style comments to all Python functions
Create CONTRIBUTING.md for external collaborators
Record Jupyter notebook walkthroughs (video tutorials)

Unfunded / Ambitious Ideas

25. Real-Time Energy Monitoring

Partner with utility to get live load data from data center substations
Build dashboard showing real-time energy consumption by facility
Correlate with AWS/Azure/GCP service outages (reverse-engineer capacity from brownouts)

Scrape Twitter/Reddit for mentions of local data center projects
NLP sentiment analysis: support vs. opposition
Correlate sentiment with facility approval outcomes

27. LIDAR Analysis of Cooling Infrastructure

Use aerial LIDAR to measure rooftop cooling equipment volume
Proxy for facility size (cooling = f(IT load))
Build predictive model: cooling equipment → power capacity

Contact & Contributions

If you're interested in collaborating on any of these research directions, please contact the repository owner.

Priorities for external collaboration:

Power capacity data acquisition
Water stress/drought overlay
Opposition cases database compilation
International comparative analysis

References for Future Work

Data Sources to Explore

Department of Energy: Grid resilience reports, interconnection queues
NREL: Renewable energy potential by HUC (solar, wind)
USDA: Agricultural water use by county (competition for water)
NOAA: Climate normals + projections by grid cell
BLS: QCEW employment data, wage data
EPA: eGRID, EJScreen, Superfund sites

Academic Literature Gaps

Limited peer-reviewed research on data center spatial concentration
No published studies on water stress exposure of data centers
Opportunity for "first mover" publication in major geography/planning journals

Policy Levers to Investigate

State renewable portfolio standards (RPS) → data center siting
Federal infrastructure investment (IRA, CHIPS Act) → energy grid capacity
Local zoning reform (industrial land use restrictions)

Legislative Analysis (LegiScan Data)

Status: Data loaded — 1.3M bills across all US states + federal, 2016–2026; ~60K tagged relevant.
Tables: legiscan_sessions, legiscan_bills
Query file: query_legiscan_bills.sql

Research Questions

1. Ratepayer Cost Shifting
Do states with high data center density show more legislative activity on ratepayer protection?

Join legiscan_bills WHERE 'ratepayer_protection' = ANY(relevance_tags) to master_data_centers counts by state
Test correlation between DC concentration and # of ratepayer bills introduced/passed
Compare outcomes: do high-DC states pass or fail more ratepayer protections?

2. Data Center Legislative Wave
Is there a measurable increase in DC-specific legislation after 2022 (AI boom)?

Trend data_center and large_load tagged bills by year_start
Cross-reference with major AI facility announcements (2022–2025)

3. Tax Incentive Geography
Which states enacted tax incentives that may have influenced DC location decisions?

tax_incentive bills with status IN (4,8) (passed/chaptered)
Overlay with master_data_centers growth by state over the same period
Candidate for difference-in-differences analysis

4. Grid Interconnection Policy
Do states with grid_impact legislation show different EIA capacity expansion patterns?

Join relevant bills to energy_eia_operating_generator_capacity_flat by state
Look for correlations between grid policy activity and nameplate MW additions

5. Siting Preemption vs. Local Control
Are states passing bills to streamline or restrict local siting authority?

Full-text search within siting_permitting bills for "preemption" vs. "local control"
Map bill outcomes by state political environment (cross-ref RDH vote data)

Suggested Joins

-- States with DCs and legislative activity by topic
SELECT
    dc.state,
    COUNT(DISTINCT dc.id)                                              AS data_centers,
    COUNT(DISTINCT lb.bill_id) FILTER (WHERE 'data_center'        = ANY(relevance_tags)) AS dc_bills,
    COUNT(DISTINCT lb.bill_id) FILTER (WHERE 'ratepayer_protection'= ANY(relevance_tags)) AS ratepayer_bills,
    COUNT(DISTINCT lb.bill_id) FILTER (WHERE 'tax_incentive'      = ANY(relevance_tags)
                                         AND lb.status IN (4,8))      AS tax_incentives_passed
FROM master_data_centers dc
LEFT JOIN legiscan_bills lb ON dc.state = lb.state AND lb.is_relevant
GROUP BY dc.state
ORDER BY data_centers DESC;

Last Updated: May 2026

21 KiB Raw Blame History Unescape Escape

Research Ideas & Future Work

Priority Next Steps

1. Backfill Power Capacity Data

2. Operator Name Deduplication

3. Water Stress Overlay

4. Opposition Cases Overlay

5. IM3 Forward Projection Integration

Methodological Extensions

6. Time-Series Analysis of Cluster Growth

7. Network Effects: Fiber Route Proximity

8. Land Use & Zoning Analysis

9. Environmental Justice Scoring

Policy & Political Economy Research

10. Tax Incentive Analysis

11. Employment & Labor Market Effects

12. Energy Cost Pass-Through

Data Quality & Infrastructure Improvements

13. Address Validation & Geocoding Refinement

14. OSM Continuous Monitoring

15. Broadband Speed Validation

Visualization & Communication

16. Interactive Story Map

17. Policy Brief Generation

18. Comparative International Analysis

Speculative / Long-Term Ideas

19. AI Demand Forecasting

20. Cooling Technology Analysis

21. Bitcoin Mining Facilities

22. Disaster Resilience & Redundancy

23. Edge Data Center Network

24. Carbon Intensity of Host Grids

Collaboration Opportunities

Academic Partnerships

Policy Stakeholders

Industry Engagement

Tools & Infrastructure Improvements

Database Enhancements

Code Quality

Documentation

Unfunded / Ambitious Ideas

25. Real-Time Energy Monitoring

26. Social Media Sentiment Analysis

27. LIDAR Analysis of Cooling Infrastructure

Contact & Contributions

References for Future Work

Data Sources to Explore

Academic Literature Gaps

Policy Levers to Investigate

Legislative Analysis (LegiScan Data)

Research Questions

Suggested Joins

21 KiB

Raw Blame History