Files
data-centers/docs/research-ideas.md
dadams ee5856661a Reorganize project into scripts/, docs/, data/, output/ directories
Move all Python scripts to scripts/, documentation to docs/, raw input
data to data/, and generated HTML/CSV outputs to output/. Update path
references in 8 scripts to use Path(__file__).parent.parent as project
root so they work correctly from the new location. Update README links
and quick-start commands accordingly. Notebooks remain at root.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-27 21:57:47 -07:00

624 lines
21 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Research Ideas & Future Work
This document outlines potential research directions, data improvements, and analyses that could extend the current US Data Centers geospatial research infrastructure.
## Priority Next Steps
### 1. Backfill Power Capacity Data
**Status**: Only 5.9% of facilities have `power_mw` values (108/1,833)
**Approach**:
- Scrape Baxtel data center database (requires paid subscription)
- Use Data Center Map API/scraping
- Cross-reference with utility interconnection queue filings
- FOIA requests to state utility commissions for large loads
**Research Impact**:
- Enable capacity-weighted concentration metrics (current analyses are facility-count only)
- Correlate power capacity with demographic/environmental variables
- Identify "hyperscale" facilities (>100 MW) vs. edge/enterprise (<10 MW)
**Implementation**:
```python
# Add capacity-weighted HHI calculation to analyze_dc_tract_concentration.py
capacity_weighted_hhi = sum((mw_i / total_mw) ** 2 for mw_i in tract_capacities)
```
---
### 2. Operator Name Deduplication
**Status**: String fragmentation inflates operator counts ("Meta" vs. "Meta, Inc.", AWS variants)
**Approach**:
- Create `operator_mapping` table with canonical names
- Use fuzzy matching (e.g., `fuzzywuzzy` library) to standardize
- Add `operator_canonical` column to `master_data_centers`
**Research Impact**:
- Accurate hyperscaler market share analysis
- Study operator-specific siting strategies (AWS hydro, Microsoft nuclear, Meta solar)
- Enable "operator power" political economy analyses
**Script**:
```python
# operators_dedupe.py
import pandas as pd
from fuzzywuzzy import process
# Load unique operators
operators = pd.read_sql("SELECT DISTINCT operator FROM master_data_centers", conn)
# Manual + fuzzy matching to canonical names
canonical_map = {
"Meta": ["Meta", "Meta, Inc.", "Meta Platforms", "Facebook"],
"Amazon Web Services": ["AWS", "Amazon", "Amazon Web Services"],
# ... etc.
}
```
---
### 3. Water Stress Overlay
**Status**: 257 HUC8 watersheds contain data centers; 15 watersheds hold 50% of facilities
**Priority**: HIGH - Critical for environmental impact analysis
**Approach**:
- Join to USGS WaterWatch streamflow data
- Add USGS Drought Watch indicators by HUC8
- Correlate data center density with:
- Groundwater depletion rates
- Surface water withdrawal permits
- Drought frequency/severity (USDM historical data)
**Key Watersheds for Focus**:
- **Middle Potomac-Catoctin** (HUC8 02070008): 235 DCs (12.8% of US total) - Loudoun/Ashburn
- **Middle Potomac-Anacostia-Occoquan** (02070010): 111 DCs - Fairfax/inner Loudoun
- **Coyote** (18050003): 88 DCs - Silicon Valley
- **Upper Scioto** (05060001): 73 DCs - Columbus OH
- **Umatilla** (17070103): 29 DCs - AWS-exclusive watershed
**Research Questions**:
- Are data centers sited in water-stressed watersheds?
- Do high-density clusters (Loudoun County, Columbus OH) face water constraints?
- Compare water stress in hyperscaler non-metro sites (Columbia River corridor) vs. metro clusters
- Does single-operator watershed capture (Umatilla = AWS only) correlate with water availability?
**Tables to Create**:
- `watershed_water_stress` - HUC8-level water stress indicators
- `data_center_water_risk` - Per-facility water-stress exposure
**Notebook**: `water_stress_analysis.ipynb`
---
### 4. Opposition Cases Overlay
**Status**: 18 geocoded opposition cases in `opposition_cases_geocoded` table (loaded but unused)
**Approach**:
- Expand dataset: Compile additional cases of rejected/delayed data center proposals from news archives
- Geocode all opposition cases, join to demographics/hazards
- Test hypotheses:
- Do wealthier/more educated communities successfully block projects?
- Are opposition cases more common in water-stressed or drought-prone areas?
- Do smaller non-metro communities have less bargaining power?
- Does clustered vs. isolated location predict opposition likelihood?
**Research Questions**:
- What predicts opposition success?
- Are opposition cases spatially clustered?
- Do demographics differ between accepted vs. rejected sites?
- Correlation with FEMA hazard exposure scores?
**Analysis Plan**:
```sql
-- Join opposition cases to demographics
SELECT o.*, ct.median_household_income, ct.bachelors_or_higher_pct
FROM opposition_cases_geocoded o
JOIN _dc_census_tract_acs_2024 ct
ON ST_Contains(ct.geom, o.geom);
```
**Output**: `opposition_cases_analysis.md`
---
### 5. IM3 Forward Projection Integration
**Status**: IM3 model data loaded in `im3_state_projected_moderate_50` (328 rows) and `im3_projected_state_demand_summary` (31 rows)
**Approach**:
- Load IM3 projected demand scenarios (2030, 2040, 2050)
- Overlay projected growth with:
- Current grid saturation (% of generation within 50 km)
- Water stress indicators
- Land availability (zoned industrial parcels)
- Identify regions where projected demand may exceed infrastructure capacity
**Grid Saturation Context** (from current analysis):
- **New Jersey**: 83% of grid within 50 km of DC
- **Nevada**: 75%
- **Tennessee**: 70%
- **Oregon**: 68%
- **Arizona**: 56%
- **Virginia**: 50%
**Research Questions**:
- Which states face grid saturation from data center growth?
- Are projected sites in water-stressed watersheds?
- Does IM3 assume spatial distribution patterns consistent with current siting?
- Can states with >50% grid saturation accommodate projected demand?
**Implementation**:
```sql
-- Compare current saturation to IM3 projected demand
SELECT
current.state,
current.dc_count,
current.pct_grid_saturated,
proj.facility_count AS projected_new_facilities,
proj.total_it_mw AS projected_new_mw
FROM state_grid_saturation current
JOIN im3_projected_state_demand_summary proj ON current.state = proj.state
WHERE current.pct_grid_saturated > 50
ORDER BY current.pct_grid_saturated DESC;
```
**Notebook**: `im3_projection_overlay.ipynb`
---
## Methodological Extensions
### 6. Time-Series Analysis of Cluster Growth
**Approach**:
- Use `rfs_year` (ready for service) from cable data and EIA generator vintage
- Reconstruct data center siting over time (requires RFS dates for facilities)
- Animate cluster formation in interactive map
**Research Questions**:
- Did Ashburn VA become dominant before or after major cable landings?
- Do clusters grow via agglomeration (new facilities near existing) or simultaneous build-out?
- Correlation between energy infrastructure build-out and data center growth
**Data Needed**:
- Facility RFS dates (scrape from press releases, Baxtel historical data)
- Historical tract demographics (decennial Census + ACS back to 2000)
---
### 7. Network Effects: Fiber Route Proximity
**Status**: Current analysis tests submarine cable proximity (negative result)
**Approach**:
- Obtain fiber optic backbone route GIS data (from FCC, carriers, or Infrapedia)
- Test proximity to long-haul fiber routes (not just submarine cables)
- Hypothesis: Data centers cluster near fiber, not cables
**Data Sources**:
- FCC Form 477 fiber deployment data
- Infrapedia fiber route database
- State-level fiber maps (e.g., Virginia Broadband Map)
**Expected Result**: Positive correlation (unlike submarine cables)
---
### 8. Land Use & Zoning Analysis
**Approach**:
- Join data centers to local zoning classifications (industrial, commercial, etc.)
- Analyze land prices in data center tracts before/after facility construction
- Correlate with property tax revenues
**Research Questions**:
- Do data centers drive local property value increases?
- Are they preferentially sited in already-zoned industrial areas?
- Do host communities capture tax base growth?
**Data Sources**:
- Zillow Home Value Index (ZHVI) by ZIP
- ATTOM property tax assessments
- Municipal zoning GIS layers (city-specific, requires scraping/FOIA)
---
### 9. Environmental Justice Scoring
**Approach**:
- Compare data center host tracts to EPA's EJScreen indices
- Add CalEnviroScreen-style burden/benefit framework
- Test if data centers increase cumulative environmental burdens
**Metrics**:
- Air quality (PM2.5, ozone)
- Hazardous waste proximity
- Superfund site proximity
- Heat island effect (LST from Landsat)
- Noise pollution (traffic, cooling systems)
**Expected Challenge**: Data centers may improve local metrics (compared to heavy industry) but increase water/energy consumption
---
## Policy & Political Economy Research
### 10. Tax Incentive Analysis
**Approach**:
- Compile state/local tax incentives for data center siting (property tax abatements, sales tax exemptions)
- Create `data_center_incentives` table with per-facility incentive details
- Correlate incentive generosity with:
- State fiscal health
- Local government bargaining power
- Facility size/operator
**Research Questions**:
- Do weaker fiscal states offer larger incentives?
- Are incentives regressive (larger for hyperscalers)?
- Do incentives predict siting decisions (natural experiment approach)?
**Data Sources**:
- Good Jobs First Subsidy Tracker
- State economic development agency press releases
- Local news archives
---
### 11. Employment & Labor Market Effects
**Approach**:
- Join to BLS Quarterly Census of Employment and Wages (QCEW) by ZIP/county
- Identify "data center construction boom" periods (before/after major facility openings)
- Analyze employment effects in:
- Construction (NAICS 23)
- Transportation/warehousing (NAICS 48-49)
- Professional services (NAICS 54)
**Research Questions**:
- Do data centers create durable local employment?
- Are jobs filled by local residents or commuters?
- Wage effects in host tracts?
**Data Sources**:
- BLS QCEW
- Census LEHD Origin-Destination Employment Statistics (LODES)
---
### 12. Energy Cost Pass-Through
**Approach**:
- Join to state-level electricity rate data (EIA, utility rate tracker)
- Test if data center density correlates with residential rate increases
- Natural experiment: Compare rate trajectories in high-DC vs. low-DC states
**Research Questions**:
- Do data centers drive residential rate increases (capacity cost allocation)?
- Are rate increases concentrated in utility service territories with large data center loads?
- Do states with retail choice (deregulated markets) see different effects?
**Data Sources**:
- EIA Form 861 (retail rates by state/utility)
- Utility rate case filings (state public utility commissions)
---
## Data Quality & Infrastructure Improvements
### 13. Address Validation & Geocoding Refinement
**Approach**:
- Re-geocode the 45 facilities using city-precision fallback
- Use USPS address validation API
- Cross-reference with Google Maps satellite imagery (manual review)
**Implementation**:
```python
# Re-run geocoding with stricter thresholds
python3 load_postgis_data_centers.py --revalidate-addresses
```
---
### 14. OSM Continuous Monitoring
**Approach**:
- Set up automated Overpass API queries (daily/weekly)
- Detect new OSM data center tags
- Alert for review + merge into `master_data_centers`
**Implementation**:
- Cron job running `load_postgis_osm_data_centers.py --update-only`
- Slack/email notification on new facilities
---
### 15. Broadband Speed Validation
**Approach**:
- Cross-reference FCC BDC provider data with Ookla Speedtest results
- Test if data center host tracts have faster actual speeds (not just availability)
**Hypothesis**: Data center presence correlates with infrastructure investment → higher speeds
**Data Sources**:
- Ookla Open Data (aggregated Speedtest results by tile)
- FCC Measuring Broadband America
---
## Visualization & Communication
### 16. Interactive Story Map
**Approach**:
- Build Scrollama.js narrative map
- Sections:
1. National overview (cluster map)
2. Ashburn VA zoom (dominance of single region)
3. Demographics comparison (host vs. national)
4. Water stress hot spots
5. Energy infrastructure saturation
**Output**: `story_map.html` (standalone web page)
---
### 17. Policy Brief Generation
**Approach**:
- Auto-generate policy briefs from analysis outputs
- Targeted audiences:
- State legislators (energy/water policy)
- Local governments (tax incentive negotiation)
- Environmental justice advocates
**Template**:
```markdown
# Data Center Siting in [STATE]: Key Facts for Policymakers
- **[STATE] hosts X% of US data centers** (rank: #Y)
- **Host communities are Z% wealthier** than state average
- **A% of state generation is within 50 km of a data center**
- **Top watershed holds B facilities** (water stress: [HIGH/MEDIUM/LOW])
```
---
### 18. Comparative International Analysis
**Approach**:
- Extend methodology to EU, Canada, Australia
- Compare siting patterns (e.g., Nordic countries = renewable energy, cold climate)
- Test if "concentrated costs / dispersed benefits" holds internationally
**Data Sources**:
- OpenStreetMap (global coverage)
- Eurostat demographics
- IEA energy data
- TeleGeography global cable data (already available)
**Research Questions**:
- Are US patterns unique (tax-driven siting) vs. EU (regulatory constraints)?
- Do Nordic countries see more equitable distribution?
---
## Speculative / Long-Term Ideas
### 19. AI Demand Forecasting
**Approach**:
- Train ML model to predict data center siting
- Features: demographics, energy capacity, fiber proximity, tax rates, water availability
- Test on historical data (train on pre-2015, test on 2015-2025)
**Use Case**:
- Identify "likely future sites" for proactive policy intervention
- Warn communities of potential incoming projects
---
### 20. Cooling Technology Analysis
**Approach**:
- Classify facilities by cooling type (air, water, hybrid)
- Correlate with:
- Climate (CDD: cooling degree days)
- Water availability
- Facility size
**Data Sources**:
- Manual classification from news/press releases
- FOIA requests to water utilities (cooling water withdrawal permits)
**Research Questions**:
- Are water-cooled facilities concentrated in water-stressed regions (paradox)?
- Do hyperscalers use more efficient cooling (e.g., Meta's Prineville OR evaporative cooling)?
---
### 21. Bitcoin Mining Facilities
**Approach**:
- Overlay cryptocurrency mining facilities (subset of "data centers")
- Compare siting patterns: Bitcoin mines prefer low electricity costs (WA, TX, NY hydro)
- Test if Bitcoin mines face more opposition (negative perception)
**Data Sources**:
- Cambridge Bitcoin Electricity Consumption Index (has facility locations)
- News archives of mining farm proposals/rejections
---
### 22. Disaster Resilience & Redundancy
**Approach**:
- Model simultaneous hazard exposure across data center clusters
- E.g., "What % of US data centers are in wildfire risk zones?"
- Identify single points of failure (e.g., Ashburn VA = 20% of US capacity)
**Research Questions**:
- Is the current spatial distribution resilient to climate change?
- Should policy incentivize geographic diversification?
**Output**: `disaster_resilience_report.md`
---
### 23. Edge Data Center Network
**Approach**:
- Separately analyze edge facilities (<1 MW) vs. hyperscale (>100 MW)
- Test if edge DCs follow different siting logic (population density > energy cost)
**Data Challenge**: Current inventory does not distinguish edge vs. hyperscale (need `power_mw` backfill)
---
### 24. Carbon Intensity of Host Grids
**Approach**:
- Join to EPA eGRID subregion carbon intensity (lb CO₂/MWh)
- Calculate per-facility estimated carbon footprint (if `power_mw` available)
- Compare to corporate renewable energy procurement (RECs, PPAs)
**Research Questions**:
- Are data centers disproportionately in high-carbon grids?
- Do hyperscaler renewable commitments offset grid carbon?
**Data Sources**:
- EPA eGRID
- Corporate sustainability reports (Google, Microsoft, Meta, AWS)
---
## Collaboration Opportunities
### Academic Partnerships
- **Energy researchers**: Joint analysis of grid saturation + IM3 projections
- **Environmental justice scholars**: EJScreen overlay + opposition case studies
- **Political scientists**: Tax incentive analysis + local government bargaining power
### Policy Stakeholders
- **State energy offices**: Share grid saturation maps
- **Water resource agencies**: Watershed analysis for permitting
- **Local governments**: Demographic/tax revenue analysis for negotiation leverage
### Industry Engagement
- **Data center operators**: Validate facility data, discuss siting criteria
- **Colocation providers**: Access to tenant mix data (multi-tenant vs. single-tenant)
---
## Tools & Infrastructure Improvements
### Database Enhancements
- Add `version` column to track data vintage
- Implement `audit_log` table for data lineage
- Set up automated backups to S3/Azure Blob
### Code Quality
- Add unit tests for geocoding functions
- Create `config.yaml` for database credentials (replace hardcoded env vars)
- Dockerize analysis environment for reproducibility
### Documentation
- Add JSDoc-style comments to all Python functions
- Create `CONTRIBUTING.md` for external collaborators
- Record Jupyter notebook walkthroughs (video tutorials)
---
## Unfunded / Ambitious Ideas
### 25. Real-Time Energy Monitoring
- Partner with utility to get live load data from data center substations
- Build dashboard showing real-time energy consumption by facility
- Correlate with AWS/Azure/GCP service outages (reverse-engineer capacity from brownouts)
### 26. Social Media Sentiment Analysis
- Scrape Twitter/Reddit for mentions of local data center projects
- NLP sentiment analysis: support vs. opposition
- Correlate sentiment with facility approval outcomes
### 27. LIDAR Analysis of Cooling Infrastructure
- Use aerial LIDAR to measure rooftop cooling equipment volume
- Proxy for facility size (cooling = f(IT load))
- Build predictive model: cooling equipment → power capacity
---
## Contact & Contributions
If you're interested in collaborating on any of these research directions, please contact the repository owner.
**Priorities for external collaboration**:
1. Power capacity data acquisition
2. Water stress/drought overlay
3. Opposition cases database compilation
4. International comparative analysis
---
## References for Future Work
### Data Sources to Explore
- **Department of Energy**: Grid resilience reports, interconnection queues
- **NREL**: Renewable energy potential by HUC (solar, wind)
- **USDA**: Agricultural water use by county (competition for water)
- **NOAA**: Climate normals + projections by grid cell
- **BLS**: QCEW employment data, wage data
- **EPA**: eGRID, EJScreen, Superfund sites
### Academic Literature Gaps
- Limited peer-reviewed research on data center spatial concentration
- No published studies on water stress exposure of data centers
- Opportunity for "first mover" publication in major geography/planning journals
### Policy Levers to Investigate
- State renewable portfolio standards (RPS) → data center siting
- Federal infrastructure investment (IRA, CHIPS Act) → energy grid capacity
- Local zoning reform (industrial land use restrictions)
---
## Legislative Analysis (LegiScan Data)
**Status**: Data loaded — 1.3M bills across all US states + federal, 20162026; ~60K tagged relevant.
**Tables**: `legiscan_sessions`, `legiscan_bills`
**Query file**: `query_legiscan_bills.sql`
### Research Questions
**1. Ratepayer Cost Shifting**
Do states with high data center density show more legislative activity on ratepayer protection?
- Join `legiscan_bills WHERE 'ratepayer_protection' = ANY(relevance_tags)` to `master_data_centers` counts by state
- Test correlation between DC concentration and # of ratepayer bills introduced/passed
- Compare outcomes: do high-DC states pass or fail more ratepayer protections?
**2. Data Center Legislative Wave**
Is there a measurable increase in DC-specific legislation after 2022 (AI boom)?
- Trend `data_center` and `large_load` tagged bills by `year_start`
- Cross-reference with major AI facility announcements (20222025)
**3. Tax Incentive Geography**
Which states enacted tax incentives that may have influenced DC location decisions?
- `tax_incentive` bills with `status IN (4,8)` (passed/chaptered)
- Overlay with `master_data_centers` growth by state over the same period
- Candidate for difference-in-differences analysis
**4. Grid Interconnection Policy**
Do states with `grid_impact` legislation show different EIA capacity expansion patterns?
- Join relevant bills to `energy_eia_operating_generator_capacity_flat` by state
- Look for correlations between grid policy activity and nameplate MW additions
**5. Siting Preemption vs. Local Control**
Are states passing bills to streamline or restrict local siting authority?
- Full-text search within `siting_permitting` bills for "preemption" vs. "local control"
- Map bill outcomes by state political environment (cross-ref RDH vote data)
### Suggested Joins
```sql
-- States with DCs and legislative activity by topic
SELECT
dc.state,
COUNT(DISTINCT dc.id) AS data_centers,
COUNT(DISTINCT lb.bill_id) FILTER (WHERE 'data_center' = ANY(relevance_tags)) AS dc_bills,
COUNT(DISTINCT lb.bill_id) FILTER (WHERE 'ratepayer_protection'= ANY(relevance_tags)) AS ratepayer_bills,
COUNT(DISTINCT lb.bill_id) FILTER (WHERE 'tax_incentive' = ANY(relevance_tags)
AND lb.status IN (4,8)) AS tax_incentives_passed
FROM master_data_centers dc
LEFT JOIN legiscan_bills lb ON dc.state = lb.state AND lb.is_relevant
GROUP BY dc.state
ORDER BY data_centers DESC;
```
---
**Last Updated**: May 2026