Move all Python scripts to scripts/, documentation to docs/, raw input data to data/, and generated HTML/CSV outputs to output/. Update path references in 8 scripts to use Path(__file__).parent.parent as project root so they work correctly from the new location. Update README links and quick-start commands accordingly. Notebooks remain at root. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
21 KiB
Research Ideas & Future Work
This document outlines potential research directions, data improvements, and analyses that could extend the current US Data Centers geospatial research infrastructure.
Priority Next Steps
1. Backfill Power Capacity Data
Status: Only 5.9% of facilities have power_mw values (108/1,833)
Approach:
- Scrape Baxtel data center database (requires paid subscription)
- Use Data Center Map API/scraping
- Cross-reference with utility interconnection queue filings
- FOIA requests to state utility commissions for large loads
Research Impact:
- Enable capacity-weighted concentration metrics (current analyses are facility-count only)
- Correlate power capacity with demographic/environmental variables
- Identify "hyperscale" facilities (>100 MW) vs. edge/enterprise (<10 MW)
Implementation:
# Add capacity-weighted HHI calculation to analyze_dc_tract_concentration.py
capacity_weighted_hhi = sum((mw_i / total_mw) ** 2 for mw_i in tract_capacities)
2. Operator Name Deduplication
Status: String fragmentation inflates operator counts ("Meta" vs. "Meta, Inc.", AWS variants)
Approach:
- Create
operator_mappingtable with canonical names - Use fuzzy matching (e.g.,
fuzzywuzzylibrary) to standardize - Add
operator_canonicalcolumn tomaster_data_centers
Research Impact:
- Accurate hyperscaler market share analysis
- Study operator-specific siting strategies (AWS hydro, Microsoft nuclear, Meta solar)
- Enable "operator power" political economy analyses
Script:
# operators_dedupe.py
import pandas as pd
from fuzzywuzzy import process
# Load unique operators
operators = pd.read_sql("SELECT DISTINCT operator FROM master_data_centers", conn)
# Manual + fuzzy matching to canonical names
canonical_map = {
"Meta": ["Meta", "Meta, Inc.", "Meta Platforms", "Facebook"],
"Amazon Web Services": ["AWS", "Amazon", "Amazon Web Services"],
# ... etc.
}
3. Water Stress Overlay
Status: 257 HUC8 watersheds contain data centers; 15 watersheds hold 50% of facilities
Priority: HIGH - Critical for environmental impact analysis
Approach:
- Join to USGS WaterWatch streamflow data
- Add USGS Drought Watch indicators by HUC8
- Correlate data center density with:
- Groundwater depletion rates
- Surface water withdrawal permits
- Drought frequency/severity (USDM historical data)
Key Watersheds for Focus:
- Middle Potomac-Catoctin (HUC8 02070008): 235 DCs (12.8% of US total) - Loudoun/Ashburn
- Middle Potomac-Anacostia-Occoquan (02070010): 111 DCs - Fairfax/inner Loudoun
- Coyote (18050003): 88 DCs - Silicon Valley
- Upper Scioto (05060001): 73 DCs - Columbus OH
- Umatilla (17070103): 29 DCs - AWS-exclusive watershed
Research Questions:
- Are data centers sited in water-stressed watersheds?
- Do high-density clusters (Loudoun County, Columbus OH) face water constraints?
- Compare water stress in hyperscaler non-metro sites (Columbia River corridor) vs. metro clusters
- Does single-operator watershed capture (Umatilla = AWS only) correlate with water availability?
Tables to Create:
watershed_water_stress- HUC8-level water stress indicatorsdata_center_water_risk- Per-facility water-stress exposure
Notebook: water_stress_analysis.ipynb
4. Opposition Cases Overlay
Status: 18 geocoded opposition cases in opposition_cases_geocoded table (loaded but unused)
Approach:
- Expand dataset: Compile additional cases of rejected/delayed data center proposals from news archives
- Geocode all opposition cases, join to demographics/hazards
- Test hypotheses:
- Do wealthier/more educated communities successfully block projects?
- Are opposition cases more common in water-stressed or drought-prone areas?
- Do smaller non-metro communities have less bargaining power?
- Does clustered vs. isolated location predict opposition likelihood?
Research Questions:
- What predicts opposition success?
- Are opposition cases spatially clustered?
- Do demographics differ between accepted vs. rejected sites?
- Correlation with FEMA hazard exposure scores?
Analysis Plan:
-- Join opposition cases to demographics
SELECT o.*, ct.median_household_income, ct.bachelors_or_higher_pct
FROM opposition_cases_geocoded o
JOIN _dc_census_tract_acs_2024 ct
ON ST_Contains(ct.geom, o.geom);
Output: opposition_cases_analysis.md
5. IM3 Forward Projection Integration
Status: IM3 model data loaded in im3_state_projected_moderate_50 (328 rows) and im3_projected_state_demand_summary (31 rows)
Approach:
- Load IM3 projected demand scenarios (2030, 2040, 2050)
- Overlay projected growth with:
- Current grid saturation (% of generation within 50 km)
- Water stress indicators
- Land availability (zoned industrial parcels)
- Identify regions where projected demand may exceed infrastructure capacity
Grid Saturation Context (from current analysis):
- New Jersey: 83% of grid within 50 km of DC
- Nevada: 75%
- Tennessee: 70%
- Oregon: 68%
- Arizona: 56%
- Virginia: 50%
Research Questions:
- Which states face grid saturation from data center growth?
- Are projected sites in water-stressed watersheds?
- Does IM3 assume spatial distribution patterns consistent with current siting?
- Can states with >50% grid saturation accommodate projected demand?
Implementation:
-- Compare current saturation to IM3 projected demand
SELECT
current.state,
current.dc_count,
current.pct_grid_saturated,
proj.facility_count AS projected_new_facilities,
proj.total_it_mw AS projected_new_mw
FROM state_grid_saturation current
JOIN im3_projected_state_demand_summary proj ON current.state = proj.state
WHERE current.pct_grid_saturated > 50
ORDER BY current.pct_grid_saturated DESC;
Notebook: im3_projection_overlay.ipynb
Methodological Extensions
6. Time-Series Analysis of Cluster Growth
Approach:
- Use
rfs_year(ready for service) from cable data and EIA generator vintage - Reconstruct data center siting over time (requires RFS dates for facilities)
- Animate cluster formation in interactive map
Research Questions:
- Did Ashburn VA become dominant before or after major cable landings?
- Do clusters grow via agglomeration (new facilities near existing) or simultaneous build-out?
- Correlation between energy infrastructure build-out and data center growth
Data Needed:
- Facility RFS dates (scrape from press releases, Baxtel historical data)
- Historical tract demographics (decennial Census + ACS back to 2000)
7. Network Effects: Fiber Route Proximity
Status: Current analysis tests submarine cable proximity (negative result)
Approach:
- Obtain fiber optic backbone route GIS data (from FCC, carriers, or Infrapedia)
- Test proximity to long-haul fiber routes (not just submarine cables)
- Hypothesis: Data centers cluster near fiber, not cables
Data Sources:
- FCC Form 477 fiber deployment data
- Infrapedia fiber route database
- State-level fiber maps (e.g., Virginia Broadband Map)
Expected Result: Positive correlation (unlike submarine cables)
8. Land Use & Zoning Analysis
Approach:
- Join data centers to local zoning classifications (industrial, commercial, etc.)
- Analyze land prices in data center tracts before/after facility construction
- Correlate with property tax revenues
Research Questions:
- Do data centers drive local property value increases?
- Are they preferentially sited in already-zoned industrial areas?
- Do host communities capture tax base growth?
Data Sources:
- Zillow Home Value Index (ZHVI) by ZIP
- ATTOM property tax assessments
- Municipal zoning GIS layers (city-specific, requires scraping/FOIA)
9. Environmental Justice Scoring
Approach:
- Compare data center host tracts to EPA's EJScreen indices
- Add CalEnviroScreen-style burden/benefit framework
- Test if data centers increase cumulative environmental burdens
Metrics:
- Air quality (PM2.5, ozone)
- Hazardous waste proximity
- Superfund site proximity
- Heat island effect (LST from Landsat)
- Noise pollution (traffic, cooling systems)
Expected Challenge: Data centers may improve local metrics (compared to heavy industry) but increase water/energy consumption
Policy & Political Economy Research
10. Tax Incentive Analysis
Approach:
- Compile state/local tax incentives for data center siting (property tax abatements, sales tax exemptions)
- Create
data_center_incentivestable with per-facility incentive details - Correlate incentive generosity with:
- State fiscal health
- Local government bargaining power
- Facility size/operator
Research Questions:
- Do weaker fiscal states offer larger incentives?
- Are incentives regressive (larger for hyperscalers)?
- Do incentives predict siting decisions (natural experiment approach)?
Data Sources:
- Good Jobs First Subsidy Tracker
- State economic development agency press releases
- Local news archives
11. Employment & Labor Market Effects
Approach:
- Join to BLS Quarterly Census of Employment and Wages (QCEW) by ZIP/county
- Identify "data center construction boom" periods (before/after major facility openings)
- Analyze employment effects in:
- Construction (NAICS 23)
- Transportation/warehousing (NAICS 48-49)
- Professional services (NAICS 54)
Research Questions:
- Do data centers create durable local employment?
- Are jobs filled by local residents or commuters?
- Wage effects in host tracts?
Data Sources:
- BLS QCEW
- Census LEHD Origin-Destination Employment Statistics (LODES)
12. Energy Cost Pass-Through
Approach:
- Join to state-level electricity rate data (EIA, utility rate tracker)
- Test if data center density correlates with residential rate increases
- Natural experiment: Compare rate trajectories in high-DC vs. low-DC states
Research Questions:
- Do data centers drive residential rate increases (capacity cost allocation)?
- Are rate increases concentrated in utility service territories with large data center loads?
- Do states with retail choice (deregulated markets) see different effects?
Data Sources:
- EIA Form 861 (retail rates by state/utility)
- Utility rate case filings (state public utility commissions)
Data Quality & Infrastructure Improvements
13. Address Validation & Geocoding Refinement
Approach:
- Re-geocode the 45 facilities using city-precision fallback
- Use USPS address validation API
- Cross-reference with Google Maps satellite imagery (manual review)
Implementation:
# Re-run geocoding with stricter thresholds
python3 load_postgis_data_centers.py --revalidate-addresses
14. OSM Continuous Monitoring
Approach:
- Set up automated Overpass API queries (daily/weekly)
- Detect new OSM data center tags
- Alert for review + merge into
master_data_centers
Implementation:
- Cron job running
load_postgis_osm_data_centers.py --update-only - Slack/email notification on new facilities
15. Broadband Speed Validation
Approach:
- Cross-reference FCC BDC provider data with Ookla Speedtest results
- Test if data center host tracts have faster actual speeds (not just availability)
Hypothesis: Data center presence correlates with infrastructure investment → higher speeds
Data Sources:
- Ookla Open Data (aggregated Speedtest results by tile)
- FCC Measuring Broadband America
Visualization & Communication
16. Interactive Story Map
Approach:
- Build Scrollama.js narrative map
- Sections:
- National overview (cluster map)
- Ashburn VA zoom (dominance of single region)
- Demographics comparison (host vs. national)
- Water stress hot spots
- Energy infrastructure saturation
Output: story_map.html (standalone web page)
17. Policy Brief Generation
Approach:
- Auto-generate policy briefs from analysis outputs
- Targeted audiences:
- State legislators (energy/water policy)
- Local governments (tax incentive negotiation)
- Environmental justice advocates
Template:
# Data Center Siting in [STATE]: Key Facts for Policymakers
- **[STATE] hosts X% of US data centers** (rank: #Y)
- **Host communities are Z% wealthier** than state average
- **A% of state generation is within 50 km of a data center**
- **Top watershed holds B facilities** (water stress: [HIGH/MEDIUM/LOW])
18. Comparative International Analysis
Approach:
- Extend methodology to EU, Canada, Australia
- Compare siting patterns (e.g., Nordic countries = renewable energy, cold climate)
- Test if "concentrated costs / dispersed benefits" holds internationally
Data Sources:
- OpenStreetMap (global coverage)
- Eurostat demographics
- IEA energy data
- TeleGeography global cable data (already available)
Research Questions:
- Are US patterns unique (tax-driven siting) vs. EU (regulatory constraints)?
- Do Nordic countries see more equitable distribution?
Speculative / Long-Term Ideas
19. AI Demand Forecasting
Approach:
- Train ML model to predict data center siting
- Features: demographics, energy capacity, fiber proximity, tax rates, water availability
- Test on historical data (train on pre-2015, test on 2015-2025)
Use Case:
- Identify "likely future sites" for proactive policy intervention
- Warn communities of potential incoming projects
20. Cooling Technology Analysis
Approach:
- Classify facilities by cooling type (air, water, hybrid)
- Correlate with:
- Climate (CDD: cooling degree days)
- Water availability
- Facility size
Data Sources:
- Manual classification from news/press releases
- FOIA requests to water utilities (cooling water withdrawal permits)
Research Questions:
- Are water-cooled facilities concentrated in water-stressed regions (paradox)?
- Do hyperscalers use more efficient cooling (e.g., Meta's Prineville OR evaporative cooling)?
21. Bitcoin Mining Facilities
Approach:
- Overlay cryptocurrency mining facilities (subset of "data centers")
- Compare siting patterns: Bitcoin mines prefer low electricity costs (WA, TX, NY hydro)
- Test if Bitcoin mines face more opposition (negative perception)
Data Sources:
- Cambridge Bitcoin Electricity Consumption Index (has facility locations)
- News archives of mining farm proposals/rejections
22. Disaster Resilience & Redundancy
Approach:
- Model simultaneous hazard exposure across data center clusters
- E.g., "What % of US data centers are in wildfire risk zones?"
- Identify single points of failure (e.g., Ashburn VA = 20% of US capacity)
Research Questions:
- Is the current spatial distribution resilient to climate change?
- Should policy incentivize geographic diversification?
Output: disaster_resilience_report.md
23. Edge Data Center Network
Approach:
- Separately analyze edge facilities (<1 MW) vs. hyperscale (>100 MW)
- Test if edge DCs follow different siting logic (population density > energy cost)
Data Challenge: Current inventory does not distinguish edge vs. hyperscale (need power_mw backfill)
24. Carbon Intensity of Host Grids
Approach:
- Join to EPA eGRID subregion carbon intensity (lb CO₂/MWh)
- Calculate per-facility estimated carbon footprint (if
power_mwavailable) - Compare to corporate renewable energy procurement (RECs, PPAs)
Research Questions:
- Are data centers disproportionately in high-carbon grids?
- Do hyperscaler renewable commitments offset grid carbon?
Data Sources:
- EPA eGRID
- Corporate sustainability reports (Google, Microsoft, Meta, AWS)
Collaboration Opportunities
Academic Partnerships
- Energy researchers: Joint analysis of grid saturation + IM3 projections
- Environmental justice scholars: EJScreen overlay + opposition case studies
- Political scientists: Tax incentive analysis + local government bargaining power
Policy Stakeholders
- State energy offices: Share grid saturation maps
- Water resource agencies: Watershed analysis for permitting
- Local governments: Demographic/tax revenue analysis for negotiation leverage
Industry Engagement
- Data center operators: Validate facility data, discuss siting criteria
- Colocation providers: Access to tenant mix data (multi-tenant vs. single-tenant)
Tools & Infrastructure Improvements
Database Enhancements
- Add
versioncolumn to track data vintage - Implement
audit_logtable for data lineage - Set up automated backups to S3/Azure Blob
Code Quality
- Add unit tests for geocoding functions
- Create
config.yamlfor database credentials (replace hardcoded env vars) - Dockerize analysis environment for reproducibility
Documentation
- Add JSDoc-style comments to all Python functions
- Create
CONTRIBUTING.mdfor external collaborators - Record Jupyter notebook walkthroughs (video tutorials)
Unfunded / Ambitious Ideas
25. Real-Time Energy Monitoring
- Partner with utility to get live load data from data center substations
- Build dashboard showing real-time energy consumption by facility
- Correlate with AWS/Azure/GCP service outages (reverse-engineer capacity from brownouts)
26. Social Media Sentiment Analysis
- Scrape Twitter/Reddit for mentions of local data center projects
- NLP sentiment analysis: support vs. opposition
- Correlate sentiment with facility approval outcomes
27. LIDAR Analysis of Cooling Infrastructure
- Use aerial LIDAR to measure rooftop cooling equipment volume
- Proxy for facility size (cooling = f(IT load))
- Build predictive model: cooling equipment → power capacity
Contact & Contributions
If you're interested in collaborating on any of these research directions, please contact the repository owner.
Priorities for external collaboration:
- Power capacity data acquisition
- Water stress/drought overlay
- Opposition cases database compilation
- International comparative analysis
References for Future Work
Data Sources to Explore
- Department of Energy: Grid resilience reports, interconnection queues
- NREL: Renewable energy potential by HUC (solar, wind)
- USDA: Agricultural water use by county (competition for water)
- NOAA: Climate normals + projections by grid cell
- BLS: QCEW employment data, wage data
- EPA: eGRID, EJScreen, Superfund sites
Academic Literature Gaps
- Limited peer-reviewed research on data center spatial concentration
- No published studies on water stress exposure of data centers
- Opportunity for "first mover" publication in major geography/planning journals
Policy Levers to Investigate
- State renewable portfolio standards (RPS) → data center siting
- Federal infrastructure investment (IRA, CHIPS Act) → energy grid capacity
- Local zoning reform (industrial land use restrictions)
Legislative Analysis (LegiScan Data)
Status: Data loaded — 1.3M bills across all US states + federal, 2016–2026; ~60K tagged relevant.
Tables: legiscan_sessions, legiscan_bills
Query file: query_legiscan_bills.sql
Research Questions
1. Ratepayer Cost Shifting
Do states with high data center density show more legislative activity on ratepayer protection?
- Join
legiscan_bills WHERE 'ratepayer_protection' = ANY(relevance_tags)tomaster_data_centerscounts by state - Test correlation between DC concentration and # of ratepayer bills introduced/passed
- Compare outcomes: do high-DC states pass or fail more ratepayer protections?
2. Data Center Legislative Wave
Is there a measurable increase in DC-specific legislation after 2022 (AI boom)?
- Trend
data_centerandlarge_loadtagged bills byyear_start - Cross-reference with major AI facility announcements (2022–2025)
3. Tax Incentive Geography
Which states enacted tax incentives that may have influenced DC location decisions?
tax_incentivebills withstatus IN (4,8)(passed/chaptered)- Overlay with
master_data_centersgrowth by state over the same period - Candidate for difference-in-differences analysis
4. Grid Interconnection Policy
Do states with grid_impact legislation show different EIA capacity expansion patterns?
- Join relevant bills to
energy_eia_operating_generator_capacity_flatby state - Look for correlations between grid policy activity and nameplate MW additions
5. Siting Preemption vs. Local Control
Are states passing bills to streamline or restrict local siting authority?
- Full-text search within
siting_permittingbills for "preemption" vs. "local control" - Map bill outcomes by state political environment (cross-ref RDH vote data)
Suggested Joins
-- States with DCs and legislative activity by topic
SELECT
dc.state,
COUNT(DISTINCT dc.id) AS data_centers,
COUNT(DISTINCT lb.bill_id) FILTER (WHERE 'data_center' = ANY(relevance_tags)) AS dc_bills,
COUNT(DISTINCT lb.bill_id) FILTER (WHERE 'ratepayer_protection'= ANY(relevance_tags)) AS ratepayer_bills,
COUNT(DISTINCT lb.bill_id) FILTER (WHERE 'tax_incentive' = ANY(relevance_tags)
AND lb.status IN (4,8)) AS tax_incentives_passed
FROM master_data_centers dc
LEFT JOIN legiscan_bills lb ON dc.state = lb.state AND lb.is_relevant
GROUP BY dc.state
ORDER BY data_centers DESC;
Last Updated: May 2026