Datasets¶

Browse PyHazards datasets across hazard families, compare source roles, inspection paths, and registry surfaces, and navigate to dataset-specific detail pages.

At a Glance¶

Hazard Groups

Public dataset tabs grouped by the curated hazard-first taxonomy.

Public Datasets

Curated datasets surfaced on the public site.

Inspection Entry Points

Datasets with an explicit inspection command documented on the site.

Registry-loadable Datasets

Datasets with a documented public load_dataset(...) path.

Catalog by Hazard¶

Use the hazard tabs below to browse the public dataset catalog. Each card keeps the summary short, then links into the detail page, the primary source, and the most relevant inspection or registry surface.

Shared Forcing

Cross-hazard meteorology and imagery sources that support multiple PyHazards workflows, inspections, and forcing pipelines.

Implemented Datasets

ERA5

ECMWF’s global reanalysis used as a high-resolution meteorological baseline for hazard experiments.

Reanalysis Regular latitude-longitude grid

Coverage: Global

Update Cadence: Daily ERA5T updates with about 5-day latency, followed by final validated releases after 2-3 months

Inspection: python -m pyhazards.datasets.era5.inspection --path pyhazards/data/era5_subset --max-vars 10

Details: ERA5

Primary Source: Hersbach et al. (2020). The ERA5 global reanalysis.

GOES-R

Rapid-refresh GOES-R satellite imagery used for smoke, fire, and weather monitoring workflows.

Geostationary Imagery Raster imagery time series on the ABI fixed grid

Coverage: Western Hemisphere / Americas geostationary view

Update Cadence: Continuous ingest as new files become available

Inspection: python -m pyhazards.datasets.goesr.inspection --path /path/to/goesr_data --max-items 10

Details: GOES-R

Primary Source: Schmit et al. (2017). A closer look at the ABI on the GOES-R series.

MERRA-2

Global atmospheric reanalysis from NASA GMAO used as a shared meteorological backbone for hazard modeling.

Reanalysis Regular latitude-longitude grid

Coverage: Global

Update Cadence: Published monthly with typical 2-3 week latency after month end

Inspection: python -m pyhazards.datasets.merra2.inspection 20260101

Details: MERRA-2

Primary Source: Gelaro et al. (2017). The Modern-Era Retrospective Analysis for Research and Applications, Version 2 (MERRA-2).

Wildfire

Wildfire datasets span authoritative incident records, active-fire detections, fuels, burn severity, and forecast-ready benchmark adapters.

Implemented Datasets

FIRMS

NASA’s near-real-time active fire detections used for operational wildfire monitoring and event labeling.

Active Fire Detections Event-based point detections

Coverage: Global

Update Cadence: Fire maps refresh about every 5 minutes and downloadable files refresh about hourly

Inspection: python -m pyhazards.datasets.firms.inspection --path /path/to/firms_data --max-items 10

Related Benchmarks: Wildfire Benchmark

Details: FIRMS

Primary Source: Schroeder et al. (2014). The New VIIRS 375 m active fire detection data product.

FPA-FOD Tabular

Incident-level FPA-FOD features packaged for wildfire cause and size classification.

Incident Tabular Tabular feature vectors

Coverage: User-provided FPA-FOD coverage

Update Cadence: User-managed local inputs or deterministic micro mode

Inspection: python -m pyhazards.datasets.fpa_fod_tabular.inspection --task cause --micro

Related Benchmarks: Wildfire Benchmark

Details: FPA-FOD Tabular

Primary Source: PyHazards FPA-FOD tabular adaptation for the wildfire incident classification path.

FPA-FOD Weekly

Weekly FPA-FOD aggregates packaged for next-week wildfire count forecasting by size group.

Weekly Forecasting Temporal tabular sequences

Coverage: User-provided FPA-FOD coverage

Update Cadence: User-managed local inputs or deterministic micro mode

Inspection: python -m pyhazards.datasets.fpa_fod_weekly.inspection --micro --lookback-weeks 12

Related Benchmarks: Wildfire Benchmark

Details: FPA-FOD Weekly

Primary Source: PyHazards FPA-FOD weekly adaptation for the wildfire forecasting path.

LANDFIRE

Nationwide fuels, vegetation, and canopy layers used as static wildfire covariates.

Fuels and Vegetation Gridded raster layers

Coverage: United States

Update Cadence: Annual versioned update suites

Inspection: python -m pyhazards.datasets.landfire.inspection --path /path/to/landfire_data --max-items 10

Related Benchmarks: Wildfire Benchmark

Details: LANDFIRE

Primary Source: Rollins (2009). LANDFIRE: A nationally consistent vegetation, wildland fire, and fuel assessment.

MTBS

U.S. burn severity and fire perimeter products used for post-fire analysis and wildfire evaluation.

Burn Severity Per-fire rasters with associated vector perimeters

Coverage: United States

Update Cadence: Continuous mapping with quarterly releases

Inspection: python -m pyhazards.datasets.mtbs.inspection --path /path/to/mtbs_data --max-items 10

Related Benchmarks: Wildfire Benchmark

Details: MTBS

Primary Source: Eidenshink et al. (2007). A project for monitoring trends in burn severity.

WFIGS

Interagency wildfire incident records used as authoritative wildfire ground truth across the United States.

Incident Records Incident points and perimeters

Coverage: United States

Update Cadence: Refreshed from IRWIN roughly every 5 minutes, with perimeter changes often appearing within 15 minutes

Inspection: python -m pyhazards.datasets.wfigs.inspection --path /path/to/wfigs_data --max-items 10

Related Benchmarks: Wildfire Benchmark

Details: WFIGS

Primary Source: National Interagency Fire Center. Wildland Fire Incident Geospatial Services (WFIGS).

Flood

Flood datasets combine event records with streamflow and inundation benchmark adapters used by the public flood models.

Implemented Datasets

Caravan

Synthetic-backed streamflow benchmark adapter aligned to the Caravan large-sample hydrology ecosystem.

Streamflow Benchmark Graph-temporal basin or node sequences

Coverage: Benchmark-aligned streamflow forecasting samples

Update Cadence: Generated locally for smoke and benchmark-alignment runs

Registry: load_dataset('caravan_streamflow', ...)

Related Benchmarks: Flood Benchmark, Caravan

Details: Caravan

Primary Source: Caravan - A global community dataset for large-sample hydrology

FloodCastBench

Synthetic-backed inundation benchmark adapter aligned to the FloodCastBench evaluation ecosystem.

Inundation Benchmark Raster inundation sequences

Coverage: Benchmark-aligned flood inundation samples

Update Cadence: Generated locally for smoke and benchmark-alignment runs

Registry: load_dataset('floodcastbench_inundation', ...)

Related Benchmarks: Flood Benchmark, FloodCastBench

Details: FloodCastBench

Primary Source: FloodCastBench

HydroBench

Synthetic-backed streamflow diagnostics adapter aligned to the HydroBench ecosystem.

Streamflow Benchmark Graph-temporal basin or node sequences

Coverage: Benchmark-aligned streamflow forecasting samples

Update Cadence: Generated locally for smoke and benchmark-alignment runs

Registry: load_dataset('hydrobench_streamflow', ...)

Related Benchmarks: Flood Benchmark, HydroBench

Details: HydroBench

Primary Source: HydroBench

NOAA Flood Events

Historical NOAA storm-event flood records used as event labels and impact targets for flood studies.

Event Records Tabular event records with administrative regions and optional point coordinates

Coverage: United States

Update Cadence: Updated monthly, typically 75-90 days after the end of a data month

Inspection: python -m pyhazards.datasets.noaa_flood.inspection --path /path/to/noaa_flood_data --max-items 10

Related Benchmarks: Flood Benchmark

Details: NOAA Flood Events

Primary Source: NOAA National Centers for Environmental Information. Storm Events Database Documentation.

WaterBench

Synthetic-backed streamflow benchmark adapter aligned to the WaterBench ecosystem.

Streamflow Benchmark Graph-temporal basin or node sequences

Coverage: Benchmark-aligned streamflow forecasting samples

Update Cadence: Generated locally for smoke and benchmark-alignment runs

Registry: load_dataset('waterbench_streamflow', ...)

Related Benchmarks: Flood Benchmark, WaterBench

Details: WaterBench

Primary Source: WaterBench: A Large-scale Benchmark Dataset for Data-driven Streamflow Forecasting

Earthquake

Earthquake datasets cover waveform-picking and forecasting adapters that align the public models with the shared earthquake benchmark.

Implemented Datasets

AEFA Forecast

Synthetic-backed dense-grid forecasting adapter aligned to the AEFA earthquake forecasting workflow.

Forecast Benchmark Dense-grid wavefield tensors

Coverage: Benchmark-aligned earthquake forecasting samples

Update Cadence: Generated locally for smoke and benchmark-alignment runs

Registry: load_dataset('aefa_forecast', ...)

Related Benchmarks: Earthquake Benchmark, AEFA

Details: AEFA Forecast

Primary Source: AEFA

pick-benchmark

Synthetic-backed waveform picking adapter aligned to the pick-benchmark evaluation ecosystem.

Waveform Benchmark Multichannel waveform windows

Coverage: Benchmark-aligned earthquake phase-picking samples

Update Cadence: Generated locally for smoke and benchmark-alignment runs

Registry: load_dataset('pick_benchmark_waveforms', ...)

Related Benchmarks: Earthquake Benchmark, pick-benchmark

Details: pick-benchmark

Primary Source: pick-benchmark

SeisBench

Synthetic-backed waveform picking adapter aligned to the SeisBench ecosystem.

Waveform Benchmark Multichannel waveform windows

Coverage: Benchmark-aligned earthquake phase-picking samples

Update Cadence: Generated locally for smoke and benchmark-alignment runs

Registry: load_dataset('seisbench_waveforms', ...)

Related Benchmarks: Earthquake Benchmark, SeisBench

Details: SeisBench

Primary Source: SeisBench - A Toolbox for Machine Learning in Seismology

Tropical Cyclone

Storm datasets cover best-track archives and benchmark adapters used by the shared tropical cyclone track-intensity workflow.

Implemented Datasets

IBTrACS

Synthetic-backed storm-track adapter aligned to the IBTrACS tropical cyclone archive.

Track Archive Storm-track history sequences

Coverage: Benchmark-aligned tropical cyclone track and intensity samples

Update Cadence: Generated locally for smoke and benchmark-alignment runs

Registry: load_dataset('ibtracs_tracks', ...)

Related Benchmarks: Tropical Cyclone Benchmark, IBTrACS

Details: IBTrACS

Primary Source: IBTrACS

TCBench Alpha

Synthetic-backed storm-track benchmark adapter aligned to the TCBench Alpha ecosystem.

Track Benchmark Storm-track history sequences

Coverage: Benchmark-aligned tropical cyclone track and intensity samples

Update Cadence: Generated locally for smoke and benchmark-alignment runs

Registry: load_dataset('tcbench_alpha', ...)

Related Benchmarks: Tropical Cyclone Benchmark, TCBench Alpha

Details: TCBench Alpha

Primary Source: TCBench Alpha

TropiCycloneNet-Dataset

Synthetic-backed storm-track benchmark adapter aligned to the TropiCycloneNet-Dataset ecosystem.

Track Benchmark Storm-track history sequences

Coverage: Benchmark-aligned tropical cyclone track and intensity samples

Update Cadence: Generated locally for smoke and benchmark-alignment runs

Registry: load_dataset('tropicyclonenet_dataset', ...)

Related Benchmarks: Tropical Cyclone Benchmark, TropiCycloneNet-Dataset

Details: TropiCycloneNet-Dataset

Primary Source: TropiCycloneNet-Dataset

Recommended Entry Points¶

If you are new to PyHazards, start with one high-signal dataset per hazard group before branching into the full catalog.

Programmatic Use¶

python -m pyhazards.datasets.era5.inspection --path pyhazards/data/era5_subset --max-vars 10

from pyhazards.datasets import load_dataset

data = load_dataset(
    "fpa_fod_weekly",
    micro=True,
    lookback_weeks=12,
    features="counts+time",
).load()
print(sorted(data.splits.keys()))

Use pyhazards.datasets package for the developer dataset workflow and package-level API lookup. Pair this page with Models and Benchmarks when you need to trace datasets into model and evaluation coverage.