Datasets

Summary

PyHazards provides a unified dataset interface for hazard prediction across tabular, temporal, and raster data. Each dataset returns a DataBundle containing splits, feature specs, label specs, and metadata.

Datasets

merra2

Global atmospheric reanalysis from NASA GMAO MERRA-2 (overview), widely used as hourly gridded meteorological drivers for hazard modeling; see Gelaro et al. (2017).

era5

ECMWF ERA5 reanalysis served via the Copernicus CDS, providing hourly single-/pressure-level variables for benchmarks and hazard covariates; see Hersbach et al. (2020).

noaa_flood

Flood-related event reports from the NOAA Storm Events Database (time, location, impacts), commonly used for event-level labeling and impact analysis.

firms

Near-real-time active fire detections from NASA FIRMS (MODIS/VIIRS), used for operational monitoring and as wildfire occurrence labels; see Schroeder et al. (2014).

mtbs

US wildfire perimeters and burn severity layers from MTBS (Landsat-derived), used for post-fire assessment and long-term regime studies; see Eidenshink et al. (2007).

landfire

Nationwide fuels and vegetation layers from the USFS LANDFIRE program, often used as static landscape covariates for wildfire behavior and risk modeling; see the program overview.

wfigs

Authoritative incident-level wildfire records from the U.S. interagency WFIGS ecosystem (ignition, location, status, extent), commonly used as ground-truth labels for wildfire occurrence.

goesr

High-frequency geostationary multispectral imagery from the NOAA GOES-R series, supporting continuous monitoring (e.g., smoke/thermal context) and early detection workflows when paired with fire and meteorology datasets.

Dataset inspection

PyHazards provides a built-in inspection utility that allows users to quickly explore dataset structure and contents through a unified API.

The example below demonstrates how to inspect a daily MERRA-2 file using the PyHazards dataset interface.

python -m pyhazards.datasets.inspection --date 2024-01-01 --outdir outputs/

Core classes

  • Dataset: base class to implement _load() and return a DataBundle.

  • DataBundle: holds named DataSplit objects, plus feature_spec and label_spec.

  • FeatureSpec / LabelSpec: describe inputs/targets to simplify model construction.

  • register_dataset / load_dataset: lightweight registry for discovering datasets by name.

Example skeleton

import torch
from pyhazards.datasets import (
    DataBundle, DataSplit, Dataset, FeatureSpec, LabelSpec, register_dataset
)

class MyHazardDataset(Dataset):
    name = "my_hazard"

    def _load(self):
        x = torch.randn(1000, 16)
        y = torch.randint(0, 2, (1000,))
        splits = {
            "train": DataSplit(x[:800], y[:800]),
            "val": DataSplit(x[800:900], y[800:900]),
            "test": DataSplit(x[900:], y[900:]),
        }
        return DataBundle(
            splits=splits,
            feature_spec=FeatureSpec(input_dim=16, description="example features"),
            label_spec=LabelSpec(num_targets=2, task_type="classification"),
        )

register_dataset(MyHazardDataset.name, MyHazardDataset)