pyhazards.datasets package

Catalog Summary

This page links the public dataset catalog, the developer dataset workflow, and the package submodules used to register or inspect datasets.

For the curated browsing experience, use Datasets.

Shared Forcing

ERA5, GOES-R, MERRA-2.

Wildfire

FIRMS, FPA-FOD Tabular, FPA-FOD Weekly, LANDFIRE, MTBS, WFIGS.

Flood

Caravan, FloodCastBench, HydroBench, NOAA Flood Events, WaterBench.

Earthquake

AEFA Forecast, pick-benchmark, SeisBench.

Tropical Cyclone

IBTrACS, TCBench Alpha, TropiCycloneNet-Dataset.

Developer Dataset Workflow

Use this section when you need the package-level registry and dataset builder interface rather than the public catalog presentation.

Inspect an External Dataset Source

python -m pyhazards.datasets.era5.inspection --path pyhazards/data/era5_subset --max-vars 10

Load a Registered Dataset

from pyhazards.datasets import available_datasets, load_dataset

print(available_datasets())
data = load_dataset(
    "seisbench_waveforms",
    micro=True,
).load()
print(sorted(data.splits.keys()))

Register a Custom Dataset

from pyhazards.datasets import (
    DataBundle,
    DataSplit,
    Dataset,
    FeatureSpec,
    LabelSpec,
    register_dataset,
)

class MyDataset(Dataset):
    name = "my_dataset"

    def _load(self) -> DataBundle:
        raise NotImplementedError("Return a populated DataBundle here.")

register_dataset("my_dataset", MyDataset)

Notes

  • Public dataset docs are generated from cards in pyhazards/dataset_cards.

  • Run python scripts/render_dataset_docs.py after editing cards or generated dataset docs.

  • Use Implementation Guide for the full contributor workflow.

Submodules

pyhazards.datasets.base module

class pyhazards.datasets.base.DataBundle(splits, feature_spec, label_spec, metadata=<factory>)[source]

Bases: object

Bundle of train/val/test splits plus metadata. Keeps feature/label specs to make model construction easy.

feature_spec: FeatureSpec
get_split(name)[source]
Return type:

DataSplit

label_spec: LabelSpec
metadata: Dict[str, Any]
splits: Dict[str, DataSplit]
class pyhazards.datasets.base.DataSplit(inputs, targets, metadata=<factory>)[source]

Bases: object

Container for a single split.

inputs: Any
metadata: Dict[str, Any]
targets: Any
class pyhazards.datasets.base.Dataset(cache_dir=None)[source]

Bases: object

Base class for hazard datasets. Subclasses should load data and return a DataBundle with splits ready for training.

_load()[source]
Return type:

DataBundle

load(split=None, transforms=None)[source]

Return a DataBundle. Optionally return a specific split if provided.

Return type:

DataBundle

name: str = 'base'
class pyhazards.datasets.base.FeatureSpec(input_dim=None, channels=None, description=None, extra=<factory>)[source]

Bases: object

Describes input features (shapes, dtypes, normalization).

channels: Optional[int] = None
description: Optional[str] = None
extra: Dict[str, Any]
input_dim: Optional[int] = None
class pyhazards.datasets.base.LabelSpec(num_targets=None, task_type='regression', description=None, extra=<factory>)[source]

Bases: object

Describes labels/targets for downstream tasks.

description: Optional[str] = None
extra: Dict[str, Any]
num_targets: Optional[int] = None
task_type: str = 'regression'
class pyhazards.datasets.base.Transform(*args, **kwargs)[source]

Bases: Protocol

Callable data transform.

_abc_impl = <_abc._abc_data object>
_is_protocol = True

pyhazards.datasets.registry module

pyhazards.datasets.registry.available_datasets()[source]
pyhazards.datasets.registry.load_dataset(name, **kwargs)[source]
Return type:

Dataset

pyhazards.datasets.registry.register_dataset(name, builder)[source]
Return type:

None

pyhazards.datasets.transforms package

Reusable transforms for preprocessing hazard datasets. Currently placeholders; implement normalization, index computation, temporal windowing, etc.

pyhazards.datasets.hazards package

Namespace for hazard-specific dataset loaders (earthquake, wildfire, flood, hurricane, landslide, etc.). Populate with concrete Dataset subclasses and register them in pyhazards.datasets.registry.

Module contents

class pyhazards.datasets.AEFADataset(cache_dir=None, samples=40, channels=3, temporal_in=5, temporal_out=4, height=12, width=10, micro=False)[source]

Bases: SyntheticEarthquakeForecastDataset

Synthetic-backed adapter for AEFA-style earthquake forecasting inputs.

_load()[source]
Return type:

DataBundle

name: str = 'aefa_forecast'
class pyhazards.datasets.CaravanStreamflowDataset(cache_dir=None, samples=40, history=4, nodes=6, features=2, micro=False)[source]

Bases: SyntheticFloodStreamflowDataset

Synthetic-backed streamflow adapter for Caravan-style smoke runs.

_load()[source]
Return type:

DataBundle

name: str = 'caravan_streamflow'
class pyhazards.datasets.DataBundle(splits, feature_spec, label_spec, metadata=<factory>)[source]

Bases: object

Bundle of train/val/test splits plus metadata. Keeps feature/label specs to make model construction easy.

feature_spec: FeatureSpec
get_split(name)[source]
Return type:

DataSplit

label_spec: LabelSpec
metadata: Dict[str, Any]
splits: Dict[str, DataSplit]
class pyhazards.datasets.DataSplit(inputs, targets, metadata=<factory>)[source]

Bases: object

Container for a single split.

inputs: Any
metadata: Dict[str, Any]
targets: Any
class pyhazards.datasets.Dataset(cache_dir=None)[source]

Bases: object

Base class for hazard datasets. Subclasses should load data and return a DataBundle with splits ready for training.

_load()[source]
Return type:

DataBundle

load(split=None, transforms=None)[source]

Return a DataBundle. Optionally return a specific split if provided.

Return type:

DataBundle

name: str = 'base'
class pyhazards.datasets.FPAFODTabularDataset(task='cause', region='US', cause_mode='paper5', data_path=None, micro=False, normalize=False, train_ratio=0.6, val_ratio=0.2, test_ratio=0.2, seed=1337, cache_dir=None)[source]

Bases: Dataset

Incident-level tabular dataset for wildfire cause or size classification.

_load()[source]
Return type:

DataBundle

name: str = 'fpa_fod_tabular'
class pyhazards.datasets.FPAFODWeeklyDataset(region='US', data_path=None, micro=False, lookback_weeks=50, features='counts', train_ratio=0.6, val_ratio=0.2, test_ratio=0.2, seed=1337, cache_dir=None)[source]

Bases: Dataset

Weekly count forecasting dataset derived from FPA-FOD incident records.

_load()[source]
Return type:

DataBundle

_weekly_table()[source]
name: str = 'fpa_fod_weekly'
class pyhazards.datasets.FeatureSpec(input_dim=None, channels=None, description=None, extra=<factory>)[source]

Bases: object

Describes input features (shapes, dtypes, normalization).

channels: Optional[int] = None
description: Optional[str] = None
extra: Dict[str, Any]
input_dim: Optional[int] = None
class pyhazards.datasets.FloodCastBenchInundationDataset(cache_dir=None, samples=40, history=4, channels=3, height=16, width=16, micro=False)[source]

Bases: SyntheticFloodInundationDataset

Synthetic-backed inundation adapter for FloodCastBench-style smoke runs.

_load()[source]
Return type:

DataBundle

name: str = 'floodcastbench_inundation'
class pyhazards.datasets.GraphTemporalDataset(x, y, adjacency=None)[source]

Bases: Dataset

Simple container for county/day style tensors with an optional adjacency.

Each sample is a window of shape (past_days, num_counties, num_features) and a label of shape (num_counties,).

class pyhazards.datasets.HydroBenchStreamflowDataset(cache_dir=None, samples=40, history=4, nodes=6, features=2, micro=False)[source]

Bases: SyntheticFloodStreamflowDataset

Synthetic-backed streamflow adapter for HydroBench diagnostics.

_load()[source]
Return type:

DataBundle

name: str = 'hydrobench_streamflow'
class pyhazards.datasets.IBTrACSTropicalCycloneDataset(cache_dir=None, samples=64, history=6, horizon=5, features=8, micro=False)[source]

Bases: SyntheticTropicalCycloneDataset

Synthetic-backed adapter for IBTrACS-style storm tracks.

_load()[source]
Return type:

DataBundle

name: str = 'ibtracs_tracks'
class pyhazards.datasets.LabelSpec(num_targets=None, task_type='regression', description=None, extra=<factory>)[source]

Bases: object

Describes labels/targets for downstream tasks.

description: Optional[str] = None
extra: Dict[str, Any]
num_targets: Optional[int] = None
task_type: str = 'regression'
class pyhazards.datasets.PickBenchmarkWaveformDataset(cache_dir=None, samples=96, channels=3, length=256, micro=False)[source]

Bases: SyntheticEarthquakeWaveformDataset

Synthetic-backed adapter with the pick-benchmark public dataset surface.

_load()[source]
Return type:

DataBundle

name: str = 'pick_benchmark_waveforms'
class pyhazards.datasets.SeisBenchWaveformDataset(cache_dir=None, samples=96, channels=3, length=256, micro=False)[source]

Bases: SyntheticEarthquakeWaveformDataset

Synthetic-backed adapter with the SeisBench public dataset surface.

_load()[source]
Return type:

DataBundle

name: str = 'seisbench_waveforms'
class pyhazards.datasets.SyntheticEarthquakeForecastDataset(cache_dir=None, samples=40, channels=3, temporal_in=5, temporal_out=4, height=12, width=10, micro=False)[source]

Bases: Dataset

Synthetic wavefield dataset for earthquake forecasting smoke runs.

_load()[source]
Return type:

DataBundle

name: str = 'earthquake_forecast_synthetic'
class pyhazards.datasets.SyntheticEarthquakeWaveformDataset(cache_dir=None, samples=96, channels=3, length=256, micro=False)[source]

Bases: Dataset

Synthetic waveform dataset for earthquake phase-picking smoke runs.

_load()[source]
Return type:

DataBundle

name: str = 'earthquake_waveforms'
class pyhazards.datasets.SyntheticFloodInundationDataset(cache_dir=None, samples=40, history=4, channels=3, height=16, width=16, micro=False)[source]

Bases: Dataset

Synthetic raster dataset for flood inundation smoke runs.

_load()[source]
Return type:

DataBundle

name: str = 'flood_inundation_synthetic'
class pyhazards.datasets.SyntheticFloodStreamflowDataset(cache_dir=None, samples=40, history=4, nodes=6, features=2, micro=False)[source]

Bases: Dataset

Synthetic graph-temporal flood dataset for streamflow smoke runs.

_load()[source]
Return type:

DataBundle

_make_split(x, y, adj)[source]
Return type:

DataSplit

name: str = 'flood_streamflow_synthetic'
class pyhazards.datasets.SyntheticTropicalCycloneDataset(cache_dir=None, samples=64, history=6, horizon=5, features=8, micro=False)[source]

Bases: Dataset

Synthetic storm-history dataset for track/intensity smoke runs.

_load()[source]
Return type:

DataBundle

name: str = 'tc_tracks_synthetic'
class pyhazards.datasets.SyntheticWildfireSpreadDataset(cache_dir=None, samples=64, channels=12, height=32, width=32, micro=False)[source]

Bases: Dataset

Synthetic raster dataset for wildfire spread smoke runs.

_load()[source]
Return type:

DataBundle

name: str = 'wildfire_spread_synthetic'
class pyhazards.datasets.SyntheticWildfireSpreadTemporalDataset(cache_dir=None, samples=48, history=4, channels=6, height=16, width=16, micro=False)[source]

Bases: Dataset

Synthetic temporal wildfire spread dataset for sequence-based spread baselines.

_load()[source]
Return type:

DataBundle

name: str = 'wildfire_spread_temporal_synthetic'
class pyhazards.datasets.TCBenchAlphaDataset(cache_dir=None, samples=64, history=6, horizon=5, features=8, micro=False)[source]

Bases: SyntheticTropicalCycloneDataset

Synthetic-backed adapter for TCBench Alpha evaluation runs.

_load()[source]
Return type:

DataBundle

name: str = 'tcbench_alpha'
class pyhazards.datasets.TropiCycloneNetDataset(cache_dir=None, samples=64, history=6, horizon=5, features=8, micro=False)[source]

Bases: SyntheticTropicalCycloneDataset

Synthetic-backed adapter for TropiCycloneNet-Dataset style smoke runs.

_load()[source]
Return type:

DataBundle

name: str = 'tropicyclonenet_dataset'
class pyhazards.datasets.WaterBenchStreamflowDataset(cache_dir=None, samples=40, history=4, nodes=6, features=2, micro=False)[source]

Bases: SyntheticFloodStreamflowDataset

Synthetic-backed streamflow adapter for WaterBench-style smoke runs.

_load()[source]
Return type:

DataBundle

name: str = 'waterbench_streamflow'
pyhazards.datasets.available_datasets()[source]
pyhazards.datasets.graph_collate(batch)[source]

Collate function that stacks x and adjacency if provided.

pyhazards.datasets.load_dataset(name, **kwargs)[source]
Return type:

Dataset

pyhazards.datasets.register_dataset(name, builder)[source]
Return type:

None