Rhein-Pegel · Citizen-Science

Data Pipeline & Sources

Last updated 2026-05-07

This page documents the complete data pipeline of the Rhine-Gauge Citizen-Science project. All data sources are public and freely accessible — the project is reproducible end-to-end without paid services.

1. Live water-level and discharge — WSV PEGELONLINE

Source: PEGELONLINE of the Wasserstraßen- und Schifffahrtsverwaltung des Bundes (WSV — German Federal Waterways and Shipping Administration).

Coverage: Real-time water level (W) and discharge (Q) for 11 stations between Iffezheim and Mainz, including Mannheim-Neckar and Raunheim-Main.

API: REST/JSON, free, no API key required.

Polling cadence: Every 15 minutes — driven by poller.py running on the Hetzner server.

Storage: Bucket rhein_raw, measurement pegel, tags station, parameter (W|Q), field value. Roughly 1 million points/year live, infinite retention.

Source attribution: “Datenquelle: Pegelonline der WSV / Federal Waterways and Shipping Administration”

2. Historical archive 2000–2026 — PEGELONLINE

Source: PEGELONLINE archive endpoint (discovered 2026-05-03): - POST https://www.pegelonline.wsv.de/gast/historische-zeitreihen/prepare-download - Body: uuid, parameter, start (ISO-UTC), end (ISO-UTC), format=csv|json - Response: 303 with Location: header pointing to a ZIP download URL

Coverage: Daily resolution back to 2000-01-01 for all PEGELONLINE stations.

Format quirks: - WSA stations (Worms, Nierstein, Mainz, Speyer, Mannheim): WASSERSTAND ROHDATEN (with space!) and ABFLUSS_ROHDATEN - HLNUG stations (Raunheim): WASSERSTAND ROHDATEN, but Q is ABFLUSS (without ROHDATEN) - All values in MEZ (Central European Time, year-round, no DST)

Backfill state (2026-05-03): 2000-01-01 to 2026-05-03 for 11 series × ~561 k + 362 k points each = ~8.6 M data points in rhein_raw, tag source=pegelonline_archive.

Note on Mannheim: Has data only from ~2010 (~146 k points in the 2000-2016 range, vs ~561 k for Worms in the same range). Pre-2010 gap analysis required for trend studies.

3. Precipitation observations + ICON forecasts — DWD via Brightsky

Source: Brightsky — community-maintained REST API wrapper around DWD CDC and DWD OpenData.

Coverage: Hourly precipitation and temperature observations for 3 weather stations along the Black Forest crest (the runoff source for the Upper Rhine flood waves): - Karlsruhe (49.04°N / 8.40°E) - Freudenstadt (48.46°N / 8.41°E) - Triberg (48.13°N / 8.23°E)

ICON forecasts: Hourly DWD-ICON-D2 forecasts for 0-72 h lead time, retrieved alongside the observations.

Polling cadence: Hourly (brightsky_poller.py, cron 30 * * * *).

Backfill 2018-2024: 425 k observation points in rhein_raw.weather (parameter precipitation / temperature, source brightsky).

Live forecasts: Stored in rhein_forecast.weather_forecast (DWD-ICON 0-72 h, refreshed hourly).

Use: Past precipitation cumulants (24/48/72 h windows) as ARX features (Phase 7.1). Future precipitation forecasts as ARX features (Phase 7.1b/c).

4. Historical forecast archives — ECMWF-IFS via TIGGE (ECDS)

Source: tigge-forecasts on the ECMWF Data Store (ECDS — separate endpoint from CDS).

Why this dataset: For honest forecast-skill evaluation we need real historical forecast archives rather than perfect-foresight substitutes. ECMWF-IFS reforecasts available via TIGGE provide the operational ECMWF-IFS forecast as run at the time, downloadable post-hoc.

Specs: - Origin: ecmwf (IFS), type=control_forecast (deterministic, single member) - Variable: total_precipitation in kg m⁻² (= mm) - Initialisation: 2× per day (00z, 12z) - Lead steps: 0/6/12/…/72 h - Spatial resolution: Reduced Gaussian Grid, ~25 km - 48 h delay between forecast init and availability (irrelevant for backfill, problematic for live)

Backfill state (2026-05-06): 84 / 84 months 2018-01..2024-12 successful, ~196 k data points (~30 MB) in rhein.weather_forecast_archive (measurement tigge_forecast, time = init_time, tags station, origin, lead_h).

Use: ARX-v6 training (Phase 7.1c). Honest forecast skill at +72 h lead = 45 % (vs perfect-foresight upper bound 50 %).

5. Push notifications — Pushover

Source: Pushover — paid one-time-purchase mobile push service.

Use: Flood warnings (430 / 500 / 640 cm Worms threshold crossings) with 6 h cooldown per threshold. Not strictly an open-data source but listed for completeness.

Storage architecture (InfluxDB v2.7)

Bucket Retention Content
rhein_raw infinite All raw observations (gauge, weather)
rhein_meta infinite Station metadata, PNP heights, characteristic values (MNW/MW/MHW)
rhein_derived 10 years Derived quantities (NHN, slope, CCF, ACF, Muskingum, ARX summaries, segment routing)
rhein_forecast 2 years Live forecasts (Brightsky weather, ARX v1-v6 forecasts, segment forecasts)
weather_forecast_archive infinite TIGGE reforecast archive 2018-2024

Total ~10 million points across all buckets as of 2026-05-06.

Reproducibility

Every figure, every backtest result, every forecast value can be reproduced from the publicly accessible sources above. The complete pipeline source code lives in /opt/rhein/poller/ on the Hetzner server. Container image (rhein-poller) is rebuilt from a small Dockerfile listing only standard scientific Python libraries (numpy, pandas, scipy, statsmodels, scikit-learn, cdsapi, xarray, cfgrib, influxdb-client).

For scientific methodology and formula details, see Scientific methodology.