Data Pipeline & Sources
This page documents the complete data pipeline of the Rhine-Gauge Citizen-Science project. All data sources are public and freely accessible — the project is reproducible end-to-end without paid services.
1. Live water-level and discharge — WSV PEGELONLINE¶
Source: PEGELONLINE of the Wasserstraßen- und Schifffahrtsverwaltung des Bundes (WSV — German Federal Waterways and Shipping Administration).
Coverage: Real-time water level (W) and discharge (Q) for 11 stations between Iffezheim and Mainz, including Mannheim-Neckar and Raunheim-Main.
API: REST/JSON, free, no API key required.
Polling cadence: Every 15 minutes — driven by poller.py running on the Hetzner server.
Storage: Bucket rhein_raw, measurement pegel, tags station, parameter (W|Q), field value. Roughly 1 million points/year live, infinite retention.
Source attribution: “Datenquelle: Pegelonline der WSV / Federal Waterways and Shipping Administration”
2. Historical archive 2000–2026 — PEGELONLINE¶
Source: PEGELONLINE archive endpoint (discovered 2026-05-03):
- POST https://www.pegelonline.wsv.de/gast/historische-zeitreihen/prepare-download
- Body: uuid, parameter, start (ISO-UTC), end (ISO-UTC), format=csv|json
- Response: 303 with Location: header pointing to a ZIP download URL
Coverage: Daily resolution back to 2000-01-01 for all PEGELONLINE stations.
Format quirks:
- WSA stations (Worms, Nierstein, Mainz, Speyer, Mannheim): WASSERSTAND ROHDATEN (with space!) and ABFLUSS_ROHDATEN
- HLNUG stations (Raunheim): WASSERSTAND ROHDATEN, but Q is ABFLUSS (without ROHDATEN)
- All values in MEZ (Central European Time, year-round, no DST)
Backfill state (2026-05-03): 2000-01-01 to 2026-05-03 for 11 series × ~561 k + 362 k points each = ~8.6 M data points in rhein_raw, tag source=pegelonline_archive.
Note on Mannheim: Has data only from ~2010 (~146 k points in the 2000-2016 range, vs ~561 k for Worms in the same range). Pre-2010 gap analysis required for trend studies.
3. Precipitation observations + ICON forecasts — DWD via Brightsky¶
Source: Brightsky — community-maintained REST API wrapper around DWD CDC and DWD OpenData.
Coverage: Hourly precipitation and temperature observations for 3 weather stations along the Black Forest crest (the runoff source for the Upper Rhine flood waves): - Karlsruhe (49.04°N / 8.40°E) - Freudenstadt (48.46°N / 8.41°E) - Triberg (48.13°N / 8.23°E)
ICON forecasts: Hourly DWD-ICON-D2 forecasts for 0-72 h lead time, retrieved alongside the observations.
Polling cadence: Hourly (brightsky_poller.py, cron 30 * * * *).
Backfill 2018-2024: 425 k observation points in rhein_raw.weather (parameter precipitation / temperature, source brightsky).
Live forecasts: Stored in rhein_forecast.weather_forecast (DWD-ICON 0-72 h, refreshed hourly).
Use: Past precipitation cumulants (24/48/72 h windows) as ARX features (Phase 7.1). Future precipitation forecasts as ARX features (Phase 7.1b/c).
4. Historical forecast archives — ECMWF-IFS via TIGGE (ECDS)¶
Source: tigge-forecasts on the ECMWF Data Store (ECDS — separate endpoint from CDS).
Why this dataset: For honest forecast-skill evaluation we need real historical forecast archives rather than perfect-foresight substitutes. ECMWF-IFS reforecasts available via TIGGE provide the operational ECMWF-IFS forecast as run at the time, downloadable post-hoc.
Specs:
- Origin: ecmwf (IFS), type=control_forecast (deterministic, single member)
- Variable: total_precipitation in kg m⁻² (= mm)
- Initialisation: 2× per day (00z, 12z)
- Lead steps: 0/6/12/…/72 h
- Spatial resolution: Reduced Gaussian Grid, ~25 km
- 48 h delay between forecast init and availability (irrelevant for backfill, problematic for live)
Backfill state (2026-05-06): 84 / 84 months 2018-01..2024-12 successful, ~196 k data points (~30 MB) in rhein.weather_forecast_archive (measurement tigge_forecast, time = init_time, tags station, origin, lead_h).
Use: ARX-v6 training (Phase 7.1c). Honest forecast skill at +72 h lead = 45 % (vs perfect-foresight upper bound 50 %).
5. Push notifications — Pushover¶
Source: Pushover — paid one-time-purchase mobile push service.
Use: Flood warnings (430 / 500 / 640 cm Worms threshold crossings) with 6 h cooldown per threshold. Not strictly an open-data source but listed for completeness.
Storage architecture (InfluxDB v2.7)¶
| Bucket | Retention | Content |
|---|---|---|
rhein_raw |
infinite | All raw observations (gauge, weather) |
rhein_meta |
infinite | Station metadata, PNP heights, characteristic values (MNW/MW/MHW) |
rhein_derived |
10 years | Derived quantities (NHN, slope, CCF, ACF, Muskingum, ARX summaries, segment routing) |
rhein_forecast |
2 years | Live forecasts (Brightsky weather, ARX v1-v6 forecasts, segment forecasts) |
weather_forecast_archive |
infinite | TIGGE reforecast archive 2018-2024 |
Total ~10 million points across all buckets as of 2026-05-06.
Reproducibility¶
Every figure, every backtest result, every forecast value can be reproduced from the publicly accessible sources above. The complete pipeline source code lives in /opt/rhein/poller/ on the Hetzner server. Container image (rhein-poller) is rebuilt from a small Dockerfile listing only standard scientific Python libraries (numpy, pandas, scipy, statsmodels, scikit-learn, cdsapi, xarray, cfgrib, influxdb-client).
For scientific methodology and formula details, see Scientific methodology.