Deep Learning for Rainfall-Runoff Uncertainty
Deep Learning for Rainfall-Runoff Uncertainty
Abstract. Deep learning is becoming an increasingly im- 1999) networks have been especially effective (e.g., Kratzert
portant way to produce accurate hydrological predictions et al., 2019a, b, 2021).
across a wide range of spatial and temporal scales. Un- The majority of machine learning (ML) and deep learn-
certainty estimations are critical for actionable hydrological ing (DL) rainfall–runoff studies do not provide uncertainty
prediction, and while standardized community benchmarks estimates (e.g., Hsu et al., 1995; Kratzert et al., 2019b, 2021;
are becoming an increasingly important part of hydrological Liu et al., 2020; Feng et al., 2020). However, uncertainty is
model development and research, similar tools for bench- inherent in all aspects of hydrological modeling, and it is
marking uncertainty estimation are lacking. This contribu- generally accepted that our predictions should account for
tion demonstrates that accurate uncertainty predictions can this (Beven, 2016). The hydrological sciences community
be obtained with deep learning. We establish an uncertainty has put substantial effort into developing methods for provid-
estimation benchmarking procedure and present four deep ing uncertainty estimations around traditional models, and
learning baselines. Three baselines are based on mixture den- similar effort is necessary for DL models like LSTMs.
sity networks, and one is based on Monte Carlo dropout. The Currently there exists no single, prevailing method for ob-
results indicate that these approaches constitute strong base- taining distributional rainfall–runoff predictions. Many, if not
lines, especially the former ones. Additionally, we provide most, methods take a basic approach where a deterministic
a post hoc model analysis to put forward some qualitative model is augmented with some uncertainty estimation strat-
understanding of the resulting models. The analysis extends egy. This includes, for example, ensemble-based methods,
the notion of performance and shows that the model learns where the idea is to define and sample probability distribu-
nuanced behaviors to account for different situations. tions around different model inputs and/or structures (e.g.,
Li et al., 2017; Demargne et al., 2014; Clark et al., 2016),
but it also comprises Bayesian (e.g., Kavetski et al., 2006)
or pseudo-Bayesian (e.g., Beven and Binley, 2014) meth-
1 Introduction ods and post-processing methods (e.g., Shrestha and Solo-
matine, 2008; Montanari and Koutsoyiannis, 2012). In other
A growing body of empirical results shows that data- words, most classical rainfall–runoff models do not provide
driven models perform well in a variety of environmen- direct estimates of their own predictive uncertainty; instead,
tal modeling tasks (e.g., Hsu et al., 1995; Govindaraju, such models are used as a part of a larger framework. There
2000; Abramowitz, 2005; Best et al., 2015; Nearing et al., are some exceptions to this, for example, methods based on
2016, 2018). Specifically for rainfall–runoff modeling, ap- stochastic partial differential equations, which actually use
proaches based on long short-term memory (LSTM; Hochre- stochastic models but generally require one to assign sam-
iter, 1991; Hochreiter and Schmidhuber, 1997; Gers et al.,
pling distributions a priori (e.g., a Wiener process). These Our secondary objective is to help advance the state of
are common, for example, in hydrologic data assimilation community model benchmarking to include uncertainty es-
(e.g., Reichle et al., 2002). The problem with these types of timation. We want to do so by outlining a basic skeleton for
approaches is that any distribution that we could possibly as- an uncertainty-centered benchmarking procedure. The rea-
sign is necessarily degenerate, resulting in well-known errors son for this is that it was difficult to find suitable bench-
and biases in estimating uncertainty (Beven et al., 2008). marks for the DL uncertainty estimation approaches we want
It is possible to fit DL models such that their own represen- to explore. Ad hoc benchmarking and model intercompar-
tations intrinsically support estimating distributions while ac- ison studies are common (e.g., Andréassian et al., 2009;
counting for strongly nonlinear interactions between model Best et al., 2015; Kratzert et al., 2019b; Lane et al., 2019;
inputs and outputs. In this case, there is no requirement to Berthet et al., 2020; Nearing et al., 2018), and, while the
fall back on deterministic predictions that would need to be community has large-sample datasets for benchmarking hy-
sampled, perturbed, or inverted. Several approaches to uncer- drological models (Newman et al., 2017; Kratzert et al.,
tainty estimation for DL have been suggested (e.g., Bishop, 2019b), we lack standardized, open procedures for conduct-
1994; Blundell et al., 2015; Gal and Ghahramani, 2016). ing comparative uncertainty estimation studies. For example,
Some of them have been used in the hydrological context. from the given references only, Berthet et al. (2020) focused
For example, Zhu et al. (2020) tested two strategies for us- on benchmarking uncertainty estimation strategies, and then
ing an LSTM in combination with Gaussian processes for only for assessing post-processing approaches. Furthermore,
drought forecasting. In one strategy, the LSTM was used to it has previously been argued that data-based models provide
parameterize a Gaussian process, and in the second strategy, a meaningful and general benchmark for testing hypothe-
the LSTM was used as a forecast model with a Gaussian pro- ses and models (Nearing and Gupta, 2015; Nearing et al.,
cess post-processor. Gal and Ghahramani (2016) showed that 2020b). Thus, here we examine a data-based uncertainty es-
Monte Carlo dropout (MCD) can be used to intrinsically ap- timation benchmark built on a standard, publicly available,
proximate Gaussian processes with LSTMs, so it is an open large-sample dataset that could be used as a baseline for fu-
question as to whether explicitly representing the Gaussian ture benchmarking studies.
process is strictly necessary. Althoff et al. (2021) examined
the use of MCD for an LSTM-based model of a single river
basin and compared its performance with the usage of an 2 Data and methods
ensemble approach. They report that MCD had uncertainty
To carve out a skeleton for a benchmarking procedure, we
bands that are more reliable and wider than the ensemble
followed the philosophy outlined by Nearing et al. (2018).
counterparts. This finding contradicts the preliminary results
According to the principles therein, the requirements for a
of Klotz et al. (2019) and observations from other domains
suitable, standardized benchmark are (i) that the benchmark
(e.g., Ovadia et al., 2019; Fort et al., 2019). It is therefore
uses a community-standard dataset that is publicly available,
not yet clear whether the results of Althoff et al. (2021) are
(ii) that the model or method is applied in a way that con-
confined to their setup and data. Be that as it may, this still is
forms to the standards of practice for that dataset (e.g., stan-
evidence of the potential capabilities of MCD. A further use
dard train/test splits), and (iii) that the results of the stan-
case was examined by Fang et al. (2019), who use MCD for
dardized benchmark runs are publicly available. To these,
soil moisture modeling. They observed a tendency in MCD
we added a fourth point, a post hoc model examination step
to underestimate uncertainties. To compensate, they tested
which aims at exposing the intrinsic properties of the model.
an MCD extension, proposed by Kendall and Gal (2017), the
Although examination is important – especially for ML ap-
core idea of which is to add an estimation for the aleatoric
proaches and imperfect approximations – we do not view it
uncertainty by using a Gaussian noise term. They report that
as a requirement for benchmarking in general.
this combination was more effective at representing uncer-
Nonetheless, we believe that good benchmarking is not
tainty.
something that can be done in a responsible way by a sin-
Our primary goal is to benchmark several methods for un-
gle contribution (unless it is the outcome of a larger effort in
certainty estimation in rainfall–runoff modeling with DL. We
itself, e.g., Best et al., 2015; Kratzert et al., 2019b). In gen-
demonstrate that DL models can produce statistically reliable
eral, however, it will require a community-based effort. If
uncertainty estimates using approaches that are straightfor-
no benchmarking effort is established yet, one would ideally
ward to implement. We adapted the LSTM rainfall–runoff
start with a set of self-contained baselines and openly share
models developed by Kratzert et al. (2019b, 2021) with
settings, data, models, and metrics. Then, over time, a com-
four different approaches to make distributional predictions.
munity can establish itself and improve, replace, or add to
Three of these approaches use neural networks to create
them. In the best case, everyone runs the model or approach
and mix probability distributions (Sect. 2.3.1). The fourth is
that they know best, and results are compared at a community
MCD, which is based on direct sampling from the LSTM
level.
(Sect. 2.3.2).
2.2.1 Reliability
Table 1. Overview of the benchmarking metrics for assessing model resolution. Each metric is applied to the distributional streamflow
predictions at each individual time step and then aggregated over all time steps and basins. All metrics are defined in the interval [0, ∞), and
lower values are preferable (but not unconditional on the reliability).
Figure 4. Illustration of a mixture density network. The core idea is to use the outputs of a neural network to determine the mixture weights
and parameters of a mixture of densities (see Fig. 3). That is, for a given input, the network determines a conditional density function, which
it builds by mixing a set of predefined base densities (the so-called components).
work receives new inputs (i.e., in our case for every time One can read this enumeration as a transition from sim-
step). We thus obtain time-varying predictive distributions ple to complex: we start with Gaussian mixture components,
that can approximate a large variety of distributions (they then replace them with ALD mixture components, and lastly
can, for example, account for asymmetric and multimodal transition from a fixed number of mixture components to an
properties). The resulting model is trained by maximizing the implicit approximation. There are two reasons why we ar-
log-likelihood function of the observations according to the gue that the more complex MDN methods might be more
predicted mixture distributions. We view MDNs as intrinsi- promising than a simple GMM. First, error distributions in
cally distributional in the sense that they provide probability hydrologic simulations often have heavy tails. A Laplacian
distributions instead of first making deterministic streamflow component lends itself to thicker-tailed uncertainty (Fig. 5).
estimates and then appending a sampling distribution. Second, streamflow uncertainty is often asymmetrical, and
In this study, we tested three different MDN approaches. thus the ALD component could make more sense than a sym-
metric distribution in this application. For example, even a
1. Gaussian mixture models (GMMs) are MDNs with
single ALD component can be used to account for zero flows
Gaussian mixture components. Appendix B1 provides a
(compare Fig. 5b). UMAL extends this to avoid having to
more formal definition as well as details on the loss/ob-
pre-specify the number of mixture components, which re-
jective function.
moves one of the more subjective degrees of freedom from
2. Countable mixtures of asymmetric Laplacians (CMAL) the model design.
are similar to GMMs, but instead of Gaussians, the
mixture components are asymmetric Laplacian distribu-
2.3.2 Monte Carlo dropout
tions (ALDs). This allows for an intrinsic representation
of the asymmetric uncertainties that often occur with
hydrological variables like streamflow. Appendix B2 MCD provides an approach to estimate a basic form of epis-
provides a more formal description as well as details temic uncertainty. In the following we provide the intuition
on the loss/objective function. behind its application.
Dropout is a regularization technique for neural networks
3. Uncountable mixtures of asymmetric Lapla-
but can also be used for uncertainty estimation (Gal and
cians (UMAL) also use asymmetric Laplacians as
Ghahramani, 2016). Dropout randomly ignores specific net-
mixture components, but the mixture is not discretized.
work units (see Fig. 6). Hence, each time the model is eval-
Instead, UMAL approximates the conditional density
uated during training, the network structure is different. Re-
by using Monte Carlo integration over distributions
peating this procedure many times results in an ensemble of
obtained from quantile regression (Brando et al., 2019).
many submodels within the network. Dropout regularization
Appendix B3 provides a more formal description as
is used during training, while during the model evaluation the
well as details on the loss/objective function.
whole neural network is used. Gal and Ghahramani (2016)
Figure 7. Schemata of the general setup. Vertically the procedure is illustrated for two arbitrary basins, m and n, and horizontally the
corresponding time steps are depicted. In total we have 531 basins with approximately 3650 data points in time each, and for each time
step we compute 7500 samples. In the case of MCD we achieve this by directly sampling from the model. In the case of the MDNs we first
estimate a conditional distribution and then sample from it. The “clipping” sign emphasizes our choice to set samples that would be below
zero back to the zero-runoff baseline.
2.4 Post hoc model examination: checking model KGE; Table 2). The term single-point predictions is used
behavior here in the statistical sense of a point estimator to distin-
guish it from distributional predictions. Single-point predic-
We performed a post hoc model examination as a comple- tions were derived as the mean of the distributional predic-
ment to the benchmarking to avoid potential blind spots. The tions at each time step and evaluated for aggregating over
analysis has three parts, each one associated with a specific the different basins, using the mean and median as aggre-
property. gation operators (as in Kratzert et al., 2019b). Section 3.2.1
discusses the outcomes of this test as part of the post hoc
1. Accuracy: how accurate are single-point predictions ob-
model examination.
tained from the distributional predictions?
2. Internal consistency: how are the mixture components 2.4.2 Internal consistency: mixture component
used with regard to flow conditions? behavior
3. Estimation quality: how can we examine the properties
of the distributional predictions with regard to second- To get an impression of the model consistency, we looked at
order uncertainties? the behavioral properties of the mixture densities themselves.
The goal was to get some qualitative understanding about
2.4.1 Accuracy: single-point predictions how the mixture components are used in different situations.
As a prototypical example of this kind of examination, we
To address accuracy, we used standard performance met- refer to the study of Ellefsen et al. (2019). It examined how
rics applied to single-point predictions (such as the Nash– LSTMs use the mixture weights to predict the future within
Sutcliffe efficiency, NSE, and the Kling–Gupta efficiency, a simple game setting. Similarly, Nearing et al. (2020a) re-
Table 2. Overview of the different single-point prediction performance metrics. The table is adapted from Kratzert et al. (2021).
Table 3. Benchmark statistics for model precision. These metrics were applied to the distributional predictions at individual time steps. The
lowest metric per row is marked in bold. Lower values are better for all statistics (conditional on the model having high reliability). This table
also provides statistics of the empirical distribution from the observations (“Obs”) aggregated over the basins as a reference, which are not
directly comparable with the model statistics since “Obs” represents an unconditional density, while the models provide a conditional one.
The “Obs” statistics should be used as a reference to contextualize the statistics from the modeled distributions.
3.2.1 Accuracy
Figure 9. Kernel densities of the basin-wise deviation from the 1 : 1 line in the probability plot for the different inner quantiles. These
distributions result from evaluating the performance at each basin individually (rather than aggregating over basins). Note how the bounded
domain of the probability plot induces a bias for the outer thresholds as the deviations cannot expand beyond the [0, 1] interval.
timations (i.e., simply taking the mean), both CMAL and and hence the means of the asymmetric distributions might
UMAL manage to outperform the model that is optimized be a sub-optimal choice.
for single-point predictions with regard to some metrics. This
suggests that it could make sense to train a model to esti- 3.2.2 Internal consistency
mate distributions and then recover the best estimates. One
possible reason why this might be the case is that single- Figure 10 summarizes the behavioral patterns of the CMAL
point loss functions (e.g., MSE) define an implicit probabil- mixture components. It depicts an exemplary hydrograph su-
ity distribution (e.g., minimizing an MSE loss is equivalent perimposed on the CMAL uncertainty prediction together
to maximizing a Gaussian likelihood with fixed variance). with the corresponding mixture component weights. The
Hence, using a more nuanced loss function (i.e., one that is mixture weights always sum to 1. This figure shows that
the likelihood of a multimodal, asymmetrical, heterogeneous the model seemingly learns to use different mixture com-
distribution) can improve performance even for the purpose ponents for different parts of the hydrograph. In particular,
of making non-distributional estimates. In fact, it is reason- distributional predictions in low-flow periods (perhaps dom-
able to expect that the results of the MDN approaches can be inated by base flow) are largely controlled by the first mix-
improved even further by using a more sophisticated strat- ture component (as can be seen by the behavior of mixture α1
egy for obtaining single-point predictions (e.g., searching for in Fig. 10). Falling limbs of the hydrograph (corresponding
the maximum of the likelihood). The single-point prediction roughly to throughflow) are associated with the second mix-
LSTM (LSTMp) outperforms the ALD-based MDNs for tail ture component (α2 ), which is low for both rising limbs and
metrics of the streamflow – that is, for the low (FLV) and low-flow periods. The third component (α3 ) mainly controls
high bias (FHV). These are regimes where we would expect the rising limbs and the peak runoff but also has some influ-
the most asymmetric distributions for hydrological reasons, ence throughout the rest of the hydrograph. In effect, CMAL
Table 4. Evaluation of different single-point prediction metrics. Best performance is marked in bold. Information about the inter-basin
variability (dispersion) is provided in the form of the standard deviation whenever the mean is used for aggregation and in the form of the
distance to the 25 % and 75 % quantiles when the median is used for aggregation.
Figure 10. (a) Hydrograph of an exemplary event in the test period with both the 5 % to 95 % and 25 % to 75 % quantile ranges. (b) The
weights (αi ) of the CMAL mixture components for these predictions.
Figure 11. Illustration of second-order uncertainties estimated by using MCD to sample the parameters of the CMAL approach. The upper
subplot shows an observed hydrograph and predictive distributions as estimated by CMAL. The lower subplots show the CMAL distributions
and distributions from 25 MCD samples of the CMAL model at three selected time steps (indicated by black ovals shown on the hydrograph).
The abbreviation “main pred” marks the unperturbed distributional predictions from the CMAL model.
learns to separate the hydrograph into three parts – rising rameters and weights of the components as a second-order
limb, falling limb, and low flows – which correspond to the uncertainty. In theory even higher-order uncertainties can be
standard hydrological conceptualization. No explicit knowl- thought of. Here, as already described in the Methods sec-
edge of this particular hydrological conceptualization is pro- tion, we use MCD on top of the CMAL approach to “stochas-
vided to the model – it is solely trained to maximize overall ticize” the weights and parameters and expose the uncer-
likelihood. tainty of the estimations. Figure 11 illustrates the procedure:
the upper part shows a hydrograph with the 25 %–75 % quan-
3.2.3 Estimation quality tiles and 5 %–95 % quantiles from CMAL. This is the main
prediction. The lower plots show kernel density estimates for
particular points of the hydrograph (marked in the upper part
In this experiment we want to demonstrate an avenue for
with black ovals labeled “a”, “b”, and “c” and shown in red
studying higher-order uncertainties with CMAL. Intuitively,
in the lower subplots). These three specific points represent
the distributional predictions are estimations themselves and
different portions of the hydrograph with different predicted
are thus subject to uncertainty, and, since the distributional
distributional shapes and are thus well suited to showcasing
predictions do already provide estimates for the prediction
the technique. These kernel densities (in red) are superim-
uncertainty, we can think about the uncertainty regarding pa-
posed with 25 sampled estimations derived after applying for this is likely the Gaussian assumption of the uncertainty
MCD on top of the CMAL model (shown in lighter tones be- estimates, which seems inadequate for many low- and high-
hind the first-order estimate). These densities are the MCD- flow situations. There is, however, also a more nuanced as-
perturbed estimations and thus a gauge for how second-order pect to consider: the MDN approaches estimate the aleatoric
uncertainties influence the distributional predictions. uncertainty. MCD, on the other hand, estimates epistemic un-
certainty, or rather a particular form thereof. The method-
3.3 Computational demand ological comparison is therefore only partially fair. In gen-
eral, these two uncertainty types can be seen as perpendicular
This section gives an overview of the computational demand to each other. They do partially co-appear in our setup, since
required to compute the different uncertainty estimations. both the epistemic and aleatoric uncertainties are largest for
All of the reported execution times were obtained by us- high flow volumes.
ing NVIDIA P100 (16 GB RAM), using the Pytorch library Yet within the chosen setup it was observable that the
(Paszke et al., 2019). A single execution of the CMAL model methods that use inherently asymmetric distributions as com-
with a batch size of 256 takes 0.026+0.001
−0.002 s (here, and in the ponents outperformed the other ones. That is, CMAL and
following, the basis gives the median over 100 model runs; UMAL performed better than MCD and GMM in terms of
the index and the exponent show the deviations of the 10 % reliability, resolution, and the accuracy of the derived single-
quantile and the 90 % quantile, respectively). An execution point predictions. The CMAL approach in particular gave
of the MCD model takes 0.055+0.002
−0.001 s. The slower execution distributional predictions that were very good in terms of
time of the MCD approach here is explained by its larger hid- reliability and sharpness (and single-point estimates). There
den size. It used 500 hidden cells, in comparison with the 250 was a direct link between the predicted probabilities and hy-
hidden cells of CMAL (see Appendix A). drologic behavior in that different distributions were acti-
Generating all the needed samples for the evaluation vated (i.e., got larger mixture weights) for rising vs. falling
with MCD and a batch size of 256 would take approxi- limbs. Nevertheless, likelihood-based approaches (for esti-
mately 36.1 d (since 7500 samples have to be generated for mating the aleatoric uncertainty) are prone to giving over-
531 basins and 10 years at a daily resolution). In practice, we confident predictions. We were not able to diagnose this em-
could shorten this time to under a week by using consider- pirically. This might rather be a result of the limits of the
ably larger batch sizes and distributing the computations for inquiry than the non-existence of the phenomenon.
different basins over multiple GPUs. In comparison, comput- These limits illustrate how challenging benchmarking is.
ing the same number of samples by re-executing the CMAL Rainfall–runoff modeling is a complex endeavor. Unifying
model would take around 17.4 d. In practice, however, only the diverse approaches into a streamlined framework is dif-
a single run of the CMAL model is needed, since MDNs ficult. Realistically, a single research group cannot be able
provide us with a density estimate from which we can di- to compare the best possible implementations of the many
rectly sample in a parallel fashion (and without needing to existing uncertainty estimation schemes – which include ap-
re-execute the model run). Thus, the CMAL model, with a proaches such as sampling distributions, ensembles, post-
batch size of 256, takes only ∼ 14 h to generate all needed processors, and so forth. We did therefore not only want to
samples. examine some baseline models, but also to provide the skele-
ton for a community-minded benchmarking scheme (see
Nearing et al., 2018). We hope this will encourage good prac-
4 Conclusions and outlook tice and provide a foundation for others to build on. As de-
tailed in the Methods section, the scheme consists of four
Our basic benchmarking scheme allowed us to systemati- parts. Three of them are core benchmarking components and
cally pursue our primary objective – to examine deep learn- one is an added model checking step. In the following we
ing baselines for uncertainty predictions. In this regard, we provide our main observations regarding each point.
gathered further evidence that deep-learning-based uncer-
tainty estimation for rainfall–runoff modeling is a promising i. Data: we used the CAMELS dataset curated by the
research avenue. The explored approaches are able to provide US National Center for Atmospheric Research and split
fully distributional predictions for each basin and time step. the data into three consecutive periods (training, valida-
All predictions are dynamic: the model adapts them accord- tion, and test). All reported metrics are for the test split,
ing to the properties of each basin and the current dynamic which has only been used for the final model evalua-
inputs, e.g., temperature or rainfall. Since the predictions are tion. The dataset has already seen use for other compar-
inherently distributional, the predictions can be further ex- ative studies and purposes (e.g., Kratzert et al., 2019b, b;
amined and/or reduced to a more basic form, e.g., sample, Newman et al., 2017). It is also part of a recent gen-
interval, or point predictions. eration of open datasets, which we believe are the re-
The comparative assessment indicated that the MCD ap- sult of a growing enthusiasm for community efforts. As
proach provided the worst uncertainty estimates. One reason such, we predict that new benchmarking possibilities
will become available in the near future. A downside iv. Model checking: in the post hoc examination step we
with using existing open datasets is that the test data tested the model performance with regard to point pre-
are accessible to modelers. This means that a potential dictions. Remarkably, the results indicate that the dis-
defense mechanism against over-fitting on the test data tributional predictions are not only reliable and precise,
is missing (since the test data might be used during the but also yield strong single-point estimates. Addition-
model selection/fitting process; for broader discussions ally, we checked for internal organization principles of
we refer to Donoho, 2017; Makridakis et al., 2020; De- the CMAL model. In doing so we showed (a) how the
hghani et al., 2021). To enable rigorous benchmarking, component weighting of a given basin changes in de-
it might thus become relevant to withhold parts of the pendence on the flow regime and (b) how higher-order
data and only make them publicly available after some uncertainties in the form of perturbations of the compo-
given time (as for example done in Makridakis et al., nent weights and parameters change the distributional
2020). prediction. The showcased behavior is in accordance
with basic hydrological intuition. Specific components
ii. Metrics: we put forward a minimal set of diagnostic cri-
are used for low and high streamflow. The uncertainty
teria, that is, a discrete probability plot, obtained from
is lowest near 0 and increases with a rise in streamflow.
a probability integral transform, for checking predic-
This relationship is nonlinear and not a simple 1 : 1 de-
tion reliability, and a set of dispersion metrics to check
piction. Similarly, additional uncertainties are expected
the prediction resolution (see Renard et al., 2010). Us-
to change the characteristics of the distributional pre-
ing these, we could see the proposed baselines exhaust
diction more for high flows than for low flows. By and
the evaluation capacity of these diagnostic tools. On
large, however, this just a start. We argue that post hoc
the one hand, this is an encouraging sign for our abil-
examination will play a central role in future bench-
ity to make reliable hydrologic predictions (the down-
marking efforts.
side being that it might be hard for models to improve
on this metric going forward). On the other hand, it To summarize, the presented results are promising.
is important to be aware that the probability plot and Viewed through the lens of community-based benchmark-
the dispersion statistic miss several important aspects of ing, we expect progress on multiple fronts: better data, better
probabilistic prediction (for example, precision, consis- models, better baselines, better metrics, and better analyses.
tency, or event-specific properties). All reported metrics To road to get there still has many challenges awaiting. Let
are highly aggregated summaries (over multiple basins us overcome them together.
and time steps) of highly nonlinear relationships (see
also Muller, 2018; Thomas and Uminsky, 2020). This is
compounded by the inherent noise of the data. We there- Appendix A: Hyperparameter search and training
fore expect that many nuanced aspects are missed by the
comparative assessment. In consequence, we hope that A1 General setup
future efforts will derive more powerful metrics, tests,
Table A provides the general setup for the hyperparameter
and probing procedures – akin to the continuous devel-
search and model training.
opment of diagnostics for single-point predictions (see
Nearing et al., 2018). A2 Noise regularization
iii. Baselines: we examined four deep-learning-based ap-
proaches. One is based on Monte Carlo dropout and Adding noise to the data during training can be viewed as
three on mixture density networks. We used them to a form of data augmentation and regularization that biases
demonstrate the benchmarking scheme and showed its towards smooth functions. These are large topics in them-
comparative power. This should however only be seen selves, and at this stage we refer to Rothfuss et al. (2019)
as a starting point. We predict that stronger baselines for an investigation on the theoretical properties of noise
will emerge in tandem with stronger metrics. From regularization and some empirical demonstrations. In short,
our perspective, there is plenty of room to build bet- plain maximum likelihood estimation can lead to strong
ter ones. A perhaps self-evident example of the po- over-fitting (resulting in a spiky distribution that generalizes
tential of improvements is ensembles: Kratzert et al. poorly beyond the training data). Training with noise regular-
(2019b) showed the benefit of LSTM ensembles for ization results in smoother density estimates that are closer to
single-point predictions, and we assume that similar ap- the true conditional density.
proaches could be developed for uncertainty estimation. Following these findings, we also add noise as a smooth-
We are therefore sure that future research will yield ness regularization for our experiments. Concretely, we de-
improved approaches and move us closer to achieving cided to use a relative additive noise as a first-order approx-
holistic diagnostic tools for the evaluation of uncertainty imation to the sort of noise contamination we expect in hy-
estimations (sensu Nearing and Gupta, 2015). drological time series. The operation for regularization is
A4 Results
z∗ = z + z · N (0, σ ),
The results of the hyperparameter search are summarized in
where z is a placeholder variable for either the dynamic or Table A3.
static input variables or the observed runoff (and, as be-
fore, the time index is omitted for the sake of simplicity),
N (0, σ ) denotes a Gaussian noise term with mean zero Appendix B: Baselines
and a standard deviation σ , and z∗ is the obtained noise-
contaminated variable. B1 Gaussian mixture model
Table A2. Search space of the hyperparameter search. The search is conducted in two steps: the variables used in the first step are shown in
the top part of the table, and the variables used in the second step are shown in the bottom part and are written in bold.
References Demargne, J., Wu, L., Regonda, S. K., Brown, J. D., Lee, H., He,
M., Seo, D.-J., Hartman, R., Herr, H. D., Fresch, M., Schaake,
Abramowitz, G.: Towards a benchmark for land sur- J., and Zhu, Y.: The science of NOAA’s operational hydrologic
face models, Geophys. Res. Lett., 32, L22702, ensemble forecast service, B. Am. Meteorol. Soc., 95, 79–98,
[Link] 2005. 2014.
Addor, N., Newman, A. J., Mizukami, N., and Clark, M. P.: The Donoho, D.: 50 years of data science, J. Comput. Graph. Stat., 26,
CAMELS data set: catchment attributes and meteorology for 745–766, 2017.
large-sample studies, Hydrol. Earth Syst. Sci., 21, 5293–5313, Ellefsen, K. O., Martin, C. P., and Torresen, J.: How do mixture den-
[Link] 2017. sity rnns predict the future?, arXiv: preprint, arXiv:1901.07859,
Althoff, D., Rodrigues, L. N., and Bazame, H. C.: Uncertainty quan- 2019.
tification for hydrological models based on neural networks: the Fang, K., Shen, C., and Kifer, D.: Evaluating aleatoric and epis-
dropout ensemble, Stoch. Environ. Res. Risk A., 35, 1051–1067, temic uncertainties of time series deep learning models for soil
2021. moisture predictions, arXiv: preprint, arXiv:1906.04595, 2019.
Andréassian, V., Perrin, C., Berthet, L., Le Moine, N., Lerat, J., Feng, D., Fang, K., and Shen, C.: Enhancing streamflow forecast
Loumagne, C., Oudin, L., Mathevet, T., Ramos, M.-H., and and extracting insights using long-short term memory networks
Valéry, A.: HESS Opinions “Crash tests for a standardized eval- with data integration at continental scales, Water Resour. Res.,
uation of hydrological models”, Hydrol. Earth Syst. Sci., 13, 56, e2019WR026793, [Link]
1757–1764, [Link] 2009. 2020.
Berthet, L., Bourgin, F., Perrin, C., Viatgé, J., Marty, R., and Fort, S., Hu, H., and Lakshminarayanan, B.: Deep ensembles: A
Piotte, O.: A crash-testing framework for predictive uncer- loss landscape perspective, arXiv: preprint, arXiv:1912.02757,
tainty assessment when forecasting high flows in an extrap- 2019.
olation context, Hydrol. Earth Syst. Sci., 24, 2017–2041, Gal, Y. and Ghahramani, Z.: Dropout as a bayesian ap-
[Link] 2020. proximation: Representing model uncertainty in deep learn-
Best, M. J., Abramowitz, G., Johnson, H., Pitman, A., Balsamo, ing, in: international conference on machine learning, 1050–
G., Boone, A., Cuntz, M., Decharme, B., Dirmeyer, P., Dong, J., 1059, [Link] (last access:
Ek, M., Guo, Z., van den Hurk, B. J. J., Nearing, G. S., Pak, B., 28 March 2022), 2016.
Peters-Lidard, C., Santanello Jr., J. A., Stevens, L., and Vuichard, Gers, F. A., Schmidhuber, J., and Cummins, F.: Learning to
N.: The plumbing of land surface models: benchmarking model forget: continual prediction with LSTM, IET Conference
performance, J. Hydrometeorol., 16, 1425–1442, 2015. Proceedings, 850–855, [Link]
Beven, K.: Facets of uncertainty: epistemic uncertainty, conferences/10.1049/cp_19991218 (last access: 31 March 2021),
non-stationarity, likelihood, hypothesis testing, and 1999.
communication, Hydrolog. Sci. J., 61, 1652–1665, Gneiting, T. and Raftery, A. E.: Strictly proper scoring rules, pre-
[Link] 2016. diction, and estimation, J. Am. Stat. Assoc., 102, 359–378, 2007.
Beven, K. and Binley, A.: GLUE: 20 years on, Hydrol. Process., 28, Govindaraju, R. S.: Artificial neural networks in hydrology. II: hy-
5897–5918, 2014. drologic applications, J. Hydrol. Eng., 5, 124–137, 2000.
Beven, K. and Young, P.: A guide to good practice in modeling se- Graves, A.: Generating sequences with recurrent neural networks,
mantics for authors and referees, Water Resour. Res., 49, 5092– arXiv: preprint, arXiv:1308.0850, 2013.
5098, 2013. Gupta, H. V., Kling, H., Yilmaz, K. K., and Martinez, G. F.: Decom-
Beven, K. J., Smith, P. J., and Freer, J. E.: So just why would a position of the mean squared error and NSE performance criteria:
modeller choose to be incoherent?, J. Hydrol., 354, 15–32, 2008. Implications for improving hydrological modelling, J. Hydrol.,
Bishop, C. M.: Mixture density networks, Tech. rep., Neural 377, 80–91, 2009.
Computing Research Group, [Link] Ha, D. and Eck, D.: A neural representation of sketch drawings,
eprint/373/ (last access: 28 March 2022), 1994. arXiv: preprint, arXiv:1704.03477, 2017.
Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, Ha, D. and Schmidhuber, J.: Recurrent world models facilitate
D.: Weight uncertainty in neural networks, arXiv: preprint, policy evolution, in: Advances in Neural Information Pro-
arXiv:1505.05424, 2015. cessing Systems, 2450–2462, [Link]
Brando, A., Rodriguez, J. A., Vitria, J., and Rubio Muñoz, A.: Mod- hash/[Link] (last
elling heterogeneous distributions with an Uncountable Mixture access: 28 March 2022), 2018.
of Asymmetric Laplacians, Adv. Neural Inform. Proc. Syst., 32, Hochreiter, S.: Untersuchungen zu dynamischen neuronalen
8838–8848, 2019. Netzen, Diploma thesis, Institut für Informatik, Lehrstuhl
Clark, M. P., Wilby, R. L., Gutmann, E. D., Vano, J. A., Gangopad- Prof. Brauer, Tech. Univ., München, 1991.
hyay, S., Wood, A. W., Fowler, H. J., Prudhomme, C., Arnold, Hochreiter, S. and Schmidhuber, J.: Long Short-Term Memory,
J. R., and Brekke, L. D.: Characterizing uncertainty of the hy- Neural Comput., 9, 1735–1780, 1997.
drologic impacts of climate change, Curr. Clim. Change Rep., 2, Hsu, K.-L., Gupta, H. V., and Sorooshian, S.: Artificial neural net-
55–64, 2016. work modeling of the rainfall-runoff process, Water Resour. Res.,
Cole, T.: Too many digits: the presentation of numerical data, Arch. 31, 2517–2530, 1995.
Disease Childhood, 100, 608–609, 2015. Kavetski, D., Kuczera, G., and Franks, S. W.: Bayesian
Dehghani, M., Tay, Y., Gritsenko, A. A., Zhao, Z., Houlsby, N., analysis of input uncertainty in hydrological model-
Diaz, F., Metzler, D., and Vinyals, O.: The Benchmark Lottery,
[Link] 2021.
ing: 2. Application, Water Resour. Res., 42, W03408, Montanari, A. and Koutsoyiannis, D.: A blueprint for process-based
[Link] 2006. modeling of uncertain hydrological systems, Water Resour. Res.,
Kendall, A. and Gal, Y.: What uncertainties do we need in 48, W09555, [Link] 2012.
bayesian deep learning for computer vision?, in: Ad- Muller, J. Z.: The tyranny of metrics, Princeton University Press,
vances in neural information processing systems, 5574– [Link] 2018.
5584, [Link] Naeini, M. P., Cooper, G. F., and Hauskrecht, M.: Obtaining
[Link] (last well calibrated probabilities using bayesian binning, in: Pro-
access: 28 March 2022), 2017. ceedings of the AAAI Conference on Artificial Intelligence,
Klemeš, V.: Operational testing of hydrological sim- vol. 2015, NIH Public Access, p. 2901, [Link]
ulation models, Hydrolog. Sci. J., 31, 13–24, php/AAAI/article/view/9602 (last access: 28 March 2022), 2015.
[Link] 1986. Nash, J. E. and Sutcliffe, J. V.: River flow forecasting through con-
Klotz, D., Kratzert, F., Herrnegger, M., Hochreiter, S., and Klam- ceptual models part I – A discussion of principles, J. Hydrol., 10,
bauer, G.: Towards the quantification of uncertainty for deep 282–290, 1970.
learning based rainfallrunoff models, Geophys. Res. Abstr., 21, NCAR: CAMELS: Catchment Attributes and Meteorology for
EGU2019-10708-2, 2019. Large-sample Studies – Dataset Downloads, [Link]
Kratzert, F.: neuralhydrology, GitHub [code], [Link] solutions/products/camels, last access: 21 March 2022.
neuralhydrology/neuralhydrology, last access: 21 March 2022. Nearing, G. S. and Gupta, H. V.: The quantity and quality of infor-
Kratzert, F., Klotz, D., Brenner, C., Schulz, K., and Herrnegger, mation in hydrologic models, Water Resour. Res., 51, 524–538,
M.: Rainfall–runoff modelling using Long Short-Term Mem- 2015.
ory (LSTM) networks, Hydrol. Earth Syst. Sci., 22, 6005–6022, Nearing, G. S., Mocko, D. M., Peters-Lidard, C. D., Kumar, S. V.,
[Link] 2018. and Xia, Y.: Benchmarking NLDAS-2 soil moisture and evapo-
Kratzert, F., Klotz, D., Herrnegger, M., Sampson, A. K., transpiration to separate uncertainty contributions, J. Hydrome-
Hochreiter, S., and Nearing, G. S.: Toward Improved Pre- teorol., 17, 745–759, 2016.
dictions in Ungauged Basins: Exploiting the Power of Nearing, G. S., Ruddell, B. L., Clark, M. P., Nijssen, B., and Peters-
Machine Learning, Water Resour. Res., 55, 11344–11354, Lidard, C.: Benchmarking and process diagnostics of land mod-
[Link] 2019a. els, J. Hydrometeorol., 19, 1835–1852, 2018.
Kratzert, F., Klotz, D., Shalev, G., Klambauer, G., Hochreiter, Nearing, G. S., Kratzert, F., Sampson, A. K., Pelissier, C.
S., and Nearing, G.: Towards learning universal, regional, and S., Klotz, D., Frame, J. M., Prieto, C., and Gupta, H. V.:
local hydrological behaviors via machine learning applied to What role does hydrological science play in the age of ma-
large-sample datasets, Hydrol. Earth Syst. Sci., 23, 5089–5110, chine learning?, Water Resour. Res., 57, e2020WR028091,
[Link] 2019b. [Link] 2020a.
Kratzert, F., Klotz, D., Hochreiter, S., and Nearing, G. S.: A note Nearing, G. S., Ruddell, B. L., Bennett, A. R., Prieto, C., and
on leveraging synergy in multiple meteorological data sets with Gupta, H. V.: Does Information Theory Provide a New Paradigm
deep learning for rainfall–runoff modeling, Hydrol. Earth Syst. for Earth Science? Hypothesis Testing, Water Resour. Res.,
Sci., 25, 2685–2703, [Link] 56, e2019WR024918, [Link]
2021. 2020b.
Laio, F. and Tamea, S.: Verification tools for probabilistic fore- Newman, A. J., Clark, M. P., Sampson, K., Wood, A., Hay, L.
casts of continuous hydrological variables, Hydrol. Earth Syst. E., Bock, A., Viger, R. J., Blodgett, D., Brekke, L., Arnold, J.
Sci., 11, 1267–1277, [Link] R., Hopson, T., and Duan, Q.: Development of a large-sample
2007. watershed-scale hydrometeorological data set for the contiguous
Lane, R. A., Coxon, G., Freer, J. E., Wagener, T., Johnes, P. J., USA: data set characteristics and assessment of regional variabil-
Bloomfield, J. P., Greene, S., Macleod, C. J. A., and Reaney, ity in hydrologic model performance, Hydrol. Earth Syst. Sci.,
S. M.: Benchmarking the predictive capability of hydrological 19, 209–223, [Link] 2015.
models for river flow and flood peak predictions across over Newman, A. J., Mizukami, N., Clark, M. P., Wood, A. W., Nijssen,
1000 catchments in Great Britain, Hydrol. Earth Syst. Sci., 23, B., and Nearing, G.: Benchmarking of a physically based hydro-
4011–4032, [Link] 2019. logic model, J. Hydrometeorol., 18, 2215–2225, 2017.
Li, W., Duan, Q., Miao, C., Ye, A., Gong, W., and Di, Z.: A re- Ovadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D., Nowozin, S.,
view on statistical postprocessing methods for hydrometeorolog- Dillon, J., Lakshminarayanan, B., and Snoek, J.: Can you trust
ical ensemble forecasting, Wiley Interdisciplin. Rev.: Water, 4, your model’s uncertainty? Evaluating predictive uncertainty un-
e1246, [Link] 2017. der dataset shift, in: Advances in Neural Information Process-
Liu, M., Huang, Y., Li, Z., Tong, B., Liu, Z., Sun, M., Jiang, F., and ing Systems, 13991–14002, [Link]
Zhang, H.: The Applicability of LSTM-KNN Model for Real- hash/[Link] (last
Time Flood Forecasting in Different Climate Zones in China, access: 28 March 2022), 2019.
Water, 12, 440, [Link] 2020. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan,
Makridakis, S., Spiliotis, E., and Assimakopoulos, V.: The M5 accu- G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmai-
racy competition: Results, findings and conclusions, Int. J. Fore- son, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani,
cast., [Link] 2020. A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chin-
tala, S.: Pytorch: An imperative style, high-performance deep
learning library, in: Advances in neural information process-
ing systems, 8026–8037, [Link] Thomas, R. and Uminsky, D.: The Problem with Metrics is a Funda-
[Link] (last access: mental Problem for AI, arXiv: preprint, arXiv:2002.08512, 2020.
28 March 2022), 2019. Weijs, S. V., Schoups, G., and van de Giesen, N.: Why
Reichle, R. H., McLaughlin, D. B., and Entekhabi, D.: Hydrologic hydrological predictions should be evaluated using infor-
data assimilation with the ensemble Kalman filter, Mon. Weather mation theory, Hydrol. Earth Syst. Sci., 14, 2545–2558,
Rev., 130, 103–114, 2002. [Link] 2010.
Renard, B., Kavetski, D., Kuczera, G., Thyer, M., and Yilmaz, K. K., Gupta, H. V., and Wagener, T.: A process-based di-
Franks, S. W.: Understanding predictive uncertainty in agnostic approach to model evaluation: Application to the NWS
hydrologic modeling: The challenge of identifying input distributed hydrologic model, Water Resour. Res., 44, W09417,
and structural errors, Water Resour. Res., 46, W05521, [Link] 2008.
[Link] 2010. Yu, K. and Moyeed, R. A.: Bayesian quantile regression, Stat. Prob-
Richmond, K., King, S., and Taylor, P.: Modelling the uncertainty abil. Lett., 54, 437–447, 2001.
in recovering articulation from acoustics, Comput. Speech Lan- Zhu, L. and Laptev, N.: Deep and confident prediction for time
guage, 17, 153–172, 2003. series at uber, in: 2017 IEEE International Conference on Data
Rothfuss, J., Ferreira, F., Walther, S., and Ulrich, M.: Conditional Mining Workshops (ICDMW), 103–110, [Link]
density estimation with neural networks: Best practices and org/abstract/document/8215650 (last access: 28 March 2022),
benchmarks, arXiv: preprint, arXiv:1903.00954, 2019. 2017.
Shrestha, D. L. and Solomatine, D. P.: Data-driven ap- Zhu, S., Xu, Z., Luo, X., Liu, X., Wang, R., Zhang, M., and
proaches for estimating uncertainty in rainfall-runoff Huo, Z.: Internal and external coupling of Gaussian mixture
modelling, Int. J. River Basin Manage., 6, 109–122, model and deep recurrent network for probabilistic drought
[Link] 2008. forecasting, Int. J. Environ. Sci. Technol., 18, 1221–1236,
Smith, L. and Gal, Y.: Understanding measures of uncer- [Link] 2020.
tainty for adversarial example detection, arXiv: preprint,
arXiv:1803.08533, 2018.