0% found this document useful (0 votes)
58 views21 pages

Deep Learning for Rainfall-Runoff Uncertainty

This document discusses the use of deep learning for uncertainty estimation in rainfall-runoff modeling, highlighting the importance of incorporating uncertainty into hydrological predictions. The authors propose a benchmarking procedure for evaluating various deep learning methods, demonstrating that accurate uncertainty predictions can be achieved with approaches like mixture density networks and Monte Carlo dropout. The study utilizes the CAMELS dataset to establish a framework for community-based benchmarking of uncertainty estimation techniques in hydrology.

Uploaded by

Mi Barrios
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views21 pages

Deep Learning for Rainfall-Runoff Uncertainty

This document discusses the use of deep learning for uncertainty estimation in rainfall-runoff modeling, highlighting the importance of incorporating uncertainty into hydrological predictions. The authors propose a benchmarking procedure for evaluating various deep learning methods, demonstrating that accurate uncertainty predictions can be achieved with approaches like mixture density networks and Monte Carlo dropout. The study utilizes the CAMELS dataset to establish a framework for community-based benchmarking of uncertainty estimation techniques in hydrology.

Uploaded by

Mi Barrios
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Hydrol. Earth Syst. Sci.

, 26, 1673–1693, 2022


[Link]
© Author(s) 2022. This work is distributed under
the Creative Commons Attribution 4.0 License.

Uncertainty estimation with deep learning


for rainfall–runoff modeling
Daniel Klotz1 , Frederik Kratzert1 , Martin Gauch1 , Alden Keefe Sampson2 , Johannes Brandstetter1 ,
Günter Klambauer1 , Sepp Hochreiter1 , and Grey Nearing3
1 Institute
for Machine Learning, Johannes Kepler University Linz, Linz, Austria
2 Upstream Tech, Natel Energy Inc., Alameda, CA, USA
3 Google Research, Mountain View, CA, USA

Correspondence: Daniel Klotz (klotz@[Link])

Received: 15 March 2021 – Discussion started: 14 April 2021


Revised: 14 January 2022 – Accepted: 9 February 2022 – Published: 31 March 2022

Abstract. Deep learning is becoming an increasingly im- 1999) networks have been especially effective (e.g., Kratzert
portant way to produce accurate hydrological predictions et al., 2019a, b, 2021).
across a wide range of spatial and temporal scales. Un- The majority of machine learning (ML) and deep learn-
certainty estimations are critical for actionable hydrological ing (DL) rainfall–runoff studies do not provide uncertainty
prediction, and while standardized community benchmarks estimates (e.g., Hsu et al., 1995; Kratzert et al., 2019b, 2021;
are becoming an increasingly important part of hydrological Liu et al., 2020; Feng et al., 2020). However, uncertainty is
model development and research, similar tools for bench- inherent in all aspects of hydrological modeling, and it is
marking uncertainty estimation are lacking. This contribu- generally accepted that our predictions should account for
tion demonstrates that accurate uncertainty predictions can this (Beven, 2016). The hydrological sciences community
be obtained with deep learning. We establish an uncertainty has put substantial effort into developing methods for provid-
estimation benchmarking procedure and present four deep ing uncertainty estimations around traditional models, and
learning baselines. Three baselines are based on mixture den- similar effort is necessary for DL models like LSTMs.
sity networks, and one is based on Monte Carlo dropout. The Currently there exists no single, prevailing method for ob-
results indicate that these approaches constitute strong base- taining distributional rainfall–runoff predictions. Many, if not
lines, especially the former ones. Additionally, we provide most, methods take a basic approach where a deterministic
a post hoc model analysis to put forward some qualitative model is augmented with some uncertainty estimation strat-
understanding of the resulting models. The analysis extends egy. This includes, for example, ensemble-based methods,
the notion of performance and shows that the model learns where the idea is to define and sample probability distribu-
nuanced behaviors to account for different situations. tions around different model inputs and/or structures (e.g.,
Li et al., 2017; Demargne et al., 2014; Clark et al., 2016),
but it also comprises Bayesian (e.g., Kavetski et al., 2006)
or pseudo-Bayesian (e.g., Beven and Binley, 2014) meth-
1 Introduction ods and post-processing methods (e.g., Shrestha and Solo-
matine, 2008; Montanari and Koutsoyiannis, 2012). In other
A growing body of empirical results shows that data- words, most classical rainfall–runoff models do not provide
driven models perform well in a variety of environmen- direct estimates of their own predictive uncertainty; instead,
tal modeling tasks (e.g., Hsu et al., 1995; Govindaraju, such models are used as a part of a larger framework. There
2000; Abramowitz, 2005; Best et al., 2015; Nearing et al., are some exceptions to this, for example, methods based on
2016, 2018). Specifically for rainfall–runoff modeling, ap- stochastic partial differential equations, which actually use
proaches based on long short-term memory (LSTM; Hochre- stochastic models but generally require one to assign sam-
iter, 1991; Hochreiter and Schmidhuber, 1997; Gers et al.,

Published by Copernicus Publications on behalf of the European Geosciences Union.


1674 D. Klotz et al.: Uncertainty estimation with deep learning for rainfall–runoff simulation

pling distributions a priori (e.g., a Wiener process). These Our secondary objective is to help advance the state of
are common, for example, in hydrologic data assimilation community model benchmarking to include uncertainty es-
(e.g., Reichle et al., 2002). The problem with these types of timation. We want to do so by outlining a basic skeleton for
approaches is that any distribution that we could possibly as- an uncertainty-centered benchmarking procedure. The rea-
sign is necessarily degenerate, resulting in well-known errors son for this is that it was difficult to find suitable bench-
and biases in estimating uncertainty (Beven et al., 2008). marks for the DL uncertainty estimation approaches we want
It is possible to fit DL models such that their own represen- to explore. Ad hoc benchmarking and model intercompar-
tations intrinsically support estimating distributions while ac- ison studies are common (e.g., Andréassian et al., 2009;
counting for strongly nonlinear interactions between model Best et al., 2015; Kratzert et al., 2019b; Lane et al., 2019;
inputs and outputs. In this case, there is no requirement to Berthet et al., 2020; Nearing et al., 2018), and, while the
fall back on deterministic predictions that would need to be community has large-sample datasets for benchmarking hy-
sampled, perturbed, or inverted. Several approaches to uncer- drological models (Newman et al., 2017; Kratzert et al.,
tainty estimation for DL have been suggested (e.g., Bishop, 2019b), we lack standardized, open procedures for conduct-
1994; Blundell et al., 2015; Gal and Ghahramani, 2016). ing comparative uncertainty estimation studies. For example,
Some of them have been used in the hydrological context. from the given references only, Berthet et al. (2020) focused
For example, Zhu et al. (2020) tested two strategies for us- on benchmarking uncertainty estimation strategies, and then
ing an LSTM in combination with Gaussian processes for only for assessing post-processing approaches. Furthermore,
drought forecasting. In one strategy, the LSTM was used to it has previously been argued that data-based models provide
parameterize a Gaussian process, and in the second strategy, a meaningful and general benchmark for testing hypothe-
the LSTM was used as a forecast model with a Gaussian pro- ses and models (Nearing and Gupta, 2015; Nearing et al.,
cess post-processor. Gal and Ghahramani (2016) showed that 2020b). Thus, here we examine a data-based uncertainty es-
Monte Carlo dropout (MCD) can be used to intrinsically ap- timation benchmark built on a standard, publicly available,
proximate Gaussian processes with LSTMs, so it is an open large-sample dataset that could be used as a baseline for fu-
question as to whether explicitly representing the Gaussian ture benchmarking studies.
process is strictly necessary. Althoff et al. (2021) examined
the use of MCD for an LSTM-based model of a single river
basin and compared its performance with the usage of an 2 Data and methods
ensemble approach. They report that MCD had uncertainty
To carve out a skeleton for a benchmarking procedure, we
bands that are more reliable and wider than the ensemble
followed the philosophy outlined by Nearing et al. (2018).
counterparts. This finding contradicts the preliminary results
According to the principles therein, the requirements for a
of Klotz et al. (2019) and observations from other domains
suitable, standardized benchmark are (i) that the benchmark
(e.g., Ovadia et al., 2019; Fort et al., 2019). It is therefore
uses a community-standard dataset that is publicly available,
not yet clear whether the results of Althoff et al. (2021) are
(ii) that the model or method is applied in a way that con-
confined to their setup and data. Be that as it may, this still is
forms to the standards of practice for that dataset (e.g., stan-
evidence of the potential capabilities of MCD. A further use
dard train/test splits), and (iii) that the results of the stan-
case was examined by Fang et al. (2019), who use MCD for
dardized benchmark runs are publicly available. To these,
soil moisture modeling. They observed a tendency in MCD
we added a fourth point, a post hoc model examination step
to underestimate uncertainties. To compensate, they tested
which aims at exposing the intrinsic properties of the model.
an MCD extension, proposed by Kendall and Gal (2017), the
Although examination is important – especially for ML ap-
core idea of which is to add an estimation for the aleatoric
proaches and imperfect approximations – we do not view it
uncertainty by using a Gaussian noise term. They report that
as a requirement for benchmarking in general.
this combination was more effective at representing uncer-
Nonetheless, we believe that good benchmarking is not
tainty.
something that can be done in a responsible way by a sin-
Our primary goal is to benchmark several methods for un-
gle contribution (unless it is the outcome of a larger effort in
certainty estimation in rainfall–runoff modeling with DL. We
itself, e.g., Best et al., 2015; Kratzert et al., 2019b). In gen-
demonstrate that DL models can produce statistically reliable
eral, however, it will require a community-based effort. If
uncertainty estimates using approaches that are straightfor-
no benchmarking effort is established yet, one would ideally
ward to implement. We adapted the LSTM rainfall–runoff
start with a set of self-contained baselines and openly share
models developed by Kratzert et al. (2019b, 2021) with
settings, data, models, and metrics. Then, over time, a com-
four different approaches to make distributional predictions.
munity can establish itself and improve, replace, or add to
Three of these approaches use neural networks to create
them. In the best case, everyone runs the model or approach
and mix probability distributions (Sect. 2.3.1). The fourth is
that they know best, and results are compared at a community
MCD, which is based on direct sampling from the LSTM
level.
(Sect. 2.3.2).

Hydrol. Earth Syst. Sci., 26, 1673–1693, 2022 [Link]


D. Klotz et al.: Uncertainty estimation with deep learning for rainfall–runoff simulation 1675

The current study can be seen as a starting point for this


process: we base the setup for a UE benchmark on a large,
publicly curated, open dataset that is already established
for other benchmarking efforts – namely the Catchment
Attributes and MEteorolgoical Large Sample (CAMELS)
dataset. Section 2.1 provides an overview of the CAMELS
dataset. The following sections describe the benchmarking
setup: Sect. 2.2 discusses a suite of performance metrics that
we used to evaluate the uncertainty estimation approaches.
Section 2.3 introduces the different uncertainty estimation
baselines that we developed. We used exclusively data-driven
models because they capture the empirically inferrable rela-
tionships between inputs and outputs (and assume minimal Figure 1. Overview map of the CAMELS basins. The plot shows
a-priori process conceptualization; see, e.g., Nearing et al., the mean precipitation estimates for the 531 basins originally cho-
2018, 2020b). The setup, the models, and the metrics should sen by Newman et al. (2017) and used in this study.
be seen as a minimum viable implementation of a compar-
ative examination of uncertainty predictions, a template that
can be expanded and adapted to progress benchmarking in 2.2 Metrics: benchmarking evaluation
a community-minded way. Lastly, Sect. 2.4 discusses the
different experiments of the post hoc model examination. Benchmarking requires metrics to evaluate. No global,
Our goal here is to make the behavior and performance of unique metric exists that is able to fully capture model behav-
the (best) model more tangible and compensate for potential ior. As a matter of fact, it is often the case that even multiple
blind spots of the metrics. metrics will miss important aspects. The choice of metrics
will also necessarily depend on the goal of the benchmark-
2.1 Data: the CAMELS dataset ing exercise. Post hoc model examination provides a partial
remedy to these inefficiencies by making the model behavior
CAMELS (Newman et al., 2015; Addor et al., 2017) is an more tangible. Still, as of now, no canonical set of metrics
openly available dataset that contains basin-averaged daily exists. The ones we employed should be seen as a bare min-
meteorological forcings derived from three different gridded imum, a starting point so to speak. The metrics will need to
data products for 671 basins across the contiguous United be adapted and refined over time and from application to ap-
States. The 671 CAMELS basins range in size between plication.
4 and 25 000 km2 and span a range of geological, ecologi- The minimal metrics for benchmarking uncertainty esti-
cal, and climatic conditions. The original CAMELS dataset mations need to test whether the distributional predictions
includes daily meteorological forcings (precipitation, tem- are “reliable” and have “high resolution” (a terminology we
perature, short-wave radiation, humidity) from three differ- adopted from Renard et al., 2010). Reliability measures how
ent data sources (NLDAS, Maurer, DayMet) for the time pe- consistent the provided uncertainty estimates are with respect
riod 1980 through 2010 as well as daily streamflow discharge to the available observations, and resolution measures the
data from the US Geological Survey. CAMELS also includes “sharpness” of distributional predictions (i.e., how thin the
basin-averaged catchment attributes related to soil, geology, body of the distribution is). Generally, models with higher
vegetation, and climate. resolution are preferable. However, this preference is con-
We used the same 531 basins from the CAMELS dataset ditional on the models being reliable. A model should not
(Fig. 1) that were originally chosen for model benchmarking be overly precise relative to its accuracy (over-confident)
by Newman et al. (2017). This means that all basins from the or overly disperse relative to its accuracy (under-confident).
original 671 with areas greater than 2000 km2 or with dis- Generally speaking, a single metric will not suffice to com-
crepancies of more than 10 % between different methods for pletely summarize these properties (see, e.g., Thomas and
calculating basin area were not considered. Since all of the Uminsky, 2020). We note however that the best form met-
models that we tested here are DL models, we use the terms rics for comparing distributional predictions would be to use
training, validation, and testing, which are standard in the proper scoring rules, such as likelihoods (see, e.g., Gneiting
machine learning community, instead of the terms calibra- and Raftery, 2007). Likelihoods however do not exist on an
tion and validation, which are more common in the hydrol- absolute scale (it is generally only possible to compare like-
ogy community (sensu Klemeš, 1986). lihoods between models), which makes these difficult to in-
terpret (although see Weijs et al., 2010). Additionally, these
can be difficult to compute with certain types of uncertainty
estimation approaches and so are not completely general for
future benchmarking studies. To have a minimal viable set of

[Link] Hydrol. Earth Syst. Sci., 26, 1673–1693, 2022


1676 D. Klotz et al.: Uncertainty estimation with deep learning for rainfall–runoff simulation

metrics, we therefore based the assessment of reliability on


probability plots and evaluated resolution with a set of sum-
mary statistics.
All metrics that we report throughout the paper are evalu-
ated on the test data only. With that we follow the thoughts
outlined by Klemeš (1986) and the current established prac-
tice in machine learning.

2.2.1 Reliability

Probability plots (Laio and Tamea, 2007) are based on the


following observation: if we insert the observations into
the estimated cumulative distribution function, a consis-
tent model will provide a uniform distribution on the inter-
val [0, 1]. The probability plot uses this to provide a diagnos-
tic. The theoretical quantiles of this uniform distribution are
plotted on the x axis and the fraction of observations that fall
below the corresponding predictions on the y axis (Fig. 2).
Deficiencies appear as deviations from the 1 : 1 line: a perfect
model should capture 10 % of the observations below a 10 %
threshold, 20 % under the 20 % threshold, and so on. If the
relative counts of observations, in particular modeled quan-
tiles, are higher than the theoretical quantiles, this means that
a model is under-confident. Similarly, if the relative counts of
observations, in particular modeled quantiles, are lower than
the theoretical quantiles, then the model is over-confident.
Laio and Tamea (2007) proposed using the probability plot Figure 2. (a) Illustration of the probability plot for the evaluation
in a continuous fashion to avoid arbitrary binning. We pre- of predictive distributions. The x axis shows the estimated cumu-
ferred to use discrete steps in our quantile estimates to avoid lative distribution over all time steps by a given model, and the y
falsely reporting overly precise results (e.g., Cole, 2015). As axis shows the actual observed cumulative probability distribution.
such, we chose a 10 % step size for the binning thresholds A conditional probability distribution was produced by each model
in our experiments. We used 10 thresholds in total, one for for each time step in each basin. A hypothetically perfect model will
have a probability plot that falls on the 1 : 1 line. We used 10 % bin-
each of the resulting steps and the additional 1.0 threshold
ning in our actual experiments. (b) Illustration of the corresponding
which used the highest sampled value as an upper bound, so
“error plot”. This plot complements the probability by explicitly de-
that an intuition regarding the upper limit can be obtained. picting the distances of individual models from the 1 : 1 line.
Subtracting the thresholds from the relative count yields a
deviation from the 1 : 1 line (the sum of which is sometimes
referred to as the expected calibration error; see, e.g., Naeini 2.2.2 Resolution
et al., 2015). For the evaluation we depicted this counting
error alongside the probability plots to provide better read- To motivate why further metrics are required on top of the re-
ability (see Fig. 2b). liability plot, it is useful to look at the following observation:
A deficit of the probability plot is its coarseness, since there are an infinity of models that produce perfect proba-
it represents an aggregate over time and basins. As such, it bility plots. One edge-case example is a model that simply
provides a general overview but necessarily neglects many ignores the inputs and produces the unconditional empirical
aspects of hydrological importance. Many expansions of data distribution at every time step. Another edge-case ex-
the analytical range are possible. One that suggested itself ample is a hypothetical “perfect” model that produces delta
was to examine the deviations from the 1 : 1 line for dif- distributions at exactly the observations every time. Both of
ferent basins. Therefore, we evaluated the probability plot these models have precision that exactly matches accuracy,
for each basin specifically, computed the deviations from the and these two models could not be distinguished from each
1 : 1 line, and examined their distributions. We did not in- other using a probability plot. Similarly, a model which is
clude the 1.0 threshold for this analysis since it consisted of consistently under-confident for low flows can compensate
large spikes only. for this by being over-confident for higher flows. Thus, to
better assess the uncertainty estimations, at least another di-
mension of the problem has to be checked: the resolution.

Hydrol. Earth Syst. Sci., 26, 1673–1693, 2022 [Link]


D. Klotz et al.: Uncertainty estimation with deep learning for rainfall–runoff simulation 1677

Table 1. Overview of the benchmarking metrics for assessing model resolution. Each metric is applied to the distributional streamflow
predictions at each individual time step and then aggregated over all time steps and basins. All metrics are defined in the interval [0, ∞), and
lower values are preferable (but not unconditional on the reliability).

Benchmarking metric Description


Mean absolute deviation More robust than standard deviation and variance
Standard deviation We use Bessel’s correction to account for one degree of freedom.
Variance We use Bessel’s correction to account for one degree of freedom.
Average width of the 0.2 to 0.9 quantiles We compute the width of each of the inner quantiles and take the mean.
Distance between the 0.25 and 0.75 quantiles Average interquartile range
Distance between the 0.1 and 0.9 quantiles Average interdecile range

To assess the resolution of the provided uncertainty esti-


mates, we used a group of metrics (Table 1). Each metric
was computed for all available data points and averaged over
all time steps and basins. The results are statistics that char-
acterize the overall sharpness of the provided uncertainty es-
timates (roughly speaking, they give us a notion about how
thin the body of the distributional predictions is). Further,
to provide an anchor for interpreting the magnitudes of the
statistics, we also computed them for the observed stream-
flow values (this yields an unconditional empirical distribu-
tion for each basin that can be aggregated). These are not
strictly the same, but we argue that they still provide some
form of guidance.

2.3 Baselines: uncertainty estimation with deep


learning

We tested four strategies for uncertainty estimation with


deep learning. These strategies fall into two broad cate-
gories: mixture density networks (MDNs) and MCD. We ar-
gue that these approaches represent a useful set of baselines
for benchmarking.

2.3.1 Mixture density networks

The first class of approaches uses a neural network to mix


different probability densities. This class is commonly re-
ferred to as MDNs (Bishop, 1994), and we tested three differ-
ent forms of MDNs. A mixture density is a probability den-
sity function created by combining multiple densities, called
components. An MDN is defined by the parameters of each
component and the mixture weights. The mixture compo- Figure 3. Illustration of the concept of a mixture density us-
nents are usually simple distributions like the Gaussians in ing Gaussian distributions. Plot (a) shows three Gaussian distri-
Fig. 3. Mixing is done using weighted sums. Mixture weights butions with different parameters (i.e., different means and stan-
are larger than zero and collectively sum to one to guarantee dard deviations). Plot (b) shows the same distributions superim-
posed with the mixture that results from the depicted weighting w =
that the mixture is also a density function. These weights can
(0.1, 0.6, 0.3). Plot (c) shows the same juxtaposition, but the mix-
therefore be seen as the probability of a particular mixture ture is derived from a different weighting w = (0.80, 0.15, 0.05).
component. Usually, the number of mixture components is The plots demonstrate that even in the simple example with fixed
discrete; however, this is not a strict requirement. parameters the skewness and form of the mixed distribution can
The output of an MDN is an estimation of a conditional vary strongly.
density, since the mixture directly depends on a given input
(Fig. 4). The mixture represents changes every time the net-

[Link] Hydrol. Earth Syst. Sci., 26, 1673–1693, 2022


1678 D. Klotz et al.: Uncertainty estimation with deep learning for rainfall–runoff simulation

Figure 4. Illustration of a mixture density network. The core idea is to use the outputs of a neural network to determine the mixture weights
and parameters of a mixture of densities (see Fig. 3). That is, for a given input, the network determines a conditional density function, which
it builds by mixing a set of predefined base densities (the so-called components).

work receives new inputs (i.e., in our case for every time One can read this enumeration as a transition from sim-
step). We thus obtain time-varying predictive distributions ple to complex: we start with Gaussian mixture components,
that can approximate a large variety of distributions (they then replace them with ALD mixture components, and lastly
can, for example, account for asymmetric and multimodal transition from a fixed number of mixture components to an
properties). The resulting model is trained by maximizing the implicit approximation. There are two reasons why we ar-
log-likelihood function of the observations according to the gue that the more complex MDN methods might be more
predicted mixture distributions. We view MDNs as intrinsi- promising than a simple GMM. First, error distributions in
cally distributional in the sense that they provide probability hydrologic simulations often have heavy tails. A Laplacian
distributions instead of first making deterministic streamflow component lends itself to thicker-tailed uncertainty (Fig. 5).
estimates and then appending a sampling distribution. Second, streamflow uncertainty is often asymmetrical, and
In this study, we tested three different MDN approaches. thus the ALD component could make more sense than a sym-
metric distribution in this application. For example, even a
1. Gaussian mixture models (GMMs) are MDNs with
single ALD component can be used to account for zero flows
Gaussian mixture components. Appendix B1 provides a
(compare Fig. 5b). UMAL extends this to avoid having to
more formal definition as well as details on the loss/ob-
pre-specify the number of mixture components, which re-
jective function.
moves one of the more subjective degrees of freedom from
2. Countable mixtures of asymmetric Laplacians (CMAL) the model design.
are similar to GMMs, but instead of Gaussians, the
mixture components are asymmetric Laplacian distribu-
2.3.2 Monte Carlo dropout
tions (ALDs). This allows for an intrinsic representation
of the asymmetric uncertainties that often occur with
hydrological variables like streamflow. Appendix B2 MCD provides an approach to estimate a basic form of epis-
provides a more formal description as well as details temic uncertainty. In the following we provide the intuition
on the loss/objective function. behind its application.
Dropout is a regularization technique for neural networks
3. Uncountable mixtures of asymmetric Lapla-
but can also be used for uncertainty estimation (Gal and
cians (UMAL) also use asymmetric Laplacians as
Ghahramani, 2016). Dropout randomly ignores specific net-
mixture components, but the mixture is not discretized.
work units (see Fig. 6). Hence, each time the model is eval-
Instead, UMAL approximates the conditional density
uated during training, the network structure is different. Re-
by using Monte Carlo integration over distributions
peating this procedure many times results in an ensemble of
obtained from quantile regression (Brando et al., 2019).
many submodels within the network. Dropout regularization
Appendix B3 provides a more formal description as
is used during training, while during the model evaluation the
well as details on the loss/objective function.
whole neural network is used. Gal and Ghahramani (2016)

Hydrol. Earth Syst. Sci., 26, 1673–1693, 2022 [Link]


D. Klotz et al.: Uncertainty estimation with deep learning for rainfall–runoff simulation 1679

Figure 6. Schematic depiction of the dropout concept.

lar radiation, minimum and maximum daily temperature,


and vapor pressure) from a set of products (namely, NL-
DAS, Maurer, and DayMet). As in our previous studies,
a set of static attributes is concatenated to the inputs (see
Kratzert et al., 2019b). The training period is from 1 Oc-
tober 1980 to 30 September 1990. The validation period is
from 1 October 1990 to 30 September 1995. Finally, the test
period is from 1 October 1995 to 1 September 2005. This
means that we use around 365 × 10 = 3650 training points
from 531 catchments (equating to a total of 531 × 3650 =
1 938 150 observations for training).
For all MDNs we introduced an additional hidden layer
to provide more flexibility and adapted the network as re-
Figure 5. Characterization of distributions that are used as mixture quired (see Appendix B). We trained all MDNs with the log-
components in our networks. Plot (a) superimposes a Gaussian with likelihood and the MCD as in Kratzert et al. (2021), except
a Laplacian distribution. The latter is sharper around its center, but
that the loss was the mean squared error (as proposed by
this sharpness is traded with thicker tails. We can think about it in
the following way: the difference in area is moved to the center and
Gal and Ghahramani, 2016). All hyperparameters were se-
the tails of the distributions. Plot (b) illustrates how the asymmetric lected on the basis of the training data, so that they provide
Laplacian distribution (ALD) can accommodate for differences in the smallest average deviation from the 1 : 1 line of the prob-
skewness (via an additional parameter). ability plot for each model. For GMMs this resulted in 10
components and for CMAL in 3 components (Appendix A).
To make the benchmarking procedure work at the most
general level, we employed the setup depicted in Fig. 7. This
showed that dropout can be used as a sampling technique for
allows for each approach, with the ability to generate sam-
Bayesian inference – hence the name Monte Carlo dropout.
ples, to be plugged into the framework (as evidenced by
the inclusion of MCD). For each basin and time step the
2.3.3 Model setup
models either predict the streamflow directly (MCD) or pro-
vide a distribution over the streamflow (GMM, CMAL, and
All models are based on the LSTMs from Kratzert et al.
UMAL). In the latter case, we then sampled from the distri-
(2021). We configured them so that they use meteorological
bution to get 7500 sample points for each data point. Since
variables as input and predict the corresponding streamflow.
the distributions have infinite support, sampled values below
This implies that all presented results are from simulation
zero cubic meter per day are possible. In this case, we trun-
models sensu Beven and Young (2013); i.e., no previous dis-
cated the distribution by setting the sample to 0. All in all,
charge observations were used as inputs. The LSTM in the
this resulted in 531 × 3650 × 7500 simulation points for each
context of hydrology is inter alia described in Kratzert et al.
model and metric. Exaggerating a little bit, we could say that
(2018) and is not repeated here. However, all models can be
we actually deal with “multi-point” predictions here.
adapted to a forecasting setting.
In short, our setting was the following. Each model takes
a set of meteorological inputs (namely, precipitation, so-

[Link] Hydrol. Earth Syst. Sci., 26, 1673–1693, 2022


1680 D. Klotz et al.: Uncertainty estimation with deep learning for rainfall–runoff simulation

Figure 7. Schemata of the general setup. Vertically the procedure is illustrated for two arbitrary basins, m and n, and horizontally the
corresponding time steps are depicted. In total we have 531 basins with approximately 3650 data points in time each, and for each time
step we compute 7500 samples. In the case of MCD we achieve this by directly sampling from the model. In the case of the MDNs we first
estimate a conditional distribution and then sample from it. The “clipping” sign emphasizes our choice to set samples that would be below
zero back to the zero-runoff baseline.

2.4 Post hoc model examination: checking model KGE; Table 2). The term single-point predictions is used
behavior here in the statistical sense of a point estimator to distin-
guish it from distributional predictions. Single-point predic-
We performed a post hoc model examination as a comple- tions were derived as the mean of the distributional predic-
ment to the benchmarking to avoid potential blind spots. The tions at each time step and evaluated for aggregating over
analysis has three parts, each one associated with a specific the different basins, using the mean and median as aggre-
property. gation operators (as in Kratzert et al., 2019b). Section 3.2.1
discusses the outcomes of this test as part of the post hoc
1. Accuracy: how accurate are single-point predictions ob-
model examination.
tained from the distributional predictions?
2. Internal consistency: how are the mixture components 2.4.2 Internal consistency: mixture component
used with regard to flow conditions? behavior
3. Estimation quality: how can we examine the properties
of the distributional predictions with regard to second- To get an impression of the model consistency, we looked at
order uncertainties? the behavioral properties of the mixture densities themselves.
The goal was to get some qualitative understanding about
2.4.1 Accuracy: single-point predictions how the mixture components are used in different situations.
As a prototypical example of this kind of examination, we
To address accuracy, we used standard performance met- refer to the study of Ellefsen et al. (2019). It examined how
rics applied to single-point predictions (such as the Nash– LSTMs use the mixture weights to predict the future within
Sutcliffe efficiency, NSE, and the Kling–Gupta efficiency, a simple game setting. Similarly, Nearing et al. (2020a) re-

Hydrol. Earth Syst. Sci., 26, 1673–1693, 2022 [Link]


D. Klotz et al.: Uncertainty estimation with deep learning for rainfall–runoff simulation 1681

Table 2. Overview of the different single-point prediction performance metrics. The table is adapted from Kratzert et al. (2021).

Single-point Description Reference


metric
NSE Nash–Sutcliffe efficiency Eq. (3) in Nash and Sutcliffe (1970)
KGE Kling–Gupta efficiency Eq. (9) in Gupta et al. (2009)
Pearson’s r Pearson correlation between observed and simulated flow
α-NSE Ratio of standard deviations of observed and simulated flow From Eq. (4) in Gupta et al. (2009)
β-NSE Ratio of the means of observed and simulated flow From Eq. (10) in Gupta et al. (2009)
FHV Top 2 % peak flow bias Eq. (A3) in Yilmaz et al. (2008)
FLV Bottom 30 % low-flow bias Eq. (A4) in Yilmaz et al. (2008)
FMS Bias of the slope of the flow duration curve between the 20 % and 80 % percentiles Eq. (A2) in Yilmaz et al. (2008)
Peak timing Mean peak time lag (in days) between observed and simulated peaks Appendix D in Kratzert et al. (2021)

ported that a GMM produced probabilities that change in re- 3 Results


sponse to different flow regimes. We conducted the same ex-
ploratory experiment with the best-performing benchmarked 3.1 Benchmarking results
approach.
The probability plots for each model are shown in Fig. 8.
2.4.3 Estimation quality: second-order uncertainty
The approaches that used mixture densities performed better
MDNs allow a quality check of the given distributional pre- than MCD, and among all of them, the ones that used asym-
dictions. The basic idea here is that predicted distributions metric components (CMAL and UMAL) performed better
are estimations themselves. MDNs provide an estimation of than GMM. CMAL has the best performance overall. All
the aleatoric uncertainty in the data, and the MCD is a ba- methods, except UMAL, tend to give estimates above the 1 :
sic estimation of the epistemic uncertainty. Thus, the esti- 1 line for thresholds lower than 0.5 (the median). This means
mations of the uncertainties are not the uncertainties them- that the models were generally under-confident in low-flow
selves, but – as the name suggests – estimations thereof, situations. GMM was the only approach that showed this
and they are thus subject to uncertainties themselves. This type of under-confidence throughout all flow regimes – in
does, of course, hold for all forms of uncertainty estimates, other words, GMM was above the 1 : 1 line everywhere.
not just for MDNs. However, MDNs provide us with single- The largest under-confidence occurred for MCD in the mid-
point predictions of the distribution parameters and mixture flow range (between the 0.3 and 0.6 quantile thresholds). For
weights. We can therefore assess the uncertainty of the es- higher flow volumes, both UMAL and MCD underestimated
timated mixture components by checking how perturbations the uncertainty. Overall, CMAL was close to the 1 : 1 line.
(e.g., in the form of input noise) influence the distributional Figure 9 shows how the deviations from the 1 : 1 line var-
predictions. This can be important in practice. For example, ied for each basin within each threshold of the probabil-
if we mistrust a given input – let us say because the event was ity plot. That is, each subplot shows a specific threshold,
rarely observed so far or because we suspect some form of er- and each density resulted from the distributions of devia-
rors – we can use a second-order check to obtain qualitative tions from the 1 : 1 line that the different basins exhibit. The
understanding of the goodness of the estimate. distributions for 0.4 to 0.6 flow quantiles were roughly the
Concretely, we examined how a second-order effect on same across methods; however, the distributions from CMAL
the estimated uncertainty can be checked with the MCD ap- and UMAL were better centered than GMM and MCD. At
proach (which provides estimations for some form of epis- the outer bounds, a bias was induced due to evaluating in
temic uncertainties), as it can be layered on top of the MDN probability space: it is more difficult to be over-confident as
approaches (which provide estimations of the aleatoric un- the thresholds get lower, and vice versa it is more difficult
certainties). This means that the Gaussian process interpre- to be under-confident as the thresholds become higher. At
tation by Gal and Ghahramani (2016) cannot be strictly ap- higher thresholds, UMAL had a larger tendency to fall be-
plied. We can nonetheless use the MCD as a perturbation low the center line, which is also visible in the probability
method, since it still forces the model to learn an internal en- plot. Again, this is a consequence of the over-confident pre-
semble. dictions from the UMAL approach for larger flow volumes
(MCD also exhibited the same pattern of over-confidence for
the highest threshold).
Lastly, Table 3 shows the results of the resolution bench-
mark. In general, UMAL and MCD provide the sharpest dis-
tributions. This goes along with over-confident narrow distri-

[Link] Hydrol. Earth Syst. Sci., 26, 1673–1693, 2022


1682 D. Klotz et al.: Uncertainty estimation with deep learning for rainfall–runoff simulation

Table 3. Benchmark statistics for model precision. These metrics were applied to the distributional predictions at individual time steps. The
lowest metric per row is marked in bold. Lower values are better for all statistics (conditional on the model having high reliability). This table
also provides statistics of the empirical distribution from the observations (“Obs”) aggregated over the basins as a reference, which are not
directly comparable with the model statistics since “Obs” represents an unconditional density, while the models provide a conditional one.
The “Obs” statistics should be used as a reference to contextualize the statistics from the modeled distributions.

Benchmarking metric GMM CMAL UMAL MCD Obs


Mean absolute deviation 0.52 0.48 0.42 0.39 0.77
Standard deviation 0.69 0.63 0.00∗ 0.38 2.85
Variance 2.73 2.64 0.00∗ 0.48 12.78
Average of the 0.2 to 0.9 quantiles 0.18 0.17 0.14 0.13 0.41
Distance between the 0.25 and 0.75 quantiles 0.71 0.68 0.67 0.51 1.38
Distance between the 0.1 and 0.9 quantiles 2.00 1.90 1.72 1.26 5.32
∗ The displayed 0 is a rounding artifact. The actual variance here is higher than 0. The “collapse” is, by and large, a result of
a very narrow distribution combined with a heavy truncation for values below 0.

butions that both approaches exhibit for high flow volumes.


These, having the largest uncertainties, also influence the av-
erage resolution the most. The other two approaches, GMM
and CMAL, provide lower resolution (less sharp distribu-
tions). In the case of GMMs, the low resolution is reflected in
under-confidently wide distributions in the probability plot.
Notably, the predictions of CMAL are in between those of
the over-confident UMAL and the under-confident GMM.
This makes sense from a methodological viewpoint since we
designed CMAL as an “intermediate step” between the two
approaches. Moreover, these results reflect a trade-off in the
log-likelihood (which the models are trained for), where a
balance between reliability and resolution has to be obtained
in order to minimize the loss.

3.2 Post hoc model examination

3.2.1 Accuracy

Table 4 shows accuracy metrics of single-point predictions,


i.e., the means of the distributional predictions aggregated
over time steps and basins. It depicts the means and medians
of each metric across all 531 basins. The approach labeled
MCDp reports the statistics from the MCD model but with-
out sampling (i.e., from the full model). The model perfor-
mances are not entirely comparable with each other, since the
architecture and hyperparameters of the MCD model were
chosen with regard to the probability plot. We therefore also
compare against a model with the same hyper-parameters as
Kratzert et al. (2019b) – the latter model is labeled LSTMp in
Table 4.
Among the uncertainty estimation approaches, the models
Figure 8. Probability plot benchmark results for the 10-year test
with asymmetric mixture components (CMAL and UMAL)
period over 531 basins in the continental US. Subplot (a) shows the
probability plots for the four methods. The 1 : 1 line is shown in perform best. UMAL provided the best point estimates. This
grey and indicates perfect probability estimates. Subplot (b) details is in line with the high resolutions of the uncertainty esti-
deviations from the 1 : 1 line to allow for easier interpretation. mation benchmark: the sharpness makes the mean a better
predictor of the likelihood’s maximum and indicates again
that the approach trades reliability for accuracy. That said,
even with our naive approach for obtaining single-point es-

Hydrol. Earth Syst. Sci., 26, 1673–1693, 2022 [Link]


D. Klotz et al.: Uncertainty estimation with deep learning for rainfall–runoff simulation 1683

Figure 9. Kernel densities of the basin-wise deviation from the 1 : 1 line in the probability plot for the different inner quantiles. These
distributions result from evaluating the performance at each basin individually (rather than aggregating over basins). Note how the bounded
domain of the probability plot induces a bias for the outer thresholds as the deviations cannot expand beyond the [0, 1] interval.

timations (i.e., simply taking the mean), both CMAL and and hence the means of the asymmetric distributions might
UMAL manage to outperform the model that is optimized be a sub-optimal choice.
for single-point predictions with regard to some metrics. This
suggests that it could make sense to train a model to esti- 3.2.2 Internal consistency
mate distributions and then recover the best estimates. One
possible reason why this might be the case is that single- Figure 10 summarizes the behavioral patterns of the CMAL
point loss functions (e.g., MSE) define an implicit probabil- mixture components. It depicts an exemplary hydrograph su-
ity distribution (e.g., minimizing an MSE loss is equivalent perimposed on the CMAL uncertainty prediction together
to maximizing a Gaussian likelihood with fixed variance). with the corresponding mixture component weights. The
Hence, using a more nuanced loss function (i.e., one that is mixture weights always sum to 1. This figure shows that
the likelihood of a multimodal, asymmetrical, heterogeneous the model seemingly learns to use different mixture com-
distribution) can improve performance even for the purpose ponents for different parts of the hydrograph. In particular,
of making non-distributional estimates. In fact, it is reason- distributional predictions in low-flow periods (perhaps dom-
able to expect that the results of the MDN approaches can be inated by base flow) are largely controlled by the first mix-
improved even further by using a more sophisticated strat- ture component (as can be seen by the behavior of mixture α1
egy for obtaining single-point predictions (e.g., searching for in Fig. 10). Falling limbs of the hydrograph (corresponding
the maximum of the likelihood). The single-point prediction roughly to throughflow) are associated with the second mix-
LSTM (LSTMp) outperforms the ALD-based MDNs for tail ture component (α2 ), which is low for both rising limbs and
metrics of the streamflow – that is, for the low (FLV) and low-flow periods. The third component (α3 ) mainly controls
high bias (FHV). These are regimes where we would expect the rising limbs and the peak runoff but also has some influ-
the most asymmetric distributions for hydrological reasons, ence throughout the rest of the hydrograph. In effect, CMAL

[Link] Hydrol. Earth Syst. Sci., 26, 1673–1693, 2022


1684 D. Klotz et al.: Uncertainty estimation with deep learning for rainfall–runoff simulation

Table 4. Evaluation of different single-point prediction metrics. Best performance is marked in bold. Information about the inter-basin
variability (dispersion) is provided in the form of the standard deviation whenever the mean is used for aggregation and in the form of the
distance to the 25 % and 75 % quantiles when the median is used for aggregation.

Aggregation GMM CMAL UMAL MCD MCDp LSTMp

NSEa Median 0.744+0.062


−0.105 0.784+0.052
−0.085 0.791+0.056
−0.086 0.762+0.062
−0.096
+0.061
0.763−0.096 0.762+0.059
−0.088
NSEa Mean 0.690 ± 0.198 0.735 ± 0.170 0.749 ± 0.162 0.646 ± 0.557 0.675 ± 0.384 0.683 ± 0.216
KGEb Median 0.728+0.077
−0.129 0.748+0.083
−0.109 0.785+0.074
−0.105 0.730+0.092
−0.138
+0.087
0.737−0.128 0.791+0.063
−0.109
KGEb Mean 0.685 ± 0.172 0.714 ± 0.169 0.745 ± 0.159 0.525 ± 0.888 0.622 ± 0.426 0.710 ± 0.210
CORc Median 0.880+0.033
−0.048 0.901+0.027
−0.042 0.903+0.026
−0.044 0.890+0.029
−0.041
+0.029
0.890−0.042 0.891+0.029
−0.041
CORc Mean 0.857 ± 0.086 0.876 ± 0.082 0.880 ± 0.077 0.866 ± 0.098 0.865 ± 0.100 0.871 ± 0.088
α-NSEd Median 0.816+0.121
−0.120 0.820+0.116
−0.094 0.863+0.097
−0.098 0.877+0.109
−0.098
+0.108
0.880−0.097 0.952+0.098
−0.117
α-NSEd Mean 0.822 ± 0.206 0.828 ± 0.189 0.858 ± 0.169 0.893 ± 0.243 0.900 ± 0.245 0.976 ± 0.200
β-NSEe Median 0.006+0.038
−0.043 −0.013+0.036
−0.034 −0.027+0.031
−0.034 0.061+0.053
−0.048
+0.039
0.054−0.049 0.027+0.041
−0.042
β-NSEe Mean 0.004 ± 0.094 −0.011 ± 0.077 −0.030 ± 0.073 0.095 ± 0.211 0.065 ± 0.122 0.011 ± 0.099
+12.502 +11.653
FHVg Median −17.322−11.951 −17.164−8.360 12.243+9.205
−9.554
+10.940
−11.346−9.652 +10.955
−11.343−10.079 +9.484
−4.277−10.528
FHVg Mean −16.324 ± 21.376 −15.140 ± 18.737 −12.705 ± 15.491 −8.641 ± 24.455 −8.744 ± 24.456 −1.084 ± 19.692
FLVe Median 28.561+27.672
−28.547 28.442+28.754
−26.082 27.954+27.816
−24.743 43.830+31.087
−32.483 −65.762+76.145
−317.926 −4.864+56.768
−264.315
FMSh Median −7.346+6.813
−6.582 −5.443+5.392
−7.001 −2.508+6.146
−5.741 −20.768+13.080
−20.734
+11.937
−17.039−14.265 −8.650+15.003
−12.988
P -T i Median 0.308+0.310
−0.133 0.333+0.287
−0.152 0.286+0.256
−0.119 0.286+0.289
−0.132
+0.300
0.286−0.132 +0.254
0.286−0.106
P -T i Mean 0.464 ± 0.405 0.455 ± 0.392 0.412 ± 0.356 0.427 ± 0.395 0.425 ± 0.388 0.405 ± 0.356
a Nash–Sutcliffe efficiency (−∞, 1]; values closer to 1 are desirable. b Kling–Gupta efficiency (−∞, 1]; values closer to 1 are desirable. c Pearson correlation [−1, 1]; values closer to 1 are desirable.
d α -NSE decomposition (0, ∞); values close to 1 are desirable. e β -NSE decomposition (−∞, ∞); values close to 0 are desirable. f Top 2 % peak flow bias (−∞, ∞); values close to 0 are desirable.
g 30 % low flow bias (−∞, ∞); values close to 0 are desirable. Since a strong bias is induced by a small subset of basins, we provide the median aggregation. h Bias of the FDC mid-segment
slope: (−∞, ∞); values close to 0 are desirable. Since a strong bias is induced by a small subset of basins, we provide the median aggregation. i Peak timing, i.e., lag of peak timing (−∞, ∞); values
close to 0 are desirable.

Figure 10. (a) Hydrograph of an exemplary event in the test period with both the 5 % to 95 % and 25 % to 75 % quantile ranges. (b) The
weights (αi ) of the CMAL mixture components for these predictions.

Hydrol. Earth Syst. Sci., 26, 1673–1693, 2022 [Link]


D. Klotz et al.: Uncertainty estimation with deep learning for rainfall–runoff simulation 1685

Figure 11. Illustration of second-order uncertainties estimated by using MCD to sample the parameters of the CMAL approach. The upper
subplot shows an observed hydrograph and predictive distributions as estimated by CMAL. The lower subplots show the CMAL distributions
and distributions from 25 MCD samples of the CMAL model at three selected time steps (indicated by black ovals shown on the hydrograph).
The abbreviation “main pred” marks the unperturbed distributional predictions from the CMAL model.

learns to separate the hydrograph into three parts – rising rameters and weights of the components as a second-order
limb, falling limb, and low flows – which correspond to the uncertainty. In theory even higher-order uncertainties can be
standard hydrological conceptualization. No explicit knowl- thought of. Here, as already described in the Methods sec-
edge of this particular hydrological conceptualization is pro- tion, we use MCD on top of the CMAL approach to “stochas-
vided to the model – it is solely trained to maximize overall ticize” the weights and parameters and expose the uncer-
likelihood. tainty of the estimations. Figure 11 illustrates the procedure:
the upper part shows a hydrograph with the 25 %–75 % quan-
3.2.3 Estimation quality tiles and 5 %–95 % quantiles from CMAL. This is the main
prediction. The lower plots show kernel density estimates for
particular points of the hydrograph (marked in the upper part
In this experiment we want to demonstrate an avenue for
with black ovals labeled “a”, “b”, and “c” and shown in red
studying higher-order uncertainties with CMAL. Intuitively,
in the lower subplots). These three specific points represent
the distributional predictions are estimations themselves and
different portions of the hydrograph with different predicted
are thus subject to uncertainty, and, since the distributional
distributional shapes and are thus well suited to showcasing
predictions do already provide estimates for the prediction
the technique. These kernel densities (in red) are superim-
uncertainty, we can think about the uncertainty regarding pa-

[Link] Hydrol. Earth Syst. Sci., 26, 1673–1693, 2022


1686 D. Klotz et al.: Uncertainty estimation with deep learning for rainfall–runoff simulation

posed with 25 sampled estimations derived after applying for this is likely the Gaussian assumption of the uncertainty
MCD on top of the CMAL model (shown in lighter tones be- estimates, which seems inadequate for many low- and high-
hind the first-order estimate). These densities are the MCD- flow situations. There is, however, also a more nuanced as-
perturbed estimations and thus a gauge for how second-order pect to consider: the MDN approaches estimate the aleatoric
uncertainties influence the distributional predictions. uncertainty. MCD, on the other hand, estimates epistemic un-
certainty, or rather a particular form thereof. The method-
3.3 Computational demand ological comparison is therefore only partially fair. In gen-
eral, these two uncertainty types can be seen as perpendicular
This section gives an overview of the computational demand to each other. They do partially co-appear in our setup, since
required to compute the different uncertainty estimations. both the epistemic and aleatoric uncertainties are largest for
All of the reported execution times were obtained by us- high flow volumes.
ing NVIDIA P100 (16 GB RAM), using the Pytorch library Yet within the chosen setup it was observable that the
(Paszke et al., 2019). A single execution of the CMAL model methods that use inherently asymmetric distributions as com-
with a batch size of 256 takes 0.026+0.001
−0.002 s (here, and in the ponents outperformed the other ones. That is, CMAL and
following, the basis gives the median over 100 model runs; UMAL performed better than MCD and GMM in terms of
the index and the exponent show the deviations of the 10 % reliability, resolution, and the accuracy of the derived single-
quantile and the 90 % quantile, respectively). An execution point predictions. The CMAL approach in particular gave
of the MCD model takes 0.055+0.002
−0.001 s. The slower execution distributional predictions that were very good in terms of
time of the MCD approach here is explained by its larger hid- reliability and sharpness (and single-point estimates). There
den size. It used 500 hidden cells, in comparison with the 250 was a direct link between the predicted probabilities and hy-
hidden cells of CMAL (see Appendix A). drologic behavior in that different distributions were acti-
Generating all the needed samples for the evaluation vated (i.e., got larger mixture weights) for rising vs. falling
with MCD and a batch size of 256 would take approxi- limbs. Nevertheless, likelihood-based approaches (for esti-
mately 36.1 d (since 7500 samples have to be generated for mating the aleatoric uncertainty) are prone to giving over-
531 basins and 10 years at a daily resolution). In practice, we confident predictions. We were not able to diagnose this em-
could shorten this time to under a week by using consider- pirically. This might rather be a result of the limits of the
ably larger batch sizes and distributing the computations for inquiry than the non-existence of the phenomenon.
different basins over multiple GPUs. In comparison, comput- These limits illustrate how challenging benchmarking is.
ing the same number of samples by re-executing the CMAL Rainfall–runoff modeling is a complex endeavor. Unifying
model would take around 17.4 d. In practice, however, only the diverse approaches into a streamlined framework is dif-
a single run of the CMAL model is needed, since MDNs ficult. Realistically, a single research group cannot be able
provide us with a density estimate from which we can di- to compare the best possible implementations of the many
rectly sample in a parallel fashion (and without needing to existing uncertainty estimation schemes – which include ap-
re-execute the model run). Thus, the CMAL model, with a proaches such as sampling distributions, ensembles, post-
batch size of 256, takes only ∼ 14 h to generate all needed processors, and so forth. We did therefore not only want to
samples. examine some baseline models, but also to provide the skele-
ton for a community-minded benchmarking scheme (see
Nearing et al., 2018). We hope this will encourage good prac-
4 Conclusions and outlook tice and provide a foundation for others to build on. As de-
tailed in the Methods section, the scheme consists of four
Our basic benchmarking scheme allowed us to systemati- parts. Three of them are core benchmarking components and
cally pursue our primary objective – to examine deep learn- one is an added model checking step. In the following we
ing baselines for uncertainty predictions. In this regard, we provide our main observations regarding each point.
gathered further evidence that deep-learning-based uncer-
tainty estimation for rainfall–runoff modeling is a promising i. Data: we used the CAMELS dataset curated by the
research avenue. The explored approaches are able to provide US National Center for Atmospheric Research and split
fully distributional predictions for each basin and time step. the data into three consecutive periods (training, valida-
All predictions are dynamic: the model adapts them accord- tion, and test). All reported metrics are for the test split,
ing to the properties of each basin and the current dynamic which has only been used for the final model evalua-
inputs, e.g., temperature or rainfall. Since the predictions are tion. The dataset has already seen use for other compar-
inherently distributional, the predictions can be further ex- ative studies and purposes (e.g., Kratzert et al., 2019b, b;
amined and/or reduced to a more basic form, e.g., sample, Newman et al., 2017). It is also part of a recent gen-
interval, or point predictions. eration of open datasets, which we believe are the re-
The comparative assessment indicated that the MCD ap- sult of a growing enthusiasm for community efforts. As
proach provided the worst uncertainty estimates. One reason such, we predict that new benchmarking possibilities

Hydrol. Earth Syst. Sci., 26, 1673–1693, 2022 [Link]


D. Klotz et al.: Uncertainty estimation with deep learning for rainfall–runoff simulation 1687

will become available in the near future. A downside iv. Model checking: in the post hoc examination step we
with using existing open datasets is that the test data tested the model performance with regard to point pre-
are accessible to modelers. This means that a potential dictions. Remarkably, the results indicate that the dis-
defense mechanism against over-fitting on the test data tributional predictions are not only reliable and precise,
is missing (since the test data might be used during the but also yield strong single-point estimates. Addition-
model selection/fitting process; for broader discussions ally, we checked for internal organization principles of
we refer to Donoho, 2017; Makridakis et al., 2020; De- the CMAL model. In doing so we showed (a) how the
hghani et al., 2021). To enable rigorous benchmarking, component weighting of a given basin changes in de-
it might thus become relevant to withhold parts of the pendence on the flow regime and (b) how higher-order
data and only make them publicly available after some uncertainties in the form of perturbations of the compo-
given time (as for example done in Makridakis et al., nent weights and parameters change the distributional
2020). prediction. The showcased behavior is in accordance
with basic hydrological intuition. Specific components
ii. Metrics: we put forward a minimal set of diagnostic cri-
are used for low and high streamflow. The uncertainty
teria, that is, a discrete probability plot, obtained from
is lowest near 0 and increases with a rise in streamflow.
a probability integral transform, for checking predic-
This relationship is nonlinear and not a simple 1 : 1 de-
tion reliability, and a set of dispersion metrics to check
piction. Similarly, additional uncertainties are expected
the prediction resolution (see Renard et al., 2010). Us-
to change the characteristics of the distributional pre-
ing these, we could see the proposed baselines exhaust
diction more for high flows than for low flows. By and
the evaluation capacity of these diagnostic tools. On
large, however, this just a start. We argue that post hoc
the one hand, this is an encouraging sign for our abil-
examination will play a central role in future bench-
ity to make reliable hydrologic predictions (the down-
marking efforts.
side being that it might be hard for models to improve
on this metric going forward). On the other hand, it To summarize, the presented results are promising.
is important to be aware that the probability plot and Viewed through the lens of community-based benchmark-
the dispersion statistic miss several important aspects of ing, we expect progress on multiple fronts: better data, better
probabilistic prediction (for example, precision, consis- models, better baselines, better metrics, and better analyses.
tency, or event-specific properties). All reported metrics To road to get there still has many challenges awaiting. Let
are highly aggregated summaries (over multiple basins us overcome them together.
and time steps) of highly nonlinear relationships (see
also Muller, 2018; Thomas and Uminsky, 2020). This is
compounded by the inherent noise of the data. We there- Appendix A: Hyperparameter search and training
fore expect that many nuanced aspects are missed by the
comparative assessment. In consequence, we hope that A1 General setup
future efforts will derive more powerful metrics, tests,
Table A provides the general setup for the hyperparameter
and probing procedures – akin to the continuous devel-
search and model training.
opment of diagnostics for single-point predictions (see
Nearing et al., 2018). A2 Noise regularization
iii. Baselines: we examined four deep-learning-based ap-
proaches. One is based on Monte Carlo dropout and Adding noise to the data during training can be viewed as
three on mixture density networks. We used them to a form of data augmentation and regularization that biases
demonstrate the benchmarking scheme and showed its towards smooth functions. These are large topics in them-
comparative power. This should however only be seen selves, and at this stage we refer to Rothfuss et al. (2019)
as a starting point. We predict that stronger baselines for an investigation on the theoretical properties of noise
will emerge in tandem with stronger metrics. From regularization and some empirical demonstrations. In short,
our perspective, there is plenty of room to build bet- plain maximum likelihood estimation can lead to strong
ter ones. A perhaps self-evident example of the po- over-fitting (resulting in a spiky distribution that generalizes
tential of improvements is ensembles: Kratzert et al. poorly beyond the training data). Training with noise regular-
(2019b) showed the benefit of LSTM ensembles for ization results in smoother density estimates that are closer to
single-point predictions, and we assume that similar ap- the true conditional density.
proaches could be developed for uncertainty estimation. Following these findings, we also add noise as a smooth-
We are therefore sure that future research will yield ness regularization for our experiments. Concretely, we de-
improved approaches and move us closer to achieving cided to use a relative additive noise as a first-order approx-
holistic diagnostic tools for the evaluation of uncertainty imation to the sort of noise contamination we expect in hy-
estimations (sensu Nearing and Gupta, 2015). drological time series. The operation for regularization is

[Link] Hydrol. Earth Syst. Sci., 26, 1673–1693, 2022


1688 D. Klotz et al.: Uncertainty estimation with deep learning for rainfall–runoff simulation

Table A1. Overview of the general benchmarking setup.

GMM CMAL UMAL MCD


Training period 1 Oct 1980–30 Sep 1990 1 Oct 1980–30 Sep 1990 1 Oct 1980–30 Sep 1990 1 Oct 1980–30 Sep 1990
Validation period 1 Oct 1990–30 Sep 1995 1 Oct 1990–30 Sep 1995 1 Oct 1990–30 Sep 1995 1 Oct 1990–30 Sep 1995
Test period 1 Oct 1995–30 Sep 2005 1 Oct 1995–30 Sep 2005 1 Oct 1995–30 Sep 2005 1 Oct 1995–30 Sep 2005
Training loss Negative log-likelihood Negative log-likelihood Negative log-likelihood MSE
CAMELS attributes Yes Yes Yes Yes
Input products DayMet, Maurer, NLDAS DayMet, Maurer, NLDAS DayMet, Maurer, NLDAS DayMet, Maurer, NLDAS
Regularization: noise Yes Yes Yes Yes
Regularization: dropout Yes Yes Yes Yes
Sampling space for τ NA NA NA (0.01, 0.99)
Gradient clipping Yes Yes No Yes
NA stands for not available.

A4 Results
z∗ = z + z · N (0, σ ),
The results of the hyperparameter search are summarized in
where z is a placeholder variable for either the dynamic or Table A3.
static input variables or the observed runoff (and, as be-
fore, the time index is omitted for the sake of simplicity),
N (0, σ ) denotes a Gaussian noise term with mean zero Appendix B: Baselines
and a standard deviation σ , and z∗ is the obtained noise-
contaminated variable. B1 Gaussian mixture model

A3 Search Gaussian mixture models (GMMs; Bishop, 1994) are well


established for producing distributional predictions from a
To provide a meaningful comparison, we conducted a hyper- single input. The principle of GMMs is to have a neural net-
parameter search for each of the four conditional density esti- work that predicts the parameters of a mixture of Gaussians
mators. A hyperparameter search is an extended search (usu- (i.e., the means, standard deviations, and weights) and to use
ally computationally intensive) for the best pre-configuration these mixtures as distributional output. GMMs are a powerful
of a machine learning model. concept. They have seen usage for diverse applications such
In our case we searched over the combination of six dif- as acoustics (Richmond et al., 2003), handwriting generation
ferent hyperparameters (see Table A2). To balance between (Graves, 2013), sketch generation (Ha and Eck, 2017), and
computational resources and search depth, we took the fol- predictive control (Ha and Schmidhuber, 2018).
lowing course of action: Given the rainfall–runoff modeling context, a GMM mod-
els the runoff q ∈ R at a given time step (subscript omitted
– First, we informally searched for sensible general pre- for the sake of simplicity) as a probability distribution p(·)
sets. of the input x ∈ RM×T (where M indicates the number of de-
– Second, we trained the models for each combination fined inputs, such as precipitation and temperature and T the
of the four hyperparameters “hidden size” (the num- number of used time steps which are provided to the neural
ber of cells in the LSTM; see Kratzert et al., 2019b), network) as a mixture of K Gaussians:
“noise” (added relative noise of the output; see Ap- K
X
pendix A2), “number of densities” (the number of den- p(q|x) = αk (x) · N (q|µk (x), σ (x)) , (B1)
sity heads in the mixture (only needed of MDN and k=1
CMAL), and “dropout rate” (the rate of dropout em-
ployed during training – and inference in the case of wherePαk are mixture weights with the property αk (x) ≥ 0
MCD). We marked these in Table A2 with a white back- and Kk=1 αk (x) = 1 (convex sum), and N (µk (x), σk (x))
ground. denotes a Gaussian with mean µk and standard deviation σk .
All three defining variables – i.e., the mixture weights, the
– Third, we choose the best resulting model and refine the mixture means, and the mixture standard deviations – are set
found models by searching for the best settings for the by a neural network and are thus a function of the inputs x.
hyperparameters “batch size” (the number of samples The negative logarithm of the likelihood between the train-
shown per back-propagation step) and “learning rate” ing data and the estimated conditional distribution is used as
(the parameter for the update per batch). loss:

Hydrol. Earth Syst. Sci., 26, 1673–1693, 2022 [Link]


D. Klotz et al.: Uncertainty estimation with deep learning for rainfall–runoff simulation 1689

Table A2. Search space of the hyperparameter search. The search is conducted in two steps: the variables used in the first step are shown in
the top part of the table, and the variables used in the second step are shown in the bottom part and are written in bold.

GMM CMAL UMAL MCD


Hidden size LSTM 250, 500, 750, 1000 250, 500, 750, 1000 250, 500, 750, 1000 250, 500, 750, 1000
Number of components 1, 3, 5, 10 1, 3, 5, 10 NA NA
Regularization: noise 0.05, 0.1, 0.2 0.05, 0.1, 0.2 0.05, 0.1, 0.2 0.05, 0.1, 0.2
Regularization: dropout 0.4, 0.5 0.4, 0.5 0.4, 0.5 0.1, 0.25, 0.4, 0.5, 0.75
Batch size 128 256 128 256 128 256 128 256
Learning rate 0.0001, 0.0005, 0.001 0.0001, 0.0005, 0.001 0.0001, 0.0005, 0.001 0.0001, 0.0005, 0.001

NA stands for not available.

Table A3. Resulting parameterization from the hyperparameter


search. τ · (1 − τ )
ALD (q|µ, s, τ ) =
b
GMM CMAL UMAL MCD

exp[−(q − µ) · (τ − 1)/s], if q < µ,
· exp[−(q − µ) · τ/s], if q ≥ µ, (B3)
Hidden size LSTM 250 250 250 500
Number of components 10 3 NA NA
where τ is the asymmetry parameter, µ the location parame-
Regularization: noise 0.2 0.2 0.2 0.1
Regularization: dropout 0.4 0.5 0.5 0.75
ter, and s the scale parameter, respectively. Using the ALD as
Batch size 256 256 256 256 a component CMAL can be defined in analogy to the GMM:
Learning rate 0.001 0.0005 0.0005 0.001
K
X
NA stands for not available. p(q|x) = αk (x) · ALD (q|µk (x), sk (x), τk (x)) , (B4)
k=1

and the parameters and weights are estimated by a neural


"
K
# network. Training is done by maximizing the negative log-
L(q|x) = − log
X
αk (x) · N (q|µk (x), σk (x)) . (B2) likelihood of the training data from an estimated distribution:
k=1 K
!
X
L(q|x) = − log αk (x) · ALD (q|µk (x), sk (x), τk (x)) . (B5)
For the actual implementation, we used a softmax activation
k=1
function to obtain the mixture weights (α) and an exponential
function as activation for variance (σ ) to guarantee that the For the implementation of the network, we used a softmax
estimate is always above 0 (see Bishop, 1994) activation function to obtain the mixture weights (α), a sig-
moid function to bind the asymmetry parameters (τ ), and a
B2 Countable mixture of asymmetric Laplacians softplus activation function to guarantee that the scale (b) is
always above 0.
Countable mixtures of asymmetric Laplacian distributions,
for short CMAL, are another form of MDN where ALDs B3 Uncountable mixture of asymmetric Laplacians
are used as a kernel function. The abbreviation is a refer-
ence to UMAL since it serves as a natural intermediate stage Uncountable mixture of asymmetric Laplacians (UMAL;
between GMM and UMAL – as will become clear in the re- Brando et al., 2019) expands upon the CMAL concept by let-
spective section. As far as we are aware, the use of ALDs for ting the model implicitly approximate the mixture of ALDs.
quantile regression was proposed by Yu and Moyeed (2001), This is achieved (a) by sampling the asymmetry parameter τ
and their application for MDNs was first proposed by Brando and providing it as input to the model and the loss and (b) by
et al. (2019). The components of CMAL already intrinsically fixing the weights with αk = 1/K and (c) stochastically ap-
provide a measure for asymmetric distributions and are there- proximating the underlying distributions by summing up dif-
fore inherently more expressive than GMMs. However, since ferent realizations. Since the network only has to account for
they also necessitate the estimation of more parameters, one the scale and the location parameter, considerably fewer pa-
can expect that they are also more difficult to handle than rameters have to be estimated than for the GMM or CMAL.
GMMs. The density for the ALD is In analogy to the CMAL model equations, these exten-
sions lead to the following equation for the conditional den-
sity:

[Link] Hydrol. Earth Syst. Sci., 26, 1673–1693, 2022


1690 D. Klotz et al.: Uncertainty estimation with deep learning for rainfall–runoff simulation

//[Link]/neuralhydrology/neuralhydrology; Kratzert, 2022).


K
1 X The CAMELS dataset with static basin attributes is accessible at
p(q|x) = ALD (q|µk (x, τk ) , sk (x, τk ) , τk ) , (B6) [Link] (NCAR, 2022).
K k=1
where the asymmetry parameter τk is randomly sampled
k times to provide a Monte Carlo approximation to an im- Author contributions. DK, FK, MG, and GN designed all the ex-
plicitly approximated distribution. After the training model- periments. DK conducted all the experiments, and the results were
ers can choose from how many discrete samples the learned analyzed together with the rest of the authors. FK and MG helped
distribution is approximated. As with the other mixture den- with building the modeling pipeline. FK provided the main setup
sity networks, the training is done by minimizing the neg- for the “accuracy” analysis, AKS and GN for the “internal con-
ative log-likelihood of the training data from the estimated sistency” analysis, and DK for the “estimation quality” analysis.
GK and JH checked the technical adequacy of the experiments.
distribution:
! GN supervised the manuscript from the hydrologic perspective and
XK SH from the machine learning perspective. All the authors worked
L(q|x) = − log ALD (q|µk (x, τk ) , sk (x, τk ) , τk ) on the manuscript.
k=1
+ log(K). (B7)
Implementation-wise we obtained the best results from Competing interests. The contact author has declared that neither
UMAL by binding the scale parameter (sk ). We therefore they nor their co-authors have any competing interests.
used a weighted sigmoid function as activation.

B4 Monte Carlo dropout Disclaimer. Publisher’s note: Copernicus Publications remains


neutral with regard to jurisdictional claims in published maps and
Monte Carlo dropout (MCD; Gal and Ghahramani, 2016) has institutional affiliations.
found widespread use and has already been used in a large
variety of applications (e.g., Zhu and Laptev, 2017; Kendall
and Gal, 2017; Smith and Gal, 2018). The MCD mechanism Financial support. This research was undertaken thanks in part
to funding from the Canada First Research Excellence Fund
can be expressed as
and the Global Water Futures Program and enabled by compu-
p(q|x) = N q|µ∗ (x), σk ,

(B8) tational resources provided by Compute Ontario and Compute
Canada. The ELLIS Unit Linz, the LIT AI Lab, and the Insti-
where µ∗ (x) is the expectation over the sub-networks given tute for Machine Learning are supported by the Federal State
the dropout rate r, such that of Upper Austria. We thank the projects AI-MOTION (LIT-
M 2018-6-YOU-212), DeepToxGen (LIT-2017-3-YOU-003), AI-SNN
1 X (LIT-2018-6-YOU-214), DeepFlood (LIT-2019-8-YOU-213), the
µ∗ (x) = Ed∼B(r) [f (x, d, θ )] ≈ f (x, d m , θ ) , (B9)
M m=1 Medical Cognitive Computing Center (MC3), PRIMAL (FFG-
873979), S3AI (FFG-872172), DL for granular flow (FFG-871302),
where d is a dropout mask sampled from a Bernoulli distribu- ELISE (H2020-ICT-2019-3, ID: 951847), and AIDD (MSCA-ITN-
tion B with rate r, d m is a particular realization of a dropout 2020, ID: 956832). Further, we thank Janssen Pharmaceutica,
mask, θ are the network weights, and f (·) is the neural net- UCB Biopharma SRL, Merck Healthcare KGaA, the [Link]
work. Note that f (x, d m , θ ) is equivalent to a particular sub- Deep Learning Center, the TGW Logistics Group GmbH, Sili-
network of f . con Austria Labs (SAL), FILL Gesellschaft mbH, Anyline GmbH,
MCD is trained by maximizing the expectancy, that is, by Google (Faculty Research Award), ZF Friedrichshafen AG, Robert
minimizing the mean squared error. As such it is quite dif- Bosch GmbH, the Software Competence Center Hagenberg GmbH,
ferent from the MDN approaches. It provides an estimation TÜV Austria, and the NVIDIA corporation.
of the epistemic uncertainty and as such does not supply a
heterogeneous, multimodal estimate (it assumes a Gaussian
Review statement. This paper was edited by Jim Freer and reviewed
form). For evaluation studies of MCD in hydrological fields,
by John Quilty, Anna E. Sikorska-Senoner, and one anonymous ref-
we refer to Fang et al. (2019), who investigated its usage in
eree.
the context of soil-moisture prediction. We also note that it
has been observed that MCD can underestimate the epistemic
uncertainty (e.g., Fort et al., 2019).

Code and data availability. We will make the code for


the experiments and data of all produced results avail-
able online. We trained all our machine learning mod-
els with the neuralhydrology Python library (https:

Hydrol. Earth Syst. Sci., 26, 1673–1693, 2022 [Link]


D. Klotz et al.: Uncertainty estimation with deep learning for rainfall–runoff simulation 1691

References Demargne, J., Wu, L., Regonda, S. K., Brown, J. D., Lee, H., He,
M., Seo, D.-J., Hartman, R., Herr, H. D., Fresch, M., Schaake,
Abramowitz, G.: Towards a benchmark for land sur- J., and Zhu, Y.: The science of NOAA’s operational hydrologic
face models, Geophys. Res. Lett., 32, L22702, ensemble forecast service, B. Am. Meteorol. Soc., 95, 79–98,
[Link] 2005. 2014.
Addor, N., Newman, A. J., Mizukami, N., and Clark, M. P.: The Donoho, D.: 50 years of data science, J. Comput. Graph. Stat., 26,
CAMELS data set: catchment attributes and meteorology for 745–766, 2017.
large-sample studies, Hydrol. Earth Syst. Sci., 21, 5293–5313, Ellefsen, K. O., Martin, C. P., and Torresen, J.: How do mixture den-
[Link] 2017. sity rnns predict the future?, arXiv: preprint, arXiv:1901.07859,
Althoff, D., Rodrigues, L. N., and Bazame, H. C.: Uncertainty quan- 2019.
tification for hydrological models based on neural networks: the Fang, K., Shen, C., and Kifer, D.: Evaluating aleatoric and epis-
dropout ensemble, Stoch. Environ. Res. Risk A., 35, 1051–1067, temic uncertainties of time series deep learning models for soil
2021. moisture predictions, arXiv: preprint, arXiv:1906.04595, 2019.
Andréassian, V., Perrin, C., Berthet, L., Le Moine, N., Lerat, J., Feng, D., Fang, K., and Shen, C.: Enhancing streamflow forecast
Loumagne, C., Oudin, L., Mathevet, T., Ramos, M.-H., and and extracting insights using long-short term memory networks
Valéry, A.: HESS Opinions “Crash tests for a standardized eval- with data integration at continental scales, Water Resour. Res.,
uation of hydrological models”, Hydrol. Earth Syst. Sci., 13, 56, e2019WR026793, [Link]
1757–1764, [Link] 2009. 2020.
Berthet, L., Bourgin, F., Perrin, C., Viatgé, J., Marty, R., and Fort, S., Hu, H., and Lakshminarayanan, B.: Deep ensembles: A
Piotte, O.: A crash-testing framework for predictive uncer- loss landscape perspective, arXiv: preprint, arXiv:1912.02757,
tainty assessment when forecasting high flows in an extrap- 2019.
olation context, Hydrol. Earth Syst. Sci., 24, 2017–2041, Gal, Y. and Ghahramani, Z.: Dropout as a bayesian ap-
[Link] 2020. proximation: Representing model uncertainty in deep learn-
Best, M. J., Abramowitz, G., Johnson, H., Pitman, A., Balsamo, ing, in: international conference on machine learning, 1050–
G., Boone, A., Cuntz, M., Decharme, B., Dirmeyer, P., Dong, J., 1059, [Link] (last access:
Ek, M., Guo, Z., van den Hurk, B. J. J., Nearing, G. S., Pak, B., 28 March 2022), 2016.
Peters-Lidard, C., Santanello Jr., J. A., Stevens, L., and Vuichard, Gers, F. A., Schmidhuber, J., and Cummins, F.: Learning to
N.: The plumbing of land surface models: benchmarking model forget: continual prediction with LSTM, IET Conference
performance, J. Hydrometeorol., 16, 1425–1442, 2015. Proceedings, 850–855, [Link]
Beven, K.: Facets of uncertainty: epistemic uncertainty, conferences/10.1049/cp_19991218 (last access: 31 March 2021),
non-stationarity, likelihood, hypothesis testing, and 1999.
communication, Hydrolog. Sci. J., 61, 1652–1665, Gneiting, T. and Raftery, A. E.: Strictly proper scoring rules, pre-
[Link] 2016. diction, and estimation, J. Am. Stat. Assoc., 102, 359–378, 2007.
Beven, K. and Binley, A.: GLUE: 20 years on, Hydrol. Process., 28, Govindaraju, R. S.: Artificial neural networks in hydrology. II: hy-
5897–5918, 2014. drologic applications, J. Hydrol. Eng., 5, 124–137, 2000.
Beven, K. and Young, P.: A guide to good practice in modeling se- Graves, A.: Generating sequences with recurrent neural networks,
mantics for authors and referees, Water Resour. Res., 49, 5092– arXiv: preprint, arXiv:1308.0850, 2013.
5098, 2013. Gupta, H. V., Kling, H., Yilmaz, K. K., and Martinez, G. F.: Decom-
Beven, K. J., Smith, P. J., and Freer, J. E.: So just why would a position of the mean squared error and NSE performance criteria:
modeller choose to be incoherent?, J. Hydrol., 354, 15–32, 2008. Implications for improving hydrological modelling, J. Hydrol.,
Bishop, C. M.: Mixture density networks, Tech. rep., Neural 377, 80–91, 2009.
Computing Research Group, [Link] Ha, D. and Eck, D.: A neural representation of sketch drawings,
eprint/373/ (last access: 28 March 2022), 1994. arXiv: preprint, arXiv:1704.03477, 2017.
Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, Ha, D. and Schmidhuber, J.: Recurrent world models facilitate
D.: Weight uncertainty in neural networks, arXiv: preprint, policy evolution, in: Advances in Neural Information Pro-
arXiv:1505.05424, 2015. cessing Systems, 2450–2462, [Link]
Brando, A., Rodriguez, J. A., Vitria, J., and Rubio Muñoz, A.: Mod- hash/[Link] (last
elling heterogeneous distributions with an Uncountable Mixture access: 28 March 2022), 2018.
of Asymmetric Laplacians, Adv. Neural Inform. Proc. Syst., 32, Hochreiter, S.: Untersuchungen zu dynamischen neuronalen
8838–8848, 2019. Netzen, Diploma thesis, Institut für Informatik, Lehrstuhl
Clark, M. P., Wilby, R. L., Gutmann, E. D., Vano, J. A., Gangopad- Prof. Brauer, Tech. Univ., München, 1991.
hyay, S., Wood, A. W., Fowler, H. J., Prudhomme, C., Arnold, Hochreiter, S. and Schmidhuber, J.: Long Short-Term Memory,
J. R., and Brekke, L. D.: Characterizing uncertainty of the hy- Neural Comput., 9, 1735–1780, 1997.
drologic impacts of climate change, Curr. Clim. Change Rep., 2, Hsu, K.-L., Gupta, H. V., and Sorooshian, S.: Artificial neural net-
55–64, 2016. work modeling of the rainfall-runoff process, Water Resour. Res.,
Cole, T.: Too many digits: the presentation of numerical data, Arch. 31, 2517–2530, 1995.
Disease Childhood, 100, 608–609, 2015. Kavetski, D., Kuczera, G., and Franks, S. W.: Bayesian
Dehghani, M., Tay, Y., Gritsenko, A. A., Zhao, Z., Houlsby, N., analysis of input uncertainty in hydrological model-
Diaz, F., Metzler, D., and Vinyals, O.: The Benchmark Lottery,
[Link] 2021.

[Link] Hydrol. Earth Syst. Sci., 26, 1673–1693, 2022


1692 D. Klotz et al.: Uncertainty estimation with deep learning for rainfall–runoff simulation

ing: 2. Application, Water Resour. Res., 42, W03408, Montanari, A. and Koutsoyiannis, D.: A blueprint for process-based
[Link] 2006. modeling of uncertain hydrological systems, Water Resour. Res.,
Kendall, A. and Gal, Y.: What uncertainties do we need in 48, W09555, [Link] 2012.
bayesian deep learning for computer vision?, in: Ad- Muller, J. Z.: The tyranny of metrics, Princeton University Press,
vances in neural information processing systems, 5574– [Link] 2018.
5584, [Link] Naeini, M. P., Cooper, G. F., and Hauskrecht, M.: Obtaining
[Link] (last well calibrated probabilities using bayesian binning, in: Pro-
access: 28 March 2022), 2017. ceedings of the AAAI Conference on Artificial Intelligence,
Klemeš, V.: Operational testing of hydrological sim- vol. 2015, NIH Public Access, p. 2901, [Link]
ulation models, Hydrolog. Sci. J., 31, 13–24, php/AAAI/article/view/9602 (last access: 28 March 2022), 2015.
[Link] 1986. Nash, J. E. and Sutcliffe, J. V.: River flow forecasting through con-
Klotz, D., Kratzert, F., Herrnegger, M., Hochreiter, S., and Klam- ceptual models part I – A discussion of principles, J. Hydrol., 10,
bauer, G.: Towards the quantification of uncertainty for deep 282–290, 1970.
learning based rainfallrunoff models, Geophys. Res. Abstr., 21, NCAR: CAMELS: Catchment Attributes and Meteorology for
EGU2019-10708-2, 2019. Large-sample Studies – Dataset Downloads, [Link]
Kratzert, F.: neuralhydrology, GitHub [code], [Link] solutions/products/camels, last access: 21 March 2022.
neuralhydrology/neuralhydrology, last access: 21 March 2022. Nearing, G. S. and Gupta, H. V.: The quantity and quality of infor-
Kratzert, F., Klotz, D., Brenner, C., Schulz, K., and Herrnegger, mation in hydrologic models, Water Resour. Res., 51, 524–538,
M.: Rainfall–runoff modelling using Long Short-Term Mem- 2015.
ory (LSTM) networks, Hydrol. Earth Syst. Sci., 22, 6005–6022, Nearing, G. S., Mocko, D. M., Peters-Lidard, C. D., Kumar, S. V.,
[Link] 2018. and Xia, Y.: Benchmarking NLDAS-2 soil moisture and evapo-
Kratzert, F., Klotz, D., Herrnegger, M., Sampson, A. K., transpiration to separate uncertainty contributions, J. Hydrome-
Hochreiter, S., and Nearing, G. S.: Toward Improved Pre- teorol., 17, 745–759, 2016.
dictions in Ungauged Basins: Exploiting the Power of Nearing, G. S., Ruddell, B. L., Clark, M. P., Nijssen, B., and Peters-
Machine Learning, Water Resour. Res., 55, 11344–11354, Lidard, C.: Benchmarking and process diagnostics of land mod-
[Link] 2019a. els, J. Hydrometeorol., 19, 1835–1852, 2018.
Kratzert, F., Klotz, D., Shalev, G., Klambauer, G., Hochreiter, Nearing, G. S., Kratzert, F., Sampson, A. K., Pelissier, C.
S., and Nearing, G.: Towards learning universal, regional, and S., Klotz, D., Frame, J. M., Prieto, C., and Gupta, H. V.:
local hydrological behaviors via machine learning applied to What role does hydrological science play in the age of ma-
large-sample datasets, Hydrol. Earth Syst. Sci., 23, 5089–5110, chine learning?, Water Resour. Res., 57, e2020WR028091,
[Link] 2019b. [Link] 2020a.
Kratzert, F., Klotz, D., Hochreiter, S., and Nearing, G. S.: A note Nearing, G. S., Ruddell, B. L., Bennett, A. R., Prieto, C., and
on leveraging synergy in multiple meteorological data sets with Gupta, H. V.: Does Information Theory Provide a New Paradigm
deep learning for rainfall–runoff modeling, Hydrol. Earth Syst. for Earth Science? Hypothesis Testing, Water Resour. Res.,
Sci., 25, 2685–2703, [Link] 56, e2019WR024918, [Link]
2021. 2020b.
Laio, F. and Tamea, S.: Verification tools for probabilistic fore- Newman, A. J., Clark, M. P., Sampson, K., Wood, A., Hay, L.
casts of continuous hydrological variables, Hydrol. Earth Syst. E., Bock, A., Viger, R. J., Blodgett, D., Brekke, L., Arnold, J.
Sci., 11, 1267–1277, [Link] R., Hopson, T., and Duan, Q.: Development of a large-sample
2007. watershed-scale hydrometeorological data set for the contiguous
Lane, R. A., Coxon, G., Freer, J. E., Wagener, T., Johnes, P. J., USA: data set characteristics and assessment of regional variabil-
Bloomfield, J. P., Greene, S., Macleod, C. J. A., and Reaney, ity in hydrologic model performance, Hydrol. Earth Syst. Sci.,
S. M.: Benchmarking the predictive capability of hydrological 19, 209–223, [Link] 2015.
models for river flow and flood peak predictions across over Newman, A. J., Mizukami, N., Clark, M. P., Wood, A. W., Nijssen,
1000 catchments in Great Britain, Hydrol. Earth Syst. Sci., 23, B., and Nearing, G.: Benchmarking of a physically based hydro-
4011–4032, [Link] 2019. logic model, J. Hydrometeorol., 18, 2215–2225, 2017.
Li, W., Duan, Q., Miao, C., Ye, A., Gong, W., and Di, Z.: A re- Ovadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D., Nowozin, S.,
view on statistical postprocessing methods for hydrometeorolog- Dillon, J., Lakshminarayanan, B., and Snoek, J.: Can you trust
ical ensemble forecasting, Wiley Interdisciplin. Rev.: Water, 4, your model’s uncertainty? Evaluating predictive uncertainty un-
e1246, [Link] 2017. der dataset shift, in: Advances in Neural Information Process-
Liu, M., Huang, Y., Li, Z., Tong, B., Liu, Z., Sun, M., Jiang, F., and ing Systems, 13991–14002, [Link]
Zhang, H.: The Applicability of LSTM-KNN Model for Real- hash/[Link] (last
Time Flood Forecasting in Different Climate Zones in China, access: 28 March 2022), 2019.
Water, 12, 440, [Link] 2020. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan,
Makridakis, S., Spiliotis, E., and Assimakopoulos, V.: The M5 accu- G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmai-
racy competition: Results, findings and conclusions, Int. J. Fore- son, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani,
cast., [Link] 2020. A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chin-
tala, S.: Pytorch: An imperative style, high-performance deep
learning library, in: Advances in neural information process-

Hydrol. Earth Syst. Sci., 26, 1673–1693, 2022 [Link]


D. Klotz et al.: Uncertainty estimation with deep learning for rainfall–runoff simulation 1693

ing systems, 8026–8037, [Link] Thomas, R. and Uminsky, D.: The Problem with Metrics is a Funda-
[Link] (last access: mental Problem for AI, arXiv: preprint, arXiv:2002.08512, 2020.
28 March 2022), 2019. Weijs, S. V., Schoups, G., and van de Giesen, N.: Why
Reichle, R. H., McLaughlin, D. B., and Entekhabi, D.: Hydrologic hydrological predictions should be evaluated using infor-
data assimilation with the ensemble Kalman filter, Mon. Weather mation theory, Hydrol. Earth Syst. Sci., 14, 2545–2558,
Rev., 130, 103–114, 2002. [Link] 2010.
Renard, B., Kavetski, D., Kuczera, G., Thyer, M., and Yilmaz, K. K., Gupta, H. V., and Wagener, T.: A process-based di-
Franks, S. W.: Understanding predictive uncertainty in agnostic approach to model evaluation: Application to the NWS
hydrologic modeling: The challenge of identifying input distributed hydrologic model, Water Resour. Res., 44, W09417,
and structural errors, Water Resour. Res., 46, W05521, [Link] 2008.
[Link] 2010. Yu, K. and Moyeed, R. A.: Bayesian quantile regression, Stat. Prob-
Richmond, K., King, S., and Taylor, P.: Modelling the uncertainty abil. Lett., 54, 437–447, 2001.
in recovering articulation from acoustics, Comput. Speech Lan- Zhu, L. and Laptev, N.: Deep and confident prediction for time
guage, 17, 153–172, 2003. series at uber, in: 2017 IEEE International Conference on Data
Rothfuss, J., Ferreira, F., Walther, S., and Ulrich, M.: Conditional Mining Workshops (ICDMW), 103–110, [Link]
density estimation with neural networks: Best practices and org/abstract/document/8215650 (last access: 28 March 2022),
benchmarks, arXiv: preprint, arXiv:1903.00954, 2019. 2017.
Shrestha, D. L. and Solomatine, D. P.: Data-driven ap- Zhu, S., Xu, Z., Luo, X., Liu, X., Wang, R., Zhang, M., and
proaches for estimating uncertainty in rainfall-runoff Huo, Z.: Internal and external coupling of Gaussian mixture
modelling, Int. J. River Basin Manage., 6, 109–122, model and deep recurrent network for probabilistic drought
[Link] 2008. forecasting, Int. J. Environ. Sci. Technol., 18, 1221–1236,
Smith, L. and Gal, Y.: Understanding measures of uncer- [Link] 2020.
tainty for adversarial example detection, arXiv: preprint,
arXiv:1803.08533, 2018.

[Link] Hydrol. Earth Syst. Sci., 26, 1673–1693, 2022

You might also like