VAENEU
VAENEU
P ROBABILISTIC F ORECASTING
Kaiserslautern,Germany
Sheraz Ahmed
DFKI GmbH
Kaiserslautern,Germany
A BSTRACT
This paper presents VAEneu, an innovative autoregressive method for multistep ahead univariate
probabilistic time series forecasting. We employ the conditional VAE framework and optimize the
lower bound of the predictive distribution likelihood function by adopting the Continuous Ranked
Probability Score (CRPS), a strictly proper scoring rule, as the loss function. This novel pipeline
results in forecasting sharp and well-calibrated predictive distribution. Through a comprehensive
empirical study, VAEneu is rigorously benchmarked against 12 baseline models across 12 datasets.
The results unequivocally demonstrate VAEneu’s remarkable forecasting performance. VAEneu
provides a valuable tool for quantifying future uncertainties, and our extensive empirical study lays
the foundation for future comparative studies for univariate multistep ahead probabilistic forecasting.
Keywords Probabilistic Forecasting, Generative models, Time series Analysis, Deep Learning
1 Introduction
Probabilistic forecasting is essential in decision-making, particularly in fields where accurately assessing risk is critical,
such as healthcare [1, 2, 3], weather forecasting [4, 5], flood risk assessment [6], seismic hazard prediction [7, 8],
renewable energy sector [9], and economic and financial risk management [10, 11]. These models provide insights into
possible future outcomes and their likelihoods, aiding decision-makers in resource allocation, policy formulation, and
strategic planning. The importance of these forecasts lies in their ability to model uncertainty about the future in the
form of predictive distribution. The ultimate goal of a probabilistic forecaster is to output a predictive distribution with
good calibration and sharpness. Calibration is concerned with the statistical consistency between predictive distribution
and observation, while sharpness governs the confidence in the forecast itself. A sharp probabilistic forecaster outputs a
narrow predictive distribution, which represents high confidence in the forecast. However, sharpness is desirable in
conjunction with calibration to employ high confidence aligned with the true distribution. [12]
The evolution of Deep Neural Networks (DNNs) has revolutionized probabilistic forecasting, introducing methods that
efficiently leverage large datasets. This advancement has shifted the focus towards end-to-end learning, minimizing
the need for manual feature engineering and opening new horizons in forecasting accuracy and application. Recurrent
Neural Networks (RNNs) were the first neural network architectures that were designed especially to process sequential
data. The Long-Short Term Memory (LSTM) [13] and Gated Recurrent Unit (GRU) [14] are two of the most well-
established RNN variants in DNN-based forecasting models. Temporal Convolutional Neural Networks (TCNs) [15]
have transformed the capabilities of CNNs to time series data. Seq2Seq models [16], later improved with Attention
mechanism [17], further enhance time series modeling. Recently, Transformers [18] has added to the arsenal of time
series modeling tools, which further empowered the richness of the probabilistic forecasting toolbox. While the DNN
∗
Corresponding author ([email protected])
VAEneu
models have demonstrated remarkable performance for probabilistic forecasting, they mostly restrict themselves to the
family of distribution with tractable likelihood regardless of true distribution (e.g. [19]). Another dominant approach
is to model key properties of predictive distribution, such as quantile [20, 21] instead of modeling the predictive
distribution itself. To circumvent the intractable likelihood, one newly emerging approach is to use GANs to model
predictive distribution [22, 23]. In optimal scenarios, these models can generate samples from true data distribution.
However, the instability of the adversarial objective function makes the training of these networks a daunting task.
In this work, we introduced VAEneu, an autoregressive probabilistic forecaster that optimizes the lower bound of
likelihood function while optimizing a strictly proper scoring rule, CRPS, to learn a sharp and well-calibrated predictive
distribution without enforcing any restrictive assumption.
The main contributions of this paper are as follows:
• We propose a probabilistic forecasting model based on Conditional Variational Autoencoder (CVAE), incorpo-
rating CRPS as the loss function
• We demonstrate the superior performance of the proposed model through extensive empirical experiments
using 12 baseline models and 12 univariate datasets.
2 Related Work
The rise of neural networks in recent years resulted in the emergence of powerful end-to-end probabilistic forecasting
methods, which are capable of effectively learning from large datasets. Khosravi et al. [24] combined neural networks
with the GARCH model [25] to provide prediction interval for time series forecasting. WaveNet [26] employed dilated
convolutional neural networks (CNNs) to process long input time series data effectively and has shown outstanding
performance in audio generation and time series forecasting. Long short-term memory(LSTM) networks have been
used for parameterizing a linear State Space Model (SSM) to generate probabilistic forecast [27]. Rasp et al. [28]
employed a neural network to post-process an ensemble weather prediction model. Prophet [29] is a modular regression
model built upon the Generalised Additive Model [30], which can integrate domain knowledge into the modeling
process. Wen [21] suggested a multi-horizon quantile forecaster (MQ-R(C)NN) based on Sequence-to-Sequence RNN
(Seq2Seq) [31] model. Gesthaus et al. [32] introduced another method for quantile forecasting based on the spline
functions for representing output distributions. Salinas et al. [33] proposed an autoregressive forecaster for high-
dimensional multivariate time series by utilizing a low-rank covariance matrix to model the output distribution. Also,
they took a copula-based approach in conjunction with an RNN model to construct the forecaster. Wang et al. [34] uses
a global model based on deep neural networks for capturing global non-linear patterns, while a probabilistic graphical
model extracts the individual random effects locally. DeepTCN [35] is proposed as an encoder-decoder forecasting
model based on dilated CNN models. This model can be utilized either as an explicit model or a quantile estimator.
Rasul et al. [36] employed conditional normalizing flows to compose an autoregressive multivariate probabilistic
forecasting model. Normalizing Kalman Filter [37] has been proposed for probabilistic forecasting, which uses
normalizing flows [38] to augment a linear Gaussian state space model. DeepAR [19] is suggested as a simple yet
effective autoregressive explicit model based on RNNs. Gouttes [39] suggested an autoregressive RNN model based on
Implicit Quantile Networks [40] for quantile forecasting. Hasson et al. [41] introduced Level Set Forecaster (LSF). LSF
organizes training data into groups, considering those that are sufficiently similar. Subsequently, the bins containing
the actual values are employed to generate the predicted distributions. Lim et al. [20] proposed the Temporal Fusion
Transformer(TFT). This attention-based model utilized a sequence-to-sequence model for the local process of the input
window and a temporal self-attention decoder to model long-term dependencies. Denoising diffusion models have been
used to develop TimeGrad [42], an autoregressive multivariate forecaster. Türkmen et al. [43] combined classical
discrete-time conditional renewal processes [44] with a neural networks architecture to build a forecasting model
suitable for intermittent demand forecasting. Xu et al.[45] expanded quantile RNN forecaster models to handle time
series with mixed sample frequency. Kan et al. [46] proposed multivariate quantile functions, which are parameterized
using gradients of "input convex neural networks" [47]. The model employs an RNN-based feature extractor to build a
probabilistic forecaster. Multiple publications [22, 48, 23, 49, 50] have successfully applied GAN to transform samples
from a prior distribution to samples from a predictive distribution.
The recent surge in interest in long sequence time series forecasting (LSTF) in the scientific community has led to
notable developments in this domain. Pioneering contributions such as Informer [51], Autoformer [52], FEDformer [53],
and TSMixer [54] have been significant. These models primarily focus on the challenge of substantially extending the
forecast horizon, primarily within the context of modeling multivariate time series in a deterministic manner.
2
VAEneu
xt+1 xt+1
x0 x1 … xt xt+1
x0
or x1
Hidden output Z
…
Hidden output xt
or
mean logvar
N Reparameterization Z
3 VAEneu
In this section, we delve into the specifics of our proposed methodology, which is pivotal to our study of probabilistic
forecasting. Initially, we establish a formal definition and structure for the task of probabilistic forecasting, setting
the stage for our subsequent discussions. We then explore the concept and mechanics of data generation using VAEs,
providing a foundation for understanding our approach. Lastly, we introduce our unique contributions to this field and
detail the architecture of our proposed model, highlighting how it differentiates and advances the current landscape of
probabilistic forecasting models in machine learning.
A given dataset consists of a set of observations generated by the process of interest. Let D be a dataset containing
univariate time series comprising T timesteps :
In forecasting, our ultimate objective is to obtain a predictive probability distribution for future quantities based on
historical information. To be more specific, we aim to model the conditional probability distribution as follows:
Variational Autoencoder (VAE) [55] is a member of explicit generative models and latent viable models. The aim of the
model is to generate artificial data x from the generative distribution pθ (x|z) conditioned on latent variable z ∼ pθ (z).
Due to intractable true posterior pθ (z|x), the direct estimation of the parameter θ is not feasible. However, VAE defines
a surrogate function qϕ (z|x), which is also known as the recognition model, to approximate the true posterior. With this
initiative, the VAE can efficiently be trained via stochastic gradient decent using the variational lower bound of the
log-likelihood:
3
VAEneu
Table 1: The figure displays the results of the multi-step-ahead forecasting experiment. Each model underwent three
separate training iterations on each dataset. The performance is quantified using the average CRPS (lower is better),
with the Coefficient of Variation (CV) indicated within parentheses. The color-coding of the cells illustrates the model’s
performance relative to the best-performing model on each dataset: green signifies the top model, blue indicates a
deviation of up to 10% from the best model, black for up to 50% deviation, orange for up to 100% deviation, and red for
deviations exceeding 100%. Notably, the VAEneu models demonstrate exemplary performance, securing the position of
the best model in five datasets and, in most cases, deviating less than 50% from the leading model’s results.
G OLD PRICE HEPC I NTERNET TRAFFIC I NTERNET TRAFFIC I NTERNET TRAFFIC I NTERNET TRAFFIC
A1H A5M B1H B5M
P ROPHET 1.1 E +2 (0.3%) 3.5 E -1 (2.1%) 4.9 E +11 (1.0%) 1.3 E +9 (1.5%) 1.9 E +15 (1.7%) 1.4 E +14 (1.2%)
MQ-RNN 3.2 E +2 (9.4%) 5.3 E -1 (17.0%) 1.4 E +12 (0.0%) 5.1 E +9 (0.0%) 1.6 E +16 (0.0%) 1.2 E +15 (0.0%)
MQ-CNN 4.6 E +1 (15.5%) 7.3 E -1 (27.7%) 4.7 E +11 (3.8%) 9.3 E +8 (1.6%) 2.4 E +15 (8.1%) 2.4 E +14 (3.1%)
TFT 4.7 E +1 (18.4%) 2.9 E -1 (24.5%) 2.5 E +11 (42.6%) 1.7 E +8 (5.3%) 8.1 E +14 (3.0%) 3.0 E +13 (10.1%)
T RANSFORMER 7.8 E +1 (55.8%) 2.8 E -1 (7.8%) 1.0 E +11 (37.4%) 1.8 E +8 (32.5%) 8.6 E +14 (19.2%) 5.1 E +13 (31.7%)
D EEP FACTOR 1.1 E +2 (77.4%) 1.1 E +0 (9.9%) 1.5 E +12 (55.5%) 3.4 E +9 (2.2%) 5.8 E +15 (24.3%) 4.9 E +14 (4.1%)
D EEPAR 6.5 E +1 (48.2%) 2.6 E -1 (4.2%) 9.9 E +10 (11.4%) 1.0 E +8 (13.4%) 6.5 E +14 (29.3%) 3.5 E +13 (5.5%)
DRP 1.3 E +3 (0.2%) 5.6 E -1 (2.4%) 1.4 E +12 (0.0%) 5.2 E +9 (0.0%) 1.6 E +16 (0.0%) 1.2 E +15 (0.0%)
WAVENET 3.3 E +1 (23.2%) 2.9 E -1 (6.7%) 1.8 E +11 (23.9%) 1.3 E +8 (9.3%) 1.5 E +15 (14.2%) 3.3 E +13 (22.6%)
GPF ORECASTER 3.9 E +1 (2.2%) 6.0 E -1 (1.5%) 1.4 E +12 (0.0%) 5.1 E +9 (0.0%) 1.6 E +16 (0.0%) 1.2 E +15 (0.0%)
D EEP S TATE 4.5 E +1 (10.3%) 7.2 E -1 (12.4%) 7.9 E +11 (5.2%) 1.2 E +9 (25.1%) 2.8 E +15 (8.6%) 2.7 E +14 (19.7%)
F OR GAN 1.9 E +1 (4.2%) 3.0 E -1 (5.7%) 8.9 E +10 (4.6%) 9.6 E +7 (4.6%) 1.1 E +15 (35.1%) 4.0 E +13 (6.0%)
VAE NEU -RNN 1.9 E +1 (1.3%) 3.1 E -1 (8.4%) 5.3 E +10 (5.8%) 1.1 E +8 (6.1%) 5.8 E +14 (20.7%) 4.2 E +13 (12.6%)
VAE NEU -TCN 2.1 E +1 (2.7%) 2.9 E -1 (3.7%) 5.9 E +10 (4.0%) 1.1 E +8 (12.7%) 5.3 E +14 (11.4%) 3.5 E +13 (7.0%)
M ACKEY G LASS S AUGEEN RIVER S OLAR 4 SECONDS S UNSPOT US BIRTHS W IND 4 SECONDS
P ROPHET 1.3 E -1 (0.7%) 1.2 E +1 (2.3%) 7.5 E +3 (1.4%) 2.2 E +1 (1.5%) 3.6 E +2 (0.9%) 1.2 E +4 (2.4%)
MQ-RNN 1.4 E -1 (5.2%) 2.1 E +1 (3.4%) 5.4 E +3 (8.1%) 3.4 E +0 (4.8%) 9.5 E +2 (2.4%) 1.4 E +4 (1.3%)
MQ-CNN 1.2 E -1 (7.4%) 1.4 E +1 (1.2%) 4.8 E +3 (11.9%) 3.5 E +0 (14.6%) 5.0 E +2 (14.2%) 1.3 E +4 (1.5%)
TFT 5.9 E -3 (17.8%) 8.9 E +0 (7.0%) 3.3 E +3 (18.2%) 3.0 E +0 (3.0%) 1.3 E +2 (1.8%) 9.3 E +3 (14.0%)
T RANSFORMER 1.2 E -1 (19.1%) 1.2 E +1 (14.9%) 1.9 E +3 (10.4%) 2.6 E +0 (0.7%) 3.9 E +2 (58.9%) 8.9 E +3 (23.1%)
D EEP FACTOR 2.5 E -1 (1.9%) 2.2 E +1 (1.0%) 3.0 E +4 (2.4%) 2.8 E +0 (12.3%) 3.1 E +3 (53.2%) 1.2 E +4 (3.9%)
D EEPAR 7.6 E -2 (8.1%) 1.3 E +1 (2.4%) 1.6 E +3 (13.3%) 2.6 E +0 (2.3%) 1.8 E +2 (7.8%) 9.7 E +3 (12.0%)
DRP 1.9 E -1 (0.0%) 1.0 E +1 (2.4%) 3.0 E +4 (0.0%) 3.4 E +1 (23.4%) 1.1 E +4 (0.1%) 1.7 E +4 (0.0%)
WAVENET 7.1 E -3 (1.8%) 9.4 E +0 (4.7%) 4.2 E +3 (31.7%) 2.6 E +0 (12.0%) 3.2 E +2 (10.6%) 1.3 E +4 (22.0%)
GPF ORECASTER 3.3 E -1 (44.3%) 1.1 E +1 (1.8%) 2.1 E +4 (1.6%) 4.0 E +0 (16.5%) 4.2 E +2 (0.2%) 1.2 E +4 (3.1%)
D EEP S TATE 2.8 E -1 (1.1%) 1.5 E +1 (13.4%) 7.4 E +3 (18.3%) 9.9 E +0 (13.7%) 5.4 E +2 (4.8%) 2.8 E +4 (28.3%)
F OR GAN 4.9 E -3 (31.2%) 8.3 E +0 (0.5%) 3.1 E +3 (6.4%) 3.6 E +0 (3.4%) 2.8 E +2 (6.8%) 6.9 E +3 (3.6%)
VAE NEU -RNN 8.6 E -4 (10.8%) 8.4 E +0 (0.2%) 2.8 E +3 (3.7%) 3.4 E +0 (2.4%) 2.9 E +2 (3.4%) 6.9 E +3 (0.3%)
VAE NEU -TCN 4.5 E -4 (24.9%) 8.4 E +0 (1.1%) 1.8 E +3 (7.5%) 2.7 E +0 (1.5%) 2.9 E +2 (7.1%) 7.4 E +3 (0.5%)
Assuming the Gaussian latent variable, the first term of the loss function can be integrated analytically. The second
term, which is also known as reconstruction error, can be approximated by drawing multiple samples z (l) (l = 1, ..., L)
from qϕ (z|x); hence the empirical loss function of VAE with Gaussian latent variable is:
To decrease the variance of the gradients for training, VAE uses the reparameterization trick where the recognition
distribution qϕ (z|x) is parameterized via a deterministic, differentiable function gϕ (., .); thus, z(l) = gϕ x, ϵ(l) , ϵ(l) ∼
N (0, I).
In practice, the prior over latent variable is considered as an isotropic multivariate Gaussian pθ (z) = N (z; 0; I). The
qϕ (z|x) and pθ (x|z) are approximated with neural networks, namely encoder and decoder, respectively. Furthermore,
the qϕ (z|x) is parameterized using gϕ (x, ϵ(l) ) = µ + σ ⊙ ϵ(l) where x is data point and ⊙ signifies the element-wise
product.
4
VAEneu
Figure 2: The critical difference diagram for univariate time series probabilistic forecasting experiment in which two
separate clusters of models emerge based on the performance of them on all datasets.
Conditional VAE (CVAE) [56] is an important extension of VAEs that enables us to train VAE conditioned on some
auxiliary information c and generate artificial data from pθ (x|c). To do so, the condition is presented to both encoder
and decoder and is integrated into VAE’s objective function:
As stated in Equation 2, the objective of probabilistic forecasting is to learn the distribution of future values, also
known as predictive distribution, conditioned on historical window input. Since CVAE provides us with a tool to learn
conditional probability distribution, with the right setup, we can train it as a probabilistic forecaster. To do so, we need
to use historical window input as the condition, i.e., c = x0:t , and the target future values as the CVAE input data.
In this work, we aim to model the one-step ahead forecasting to simplify the training pipeline. During inference, we
extend the resultant model into the multi-step forecasting model using auto-regression, i.e., feeding the model’s forecast
back to the model to forecast further in the future. Hence, the predictive distribution is updated as pθ (xt+1 |z, xx:0 ).
To train CVAE as the probabilistic forecaster, we still need to define a loss function that lets the model learn the intrinsic
and complexity of predictive distribution. The conventional approach in VAE is that when the data is a continuous value,
the likelihood function is assumed to be multivariate Gaussian, i.e., pθ (x|z) = N (x; µθ (z), Σθ (z)). To streamline the
training process, the Σθ (z) is assumed to be a diagonal matrix which can be written as Σθ (z) = σI. This assumption
effectively turns the covariance of predictive distribution as constant and simplifies the reconstruction error to the mean
squared error (MSE). These assumptions and simplification enable the efficient training of VAE on continuous variables.
However, probabilistic forecasters trained using these assumptions tend to generate a very sharp predictive distribution
around the mean of the predictive distribution and disregard statistical constancy with true data distribution, which
results in low calibration. Therefore, we need a novel reconstruction loss that enforces diversity between samples and
lets the model learn the nuances of the predictive distribution.
5
VAEneu
The Continuous Ranked Probability Score (CRPS) serves as a univariate, strictly proper scoring rule, evaluating the
alignment of a cumulative distribution function (CDF) F with an observed value x ∈ R. This is formally expressed as:
Z
CRPS(F, x) = (F (y) − 1{x ≤ y})2 dy , (6)
R
where 1{x ≤ y} denotes the indicator function, taking a value of one if x ≤ y and zero otherwise. Since CRPS is
a strictly proper scoring rule, it reaches its maximum when the predictive distribution precisely mirrors the actual
data distribution. Consequently, minimizing the negative CRPS as a reconstruction loss for training a CVAE-based
probabilistic forecaster effectively equates to optimizing the likelihood function of the predictive distribution, thereby
making CRPS an effective proxy for optimizing pθ (xt+1 |z, x0:t ) in the CVAE pipeline.
On the other hand, during inference from a CVAE, direct access to the CDF of the predictive distribution is not available.
Instead, the model provides samples from the predictive distribution. There is a method to approximate the CRPS in
this scenario. Using lemma 2.2 of [57] or identity 17 of [58], we can define the negative orientation of CRPS by
1
CRPS(F, x) = EF X̄ − x − EF X̄ − X̄ ′ , (7)
2
where X̄ and X̄ ′ are independent copies of a random variable with distribution function F and a finite first moment [12].
In practice, we calculate CRPS using Equation 7 by drawing two sets of samples from the predictive distribution to
represent X̄ and X̄ ′ . The accuracy of CRPS estimation is dependent on the sample size of X̄ and X̄ ′ .
Intuitively, optimizing the CRPS, as defined in Equation 7, involves balancing two contrasting objectives. The first term,
representing the mean absolute error between the forecast and the target, aims to sharpen the predictive distribution
around the median. Simultaneously, the second term, assessing the spread of forecast samples, encourages diversity
in forecast samples to enhance calibration. The optimization process thus navigates towards a balanced predictive
distribution, one that reconciles sharpness with a well-calibrated spread of forecast samples.
In this study, we optimized the CRPS approximation (Equation 7) to train a CVAE for probabilistic forecasting. For a
given input xt+1 and condition xt:0 , we sampled multiple instances z from the recognition distribution qθ (z|xt:0 , xt+1 )
to obtain forecast samples for X̄. The number of these samples, denoted as sample size, is a critical hyperparameter.
To compute CRPS efficiently, we shuffled the forecast samples in X̄ to create X̄ ′ , thereby circumventing the need for
additional sampling from the predictive distribution.
Figure 1 offers a detailed view of the VAEneu architecture, which was implemented in two distinct structural variants.
VAEneu-TCN incorporates TCN in both its encoder and decoder, while VAEneu-RNN employs LSTM networks
as its core components. The encoder process involves concatenating the historical data with the target forecast to
form the network input x0:t+1 . This input is then transformed into a representation h0:t+1 using either LSTM or
TCN. Subsequently, the representation passes through parallel fully connected layers to generate mean µϕ (h0:t+1 ) and
standard deviation σϕ (h0:t+1 ) outputs. Latent variables z are generated via reparameterization and then fed into the
decoder.
6
VAEneu
The decoder’s role is to learn the predictive distribution pθ (xt+1 |z, x0:t ). It begins by mapping the condition to its
representation h0:t and then concatenates this with the latent variable from the encoder. The final mapping to the next
forecast is achieved through a fully connected layer.
4 Experiment
4.1 Datasets
This section introduces an array of datasets utilized to evaluate VAEneu performance nuances under different data
conditions. We employed 12 datasets sourced from public repositories, namely Gold Price, Household Electric Power
Consumption(HEPC), Internet Traffic Datasets(A5m, A1h, B5m, B1h), Macky-Glass, Saugeen River Flow, Solar 4
second, Wind 4 second, Sunspot and US birth. Appendix A presents salient features of these datasets alongside a
detailed review of each dataset and a visual representation of them.
4.2 Baseline
In this study, we compare the proposed model with 12 established probabilistic forecasting models namely DeepAR [19],
DeepState [27], DeepFactor [34], Deep Renewal Processes (DRP) [43], GPForecaster, MQ-RNN and MQ-CNN [21],
Prophet [29], Wavenet [26], Transformer [18], TFT [20], ForGAN [22]. TFT, MQ-RNN, and MQ-CNN models
provide quantiles of predictive distribution, while the rest of the models generate samples from predictive distribution
as the forecast. We employed the implementation of these methods from GluonTS [59] package with their default
hyperparameters, with the exception of ForGAN. For the ForGAN, we used the implementation provided by their
author. Appendix B provides detailed information on baseline models.
4.3 Assessment
In this study, we utilize the CRPS to assess the performance of various probabilistic forecasters. For models that yield
samples from the predictive distribution as their forecast output, we used 1,000 samples per time step in the forecast
horizon from predictive distribution to approximate CRPS using Equation 7. For probabilistic forecasting models that
provide quantiles of the predictive distribution for their forecasts, we engage a quantile-based approximation method
for CRPS calculation.
CRPS is conceptually interpreted as the integrated pinball loss across all possible quantile levels α, ranging from 0 to
1 [32]. This relationship is mathematically expressed as:
Z 1
CRPS F −1 , x = 2Λα F −1 (α), x dα, ,
(8)
0
where F −1 signifies the quantile function. For practical applications, the quantile-based CRPS is typically approximated
based on the available set of quantiles. Thus, Equation 8 is effectively approximated as a summation over a specified
number of quantiles, N. The accuracy of this approximation is dependent upon the number of quantiles utilized,
influencing the precision of the CRPS estimation. We employed 99 equally distant quantile levels to estimate CRPS.
Our experimental setup involves partitioning each dataset, reserving a portion equivalent to five forecast horizons for
testing while the remainder is utilized for model training. The proposed models undergo training using RMSProp
optimizer [60] for a maximum of 100,000 steps. However, training is ceased prematurely if no improvement in model
performance is observed over 5,000 consecutive training steps.
The models detailed in this study have been implemented leveraging the Pytorch framework [61] and trained on a
machine equipped with an Nvidia RTX 6000 GPU. We used eight samples to estimate CRPS during training (sample
size = 8). The rest of the hyperparameters for VAEnue models are presented in Appendix C.
To circumvent any anomalies and ensure robustness, each model is subjected to three independent training iterations on
every dataset. The reported performance metric is an arithmetic mean of these three runs, which is indicated as CRP S.
The coefficient of variation (CV) is also presented in parentheses to shed light on the stability and reproducibility of
each technique. CV provides a standardized measure of dispersion of a probability distribution or frequency distribution
and is defined as:
σ
CV = × 100% (9)
µ
7
VAEneu
250%
200%
Relative CRPS
150%
100%
50%
0%
VA GAN
AR
Tra TFT
N
VA NN
ne
me
-TC
ep
-R
ve
for
eu
eu
Fo
De
Wa
ns
En
En
Figure 4: The distribution of relative CRPS for top tier models across all datasets revealing remarkable performance of
VAEneu-TCN model which is constantly secured best results or performed close to the best model on all datasets.
∗
CRP S − CRP S
∆CRP S = ∗ × 100%, (10)
CRP S
∗
wherein CRP S represents the CRP S of the top-performing model for a given dataset. With a glance at the color
scheme of Table 1, we can perceive that VAEneu models are the best-performing model on five datasets, and on the rest
of the datasets, they are always following the best-performing model closely, with the expectation of one dataset, US
Birth dataset.
To synthesize these extensive results into a more digestible format, we utilize the critical difference diagram (CD
diagram) [62]. The CD diagram reflects on the rank of models on each dataset and paints a complete picture of models
performance on all datasets. Figure 2 showcases this diagram for the multi-step-ahead forecasting experiment. The
model’s names are on the side with the average rank across all datasets by their side. The thick horizontal lines are
called critical difference bands. The performance of the models, which are grouped by a critical difference band, is not
significantly different considering CRP S of them across all datasets. The critical difference bands are defined using
the Wilcoxon signed-rank test at a significance threshold of 0.05, and the model ranking is done via the one-sided paired
t-test at a 0.05 significant threshold. Appendix D provides a complete description of how the CD diagram has been
created for this study. From this figure, we can clearly perceive that the models fall into two categories. The top critical
difference band includes VAEneu models alongside DeepAR, Forgan, TFT, Wavenet, and Transformer, suggesting
their comparable performance across all datasets. These models, hereafter referred to as "top-tier models," exhibit
performances that are statistically superior compared to the rest of the baseline models.
Next, we delved deeper into the performance analysis of top-tier models. Figure 4 illustrates the distribution of relative
scores ∆CRP S among the top-tier models across all datasets. VAEneu-TCN separates itself from the rest of the models
with a stable performance that is perfect or near perfect across all datasets. VAEneu-RNN follows the VAEnet-TCN
alongside DeepAR and ForGAN. DeepAR’s remarkable performance was somewhat unexpected, considering that
DeepAR assumes explicitly that predictive distribution is Gaussian distribution, which renders the model less flexible.
The performance of DeepAR suggests that Gaussian distributions can aptly approximate uncertainty in many real-world
scenarios.
8
VAEneu
50%
Coefficient of Variation
40%
30%
20%
10%
0%
VA GAN
AR
Tra TFT
N
VA NN
ne
me
-TC
ep
-R
ve
for
eu
eu
Fo
De
Wa
ns
En
En
Figure 5: The coefficient of variation distribution for top tier models across all datasets indicates the consistent
performance of VAEneu models across three training iteration overall datasets and highlights stable performance for the
ForGAN and DeepAR models as well.
Time (ms)
Time (ms)
15
CRPS
CRPS
CRPS
0.350 9.0 20 20
300
0.325 10
0.300 8.5 10 280 10
5
0.275 25 360
0.40 VAEneu-RNN VAEneu-RNN VAEneu-RNN
9.50
0.38 20 30 340 30
9.25
Time (ms)
Time (ms)
Time (ms)
CRPS
CRPS
0.34 20 20
8.75 300
0.32 10
8.50 10 10
0.30 5 280
8.25
1 2 4 8 16 32 64 128 1 2 4 8 16 32 64 128 1 2 4 8 16 32 64 128
Sample size Sample size Sample size
9
VAEneu
Figure 3 exhibits VAEneu models forecast on the Internet Traffic B1H dataset and provides qualitative comparison
ground against DeepAR, the best-performing baseline model on this dataset. The DeepAR model generates a sharp
and relatively low-calibrated predictive distribution. In contrast, the VAEneu models demonstrate an effective balance
between sharpness and calibration within their predictive distributions. Notably, the TCN-based model has slightly
better sharpness in comparison to the RNN variant.
Finally, to analyze the model performance consistency across three runs, we investigate the distribution of CV for
top-tier models and present the result in Figure 5. Here, we can observe that VAEneu, alongside ForGAN and DeepAR
models, can deliver good performance at a consistent rate. Appendix E illustrates forecasts from VAEneu models and
baseline to provide a qualitative insight into the models’ performances as well.
Figure 6 encapsulates our findings. An evident trend emerges: as the sample size escalates, models tend to converge
to a configuration with a more favorable CRPS. This improvement, however, comes at the cost of an increased batch
processing duration. In the case where the sample size = 1, the CRPS loss turns into MAE loss. From this figure, we
can observe that moving from MAE loss toward CRPS loss improves the model performance significantly, underlying
the effectiveness of the proposed methodology. Moreover, an interesting observation to note is that by choosing a
sample size within the interval [4, 32], one can achieve decent CRPS values without significantly extending the training
step processing time.
7 Conclusion
This study introduces a pioneering probabilistic forecasting approach utilizing CVAEs. The methodology centers on
optimizing the CRPS as the loss function during training, allowing the CVAEs to learn predictive distributions effectively
without imposing restrictive assumptions. Demonstrating both remarkable and consistent performance in our extensive
empirical evaluations, this approach emerges as a significant and innovative contribution to the field of probabilistic
forecasting, offering a robust tool for decision-makers to navigate the realm of uncertainty with confidence.
References
[1] Anand Avati, Stephen Pfohl, Chris Lin, Thao Nguyen, Meng Zhang, Philip Hwang, Jessica Wetstone, Kenneth
Jung, Andrew Y. Ng, and Nigam H. Shah. Predicting Inpatient Discharge Prioritization With Electronic Health
Records. CoRR, abs/1812.00371, 2018.
[2] Thomas Janssoone, Clémence Bic, Dorra Kanoun, Pierre Hornus, and Pierre Rinder. Machine Learning on
Electronic Health Records: Models and Features Usages to predict Medication Non-Adherence. arXiv e-prints,
page arXiv:1811.12234, November 2018. _eprint: 1811.12234.
[3] Anand Avati, Kenneth Jung, Stephanie Harman, Lance Downing, Andrew Ng, and Nigam H Shah. Improving
palliative care with deep learning. BMC medical informatics and decision making, 18(4):55–64, 2018.
[4] Evan Racah, Christopher Beckham, Tegan Maharaj, Samira Ebrahimi Kahou, Mr Prabhat, and Chris Pal. Ex-
tremeWeather: A large-scale climate dataset for semi-supervised detection, localization, and understanding of
extreme weather events. In Advances in Neural Information Processing Systems, pages 3402–3413, 2017.
[5] Eduardo Rocha Rodrigues, Igor Oliveira, Renato Cunha, and Marco Netto. DeepDownscale: a deep learning
strategy for high-resolution weather forecast. In 2018 IEEE 14th International Conference on e-Science (e-Science),
pages 415–422. IEEE, 2018.
10
VAEneu
[6] Sella Nevo, Vova Anisimov, Gal Elidan, Ran El-Yaniv, Pete Giencke, Yotam Gigi, Avinatan Hassidim, Zach Moshe,
Mor Schlesinger, Guy Shalev, and others. ML for Flood Forecasting at Scale. arXiv preprint arXiv:1901.09583,
2019.
[7] S. Mostafa Mousavi, Weiqiang Zhu, Yixiao Sheng, and Gregory C. Beroza. CRED: A Deep Residual Network
of Convolutional and Recurrent Units for Earthquake Signal Detection. arXiv e-prints, page arXiv:1810.01965,
October 2018. _eprint: 1810.01965.
[8] Zachary E. Ross, Yisong Yue, Men-Andrin Meier, Egill Hauksson, and Thomas H. Heaton. PhaseLink: A Deep
Learning Approach to Seismic Phase Association. Journal of Geophysical Research: Solid Earth, 124(1):856–869,
2019. _eprint: https://agupubs.onlinelibrary.wiley.com/doi/pdf/10.1029/2018JB016674.
[9] André Gensler and Bernhard Sick. A Multi-Scheme Ensemble Using Coopetitive Soft-Gating With Application to
Power Forecasting for Renewable Energy Generation. arXiv preprint arXiv:1803.06344, 2018.
[10] W. Ronny Huang and Miguel A. Perez. Accurate, Data-Efficient Learning from Noisy, Choice-Based Labels for
Inherent Risk Scoring. CoRR, abs/1811.10791, 2018.
[11] Qiang Zhang, Rui Luo, Yaodong Yang, and Yuanyuan Liu. Benchmarking Deep Sequential Models on Volatil-
ity Predictions for Financial Time Series. arXiv e-prints, page arXiv:1811.03711, November 2018. _eprint:
1811.03711.
[12] Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the
American statistical Association, 102(477):359–378, 2007.
[13] Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735–1780,
November 1997.
[14] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk,
and Yoshua Bengio. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine
Translation, September 2014. arXiv:1406.1078 [cs, stat].
[15] Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. An Empirical Evaluation of Generic Convolutional and Recurrent
Networks for Sequence Modeling, April 2018. arXiv:1803.01271 [cs].
[16] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to Sequence Learning with Neural Networks. In Advances
in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014.
[17] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural Machine Translation by Jointly Learning to
Align and Translate, May 2016. arXiv:1409.0473 [cs, stat].
[18] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and
Illia Polosukhin. Attention is All you Need. In Advances in Neural Information Processing Systems, volume 30.
Curran Associates, Inc., 2017.
[19] David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. DeepAR: Probabilistic forecasting with
autoregressive recurrent networks. International Journal of Forecasting, 36(3):1181–1191, 2020.
[20] Bryan Lim, Sercan O. Arik, Nicolas Loeff, and Tomas Pfister. Temporal Fusion Transformers for interpretable
multi-horizon time series forecasting. International Journal of Forecasting, 37(4):1748–1764, October 2021.
[21] Ruofeng Wen, Kari Torkkola, Balakrishnan Narayanaswamy, and Dhruv Madeka. A Multi-Horizon Quantile
Recurrent Forecaster, June 2018.
[22] Alireza Koochali, Peter Schichtel, Andreas Dengel, and Sheraz Ahmed. Probabilistic Forecasting of Sensory Data
With Generative Adversarial Networks – ForGAN. IEEE Access, 7:63868–63880, 2019.
[23] Xingyu Zhou, Zhisong Pan, Guyu Hu, Siqi Tang, and Cheng Zhao. Stock Market Prediction on High-Frequency
Data Using Generative Adversarial Nets. Mathematical Problems in Engineering, 2018:e4907423, April 2018.
[24] Abbas Khosravi, Saeid Nahavandi, and Doug Creighton. A neural network-GARCH-based method for construction
of Prediction Intervals. Electric Power Systems Research, 96:185–193, March 2013.
[25] Tim Bollerslev. Generalized autoregressive conditional heteroskedasticity. Journal of econometrics, 31(3):307–
327, 1986.
[26] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalch-
brenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint
arXiv:1609.03499, 2016.
[27] Syama Sundar Rangapuram, Matthias W Seeger, Jan Gasthaus, Lorenzo Stella, Yuyang Wang, and Tim
Januschowski. Deep State Space Models for Time Series Forecasting. In S. Bengio, H. Wallach, H. Larochelle,
K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems,
volume 31. Curran Associates, Inc., 2018.
11
VAEneu
[28] Stephan Rasp and Sebastian Lerch. Neural Networks for Postprocessing Ensemble Weather Forecasts. Monthly
Weather Review, 146(11):3885–3900, November 2018.
[29] Sean J. Taylor and Benjamin Letham. Forecasting at Scale. The American Statistician, 72(1):37–45, January
2018.
[30] Trevor J. Hastie. Generalized Additive Models. In Statistical Models in S. Routledge, 1992.
[31] Alex Graves. Generating Sequences With Recurrent Neural Networks, June 2014.
[32] Jan Gasthaus, Konstantinos Benidis, Yuyang Wang, Syama Sundar Rangapuram, David Salinas, Valentin Flunkert,
and Tim Januschowski. Probabilistic Forecasting with Spline Quantile Function RNNs. In Proceedings of the
Twenty-Second International Conference on Artificial Intelligence and Statistics, pages 1901–1910. PMLR, April
2019.
[33] David Salinas, Michael Bohlke-Schneider, Laurent Callot, Roberto Medico, and Jan Gasthaus. High-dimensional
multivariate forecasting with low-rank Gaussian Copula Processes. In Advances in Neural Information Processing
Systems, volume 32. Curran Associates, Inc., 2019.
[34] Yuyang Wang, Alex Smola, Danielle Maddix, Jan Gasthaus, Dean Foster, and Tim Januschowski. Deep Factors
for Forecasting. In Proceedings of the 36th International Conference on Machine Learning, pages 6607–6617.
PMLR, May 2019.
[35] Yitian Chen, Yanfei Kang, Yixiong Chen, and Zizhuo Wang. Probabilistic forecasting with temporal convolutional
neural network. Neurocomputing, 399:491–501, July 2020.
[36] Kashif Rasul, Abdul-Saboor Sheikh, Ingmar Schuster, Urs Bergmann, and Roland Vollgraf. Multi-variate
probabilistic time series forecasting via conditioned normalizing flows. arXiv preprint arXiv:2002.06103, 2020.
[37] Emmanuel de Bézenac, Syama Sundar Rangapuram, Konstantinos Benidis, Michael Bohlke-Schneider, Richard
Kurle, Lorenzo Stella, Hilaf Hasson, Patrick Gallinari, and Tim Januschowski. Normalizing Kalman Filters for
Multivariate Time Series Analysis. In Advances in Neural Information Processing Systems, volume 33, pages
2995–3007. Curran Associates, Inc., 2020.
[38] Danilo Rezende and Shakir Mohamed. Variational Inference with Normalizing Flows. In Proceedings of the 32nd
International Conference on Machine Learning, pages 1530–1538. PMLR, June 2015.
[39] Adèle Gouttes, Kashif Rasul, Mateusz Koren, Johannes Stephan, and Tofigh Naghibi. Probabilistic Time Series
Forecasting with Implicit Quantile Networks, July 2021.
[40] Will Dabney, Georg Ostrovski, David Silver, and Remi Munos. Implicit Quantile Networks for Distributional
Reinforcement Learning. In Proceedings of the 35th International Conference on Machine Learning, pages
1096–1105. PMLR, July 2018.
[41] Hilaf Hasson, Bernie Wang, Tim Januschowski, and Jan Gasthaus. Probabilistic forecasting: A level-set approach.
Advances in neural information processing systems, 34:6404–6416, 2021.
[42] Kashif Rasul, Calvin Seward, Ingmar Schuster, and Roland Vollgraf. Autoregressive Denoising Diffusion Models
for Multivariate Probabilistic Time Series Forecasting. In Proceedings of the 38th International Conference on
Machine Learning, pages 8857–8868. PMLR, July 2021.
[43] Ali Caner Türkmen, Tim Januschowski, Yuyang Wang, and Ali Taylan Cemgil. Forecasting intermittent and
sparse time series: A unified probabilistic framework via deep renewal processes. PLOS ONE, 16(11):e0259764,
November 2021.
[44] J. D. Croston. Forecasting and Stock Control for Intermittent Demands. Journal of the Operational Research
Society, 23(3):289–303, September 1972.
[45] Qifa Xu, Shuting Liu, Cuixia Jiang, and Xingxuan Zhuo. QRNN-MIDAS: A novel quantile regression neural
network for mixed sampling frequency data. Neurocomputing, 457:84–105, October 2021.
[46] Kelvin Kan, François-Xavier Aubet, Tim Januschowski, Youngsuk Park, Konstantinos Benidis, Lars Ruthotto,
and Jan Gasthaus. Multivariate Quantile Function Forecaster, December 2022.
[47] Brandon Amos, Lei Xu, and J. Zico Kolter. Input Convex Neural Networks. In Proceedings of the 34th
International Conference on Machine Learning, pages 146–155. PMLR, July 2017.
[48] Alireza Koochali, Andreas Dengel, and Sheraz Ahmed. If You Like It, GAN It—Probabilistic Multivariate Times
Series Forecast with GAN. Engineering Proceedings, 5(1), 2021.
[49] Kang Zhang, Guoqiang Zhong, Junyu Dong, Shengke Wang, and Yong Wang. Stock market prediction based on
generative adversarial network. Procedia computer science, 147:400–406, 2019.
12
VAEneu
[50] Yilun Lin, Xingyuan Dai, Li Li, and Fei-Yue Wang. Pattern Sensitive Prediction of Traffic Flow Based on
Generative Adversarial Framework. IEEE Transactions on Intelligent Transportation Systems, 20(6):2395–2400,
June 2019.
[51] Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer:
Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. Proceedings of the AAAI Conference
on Artificial Intelligence, 35(12):11106–11115, May 2021. Number: 12.
[52] Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. Autoformer: Decomposition Transformers with
Auto-Correlation for Long-Term Series Forecasting, January 2022. arXiv:2106.13008 [cs].
[53] Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin. FEDformer: Frequency Enhanced
Decomposed Transformer for Long-term Series Forecasting, June 2022. arXiv:2201.12740 [cs, stat].
[54] Si-An Chen, Chun-Liang Li, Nate Yoder, Sercan O. Arik, and Tomas Pfister. TSMixer: An All-MLP Architecture
for Time Series Forecasting, September 2023. arXiv:2303.06053 [cs].
[55] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
[56] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning Structured Output Representation using Deep Conditional
Generative Models. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc.,
2015.
[57] Ludwig Baringhaus and Carsten Franz. On a new multivariate two-sample test. Journal of multivariate analysis,
88(1):190–206, 2004.
[58] Gábor J Székely and Maria L Rizzo. A new test for multivariate normality. Journal of Multivariate Analysis,
93(1):58–80, 2005.
[59] Alexander Alexandrov, Konstantinos Benidis, Michael Bohlke-Schneider, Valentin Flunkert, Jan Gasthaus, Tim
Januschowski, Danielle C. Maddix, Syama Rangapuram, David Salinas, Jasper Schulz, Lorenzo Stella, Ali Caner
Türkmen, and Yuyang Wang. GluonTS: Probabilistic and Neural Time Series Modeling in Python. Journal of
Machine Learning Research, 21(116):1–6, 2020.
[60] Tijmen Tieleman, Geoffrey Hinton, and others. Lecture 6.5-rmsprop: Divide the gradient by a running average of
its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
[61] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen,
Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito,
Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala.
PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information
Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
[62] Hassan Ismail Fawaz, Germain Forestier, Jonathan Weber, Lhassane Idoumghar, and Pierre-Alain Muller. Deep
learning for time series classification: a review. Data Mining and Knowledge Discovery, 33(4):917–963, July
2019.
[63] Georges Hebrail and Alice Berard. Individual household electric power consumption, 2012.
[64] Paulo Cortez, Miguel Rio, Miguel Rocha, and Pedro Sousa. Multi-scale Internet traffic forecasting using neural
networks and time series methods. Expert Systems, 29(2):143–155, 2012. Publisher: Wiley Online Library.
[65] Michael C Mackey and Leon Glass. Oscillation and chaos in physiological control systems. Science,
197(4300):287–289, 1977. Publisher: American Association for the Advancement of Science.
[66] Eduardo Méndez, Omar Lugo, and Patricia Melin. A competitive modular neural network for long-term time
series forecasting. In Nature-Inspired Design of Hybrid Intelligent Systems, pages 243–254. Springer International
Publishing, Cham, 2017.
[67] R. I. Ogie, N. Shukla, F. Sedlar, and T. Holderness. Optimal placement of water-level sensors to facilitate data-
driven management of hydrological infrastructure assets in coastal mega-cities of developing nations. Sustainable
Cities and Society, 35:385–395, 2017.
[68] Rakshitha Godahewa, Christoph Bergmeir, Geoffrey I. Webb, Rob J. Hyndman, and Pablo Montero-Manso.
Monash Time Series Forecasting Archive. In Neural Information Processing Systems Track on Datasets and
Benchmarks, 2021.
[69] Randall Pruim, Daniel Kaplan, and Nicholas Horton. mosaicData: Project MOSAIC Data Sets, September 2022.
13
VAEneu
Appendix A Datasets
This appendix provides further information on the datasets employed in this study. Table 2 lists the main features of
these datasets, and figure 7 visualizes a segment from each to provide an overview of the diverse dynamics of these
datasets. The rest of this appendix presents further details for each dataset.
Table 2: The properties of univariate datasets utilized for multi-step-ahead forecasting experiments
DATASET H ISTORY
NAME F REQUENCY L ENGTH S IZE H ORIZON
G OLD P RICE DAILY 2487 60 30
HPEC H OURLY 34569 48 24
I NTERNET T RAFFIC A1H H OURLY 1231 48 24
I NTERNET T RAFFIC A5M 5 MINUTES 14772 24 12
I NTERNET T RAFFIC B1H H OURLY 1657 48 24
I NTERNET T RAFFIC B5M 5 MINUTES 19888 24 12
M ACKEY G LASS S ECONDS 20000 120 60
S AUGEEN R IVER F LOW DAILY 23741 60 30
S UNSPOT DAILY 73924 60 30
US B IRTHS DAILY 7305 60 30
S OLAR H OURLY 8219 48 24
W IND H OURLY 8219 48 24
Gold is a vital commodity with a significant history of global trading. The Gold Price dataset2 provides daily pricing
details, encompassing 2,487 data points from September 1st, 2010, to March 13th, 2020. This dataset facilitates
analytical and predictive modeling of the gold market due to its comprehensive coverage during the specified period.
The Household Electric Power Consumption (HEPC) dataset [63] comprises electric power consumption readings from
a singular household with a sampling rate of one minute. From December 16th, 2006, to November 26th, 2010, it
encapsulates nearly four years of data. Notably, there are timestamps with missing values. An imputation approach,
where missing values are replaced with the average of their adjacent observations, is adopted to ensure data integrity.
Moreover, the dataset is resampled to aggregate the readings into hourly intervals, ensuring manageability while
preserving inherent patterns.
Internet traffic datasets [64] offer data from two distinct ISPs, named A and B. The A dataset pertains to a private ISP
with nodes across 11 European cities, capturing data on a transatlantic link from June 7the to July 29the, 2005. The B
dataset is sourced from UKERNA1 and aggregates traffic across the UK’s academic network from November 19th,
2004, to January 27th, 2005. The A dataset logs every 30 seconds, whereas B records every 5 minutes. Aggregated
variants, A5M, A1H, B5M, and B1H, offer data at 5-minute and 1-hour resolutions respectively.
The time-delay differential equation proposed by Mackey and Glass [65]3 stands as a benchmark for generating chaotic
time series. The equation is given by:
a x(t − τ )
ẋ = . (12)
(1 + 10 · (t − τ )) − b x(t)
Adhering to the parameters in [66], where a = 0.1, b = 0.2, and τ = 17, a dataset of length 20,000 is generated using
Eq. (12.
2
Accessible in www.kaggle.com/datasets/arashnic/learn-timeseries-forecasting-from-gold-price
3
The dataset can be accessed at https://git.opendfki.de/koochali/forgan
14
VAEneu
This dataset [67] includes a univariate time series showcasing the daily mean flow of the Saugeen River at Walkerton
3
measured in cubic meters per second ( ms ). It captures data from January 1st, 1915, to December 31st, 1979. The
dataset is archived in the Monash Time Series Forecasting Archive [68].
These datasets capture solar and wind power production every 4 seconds from August 1st, 2019, for approximately one
year. For the purpose of this research, the datasets are aggregated to represent hourly resolutions. Sourced from the
Australian Energy Market Operator (AEMO) online platform, they are also part of the Monash Time Series Forecasting
Archive [68].
The dataset contains a time series depicting daily sunspot counts from January 8th, 1818, to May 31st, 2020. Any
missing values within the dataset are addressed using the Last Observation Carried Forward (LOCF) method. The Solar
Influences Data Analysis Center is the primary data source, with the dataset also being part of the Monash Time Series
Forecasting Archive [68].
This dataset illustrates the number of births in the US from January 1st, 1969, to December 31st, 1988. Originally
sourced from the R mosaicData package [69], it’s accessed for this research via the Monash Time Series Forecasting
Archive [68].
Appendix B Baselines
This section is dedicated to a detailed examination of 12 advanced probabilistic forecasting models. These models
represent the latest advancements in the field and have demonstrated their effectiveness and robustness across various
research studies. The primary focus will be on elucidating the distinctive features, methodologies, and underlying
principles of each model, highlighting their contributions to advancing probabilistic forecasting. In subsequent
chapters, this paper will benchmark these models as baselines, employing them as comparative standards to evaluate
the performance of newly proposed methodologies in diverse experimental settings. This comparative analysis aims
to provide a comprehensive understanding of where the newly proposed models stand in the context of current
state-of-the-art probabilistic forecasting techniques.
B.1 DeepAR
DeepAR, as presented by Salinas et al. [19], serves as a prominent explicit probabilistic forecasting model in the realm
of time series prediction. At its core, DeepAR constructs a predictive distribution utilizing a Gaussian distribution,
wherein the distribution’s parameters, specifically the mean and standard deviation, are derived from the historical data.
To facilitate this parameter extraction, DeepAR integrates a neural network structure. This architecture predominantly
features a multi-layer Recurrent Neural Network (RNN) to map the time series history onto the Gaussian distribution
parameters. The training regimen for the network adopts a maximum likelihood estimation paradigm. Within this
paradigm, the probability density function of the Gaussian distribution acts as the likelihood function, ensuring optimal
parameter approximation.
B.2 DeepState
Rangapuram et al. [27] proposed the DeepState model as a novel approach that combines state space models (SSMs)
with deep learning for probabilistic time series forecasting. The paper proposes bridging the gap between SSMs and
deep learning by parametrizing a linear SSM with a jointly-learned recurrent neural network (RNN). The deep recurrent
neural network parametrizes the mapping from covariates to state space model parameters. The model is interpretable,
as it allows inspection and modification of SSM parameters for each time series. It can automatically extract features
and learn complex temporal patterns from raw time series and associated covariates. The model promises to contain the
interpretability and data efficiency of SSMs and large data handling of deep neural networks.
15
20
19 Produced Wind Energy X 20
05 Gold Price
0.4
0.6
0.8
1.0
1.2
-08 00 -06 20
1300
1400
1500
1600
1700
1800
1900
-01 :00 -07 Internet Traffic 10
2
3
4
5
6
7
8
0
20000
40000
60000
80000
100000
00 :00 06 -09
:00 :57 -01
:0
1e9
20 3 :00
VAEneu
19 20 20
-08 00 05 10
-02 :00 -06 -10
00 :30 -07 -31
:00 18
20 :03 :57 20
19 :00 10
-08 -12
-03 00 20 -30
00 :01 05
:00 :00 -06
20 :0 -08 20
19 3 06 11
-08 :57 -02
-04 :00 -28
Date
00
Seconds
Hours
00 :01 20
Minutes
:00 :30 05 20
20 :0 11
19 3 -06
-08 -04
-08 -29
-05 18
00 :57
00 :00 20
20
40
60
80
100
19 15 -11 Internet Traffic -12
Energy Consumption (Wh)
8500
9000
9500
10000
10500
69 -01 -19 -16
1.0
1.5
2.0
2.5
3.0
3.5
1
2
3
4
-01 -01 09 17
- :30
:00
:00
:00
19 01
1e16
69 19
-01 15
- -03
-02
20
04
20
06
19 08 -11 -12
69 -21 -18
-01 19
- 15 09 17
19 15 -05 :30 :00
69 -01 :00 :00
-01
-
19 22 19 20 20
69 15 04 06
-01 -11 -12
16
- -06
19 29 -30 -23 -20
Day
69 09 17
Days
Hours
Hours
19 12 19 -11 -12
Sunspot Count 20
19 Produced Solar Energy 20
04
20
05
18 Internet Traffic
0
25
50
75
100
125
150
175
200
18 -08
-01
-11
-19
-06
-07 Internet Traffic
-01
0.5
1.0
1.5
2.0
2.5
3.0
0
20000
40000
60000
80000
100000
0.50
0.75
1.00
1.25
1.50
1.75
2.00
-08 00 09 06
:00 :30 :57
:03 :00 :00
1e12
1e15
18
18 20 20
-02 19 04 20
-07 -08 -11 05
-02 -19 -06
Days
:03 :00 06
Hours
Hours
18 20 :57
Minutes
18 20 04 :00
-05 19 -11
-08 -08
-04 -20 20
00 21 05
18 09
-07 00 :30
-07 :00
:03 :00
VAEneu
B.3 DeepFactor
The DeepFactor [34] employs both classical and deep neural time series forecasting models to construct probabilistic
forecaster. DeepFactor is a global-local method where each time series or its latent function is represented as a
combination of a global time series and a corresponding local model. The global factors are modeled by Recurrent
Neural Networks (RNNs). These latent global deep factors can be thought of as dynamic principal components driving
the underlying dynamics of all time series. Local models can include various choices like white noise processes,
linear dynamical systems (LDS), or Gaussian processes (GPs). This stochastic component captures individual random
effects for each time series. DeepFactor systematically combines of deep neural networks and probabilistic models in a
global-local framework and develops an efficient and scalable inference algorithm for non-Gaussian likelihoods.
Deep Renewal Processes [43] is a novel framework for probabilistic forecasting of intermittent and sparse time series.
The method focuses on forecasting demand data that appears sporadically, posing unique challenges due to long periods
of zero demand followed by non-zero demand. The framework is based on discrete-time renewal processes, which
are adept at handling patterns like aging, clustering, and quasi-periodicity in demand arrivals. The model innovatively
incorporates neural networks, specifically recurrent neural networks (RNNs), replacing exponential smoothing with
a more flexible neural approach. Moreover, the framework also extends to model continuous-time demand arrivals,
enhancing flexibility and applicability.
B.5 GPForecaster
For the Gaussian Process (GP), we employed a model implemented in GluonTS [59]. This model defines a Gaussian
Process with Radial Basis Function (RBF) as the kernel for each time step. GP with an RBF kernel can capture both the
trends and nuances of time series data, while its probabilistic nature provides a measure of uncertainty or confidence
in its predictions. This combination is especially valuable in scenarios where understanding the uncertainty of future
events is as crucial as the predictions themselves.
The Multi-Horizon Quantile Recurrent Forecaster, or in short MQ-RNN [21], is a direct quantile regression method
that aims to forecast the quantile of predictive distribution for multiple horizons directly. The model employs Seq2Seq
architecture for quantile regression. Besides MQ-RNN, this paper also utilized MQ-CNN, which is the same model
with CNN components as the main components of Seq2Seq architecture instead of RNN modules.
MQ-CNN is a similar method to MQ-RNN, with CNN components as the main components of Seq2Seq architecture
instead of RNN modules.
B.8 Prophet
The prophet [29] is a robust and scalable solution for forecasting time series data, particularly in business contexts
where handling specific scenarios such as multiple strong seasonalities, trend changes, outliers, and holiday effects
becomes paramount. The prophet decomposes time series into three components. The first component, trend, Captures
non-periodic changes in the data. The second component, seasonality, represents periodic changes, like weekly yearly
cycles. finally, the last component, holidays, accounts for holidays and events that occur on irregular schedules. Each
component is modeled separately and fits using Bayesian infeProbabilistic Stream Flow Forecasting for Reservoir
Operatiorence to estimate parameters. Furthermore, the model is designed to be intuitive, allowing analysts to make
informed adjustments based on domain knowledge.
B.9 WaveNet
The Wavenet [26] is a powerful generative model originally proposed for voice synthesis. Wavenet forecasts sequentially
in an autoregressive fashion using previously generated samples as input to predict the next.WaveNet employs a deep
neural network consisting of stacks of dilated causal convolutions, enabling it to capture a wide range of audio
frequencies and temporal dependencies. Wavenet defines predictive distribution for each time step in the horizon by
quantifying the time series and using a softmax at the output layer of the model. In other words, the wavenet quantifies
17
VAEneu
the time series to make the problem similar to a classification problem and provides the output of the softmax layer as
the predictive distribution.
B.10 Transformer
The transformer [18] is a novel sequence-to-sequence model that represents a departure from previous methods by
relying entirely on attention mechanisms, dispensing with recurrent and convolutional neural networks. The Transformer
follows the typical encoder-decoder structure, but both the encoder and the decoder are composed of a stack of identical
layers, each with a novel self-attention mechanism. The Self-Attention mechanism allows the model to weigh the
importance of different time steps in the input sequence, enabling it to capture context more effectively. At the core
of the model is the scaled dot-product attention, a mechanism that calculates the attention scores based on the scaled
dot products of queries and keys. Unlike recurrent models, the Transformer allows for much more parallelization,
making training faster. Moreover, the self-attention mechanism enables the model to learn long-range dependencies
more effectively.
Temporal Fusion Transformers (TFT) [20] is an encoder-decoder-based quantile regression model for multi-horizon
forecasting. It has many novelties under the hood that make its performance excel in many scenarios, including:
• Static Covariate Encoders: These encode static context vectors, which are crucial for conditioning the
temporal dynamics within the network.
• Gating Mechanisms: They provide adaptive depth and complexity to the network, allowing it to handle a
wide range of datasets and scenarios.
• Variable Selection Networks: These networks select relevant input variables at each time step, enhancing
interpretability and model performance by focusing on the most salient features.
• Temporal Processing: The model employs a sequence-to-sequence layer for local processing and interpretable
multi-head attention mechanisms to capture long-term dependencies.
The TFT demonstrates superior performance across various real-world datasets, outperforming existing benchmarks in
multi-horizon forecasting. The model facilitates interpretability, allowing users to identify globally important variables,
persistent temporal patterns, and significant events.
B.12 ForGAN
ForGAN [22] is a probabilistic forecaster that employs a conditional GAN to learn a mapping from a prior distribution
to the predictive distribution conditioned on an input history window. It incorporates Recurrent Neural Network (RNN)
modules in both the Generator and Discriminator components. The ForGAN was proposed for one-step ahead univariate
time series forecasting. In this paper, we extend the model to encompass multi-step ahead forecasting by incorporating
an auto-regressive approach.
Friedman test: The initial analytical step entails the application of the Friedman test, aimed at discerning any
significant differences among the algorithms spanning multiple datasets. The null hypothesis, denoted as H0 , posits the
equivalence in performance across all tested models. Contrarily, the alternative hypothesis, H1 , suggests a differential
performance exhibited by at least one algorithm relative to the others. The rejection of H0 in favor of H1 signifies the
potential utility of the CD diagram, thereby certifying the progression to subsequent statistical tests.
18
VAEneu
Wilcoxon signed-rank test: For pairwise comparisons of model performances across all datasets, the study employs
the Wilcoxon signed-rank test. This non-parametric test facilitates the comparison of two paired samples without
assuming that the differences between said pairs adhere to a normal distribution. Given two models, A and B, with their
performances represented as sets of CRP S values spanning the datasets, the null hypothesis, H0 , asserts a median
difference of zero between paired CRP S values, thereby suggesting similar performance between Models A and B
across all dataset. The outcome, be it acceptance or rejection of H0 , is utilized for defining the critical difference band
within the CD diagram.
Paired t-test: Finally, the paired t-test is applied to rank model performances within specific datasets. A paired
t-test is used to compare the means of two related groups when the samples are dependent; that is, each observation in
one sample can be paired with an observation in the other sample. Within the scope of model performance evaluated
over multiple runs, these dependent observations represent the performances of two distinct models assessed on an
identical dataset across separate runs. Let CRP S A symbolize the mean CRPS of model A, computed over three runs
on a specific dataset. If the inequality CRP S A < CRP S B holds for a dataset, then the null hypothesis H0 of the
one-sided paired t-test posits a superior or equivalent performance by Model B relative to Model A across all runs. The
alternative hypothesis, H1 , counters this by indicating a statistically superior performance by Model A. We sequentially
applied this test to every possible model pair within a dataset to determine their ranks. In scenarios where the null
hypothesis faced rejection, Model A was accorded a superior rank. Conversely, an acceptance led to the assignment of
identical ranks to both Model A and Model B. Subsequent to the ranking procedure, the models’ average ranks across
all datasets were computed to establish their respective positions on the CD diagram.
19
VAEneu
HEPC Dataset
VAEneu-TCN VAEneu-RNN ForGAN
3.5 1e12 3.5 1e12 3.5 1e12
3.0 3.0 3.0
2.5 2.5 2.5
2.0 2.0 2.0
1.5 1.5 1.5
1.0 1.0 1.0
0.5 0.5 0.5
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Horizon Horizon Horizon
7 7 7
6 6 6
5 5 5
4 4 4
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
Horizon Horizon Horizon
20
VAEneu
0 0 0
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
Horizon Horizon Horizon
21
VAEneu
Solar-4-Seconds Dataset
VAEneu-TCN VAEneu-RNN Wavenet
60 60 60
40 40 40
20 20 20
0 0 0
20 20 20
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
Horizon Horizon Horizon
Sunspot Dataset
VAEneu-TCN VAEneu-RNN TFT
13000 13000 13000
12000 12000 12000
11000 11000 11000
10000 10000 10000
9000 9000 9000
8000 8000 8000
7000 7000 7000
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
Horizon Horizon Horizon
US Birth Dataset
VAEneu-TCN VAEneu-RNN ForGAN
120000 120000 120000
100000 100000 100000
80000 80000 80000
60000 60000 60000
40000 40000 40000
20000 20000 20000
0 0 0
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Horizon Horizon Horizon
Wind-4-Seconds Dataset
Figure 10: Sample of the forecast for VAEneu models alongside the best baseline model.
22