Tail-Aware Density Forecasting of Locally Explosive Time Series: A Neural Network Approach

Elena Dumitrescu 
University Paris-Panthéon-Assas, CRED, 75005 Paris, France
and
Julien Peignon
CEREMADE, CNRS, UMR 7534, Université Paris-Dauphine
PSL University, Paris, France
and
Arthur Thomas
Université Paris-Dauphine, Université PSL
LEDa, CNRS, IRD, 75016 Paris, France
We are grateful to Serge Darolles, Christian Francq, Laurent Ferrara, Gaëlle le Fol, Daniel Velasquez Gaviria, Christian Gouriéroux, Alain Hecq, Loïc Henry, Joann Jasiak, Sébastien Laurent, Yannick Le Pen, Gabriele Mingoli, Aryan Manafi Neyazi, Fabrice Rossi and Jean-Michel Zakoian. We also thank the seminar participants at Maastricht University School of Business and Economics, as well as the participants at the 19th CFE conference, the 24th Conference Développements Récents de l’Econométrie Appliquée à la Finance (University of Paris Nanterre), and the First Workshop in Noncausal Econometrics for helpful comments and discussions.
Abstract

This paper proposes a Mixture Density Network specifically designed for forecasting time series that exhibit locally explosive behavior. By incorporating skewed t-distributions as mixture components, our approach offers enhanced flexibility in capturing the skewed, heavy-tailed, and potentially multimodal nature of predictive densities associated with bubble dynamics modeled by mixed causal-noncausal ARMA processes. In addition, we implement an adaptive weighting scheme that emphasizes tail observations during training and hence leads to accurate density estimation in the extreme regions most relevant for financial applications. Equally important, once trained, the MDN produces near-instantaneous density forecasts. Through extensive Monte Carlo simulations and two empirical applications, on the natural gas price and inflation, we show that the proposed MDN-based framework delivers superior forecasting performance relative to existing approaches.

Keywords: Forecasting, Noncausal Models, Mixture Density Networks

1 Introduction

Time-series forecasts using causal ARMA models have been playing a crucial role in economic and financial decision-making processes for a long time. A conclusion often reached in this context where the current value of the variable of interest is forced to depend only on its past is that one- and multi-step ahead forecasting in periods of high instability or in presence of forward looking behavior of economic agents is particularly difficult. Indeed, these causal models are characterized by mean reversion, i.e., their forecasts converge to the unconditional mean even after an extreme event occurs, regardless of whether such behavior reflects the true underlying dynamics. In the specific case of financial prices, which we are interested in, the traditional properties of heavy-tailed marginal distributions and volatility clustering are at the core of forecasting models. But a close look at the dynamics of various types of asset prices, sometimes called speculative assets, reveals the presence of phases of locally explosive behaviour: rising patterns followed by a burst, local trends and spikes. Such non-linear characteristics are, however, very poorly captured by standard financial econometric models.

Recently, (causal-)noncausal autoregressive processes have been found to be suitable for modelling such locally explosive behaviour as they allow for dependence on the future. These simple linear models produce rich non-linear patterns without requiring non-linearities to be imposed ex ante. Importantly, noncausal processes are grounded in macroeconomic theory, as they arise as stationary solutions of rational expectation models under infinite variance (Gourieroux et al. 2020). Relevant applications of these models range from asset prices (see Fries & Zakoian 2019, Gourieroux & Zakoian 2017, Gourieroux & Jasiak 2018, Hecq & Velasquez-Gaviria 2025), to macroeconomic data (see Lanne & Saikkonen 2011, Davis & Song 2020), commodity prices (Blasques et al. 2025), climate risk on El Niño and La Niña (de Truchis et al. 2025), green stock prices (Giancaterini et al. 2025), and to electronic currency exchange rates (Cavaliere et al. 2020).

In this context, our paper proposes a novel forecasting methodology for causal–noncausal autoregressive processes based on a specifically-designed neural network architecture and training procedure. Indeed, building predictions with this class of models has been shown to be particularly difficult due to their dependence on future values. In fact, it has long been thought that the conditional predictive density of mixed causal-noncausal processes does not have closed-form and inference can only be performed by simulation-based or Bayesian methods (see Lanne et al. 2012, Gourieroux & Jasiak 2016, Nyberg & Saikkonen 2014). Although these approaches constitute flexible alternatives for predicting general noncausal processes, Hecq & Voisin (2021) highlight two main drawbacks: they become computationally intensive for long‑horizon forecasts and do not accurately capture the dynamics of extreme events concentrated in the tails of the distribution. This limitation is highly problematic given that modeling explosive tail behavior is the primary motivation for using noncausal processes in the first place.

More importantly, noncausal processes exhibit highly non-linear and process-specific tail behavior that must be properly accounted for in forecasting. For example, when the conditional predictive density becomes multimodal, point forecasts become meaningless, as they fail to represent the fundamental dichotomy between bubble continuation and collapse (see Gourieroux & Zakoian 2017, Fries 2022, de Truchis et al. 2025, Gourieroux et al. 2025). For this reason, the noncausal literature as a whole, and this paper in particular, focuses exclusively on density forecasting rather than point prediction. Additionally, this tail behavior may also explain the difficulties encountered by standard numerical approaches when modelling extreme values (see Hecq & Voisin 2021) as well as the failure of state-of-the-art machine learning methods to accurately forecast locally explosive dynamics (Saïdi 2023).

This paper takes a new approach to forecasting noncausal processes, i.e., locally explosive dynamics. We develop a specifically-designed neural network architecture to accurately estimate the full conditional predictive density of univariate general mixed causal-noncausal autoregressive moving average (MARMA) processes. More precisely, we introduce a newly tailored Mixture Density Network (MDN) based on skewed t-distributions as mixture components, which naturally captures both the multimodality and heavy-tailed asymmetric nature of predictive densities during explosive episodes. This contrasts with traditional MDNs (à la Bishop 1994), whose underlying Gaussian assumption cannot accommodate such features. Furthermore, since tail observations are inherently rare, we develop an adaptive weighting scheme that emphasizes extreme regions during training, enabling accurate density estimation in the parts of the distribution most relevant for financial applications.

Another significant advantage of our approach is that, once trained, the MDN produces near-instantaneous density forecasts for all forecasting horizons under analysis. By leveraging neural networks, this paper contributes to the growing literature on machine learning methods for economic and financial applications (Athey & Imbens 2019), and specifically for uncertainty quantification in forecasting. However, we do not pursue a strictly statistical path aimed at identifying true predictive densities or establishing formal properties of the forecasts, such as identifiability, asymptotic convergence, or Markov structure (see, e.g., Gouriéroux & Monfort 2025). We rather exploit the approximation power of neural networks to directly estimate the predictive densities. Our procedure accommodates short time series by proceeding in two steps: first, estimate the underlying noncausal model, and then learn the associated predictive density from simulated paths of this estimated process. Nevertheless, one could directly apply our MDN approach to the raw data, provided that it is sufficiently long.

We compare our MDN with noncausal-specific density forecasting methods and with more general techniques through both extensive Monte Carlo simulations and two empirical applications, on U.S. natural gas price and inflation. In all settings, our approach outperforms competing methods while requiring substantially less computational time. It also achieves superior point‑forecast accuracy compared to standard benchmarks in both applications.

The paper is structured as follows. Section 2 details our machine learning forecasting method. The Monte Carlo analysis is detailed in Section 3. Section 4 summarizes the two empirical illustrations, while Section 5 concludes.

2 Forecasting with Mixture Density Networks

Traditional Mixture Density Networks (Bishop 1994) provide a flexible framework for modeling conditional probability distributions by outputting the parameters of a Gaussian mixture model through a neural network. However, Gaussian mixtures struggle to adequately capture the heavy-tailed behavior of financial prices and lead to systematic underestimation of extreme event probabilities. To address this deficiency, we propose a new MDN, which relies on skewed t-distribution components instead of Gaussian ones to provide flexible parametric control over both skewness and tail heaviness. More precisely, it models the conditional density as:

ph(Xt+h|𝐗t)=j=1Kπj,h(𝐗t)f(Xt+h;μj,h(𝐗t),σj,h(𝐗t),ξj,h(𝐗t),νj,h(𝐗t)),p_{h}(X_{t+h}|\mathbf{X}_{t})=\sum_{j=1}^{K}\pi_{j,h}(\mathbf{X}_{t})\cdot f(X_{t+h};\mu_{j,h}(\mathbf{X}_{t}),\sigma_{j,h}(\mathbf{X}_{t}),\xi_{j,h}(\mathbf{X}_{t}),\nu_{j,h}(\mathbf{X}_{t})),

with f(y;μ,σ,ξ,ν)=2σt(yμσ;ν)T(ξyμσν+1ν+yμσ;ν+1)f(y;\mu,\sigma,\xi,\nu)=\frac{2}{\sigma}t\left(\frac{y-\mu}{\sigma};\nu\right)T\left(\xi\frac{y-\mu}{\sigma}\sqrt{\frac{\nu+1}{\nu+\frac{y-\mu}{\sigma}}};\nu+1\right) the skewed t-distribution probability density function, where t(;ν)t(\cdot;\nu) denotes the Student-t density and T(;ν)T(\cdot;\nu) its cumulative distribution function, both with ν\nu degrees of freedom that control tail thickness. We denote the forecasting horizon by hh and let 𝐗t=(Xt,Xt1,,XtL+1)\mathbf{X}_{t}=(X_{t},X_{t-1},\ldots,X_{t-L+1}) be a vector of LL consecutive observations up to time tt.111Intuitively, machine learning approaches can naturally leverage vectors of multiple past observations. Kernel estimators could also theoretically be extended to incorporate multiple conditioning lags, but at the expense of an exponential increase in computational complexity due to the curse of dimensionality. Throughout the rest of the paper, we mainly use 𝐗t=Xt\mathbf{X}_{t}=X_{t} to insure a fair comparison across methods. The other parameters are the location μ\mu\in\mathbb{R}, the scale σ>0\sigma>0, and the shape ξ\xi\in\mathbb{R}, which governs asymmetry.222We found this parameterization of the MDN to be numerically more robust than the Tukey g-and-h component-based approach of Guillaumin & Efremova (2024), as it does not need computing numerical inverse transforms via binary search, which can cause training instability when dealing with extreme values in the tails. Additionally, the skewed t-distribution admits well-established multivariate extensions (see Azzalini & Dalla Valle 1996), providing a natural pathway for extending our framework to vector-valued time series in future work. These predictive densities provide a complete characterization of forecast uncertainty: they can be used to construct prediction intervals, assess tail risk probabilities during explosive episodes, and derive point forecasts. We develop our own network architecture and training strategy as follows.

2.1 Network Architecture

Our network architecture is lightweight and parsimonious, consisting of a fully connected multilayer perceptron (MLP) with two hidden layers of dimension 64 and five parallel output heads, one for each parameter of the skewed t-distribution mixture (𝝅,𝝁,𝝈,𝝃,𝝂\boldsymbol{\pi},\boldsymbol{\mu},\boldsymbol{\sigma},\boldsymbol{\xi},\boldsymbol{\nu}). This makes it efficient to train even on standard CPU hardware.333While recurrent architectures such as RNNs or LSTMs could capture additional temporal dependencies, we opt for this simpler feedforward structure to maintain computational efficiency and ease of implementation in applied settings. The hidden layers use rectified linear unit (ReLU) activations, a standard choice in deep learning due to its computational efficiency and ability to mitigate vanishing gradient issues (Goodfellow et al. 2016):

𝐳(0)(𝐗t)\displaystyle\mathbf{z}^{(0)}(\mathbf{X}_{t}) =𝐗tL,\displaystyle=\mathbf{X}_{t}\in\mathbb{R}^{L},
𝐳(1)(𝐗t)\displaystyle\mathbf{z}^{(1)}(\mathbf{X}_{t}) =ReLU(𝐖(0)𝐳(0)(𝐗t)+𝐛(0))64,\displaystyle=\text{ReLU}\left(\mathbf{W}^{(0)}\mathbf{z}^{(0)}(\mathbf{X}_{t})+\mathbf{b}^{(0)}\right)\in\mathbb{R}^{64},
𝐳(2)(𝐗t)\displaystyle\mathbf{z}^{(2)}(\mathbf{X}_{t}) =ReLU(𝐖(1)𝐳(1)(𝐗t)+𝐛(1))64,\displaystyle=\text{ReLU}\left(\mathbf{W}^{(1)}\mathbf{z}^{(1)}(\mathbf{X}_{t})+\mathbf{b}^{(1)}\right)\in\mathbb{R}^{64},

where 𝐖(0)64×L\mathbf{W}^{(0)}\in\mathbb{R}^{64\times L}, 𝐖(1)64×64\mathbf{W}^{(1)}\in\mathbb{R}^{64\times 64}, 𝐛(0),𝐛(1)64\mathbf{b}^{(0)},\mathbf{b}^{(1)}\in\mathbb{R}^{64}, and ReLU(x)=max(0,x)\text{ReLU}(x)=\max(0,x) is applied element-wise. The five output heads map the final hidden representation 𝐳(2)(𝐗t)\mathbf{z}^{(2)}(\mathbf{X}_{t}) to the mixture parameters:

𝝅(𝐗t)\displaystyle\boldsymbol{\pi}(\mathbf{X}_{t}) =Softmax(𝐖π𝐳(2)(𝐗t)+𝐛π)[0,1]K,j=1Kπj=1\displaystyle=\text{Softmax}(\mathbf{W}_{\pi}\mathbf{z}^{(2)}(\mathbf{X}_{t})+\mathbf{b}_{\pi})\in[0,1]^{K},\;\sum_{j=1}^{K}\pi_{j}=1
𝝁(𝐗t)\displaystyle\boldsymbol{\mu}(\mathbf{X}_{t}) =𝐖μ𝐳(2)(𝐗t)+𝐛μK\displaystyle=\mathbf{W}_{\mu}\mathbf{z}^{(2)}(\mathbf{X}_{t})+\mathbf{b}_{\mu}\in\mathbb{R}^{K}
𝝈(𝐗t)\displaystyle\boldsymbol{\sigma}(\mathbf{X}_{t}) =Softplus(𝐖σ𝐳(2)(𝐗t)+𝐛σ)+K\displaystyle=\text{Softplus}(\mathbf{W}_{\sigma}\mathbf{z}^{(2)}(\mathbf{X}_{t})+\mathbf{b}_{\sigma})\in\mathbb{R}_{+}^{K}
𝝃(𝐗t)\displaystyle\boldsymbol{\xi}(\mathbf{X}_{t}) =𝐖ξ𝐳(2)(𝐗t)+𝐛ξK\displaystyle=\mathbf{W}_{\xi}\mathbf{z}^{(2)}(\mathbf{X}_{t})+\mathbf{b}_{\xi}\in\mathbb{R}^{K}
𝝂(𝐗t)\displaystyle\boldsymbol{\nu}(\mathbf{X}_{t}) =Softplus(𝐖ν𝐳(2)(𝐗t)+𝐛ν)+K,\displaystyle=\text{Softplus}(\mathbf{W}_{\nu}\mathbf{z}^{(2)}(\mathbf{X}_{t})+\mathbf{b}_{\nu})\in\mathbb{R}_{+}^{K},

where K=10K=10 is the number of mixture components, and each output head has weight matrix 𝐖K×64\mathbf{W}_{\bullet}\in\mathbb{R}^{K\times 64} and bias 𝐛K\mathbf{b}_{\bullet}\in\mathbb{R}^{K}. The softmax function, Softmax(𝐱)j=exj/i=1Kexi\text{Softmax}(\mathbf{x})_{j}=e^{x_{j}}/\sum_{i=1}^{K}e^{x_{i}}, maps an unconstrained KK-vector to valid probability weights summing to one. The softplus function, Softplus(x)=log(1+ex)\text{Softplus}(x)=\log(1+e^{x}), is applied element-wise and provides a smooth transformation ensuring strictly positive outputs for the scale (σ\sigma) and degree of freedom (ν\nu) parameters. The location (μ\mu) and skewness (ξ\xi) parameters remain unconstrained to allow for arbitrary centering and both left and right asymmetry.

The model is implemented in PyTorch with weights initialized via the Kaiming uniform scheme (He et al. 2015), and trained by stochastic gradient descent (SGD) with the Adam optimizer (Kingma & Ba 2015), minimizing the negative log-likelihood of the mixture density. We add noise regularization during training by injecting small Gaussian perturbations to the input data. This smooths the estimated density by implicitly penalizing the Hessian of the log-likelihood, an approach promoted in the conditional density estimation literature (Rothfuss et al. 2019, 2020). This architecture has approximately 8,000 parameters, exceeding the number of observations available in both our simulations and empirical applications, and is therefore not identifiable. However, for a purely forecasting-oriented approach, this overparameterization is not a limitation but rather an advantage, as it can improve generalization without overfitting (Belkin et al. 2019).

2.2 Adaptive Weighting Function

We focus on financial prices rather than returns, as differencing eliminates the explosive dynamics that we aim to forecast. Prices naturally exhibit imbalance between normal market conditions and extreme events, yet learning algorithms prioritize frequent observations over rare tail events that are crucial for risk assessment. To address this, we combine resampling strategies (Avelino et al. 2024) with cost-sensitive learning (Steininger et al. 2021), which encourage the MDN to focus on extreme events during training.

Let 𝒟\mathcal{D} denote the training sample of size T=|𝒟|T=|\mathcal{D}|, and let the subset of tail events in the training sample be given by ={t𝒟:Xt< or Xt>u}.\mathcal{E}=\{t\in\mathcal{D}:X_{t}<\ell\text{ or }X_{t}>u\}. The set \mathcal{E} is defined using only the scalar XtX_{t}, not the entire lagged vector 𝐗t\mathbf{X}_{t} used as input. The boundaries \ell and uu are determined using the generalized boxplot methodology of Bruffaerts et al. (2014).444=ξδ/2TGH,u=ξ1δ/2TGH,\ell=\xi^{\text{TGH}}_{\delta/2},\quad u=\xi^{\text{TGH}}_{1-\delta/2}, for a fixed detection rate δ\delta, where ξTGH\xi^{\text{TGH}} is the empirical Tukey g-h CDF fitted to a rank-preserving transformation of the observed data to accurately capture skewness and tail heaviness of the data. Then, the adaptive weight function is given by:

wt={T||,if t1,otherwise.w_{t}=\begin{cases}\sqrt{\frac{T}{|\mathcal{E}}|},&\text{if }t\in\mathcal{E}\\ 1,&\text{otherwise}.\end{cases} (1)

The weighting mechanism operates at two levels during training, which allows the model to focus on rare tail events without requiring modifications to the underlying MDN architecture. First, during mini-batch construction, we give more weight to rare cases by employing a weighted random sampling with replacement approach.555A mini-batch 𝒟\mathcal{B}\subset\mathcal{D} is a small random subset of training observations used to compute gradient updates at each iteration of the SGD. Under this scheme, the probability that observation tt is sampled in a mini-batch \mathcal{B} is given by P(t)=wt/i=1Twi.P(t)={w_{t}}/{\sum_{i=1}^{T}w_{i}}. Second, within each mini-batch \mathcal{B}, we apply cost-sensitive learning through a weighted loss function. At each gradient step, the model parameters 𝜽\boldsymbol{\theta} are updated by minimizing the weighted negative log-likelihood,

(𝜽)=1jwjjwtlogph(Xj+h|𝐗j;𝜽),\mathcal{L}(\boldsymbol{\theta})=-\frac{1}{\sum_{j\in\mathcal{B}}w_{j}}\sum_{j\in\mathcal{B}}w_{t}\log p_{h}(X_{j+h}|\mathbf{X}_{j};\boldsymbol{\theta}),

where jj indexes observations in mini-batch \mathcal{B}. This dual application of weights, both in sampling and loss computation, results in an effective weight of T/||T/|\mathcal{E}| for tail events. This justifies the square-root scaling in (1) to achieve the desired inverse proportion weighting.

2.3 Post-hoc Calibration

By reweighting the training distribution to emphasize extreme events, we train the model on a distribution that differs from the data-generating process. Specifically, when we assign higher weights to tail observations, the model learns to predict conditional distributions p^h(Xt+h|𝐗t)\hat{p}_{h}(X_{t+h}|\mathbf{X}_{t}) that reflect this reweighted empirical distribution rather than the original empirical distribution. This shift causes systematic biases: the model overestimates the probability of extreme events across the entire input space, leading to miscalibrated predictions when evaluated with respect to the original, unweighted distribution.

To correct for miscalibration, we use the method of Dey et al. (2022) based on the uniformity property of the Probability Integral Transform (PIT), PITh(Xt+h,𝐗t)=Xt+hp^h(z|𝐗t)𝑑z=F^h(Xt+h|𝐗t),\text{PIT}_{h}(X_{t+h},\mathbf{X}_{t})=\int_{-\infty}^{X_{t+h}}\hat{p}_{h}(z|\mathbf{X}_{t})\,dz=\hat{F}_{h}(X_{t+h}|\mathbf{X}_{t}), where F^h\hat{F}_{h} is the predicted cumulative distribution function. This recalibration procedure is applied after the initial model training on 𝒟\mathcal{D} and requires a distinct calibration dataset, 𝒟cal\mathcal{D}_{\text{cal}}. First, we use the trained model to compute the PIT values for all couples of observations (Xt+hcal,𝐗tcal)𝒟cal(X^{\text{cal}}_{t+h},\mathbf{X}^{\text{cal}}_{t})\in\mathcal{D}_{\text{cal}}. Then, the conditional distribution of PIT values is estimated as follows: for each threshold τ\tau on a grid 𝒢[0,1]\mathcal{G}\subset[0,1], a separate XGBoost classifier (Chen & Guestrin 2016) is trained to predict the binary outcome 𝟏{PITh(Xt+h,𝐗t)τ}\mathbf{1}\{\text{PIT}_{h}(X_{t+h},\mathbf{X}_{t})\leq\tau\} from the calibration features 𝐗tcal\mathbf{X}^{\text{cal}}_{t}. The predicted probability from each classifier provides an estimate of

β^h(τ|𝐗tcal)=(PITh(Xt+h,𝐗t)τ|𝐗tcal)=(F^h(Xt+hcal|𝐗tcal)τ|𝐗tcal),τ[0,1].\hat{\beta}_{h}(\tau|\mathbf{X}^{\text{cal}}_{t})=\mathbb{P}(\text{PIT}_{h}(X_{t+h},\mathbf{X}_{t})\leq\tau|\mathbf{X}^{\text{cal}}_{t})=\mathbb{P}(\hat{F}_{h}(X^{\text{cal}}_{t+h}|\mathbf{X}^{\text{cal}}_{t})\leq\tau|\mathbf{X}^{\text{cal}}_{t}),\quad\tau\in[0,1].

Intuitively, β^h(τ|𝐗tcal)\hat{\beta}_{h}(\tau|\mathbf{X}^{\text{cal}}_{t}) measures the empirical frequency with which the model’s predicted CDF, evaluated at the true outcome, falls below τ\tau for observations with similar features. Perfect calibration corresponds to β^h(τ|𝐗tcal)=τ\hat{\beta}_{h}(\tau|\mathbf{X}^{\text{cal}}_{t})=\tau for all τ\tau, which would indicate that the PIT values are uniformly distributed conditionally on 𝐗tcal\mathbf{X}^{\text{cal}}_{t}.

Once the correction function β^h\hat{\beta}_{h} is learned on 𝒟cal\mathcal{D}_{\text{cal}}, it can be applied to any new couple of observations (Xt+htest,𝐗ttest)𝒟test(X^{\text{test}}_{t+h},\mathbf{X}^{\text{test}}_{t})\in\mathcal{D}_{\text{test}} from the test set or future data. The final recalibrated PDF, p^hrecal\hat{p}^{\text{recal}}_{h}, is obtained by applying the correction and then renormalizing,

p^hrecal(Xt+htest|𝐗ttest)=ch(Xt+htest|𝐗ttest)p^h(Xt+htest|𝐗ttest)ch(z|𝐗ttest)p^h(z|𝐗ttest)𝑑z,\hat{p}^{\text{recal}}_{h}(X^{\text{test}}_{t+h}|\mathbf{X}^{\text{test}}_{t})=\frac{c_{h}(X^{\text{test}}_{t+h}|\mathbf{X}^{\text{test}}_{t})\cdot\hat{p}_{h}(X^{\text{test}}_{t+h}|\mathbf{X}^{\text{test}}_{t})}{\int c_{h}(z|\mathbf{X}^{\text{test}}_{t})\cdot\hat{p}_{h}(z|\mathbf{X}^{\text{test}}_{t})\,dz},

with correction factor ch(Xt+htest|𝐗ttest)=dβ^hdτ|τ=F^h(Xt+htest|𝐗ttest).c_{h}(X^{\text{test}}_{t+h}|\mathbf{X}^{\text{test}}_{t})=\frac{d\hat{\beta}_{h}}{d\tau}\bigg|_{\tau=\hat{F}_{h}(X^{\text{test}}_{t+h}|\mathbf{X}^{\text{test}}_{t})}. The correction factor is derived from a change-of-variables formula which ensures that the recalibrated density produces uniformly distributed PIT values.

3 Monte Carlo Simulations

In this section, we evaluate the forecasting performance of our MDN approach through Monte Carlo simulations on various MARMA specifications. These controlled settings allow us to benchmark against theoretical predictive densities when available, or against theoretical predictive moments. We assess the relative performance of our method by conducting a horse race against established forecasting approaches.

3.1 MARMA Processes

Mixed Causal-Noncausal Autoregressive Moving Average processes naturally capture key distributional features of financial price series: multimodality, skewness, and heavy tails. Let XtX_{t} (t=0,±1,±2,)(t=0,\pm 1,\pm 2,\ldots) be a stochastic process generated by

ψ(F)ϕ(B)Xt=θ(F)H(B)εt,\psi(F)\phi(B)X_{t}=\theta(F)H(B)\varepsilon_{t}, (2)

where FF (resp. B=F1B=F^{-1}) denotes the forward (resp. backward) operator, ψ(z)=1ψ1zψpzp\psi(z)=1-\psi_{1}z-\ldots-\psi_{p}z^{p}, ϕ(z)=1ϕ1zϕqzq\phi(z)=1-\phi_{1}z-\ldots-\phi_{q}z^{q}, θ(z)=1θ1zθrzr\theta(z)=1-\theta_{1}z-\ldots-\theta_{r}z^{r}, and H(z)=1H1zHszsH(z)=1-H_{1}z-\ldots-H_{s}z^{s} are polynomials with roots outside the unit circle, and (εt)t(\varepsilon_{t})_{t\in\mathbb{Z}} is a sequence of i.i.d. variables. Equation (2) admits a unique stationary solution, called a MARMA(p,q,r,s)MARMA(p,q,r,s), if ϕ(z)0\phi(z)\neq 0 and ψ(z)0\psi(z)\neq 0 for all z1\mid z\mid\leq 1, and if ψ\psi (resp. ϕ\phi) has no common root with θ\theta (resp. HH). A sufficient condition for the identification of the model in (2) is that εt\varepsilon_{t} is i.i.d. non-gaussian (see Rosenblatt 2000).666To the best of our knowledge, our MDN is the first noncausal-specific method capable of forecasting the conditional predictive density of general MARMA processes, as existing approaches are restricted to MAR specifications.

The literature has been using either α\alpha-stable or tt-distributed innovations, the choice of one or another driving the forecasting algorithms proposed so far in the literature. We focus on α\alpha-stable MARMA models for two reasons. First, the α\alpha-stable family offers considerable flexibility, encompassing distributions ranging from Gaussian (α=2\alpha=2, β=0\beta=0) to Cauchy (α=1\alpha=1, β=0\beta=0) as special cases. Second, and most importantly, unlike the tt-distribution case, theoretical results on predictive densities and moments are available in this setting. Specifically, Gourieroux & Zakoian (2017) derived a closed-form expression for the predictive conditional density of a noncausal MAR(0,1)(0,1) when εti.i.d.\varepsilon_{t}\overset{\text{i.i.d.}}{\sim} Cauchy. More generally, when εt\varepsilon_{t} follows an α\alpha-stable law, εti.i.d.𝒮(α,β,σ,μ)\varepsilon_{t}\overset{\text{i.i.d.}}{\sim}\mathcal{S}(\alpha,\beta,\sigma,\mu), with α>1\alpha>1, β[1,1]\beta\in[-1,1], and σ>0\sigma>0, the theoretical results of Fries (2022) yield closed-form expressions for higher-order conditional moments: 𝔼[Xt+hpXt=x]\mathbb{E}\!\left[X_{t+h}^{p}\mid X_{t}=x\right] for any integer pp satisfying 1p<2α+11\leq p<2\alpha+1. This provides a rigorous framework for evaluating density forecasts in the following subsections.

3.2 Simulation Design

We generate time series of length 5,000 from the following MARMA data generating processes (de Truchis & Thomas 2025):

MAR(0,1): (10.9F)Xt=εt,\displaystyle(1-0.9F)X_{t}=\varepsilon_{t}, (3)
MAR(0,2): (10.7F0.1F2)Xt=εt,\displaystyle(1-0.7F-0.1F^{2})X_{t}=\varepsilon_{t}, (4)
MAR(1,1): (10.9F)(10.1B)Xt=εt,\displaystyle(1-0.9F)(1-0.1B)X_{t}=\varepsilon_{t}, (5)
MARMA(1,1,1,1): (10.9F)(1+0.3B)Xt=(10.4F)(1+0.3B)εt,\displaystyle(1-0.9F)(1+0.3B)X_{t}=(1-0.4F)(1+0.3B)\varepsilon_{t}, (6)

where εti.i.d.𝒮(α,0,0.5,0)\varepsilon_{t}\overset{\text{i.i.d.}}{\sim}\mathcal{S}(\alpha,0,0.5,0).777The parameter values are chosen by following de Truchis et al. (2025) for the three MAR models and based on Fries (2022) for the MARMA specification. All specifications satisfy the stationarity conditions. In the simulations, we set α{1.0,1.2,1.4,1.8}\alpha\in\{1.0,1.2,1.4,1.8\}, where smaller values correspond to fatter tails, which allows us to examine how model performance changes with the degree of tail heaviness.

The forecasting abilities of our MDN approach are compared with those of a set of established conditional density forecasting methods spanning different methodological paradigms: the nonparametric Nadaraya-Watson kernel density estimator (Rosenblatt 1969), the simulation-based approach of Lanne et al. (2012) and the closed-form predictive densities of Gourieroux & Jasiak (2025), both designed for MAR processes, as well as the learning-based FlexZBoost method of Izbicki (2017) and Dalmasso et al. (2020).888For a thorough presentation of these methods, see the Online Appendix. An additional calibration set 𝒟cal\mathcal{D}_{\text{cal}} of 5,000 observations, generated by the same data-generating process, is used to perform the local recalibration procedure described in Section 2.3 for the MDN.

To ensure a fair comparison across all methods, the kernel densities required by Nadaraya-Watson and the noncausal state density needed for the closed-form approach of Gourieroux & Jasiak (2025) are both estimated using the same 5,000-observation training set. The simulation-based method of Lanne et al. (2012) does not require additional historical observations beyond the parameter estimation step and can be applied directly to the evaluation grid once the MAR and α\alpha-stable parameters are obtained. Note that for Lanne et al. (2012) and Gourieroux & Jasiak (2025), the MAR specification is assumed known, thereby circumventing model identification issues. We estimate the MAR process parameters using the Generalized Covariance (GCov) estimator (Gourieroux & Jasiak 2023), a semi-parametric approach that minimizes a portmanteau statistic based on the autocovariances of transformed residuals. The parameters of the α\alpha-stable distribution are then obtained by fitting the characteristic function-based estimator of Nolan (2020) to the filtered residuals.999This is an optimal framework to account for parameter estimation uncertainty, whereas in real-life applications model risk also matters and can further hamper the forecasting abilities of these approaches. In contrast, the nonparametric and learning-based methods, Nadaraya-Watson, FlexZBoost, and our MDN, do not require explicit parametric model estimation and directly learn the predictive density from the data.101010To assess the impact of parameter estimation uncertainty, we conducted additional simulations where the methods of Lanne et al. (2012) and Gourieroux & Jasiak (2025) were evaluated using the true data-generating parameters rather than estimated ones. The relative performance rankings remained unchanged, indicating that the observed differences in forecasting accuracy are not primarily attributable to parameter estimation error.

3.3 Bimodality Analysis and Sampled Trajectories

As shown by de Truchis et al. (2025), for any MAR process with a single anticipative root, when conditioning on a single observation, the conditional predictive density theoretically exhibits bimodality in the tail regions. This reflects the fundamental dichotomy of locally explosive dynamics: the bubble either continues or crashes, with probability |ψ1|αh|\psi_{1}|^{\alpha h} and 1|ψ1|αh1-|\psi_{1}|^{\alpha h} respectively (see Section 2.2 in the Online Appendix for a discussion). Accordingly, the predictive density features two modes: one near the unconditional mean (zero in our simulations), corresponding to the crash scenario, and one further from the conditioning value, corresponding to bubble continuation.

Before implementing the forecast evaluation horse-race, we investigate whether the competing approaches capture well this theoretical bimodality of the conditional predictive density, particularly in the tail region. The procedure relies on a grid of 5,000 equispaced conditioning values {x1,,x5000}\{x_{1},\ldots,x_{5000}\} spanning the quantile range [q0.01,q0.99][q_{0.01},q_{0.99}], where qτq_{\tau} denotes the τ\tau-quantile of the theoretical marginal distribution. For each point xix_{i} on the grid, we estimate the conditional predictive density p^h(Xt+hXt=xi)\hat{p}_{h}(X_{t+h}\mid X_{t}=x_{i}) using each of the competing methods and visually assess their ability to recover the bimodal structure.

Figure 1 displays the one-step-ahead conditional predictive density for a MAR(0,1) process under three conditioning scenarios: Xt=q0.01X_{t}=q_{0.01} (left tail), Xt=0X_{t}=0 (center), and Xt=q0.99X_{t}=q_{0.99} (right tail).

Refer to caption
(a) Mixture Density Network
Refer to caption
(b) Lanne et al. (2012)
Refer to caption
(c) Gourieroux & Jasiak (2025)
Refer to caption
(d) Nadaraya-Watson
Refer to caption
(e) FlexZBoost
Figure 1: 1-Step-Ahead Conditional Predictive Density of a MAR(0,1) Process

Notes: This figure displays the estimated conditional predictive density p^h(Xt+h𝐗t)\hat{p}_{h}(X_{t+h}\mid\mathbf{X}_{t}) for a purely noncausal MAR(0,1) process with α\alpha-stable innovations at three conditioning values: Xt=q1X_{t}=q_{1} (left tail), Xt=0X_{t}=0 (center), and Xt=q99X_{t}=q_{99} (right tail), where q.q_{.} denotes the percentiles. The red dashed line indicates the conditioning value XtX_{t}. The tail-index is fixed at α=1.4\alpha=1.4.

The MDN (panel a) successfully captures the expected bimodality, exactly as predicted by the geometric decay of |ψ1|αh|\psi_{1}|^{\alpha h}.111111The bimodality becomes increasingly pronounced as the horizon extends from h=1h=1 to h=5h=5, and it is best captured by the MDN approach (see Section 3 in the Online Appendix). Moreover, employing skewed t-distribution mixture components proves beneficial: the MDN naturally outputs asymmetric and heavy-tailed predictive densities. In contrast, Nadaraya-Watson (panel d) produces a highly erratic predictive density, with substantial gaps and spikes, especially in data-sparse tail regions. The approaches of Lanne et al. (2012) and Gourieroux & Jasiak (2025) exhibit similar irregular behavior. FlexZBoost (panel e) yields smoother estimates than kernel methods but displays small spurious bumps due to its inability to set certain cosine basis coefficients to zero. This prevents it from achieving the clean bimodal structure predicted by theory and confirms that general-purpose density estimators require substantial adaptation to capture locally explosive patterns. The Online Appendix, includes GIFs (Graphics Interchange Format) depicting the evolution of our MDN predictive densities across the entire grid of 5,000 conditioning points, for h{1,2,5}h\in\{1,2,5\}.

Beyond density estimation, the MDN’s ability to capture bimodality translates directly into realistic trajectory generation. Figure 2 displays a trajectory obtained by iteratively sampling from the one-step-ahead predictive density, starting from an initial X0=0X_{0}=0 and using each sampled value as the next conditioning point. The resulting path exhibits both locally explosive dynamics and abrupt reversals that define noncausal processes, mirroring the behavior of a true MAR(0,1) simulation. Remarkably, when estimating the parameters of the MDN-sampled trajectory, one recovers a MAR(0,1) structure with coefficients close to the true data-generating process (see the Online Appendix for estimation details). We further demonstrate in the same appendix that this sampling procedure extends to higher-order anticipative models by providing an illustration for a MAR(0,2) process conditional on 𝐗t=[Xt,Xt1]\mathbf{X}_{t}=[X_{t},X_{t-1}]. This underscores our model’s capacity to effectively utilize multiple conditioning variables.

Refer to caption
Figure 2: MDN-Sampled vs. True MAR(0,1) Trajectories

Notes: Left panel: trajectory generated by iteratively sampling from the MDN’s one-step-ahead predictive density. Right panel: true MAR(0,1) simulation with α=1.4\alpha=1.4. The underlying MDN has been trained on a simulated sample of 5,000 realizations from the true MAR(0,1) specification in Equation (3), with α=1.4\alpha=1.4.

3.4 Simulation Results

Ideally, comparing the alternative forecasting approaches introduced in Section 3.2 would involve evaluating the competing predictive densities against the true theoretical one. However, since the latter is unavailable for most noncausal processes, our simulation analysis proceeds along three complementary directions.

First, we exploit a unique feature of the Cauchy MAR(0,1) process (α=1.0\alpha=1.0, β=0\beta=0): the availability of a closed-form expression for the conditional predictive density (Gourieroux & Zakoian 2017). This special case provides a direct benchmark for assessing how well each method approximates the true density. Second, for the general case of MARMA processes where closed-form densities are unavailable, we rely on theoretical conditional moments to evaluate predictive accuracy. Specifically, we compute the estimated conditional moments by numerically integrating the estimated predictive densities, and compare them to their theoretical counterparts. Note that our grid-based approach (see Section 3.3) is naturally suited to univariate conditioning, i.e., predicting Xt+hX_{t+h} based solely on the current value XtX_{t}, which is consistent with the theoretical conditional predictive densities of Gourieroux & Zakoian (2017), and the conditional moments of Fries (2022). Accordingly, in all simulations, the different methods condition only on the last observed value, 𝐗t=Xt\mathbf{X}_{t}=X_{t}. Third, we complement the grid-based evaluation with an out-of-sample forecasting exercise that assesses the predictive densities against realized outcomes, thereby better reflecting real-world forecasting applications.

3.4.1 Case 1: Cauchy MAR(0,1) model

Table 1 compares the estimated predictive densities with the true density of a MAR(0,1) process with α=1.0\alpha=1.0 and β=0\beta=0, using Kullback-Leibler (KL) divergence and Integrated Squared Error (ISE) metrics.121212The KL divergence is defined as KL(php^h)=ph(Xt+h𝐗t)log(ph(Xt+h𝐗t)p^h(Xt+h𝐗t))𝑑Xt+h\text{KL}(p_{h}\|\hat{p}_{h})=\int p_{h}(X_{t+h}\mid\mathbf{X}_{t})\log\left(\frac{p_{h}(X_{t+h}\mid\mathbf{X}_{t})}{\hat{p}_{h}(X_{t+h}\mid\mathbf{X}_{t})}\right)dX_{t+h}, while the ISE is given by ISE=(ph(Xt+h𝐗t)p^h(Xt+h𝐗t))2𝑑Xt+h\text{ISE}=\int\left(p_{h}(X_{t+h}\mid\mathbf{X}_{t})-\hat{p}_{h}(X_{t+h}\mid\mathbf{X}_{t})\right)^{2}dX_{t+h}.

Table 1: KL Divergence and ISE Between the True Predictive Density and the Predicted Ones: the Case of MAR(0, 1)
Horizon Model KL Divergence ISE
Center Tails Total Center Tails Total
h=1h=1 Nadaraya-Watson 0.578 11.45 10.39 0.020 0.751 0.680
Lanne et al. (2012) 2.339 13.10 12.06 0.418 0.416 0.416
Gourieroux and Jasiak (2025) 2.512 8.392 7.823 0.441 0.600 0.585
FlexZBoost 2.877 3.410 3.359 0.268 0.225 0.229
Mixture Density Network 0.927 1.220 1.192 0.197 0.182 0.184
h=2h=2 Nadaraya-Watson 0.751 14.47 13.14 0.017 0.701 0.635
Lanne et al. (2012) 2.457 11.81 10.90 0.191 0.230 0.226
Gourieroux and Jasiak (2025) 2.924 8.685 8.127 0.292 0.474 0.456
FlexZBoost 2.236 3.294 3.192 0.121 0.084 0.087
Mixture Density Network 0.482 0.731 0.707 0.063 0.057 0.058
h=5h=5 Nadaraya-Watson 1.109 18.23 16.58 0.018 0.760 0.688
Lanne et al. (2012) 2.698 10.50 9.749 0.071 0.150 0.143
Gourieroux and Jasiak (2025) 2.445 8.702 8.097 0.211 0.393 0.375
FlexZBoost 1.485 1.368 1.380 0.041 0.017 0.019
Mixture Density Network 0.129 0.546 0.506 0.008 0.012 0.011
  • Note: This Table reports the Kullback-Leibler (KL) divergence and Integrated Squared Error (ISE) between the true predictive density and estimated densities. Metrics are evaluated over three spatial regions: Center [q0.1,q0.9][q_{0.1},q_{0.9}], Tails [q0.01,q0.1][q0.9,q0.99][q_{0.01},q_{0.1}]\cup[q_{0.9},q_{0.99}], and Total [q0.01,q0.99][q_{0.01},q_{0.99}], where qpq_{p} represents the pp-th quantile. Best method in red, second best in bold black.

Using the grid-based methodology described above, we estimate the conditional predictive density p^h(Xt+hXt=xi)\hat{p}_{h}(X_{t+h}\mid X_{t}=x_{i}) for each competing method and directly compare it with the true density ph(Xt+hXt=xi)p_{h}(X_{t+h}\mid X_{t}=x_{i}) across three distinct regions: the center [q0.10,q0.90][q_{0.10},q_{0.90}], the tails [q0.01,q0.10][q0.90,q0.99][q_{0.01},q_{0.10}]\cup[q_{0.90},q_{0.99}], and the total range [q0.01,q0.99][q_{0.01},q_{0.99}].

The MDN exhibits the lowest KL divergence and ISE in the tail region and over the full distribution across all forecast horizons. In the center of the distribution, Nadaraya-Watson remains competitive, achieving the lowest KL divergence and ISE at h=1h=1, and the lowest ISE at h=2h=2. However, it performs substantially worse for density forecasting in the tails, especially at longer horizons. At horizon h=5h=5, the MDN achieves a tail KL divergence of 0.546 compared to 18.23 for Nadaraya-Watson and 8.702 for Gourieroux & Jasiak (2025). Across all horizons and regions examined, the MDN consistently delivers either the best or second-best performance.

3.4.2 Case 2: Predictive Moments Approach

The root mean square error (RMSE) criterion is now employed to compare the estimated conditional moments, 𝔼^[Xt+hkXt=xi]\hat{\mathbb{E}}[X^{k}_{t+h}\mid X_{t}=x_{i}] to the theoretical ones, 𝔼[Xt+hkXt=xi]\mathbb{E}[X^{k}_{t+h}\mid X_{t}=x_{i}], of Fries (2022) and measure the forecasting performance for each tail index α{1.0,1.2,1.4,1.8}\alpha\in\{1.0,1.2,1.4,1.8\} and each moment order k{1,2,3,4}k\in\{1,2,3,4\}, for all four DGPs in the same three regions of the distribution as in the first case-scenario. To formally assess the statistical significance of relative performance differences, we employ the Model Confidence Set (MCS) procedure of Hansen et al. (2011) with a 90% confidence level.

Note that, the forecast methods of Gourieroux & Jasiak (2025) and Lanne et al. (2012) require conditioning on multiple lags when the autoregressive order exceeds one, making direct comparison with Fries (2022)’s theoretical moments inappropriate beyond MAR(0,1) and for MARMA models. For this reason, this second-case comparative analysis is structured as follows. For the MAR(0,1) process, the MDN is compared with the full set of alternatives at all forecast horizons. For higher-order processes, MAR(0,2), MAR(1,1) and MARMA(1,1,1,1), the comparison is restricted to Nadaraya-Watson and FlexZBoost due to the conditioning set discrepancy.

Tables 2, 3, 4, and 5 present the one-period-ahead RMSE results for all four data-generating processes. The MDN approach almost always exhibits the lowest RMSE in the tail region and over the full distribution, regardless of the tail index α\alpha and the conditional moment order.131313Notable exceptions occur at α=1.8\alpha=1.8: for MAR(0,2) and MAR(1,1), Nadaraya-Watson performs better for higher-order moments (k2k\geq 2), while for MARMA(1,1,1,1), Nadaraya-Watson outperforms only for the fourth moment. In the center of the distribution, Nadaraya-Watson consistently achieves the lowest RMSE across almost all configurations, benefiting from regions where the conditional distribution exhibits more regular behavior. Importantly, when the MDN does not achieve the lowest RMSE in the center, it consistently ranks as the second-best method. The more traditional forecasting methods of Lanne et al. (2012) and Gourieroux & Jasiak (2025), as well as FlexZBoost are almost always dominated in all regions.

Table 2: Root Mean Squared Error of Predictive Moments: MAR(0, 1) Process, 1-Step-Ahead Forecasts
Model α\alpha 𝔼[yt+1yt]\mathbb{E}[y_{t+1}\mid y_{t}] 𝔼[yt+12yt]\mathbb{E}[y^{2}_{t+1}\mid y_{t}] 𝔼[yt+13yt]\mathbb{E}[y^{3}_{t+1}\mid y_{t}] 𝔼[yt+14yt]\mathbb{E}[y^{4}_{t+1}\mid y_{t}]
Center Tails Total Center Tails Total Center Tails Total Center Tails Total
Nadaraya-Watson 1.0 0.434 43.72 41.55 10.80 6619 6291
1.2 0.097 7.182 6.607 1.360 341.8 314.4 12.75 1.671e+04 1.538e+04
1.4 0.075 2.313 2.038 0.431 49.86 43.92 2.798 1125 991.2
1.6 0.047 0.879 0.727 0.167 12.20 10.09 0.523 149.7 123.8 4.584 1786 1477
1.8 0.037 0.164 0.125 0.095 1.149 0.863 0.334 6.941 5.201 1.642 41.28 30.92
Lanne et al. (2012) 1.0 1.115 39.51 37.55 133.1 7320 6956
1.2 0.391 12.90 11.87 5.406 649.5 597.6 46.55 3.127e+04 2.877e+04
1.4 0.218 4.369 3.850 1.602 96.61 85.10 9.985 2027 1785
1.6 0.121 1.010 0.838 0.679 14.71 12.17 3.171 184.1 152.3 17.51 2234 1848
1.8 0.076 0.220 0.172 0.380 2.482 1.875 1.468 20.93 15.69 5.417 159.3 119.3
Gourieroux and Jasiak (2025) 1.0 4.040 19.37 18.45 81.40 2666 2534
1.2 0.698 4.649 4.286 6.326 186.1 171.2 46.11 8597 7910
1.4 0.285 2.150 1.898 1.590 37.05 32.65 6.841 683.2 601.9
1.6 0.170 1.197 0.995 0.787 11.76 9.736 2.597 123.7 102.3 11.09 1299 1075
1.8 0.092 0.449 0.342 0.385 2.526 1.908 1.060 15.13 11.35 3.584 87.15 65.28
FlexZBoost 1.0 2.107 16.39 15.59 2481 6082 5831
1.2 0.740 4.858 4.479 98.26 482.4 445.5 332.8 4.168e+04 3.835e+04
1.4 0.292 2.381 2.102 10.03 75.54 66.71 87.52 2221 1957
1.6 0.182 1.038 0.864 1.629 18.51 15.34 9.190 318.7 263.7 89.79 5745 4751
1.8 0.179 0.256 0.225 0.963 1.965 1.604 7.424 17.58 14.05 93.36 200.1 162.0
Mixture Density Network 1.0 0.902 7.461 7.096 37.73 1091 1037
1.2 0.305 1.705 1.574 2.980 67.51 62.12 38.00 2614 2405
1.4 0.124 0.640 0.567 0.747 8.409 7.416 4.118 133.6 117.7
1.6 0.082 0.306 0.258 0.427 2.745 2.283 1.932 31.10 25.75 24.66 390.1 323.0
1.8 0.056 0.113 0.093 0.421 0.628 0.547 1.624 4.679 3.664 15.36 26.48 22.29

Notes: This Table reports the root mean squared error (RMSE) of estimated predictive moments relative to theoretical values. α\alpha denotes the tail index of the stable distribution. Predictive moments are evaluated over three spatial regions: Center [q0.1,q0.9][q_{0.1},q_{0.9}], Tails [q0.01,q0.1][q0.9,q0.99][q_{0.01},q_{0.1}]\cup[q_{0.9},q_{0.99}], and Total [q0.01,q0.99][q_{0.01},q_{0.99}], where qpq_{p} represents the pp-th quantile. Best method in red, second best in bold black. An asterisk () designates model(s) which belong to the Model Confidence Set at the 90% confidence level.

Table 3: Root Mean Squared Error of Predictive Moments: MAR(0, 2) Process, 1-Step-Ahead Forecasts
Model α\alpha 𝔼[yt+1yt]\mathbb{E}[y_{t+1}\mid y_{t}] 𝔼[yt+12yt]\mathbb{E}[y^{2}_{t+1}\mid y_{t}] 𝔼[yt+13yt]\mathbb{E}[y^{3}_{t+1}\mid y_{t}] 𝔼[yt+14yt]\mathbb{E}[y^{4}_{t+1}\mid y_{t}]
Center Tails Total Center Tails Total Center Tails Total Center Tails Total
Nadaraya-Watson 1.0 0.244 20.23 19.23 2.252 1697 1613
1.2 0.114 5.137 4.726 0.610 162.9 149.9 5.464 6199 5703
1.4 0.086 1.568 1.382 0.274 21.01 18.51 0.787 339.5 299.1
1.6 0.044 0.583 0.483 0.118 4.763 3.940 0.347 38.99 32.25 1.540 368.0 304.4
1.8 0.029 0.188 0.142 0.085 0.962 0.722 0.159 4.597 3.443 0.578 23.65 17.71
FlexZBoost 1.0 1.164 9.557 9.090 617.9 1886 1803
1.2 0.455 4.648 4.280 29.97 226.7 208.9 59.39 1.322e+04 1.216e+04
1.4 0.265 1.789 1.581 3.475 39.33 34.68 19.30 1070 942.4
1.6 0.200 0.695 0.586 1.345 4.626 3.900 12.98 47.36 39.85 226.6 502.1 434.4
1.8 0.187 0.267 0.235 0.703 1.092 0.941 3.668 9.215 7.315 32.50 72.83 58.63
Mixture Density Network 1.0 0.733 5.713 5.435 21.11 291.6 277.2
1.2 0.129 2.285 2.103 1.647 41.24 37.94 41.53 2067 1902
1.4 0.084 0.607 0.536 0.773 8.158 7.196 2.672 161.3 142.1
1.6 0.084 0.276 0.233 0.575 2.696 2.253 2.873 28.51 23.63 30.92 309.6 256.6
1.8 0.049 0.168 0.130 0.345 1.032 0.806 1.053 6.553 4.955 8.499 59.19 44.66

Notes: For details on variable definitions and methodology, refer to Table 2.

Table 4: Root Mean Squared Error of Predictive Moments: MAR(1, 1) Process, 1-Step-Ahead Forecasts
Model α\alpha 𝔼[yt+1yt]\mathbb{E}[y_{t+1}\mid y_{t}] 𝔼[yt+12yt]\mathbb{E}[y^{2}_{t+1}\mid y_{t}] 𝔼[yt+13yt]\mathbb{E}[y^{3}_{t+1}\mid y_{t}] 𝔼[yt+14yt]\mathbb{E}[y^{4}_{t+1}\mid y_{t}]
Center Tails Total Center Tails Total Center Tails Total Center Tails Total
Nadaraya-Watson 1.0 0.506 40.06 38.07 13.94 7557 7182
1.2 0.104 6.755 6.215 1.469 378.8 348.5 16.37 2.040e+04 1.877e+04
1.4 0.083 2.312 2.037 0.566 59.61 52.51 4.169 1513 1333
1.6 0.047 0.798 0.660 0.199 12.24 10.12 0.688 170.4 140.9 5.156 2312 1912
1.8 0.040 0.159 0.122 0.113 1.335 1.002 0.428 9.769 7.319 2.135 70.68 52.93
FlexZBoost 1.0 2.320 17.85 16.98 3065 7661 7343
1.2 0.791 5.266 4.855 120.7 616.9 569.5 462.4 5.894e+04 5.423e+04
1.4 0.349 2.174 1.922 12.40 84.16 74.37 120.5 2932 2583
1.6 0.191 1.306 1.086 1.996 27.51 22.78 12.41 529.1 437.6 119.7 1.049e+04 8678
1.8 0.194 0.353 0.294 1.015 3.798 2.922 8.195 38.76 29.52 108.6 424.0 325.5
Mixture Density Network 1.0 0.967 3.751 3.578 47.44 863.9 821.2
1.2 0.328 1.789 1.651 4.245 86.04 79.17 62.46 3998 3678
1.4 0.094 1.021 0.901 0.844 19.21 16.92 4.898 367.2 323.5
1.6 0.061 0.302 0.252 0.479 2.472 2.062 2.441 23.10 19.16 31.12 300.8 249.4
1.8 0.067 0.154 0.123 0.467 1.367 1.069 1.927 12.52 9.462 18.88 118.4 89.51

Notes: For details on variable definitions and methodology, refer to Table 2.

Table 5: Root Mean Squared Error of Predictive Moments: MARMA(1, 1, 1, 1) Process, 1-Step-Ahead Forecasts
Model α\alpha 𝔼[yt+1yt]\mathbb{E}[y_{t+1}\mid y_{t}] 𝔼[yt+12yt]\mathbb{E}[y^{2}_{t+1}\mid y_{t}] 𝔼[yt+13yt]\mathbb{E}[y^{3}_{t+1}\mid y_{t}] 𝔼[yt+14yt]\mathbb{E}[y^{4}_{t+1}\mid y_{t}]
Center Tails Total Center Tails Total Center Tails Total Center Tails Total
Nadaraya-Watson 1.0 0.711 41.50 39.44 27.39 3236 3076
1.2 0.134 6.808 6.264 1.149 194.5 178.9 26.24 8876 8166
1.4 0.082 2.129 1.876 0.383 42.68 37.59 3.275 1044 919.7
1.6 0.056 0.478 0.396 0.127 4.996 4.132 0.984 46.74 38.66 10.97 631.4 522.2
1.8 0.042 0.290 0.219 0.116 1.566 1.175 0.207 9.357 7.006 1.216 48.67 36.44
FlexZBoost 1.0 1.679 20.73 19.71 1770 6375 6083
1.2 0.592 4.634 4.269 66.67 344.6 318.1 185.0 4.014e+04 3.693e+04
1.4 0.296 1.736 1.536 6.192 68.48 60.39 40.52 2015 1775
1.6 0.201 0.828 0.694 1.488 19.57 16.21 11.53 277.7 229.8 185.3 8553 7074
1.8 0.196 0.402 0.328 0.693 4.325 3.270 4.234 38.92 29.27 43.93 426.0 320.2
Mixture Density Network 1.0 0.689 5.908 5.619 68.94 659.3 626.9
1.2 0.261 2.209 2.035 1.493 23.46 21.59 42.15 1805 1661
1.4 0.173 0.925 0.819 1.186 4.344 3.868 13.60 180.8 159.4
1.6 0.086 0.327 0.274 0.633 0.972 0.879 3.186 26.41 21.91 41.99 180.3 151.0
1.8 0.089 0.132 0.115 0.576 0.902 0.776 2.915 6.460 5.208 22.14 52.50 41.95

Notes: For details on variable definitions and methodology, refer to Table 2.

The statistical significance of these differences in performance is confirmed by the MCS tests. In most cases, only one forecasting approach belongs to the MCS90%MCS_{90\%}: Nadaraya-Watson for the central region, and the MDN for the tail region and overall distribution.141414We also test the sensitivity of our findings to the choice of the forecast horizon (h=2h=2 and h=5h=5) both in terms of RMSE and MCS test. The results, available in Section 3 of the Online Appendix, are qualitatively similar.

Figure 3 provides a visual comparison of the one-step-ahead predictive conditional moments for a MAR(0,1) process with tail-index α=1.4\alpha=1.4 across the five competing approaches. The MDN produces smooth predictions that closely track the theoretical moments throughout the distribution. Gourieroux & Jasiak (2025)’s forecasts are also smooth but increasingly drift from the truth in the tails. FlexZBoost exhibits a staircase-like pattern with systematic deviations. The moment forecasts from Nadaraya-Watson and Lanne et al. (2012) become increasingly noisy as we move away from the center. These patterns are qualitatively similar, though noisier, at longer horizons h=2h=2 and h=5h=5 (see the Online Appendix).

Refer to caption
(a) Mixture Density Network
Refer to caption
(b) Lanne et al. (2012)
Refer to caption
(c) Gourieroux & Jasiak (2025)
Refer to caption
(d) Nadaraya-Watson
Refer to caption
(e) FlexZBoost
Figure 3: Conditional Predictive Moment Accuracy: MAR(0,1) Process, 1-Step-Ahead Forecasts

Notes: This figure displays the estimated predictive moments 𝔼[Xt+hkXt]\mathbb{E}[X_{t+h}^{k}\mid X_{t}] for k{1,2,3}k\in\{1,2,3\} as a function of the conditioning variable XtX_{t} for a purely noncausal MAR(0,1) process with α\alpha-stable innovations. Each panel from (a) to (e) shows the results for a specific density forecasting method. Blue curves represent estimated moments, while orange curves show the corresponding theoretical ones. The tail-index is fixed at α=1.4\alpha=1.4.

3.4.3 Case 3: Comparison with the Realized Outcomes

Rather than evaluating predictive densities at pre-specified conditioning values, we now train each method on 5,000 observations from a MAR(0,1) process and use the trained models to forecast the subsequent 500 realizations. This approach allows us to assess density forecast quality using proper scoring rules that directly compare the predictive density, p^h(Xt+hXt=xi)\hat{p}_{h}(X_{t+h}\mid X_{t}=x_{i}), against the realized outcome Xt+hX_{t+h}.

Table 6 presents detailed results for forecast horizons h{1,2,5}h\in\{1,2,5\} using four complementary evaluation metrics: the Conditional Density Estimation (CDE) loss, the Continuous Ranked Probability Score (CRPS), the logarithmic probability score, and the quantile score at the 10% level (see the Online Appendix for formal definitions).151515For CDE loss, CRPS, and quantile scores, lower values indicate superior performance, while for the log probability score, higher values are preferred. As in our previous evaluation strategy, we assess the performance across three distributional regions (Center, Tails, and Total) to capture method-specific strengths. The results corroborate our grid-based findings. At the one-step-ahead horizon, the MDN achieves the best or second-best performance across nearly all metrics and tail index values, and it is particularly excelling in the tail region. As the forecast horizon extends to h={2,5}h=\{2,5\}, the relative performance rankings remain remarkably stable. The MDN continues to dominate in the tail region and overall distribution for most α\alpha values, though Nadaraya-Watson exhibits competitive performance for α=1.8\alpha=1.8, in particular at longer horizons.

Table 6: Density Forecast Performance Metrics: MAR(0, 1) Process, 1-Step-Ahead Forecasts
Model α\alpha CDE Loss CRPS Log Prob QS 10%
Center Tails Total Center Tails Total Center Tails Total Center Tails Total
Nadaraya-Watson 1.0 -0.235 -0.076 -0.221 1.217 3.716 2.041 -2.354 -4.824 -2.822 0.509 1.964 0.889
Lanne et al. (2012) 0.161 0.080 0.135 3.499 3.811 3.794 -3.926 -3.882 -4.207 0.702 0.756 0.841
Gourieroux and Jasiak (2025) 0.144 0.192 0.172 3.709 4.312 4.402 -4.046 -4.976 -4.450 1.522 1.916 1.779
FlexZBoost -0.034 -0.014 -0.011 7.363 12.13 10.71 -3.452 -4.511 -4.785 1.167 5.739 5.476
Mixture Density Network -0.098 -0.087 -0.088 1.584 2.817 2.281 -2.568 -2.887 -2.773 0.693 0.965 0.942
Nadaraya-Watson 1.2 -0.308 -0.342 -0.299 0.690 1.126 0.993 -1.750 -2.158 -1.927 0.264 0.296 0.299
Lanne et al. (2012) -0.129 -0.130 -0.114 0.834 1.435 1.128 -2.029 -3.273 -2.403 0.272 0.432 0.342
Gourieroux and Jasiak (2025) -0.029 -0.085 -0.015 0.988 1.325 1.296 -2.236 -2.307 -2.429 0.297 0.309 0.335
FlexZBoost -0.081 -0.076 -0.051 1.685 2.896 2.440 -2.585 -3.014 -3.228 0.552 1.132 1.141
Mixture Density Network -0.302 -0.348 -0.290 0.682 1.108 0.996 -1.580 -1.600 -1.710 0.258 0.288 0.334
Nadaraya-Watson 1.4 -0.350 -0.486 -0.327 0.511 0.585 0.703 -1.397 -1.151 -1.582 0.161 0.167 0.201
Lanne et al. (2012) -0.288 -0.459 -0.258 0.529 0.752 0.721 -1.418 -1.663 -1.765 0.166 0.266 0.223
Gourieroux and Jasiak (2025) -0.237 -0.408 -0.207 0.583 0.587 0.770 -1.503 -1.329 -1.750 0.175 0.173 0.210
FlexZBoost -0.184 -0.237 -0.149 0.795 0.927 0.965 -1.982 -1.846 -2.266 0.279 0.325 0.390
Mixture Density Network -0.355 -0.542 -0.332 0.511 0.585 0.684 -1.266 -1.037 -1.450 0.160 0.166 0.202
Nadaraya-Watson 1.6 -0.398 -0.608 -0.360 0.432 0.350 0.514 -1.149 -0.796 -1.317 0.127 0.104 0.152
Lanne et al. (2012) -0.388 -0.444 -0.301 0.427 0.420 0.540 -1.156 -1.506 -1.522 0.128 0.121 0.162
Gourieroux and Jasiak (2025) -0.306 -0.513 -0.280 0.467 0.386 0.569 -1.244 -0.940 -1.470 0.138 0.105 0.163
FlexZBoost -0.354 -0.492 -0.298 0.452 0.393 0.558 -1.542 -0.944 -1.673 0.145 0.125 0.182
Mixture Density Network -0.410 -0.578 -0.356 0.420 0.359 0.529 -1.083 -0.834 -1.288 0.125 0.105 0.159
Nadaraya-Watson 1.8 -0.455 -0.716 -0.386 0.371 0.262 0.437 -0.971 -0.535 -1.164 0.109 0.078 0.130
Lanne et al. (2012) -0.416 -0.411 -0.327 0.370 0.295 0.440 -1.005 -1.044 -1.357 0.110 0.080 0.131
Gourieroux and Jasiak (2025) -0.415 -0.706 -0.357 0.384 0.264 0.450 -1.037 -0.539 -1.240 0.111 0.077 0.136
FlexZBoost -0.434 -0.657 -0.378 0.381 0.281 0.447 -1.588 -1.336 -1.787 0.109 0.089 0.131
Mixture Density Network -0.460 -0.687 -0.382 0.368 0.263 0.434 -0.943 -0.571 -1.152 0.107 0.075 0.128

Notes: This table reports density forecast performance metrics across different tail index values (α\alpha). CDE Loss (Conditional Density Estimation loss), CRPS (Continuous Ranked Probability Score), and QS 10% (Quantile Score at 10% level) are loss functions where lower values indicate better performance. Log Prob (Log Probability Score) is a scoring rule where higher values indicate better performance. Metrics are evaluated over three spatial regions: Center [q0.1,q0.9][q_{0.1},q_{0.9}], Tails [q0.01,q0.1][q0.9,q0.99][q_{0.01},q_{0.1}]\cup[q_{0.9},q_{0.99}], and Total [q0.01,q0.99][q_{0.01},q_{0.99}], where qpq_{p} represents the pp-th quantile. Best method in red, second best in bold black.

Note also that contrary to the previous two setups, this evaluation framework has the advantage of allowing one to compare all the alternative forecasting approaches for all MAR models and at all forecast horizons.

As a final remark, comparing Tables 1 and 6 for the α=1.0\alpha=1.0 case reveals a notable discrepancy. When evaluated against the true predictive density, the MDN exhibits clear dominance, achieving KL divergence and ISE values substantially lower than all competitors both in the tail and over the total regions. However, when assessed using proxy metrics that do not require the true density, although the MDN remains among the best-performing methods, its superiority is less apparent. This suggests that standard scoring rules may lack sufficient discriminatory power to identify the best-performing forecasting method when predictive distributions exhibit heavy tails and bimodality.

3.5 Runtime Analysis

Finally, Table 7 reports the average computational time (over all tail indices α\alpha) required by each method to generate the forecasts for a MAR(0,1) process. For each forecast horizon, the reported runtime corresponds to the average time needed to compute the 5,000 conditional predictive densities over the grid of conditioning values {x1,,x5000}\{x_{1},\ldots,x_{5000}\}. FlexZBoost achieves the shortest runtime (under 5 seconds), though at the cost of inferior forecast accuracy, as demonstrated above. The MDN requires only 2-3 minutes. In stark contrast, the simulation-based methods of Lanne et al. (2012) and Gourieroux & Jasiak (2025) require substantially longer computation times, with the latter exhibiting a dramatic increase from about 1 hour at h=1h=1 to over 5 hours at h=5h=5.

Table 7: Average running time (in minutes)
Model Horizon 1 Horizon 2 Horizon 5
Nadaraya-Watson 9.38 9.34 10.72
Gourieroux & Jasiak (2025) 59.03 117.85 300.54
Lanne et al. (2012) 106.33 107.37 107.69
FlexZBoost 0.04 0.05 0.05
Mixture Density Network 2.64 2.20 1.92

4 Empirical Applications

4.1 Forecasting Natural Gas Prices in Real Time

Having established the MDN’s superior performance in controlled settings where the true DGP is known, we now evaluate its forecasting ability on real-world data where model specification uncertainty and data complexities are unavoidable. Natural gas prices are notoriously difficult to forecast due to periodic episodes of locally explosive behavior. These characteristics make the natural gas market a compelling testbed for our MDN approach.

We adopt the real-time forecasting framework of Baumeister et al. (2025), who provide an extensive evaluation of point forecasting methods for the real Henry Hub spot price, the benchmark for North American natural gas markets. Their setup accounts for publication lags and data revisions through a database of monthly vintages, ensuring that forecasts rely only on information available at each forecast origin. The estimation period runs from January 1976 to January 1997, and the out-of-sample evaluation period spans February 1997 to February 2024. We use forecast horizons hh ranging from 1 to 24 months. We defer the reader to Baumeister et al. (2025) for a detailed description of the data construction and real-time constraints.

Refer to caption
Figure 4: Real Henry Hub spot price of natural gas.

Figure 4 displays the evolution of the real Henry Hub spot price over the sample period. The series exhibits clear episodes of locally explosive behavior, most notably during the 2005-2008 period, characterized by sharp run-ups followed by abrupt collapses, precisely the type of dynamics that noncausal processes are designed to capture. Using the GCov estimation method (Gourieroux & Jasiak 2023) on the in-sample period (1976M1-1997M1), we estimate all possible MAR(p,q)(p,q) specifications such that p+qkp+q\leq k, where kk is chosen based on the Partial Autocorrelation Function. Then we select the model yielding i.i.d. residuals based on the portmanteau test of Jasiak & Neyazi (2025). The best specification is a purely noncausal MAR(0,1) process. The α\alpha-stable distribution parameters and their respective standard deviations are then obtained by fitting the characteristic function-based estimator of Nolan (2020) to the filtered residuals.161616The standard deviations of the MAR parameters are obtained by following Gourieroux & Jasiak (2023). Besides, the MAR(0,1) dynamics of the noncausal process is found to exhibit temporal robustness, with comparable estimates for the full post-revised series (1976M1–2024M2). As reported in Table 8, both samples reveal strong anticipative persistence (ψ^0.95\hat{\psi}\approx 0.95) and heavy tails (α^1.8\hat{\alpha}\approx 1.8), substantially thicker than in the Gaussian case (α=2\alpha=2), and consistent with the extreme price movements observed in the data.

Table 8: Estimated MAR(0,1) parameters for the real Henry Hub spot price
Parameter Real-Time In-Sample Full Period Post-Revised
ψ\psi 0.957∗∗∗ (0.018) 0.945∗∗∗ (0.015)
α\alpha 1.779∗∗∗ (0.186) 1.830∗∗∗ (0.125)
β\beta 0.415∗∗ (0.200) 0.628∗∗∗ (0.149)
σ\sigma 0.070∗∗∗ (0.006) 0.158∗∗∗ (0.009)
  • Note: Standard errors in parentheses. ∗∗∗ p<0.01p<0.01, ∗∗ p<0.05p<0.05, p<0.10p<0.10.

To evaluate the density forecasting performance of our MDN approach in this real-time setting, we simulate a trajectory of 5,000 observations from the estimated MAR(0,1) process on which each method is fitted, and generate out-of-sample forecasts over the period 1997M2–2024M2 for horizons h{1,3,6,9,12,15,18,21,24}h\in\{1,3,6,9,12,15,18,21,24\} months. We assess the quality of the predictive densities with the four complementary metrics used in simulations (CDE loss, CRPS, log-probability score, and quantile score).

Table 9 summarizes the results. The MDN approach delivers the best or near-best performance across the majority of horizons and metrics. At the shortest horizon (h=1h=1), the MDN achieves the best CDE loss and CRPS, while FlexZBoost achieves a slightly lower quantile score. The MDN dominates in terms of log-probability score across all horizons. From h=3h=3 onward, the MDN consistently achieves the best performance across all four metrics, with only occasional exceptions at intermediate horizons (h=12,15h=12,15) where Nadaraya-Watson performs marginally better in terms of CDE loss and CRPS. Importantly, when the MDN is not the best performer for a given metric, it consistently ranks as the second-best method.

Table 9: Density forecast comparison
Horizon Metric Nadaraya-Watson Lanne et al. (2012) Gourieroux and Jasiak (2025) FlexZBoost Mixture Density Network
h=1h=1 CDE loss -0.990 8.105 -0.631 -1.130 -1.167
CRPS 0.189 0.212 0.195 0.177 0.175
Log probability score -0.453 -8.463 -0.710 -1.165 -0.214
Quantile score (10%) 0.062 0.078 0.093 0.059 0.063
h=3h=3 CDE loss -0.660 7.813 0.881 -0.637 -0.701
CRPS 0.311 0.395 0.392 0.297 0.291
Log probability score -1.024 -8.839 -2.375 -3.573 -0.713
Quantile score (10%) 0.099 0.174 0.235 0.090 0.081
h=6h=6 CDE loss -0.506 9.300 1.530 -0.469 -0.542
CRPS 0.432 0.613 0.628 0.410 0.393
Log probability score -1.416 -9.429 -3.645 -3.904 -1.029
Quantile score (10%) 0.144 0.217 0.426 0.120 0.118
h=9h=9 CDE loss -0.446 9.128 1.615 -0.383 -0.459
CRPS 0.468 0.790 0.781 0.453 0.435
Log probability score -1.440 -10.90 -4.083 -3.906 -1.102
Quantile score (10%) 0.137 0.254 0.564 0.145 0.116
h=12h=12 CDE loss -0.404 9.744 2.062 -0.350 -0.402
CRPS 0.474 0.950 0.960 0.474 0.478
Log probability score -1.519 -11.17 -4.767 -3.625 -1.232
Quantile score (10%) 0.119 0.234 0.724 0.128 0.117
h=15h=15 CDE loss -0.354 10.14 2.195 -0.283 -0.348
CRPS 0.489 1.067 1.181 0.511 0.510
Log probability score -1.669 -11.32 -5.064 -4.285 -1.327
Quantile score (10%) 0.120 0.223 0.937 0.134 0.117
h=18h=18 CDE loss -0.332 11.06 2.242 -0.244 -0.349
CRPS 0.514 1.165 1.318 0.533 0.515
Log probability score -1.766 -12.14 -5.149 -4.773 -1.350
Quantile score (10%) 0.122 0.227 1.043 0.136 0.114
h=21h=21 CDE loss -0.341 9.754 2.085 -0.273 -0.356
CRPS 0.524 1.255 1.359 0.552 0.516
Log probability score -1.748 -11.67 -5.026 -4.682 -1.339
Quantile score (10%) 0.120 0.230 1.005 0.134 0.115
h=24h=24 CDE loss -0.336 9.564 2.143 -0.245 -0.336
CRPS 0.532 1.342 1.408 0.564 0.524
Log probability score -1.780 -12.65 -5.280 -4.307 -1.359
Quantile score (10%) 0.117 0.231 0.986 0.133 0.109
  • Note: Best method in red, second best in bold black.

A natural question is whether the MDN’s well-calibrated predictive distributions also translate into accurate point predictions. Table 10 addresses this by comparing the RMSE of the MDN’s point forecasts, computed as the median of the predictive density, against the forecasts of a comprehensive set of models evaluated by Baumeister et al. (2025), expressed as ratios relative to the no-change forecast (X^t+h=Xt\hat{X}_{t+h}=X_{t}). The comparison encompasses both univariate specifications (AR(1), AR(AIC), exponential smoothing) and multivariate Bayesian VAR models that incorporate additional predictors. Focusing first on the univariate benchmarks, the MDN delivers the lowest RMSE at all horizons from h=1h=1 to h=9h=9, outperforming AR and exponential smoothing specifications. At the 9-month-ahead horizon, the MDN achieves an RMSE ratio of 0.864, representing a 14% improvement over the random walk and substantially better than all univariate alternatives. Statistical significance against the no-change forecast is confirmed by Diebold-Mariano tests (Diebold & Mariano 1995), see Online Appendix. At longer horizons (h15h\geq 15), exponential smoothing takes the lead, although the MDN remains competitive, consistently ranking second-best among univariate methods.

Remarkably, the MDN also outperforms the multivariate BVAR models at short- to medium-term horizons. The BVAR specifications in Baumeister et al. (2025) exploit up to six predictor variables, yet the MDN achieves lower RMSE ratios up to horizon h=9h=9. Moreover, unlike Baumeister et al. (2025), our approach does not require recursive re-estimation of the model at each forecast origin, making it computationally more efficient.

Table 10: Average RMSE Ratios Relative to the No-Change Forecast of the Real Natural Gas Spot Price
Univariate Models Multivariate Models
Horizon AR(1) AR(AIC) Exp. Smoothing MDN BVAR(AIC) BVAR(1)
1 0.993 1.021 1.456 0.963 1.003 0.984
3 0.978 0.989 1.121 0.937 0.986 0.962
6 0.949 0.944 0.970 0.894 0.948 0.923
9 0.925 0.921 0.905 0.864 0.909 0.897
12 0.915 0.909 0.893 0.882 0.900 0.875
15 0.914 0.908 0.896 0.899 0.909 0.876
18 0.918 0.911 0.899 0.897 0.923 0.886
21 0.924 0.914 0.890 0.901 0.925 0.905
24 0.933 0.919 0.884 0.908 0.940 0.922
  • Note: Values below 1 indicate improvements relative to the no-change forecast. Best univariate method in red, second best in bold black. Benchmark results from Baumeister et al. (2025), Table 3. We report RMSE ratios by taking the square root of the Mean Squared Prediction Error (MSPE) measures in Baumeister et al. (2025), to ensure consistency throughout the paper.

The excellent point forecast performance of the MDN can be partly attributed to the noncausal specification itself. By explicitly modeling the anticipative dynamics, the MAR(0,1) process captures the locally explosive behavior of natural gas prices, a feature that the purely causal models employed by Baumeister et al. (2025) cannot accommodate. The gains are concentrated at short- to medium-term horizons (1–9 months), where the nonlinear dynamics inherent to noncausal processes exert their strongest influence and prove particularly effective at capturing the explosive episodes observed in natural gas prices. At longer horizons, mean-reverting forces dominate the predictive signal, diminishing the comparative advantage of our approach, although the MDN remains highly competitive. Furthermore, the MDN outperforms the alternative density forecasting methods in terms of RMSE. Additional results are reported in the Online Appendix.

4.2 Forecasting Inflation in Real Time

To further assess the point forecasting performance of our MDN approach, we apply it to U.S. inflation, a series known to exhibit noncausal dynamics (Lanne & Saikkonen 2011, Lanne et al. 2012), following the real-time framework of Medeiros et al. (2021). Importantly, unlike natural gas prices, inflation exhibits near-Gaussian tail index (α^1.9\hat{\alpha}\approx 1.9), providing a valuable test of whether the MDN, designed to capture heavy-tailed distributions, retains its forecasting advantages when tail behavior is less pronounced. The model is estimated on the January 2001 vintage of FRED-MD dataset, which covers January 1960 to December 2000. Forecasts for the period January 2001 to December 2015 are then evaluated against the ex-post revised series from the January 2016 vintage (see Figure 5). This real-time setup mirrors our natural gas application and reflects the information actually available to forecasters at each point in time.

Refer to caption
Figure 5: U.S. Inflation.

Using the same procedure as in Section 4, we estimate a purely noncausal MAR(0,2) on the in-sample period and train our MDN on a simulated trajectory of 5,000 observations (see the Online Appendix for estimation results). Table 11 reports the RMSE, mean absolute error (MAE), and median absolute deviation from the median (MAD) for the MDN and the two univariate benchmarks (AR and unobserved components stochastic volatility – UCSV – models) from Medeiros et al. (2021), relative to the no-change forecast.

Table 11: Real-Time Forecast Accuracy Ratios Relative to the No-Change Forecast for US Inflation
Univariate Models Multivariate Models
AR(BIC) UCSV MDN RF RR
Horizon RMSE MAE MAD RMSE MAE MAD RMSE MAE MAD RMSE MAE MAD RMSE MAE MAD
1 0.91 0.87 0.80 0.97 0.91 0.86 0.86 0.85 0.77 0.88 0.83 0.74 0.84 0.81 0.77
2 0.81 0.80 0.78 0.82 0.82 0.82 0.79 0.77 0.67 0.74 0.72 0.72 0.74 0.72 0.70
3 0.77 0.73 0.66 0.79 0.77 0.82 0.76 0.75 0.72 0.70 0.66 0.62 0.72 0.68 0.60
4 0.78 0.75 0.80 0.80 0.77 0.81 0.77 0.74 0.63 0.72 0.72 0.73 0.75 0.72 0.76
5 0.79 0.80 0.76 0.78 0.79 0.80 0.75 0.75 0.68 0.72 0.74 0.75 0.76 0.77 0.74
6 0.80 0.81 0.76 0.79 0.80 0.76 0.76 0.76 0.66 0.75 0.76 0.77 0.78 0.80 0.75
7 0.80 0.79 0.73 0.81 0.81 0.90 0.76 0.75 0.55 0.76 0.75 0.75 0.79 0.77 0.75
8 0.77 0.75 0.74 0.80 0.80 0.86 0.76 0.73 0.55 0.74 0.72 0.73 0.76 0.73 0.72
9 0.79 0.77 0.81 0.79 0.79 0.84 0.76 0.75 0.62 0.74 0.72 0.76 0.77 0.76 0.77
10 0.82 0.80 0.77 0.81 0.80 0.75 0.80 0.78 0.63 0.79 0.75 0.65 0.79 0.76 0.69
11 0.83 0.84 0.83 0.81 0.83 0.92 0.83 0.84 0.62 0.80 0.81 0.81 0.80 0.82 0.75
12 0.77 0.78 0.73 0.77 0.78 0.70 0.76 0.75 0.58 0.72 0.73 0.68 0.74 0.74 0.67
  • Note: Values below 1 indicate improvements relative to the no-change forecast. Best univariate method in red, second best in bold black. AR, UCSV, RF, and RR results from Medeiros et al. (2021), Table 6.

Focusing on univariate models, our MDN outperforms the benchmarks for all three criteria at most horizons considered (11 out of 12 horizons, with stable gains around 24% relative to the random walk). Again, statistical significance against the no-change forecast is confirmed by Diebold-Mariano tests in Online Appendix. Table 11 also reports results for the Random Forest (RF) and Ridge Regression (RR) models, identified as the best-performing multivariate methods in Medeiros et al. (2021). These machine learning approaches exploit the full FRED-MD database comprising over 120 macroeconomic variables with four lags each, resulting in approximately 500 potential predictors. Despite this vastly richer information set, the MDN remains competitive, with RMSE ratios within 0.01–0.06 of RF and RR at most horizons.

Overall, despite not being tailored to point forecasting, our approach ranks among the leading univariate methods for forecasting U.S. inflation and natural gas prices, and remains competitive with state-of-the-art multivariate methods that rely on considerably richer information sets.

5 Conclusion

Time series forecasting in the presence of locally explosive dynamics, marked by rapid expansions followed by sudden reversals, remains a core challenge in (macro‑)financial econometrics. We address this issue by introducing a Mixture Density Network designed to capture the distinctive distributional features of noncausal processes. The resulting framework enables near‑instantaneous density forecasting once trained, effectively overcoming the computational constraints associated with existing methods.

We evaluate our approach through extensive Monte Carlo simulations that cover a range of MARMA specifications. The results show that the MDN method consistently achieves the best performance in the tail region and over the full distribution at all forecast horizons. These findings are further validated through two empirical applications to real‑time forecasting of natural gas prices and inflation.

References

  • (1)
  • Athey & Imbens (2019) Athey, S. & Imbens, G. W. (2019), ‘Machine learning methods that economists should know about’, Annual Review of Economics 11(Volume 11, 2019), 685–725.
  • Avelino et al. (2024) Avelino, J. G., Cavalcanti, G. D. C. & Cruz, R. M. O. (2024), ‘Resampling strategies for imbalanced regression: A survey and empirical analysis’, Artificial Intelligence Review 57(4), 82.
  • Azzalini & Dalla Valle (1996) Azzalini, A. & Dalla Valle, A. (1996), ‘The multivariate skew-normal distribution’, Biometrika 83(4), 715–726.
  • Baumeister et al. (2025) Baumeister, C., Huber, F., Lee, T. K. & Ravazzolo, F. (2025), ‘Forecasting natural gas prices in real time’, Journal of Applied Econometrics .
  • Belkin et al. (2019) Belkin, M., Hsu, D., Ma, S. & Mandal, S. (2019), ‘Reconciling modern machine-learning practice and the classical bias–variance trade-off’, Proceedings of the National Academy of Sciences 116(32), 15849–15854.
  • Bishop (1994) Bishop, C. M. (1994), ‘Mixture density networks’.
  • Blasques et al. (2025) Blasques, F., Koopman, S. J., Mingoli, G. & Telg, S. (2025), ‘A novel test for the presence of local explosive dynamics’, Journal of Time Series Analysis .
  • Bruffaerts et al. (2014) Bruffaerts, C., Verardi, V. & Vermandele, C. (2014), ‘A generalized boxplot for skewed and heavy-tailed distributions’, Statistics & Probability Letters 95, 110–117.
  • Cavaliere et al. (2020) Cavaliere, G., Nielsen, H. B. & Rahbek, A. (2020), ‘Bootstrapping noncausal autoregressions: With applications to explosive bubble modeling’, Journal of Business & Economic Statistics 38(1), 55–67.
  • Chen & Guestrin (2016) Chen, T. & Guestrin, C. (2016), XGBoost: A scalable tree boosting system, in ‘Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining’, pp. 785–794.
  • Dalmasso et al. (2020) Dalmasso, N., Pospisil, T., Lee, A. B., Izbicki, R., Freeman, P. E. & Malz, A. I. (2020), ‘Conditional density estimation tools in python and R with applications to photometric redshifts and likelihood-free cosmological inference’, Astronomy and Computing 30, 100362.
  • Davis & Song (2020) Davis, R. A. & Song, L. (2020), ‘Noncausal vector AR processes with application to economic time series’, Journal of Econometrics 216(1), 246–267.
  • de Truchis et al. (2025) de Truchis, G., Fries, S. & Thomas, A. (2025), ‘Forecasting extreme trajectories using seminorm representations’. Working paper.
  • de Truchis & Thomas (2025) de Truchis, G. & Thomas, A. (2025), Laurent series expansion for MA(\infty) representation of mixed causal-noncausal autoregressive processes, LEDa Working Paper WP 2025-06, LEDa, Université Paris Dauphine, PSL.
  • Dey et al. (2022) Dey, B., Newman, J. A., Andrews, B. H., Izbicki, R., Lee, A. B., Zhao, D., Rau, M. M. & Malz, A. I. (2022), ‘Re-calibrating photometric redshift probability distributions using feature-space regression’, Fourth Workshop on Machine Learning and Physical Sciences (NeurIPS 2021) .
  • Diebold & Mariano (1995) Diebold, F. X. & Mariano, R. S. (1995), ‘Comparing predictive accuracy’, Journal of Business & Economic Statistics 13(3), 253–263.
  • Fries (2022) Fries, S. (2022), ‘Conditional moments of noncausal alpha-stable processes and the prediction of bubble crash odds’, Journal of Business & Economic Statistics 40(4), 1596–1616.
  • Fries & Zakoian (2019) Fries, S. & Zakoian, J.-M. (2019), ‘Mixed causal-noncausal AR processes and the modelling of explosive bubbles’, Econometric Theory 35(6), 1234–1270.
  • Giancaterini et al. (2025) Giancaterini, F., Hecq, A., Jasiak, J. & Manafi Neyazi, A. (2025), ‘Bubble detection with application to green bubbles: A noncausal approach’.
  • Goodfellow et al. (2016) Goodfellow, I., Bengio, Y. & Courville, A. (2016), Deep Learning, MIT Press.
  • Gourieroux & Jasiak (2016) Gourieroux, C. & Jasiak, J. (2016), ‘Filtering, prediction and simulation methods for noncausal processes’, Journal of Time Series Analysis 37(3), 405–430.
  • Gourieroux & Jasiak (2018) Gourieroux, C. & Jasiak, J. (2018), ‘Misspecification of noncausal order in autoregressive processes’, Journal of Econometrics 205(1), 226–248.
  • Gourieroux & Jasiak (2023) Gourieroux, C. & Jasiak, J. (2023), ‘Generalized covariance estimator’, Journal of Business & Economic Statistics 41(4), 1315–1327.
  • Gourieroux & Jasiak (2025) Gourieroux, C. & Jasiak, J. (2025), ‘Nonlinear fore(back)casting and innovation filtering for causal-noncausal VAR models’.
  • Gourieroux et al. (2020) Gourieroux, C., Jasiak, J. & Monfort, A. (2020), ‘Stationary bubble equilibria in rational expectation models’, Journal of Econometrics 218(2), 714–735.
  • Gourieroux et al. (2025) Gourieroux, C., Lu, Y. & Robert, C.-Y. (2025), ‘The causal-noncausal tail processes’.
  • Gouriéroux & Monfort (2025) Gouriéroux, C. & Monfort, A. (2025), Affine feedforward stochastic (afs) neural network, TSE Working Paper 25-1666, Toulouse School of Economics (TSE).
  • Gourieroux & Zakoian (2017) Gourieroux, C. & Zakoian, J.-M. (2017), ‘Local explosion modelling by non-causal process’, Journal of the Royal Statistical Society Series B 79(3), 737–756.
  • Guillaumin & Efremova (2024) Guillaumin, A. P. & Efremova, N. (2024), ‘Tukey g-and-h neural network regression for non-gaussian data’.
  • Hansen et al. (2011) Hansen, P. R., Lunde, A. & Nason, J. M. (2011), ‘The model confidence set’, Econometrica 79(2), 453–497.
  • He et al. (2015) He, K., Zhang, X., Ren, S. & Sun, J. (2015), Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, in ‘2015 IEEE International Conference on Computer Vision (ICCV)’, pp. 1026–1034.
  • Hecq & Velasquez-Gaviria (2025) Hecq, A. & Velasquez-Gaviria, D. (2025), ‘Non-causal and non-invertible ARMA models: Identification, estimation and application in equity portfolios’, Journal of Time Series Analysis 46(2), 325–352.
  • Hecq & Voisin (2021) Hecq, A. & Voisin, E. (2021), ‘Forecasting bubbles with mixed causal-noncausal autoregressive models’, Econometrics and Statistics 20, 29–45.
  • Izbicki (2017) Izbicki, R. (2017), ‘Converting high-dimensional regression to high-dimensional conditional density estimation’, Electronic Journal of Statistics 11(2), 2800–2831.
  • Jasiak & Neyazi (2025) Jasiak, J. & Neyazi, A. M. (2025), ‘Gcov-based portmanteau test’.
  • Kingma & Ba (2015) Kingma, D. P. & Ba, J. (2015), Adam: A method for stochastic optimization, in ‘International Conference on Learning Representations (ICLR)’.
  • Lanne et al. (2012) Lanne, M., Luoto, J. & Saikkonen, P. (2012), ‘Optimal forecasting of noncausal autoregressive time series’, International Journal of Forecasting 28(3), 623–631.
  • Lanne & Saikkonen (2011) Lanne, M. & Saikkonen, P. (2011), ‘Noncausal autoregressions for economic time series’, Journal of Time Series Econometrics 3(3).
  • Medeiros et al. (2021) Medeiros, M. C., Vasconcelos, G. F. R., Veiga, Á. & Zilberman, E. (2021), ‘Forecasting inflation in a data-rich environment: The benefits of machine learning methods’, Journal of Business & Economic Statistics 39(1), 98–119.
  • Nolan (2020) Nolan, J. P. (2020), Univariate Stable Distributions: Models for Heavy Tailed Data, Springer Series in Operations Research and Financial Engineering, Springer, Cham.
  • Nyberg & Saikkonen (2014) Nyberg, H. & Saikkonen, P. (2014), ‘Forecasting with a noncausal VAR model’, Computational Statistics & Data Analysis 76, 536–555.
  • Rosenblatt (1969) Rosenblatt, M. (1969), Conditional probability density and regression estimators, in P. R. Krishnaiah, ed., ‘Multivariate Analysis II’, Academic Press, pp. 25–31.
  • Rosenblatt (2000) Rosenblatt, M. (2000), Gaussian and Non-Gaussian Linear Time Series and Random Fields, Springer.
  • Rothfuss et al. (2020) Rothfuss, J., Ferreira, F., Boehm, S., Walther, S., Ulrich, M., Asfour, T. & Krause, A. (2020), ‘Noise regularization for conditional density estimation’.
  • Rothfuss et al. (2019) Rothfuss, J., Ferreira, F., Walther, S. & Ulrich, M. (2019), ‘Conditional density estimation with neural networks: Best practices and benchmarks’.
  • Saïdi (2023) Saïdi, S. (2023), Noncausal Models: Unit Root Tests and Forecasting, PhD thesis, CY Cergy Paris University.
  • Steininger et al. (2021) Steininger, M., Kobs, K., Davidson, P., Krause, A. & Hotho, A. (2021), ‘Density-based weighting for imbalanced regression’, Machine Learning 110(9-10), 2187–2211.