Tail-Aware Density Forecasting of Locally Explosive Time Series: A Neural Network Approach
Abstract
This paper proposes a Mixture Density Network specifically designed for forecasting time series that exhibit locally explosive behavior. By incorporating skewed t-distributions as mixture components, our approach offers enhanced flexibility in capturing the skewed, heavy-tailed, and potentially multimodal nature of predictive densities associated with bubble dynamics modeled by mixed causal-noncausal ARMA processes. In addition, we implement an adaptive weighting scheme that emphasizes tail observations during training and hence leads to accurate density estimation in the extreme regions most relevant for financial applications. Equally important, once trained, the MDN produces near-instantaneous density forecasts. Through extensive Monte Carlo simulations and two empirical applications, on the natural gas price and inflation, we show that the proposed MDN-based framework delivers superior forecasting performance relative to existing approaches.
Keywords: Forecasting, Noncausal Models, Mixture Density Networks
1 Introduction
Time-series forecasts using causal ARMA models have been playing a crucial role in economic and financial decision-making processes for a long time. A conclusion often reached in this context where the current value of the variable of interest is forced to depend only on its past is that one- and multi-step ahead forecasting in periods of high instability or in presence of forward looking behavior of economic agents is particularly difficult. Indeed, these causal models are characterized by mean reversion, i.e., their forecasts converge to the unconditional mean even after an extreme event occurs, regardless of whether such behavior reflects the true underlying dynamics. In the specific case of financial prices, which we are interested in, the traditional properties of heavy-tailed marginal distributions and volatility clustering are at the core of forecasting models. But a close look at the dynamics of various types of asset prices, sometimes called speculative assets, reveals the presence of phases of locally explosive behaviour: rising patterns followed by a burst, local trends and spikes. Such non-linear characteristics are, however, very poorly captured by standard financial econometric models.
Recently, (causal-)noncausal autoregressive processes have been found to be suitable for modelling such locally explosive behaviour as they allow for dependence on the future. These simple linear models produce rich non-linear patterns without requiring non-linearities to be imposed ex ante. Importantly, noncausal processes are grounded in macroeconomic theory, as they arise as stationary solutions of rational expectation models under infinite variance (Gourieroux et al. 2020). Relevant applications of these models range from asset prices (see Fries & Zakoian 2019, Gourieroux & Zakoian 2017, Gourieroux & Jasiak 2018, Hecq & Velasquez-Gaviria 2025), to macroeconomic data (see Lanne & Saikkonen 2011, Davis & Song 2020), commodity prices (Blasques et al. 2025), climate risk on El Niño and La Niña (de Truchis et al. 2025), green stock prices (Giancaterini et al. 2025), and to electronic currency exchange rates (Cavaliere et al. 2020).
In this context, our paper proposes a novel forecasting methodology for causal–noncausal autoregressive processes based on a specifically-designed neural network architecture and training procedure. Indeed, building predictions with this class of models has been shown to be particularly difficult due to their dependence on future values. In fact, it has long been thought that the conditional predictive density of mixed causal-noncausal processes does not have closed-form and inference can only be performed by simulation-based or Bayesian methods (see Lanne et al. 2012, Gourieroux & Jasiak 2016, Nyberg & Saikkonen 2014). Although these approaches constitute flexible alternatives for predicting general noncausal processes, Hecq & Voisin (2021) highlight two main drawbacks: they become computationally intensive for long‑horizon forecasts and do not accurately capture the dynamics of extreme events concentrated in the tails of the distribution. This limitation is highly problematic given that modeling explosive tail behavior is the primary motivation for using noncausal processes in the first place.
More importantly, noncausal processes exhibit highly non-linear and process-specific tail behavior that must be properly accounted for in forecasting. For example, when the conditional predictive density becomes multimodal, point forecasts become meaningless, as they fail to represent the fundamental dichotomy between bubble continuation and collapse (see Gourieroux & Zakoian 2017, Fries 2022, de Truchis et al. 2025, Gourieroux et al. 2025). For this reason, the noncausal literature as a whole, and this paper in particular, focuses exclusively on density forecasting rather than point prediction. Additionally, this tail behavior may also explain the difficulties encountered by standard numerical approaches when modelling extreme values (see Hecq & Voisin 2021) as well as the failure of state-of-the-art machine learning methods to accurately forecast locally explosive dynamics (Saïdi 2023).
This paper takes a new approach to forecasting noncausal processes, i.e., locally explosive dynamics. We develop a specifically-designed neural network architecture to accurately estimate the full conditional predictive density of univariate general mixed causal-noncausal autoregressive moving average (MARMA) processes. More precisely, we introduce a newly tailored Mixture Density Network (MDN) based on skewed t-distributions as mixture components, which naturally captures both the multimodality and heavy-tailed asymmetric nature of predictive densities during explosive episodes. This contrasts with traditional MDNs (à la Bishop 1994), whose underlying Gaussian assumption cannot accommodate such features. Furthermore, since tail observations are inherently rare, we develop an adaptive weighting scheme that emphasizes extreme regions during training, enabling accurate density estimation in the parts of the distribution most relevant for financial applications.
Another significant advantage of our approach is that, once trained, the MDN produces near-instantaneous density forecasts for all forecasting horizons under analysis. By leveraging neural networks, this paper contributes to the growing literature on machine learning methods for economic and financial applications (Athey & Imbens 2019), and specifically for uncertainty quantification in forecasting. However, we do not pursue a strictly statistical path aimed at identifying true predictive densities or establishing formal properties of the forecasts, such as identifiability, asymptotic convergence, or Markov structure (see, e.g., Gouriéroux & Monfort 2025). We rather exploit the approximation power of neural networks to directly estimate the predictive densities. Our procedure accommodates short time series by proceeding in two steps: first, estimate the underlying noncausal model, and then learn the associated predictive density from simulated paths of this estimated process. Nevertheless, one could directly apply our MDN approach to the raw data, provided that it is sufficiently long.
We compare our MDN with noncausal-specific density forecasting methods and with more general techniques through both extensive Monte Carlo simulations and two empirical applications, on U.S. natural gas price and inflation. In all settings, our approach outperforms competing methods while requiring substantially less computational time. It also achieves superior point‑forecast accuracy compared to standard benchmarks in both applications.
2 Forecasting with Mixture Density Networks
Traditional Mixture Density Networks (Bishop 1994) provide a flexible framework for modeling conditional probability distributions by outputting the parameters of a Gaussian mixture model through a neural network. However, Gaussian mixtures struggle to adequately capture the heavy-tailed behavior of financial prices and lead to systematic underestimation of extreme event probabilities. To address this deficiency, we propose a new MDN, which relies on skewed t-distribution components instead of Gaussian ones to provide flexible parametric control over both skewness and tail heaviness. More precisely, it models the conditional density as:
with the skewed t-distribution probability density function, where denotes the Student-t density and its cumulative distribution function, both with degrees of freedom that control tail thickness. We denote the forecasting horizon by and let be a vector of consecutive observations up to time .111Intuitively, machine learning approaches can naturally leverage vectors of multiple past observations. Kernel estimators could also theoretically be extended to incorporate multiple conditioning lags, but at the expense of an exponential increase in computational complexity due to the curse of dimensionality. Throughout the rest of the paper, we mainly use to insure a fair comparison across methods. The other parameters are the location , the scale , and the shape , which governs asymmetry.222We found this parameterization of the MDN to be numerically more robust than the Tukey g-and-h component-based approach of Guillaumin & Efremova (2024), as it does not need computing numerical inverse transforms via binary search, which can cause training instability when dealing with extreme values in the tails. Additionally, the skewed t-distribution admits well-established multivariate extensions (see Azzalini & Dalla Valle 1996), providing a natural pathway for extending our framework to vector-valued time series in future work. These predictive densities provide a complete characterization of forecast uncertainty: they can be used to construct prediction intervals, assess tail risk probabilities during explosive episodes, and derive point forecasts. We develop our own network architecture and training strategy as follows.
2.1 Network Architecture
Our network architecture is lightweight and parsimonious, consisting of a fully connected multilayer perceptron (MLP) with two hidden layers of dimension 64 and five parallel output heads, one for each parameter of the skewed t-distribution mixture (). This makes it efficient to train even on standard CPU hardware.333While recurrent architectures such as RNNs or LSTMs could capture additional temporal dependencies, we opt for this simpler feedforward structure to maintain computational efficiency and ease of implementation in applied settings. The hidden layers use rectified linear unit (ReLU) activations, a standard choice in deep learning due to its computational efficiency and ability to mitigate vanishing gradient issues (Goodfellow et al. 2016):
where , , , and is applied element-wise. The five output heads map the final hidden representation to the mixture parameters:
where is the number of mixture components, and each output head has weight matrix and bias . The softmax function, , maps an unconstrained -vector to valid probability weights summing to one. The softplus function, , is applied element-wise and provides a smooth transformation ensuring strictly positive outputs for the scale () and degree of freedom () parameters. The location () and skewness () parameters remain unconstrained to allow for arbitrary centering and both left and right asymmetry.
The model is implemented in PyTorch with weights initialized via the Kaiming uniform scheme (He et al. 2015), and trained by stochastic gradient descent (SGD) with the Adam optimizer (Kingma & Ba 2015), minimizing the negative log-likelihood of the mixture density. We add noise regularization during training by injecting small Gaussian perturbations to the input data. This smooths the estimated density by implicitly penalizing the Hessian of the log-likelihood, an approach promoted in the conditional density estimation literature (Rothfuss et al. 2019, 2020). This architecture has approximately 8,000 parameters, exceeding the number of observations available in both our simulations and empirical applications, and is therefore not identifiable. However, for a purely forecasting-oriented approach, this overparameterization is not a limitation but rather an advantage, as it can improve generalization without overfitting (Belkin et al. 2019).
2.2 Adaptive Weighting Function
We focus on financial prices rather than returns, as differencing eliminates the explosive dynamics that we aim to forecast. Prices naturally exhibit imbalance between normal market conditions and extreme events, yet learning algorithms prioritize frequent observations over rare tail events that are crucial for risk assessment. To address this, we combine resampling strategies (Avelino et al. 2024) with cost-sensitive learning (Steininger et al. 2021), which encourage the MDN to focus on extreme events during training.
Let denote the training sample of size , and let the subset of tail events in the training sample be given by The set is defined using only the scalar , not the entire lagged vector used as input. The boundaries and are determined using the generalized boxplot methodology of Bruffaerts et al. (2014).444 for a fixed detection rate , where is the empirical Tukey g-h CDF fitted to a rank-preserving transformation of the observed data to accurately capture skewness and tail heaviness of the data. Then, the adaptive weight function is given by:
| (1) |
The weighting mechanism operates at two levels during training, which allows the model to focus on rare tail events without requiring modifications to the underlying MDN architecture. First, during mini-batch construction, we give more weight to rare cases by employing a weighted random sampling with replacement approach.555A mini-batch is a small random subset of training observations used to compute gradient updates at each iteration of the SGD. Under this scheme, the probability that observation is sampled in a mini-batch is given by Second, within each mini-batch , we apply cost-sensitive learning through a weighted loss function. At each gradient step, the model parameters are updated by minimizing the weighted negative log-likelihood,
where indexes observations in mini-batch . This dual application of weights, both in sampling and loss computation, results in an effective weight of for tail events. This justifies the square-root scaling in (1) to achieve the desired inverse proportion weighting.
2.3 Post-hoc Calibration
By reweighting the training distribution to emphasize extreme events, we train the model on a distribution that differs from the data-generating process. Specifically, when we assign higher weights to tail observations, the model learns to predict conditional distributions that reflect this reweighted empirical distribution rather than the original empirical distribution. This shift causes systematic biases: the model overestimates the probability of extreme events across the entire input space, leading to miscalibrated predictions when evaluated with respect to the original, unweighted distribution.
To correct for miscalibration, we use the method of Dey et al. (2022) based on the uniformity property of the Probability Integral Transform (PIT), where is the predicted cumulative distribution function. This recalibration procedure is applied after the initial model training on and requires a distinct calibration dataset, . First, we use the trained model to compute the PIT values for all couples of observations . Then, the conditional distribution of PIT values is estimated as follows: for each threshold on a grid , a separate XGBoost classifier (Chen & Guestrin 2016) is trained to predict the binary outcome from the calibration features . The predicted probability from each classifier provides an estimate of
Intuitively, measures the empirical frequency with which the model’s predicted CDF, evaluated at the true outcome, falls below for observations with similar features. Perfect calibration corresponds to for all , which would indicate that the PIT values are uniformly distributed conditionally on .
Once the correction function is learned on , it can be applied to any new couple of observations from the test set or future data. The final recalibrated PDF, , is obtained by applying the correction and then renormalizing,
with correction factor The correction factor is derived from a change-of-variables formula which ensures that the recalibrated density produces uniformly distributed PIT values.
3 Monte Carlo Simulations
In this section, we evaluate the forecasting performance of our MDN approach through Monte Carlo simulations on various MARMA specifications. These controlled settings allow us to benchmark against theoretical predictive densities when available, or against theoretical predictive moments. We assess the relative performance of our method by conducting a horse race against established forecasting approaches.
3.1 MARMA Processes
Mixed Causal-Noncausal Autoregressive Moving Average processes naturally capture key distributional features of financial price series: multimodality, skewness, and heavy tails. Let be a stochastic process generated by
| (2) |
where (resp. ) denotes the forward (resp. backward) operator, , , , and are polynomials with roots outside the unit circle, and is a sequence of i.i.d. variables. Equation (2) admits a unique stationary solution, called a , if and for all , and if (resp. ) has no common root with (resp. ). A sufficient condition for the identification of the model in (2) is that is i.i.d. non-gaussian (see Rosenblatt 2000).666To the best of our knowledge, our MDN is the first noncausal-specific method capable of forecasting the conditional predictive density of general MARMA processes, as existing approaches are restricted to MAR specifications.
The literature has been using either -stable or -distributed innovations, the choice of one or another driving the forecasting algorithms proposed so far in the literature. We focus on -stable MARMA models for two reasons. First, the -stable family offers considerable flexibility, encompassing distributions ranging from Gaussian (, ) to Cauchy (, ) as special cases. Second, and most importantly, unlike the -distribution case, theoretical results on predictive densities and moments are available in this setting. Specifically, Gourieroux & Zakoian (2017) derived a closed-form expression for the predictive conditional density of a noncausal MAR when Cauchy. More generally, when follows an -stable law, , with , , and , the theoretical results of Fries (2022) yield closed-form expressions for higher-order conditional moments: for any integer satisfying . This provides a rigorous framework for evaluating density forecasts in the following subsections.
3.2 Simulation Design
We generate time series of length 5,000 from the following MARMA data generating processes (de Truchis & Thomas 2025):
| MAR(0,1): | (3) | |||
| MAR(0,2): | (4) | |||
| MAR(1,1): | (5) | |||
| MARMA(1,1,1,1): | (6) |
where .777The parameter values are chosen by following de Truchis et al. (2025) for the three MAR models and based on Fries (2022) for the MARMA specification. All specifications satisfy the stationarity conditions. In the simulations, we set , where smaller values correspond to fatter tails, which allows us to examine how model performance changes with the degree of tail heaviness.
The forecasting abilities of our MDN approach are compared with those of a set of established conditional density forecasting methods spanning different methodological paradigms: the nonparametric Nadaraya-Watson kernel density estimator (Rosenblatt 1969), the simulation-based approach of Lanne et al. (2012) and the closed-form predictive densities of Gourieroux & Jasiak (2025), both designed for MAR processes, as well as the learning-based FlexZBoost method of Izbicki (2017) and Dalmasso et al. (2020).888For a thorough presentation of these methods, see the Online Appendix. An additional calibration set of 5,000 observations, generated by the same data-generating process, is used to perform the local recalibration procedure described in Section 2.3 for the MDN.
To ensure a fair comparison across all methods, the kernel densities required by Nadaraya-Watson and the noncausal state density needed for the closed-form approach of Gourieroux & Jasiak (2025) are both estimated using the same 5,000-observation training set. The simulation-based method of Lanne et al. (2012) does not require additional historical observations beyond the parameter estimation step and can be applied directly to the evaluation grid once the MAR and -stable parameters are obtained. Note that for Lanne et al. (2012) and Gourieroux & Jasiak (2025), the MAR specification is assumed known, thereby circumventing model identification issues. We estimate the MAR process parameters using the Generalized Covariance (GCov) estimator (Gourieroux & Jasiak 2023), a semi-parametric approach that minimizes a portmanteau statistic based on the autocovariances of transformed residuals. The parameters of the -stable distribution are then obtained by fitting the characteristic function-based estimator of Nolan (2020) to the filtered residuals.999This is an optimal framework to account for parameter estimation uncertainty, whereas in real-life applications model risk also matters and can further hamper the forecasting abilities of these approaches. In contrast, the nonparametric and learning-based methods, Nadaraya-Watson, FlexZBoost, and our MDN, do not require explicit parametric model estimation and directly learn the predictive density from the data.101010To assess the impact of parameter estimation uncertainty, we conducted additional simulations where the methods of Lanne et al. (2012) and Gourieroux & Jasiak (2025) were evaluated using the true data-generating parameters rather than estimated ones. The relative performance rankings remained unchanged, indicating that the observed differences in forecasting accuracy are not primarily attributable to parameter estimation error.
3.3 Bimodality Analysis and Sampled Trajectories
As shown by de Truchis et al. (2025), for any MAR process with a single anticipative root, when conditioning on a single observation, the conditional predictive density theoretically exhibits bimodality in the tail regions. This reflects the fundamental dichotomy of locally explosive dynamics: the bubble either continues or crashes, with probability and respectively (see Section 2.2 in the Online Appendix for a discussion). Accordingly, the predictive density features two modes: one near the unconditional mean (zero in our simulations), corresponding to the crash scenario, and one further from the conditioning value, corresponding to bubble continuation.
Before implementing the forecast evaluation horse-race, we investigate whether the competing approaches capture well this theoretical bimodality of the conditional predictive density, particularly in the tail region. The procedure relies on a grid of 5,000 equispaced conditioning values spanning the quantile range , where denotes the -quantile of the theoretical marginal distribution. For each point on the grid, we estimate the conditional predictive density using each of the competing methods and visually assess their ability to recover the bimodal structure.
Figure 1 displays the one-step-ahead conditional predictive density for a MAR(0,1) process under three conditioning scenarios: (left tail), (center), and (right tail).
Notes: This figure displays the estimated conditional predictive density for a purely noncausal MAR(0,1) process with -stable innovations at three conditioning values: (left tail), (center), and (right tail), where denotes the percentiles. The red dashed line indicates the conditioning value . The tail-index is fixed at .
The MDN (panel a) successfully captures the expected bimodality, exactly as predicted by the geometric decay of .111111The bimodality becomes increasingly pronounced as the horizon extends from to , and it is best captured by the MDN approach (see Section 3 in the Online Appendix). Moreover, employing skewed t-distribution mixture components proves beneficial: the MDN naturally outputs asymmetric and heavy-tailed predictive densities. In contrast, Nadaraya-Watson (panel d) produces a highly erratic predictive density, with substantial gaps and spikes, especially in data-sparse tail regions. The approaches of Lanne et al. (2012) and Gourieroux & Jasiak (2025) exhibit similar irregular behavior. FlexZBoost (panel e) yields smoother estimates than kernel methods but displays small spurious bumps due to its inability to set certain cosine basis coefficients to zero. This prevents it from achieving the clean bimodal structure predicted by theory and confirms that general-purpose density estimators require substantial adaptation to capture locally explosive patterns. The Online Appendix, includes GIFs (Graphics Interchange Format) depicting the evolution of our MDN predictive densities across the entire grid of 5,000 conditioning points, for .
Beyond density estimation, the MDN’s ability to capture bimodality translates directly into realistic trajectory generation. Figure 2 displays a trajectory obtained by iteratively sampling from the one-step-ahead predictive density, starting from an initial and using each sampled value as the next conditioning point. The resulting path exhibits both locally explosive dynamics and abrupt reversals that define noncausal processes, mirroring the behavior of a true MAR(0,1) simulation. Remarkably, when estimating the parameters of the MDN-sampled trajectory, one recovers a MAR(0,1) structure with coefficients close to the true data-generating process (see the Online Appendix for estimation details). We further demonstrate in the same appendix that this sampling procedure extends to higher-order anticipative models by providing an illustration for a MAR(0,2) process conditional on . This underscores our model’s capacity to effectively utilize multiple conditioning variables.

Notes: Left panel: trajectory generated by iteratively sampling from the MDN’s one-step-ahead predictive density. Right panel: true MAR(0,1) simulation with . The underlying MDN has been trained on a simulated sample of 5,000 realizations from the true MAR(0,1) specification in Equation (3), with .
3.4 Simulation Results
Ideally, comparing the alternative forecasting approaches introduced in Section 3.2 would involve evaluating the competing predictive densities against the true theoretical one. However, since the latter is unavailable for most noncausal processes, our simulation analysis proceeds along three complementary directions.
First, we exploit a unique feature of the Cauchy MAR(0,1) process (, ): the availability of a closed-form expression for the conditional predictive density (Gourieroux & Zakoian 2017). This special case provides a direct benchmark for assessing how well each method approximates the true density. Second, for the general case of MARMA processes where closed-form densities are unavailable, we rely on theoretical conditional moments to evaluate predictive accuracy. Specifically, we compute the estimated conditional moments by numerically integrating the estimated predictive densities, and compare them to their theoretical counterparts. Note that our grid-based approach (see Section 3.3) is naturally suited to univariate conditioning, i.e., predicting based solely on the current value , which is consistent with the theoretical conditional predictive densities of Gourieroux & Zakoian (2017), and the conditional moments of Fries (2022). Accordingly, in all simulations, the different methods condition only on the last observed value, . Third, we complement the grid-based evaluation with an out-of-sample forecasting exercise that assesses the predictive densities against realized outcomes, thereby better reflecting real-world forecasting applications.
3.4.1 Case 1: Cauchy MAR(0,1) model
Table 1 compares the estimated predictive densities with the true density of a MAR(0,1) process with and , using Kullback-Leibler (KL) divergence and Integrated Squared Error (ISE) metrics.121212The KL divergence is defined as , while the ISE is given by .
| Horizon | Model | KL Divergence | ISE | ||||
| Center | Tails | Total | Center | Tails | Total | ||
| Nadaraya-Watson | 0.578 | 11.45 | 10.39 | 0.020 | 0.751 | 0.680 | |
| Lanne et al. (2012) | 2.339 | 13.10 | 12.06 | 0.418 | 0.416 | 0.416 | |
| Gourieroux and Jasiak (2025) | 2.512 | 8.392 | 7.823 | 0.441 | 0.600 | 0.585 | |
| FlexZBoost | 2.877 | 3.410 | 3.359 | 0.268 | 0.225 | 0.229 | |
| Mixture Density Network | 0.927 | 1.220 | 1.192 | 0.197 | 0.182 | 0.184 | |
| Nadaraya-Watson | 0.751 | 14.47 | 13.14 | 0.017 | 0.701 | 0.635 | |
| Lanne et al. (2012) | 2.457 | 11.81 | 10.90 | 0.191 | 0.230 | 0.226 | |
| Gourieroux and Jasiak (2025) | 2.924 | 8.685 | 8.127 | 0.292 | 0.474 | 0.456 | |
| FlexZBoost | 2.236 | 3.294 | 3.192 | 0.121 | 0.084 | 0.087 | |
| Mixture Density Network | 0.482 | 0.731 | 0.707 | 0.063 | 0.057 | 0.058 | |
| Nadaraya-Watson | 1.109 | 18.23 | 16.58 | 0.018 | 0.760 | 0.688 | |
| Lanne et al. (2012) | 2.698 | 10.50 | 9.749 | 0.071 | 0.150 | 0.143 | |
| Gourieroux and Jasiak (2025) | 2.445 | 8.702 | 8.097 | 0.211 | 0.393 | 0.375 | |
| FlexZBoost | 1.485 | 1.368 | 1.380 | 0.041 | 0.017 | 0.019 | |
| Mixture Density Network | 0.129 | 0.546 | 0.506 | 0.008 | 0.012 | 0.011 | |
-
Note: This Table reports the Kullback-Leibler (KL) divergence and Integrated Squared Error (ISE) between the true predictive density and estimated densities. Metrics are evaluated over three spatial regions: Center , Tails , and Total , where represents the -th quantile. Best method in red, second best in bold black.
Using the grid-based methodology described above, we estimate the conditional predictive density for each competing method and directly compare it with the true density across three distinct regions: the center , the tails , and the total range .
The MDN exhibits the lowest KL divergence and ISE in the tail region and over the full distribution across all forecast horizons. In the center of the distribution, Nadaraya-Watson remains competitive, achieving the lowest KL divergence and ISE at , and the lowest ISE at . However, it performs substantially worse for density forecasting in the tails, especially at longer horizons. At horizon , the MDN achieves a tail KL divergence of 0.546 compared to 18.23 for Nadaraya-Watson and 8.702 for Gourieroux & Jasiak (2025). Across all horizons and regions examined, the MDN consistently delivers either the best or second-best performance.
3.4.2 Case 2: Predictive Moments Approach
The root mean square error (RMSE) criterion is now employed to compare the estimated conditional moments, to the theoretical ones, , of Fries (2022) and measure the forecasting performance for each tail index and each moment order , for all four DGPs in the same three regions of the distribution as in the first case-scenario. To formally assess the statistical significance of relative performance differences, we employ the Model Confidence Set (MCS) procedure of Hansen et al. (2011) with a 90% confidence level.
Note that, the forecast methods of Gourieroux & Jasiak (2025) and Lanne et al. (2012) require conditioning on multiple lags when the autoregressive order exceeds one, making direct comparison with Fries (2022)’s theoretical moments inappropriate beyond MAR(0,1) and for MARMA models. For this reason, this second-case comparative analysis is structured as follows. For the MAR(0,1) process, the MDN is compared with the full set of alternatives at all forecast horizons. For higher-order processes, MAR(0,2), MAR(1,1) and MARMA(1,1,1,1), the comparison is restricted to Nadaraya-Watson and FlexZBoost due to the conditioning set discrepancy.
Tables 2, 3, 4, and 5 present the one-period-ahead RMSE results for all four data-generating processes. The MDN approach almost always exhibits the lowest RMSE in the tail region and over the full distribution, regardless of the tail index and the conditional moment order.131313Notable exceptions occur at : for MAR(0,2) and MAR(1,1), Nadaraya-Watson performs better for higher-order moments (), while for MARMA(1,1,1,1), Nadaraya-Watson outperforms only for the fourth moment. In the center of the distribution, Nadaraya-Watson consistently achieves the lowest RMSE across almost all configurations, benefiting from regions where the conditional distribution exhibits more regular behavior. Importantly, when the MDN does not achieve the lowest RMSE in the center, it consistently ranks as the second-best method. The more traditional forecasting methods of Lanne et al. (2012) and Gourieroux & Jasiak (2025), as well as FlexZBoost are almost always dominated in all regions.
| Model | |||||||||||||
| Center | Tails | Total | Center | Tails | Total | Center | Tails | Total | Center | Tails | Total | ||
| Nadaraya-Watson | 1.0 | 0.434∗ | 43.72 | 41.55 | 10.80∗ | 6619 | 6291 | – | – | – | – | – | – |
| 1.2 | 0.097∗ | 7.182 | 6.607 | 1.360∗ | 341.8 | 314.4 | 12.75∗ | 1.671e+04 | 1.538e+04 | – | – | – | |
| 1.4 | 0.075∗ | 2.313∗ | 2.038∗ | 0.431∗ | 49.86∗ | 43.92∗ | 2.798∗ | 1125 | 991.2 | – | – | – | |
| 1.6 | 0.047∗ | 0.879 | 0.727 | 0.167∗ | 12.20 | 10.09 | 0.523∗ | 149.7 | 123.8 | 4.584∗ | 1786 | 1477 | |
| 1.8 | 0.037∗ | 0.164 | 0.125∗ | 0.095∗ | 1.149 | 0.863∗ | 0.334∗ | 6.941∗ | 5.201∗ | 1.642∗ | 41.28∗ | 30.92∗ | |
| Lanne et al. (2012) | 1.0 | 1.115 | 39.51 | 37.55 | 133.1 | 7320 | 6956 | – | – | – | – | – | – |
| 1.2 | 0.391 | 12.90 | 11.87 | 5.406 | 649.5 | 597.6 | 46.55 | 3.127e+04 | 2.877e+04 | – | – | – | |
| 1.4 | 0.218 | 4.369 | 3.850∗ | 1.602 | 96.61 | 85.10 | 9.985 | 2027 | 1785 | – | – | – | |
| 1.6 | 0.121 | 1.010 | 0.838 | 0.679 | 14.71 | 12.17 | 3.171 | 184.1 | 152.3 | 17.51 | 2234 | 1848 | |
| 1.8 | 0.076 | 0.220 | 0.172 | 0.380 | 2.482 | 1.875 | 1.468 | 20.93 | 15.69 | 5.417 | 159.3 | 119.3 | |
| Gourieroux and Jasiak (2025) | 1.0 | 4.040 | 19.37 | 18.45 | 81.40 | 2666 | 2534 | – | – | – | – | – | – |
| 1.2 | 0.698 | 4.649 | 4.286 | 6.326 | 186.1 | 171.2 | 46.11 | 8597 | 7910 | – | – | – | |
| 1.4 | 0.285 | 2.150∗ | 1.898∗ | 1.590 | 37.05∗ | 32.65∗ | 6.841 | 683.2∗ | 601.9∗ | – | – | – | |
| 1.6 | 0.170 | 1.197 | 0.995 | 0.787 | 11.76 | 9.736 | 2.597 | 123.7 | 102.3 | 11.09 | 1299 | 1075 | |
| 1.8 | 0.092 | 0.449 | 0.342 | 0.385 | 2.526 | 1.908 | 1.060 | 15.13 | 11.35 | 3.584 | 87.15 | 65.28 | |
| FlexZBoost | 1.0 | 2.107 | 16.39 | 15.59 | 2481 | 6082 | 5831 | – | – | – | – | – | – |
| 1.2 | 0.740 | 4.858 | 4.479 | 98.26 | 482.4 | 445.5 | 332.8 | 4.168e+04 | 3.835e+04 | – | – | – | |
| 1.4 | 0.292 | 2.381∗ | 2.102∗ | 10.03 | 75.54 | 66.71 | 87.52 | 2221 | 1957 | – | – | – | |
| 1.6 | 0.182 | 1.038 | 0.864 | 1.629 | 18.51 | 15.34 | 9.190 | 318.7 | 263.7 | 89.79 | 5745 | 4751 | |
| 1.8 | 0.179 | 0.256 | 0.225 | 0.963 | 1.965 | 1.604 | 7.424 | 17.58 | 14.05 | 93.36 | 200.1 | 162.0 | |
| Mixture Density Network | 1.0 | 0.902 | 7.461∗ | 7.096∗ | 37.73 | 1091∗ | 1037∗ | – | – | – | – | – | – |
| 1.2 | 0.305 | 1.705∗ | 1.574∗ | 2.980 | 67.51∗ | 62.12∗ | 38.00 | 2614∗ | 2405∗ | – | – | – | |
| 1.4 | 0.124 | 0.640∗ | 0.567∗ | 0.747 | 8.409 | 7.416 | 4.118 | 133.6 | 117.7 | – | – | – | |
| 1.6 | 0.082 | 0.306∗ | 0.258∗ | 0.427 | 2.745∗ | 2.283∗ | 1.932 | 31.10∗ | 25.75∗ | 24.66 | 390.1∗ | 323.0∗ | |
| 1.8 | 0.056 | 0.113∗ | 0.093∗ | 0.421 | 0.628∗ | 0.547∗ | 1.624 | 4.679∗ | 3.664∗ | 15.36 | 26.48∗ | 22.29∗ | |
Notes: This Table reports the root mean squared error (RMSE) of estimated predictive moments relative to theoretical values. denotes the tail index of the stable distribution. Predictive moments are evaluated over three spatial regions: Center , Tails , and Total , where represents the -th quantile. Best method in red, second best in bold black. An asterisk (∗) designates model(s) which belong to the Model Confidence Set at the 90% confidence level.
| Model | |||||||||||||
| Center | Tails | Total | Center | Tails | Total | Center | Tails | Total | Center | Tails | Total | ||
| Nadaraya-Watson | 1.0 | 0.244∗ | 20.23 | 19.23 | 2.252∗ | 1697 | 1613 | – | – | – | – | – | – |
| 1.2 | 0.114∗ | 5.137 | 4.726 | 0.610∗ | 162.9 | 149.9 | 5.464∗ | 6199 | 5703 | – | – | – | |
| 1.4 | 0.086∗ | 1.568 | 1.382 | 0.274∗ | 21.01 | 18.51 | 0.787∗ | 339.5 | 299.1 | – | – | – | |
| 1.6 | 0.044∗ | 0.583 | 0.483 | 0.118∗ | 4.763 | 3.940 | 0.347∗ | 38.99∗ | 32.25∗ | 1.540∗ | 368.0∗ | 304.4∗ | |
| 1.8 | 0.029∗ | 0.188∗ | 0.142∗ | 0.085∗ | 0.962∗ | 0.722∗ | 0.159∗ | 4.597∗ | 3.443∗ | 0.578∗ | 23.65∗ | 17.71∗ | |
| FlexZBoost | 1.0 | 1.164 | 9.557 | 9.090 | 617.9 | 1886 | 1803 | – | – | – | – | – | – |
| 1.2 | 0.455 | 4.648 | 4.280 | 29.97 | 226.7 | 208.9 | 59.39 | 1.322e+04 | 1.216e+04 | – | – | – | |
| 1.4 | 0.265 | 1.789 | 1.581 | 3.475 | 39.33 | 34.68 | 19.30 | 1070 | 942.4 | – | – | – | |
| 1.6 | 0.200 | 0.695 | 0.586 | 1.345 | 4.626 | 3.900 | 12.98 | 47.36 | 39.85 | 226.6 | 502.1 | 434.4 | |
| 1.8 | 0.187 | 0.267 | 0.235 | 0.703 | 1.092 | 0.941 | 3.668 | 9.215 | 7.315 | 32.50 | 72.83 | 58.63 | |
| Mixture Density Network | 1.0 | 0.733 | 5.713∗ | 5.435∗ | 21.11 | 291.6∗ | 277.2∗ | – | – | – | – | – | – |
| 1.2 | 0.129∗ | 2.285∗ | 2.103∗ | 1.647 | 41.24∗ | 37.94∗ | 41.53 | 2067∗ | 1902∗ | – | – | – | |
| 1.4 | 0.084∗ | 0.607∗ | 0.536∗ | 0.773 | 8.158∗ | 7.196∗ | 2.672 | 161.3∗ | 142.1∗ | – | – | – | |
| 1.6 | 0.084 | 0.276∗ | 0.233∗ | 0.575 | 2.696∗ | 2.253∗ | 2.873 | 28.51∗ | 23.63∗ | 30.92 | 309.6∗ | 256.6∗ | |
| 1.8 | 0.049 | 0.168∗ | 0.130∗ | 0.345 | 1.032∗ | 0.806 | 1.053 | 6.553 | 4.955 | 8.499 | 59.19 | 44.66 | |
Notes: For details on variable definitions and methodology, refer to Table 2.
| Model | |||||||||||||
| Center | Tails | Total | Center | Tails | Total | Center | Tails | Total | Center | Tails | Total | ||
| Nadaraya-Watson | 1.0 | 0.506∗ | 40.06 | 38.07 | 13.94∗ | 7557 | 7182 | – | – | – | – | – | – |
| 1.2 | 0.104∗ | 6.755 | 6.215 | 1.469∗ | 378.8 | 348.5 | 16.37∗ | 2.040e+04 | 1.877e+04 | – | – | – | |
| 1.4 | 0.083∗ | 2.312 | 2.037 | 0.566∗ | 59.61 | 52.51 | 4.169∗ | 1513 | 1333 | – | – | – | |
| 1.6 | 0.047∗ | 0.798 | 0.660 | 0.199∗ | 12.24 | 10.12 | 0.688∗ | 170.4 | 140.9 | 5.156∗ | 2312 | 1912 | |
| 1.8 | 0.040∗ | 0.159∗ | 0.122∗ | 0.113∗ | 1.335∗ | 1.002∗ | 0.428∗ | 9.769∗ | 7.319∗ | 2.135∗ | 70.68∗ | 52.93∗ | |
| FlexZBoost | 1.0 | 2.320 | 17.85 | 16.98 | 3065 | 7661 | 7343 | – | – | – | – | – | – |
| 1.2 | 0.791 | 5.266 | 4.855 | 120.7 | 616.9 | 569.5 | 462.4 | 5.894e+04 | 5.423e+04 | – | – | – | |
| 1.4 | 0.349 | 2.174 | 1.922 | 12.40 | 84.16 | 74.37 | 120.5 | 2932 | 2583 | – | – | – | |
| 1.6 | 0.191 | 1.306 | 1.086 | 1.996 | 27.51 | 22.78 | 12.41 | 529.1 | 437.6 | 119.7 | 1.049e+04 | 8678 | |
| 1.8 | 0.194 | 0.353 | 0.294 | 1.015 | 3.798 | 2.922 | 8.195 | 38.76 | 29.52 | 108.6 | 424.0 | 325.5 | |
| Mixture Density Network | 1.0 | 0.967 | 3.751∗ | 3.578∗ | 47.44 | 863.9∗ | 821.2∗ | – | – | – | – | – | – |
| 1.2 | 0.328 | 1.789∗ | 1.651∗ | 4.245 | 86.04∗ | 79.17∗ | 62.46 | 3998∗ | 3678∗ | – | – | – | |
| 1.4 | 0.094∗ | 1.021∗ | 0.901∗ | 0.844 | 19.21∗ | 16.92∗ | 4.898 | 367.2∗ | 323.5∗ | – | – | – | |
| 1.6 | 0.061∗ | 0.302∗ | 0.252∗ | 0.479 | 2.472∗ | 2.062∗ | 2.441 | 23.10∗ | 19.16∗ | 31.12 | 300.8∗ | 249.4∗ | |
| 1.8 | 0.067 | 0.154∗ | 0.123∗ | 0.467 | 1.367∗ | 1.069∗ | 1.927 | 12.52∗ | 9.462∗ | 18.88 | 118.4 | 89.51 | |
Notes: For details on variable definitions and methodology, refer to Table 2.
| Model | |||||||||||||
| Center | Tails | Total | Center | Tails | Total | Center | Tails | Total | Center | Tails | Total | ||
| Nadaraya-Watson | 1.0 | 0.711∗ | 41.50 | 39.44 | 27.39∗ | 3236 | 3076 | – | – | – | – | – | – |
| 1.2 | 0.134∗ | 6.808 | 6.264 | 1.149∗ | 194.5 | 178.9 | 26.24∗ | 8876 | 8166 | – | – | – | |
| 1.4 | 0.082∗ | 2.129 | 1.876 | 0.383∗ | 42.68 | 37.59 | 3.275∗ | 1044 | 919.7 | – | – | – | |
| 1.6 | 0.056∗ | 0.478∗ | 0.396∗ | 0.127∗ | 4.996 | 4.132 | 0.984∗ | 46.74 | 38.66∗ | 10.97∗ | 631.4 | 522.2 | |
| 1.8 | 0.042∗ | 0.290 | 0.219∗ | 0.116∗ | 1.566 | 1.175∗ | 0.207∗ | 9.357∗ | 7.006∗ | 1.216∗ | 48.67∗ | 36.44∗ | |
| FlexZBoost | 1.0 | 1.679 | 20.73 | 19.71 | 1770 | 6375 | 6083 | – | – | – | – | – | – |
| 1.2 | 0.592 | 4.634 | 4.269 | 66.67 | 344.6 | 318.1 | 185.0 | 4.014e+04 | 3.693e+04 | – | – | – | |
| 1.4 | 0.296 | 1.736 | 1.536 | 6.192 | 68.48 | 60.39 | 40.52 | 2015 | 1775 | – | – | – | |
| 1.6 | 0.201 | 0.828 | 0.694 | 1.488 | 19.57 | 16.21 | 11.53 | 277.7 | 229.8 | 185.3 | 8553 | 7074 | |
| 1.8 | 0.196 | 0.402 | 0.328 | 0.693 | 4.325 | 3.270 | 4.234 | 38.92 | 29.27 | 43.93 | 426.0 | 320.2 | |
| Mixture Density Network | 1.0 | 0.689∗ | 5.908∗ | 5.619∗ | 68.94 | 659.3∗ | 626.9∗ | – | – | – | – | – | – |
| 1.2 | 0.261 | 2.209∗ | 2.035∗ | 1.493 | 23.46∗ | 21.59∗ | 42.15 | 1805∗ | 1661∗ | – | – | – | |
| 1.4 | 0.173 | 0.925∗ | 0.819∗ | 1.186 | 4.344∗ | 3.868∗ | 13.60 | 180.8∗ | 159.4∗ | – | – | – | |
| 1.6 | 0.086 | 0.327∗ | 0.274∗ | 0.633 | 0.972∗ | 0.879∗ | 3.186 | 26.41∗ | 21.91∗ | 41.99 | 180.3∗ | 151.0∗ | |
| 1.8 | 0.089 | 0.132∗ | 0.115∗ | 0.576 | 0.902∗ | 0.776∗ | 2.915 | 6.460∗ | 5.208∗ | 22.14 | 52.50∗ | 41.95 | |
Notes: For details on variable definitions and methodology, refer to Table 2.
The statistical significance of these differences in performance is confirmed by the MCS tests. In most cases, only one forecasting approach belongs to the : Nadaraya-Watson for the central region, and the MDN for the tail region and overall distribution.141414We also test the sensitivity of our findings to the choice of the forecast horizon ( and ) both in terms of RMSE and MCS test. The results, available in Section 3 of the Online Appendix, are qualitatively similar.
Figure 3 provides a visual comparison of the one-step-ahead predictive conditional moments for a MAR(0,1) process with tail-index across the five competing approaches. The MDN produces smooth predictions that closely track the theoretical moments throughout the distribution. Gourieroux & Jasiak (2025)’s forecasts are also smooth but increasingly drift from the truth in the tails. FlexZBoost exhibits a staircase-like pattern with systematic deviations. The moment forecasts from Nadaraya-Watson and Lanne et al. (2012) become increasingly noisy as we move away from the center. These patterns are qualitatively similar, though noisier, at longer horizons and (see the Online Appendix).
Notes: This figure displays the estimated predictive moments for as a function of the conditioning variable for a purely noncausal MAR(0,1) process with -stable innovations. Each panel from (a) to (e) shows the results for a specific density forecasting method. Blue curves represent estimated moments, while orange curves show the corresponding theoretical ones. The tail-index is fixed at .
3.4.3 Case 3: Comparison with the Realized Outcomes
Rather than evaluating predictive densities at pre-specified conditioning values, we now train each method on 5,000 observations from a MAR(0,1) process and use the trained models to forecast the subsequent 500 realizations. This approach allows us to assess density forecast quality using proper scoring rules that directly compare the predictive density, , against the realized outcome .
Table 6 presents detailed results for forecast horizons using four complementary evaluation metrics: the Conditional Density Estimation (CDE) loss, the Continuous Ranked Probability Score (CRPS), the logarithmic probability score, and the quantile score at the 10% level (see the Online Appendix for formal definitions).151515For CDE loss, CRPS, and quantile scores, lower values indicate superior performance, while for the log probability score, higher values are preferred. As in our previous evaluation strategy, we assess the performance across three distributional regions (Center, Tails, and Total) to capture method-specific strengths. The results corroborate our grid-based findings. At the one-step-ahead horizon, the MDN achieves the best or second-best performance across nearly all metrics and tail index values, and it is particularly excelling in the tail region. As the forecast horizon extends to , the relative performance rankings remain remarkably stable. The MDN continues to dominate in the tail region and overall distribution for most values, though Nadaraya-Watson exhibits competitive performance for , in particular at longer horizons.
| Model | CDE Loss | CRPS | Log Prob | QS 10% | |||||||||
| Center | Tails | Total | Center | Tails | Total | Center | Tails | Total | Center | Tails | Total | ||
| Nadaraya-Watson | 1.0 | -0.235 | -0.076 | -0.221 | 1.217 | 3.716 | 2.041 | -2.354 | -4.824 | -2.822 | 0.509 | 1.964 | 0.889 |
| Lanne et al. (2012) | 0.161 | 0.080 | 0.135 | 3.499 | 3.811 | 3.794 | -3.926 | -3.882 | -4.207 | 0.702 | 0.756 | 0.841 | |
| Gourieroux and Jasiak (2025) | 0.144 | 0.192 | 0.172 | 3.709 | 4.312 | 4.402 | -4.046 | -4.976 | -4.450 | 1.522 | 1.916 | 1.779 | |
| FlexZBoost | -0.034 | -0.014 | -0.011 | 7.363 | 12.13 | 10.71 | -3.452 | -4.511 | -4.785 | 1.167 | 5.739 | 5.476 | |
| Mixture Density Network | -0.098 | -0.087 | -0.088 | 1.584 | 2.817 | 2.281 | -2.568 | -2.887 | -2.773 | 0.693 | 0.965 | 0.942 | |
| Nadaraya-Watson | 1.2 | -0.308 | -0.342 | -0.299 | 0.690 | 1.126 | 0.993 | -1.750 | -2.158 | -1.927 | 0.264 | 0.296 | 0.299 |
| Lanne et al. (2012) | -0.129 | -0.130 | -0.114 | 0.834 | 1.435 | 1.128 | -2.029 | -3.273 | -2.403 | 0.272 | 0.432 | 0.342 | |
| Gourieroux and Jasiak (2025) | -0.029 | -0.085 | -0.015 | 0.988 | 1.325 | 1.296 | -2.236 | -2.307 | -2.429 | 0.297 | 0.309 | 0.335 | |
| FlexZBoost | -0.081 | -0.076 | -0.051 | 1.685 | 2.896 | 2.440 | -2.585 | -3.014 | -3.228 | 0.552 | 1.132 | 1.141 | |
| Mixture Density Network | -0.302 | -0.348 | -0.290 | 0.682 | 1.108 | 0.996 | -1.580 | -1.600 | -1.710 | 0.258 | 0.288 | 0.334 | |
| Nadaraya-Watson | 1.4 | -0.350 | -0.486 | -0.327 | 0.511 | 0.585 | 0.703 | -1.397 | -1.151 | -1.582 | 0.161 | 0.167 | 0.201 |
| Lanne et al. (2012) | -0.288 | -0.459 | -0.258 | 0.529 | 0.752 | 0.721 | -1.418 | -1.663 | -1.765 | 0.166 | 0.266 | 0.223 | |
| Gourieroux and Jasiak (2025) | -0.237 | -0.408 | -0.207 | 0.583 | 0.587 | 0.770 | -1.503 | -1.329 | -1.750 | 0.175 | 0.173 | 0.210 | |
| FlexZBoost | -0.184 | -0.237 | -0.149 | 0.795 | 0.927 | 0.965 | -1.982 | -1.846 | -2.266 | 0.279 | 0.325 | 0.390 | |
| Mixture Density Network | -0.355 | -0.542 | -0.332 | 0.511 | 0.585 | 0.684 | -1.266 | -1.037 | -1.450 | 0.160 | 0.166 | 0.202 | |
| Nadaraya-Watson | 1.6 | -0.398 | -0.608 | -0.360 | 0.432 | 0.350 | 0.514 | -1.149 | -0.796 | -1.317 | 0.127 | 0.104 | 0.152 |
| Lanne et al. (2012) | -0.388 | -0.444 | -0.301 | 0.427 | 0.420 | 0.540 | -1.156 | -1.506 | -1.522 | 0.128 | 0.121 | 0.162 | |
| Gourieroux and Jasiak (2025) | -0.306 | -0.513 | -0.280 | 0.467 | 0.386 | 0.569 | -1.244 | -0.940 | -1.470 | 0.138 | 0.105 | 0.163 | |
| FlexZBoost | -0.354 | -0.492 | -0.298 | 0.452 | 0.393 | 0.558 | -1.542 | -0.944 | -1.673 | 0.145 | 0.125 | 0.182 | |
| Mixture Density Network | -0.410 | -0.578 | -0.356 | 0.420 | 0.359 | 0.529 | -1.083 | -0.834 | -1.288 | 0.125 | 0.105 | 0.159 | |
| Nadaraya-Watson | 1.8 | -0.455 | -0.716 | -0.386 | 0.371 | 0.262 | 0.437 | -0.971 | -0.535 | -1.164 | 0.109 | 0.078 | 0.130 |
| Lanne et al. (2012) | -0.416 | -0.411 | -0.327 | 0.370 | 0.295 | 0.440 | -1.005 | -1.044 | -1.357 | 0.110 | 0.080 | 0.131 | |
| Gourieroux and Jasiak (2025) | -0.415 | -0.706 | -0.357 | 0.384 | 0.264 | 0.450 | -1.037 | -0.539 | -1.240 | 0.111 | 0.077 | 0.136 | |
| FlexZBoost | -0.434 | -0.657 | -0.378 | 0.381 | 0.281 | 0.447 | -1.588 | -1.336 | -1.787 | 0.109 | 0.089 | 0.131 | |
| Mixture Density Network | -0.460 | -0.687 | -0.382 | 0.368 | 0.263 | 0.434 | -0.943 | -0.571 | -1.152 | 0.107 | 0.075 | 0.128 | |
Notes: This table reports density forecast performance metrics across different tail index values (). CDE Loss (Conditional Density Estimation loss), CRPS (Continuous Ranked Probability Score), and QS 10% (Quantile Score at 10% level) are loss functions where lower values indicate better performance. Log Prob (Log Probability Score) is a scoring rule where higher values indicate better performance. Metrics are evaluated over three spatial regions: Center , Tails , and Total , where represents the -th quantile. Best method in red, second best in bold black.
Note also that contrary to the previous two setups, this evaluation framework has the advantage of allowing one to compare all the alternative forecasting approaches for all MAR models and at all forecast horizons.
As a final remark, comparing Tables 1 and 6 for the case reveals a notable discrepancy. When evaluated against the true predictive density, the MDN exhibits clear dominance, achieving KL divergence and ISE values substantially lower than all competitors both in the tail and over the total regions. However, when assessed using proxy metrics that do not require the true density, although the MDN remains among the best-performing methods, its superiority is less apparent. This suggests that standard scoring rules may lack sufficient discriminatory power to identify the best-performing forecasting method when predictive distributions exhibit heavy tails and bimodality.
3.5 Runtime Analysis
Finally, Table 7 reports the average computational time (over all tail indices ) required by each method to generate the forecasts for a MAR(0,1) process. For each forecast horizon, the reported runtime corresponds to the average time needed to compute the 5,000 conditional predictive densities over the grid of conditioning values . FlexZBoost achieves the shortest runtime (under 5 seconds), though at the cost of inferior forecast accuracy, as demonstrated above. The MDN requires only 2-3 minutes. In stark contrast, the simulation-based methods of Lanne et al. (2012) and Gourieroux & Jasiak (2025) require substantially longer computation times, with the latter exhibiting a dramatic increase from about 1 hour at to over 5 hours at .
4 Empirical Applications
4.1 Forecasting Natural Gas Prices in Real Time
Having established the MDN’s superior performance in controlled settings where the true DGP is known, we now evaluate its forecasting ability on real-world data where model specification uncertainty and data complexities are unavoidable. Natural gas prices are notoriously difficult to forecast due to periodic episodes of locally explosive behavior. These characteristics make the natural gas market a compelling testbed for our MDN approach.
We adopt the real-time forecasting framework of Baumeister et al. (2025), who provide an extensive evaluation of point forecasting methods for the real Henry Hub spot price, the benchmark for North American natural gas markets. Their setup accounts for publication lags and data revisions through a database of monthly vintages, ensuring that forecasts rely only on information available at each forecast origin. The estimation period runs from January 1976 to January 1997, and the out-of-sample evaluation period spans February 1997 to February 2024. We use forecast horizons ranging from 1 to 24 months. We defer the reader to Baumeister et al. (2025) for a detailed description of the data construction and real-time constraints.
Figure 4 displays the evolution of the real Henry Hub spot price over the sample period. The series exhibits clear episodes of locally explosive behavior, most notably during the 2005-2008 period, characterized by sharp run-ups followed by abrupt collapses, precisely the type of dynamics that noncausal processes are designed to capture. Using the GCov estimation method (Gourieroux & Jasiak 2023) on the in-sample period (1976M1-1997M1), we estimate all possible MAR specifications such that , where is chosen based on the Partial Autocorrelation Function. Then we select the model yielding i.i.d. residuals based on the portmanteau test of Jasiak & Neyazi (2025). The best specification is a purely noncausal MAR(0,1) process. The -stable distribution parameters and their respective standard deviations are then obtained by fitting the characteristic function-based estimator of Nolan (2020) to the filtered residuals.161616The standard deviations of the MAR parameters are obtained by following Gourieroux & Jasiak (2023). Besides, the MAR(0,1) dynamics of the noncausal process is found to exhibit temporal robustness, with comparable estimates for the full post-revised series (1976M1–2024M2). As reported in Table 8, both samples reveal strong anticipative persistence () and heavy tails (), substantially thicker than in the Gaussian case (), and consistent with the extreme price movements observed in the data.
| Parameter | Real-Time In-Sample | Full Period Post-Revised |
| 0.957∗∗∗ (0.018) | 0.945∗∗∗ (0.015) | |
| 1.779∗∗∗ (0.186) | 1.830∗∗∗ (0.125) | |
| 0.415∗∗ (0.200) | 0.628∗∗∗ (0.149) | |
| 0.070∗∗∗ (0.006) | 0.158∗∗∗ (0.009) |
-
•
Note: Standard errors in parentheses. ∗∗∗ , ∗∗ , ∗ .
To evaluate the density forecasting performance of our MDN approach in this real-time setting, we simulate a trajectory of 5,000 observations from the estimated MAR(0,1) process on which each method is fitted, and generate out-of-sample forecasts over the period 1997M2–2024M2 for horizons months. We assess the quality of the predictive densities with the four complementary metrics used in simulations (CDE loss, CRPS, log-probability score, and quantile score).
Table 9 summarizes the results. The MDN approach delivers the best or near-best performance across the majority of horizons and metrics. At the shortest horizon (), the MDN achieves the best CDE loss and CRPS, while FlexZBoost achieves a slightly lower quantile score. The MDN dominates in terms of log-probability score across all horizons. From onward, the MDN consistently achieves the best performance across all four metrics, with only occasional exceptions at intermediate horizons () where Nadaraya-Watson performs marginally better in terms of CDE loss and CRPS. Importantly, when the MDN is not the best performer for a given metric, it consistently ranks as the second-best method.
| Horizon | Metric | Nadaraya-Watson | Lanne et al. (2012) | Gourieroux and Jasiak (2025) | FlexZBoost | Mixture Density Network |
| CDE loss | -0.990 | 8.105 | -0.631 | -1.130 | -1.167 | |
| CRPS | 0.189 | 0.212 | 0.195 | 0.177 | 0.175 | |
| Log probability score | -0.453 | -8.463 | -0.710 | -1.165 | -0.214 | |
| Quantile score (10%) | 0.062 | 0.078 | 0.093 | 0.059 | 0.063 | |
| CDE loss | -0.660 | 7.813 | 0.881 | -0.637 | -0.701 | |
| CRPS | 0.311 | 0.395 | 0.392 | 0.297 | 0.291 | |
| Log probability score | -1.024 | -8.839 | -2.375 | -3.573 | -0.713 | |
| Quantile score (10%) | 0.099 | 0.174 | 0.235 | 0.090 | 0.081 | |
| CDE loss | -0.506 | 9.300 | 1.530 | -0.469 | -0.542 | |
| CRPS | 0.432 | 0.613 | 0.628 | 0.410 | 0.393 | |
| Log probability score | -1.416 | -9.429 | -3.645 | -3.904 | -1.029 | |
| Quantile score (10%) | 0.144 | 0.217 | 0.426 | 0.120 | 0.118 | |
| CDE loss | -0.446 | 9.128 | 1.615 | -0.383 | -0.459 | |
| CRPS | 0.468 | 0.790 | 0.781 | 0.453 | 0.435 | |
| Log probability score | -1.440 | -10.90 | -4.083 | -3.906 | -1.102 | |
| Quantile score (10%) | 0.137 | 0.254 | 0.564 | 0.145 | 0.116 | |
| CDE loss | -0.404 | 9.744 | 2.062 | -0.350 | -0.402 | |
| CRPS | 0.474 | 0.950 | 0.960 | 0.474 | 0.478 | |
| Log probability score | -1.519 | -11.17 | -4.767 | -3.625 | -1.232 | |
| Quantile score (10%) | 0.119 | 0.234 | 0.724 | 0.128 | 0.117 | |
| CDE loss | -0.354 | 10.14 | 2.195 | -0.283 | -0.348 | |
| CRPS | 0.489 | 1.067 | 1.181 | 0.511 | 0.510 | |
| Log probability score | -1.669 | -11.32 | -5.064 | -4.285 | -1.327 | |
| Quantile score (10%) | 0.120 | 0.223 | 0.937 | 0.134 | 0.117 | |
| CDE loss | -0.332 | 11.06 | 2.242 | -0.244 | -0.349 | |
| CRPS | 0.514 | 1.165 | 1.318 | 0.533 | 0.515 | |
| Log probability score | -1.766 | -12.14 | -5.149 | -4.773 | -1.350 | |
| Quantile score (10%) | 0.122 | 0.227 | 1.043 | 0.136 | 0.114 | |
| CDE loss | -0.341 | 9.754 | 2.085 | -0.273 | -0.356 | |
| CRPS | 0.524 | 1.255 | 1.359 | 0.552 | 0.516 | |
| Log probability score | -1.748 | -11.67 | -5.026 | -4.682 | -1.339 | |
| Quantile score (10%) | 0.120 | 0.230 | 1.005 | 0.134 | 0.115 | |
| CDE loss | -0.336 | 9.564 | 2.143 | -0.245 | -0.336 | |
| CRPS | 0.532 | 1.342 | 1.408 | 0.564 | 0.524 | |
| Log probability score | -1.780 | -12.65 | -5.280 | -4.307 | -1.359 | |
| Quantile score (10%) | 0.117 | 0.231 | 0.986 | 0.133 | 0.109 |
-
•
Note: Best method in red, second best in bold black.
A natural question is whether the MDN’s well-calibrated predictive distributions also translate into accurate point predictions. Table 10 addresses this by comparing the RMSE of the MDN’s point forecasts, computed as the median of the predictive density, against the forecasts of a comprehensive set of models evaluated by Baumeister et al. (2025), expressed as ratios relative to the no-change forecast (). The comparison encompasses both univariate specifications (AR(1), AR(AIC), exponential smoothing) and multivariate Bayesian VAR models that incorporate additional predictors. Focusing first on the univariate benchmarks, the MDN delivers the lowest RMSE at all horizons from to , outperforming AR and exponential smoothing specifications. At the 9-month-ahead horizon, the MDN achieves an RMSE ratio of 0.864, representing a 14% improvement over the random walk and substantially better than all univariate alternatives. Statistical significance against the no-change forecast is confirmed by Diebold-Mariano tests (Diebold & Mariano 1995), see Online Appendix. At longer horizons (), exponential smoothing takes the lead, although the MDN remains competitive, consistently ranking second-best among univariate methods.
Remarkably, the MDN also outperforms the multivariate BVAR models at short- to medium-term horizons. The BVAR specifications in Baumeister et al. (2025) exploit up to six predictor variables, yet the MDN achieves lower RMSE ratios up to horizon . Moreover, unlike Baumeister et al. (2025), our approach does not require recursive re-estimation of the model at each forecast origin, making it computationally more efficient.
| Univariate Models | Multivariate Models | ||||||
| Horizon | AR(1) | AR(AIC) | Exp. Smoothing | MDN | BVAR(AIC) | BVAR(1) | |
| 1 | 0.993 | 1.021 | 1.456 | 0.963 | 1.003 | 0.984 | |
| 3 | 0.978 | 0.989 | 1.121 | 0.937 | 0.986 | 0.962 | |
| 6 | 0.949 | 0.944 | 0.970 | 0.894 | 0.948 | 0.923 | |
| 9 | 0.925 | 0.921 | 0.905 | 0.864 | 0.909 | 0.897 | |
| 12 | 0.915 | 0.909 | 0.893 | 0.882 | 0.900 | 0.875 | |
| 15 | 0.914 | 0.908 | 0.896 | 0.899 | 0.909 | 0.876 | |
| 18 | 0.918 | 0.911 | 0.899 | 0.897 | 0.923 | 0.886 | |
| 21 | 0.924 | 0.914 | 0.890 | 0.901 | 0.925 | 0.905 | |
| 24 | 0.933 | 0.919 | 0.884 | 0.908 | 0.940 | 0.922 | |
-
•
Note: Values below 1 indicate improvements relative to the no-change forecast. Best univariate method in red, second best in bold black. Benchmark results from Baumeister et al. (2025), Table 3. We report RMSE ratios by taking the square root of the Mean Squared Prediction Error (MSPE) measures in Baumeister et al. (2025), to ensure consistency throughout the paper.
The excellent point forecast performance of the MDN can be partly attributed to the noncausal specification itself. By explicitly modeling the anticipative dynamics, the MAR(0,1) process captures the locally explosive behavior of natural gas prices, a feature that the purely causal models employed by Baumeister et al. (2025) cannot accommodate. The gains are concentrated at short- to medium-term horizons (1–9 months), where the nonlinear dynamics inherent to noncausal processes exert their strongest influence and prove particularly effective at capturing the explosive episodes observed in natural gas prices. At longer horizons, mean-reverting forces dominate the predictive signal, diminishing the comparative advantage of our approach, although the MDN remains highly competitive. Furthermore, the MDN outperforms the alternative density forecasting methods in terms of RMSE. Additional results are reported in the Online Appendix.
4.2 Forecasting Inflation in Real Time
To further assess the point forecasting performance of our MDN approach, we apply it to U.S. inflation, a series known to exhibit noncausal dynamics (Lanne & Saikkonen 2011, Lanne et al. 2012), following the real-time framework of Medeiros et al. (2021). Importantly, unlike natural gas prices, inflation exhibits near-Gaussian tail index (), providing a valuable test of whether the MDN, designed to capture heavy-tailed distributions, retains its forecasting advantages when tail behavior is less pronounced. The model is estimated on the January 2001 vintage of FRED-MD dataset, which covers January 1960 to December 2000. Forecasts for the period January 2001 to December 2015 are then evaluated against the ex-post revised series from the January 2016 vintage (see Figure 5). This real-time setup mirrors our natural gas application and reflects the information actually available to forecasters at each point in time.
Using the same procedure as in Section 4, we estimate a purely noncausal MAR(0,2) on the in-sample period and train our MDN on a simulated trajectory of 5,000 observations (see the Online Appendix for estimation results). Table 11 reports the RMSE, mean absolute error (MAE), and median absolute deviation from the median (MAD) for the MDN and the two univariate benchmarks (AR and unobserved components stochastic volatility – UCSV – models) from Medeiros et al. (2021), relative to the no-change forecast.
| Univariate Models | Multivariate Models | |||||||||||||||
| AR(BIC) | UCSV | MDN | RF | RR | ||||||||||||
| Horizon | RMSE | MAE | MAD | RMSE | MAE | MAD | RMSE | MAE | MAD | RMSE | MAE | MAD | RMSE | MAE | MAD | |
| 1 | 0.91 | 0.87 | 0.80 | 0.97 | 0.91 | 0.86 | 0.86 | 0.85 | 0.77 | 0.88 | 0.83 | 0.74 | 0.84 | 0.81 | 0.77 | |
| 2 | 0.81 | 0.80 | 0.78 | 0.82 | 0.82 | 0.82 | 0.79 | 0.77 | 0.67 | 0.74 | 0.72 | 0.72 | 0.74 | 0.72 | 0.70 | |
| 3 | 0.77 | 0.73 | 0.66 | 0.79 | 0.77 | 0.82 | 0.76 | 0.75 | 0.72 | 0.70 | 0.66 | 0.62 | 0.72 | 0.68 | 0.60 | |
| 4 | 0.78 | 0.75 | 0.80 | 0.80 | 0.77 | 0.81 | 0.77 | 0.74 | 0.63 | 0.72 | 0.72 | 0.73 | 0.75 | 0.72 | 0.76 | |
| 5 | 0.79 | 0.80 | 0.76 | 0.78 | 0.79 | 0.80 | 0.75 | 0.75 | 0.68 | 0.72 | 0.74 | 0.75 | 0.76 | 0.77 | 0.74 | |
| 6 | 0.80 | 0.81 | 0.76 | 0.79 | 0.80 | 0.76 | 0.76 | 0.76 | 0.66 | 0.75 | 0.76 | 0.77 | 0.78 | 0.80 | 0.75 | |
| 7 | 0.80 | 0.79 | 0.73 | 0.81 | 0.81 | 0.90 | 0.76 | 0.75 | 0.55 | 0.76 | 0.75 | 0.75 | 0.79 | 0.77 | 0.75 | |
| 8 | 0.77 | 0.75 | 0.74 | 0.80 | 0.80 | 0.86 | 0.76 | 0.73 | 0.55 | 0.74 | 0.72 | 0.73 | 0.76 | 0.73 | 0.72 | |
| 9 | 0.79 | 0.77 | 0.81 | 0.79 | 0.79 | 0.84 | 0.76 | 0.75 | 0.62 | 0.74 | 0.72 | 0.76 | 0.77 | 0.76 | 0.77 | |
| 10 | 0.82 | 0.80 | 0.77 | 0.81 | 0.80 | 0.75 | 0.80 | 0.78 | 0.63 | 0.79 | 0.75 | 0.65 | 0.79 | 0.76 | 0.69 | |
| 11 | 0.83 | 0.84 | 0.83 | 0.81 | 0.83 | 0.92 | 0.83 | 0.84 | 0.62 | 0.80 | 0.81 | 0.81 | 0.80 | 0.82 | 0.75 | |
| 12 | 0.77 | 0.78 | 0.73 | 0.77 | 0.78 | 0.70 | 0.76 | 0.75 | 0.58 | 0.72 | 0.73 | 0.68 | 0.74 | 0.74 | 0.67 | |
-
•
Note: Values below 1 indicate improvements relative to the no-change forecast. Best univariate method in red, second best in bold black. AR, UCSV, RF, and RR results from Medeiros et al. (2021), Table 6.
Focusing on univariate models, our MDN outperforms the benchmarks for all three criteria at most horizons considered (11 out of 12 horizons, with stable gains around 24% relative to the random walk). Again, statistical significance against the no-change forecast is confirmed by Diebold-Mariano tests in Online Appendix. Table 11 also reports results for the Random Forest (RF) and Ridge Regression (RR) models, identified as the best-performing multivariate methods in Medeiros et al. (2021). These machine learning approaches exploit the full FRED-MD database comprising over 120 macroeconomic variables with four lags each, resulting in approximately 500 potential predictors. Despite this vastly richer information set, the MDN remains competitive, with RMSE ratios within 0.01–0.06 of RF and RR at most horizons.
Overall, despite not being tailored to point forecasting, our approach ranks among the leading univariate methods for forecasting U.S. inflation and natural gas prices, and remains competitive with state-of-the-art multivariate methods that rely on considerably richer information sets.
5 Conclusion
Time series forecasting in the presence of locally explosive dynamics, marked by rapid expansions followed by sudden reversals, remains a core challenge in (macro‑)financial econometrics. We address this issue by introducing a Mixture Density Network designed to capture the distinctive distributional features of noncausal processes. The resulting framework enables near‑instantaneous density forecasting once trained, effectively overcoming the computational constraints associated with existing methods.
We evaluate our approach through extensive Monte Carlo simulations that cover a range of MARMA specifications. The results show that the MDN method consistently achieves the best performance in the tail region and over the full distribution at all forecast horizons. These findings are further validated through two empirical applications to real‑time forecasting of natural gas prices and inflation.
References
- (1)
- Athey & Imbens (2019) Athey, S. & Imbens, G. W. (2019), ‘Machine learning methods that economists should know about’, Annual Review of Economics 11(Volume 11, 2019), 685–725.
- Avelino et al. (2024) Avelino, J. G., Cavalcanti, G. D. C. & Cruz, R. M. O. (2024), ‘Resampling strategies for imbalanced regression: A survey and empirical analysis’, Artificial Intelligence Review 57(4), 82.
- Azzalini & Dalla Valle (1996) Azzalini, A. & Dalla Valle, A. (1996), ‘The multivariate skew-normal distribution’, Biometrika 83(4), 715–726.
- Baumeister et al. (2025) Baumeister, C., Huber, F., Lee, T. K. & Ravazzolo, F. (2025), ‘Forecasting natural gas prices in real time’, Journal of Applied Econometrics .
- Belkin et al. (2019) Belkin, M., Hsu, D., Ma, S. & Mandal, S. (2019), ‘Reconciling modern machine-learning practice and the classical bias–variance trade-off’, Proceedings of the National Academy of Sciences 116(32), 15849–15854.
- Bishop (1994) Bishop, C. M. (1994), ‘Mixture density networks’.
- Blasques et al. (2025) Blasques, F., Koopman, S. J., Mingoli, G. & Telg, S. (2025), ‘A novel test for the presence of local explosive dynamics’, Journal of Time Series Analysis .
- Bruffaerts et al. (2014) Bruffaerts, C., Verardi, V. & Vermandele, C. (2014), ‘A generalized boxplot for skewed and heavy-tailed distributions’, Statistics & Probability Letters 95, 110–117.
- Cavaliere et al. (2020) Cavaliere, G., Nielsen, H. B. & Rahbek, A. (2020), ‘Bootstrapping noncausal autoregressions: With applications to explosive bubble modeling’, Journal of Business & Economic Statistics 38(1), 55–67.
- Chen & Guestrin (2016) Chen, T. & Guestrin, C. (2016), XGBoost: A scalable tree boosting system, in ‘Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining’, pp. 785–794.
- Dalmasso et al. (2020) Dalmasso, N., Pospisil, T., Lee, A. B., Izbicki, R., Freeman, P. E. & Malz, A. I. (2020), ‘Conditional density estimation tools in python and R with applications to photometric redshifts and likelihood-free cosmological inference’, Astronomy and Computing 30, 100362.
- Davis & Song (2020) Davis, R. A. & Song, L. (2020), ‘Noncausal vector AR processes with application to economic time series’, Journal of Econometrics 216(1), 246–267.
- de Truchis et al. (2025) de Truchis, G., Fries, S. & Thomas, A. (2025), ‘Forecasting extreme trajectories using seminorm representations’. Working paper.
- de Truchis & Thomas (2025) de Truchis, G. & Thomas, A. (2025), Laurent series expansion for MA() representation of mixed causal-noncausal autoregressive processes, LEDa Working Paper WP 2025-06, LEDa, Université Paris Dauphine, PSL.
- Dey et al. (2022) Dey, B., Newman, J. A., Andrews, B. H., Izbicki, R., Lee, A. B., Zhao, D., Rau, M. M. & Malz, A. I. (2022), ‘Re-calibrating photometric redshift probability distributions using feature-space regression’, Fourth Workshop on Machine Learning and Physical Sciences (NeurIPS 2021) .
- Diebold & Mariano (1995) Diebold, F. X. & Mariano, R. S. (1995), ‘Comparing predictive accuracy’, Journal of Business & Economic Statistics 13(3), 253–263.
- Fries (2022) Fries, S. (2022), ‘Conditional moments of noncausal alpha-stable processes and the prediction of bubble crash odds’, Journal of Business & Economic Statistics 40(4), 1596–1616.
- Fries & Zakoian (2019) Fries, S. & Zakoian, J.-M. (2019), ‘Mixed causal-noncausal AR processes and the modelling of explosive bubbles’, Econometric Theory 35(6), 1234–1270.
- Giancaterini et al. (2025) Giancaterini, F., Hecq, A., Jasiak, J. & Manafi Neyazi, A. (2025), ‘Bubble detection with application to green bubbles: A noncausal approach’.
- Goodfellow et al. (2016) Goodfellow, I., Bengio, Y. & Courville, A. (2016), Deep Learning, MIT Press.
- Gourieroux & Jasiak (2016) Gourieroux, C. & Jasiak, J. (2016), ‘Filtering, prediction and simulation methods for noncausal processes’, Journal of Time Series Analysis 37(3), 405–430.
- Gourieroux & Jasiak (2018) Gourieroux, C. & Jasiak, J. (2018), ‘Misspecification of noncausal order in autoregressive processes’, Journal of Econometrics 205(1), 226–248.
- Gourieroux & Jasiak (2023) Gourieroux, C. & Jasiak, J. (2023), ‘Generalized covariance estimator’, Journal of Business & Economic Statistics 41(4), 1315–1327.
- Gourieroux & Jasiak (2025) Gourieroux, C. & Jasiak, J. (2025), ‘Nonlinear fore(back)casting and innovation filtering for causal-noncausal VAR models’.
- Gourieroux et al. (2020) Gourieroux, C., Jasiak, J. & Monfort, A. (2020), ‘Stationary bubble equilibria in rational expectation models’, Journal of Econometrics 218(2), 714–735.
- Gourieroux et al. (2025) Gourieroux, C., Lu, Y. & Robert, C.-Y. (2025), ‘The causal-noncausal tail processes’.
- Gouriéroux & Monfort (2025) Gouriéroux, C. & Monfort, A. (2025), Affine feedforward stochastic (afs) neural network, TSE Working Paper 25-1666, Toulouse School of Economics (TSE).
- Gourieroux & Zakoian (2017) Gourieroux, C. & Zakoian, J.-M. (2017), ‘Local explosion modelling by non-causal process’, Journal of the Royal Statistical Society Series B 79(3), 737–756.
- Guillaumin & Efremova (2024) Guillaumin, A. P. & Efremova, N. (2024), ‘Tukey g-and-h neural network regression for non-gaussian data’.
- Hansen et al. (2011) Hansen, P. R., Lunde, A. & Nason, J. M. (2011), ‘The model confidence set’, Econometrica 79(2), 453–497.
- He et al. (2015) He, K., Zhang, X., Ren, S. & Sun, J. (2015), Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, in ‘2015 IEEE International Conference on Computer Vision (ICCV)’, pp. 1026–1034.
- Hecq & Velasquez-Gaviria (2025) Hecq, A. & Velasquez-Gaviria, D. (2025), ‘Non-causal and non-invertible ARMA models: Identification, estimation and application in equity portfolios’, Journal of Time Series Analysis 46(2), 325–352.
- Hecq & Voisin (2021) Hecq, A. & Voisin, E. (2021), ‘Forecasting bubbles with mixed causal-noncausal autoregressive models’, Econometrics and Statistics 20, 29–45.
- Izbicki (2017) Izbicki, R. (2017), ‘Converting high-dimensional regression to high-dimensional conditional density estimation’, Electronic Journal of Statistics 11(2), 2800–2831.
- Jasiak & Neyazi (2025) Jasiak, J. & Neyazi, A. M. (2025), ‘Gcov-based portmanteau test’.
- Kingma & Ba (2015) Kingma, D. P. & Ba, J. (2015), Adam: A method for stochastic optimization, in ‘International Conference on Learning Representations (ICLR)’.
- Lanne et al. (2012) Lanne, M., Luoto, J. & Saikkonen, P. (2012), ‘Optimal forecasting of noncausal autoregressive time series’, International Journal of Forecasting 28(3), 623–631.
- Lanne & Saikkonen (2011) Lanne, M. & Saikkonen, P. (2011), ‘Noncausal autoregressions for economic time series’, Journal of Time Series Econometrics 3(3).
- Medeiros et al. (2021) Medeiros, M. C., Vasconcelos, G. F. R., Veiga, Á. & Zilberman, E. (2021), ‘Forecasting inflation in a data-rich environment: The benefits of machine learning methods’, Journal of Business & Economic Statistics 39(1), 98–119.
- Nolan (2020) Nolan, J. P. (2020), Univariate Stable Distributions: Models for Heavy Tailed Data, Springer Series in Operations Research and Financial Engineering, Springer, Cham.
- Nyberg & Saikkonen (2014) Nyberg, H. & Saikkonen, P. (2014), ‘Forecasting with a noncausal VAR model’, Computational Statistics & Data Analysis 76, 536–555.
- Rosenblatt (1969) Rosenblatt, M. (1969), Conditional probability density and regression estimators, in P. R. Krishnaiah, ed., ‘Multivariate Analysis II’, Academic Press, pp. 25–31.
- Rosenblatt (2000) Rosenblatt, M. (2000), Gaussian and Non-Gaussian Linear Time Series and Random Fields, Springer.
- Rothfuss et al. (2020) Rothfuss, J., Ferreira, F., Boehm, S., Walther, S., Ulrich, M., Asfour, T. & Krause, A. (2020), ‘Noise regularization for conditional density estimation’.
- Rothfuss et al. (2019) Rothfuss, J., Ferreira, F., Walther, S. & Ulrich, M. (2019), ‘Conditional density estimation with neural networks: Best practices and benchmarks’.
- Saïdi (2023) Saïdi, S. (2023), Noncausal Models: Unit Root Tests and Forecasting, PhD thesis, CY Cergy Paris University.
- Steininger et al. (2021) Steininger, M., Kobs, K., Davidson, P., Krause, A. & Hotho, A. (2021), ‘Density-based weighting for imbalanced regression’, Machine Learning 110(9-10), 2187–2211.