JANET: Joint Adaptive predictioN-region Estimation for Time-series

English, Eshant; Wong-Toi, Eliot; Fontana, Matteo; Mandt, Stephan; Smyth, Padhraic; Lippert, Christoph

doi:10.1007/s10994-025-06812-2

JANET: Joint Adaptive predictioN-region Estimation for Time-series

Open access
Published: 23 June 2025

Volume 114, article number 177, (2025)
Cite this article

You have full access to this open access article

Download PDF

Save article

View saved research

Machine Learning Aims and scope Submit manuscript

JANET: Joint Adaptive predictioN-region Estimation for Time-series

Download PDF

Eshant English^1,2,
Eliot Wong-Toi⁵,
Matteo Fontana³,
Stephan Mandt^2,5,
Padhraic Smyth^2,5 &
…
Christoph Lippert^1,4

2526 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Conformal prediction provides machine learning models with prediction sets that offer theoretical guarantees, but the underlying assumption of exchangeability limits its applicability to time series data. Furthermore, existing approaches struggle to handle multi-step ahead prediction tasks, where uncertainty estimates across multiple future time points are crucial. We propose JANET (Joint Adaptive predictioN-region Estimation for Time-series), a novel framework for constructing conformal prediction regions that are valid for both univariate and multivariate time series. JANET generalises the inductive conformal framework and efficiently produces joint prediction regions with controlled K-familywise error rates, enabling flexible adaptation to specific application needs. Our empirical evaluation demonstrates JANET’s superior performance in multi-step prediction tasks across diverse time series datasets, highlighting its potential for reliable and interpretable uncertainty quantification in sequential data.

Distribution-free conformal joint prediction regions for neural marked temporal point processes

Article 23 July 2024

Conformal multistep-ahead multivariate time-series forecasting

Article Open access 09 June 2025

Towards Synthetic Multivariate Time Series Generation for Flare Forecasting

1 Introduction

In this work, we tackle the challenging problem of multi-step uncertainty quantification for time series prediction. Our goal is to construct joint prediction regions (JPRs), a generalisation of prediction intervals to sequences of future values. The naive approach of taking the Cartesian product of marginal prediction intervals, each with the desired coverage level $( 1 - \epsilon )$, does not guarantee the desired global coverage. If future time steps were independent, this approach would result in coverage $(1-\epsilon )^H$, where H is the length of the horizon. Although techniques such as Bonferroni correction (Dunn, 1961) can adjust for this, it becomes overly conservative as H increases, especially in the presence of temporal dependencies, which, of course, is to be expected in the case of time series data.

Existing approaches to uncertainty quantification in time series, such as bootstrap-based prediction intervals or Bayesian forecasting techniques (e.g., dropout Gal and Ghahramani (2016) or variational inference), each come with key limitations. Bootstrap methods offer only asymptotic coverage guarantees and often break down under small-sample or dependent data settings. Bayesian methods, on the other hand, require the specification of priors and frequently produce miscalibrated intervals when the model is misspecified.

Unlike these approaches, Conformal Prediction (CP) (Angelopoulos and Bates, 2022; Vovk et al., 2005; Shafer and Vovk, 2007) offers a distribution-free frequentist methodology for uncertainty quantification with finite-sample coverage guarantees. Specifically, CP ensures that $\text {Pr}(Y \in \mathcal {C}(X^*)) \ge 1 - \epsilon$, where $\mathcal {C}(X^*)$ is the prediction set (or interval in the univariate case) for a new input $X^*$. The theoretical guarantees and model-agnostic nature of CP have spurred its application in diverse areas, including large language models (Quach et al., 2023) and online model aggregation (Gasparin and Ramdas, 2024). CP is readily applicable to any algorithm that provides a (non-)conformity score, a measure of how well a data point aligns with the rest of the dataset. The key assumption is exchangeability, which is satisfied for independent and identically distributed (IID) data. Extensions of CP exist for dependent data settings (Chernozhukov et al., 2018; Barber et al., 2023; Xu and Xie, 2023), but primarily focus on single-step predictions, and the extension to multistep prediction is not trivial.

CP offers finite-sample guarantees, making it an attractive approach to building joint prediction regions (JPRs). However, the exchangeability assumption of CP is always violated in time series data. Moreover, the multivariate residuals (H in total — one residual per time step) in a multi-step ahead prediction pose a challenge, as for most regression cases, CP relies on the exchangeability of the scalar non-conformity scores and the quantile inversion for calibration. This poses the question of mapping multivariate residuals to a scalar value. Whilst single-step prediction allows for simple residual sorting and the necessary quantile inversion, this approach fails for multi-step scenarios due to the multidimensionality of the residuals. We address the exchangeability issue in the inductive conformal setting by showing an extension of the work of Chernozhukov et al. (2018) from the transductive (full) to the inductive (split) CP setting. Full CP requires fitting many models, whilst inductive CP only requires fitting a single model. The approach of Chernozhukov et al. (2018) leverages specific index permutations of a single time series, providing a distribution of residuals for each permuted time series. We further introduce a novel non-conformity score designed to map multi-dimensional residuals to univariate quantities, enabling the construction of valid JPRs. The resulting framework, JANET (Joint Adaptive predictioN-region Estimation for Time-series), allows for inductive conformal prediction of JPRs in both multiple and single time series settings. When applied to multiple independent time series, JANET guarantees validity as the permutation across independent time series is exchangeable, whilst, for single time series, it provides approximate validity under weak assumptions on the non-conformity scores as long as transformations (permutations within a single series) of the data are a meaningful approximation for a stationary series.

Our key contributions are: (i) formally generalising the framework in Chernozhukov et al. (2018) to the inductive setting for computational efficiency whilst maintaining approximate coverage guarantees; (ii) design of non-conformity measures that effectively control the K-familywise error (a generalisation of the familywise error) (Lehmann and Romano, 2005; Wolf and Wunderli, 2013), whilst accounting for time horizon and historical context; and (iii) empirical demonstration of JANET’s (finite-sample) coverage guarantees, computational efficiency, and adaptability to diverse time series scenarios.

2 Background on conformal prediction

In this section, we provide an overview of conformal prediction for IID data in the context of regression tasks. In the next section, we specify CP for the time series setting.

Full conformal prediction constructs prediction sets by evaluating how well a hypothesised label $y$ for a new input $X^*$ conforms with the training data. For each candidate value $y$ from a discretised grid over the target space $\mathcal {Y}$, the method temporarily augments the training set with the test pair $(X^*, y)$, forming a dataset of size $n + 1$. A model is then retrained on this augmented dataset to compute a non-conformity score for each of the $n + 1$ points, including the test point. This process is repeated across all $c$ grid points, resulting in $(n + 1) \times c$ model fits. For each $y$, a p-value is computed by ranking the test point’s score among those of the training samples under the exchangeability assumption. The prediction set includes all $y \in \mathcal {Y}$ such that the corresponding p-value exceeds the significance level $\epsilon$; this inversion of the conformity scores yields a valid $(1 - \epsilon )$ prediction set. Full CP is prohibitive for compute-intensive underlying training algorithms as it requires $(n + 1) \times c$ model fits. Inductive conformal predictors (ICPs), explained below, offer an elegant solution to this problem by training the model only once. We focus on regression tasks with ICPs.

Let $\mathbb {P}$ be the data generating process for the sequence of n training examples $Z_1, \dots , Z_n$, where for each $Z_i=(X_i, Y_i)$ pair $X \in \mathcal {X}$ is the sample and $Y \in \mathcal {Y}$ is the target value. We partition the sequence into a proper training set, $Z_{tr} = \{Z_1, \dots , Z_{n_{tr}}\}$ (the first $n_{tr}$ elements), and the remaining $n_{cal}$ elements form a calibration set, $Z_{cal} = \{Z_{n_{tr}+1}, \dots , Z_{n_{tr} + n_{cal}}\}$, such that $n = n_{tr} + n_{cal}$. A point prediction model, $\hat{f}$, is trained on the training set proper $Z_{tr}$, and non-conformity scores, $a_i$, are computed for each element of the calibration set, $Z_{cal}$.

In standard regression, a natural non-conformity score is the absolute residual $a_i = |y_i-\hat{f}(x_i)|$. By construction, for any i, $y_i \in [\hat{f}(x_i)\pm a_i]$; $a_i$ is half the interval width that ensures coverage for any new sample (assuming symmetric intervals). In ICP we compute a non-conformity score for each of the $n_{cal}$ samples in the calibration set and then sort them from largest to smallest. Let $a_{(1)}$ denote the largest and $a_{(n_{cal})}$ denote the smallest. Then intervals of the form $(\hat{f}(x_i) \pm a_{(1)})$ will cover all but one sample from the calibration set and intervals of the form $(\hat{f}(x_i) \pm a_{(n_{cal})})$ will cover none of the samples from the calibration set (assuming no ties).

Extending this line of thinking, a prediction interval with $(1-\epsilon )$ coverage can be obtained by inverting the quantile of the distribution of non-conformity scores $a_i$. To do so we find the $\lfloor \epsilon (n_{cal} + 1)\rfloor ^{th}$ largest non-conformity score, $q^a_{1-\epsilon }:=a_{(\lfloor \epsilon (n_{cal} + 1\rfloor )}$ and set this to half of our interval width. For an unseen $Z^* = (X^*, Y^*)$, we will provide a prediction interval of the form

$$\begin{aligned} \mathcal {C}^a_{1-\epsilon }(X^*) = \left( \hat{f}(X^*) - q^a_{1-\epsilon },\hat{f}(X^*) + q^a_{1-\epsilon }\right) . \end{aligned}$$

(1)

A drawback of this method is that the intervals are of constant width, $2 q^a_{1-\epsilon }$. Lei et al. (2018) suggest an alternative non-conformity score that takes into account local information, $r_i = \frac{|y_i - \hat{f}(x_i)|}{\hat{s}(x_i)}$. Here, $\hat{s}(\cdot )$ is also fit on the train set and predicts the conditional mean absolute deviation. Now we can provide locally adaptive prediction intervals for a test sample $Z^* = (X^*, Y^*)$ with $(1-\epsilon )$ coverage:

$$\begin{aligned} \mathcal {C}^r_{1-\epsilon }(X^*) = \left( \hat{f}(X^*)-q^r_{1-\epsilon }\cdot \hat{s}(X^*), \hat{f}(X^*)+q^r_{1-\epsilon }\cdot \hat{s}(X^*)\right) \end{aligned}$$

(2)

where $q^r_{1-\epsilon }$ is the $\lfloor \epsilon (n_{cal} + 1)\rfloor ^{th}$ largest non-conformity score. In Eq. (1), the $a_i$ (and $q_{1-\epsilon }^a$) directly define the width of the interval whilst in Eq. (2), the $r_i$ (and $q_{1-\epsilon }^r$) inform how much to rescale the conditional mean absolute deviation to achieve coverage of a particular level.

3 Generalised inductive conformal predictors

In this section, we formally describe a generalised framework for inductive conformal predictors based on Generalised CP (Chernozhukov et al., 2018) and randomisation inference (Romano, 1990) applied to time series forecasting. With minor notational adjustments, our generalised ICP extends to multi-step and multivariate time series prediction. In the following sections, we demonstrate how to obtain exact validity guarantees for independent time series (or IID data) and approximate validity guarantees for a single time series. Our task is to forecast H steps into the future conditioned on T steps of history.

3.1 Multiple time series

Assume we have n independent time series $\textbf{Z} = \{Z_{k}\}_{k=1}^n$ where each individual time series, $Z_{k}$, is an independent realisation from an underlying distribution $\mathbb {P}$ (and within each $Z_{k}$ there is temporal dependence). As usual in ICP, we split $\textbf{Z}$ into a proper training sequence $\textbf{Z}_{tr} = \{Z_k\}_{k=1}^{n_{tr}}$ and a calibration sequence $\textbf{Z}_{cal} = \{Z_{n_{tr}+i}\}_{i=1}^{n_{cal}}$. For notational convenience, we assume the time series are of length $T+H$. For each time series, we can define $Z_k = \{z_{k,1}, z_{k,2}, \dots , z_{k,T+H}\}$ as $(X_k, Y_k)$, each with $X_k = \{z_{k,1}, \dots z_{k, T}\} = \{x_{k,1}, \dots x_{k, T}\}$ denoting the relevant series’ history, $Y_k=\{z_{k, T+1}, \dots , z_{k, T+H}\} = \{y_{k,1}, \dots , y_{k, H}\}$ being the values at the next H time steps and each $z_{k, j}\in \mathbb {N}^p$ (where $p = 1$ corresponds to univariate time series and $p > 1$ corresponds to multivariate time series). In other words, each time series can be split into the history we use to predict ($X_k$) and the target ($Y_k$). As in any other ICP setting, we can train a model $\hat{f}: \mathbb {N}^{p\times T}\rightarrow \mathbb {N}^{p\times H}$ that predicts H steps into the future based on T steps of history. Then, we can compute non-conformity scores for a non-conformity scoring function $\mathcal {A}$. Note that we can form a distribution over non-conformity scores with one non-conformity score per time series. From there we can invert the quantiles and produce a prediction interval.

3.2 Single time series

Unlike the case of multiple time series, we only have a single time series and we do not have access to an entire distribution of non-conformity scores—we only have a single score. To address this problem we apply a permutation scheme from Chernozhukov et al. (2018) on the calibration series that provides a distribution over the conformity scores, this is equivalent to the view of randomisation inference employed in (Chernozhukov et al., 2018).

Permutation scheme

Chernozhukov et al. (2018) introduced a permutation scheme to address data dependence. We adapt this scheme and their notation to our setting on $Z_{cal}$. Given a time calibration sequence of length $L_{cal}$, divisible (for simplicity) by a block size b, we divide the calibration series into $d=L_{cal}/b$ non-overlapping blocks of b observations each. The $j^{th}$ non-overlapping block (NOB) permutation is defined by the permutation $\Pi _{j, NOB}: \{1, \dots , L_{cal}\} \rightarrow \{1, \dots , L_{cal}\}$ where

$$\begin{aligned} i\rightarrow \pi _{j, NOB}(i)&= {\left\{ \begin{array}{ll} i + (j-1)b & \text { if } 1 \le i \le L_{cal}-(j-1)b\\ i+(j-1)b-l & \text { if } L_{cal}-(j-1)b + 1 \le i \le L_{cal} \end{array}\right. } \quad \text { for } i = 1, \dots , L_{cal}. \end{aligned}$$

Figure 1 provides a visualisation of the NOB permutation scheme to a hypothetical time series Z with $L = L_{tr}+L_{cal} = 10 + 6 = 16$ and $b=1$.

Now we can treat this collection of permuted time series in the same way we did for multiple time series. Each permutation will provide a non-conformity score and this is how we approximate a distribution of non-conformity scores.

Formally, we now assume we have a single time series $Z = \{z_1, \dots z_{L}\}$ where $L = L_{tr} + L_{cal}$ is the length of the entire single series, $L_{tr}, L_{cal} \ge T+H$ and all $z_{i} \in \mathbb {N}^p$. We split Z into a train subseries, $Z_{tr} = \{z_1, \dots , z_{L_{tr}}\}$, and a calibration subseries, $Z_{cal} = \{z_{L_{tr} + 1}, \dots , z_{L_{tr} + L_{cal}}\}$. $Z_{tr}$ is used to fit our point prediction model $\hat{f}$ and $Z_{cal}$ will be used to calibrate our prediction intervals.

Let $\Pi$ be a set of permutations of the indices $\{1, \dots , L_{cal}\}$. For a permutation $\pi \in \Pi$, let $Z^\pi _{cal} = \{z_{L_{tr} \pi (i)}\}_{i=1}^{L_{cal}}$ denote a permuted version of the calibration sequence, $Z_{cal}$. We call $\textbf{Z}^\Pi _{cal} = \{Z^\pi _{cal}\}_{\pi \in \Pi }$ the set of permuted version of $Z_{cal}$ under the permutations in $\Pi$. We can partition each permutation of $Z_{cal}$ into X and Y components as in the multiple time series setup where X is the lagged version of Y. We compute the p-values as in Definition 1.

Definition 1

(Randomised p-value)

We define the randomised p-value for the postulated label y as:

$$\begin{aligned} {\hat{p}} = {\hat{p}}(y) := \frac{1}{d} \sum _{j=1}^d \textbf{1}\left( \mathcal {A}\left( Z_{tr}, Z_{cal}^ {\pi _j},y \right) \ge \mathcal {A}\left( Z_{tr}, Z_{cal},y\right) \right) \end{aligned}$$

where $\Pi$ is a group of permutations of size d and $\mathcal {A}$ is a non-conformity measure.

3.3 Validity of ICP with permutations

Theorem 1 states that under mild assumptions on ergodicity and small prediction errors, as defined in A.2.1, we can achieve approximate validity in the case of a single time series whilst using the permutation scheme from Chernozhukov et al. (2018). In the case of IID data, we have exact validity guarantees shown in Theorem 2 and this exact validity can be applicable in the case of multiple independent time series.

Effectively we are computing an empirical quantile of our test sample’s non-conformity relative to the calibration set. Let

$$\begin{aligned} \alpha _j := \mathcal {A}(Z_{tr}, Z_{cal}^{\pi _j}) \text { for } j=1, \dots , d \end{aligned}$$

where $\pi _j$ is the $j^{th}$ permutation of $\Pi$. We can invert the quantile to gain a prediction interval as in Eq. (1). Theorem 1 is adapted from Chernozhukov et al. (2018) to the inductive conformal setting.

Theorem 1

(Approximate General Validity of Inductive Conformal Inference) Under mild assumptions on ergodicity and small prediction errors (see Appendix A.2.1), for any $\epsilon \in (0, 1)$, the approximate conformal p-value is approximately distributed as follows:

$$\begin{aligned} \left| \text {Pr}({\hat{p}} \le \epsilon ) - \epsilon \right| \le 6\delta _{1d} + 4 \delta _{2m} + 2 D\;\left( \delta _{2m} + 2\sqrt{\delta _{2m}}\right) + \gamma _{1d} + \gamma _{2m} \end{aligned}$$

(3)

for any $\epsilon \in (0, 1)$ and the corresponding conformal set has an approximate coverage $1-\epsilon$, i.e,

$$\begin{aligned} \left| \text {Pr}(y^* \in \mathcal {C}_{1-\epsilon }) - (1-\epsilon ))\right| \le 6\delta _{1d} + 4\delta _{2m} + 2 D\left( \delta _{2m} + 2\sqrt{\delta _{2m}}\right) + \gamma _{1d} + \gamma _{2m}. \end{aligned}$$

(4)

Here, the terms $\delta _{1d}$ and $\gamma _{1d}$ correspond to the approximation error and failure probability in the empirical distribution of permutation-based conformity scores under weak ergodicity assumptions. The terms $\delta _{2\,m}$ and $\gamma _{2\,m}$ reflect the estimation error and failure probability when approximating the oracle non-conformity score using finite training data. Together, these bounds quantify the deviation from exact validity, and all vanish as $d, m \rightarrow \infty$. Full assumptions, definitions and the proof are provided in Appendix A.2.

Remark:Chernozhukov et al. (2018) demonstrated that conformity scores obtained via the permutations (Fig. 1) offer a valid approximation to the true conformity score distribution for strongly mixing time series. Notably, stationary processes like Harris-recurrent Markov chains and autoregressive moving average (ARMA) models exhibit strong mixing properties Athreya and Pantula (1986b, 1986a). Statistical tests designed for ARMA models, such as the Ljung-Box or KPSS tests, can be employed to assess the presence of strongly mixing. Our empirical findings suggest that even when stationarity is violated, approximate coverage can still be achieved (see Table 1).

4 JANET: Joint Adaptive predictioN-region Estimation for Time-series

A joint prediction region (JPR) aims to capture the entire multi-step prediction sequence $Y = (y_{T+1}, \dots , y_{T+H})$ with high probability. Traditionally, this is formalised by controlling the familywise error rate (FWER), which bounds the probability that any (h) of the $H$ components fall outside their respective prediction intervals ($C_{T+h}$):

$$\begin{aligned} \text {FWER} := \text {Pr}\left( \exists \, h \in \{1, \dots , H\} \text { such that } y_{T+h} \notin C_{T+h}\right) . \end{aligned}$$

Equivalently, FWER corresponds to a joint coverage probability of at least $1 - \epsilon$, where $\epsilon$ is the user-specified significance level. However, as the prediction horizon $H$ increases, FWER control becomes increasingly conservative, often producing excessively large prediction regions that reduce practical utility.

$$\begin{aligned} K\text {-FWER} := \text {Pr}\left( \left| \{ h : y_{y+h} \notin C_{T+h} \} \right| \ge K\right) . \end{aligned}$$

When $K = 1$, the definition recovers standard FWER. By allowing controlled violations, K-FWER enables tighter prediction regions that are more informative and practically actionable. For example, when $H = 24$, allowing up to $K - 1 = 2$ errors may be acceptable in high-uncertainty applications like macroeconomic forecasting. As such, K-FWER offers a more flexible and tunable approach to uncertainty quantification, especially in domains where small deviations can be tolerated in exchange for sharper predictions.

Remark. Joint prediction regions (JPRs) can be constructed in various geometric forms. A common approach is to use a hyperspherical region in $\mathbb {R}^H$, which may offer statistical or computational advantages. However, such regions are not decomposable across time steps, making it difficult to assess whether specific components of a forecast sequence lie within the region.

In contrast, we adopt a rectangular construction, where the joint region at $\epsilon$, the significance level (or $K-$FWER), $C_{1-\epsilon } = \prod _{h=1}^H C_{T+h, \epsilon }$ consists of marginal intervals along each forecast horizon. This structure offers a key interpretability advantage: one can track the realised sequence in real time and immediately identify whether a prediction lies outside the region at any step. For instance, if the forecast for day $T+h$ falls outside $C_{T+h}$, a decision-maker can already reason that the region may be invalidated, especially if $K$ or more such violations occur. This enables sequential, component-wise decision-making—a property not afforded by hyperspherical regions.

While it is possible to project a hypersphere onto a hyperrectangle to enable component-wise analysis, this leads to a looser region and reduced predictive efficiency (Diquigiovanni et al., 2021; Wolf and Wunderli, 2013). As such, rectangular regions are better suited to applications requiring transparent, step-level risk assessment.

Now, we introduce JANET (Joint Adaptive predictioN-region Estimation for Time-series) to control K-FWER in multi-step time series prediction.

We adopt the notation for the multiple time series setting and further assume that the time series are univariate (i.e. $p=1$). In the case of a single time series, we use the same non-conformity scores but treat the permutations of the single time series as distinct, exchangeable time series. We propose two non-conformity measures that are extensions of a locally adaptive non-conformity score from Lei et al. (2018)

$$\begin{aligned} \alpha _i^{K}&:= K\text {-}\max \left\{ \frac{|y_{i,1} - \hat{y}_{i,1}|}{\hat{\sigma }_{1}}, \dots , \frac{|y_{i,H} - \hat{y}_{i,H}|}{\hat{\sigma }_{H}} \right\} \text { and } \end{aligned}$$

(5)

$$\begin{aligned} R_i^{K}&:= K\text {-}\max \left\{ \frac{|y_{i,1} - \hat{y}_{i,1}|}{\hat{\sigma }_{1}(X_{i})}, \dots , \frac{|y_{i,H} - \hat{y}_{i,H}|}{\hat{\sigma }_{H}(X_i)} \right\} \end{aligned}$$

(6)

for $i \in \{1, \dots , n_{cal}\}$ where, $K\text {-}\max (\textbf{x})$ is the $K^{th}$ largest element of a sequence $\textbf{x}$, $y_{i,j}$ is the $j^{th}$ entry of the target $Y_i$ and $\hat{y}_{i,j}$ is the $j^{th}$ entry of $\hat{f}(X_i)$. Note that the prediction steps can be generated by any model ${\hat{f}}$ (AR models with $H=1$ or multi-output models with $H>1$). The difference between Eqs. (5) and (6) is only in the scaling factors (denominators). In Eq. (5) the scaling factors $\hat{\sigma }_{1}, \dots , \hat{\sigma }_{H}$ are standard deviations of the error, that are computed on the proper training sequence. These scaling factors account for differing levels of variability and magnitude of errors across the prediction horizon but do not depend on the history. Meanwhile, for Eq. (6) the scaling factors are conditional on the relevant history X.

We call these functions $\hat{\sigma }_{1}(\cdot ), \dots , \hat{\sigma }_{H}(\cdot )$ and they are also fit on the training sequence. Even with IID errors, the predictor may have higher errors for certain history patterns. The history-adaptive conformity score penalises these residuals and aims to deliver uniform miscoverage over the prediction horizon.

Note: In the multivariate case (that is, $z\in \mathbb {N}^p, \; p > 1$), the entries of Eqs. (5) and (6), whose entries are $\left[ \frac{|y_{i,j} - \hat{y}_{i,j}|}{\hat{\sigma }_{j}(X_{i})}\right] _{j}$ for $j \in \{1, \dots , p\}$. We then take the $K\text {-}\max$, across all p dimensions, and all H steps in the time horizon.

The quantile $q^\alpha _{1-\epsilon }$ can be found by inverting the conformity scores. Then a JPR of the desired significance $\epsilon$ and tolerance K for a test sample $Z^*=(X^*, Y^*)$ (where $Y^*$ is unknown) can be constructed as follows:

$$\begin{aligned} \mathcal {C}_{1-\epsilon }^{\alpha }(X^*) = \left( \hat{y}_{1} \pm q^{\alpha }_{1-\epsilon }\cdot \hat{\sigma }_{1}\right) \times \dots \times \left( \hat{y}_{H} \pm q^{\alpha }_{1-\epsilon }\cdot \hat{\sigma }_{H}\right) \end{aligned}$$

(7)

whereas, the JPR for the second score that incorporates historical context are:

$$\begin{aligned} \mathcal {C}_{1-\epsilon }^{R}(X^*) = \left( \hat{y}_{1} \pm q^{R}_{1-\epsilon }\cdot \hat{\sigma }_{1}(X^*)\right) \times \dots \times \left( \hat{y}_{H} \pm q^R_{1-\epsilon }\cdot \hat{\sigma }_{H}(X^*)\right) \end{aligned}$$

(8)

where $X^*$ is the sequence of lagged values (i.e. the history) for the unknown $Y^*$ that is to be predicted. The operation $K\text {-}\max$ maps the multidimensional residuals to a singular score that allows quantile inversion as done in Eq. (1). Further, by taking the $K\text {-}\max$, each $\alpha ^K_i$ (or $R^K_i$) is the value to rescale each $\hat{\sigma }_j$ to provide intervals that cover all but K of the predicted time points for a specific trajectory $Z_i \in \textbf{Z}_{cal}$.

We provide two methods for producing JPRs, JANET^* and JANET, and describe how to construct a JANET JPR in Algorithm 1:

JANET^*: Adapts prediction intervals over time horizon as defined in Eq. (7) and only requires fitting a single model.
JANET: Adapts prediction intervals conditional on the relevant history as defined in Eq. (8).

Remark: The constructed regions are two-sided and symmetric, we discuss the construction of one-sided intervals and asymmetric intervals in Appendix A.1.

5 Experiments

We demonstrate the utility of our method, JANET, on single time series and multiple time series. For the single time series, we compare against Bonferroni and Scheffé JPRs. Further, we compare against bootstrapping methods. Note that for Bonferroni and Scheffé we can only train the models for complete coverage (1-FWE), whilst for Bootstrap-JPR we can vary K (as with our method). Despite matching this level of flexibility in the choice of K, Bootstrap-JPR is much more computationally intensive than JANET. For the multiple time series, we compare against baselines (CopulaCPTS (Sun and Yu, 2024), CF-RNN (Stankeviciute et al., 2021), CAFHT (Zhou et al., 2024), CCN-JPR (English and Lippert, 2024), and MC-Dropout (Gal and Ghahramani, 2016)) that make stronger assumptions (different series being independent) than us on synthetic datasets as well as real-world data. Our generalised ICP method can lead to greater data efficiency whilst creating approximately valid prediction sets should independence be violated.

5.1 Single time series experiments

We now focus on the scenario where only a single time series is available, and the goal is to construct a JPR for horizon H. Common approaches, such as bootstrapping, are used to estimate prediction errors and subsequently compute JPRs. However, these methods suffer from two limitations: (i) bootstrap guarantees often hold only asymptotically, not for finite samples, and (ii) bootstrapping can be computationally expensive, particularly when using neural networks as function approximations (or predictors). We compare JANET to the following baselines:

Bonferroni Correction (Dunn, 1961; Haynes, 2013): This approach controls the FWER, but it is conservative. We use bootstrapping-based methods to find the standard deviation of the prediction errors before applying the correction.
Scheffé-JPR (Staszewska-Bystrova, 2011): This statistical method assumes normality of errors but may not hold for prediction intervals.
Bootstrap-JPR (Wolf and Wunderli, 2013; English, 2024): This method is based on bootstrapping and lacks finite-sample guarantees and can be computationally demanding.

We use an ARIMA model as the learner for all methods in this section. Due to computational constraints over numerous simulations, we do not use neural networks for the main coverage experiments. Both Bootstrap-JPR and our proposed method, JANET, control the K-FWER (Lehmann and Romano, 2005), so we compare for different tolerance levels, K. We want to point out that Bootstrap-JPR can be conformalised in a naive way. Whilst conformalised bootstrapping can address finite-sample issues, it does not reduce computational cost. In contrast, JANET requires training the model only once per simulation. See Appendix B for a detailed analysis of the computation costs. We compare both of our variants: JANET and JANET^*.

5.1.1 Monte Carlo simulations

We generate data from an AR(2) process with $\rho \in \{1.25, -0.75\}$ and evaluate empirical coverage across 1000 simulations. We compute the JPRs for $K \in \{1,2,3\}$, $H \in \{6,12,18,24\}$ and significance level, $\epsilon \in \{0.7,0.8, 0.9\}$. For methods that cannot control the K-FWER, the corresponding results refer to $K = 1$. The interval width of one JPR is calculated as the geometric mean of the widths over the horizon H. The average over all simulations per setting is reported in Tables 11, 12 and 13.

Tables 8, 9 and 10 in the Appendix present coverage results for $\epsilon$ and varying tolerances K, whilst Figs. 5, 6 and 7 in the Appendix display coverage errors as bar plots. As expected, Bonferroni is conservative, particularly at larger $\epsilon$. Scheffé-JPR undercovers substantially, likely due to the normality assumption. Bootstrap-JPR and JANET$^{*}$ perform comparably. JANET often has smaller coverage errors but slightly larger average widths, as it treats errors uniformly across the history covariates. Additionally, Tables 11, 12 and 13 in the Appendix present empirical width results for various significance levels $\epsilon$ and varying tolerances K and Figs. 8, 9 and 10in the Appendix plot empirical widths against forecast horizon (H) for various significance levels $\epsilon$ and $K=1$. ure 2 compares coverage errors against widths, with the ideal method being close to zero error with minimal width within each colour group (representing the same tolerance K). Note that within a level of K, the points tend to cluster together. (refer to Fig. 4 in the Appendix for all different K and all significance levels $\alpha$). The left plot ($K=1$) includes more points per cluster as Bonferroni and Scheffé-JPR control for this tolerance level only. The analysis also shows that JANET* generally achieves good coverage with minimal width.

5.1.2 U.S. real gross domestic product (GDP) dataset

We evaluate JANET on the U.S. real gross domestic product (GDP) dataset (U.S. Bureau of Economic Analysis, 1947). To address non-stationarity, we follow the preprocessing procedure used in the Bootstrap-JPR (Wolf and Wunderli, 2013): a log transformation followed by first-order differencing to remove trends. The resulting log-growth series is shown in Fig. 3.

Our task is to forecast the next $H=4$ quarters (equivalent to one year) and construct a JPR for the true sequence. We set the significance level to $\epsilon = 0.2$. Due to the limited availability of real-world data, we create windowed datasets to increase the number of datasets for evaluating coverage. Each window consists of a sequence of 48 quarters (12 years) for JPR computation, followed by a true sequence of 4 quarters (1 year) for coverage assessment. It should be noted that this method for computing empirical coverage has two deficiencies, (i) there are only 100 series (created through windowing); (ii) the series are not independent of each other. Nevertheless, it still provides an assessment of the out-of-sample performance of the method. Table 1 shows the coverage results of the data. Similar to the experiment in Monte Carlo simulations in the previous section, the out-of-sample coverages for JANET and Bootstrap-JPR are close to the desired level at all tolerances K, whereas Bonferroni overcovers and Scheffé-JPR undercovers.

Table 1 Empirical out-of-sample coverages on the US GDP data ($\epsilon = 0.2$, target coverage 80%) and training times (minutes), numbers in the parenthesis refer to values of K

Full size table

5.2 Multiple independent time series experiments

In this section, we focus on the setting with multiple independent time series as in (Stankeviciute et al., 2021; Sun and Yu, 2024). As previously noted, this scenario allows us to achieve exact validity guarantees, rather than approximate validity, due to the independence assumption. We compare JANET against the following methods:

CF-RNN (Stankeviciute et al., 2021) is designed for multi-output neural networks–the entire predictive sequence is outputted at once. It assumes conditional IID prediction errors, which may not always be accurate, especially if the trained model has not captured the underlying trend. Further, it relies on the Bonferroni correction (Haynes, 2013), which tends to be conservative.
MC-Dropout (Gal and Ghahramani, 2016) is a Bayesian method for neural networks that can provide prediction intervals that are often overly narrow intervals with poor coverage, indicating overconfidence.
CopulaCPTS (Sun and Yu, 2024) uses copulas to adjust for dependencies. However, it is data-inefficient since it requires two calibration sets and unlike JANET, it cannot adapt its regions based on history or handle different values of K in the K-FWER control problem.
CAFHT (Zhou et al., 2024) uses ACI (Candès et al., 2023) for primarily heterogenous trajectories.. Just like CopulaCPTS, it is also data-inefficient due to additional tuning parameters and cannot adapt its regions based on history or handle different values of K in the K-FWER control problem.
CCN-JPR (English and Lippert, 2024) employ Normalising Flows to prediction regions that do not have any geometrical assumptions. Their appraoch restricts to using generative models with density estimation capabilities as predictors and can not be used with any other model as in our case.

Following the evaluation approach used in the baseline methods, we report the coverage on a hold-out test set.

5.2.1 Particle simulation datasets

We evaluate our model on two synthetic datasets from Kipf et al. (2018) using the same experimental settings as in Sun and Yu (2024). In both cases, we predict $H=24$ steps into the future based on a history of length $T=35$. Each time step is multivariate, i.e., $\mathbb {N}^2$. In the two setups, we add in mean-zero Gaussian noise with $\sigma =0.05$ and $\sigma =0.01$, respectively. We used two different predictors for the experiments. See Appendix D.2 for predictor model and training details. Table 2 shows results for the particle5 experiment $\sigma =0.05$. Under the EncDec predictor, our coverage is closer to the desired level. For RNN architectures, we observe slight under-coverage for JANET and over-coverage for CF-RNN, whilst other methods, especially MC-dropout, exhibit severe under-coverage. In the particle1 experiment (Table 2, $\sigma =0.01$), JANET achieves better coverage under both predictors. Other methods again show significant under-coverage, particularly MC dropout.

5.2.2 UK COVID-19 dataset

We evaluate JANET on the UK COVID-19 dataset with daily case counts from 380 UK regions (Stankeviciute et al., 2021; Sun and Yu, 2024). Table 2 presents the results. The task is to predict daily cases for the next 10 days based on the previous 100 days. Whilst the COVID case sequences from different regions are not independent, we anticipate at least approximate validity using our generalised ICP framework. JANET’s coverage is close to the desired significance level $\epsilon$. CF-RNN shows overcoverage with the basic RNN architecture but performs closer to the desired level with the EncDec architecture.

Table 2 Comparison of coverage (%) and interval widths/areas on the test set for $\epsilon =0.1$

Full size table

6 Related work

Relaxing exchangeability Recent research has shown a growing interest in adapting conformal prediction (CP) to non-exchangeable data. Early work by Vovk et al. (2005) explored relaxing the exchangeability assumption using Mondrian CP, which divides observations into exchangeable groups. Dunn et al. (2022) built upon this idea to share strengths between groups in hierarchical modelling. Tibshirani et al. (2020) and Podkopaev and Ramdas (2021) addressed covariate and label shifts, respectively, by reweighting data based on likelihood ratios. Similarly, Lei and Candès (2021) and Candès et al. (2023) applied reweighting for predictive inference in causal inference and survival analysis, while (Fannjiang et al., 2022) focused on controlling covariate shift by statisticians. However, these methods address heterogeneity rather than the serial dependence found in time series and can not be applied to time series data.

One-step or multivariate predictionGibbs and Candès (2021) tackled distribution shifts in an online manner by adapting coverage levels based on comparisons of current coverage with desired coverage. Zaffran et al. (2022) extended this work with online expert aggregation. Gibbs and Candès (2023) later introduced an expert selection scheme to guide update step sizes. These works typically require a gradient-based approach to learn a model that adapts to the coverage. Barber et al. (2023) generalised pure conformal inference for dependent data, using fixed weights for recent training examples to account for distributional drift. Xu and Xie (2023) employed predictor ensembling, assuming exchangeability but with asymptotic guarantees. Chernozhukov et al. (2018) leveraged randomisation inference (Romano, 1990) to generalise full conformal inference to serial dependence, achieving valid coverage under exchangeability and approximate validity with serial dependence. Notably, these methods primarily focus on single-step predictions for univariate series and can not be adapted. A similar extension to the inductive setting is provided in Diquigiovanni et al. (2024) for the functional time series setting. They randomly selected the time indices to form a calibration series, however, such an approach does not preserve the statistical properties of the time series. They also postulate different values of the predictor variable, which is not scalable to large time horizons. Additionally, the work is primarily based on applying to the time series and lacks formalisation as the generalised inductive conformal predictors. In contrast, we split the sequence into two such that we preserve the statistical properties of the time series; our method formally extends Chernozhukov et al. (2018) to inductive conformal prediction and multi-step scenarios, including multivariate time series.

Other notable works, Diquigiovanni et al. (2021); Ajroldi et al. (2023); Xu et al. (2024); Angelopoulos et al. (2023) constructed prediction bands for multivariate functional data, functional surfaces, ellipsoidal regions for multivariate time series, prediction after skipping next time steps, respectively, but only for single-step predictions without an extension for multi-step predictions.

Multi-step prediction Alaa and van der Schaar (2020) applied the jackknife method to RNN-based networks for multi-step prediction, with theoretical coverage of $<span class='convertEndash'>1-2</span>\epsilon$ at significance level $\epsilon$, instead of $1-\epsilon$ that we obtain and is desirable. Stankeviciute et al. (2021) assumed conditional IID prediction errors in multi-output models, using Bonferroni correction (Haynes, 2013) to achieve desired coverage. However, their approach can be overly conservative with increasing prediction horizons and may not hold when model assumptions are violated, and they require multi-output prediction models, in contrast to our approach that is model-agnostic. Cleaveland et al. (2024) utilised linear complementary programming and an additional dataset to optimise the parameters of a multi-step prediction error model, whereas JANET does not require any additional optimisation step or an additional dataset. In recent work, Sun and Yu (2024), building on Messoudi et al. (2021), employed copulas to adjust for temporal dependencies, enabling multi-step and autoregressive predictions for multivariate time series filling in the deficit from other works. However, their method requires two calibration sets and gradient-based optimisation, making it data-inefficient, unlike JANET. Furthermore, unlike our proposed framework JANET, their approach requires multiple independent time series (just like the preceding works) and cannot adapt prediction regions based on historical context.

A concurrent work Zhou et al. (2024) primarily focussed on single-step prediction adapted Gibbs and Candès (2021) to account for heterogeneous trajectories in multiple time series settings. However, they require multiple calibration sets and do not show how to control $K-$familywise error when dealing with multi-step prediction. JANET, on the other hand, can control $K-$familywise error and does not require mutiple calibraiton steps. Another concurrent work English and Lippert (2024) form multiple disjoint prediction regions, but their approach requires using generative models with access to density estimation, whereas JANET is model agnostic.

Non-conformal prediction regions Beyond conformal methods, bootstrapping provides an alternative method for constructing joint prediction regions especially when one only has a single time series. Staszewska-Bystrova (2013) generates B bootstraps and finds a heuristic-based prediction region from the bootstrapped predictive sequences, however, the method only provides have asymptotic guarantees. Wolf and Wunderli (2013) creates JPRs that are asymptotically valid and can control the K-familywise error. English (2024) showed that model refitting in Staszewska-Bystrova (2013) can be avoided under certain assumptions but at the loss of asymptotic guarantees. However, their double bootstrapping method gets computationally expensive die to exponential inference from the trained models. However, given that these method are based on bootstrapping, they suffer from large computational cost that is infeasible when working with neural networks. JANET is based on inductive conformal prediction and has negligible computational cost in comparison to the bootstrapped methods. Our work can be seen as conformalisation of Wolf and Wunderli (2013); English (2024) without relying on bootstrapping and with adaptive prediction regions that is not present in the prior works. Additionally, as a conformal method, JANET can be readily applied to time series classification tasks which is not apparent for bootstrap-based methods.

To the best of our knowledge, JANET is the only framework that can handle both single and multiple time series (univariate or multivariate), whilst providing adaptive prediction regions based on historical context (lagged values), and controlling K-familywise error with finite-sample guarantees.

7 Discussion and future work

In this paper, we have formally extended the Generalised Conformal Prediction framework proposed by Chernozhukov et al. (2018) to the inductive conformalisation setting (Vovk et al., 2005). Building upon this foundation, we have introduced JANET (Joint Adaptive predictioN-region Estimation for Time-series), a comprehensive framework for constructing prediction regions in time series data. JANET is capable of producing prediction regions with marginal intervals that adapt to both the time horizon and the relevant historical context of the data. Notably, JANET effectively controls the K-FWER, a valuable feature particularly when dealing with long prediction horizons.

JANET includes several desirable properties: computational efficiency, as it requires only a single model training process; applicability to scenarios involving multiple independent time series, with exact validity guarantees; and approximate validity guarantees when a single time series is available. Furthermore, JANET is flexible enough to provide asymmetric or one-sided prediction regions, a capability not readily available in many existing methods.

Looking towards future work, we envision several promising directions for extending JANET. One avenue involves exploring the extension of JANET to a cross-conformal setting (Vovk et al., 2005), which could offer gains in predictive efficiency at the expense of potentially weaker coverage guarantees. Additionally, we acknowledge a current limitation of JANET, which is its inability to create multiple disjoint prediction regions (Izbicki et al., 2021). Such disjoint regions could be denser than a single joint region, thereby providing more informative uncertainty estimates, particularly in cases with multimodal predictive distributions. We intend to address this limitation in future research.

Data availability

No datasets were generated or analysed during the current study.

Notes

Sourced from MachineLearningMastery.com (Jason Brownlee), available at https://github.com/jbrownlee/Datasets

References

Ajroldi, N., Diquigiovanni, J., Fontana, M., et al. (2023). Conformal prediction bands for two-dimensional functional time series. Computational Statistics & Data Analysis, 187, Article 107821.
Article MathSciNet Google Scholar
Alaa AM, van der Schaar M (2020) Frequentist uncertainty in recurrent neural networks via blockwise influence functions. arXiv:2006.13707
Angelopoulos, A., Candes, E., & Tibshirani, R. J., et al. (2023). Conformal pid control for time series prediction. In A. Oh, T. Naumann, & A. Globerson (Eds.), Advances in Neural Information Processing Systems (pp. 23047–23074). Curran Associates Inc.
Google Scholar
Angelopoulos AN, Bates S (2022) A gentle introduction to conformal prediction and distribution-free uncertainty quantification. arXiv:2107.07511
Athreya, K. B., & Pantula, S. G. (1986). Mixing properties of harris chains and autoregressive processes. Journal of Applied Probability, 23(4), 880–892.
Article MathSciNet Google Scholar
Athreya, K. B., & Pantula, S. G. (1986). A note on strong mixing of arma processes. Statistics & Probability Letters, 4(4), 187–190. https://doi.org/10.1016/0167-7152(86)90064-7 https://www.sciencedirect.com/science/article/pii/0167715286900647.
Article MathSciNet Google Scholar
Barber RF, Candes EJ, Ramdas A, et al (2023) Conformal prediction beyond exchangeability. arXiv:2202.13415
Bradley R (2007) Introduction to Strong Mixing Conditions. No. v. 1 in Introduction to Strong Mixing Conditions, Kendrick Press
Candès EJ, Lei L, Ren Z (2023) Conformalized survival analysis. arXiv:2103.09763
Chernozhukov V, Wüthrich K, Yinchu Z (2018) Exact and robust conformal inference methods for predictive machine learning with dependent data. In: Bubeck S, Perchet V, Rigollet P (Eds) Proceedings of the 31st Conference On Learning Theory, Proceedings of Machine Learning Research, PMLR, 75 pp 732–749,
Cleaveland M, Lee I, Pappas GJ, et al (2024) Conformal prediction regions for time series using linear complementarity programming. arXiv:2304.01075
Diquigiovanni J, Fontana M, Vantini S (2021) The importance of being a band: Finite-sample exact distribution-free prediction sets for functional data. arXiv:2102.06746
Diquigiovanni J, Fontana M, Vantini S (2024) Distribution-free prediction bands for multivariate functional time series: an application to the italian gas market. arXiv:2107.00527
Dunn, O. J. (1961). Multiple Comparisons among Means. Journal of the American Statistical Association, 56(293), 52–64. https://doi.org/10.1080/01621459.1961.10482090 http://www.tandfonline.com/doi/abs/10.1080/01621459.1961.10482090.
Article MathSciNet Google Scholar
Dunn R, Wasserman L, Ramdas A (2022) Distribution-free prediction sets for two-layer hierarchical models. arXiv:1809.07441
English E (2024) Joint prediction regions for time-series models. arXiv:2405.12234
English E, Lippert C (2024) Conformalised conditional normalising flows for joint prediction regions in time series. In: NeurIPS 2024 Workshop on Bayesian Decision-making and Uncertainty, https://openreview.net/forum?id=5Vkwer7im5
Fannjiang, C., Bates, S., Angelopoulos, A. N., et al. (2022). Conformal prediction under feedback covariate shift for biomolecular design. Proceedings of the National Academy of Sciences, 119(43), Article e2204569119. https://doi.org/10.1073/pnas.2204569119
Article MathSciNet Google Scholar
Gal Y, Ghahramani Z (2016) Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In: Balcan MF, Weinberger KQ (eds) Proceedings of The 33rd International Conference on Machine Learning, Proceedings of Machine Learning Research, vol 48. PMLR, New York, New York, USA, pp 1050–1059, https://proceedings.mlr.press/v48/gal16.html
Gasparin M, Ramdas A (2024) Conformal online model aggregation. arXiv:2403.15527
Gibbs I, Candès E (2021) Adaptive conformal inference under distribution shift. arXiv:2106.00170
Gibbs I, Candès E (2023) Conformal inference for online prediction with arbitrary distribution shifts. arXiv:2208.08401
Haynes, W. (2013). Bonferroni Correction. Springer, New York, New York, NY,. https://doi.org/10.1007/978-1-4419-9863-7_1213 https://doi.org/10.1007/978-1-4419-9863-7_1213.
Article Google Scholar
Izbicki R, Shimizu G, Stern RB (2021) Cd-split and hpd-split: efficient conformal regions in high dimensions. arXiv:2007.12778
Kipf T, Fetaya E, Wang KC, et al (2018) Neural relational inference for interacting systems. arXiv:1802.04687
Lehmann, E. L., & Romano, J. P. (2005). Generalizations of the familywise error rate. The Annals of Statistics. https://doi.org/10.1214/009053605000000084
Article MathSciNet Google Scholar
Lei, J., G’Sell, M., Rinaldo, A., et al. (2018). Distribution-Free Predictive Inference for Regression. Journal of the American Statistical Association, 113(523), 1094–1111. https://doi.org/10.1080/01621459.2017.1307116
Article MathSciNet Google Scholar
Lei L, Candès EJ (2021) Conformal inference of counterfactuals and individual treatment effects. arXiv:2006.06138
Messoudi S, Destercke S, Rousseau S (2021) Copula-based conformal prediction for multi-target regression. arXiv:2101.12002
Podkopaev A, Ramdas A (2021) Distribution-free uncertainty quantification for classification under label shift. arXiv:2103.03323
Quach V, Fisch A, Schuster T, et al (2023) Conformal language modeling. arXiv:2306.10193
Rio E (2017) Asymptotic Theory of Weakly Dependent Random Processes. Probability Theory and Stochastic Modelling, Springer Berlin Heidelberg, https://books.google.com/books?id=b_iqDgAAQBAJ
Romano, J. P. (1990). On the behavior of randomization tests without a group invariance assumption. Journal of the American Statistical Association, 85(411), 686–692.
Article MathSciNet Google Scholar
Shafer G, Vovk V (2007) A tutorial on conformal prediction. arXiv:0706.3188
Stankeviciute K, M. Alaa A, van der Schaar M (2021) Conformal Time-series Forecasting. In: Ranzato M, Beygelzimer A, Dauphin Y, et al (Eds) Advances in Neural Information Processing Systems, Curran Associates, Inc, Newry pp 6216–6228
Staszewska-Bystrova, A. (2013). Modified scheffé’s prediction bands. Journal of Economics and Statistics, 233(5–6), 680–690.
Google Scholar
Staszewska-Bystrova, A. (2011). Bootstrap prediction bands for forecast paths from vector autoregressive models. Journal of Forecasting, 30(8), 721–735.
Article MathSciNet Google Scholar
Sun SH, Yu R (2024) Copula conformal prediction for multi-step time series prediction. In: The Twelfth International Conference on Learning Representations, https://openreview.net/forum?id=ojIJZDNIBj
Tibshirani RJ, Barber RF, Candes EJ, et al (2020) Conformal prediction under covariate shift. arXiv1904.06019
U.S. Bureau of Economic Analysis (1947) Real Gross Domestic Product. https://fred.stlouisfed.org/series/GDPC1, publisher: FRED, Federal Reserve Bank of St. Louis
Vovk, V., Gammerman, A., & Shafer, G. (2005). Algorithmic learning in a random world, (Vol. 29). Springer.
Google Scholar
Wolf M, Wunderli D (2013) Bootstrap joint prediction regions. ECON - Working Papers 064, Department of Economics - University of Zurich, https://EconPapers.repec.org/RePEc:zur:econwp:064
Xu C, Xie Y (2023) Conformal prediction for time series. arXiv:2010.09107
Xu C, Jiang H, Xie Y (2024) Conformal prediction for multi-dimensional time series by ellipsoidal sets. arXiv:2403.03850
Zaffran M, Dieuleveut A, Féron O, et al (2022) Adaptive conformal predictions for time series. arXiv:2202.07282
Zhou Y, Lindemann L, Sesia M (2024) Conformalized adaptive forecasting of heterogeneous trajectories. arXiv:2402.09623

Download references

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations

Faculty of Digital Engineering, Hasso Plattner Institute, Prof.-Dr.-Helmert-Str. 2-3, 14482, Potsdam, Brandenburg, Germany
Eshant English & Christoph Lippert
Department of Computer Science, University of California, Irvine, Irvine, CA, 92697-3425, USA
Eshant English, Stephan Mandt & Padhraic Smyth
Department of Computer Science, Royal Holloway, University of London, Egham Hill, Egham, SRY, TW20 0EX, UK
Matteo Fontana
Hasso Plattner Institute for Digital Health, Icahn School of Medicine at Mount Sinai, 1468 Madison Avenue, New York, NY, 10029, USA
Christoph Lippert
Department of Statistics, University of California Irvine, Irvine, CA, 92697-3425, USA
Eliot Wong-Toi, Stephan Mandt & Padhraic Smyth

Authors

Eshant English
View author publications
Search author on:PubMed Google Scholar
Eliot Wong-Toi
View author publications
Search author on:PubMed Google Scholar
Matteo Fontana
View author publications
Search author on:PubMed Google Scholar
Stephan Mandt
View author publications
Search author on:PubMed Google Scholar
Padhraic Smyth
View author publications
Search author on:PubMed Google Scholar
Christoph Lippert
View author publications
Search author on:PubMed Google Scholar

Contributions

E.E. led the project with writing, experiments, and theoretical analysis. E.W. helped in writing and performing the earlier neural network experiments and during idea development. M.F., S.M., P.S., and C.L., acted as advisors at different stages of the project, helping with idea development, writing and presentation suggestions. All authors reviewed the manuscript.

Corresponding author

Correspondence to Eshant English.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Editors: Maria Concepcion Bielza Lozoya, Ana Carolina Lorena, Tiago A. Almeida.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Appendix

1.1 A.1 One-sided intervals and asymmetric intervals

One-sided conformalised JPRs can be formed with a slight change to the conformity score. To do so we consider signed residuals instead of absolute ones at each time step, redefining the conformity score and present them in the multiple time series setting

$$\begin{aligned} \alpha ^{k}_{i,+}&= K\text {-}\max \left\{ \frac{y_{i,1} - \hat{y}_{i,1}}{\hat{\sigma }_{1}(X_i)}, \dots , \frac{y_{i,H} - \hat{y}_{i,H}}{\hat{\sigma }_{H}(X_i)} \right\} \end{aligned}$$

for $i \in \{1, \dots , n_{cal}\}$ and where $K\text {-}\max (\textbf{x})$ as the $K^{th}$ largest element of a sequence $\textbf{x}$. One-sided lower JPRs for test sample $Z^* = (X^*, Y^*)$ can be given by

$$\begin{aligned} \mathcal {C}_{1-\epsilon }^{\alpha +} = \left( \hat{y}_{1} - q^{K\text {-}\max }_{1-\epsilon } \cdot {\hat{\sigma }_{1}(X^*)}, \infty \right) \times \dots \times \left( \hat{y}_{H} - q^{K\text {-}\max }_{1-\epsilon } \cdot {\hat{\sigma }_{H}(X^*)}, \infty \right) \end{aligned}$$

and analogously a one-side upper JPR is given by

$$\begin{aligned} \mathcal {C}_{1-\epsilon }^{\alpha -} = \left( -\infty , \hat{y}_{1} + q^{K\text {-}\max }_{1-\epsilon } \cdot {\hat{\sigma }_{1}(X^*)}\right) \times \dots \times \left( -\infty , \hat{y}_{H} + q^{K\text {-}\max }_{1-\epsilon } \cdot {\hat{\sigma }_{H}(X^*)}\right) . \end{aligned}$$

where, $X^*$ are the lagged values(i.e the history) for the unknown $Y^*$ that is to be predicted.

We define $q_{\eta }^{K\text {-}\max }$ as the $\eta ^{th}$ empirical quantile of the distribution of $\alpha ^k_{i,+}$.

Asymmetric intervals: If the intervals are unlikely to be symmetric, one can adapt the inversion of quantiles from the conformity scores such as $\epsilon _+, \epsilon _->0$ with $\epsilon _+ + \epsilon _- = \epsilon$ thus the JPR is

$$\begin{aligned} \mathcal {C}_{1-\epsilon }^{\alpha \pm } = \bigg (\left( \hat{y}_{1} - q^{K\text {-}\max }_{1-\epsilon _-}\cdot {\hat{\sigma }_{1}(X^*)}\right) ,&\left( \hat{y}_{1} + q^{K\text {-}\max }_{1-\epsilon _+}\cdot \hat{\sigma }_{1}(X^*)\right) \bigg ) \times \dots \\ \dots \times \bigg (&\left( \hat{y}_{H} - q^{K\text {-}\max }_{1-\epsilon _-}\cdot \hat{\sigma }_{H}(X^*)\right) ,\, \left( \hat{y}_{H} + q^{K\text {-}\max }_{1-\epsilon _+}\cdot \hat{\sigma }_{H}(X^*)\right) \bigg ). \end{aligned}$$

1.2 A.2 Dtails on Theorem 1

1.2.1 A.2.1 Assumptions for Theorem 1

Let $\alpha ^o$ be an oracle non-conformity measure, and let $\alpha$ be the corresponding non-conformity score for approximate results. Assume the number of randomisations, d, and the size of the training sequence, m, grow arbitrarily large, i.e. $d, m \rightarrow \infty$. Further, let $\{\delta _{1d}, \delta _{2\,m}, \gamma _{1d}, \gamma _{2\,m}\}$ be sequences of non-negative numbers converging to zero. We impose the following conditions:

(1)
Approximate Ergodicity: With probability $1-\gamma _{1d}$, the randomisation distribution
$$\begin{aligned} \hat{F}(x):= \frac{1}{d} \sum _{\pi \in \Pi } \textbf{1}\{\alpha ^o(Z_{cal}^{\pi }) < x\} \end{aligned}$$
(9)
is approximately ergodic for
$$\begin{aligned} F(x) = \text {Pr}( \alpha ^o(Z_{cal}) < x) \end{aligned}$$
(10)
that is $\sup _{x\in \mathbb {N}} |\hat{F}(x)-{F}(x)| \le \delta _{1d}$;
(2)
Small estimation errors: With probability $1-\gamma _{2m}$ the following hold:
1. (a)
  the mean squared error is small, $d^{-1}\sum _{\pi \in \Pi } [\left( \alpha (Z_{cal}^{\pi }) - \alpha ^o(Z_{cal}^{\pi })\right) ]^2 \le \delta ^2_{2m}$;
2. (b)
  the pointwise error when $\pi$ is the identity permutation is small, $|\alpha (Z_{cal})-\alpha ^o(Z_{cal})| \le \delta _{2m}$;
3. (c)
  the pdf of $\alpha ^o(Z_{cal})$ is bounded above by a constant D.

Condition (1) states an ergodicity condition, that permuting the oracle conformity scores provides a meaningful approximation to the unconditional distribution of the oracle conformity score. This holds when a time series is strongly mixing ($\alpha$-mixing) Bradley (2007); Rio (2017); Bradley (2007); Rio (2017) using the permutation scheme discussed 3.2. Notably ARMA (Autoregressive and Moving Average) series with IID innovations are known to be $\alpha$-mixing (Bradley, 2007; Rio, 2017). Condition (2) bounds the discrepancy between the non-conformity scores and their oracle counterparts.

Theorem 1(Approximate General Validity of Inductive Conformal Inference)

Under mild assumptions on ergodicity and small errors (see Appendix A.2.1), for any $\epsilon \in (0, 1)$, the approximation of conformal p-value is approximately distributed as follows:

$$\begin{aligned} |\text {Pr}({\hat{p}} \le \epsilon ) - \epsilon | \le 6\delta _{1d} + 2 \delta _{2m} + 2 D(\delta _{2m} + 2\sqrt{\delta _{2m}}) + \gamma _{1d} + \gamma _{2m} \end{aligned}$$

(11)

for any $\epsilon \in (0, 1)$ and the corresponding conformal set has an approximate coverage $1-\epsilon$, i.e,

$$\begin{aligned} | \text {Pr}(y^* \in \mathcal {C}_{1-\epsilon }) - (1-\epsilon ))| \le 6\delta _{1d} + 2\delta _{2m} + 2 D(\delta _{2m} + 2\sqrt{\delta _{2m}}) + \gamma _{1d} + \gamma _{2m}. \end{aligned}$$

(12)

Proof

The proof largely follows the proof of Generalised Conformal Prediction in Chernozhukov et al. (2018), adapted for Inductive Conformal Prediction. Since the second condition (bounds on the coverage probability) is implied by the first condition, it suffices to prove the first claim. Define the empirical distribution function of the non-conformity scores under randomization as:

$$\begin{aligned} \hat{F}(x):= \frac{1}{d} \sum _{\pi \in \Pi } \textbf{1}\{\alpha ^o(Z_{cal}^{\pi }) < x\} \end{aligned}$$

(13)

The rest of the proof proceeds in two steps. We first bound $\hat{F}(x) - F(x)$ and then derive the desired result.

Step 1: We bound the difference between the p-value and the oracle p-value, $\hat{F}(\alpha (Z_{cal})) - F(\alpha ^o{(Z_{cal})}).$ Let $\mathcal {M}$ be the event that the conditions (1) and (2) hold. By assumption,

$$\begin{aligned} \text {Pr}(\mathcal {M}) \ge 1- \gamma _{1d} - \gamma _{2m}. \end{aligned}$$

(14)

Notice that on the event $\mathcal {M}$,

$$\begin{aligned} \left| \hat{F}(\alpha (Z_{cal})) - F(\alpha ^o{(Z_{cal})})\right|&\le \left| \hat{F}(\alpha (Z_{cal})) - F(\alpha (Z_{cal}))\right| + \left| {F}(\alpha (Z_{cal}) - F(\alpha ^o{(Z_{cal})})\right| \nonumber \\&{\mathop {\le }\limits ^{\text {(i)}}} \sup _{x\in \mathbb {N}}\left| \hat{F}(x) - F(x)\right| + D \left| \alpha (Z_{cal}) - \alpha ^o{(Z_{cal})}\right| \nonumber \\&\le \sup _{x\in \mathbb {N}} \left| {\hat{F}}(x) - \tilde{F}(x)\right| + \sup _{x\in \mathbb {N}} \left| \tilde{F}(x) - F(x)\right| + D|\alpha (Z_{cal}) - \alpha ^o{(Z_{cal})}|\nonumber \\&\le \sup _{x\in \mathbb {N}} \left| {\hat{F}}(x) -\tilde{F}(x)\right| + \delta _{1d} + D|\alpha (Z_{cal}) - \alpha ^o{(Z_{cal})}|\nonumber \\&\le \sup _{x\in \mathbb {N}} \left| {\hat{F}}(x) -\tilde{F}(x)\right| +\delta _{1d} + D\delta _{2m}, \end{aligned}$$

(15)

where (i) holds by the fact that the bounded pdf of $\alpha ^o{(Z_{cal})}$ implies the Lipschitz property for F.

Let $A = \{\pi \in \Pi :|\alpha (Z_{cal}^{\pi }) - \alpha ^o(Z_{cal}^{\pi })| \ge \sqrt{\delta _{2\,m}}$. Notice that on the event $\mathcal {M}$, by Chebyshev inequality

$$\begin{aligned} |A|\delta _{2d} \le \sum _{\pi \in \Pi }\left( \alpha (Z_{cal}^{\pi }) - \alpha ^o(Z_{cal}^{\pi })\right) ^2 \le d \delta _{2m}^2 \end{aligned}$$

and thus $|A|/m \le \delta _{2m}$. Also notice that on the event $\mathcal {M}$, for any $x \in \mathbb {N}$,

$$\begin{aligned}&\left| \hat{F}(x) - \tilde{F}(x)\right| \\&\le \frac{1}{d} \sum _{\pi \in A} \left| \textbf{1}\{\alpha (Z_{cal}^{\pi })< x\}-\textbf{1}\{\alpha ^o(Z_{cal}^{\pi })< x\}\right| + \frac{1}{d}\sum _{\pi \in (\Pi \setminus A)} \left| \textbf{1}\{\alpha (Z_{cal}^{\pi })< x\}-\textbf{1}\{\alpha ^o(Z_{cal}^{\pi }) < x\}\right| \\&{\mathop {\le }\limits ^{\text {(i)}}} \frac{|A|}{d} + \frac{1}{d}\sum _{\pi \in (\Pi \setminus A)} \textbf{1}\left\{ |\alpha ^o(Z_{cal}^{\pi }) - x| \le \sqrt{\delta _{2}} \right\} \nonumber \\&\le \frac{|A|}{d} + \frac{1}{d} \sum _{\pi \in \Pi } \textbf{1}\left\{ |\alpha ^o(Z_{cal}^{\pi }) - x| \le \sqrt{\delta _{2m}} \right\} \nonumber \\&\le \frac{|A|}{d} + \text {Pr}\left( |\alpha ^o{(Z_{cal})} - x| \le \sqrt{\delta _{2m}} \right) \nonumber \\&\qquad \qquad + \sup _{z\in \mathbb {N}} \left| \frac{1}{d} \sum _{\pi \in \Pi } \textbf{1}\left\{ |\alpha ^o(Z_{cal}^{\pi }) - z| \le \sqrt{\delta _{2m}} \right\} - \text {Pr}\left( |\alpha ^o{(Z_{cal})} - z| \le \sqrt{\delta _{2m}} \right) \right| \nonumber \\&= \frac{|A|}{d} + \text {Pr}\left( |\alpha ^o{(Z_{cal})} -x | \le \sqrt{\delta _{2m}} \right) \nonumber \\&\qquad + \sup _{x\in \mathbb {N}}\left| \left[ \tilde{F}\left( z + \sqrt{\delta _{2m}}\right) - \tilde{F}\left( z - \sqrt{\delta _{2m}}\right) \right] - \left[ F\left( z + \sqrt{\delta _{2m}}\right) - F\left( z - \sqrt{\delta _{2m}}\right) \right] \right| \nonumber \\&\le \frac{|A|}{d} + \text {Pr}\left( |\alpha ^o{(Z_{cal})} - x| \le \sqrt{\delta _{2m}}\right) + 2 \sup _{x\in \mathbb {N}}\left| \tilde{F}(z) - F(z)\right| \nonumber \\&{\mathop {\le }\limits ^{\text {(ii)}}} \frac{|A|}{d} + 2D \sqrt{\delta _{2m}} + 2 \delta _{1d}\nonumber \\&{\mathop {\le }\limits ^{\text {(iii)}}} 2 \delta _{1d} + \delta _{2m} + 2D\sqrt{\delta _{2m}}, \end{aligned}$$

where

i.
follows by the boundedness of indicator functions and the elementary inequality of $|\textbf{1}\left\{ \alpha (Z_{cal}^{\pi })< x\} - \textbf{1}\{\alpha ^o(Z_{cal}^{\pi })<x\right\} |\le \textbf{1}\left\{ |\alpha ^o(Z_{cal}^{\pi }) - x| \le |\alpha (Z_{cal}^{\pi }) - \alpha ^o(Z_{cal}^{\pi })|\right\}$;
ii.
follows by the bounded pdf of $\alpha ^o{(Z_{cal})}$;
iii.
follows by $|A|/d\le \delta _{2m}$.

Since the above result holds for each $x \in \mathbb {N}$, it follows that on the event $\mathcal {M}$,

$$\begin{aligned} \sup _{x \in \mathbb {N}} \left| \hat{F}(x) - \tilde{F}(x)\right| \le 2 \delta _{1d} + \delta _{2m} + 2D\sqrt{\delta _{2m}}. \end{aligned}$$

(16)

We combine (15) and (16) and obtain that on the event $\mathcal {M}$,

$$\begin{aligned} \left| \hat{F}(\alpha (Z_{cal})) - F(\alpha ^o{(Z_{cal})}) \right| \le 3 \delta _{1d} + \delta _{2m} + D(\delta _{2m} + 2\sqrt{\delta _{2m}}). \end{aligned}$$

(17)

Step 2: The derivation of the main result follows: Notice that

$$\begin{aligned} \big | \text {Pr}\big ( 1- \hat{F}(\alpha (Z_{cal}))&\le \epsilon \big ) - \epsilon \big |\\&= \left| E\left( \textbf{1}\left\{ 1-\hat{F}(\alpha (Z_{cal})) \le \epsilon \right\} - \textbf{1}\{1-F(\alpha ^o{(Z_{cal})}) \le \epsilon \} \right) \right| \\&\le E\left| \textbf{1}\left\{ 1-\hat{F}(\alpha (Z_{cal})) \le \epsilon \right\} - \textbf{1}\left\{ 1-F(\alpha ^o{(Z_{cal})}) \le \epsilon \right\} \right| \\&{\mathop {\le }\limits ^{\text {(i)}}} \text {Pr}\left( |F(\alpha ^o{(Z_{cal})}) - 1 + \epsilon | \le \left| \hat{F}(\alpha (Z_{cal})) - F(\alpha ^o{(Z_{cal})}) \right| \right) \\&\le \text {Pr}\left( |F(\alpha ^o{(Z_{cal})})-1 + \epsilon | \le \left| \hat{F}(\alpha (Z_{cal})) - F(\alpha ^o{(Z_{cal})}) \right| \text { and } \mathcal {M}\right) + \text {Pr}\left( \mathcal {M}^{\mathcal {C}}\right) \\&{\mathop {\le }\limits ^{\text {(ii)}}} \left( \text {Pr}(|F(\alpha ^o{(Z_{cal})})-1 + \epsilon | \le 3 \delta _{1d} + \delta _{2m} + D(\delta _{2m} - 2 \sqrt{\delta _{2m}})\right) +\text {Pr}(\mathcal {M}^{\mathcal {C}})\\&{\mathop {\le }\limits ^{\text {(iii)}}} 6 \delta _{1d} + 2 \delta _{2m} + 2D(\delta _{2m} - 2 \sqrt{\delta _{2m}}) + \gamma _{1d} + \gamma _{2m}, \end{aligned}$$

where

i.
follows by the elementary inequality $|\textbf{1}\{1-\hat{F}(\alpha (Z_{cal})) \le \epsilon \} - \textbf{1}\{1-F(\alpha ^o{(Z_{cal})}) \le \epsilon \}| \le |\textbf{1}\{F(\alpha ^o{(Z_{cal})}) - 1 + \epsilon | \le |\hat{F}(\alpha (Z_{cal})) - F(\alpha ^o{(Z_{cal})})|\}$,
ii.
follows by (17),
iii.
follows by the fact that $F(\alpha ^o{(Z_{cal})})$ has the uniform distribution on (0, 1) and therefore, has pdf equal to 1, and by (14).

$\square$

Theorem 2

(General Exact Validity) Consider a sequence of observations $\textbf{Z}^\pi _{cal}$ that has an exchangeable distribution under the permutation group $\Pi$. For any fixed permutation group the randomisation quantiles, $\epsilon$, are invariant, namely

$$\begin{aligned} \mathcal {A}^{(r(\epsilon ))}\left( {Z}_{tr}, Z^\pi _{cal}\right) = \mathcal {A}^{(r(\epsilon ))}\left( {Z}_{tr}, Z_{cal}\right) \; \forall \pi \in \Pi \end{aligned}$$

where $r(\epsilon ) = \lceil (d+1)\epsilon ) \rceil$-th non-conformity score (when ranked in the descending order). Then, the following probabilistic guarantees hold:

$\text {Pr}({\hat{p}} \le \epsilon ) = \text {Pr}\left( \mathcal {A}\left( {Z}_{tr}, Z_{cal}\right) > \mathcal {A}\left( {Z}_{tr}, Z^{(r(\epsilon ))}_{cal}\right) \right) \le \epsilon$
$\text {Pr}(Y^* \in \mathcal {C}_{1-\epsilon }) \ge 1 - \epsilon$

where $Z^{(r(\epsilon ))}_{cal}$ is the permuted sequence that corresponds to the $\lceil (d+1) \epsilon \rceil ^{th}$ non-conformity score when ranked in descending order, $\mathcal {C}_{1-\epsilon }$ is the prediction interval for $(1-\epsilon )$ coverage and test sample $Z^* = (X^*, Y^*)$.

The proof follows from arguments for randomisation inference that can be found in Chernozhukov et al. (2018); Romano (1990).

B Computational cost analysis

In this section, we analsyse the computational costs for constructing prediction regions using different methods. Bonferroni and Scheffé-JPR require only a single run of the regression algorithm $\mathcal {A}$ (to fit $\hat{f}$ on the training set $Z_{\text {tr}}$), while bootstrap methods require $B$ separate runs–one for each bootstrap sample. Full conformal prediction (FCP)-based methods, like Chernozhukov et al. (2018), in contrast, must train $\mathcal {A}$ many more times–once for each test point $X^*$ and each candidate label $y \in \mathcal {Y}$, where $\mathcal {Y}$ is a discretised grid of size $c$. The total number of model fits is therefore $(n_{\text {train}}+1)\cdot n_{\text {test}} \cdot c$.

In the case of JANET, only a single model fit is required as it is rooted in the Inductive Conformal Prediciton framework (ICP). The calibration cost depends on whether the task involves multiple time series or a single time series. For multiple time series, JANET evaluates one conformity score per calibration series–like standard ICP. For a single time series, however, JANET uses a permutation-based calibration strategy, computing one conformity score for each of the $d$ permutations of a single calibration sequence. Evaluation involves computing a $K-$max over H-dimensional residuals for each calibration permutation or series. For simplicity, we assume the model requires only one evaluation for H forecasts into the future.

The Table 3 below summarises the model training and evaluation cost (ignoring constants), where all methods are trained on a dataset of size $n_{\text {tr}}$, and calibrated on $n_{\text {cal}}$ sequences (or permutations), for $n_{\text {test}}$ test examples. Table 3 compares computational cost (ignoring constants) for constructing prediction regions with different methods. The second column (Model training cost) counts the number of times the model $\hat{f}$ is trained. The third column (Model evaluation cost) accounts for computing conformity scores or predictions needed for calibration and test set evaluation. In most practical cases, the training cost dominates–for example, when $\hat{f}$ is a neural network, which is substantially more expensive to fit than to evaluate. Thus, methods like JANET, which require only a single model fit and relatively low calibration overhead, provide a practical alternative to bootstrap or full conformal methods, particularly in high-horizon or high-dimensional settings.

Table 3 Computational cost comparison of prediction region methods

Full size table

C Additional experiments

1.1 C.1 Multiple time series

To further demonstrate JANET’s adaptability across architectures, we conducted additional experiments using two expressive forecasting models: (i) a lightweight Transformer, and (ii) a RealNVP-based normalising flow. The results are summarised in Table 4.

Transformer model. We implemented a minimal Transformer encoder with one self-attention layer (1 layer, 2 attention heads) and a feedforward decoder. The conformal baselines (CF-Transformer, CopulaCPTS-Transformer) produce significantly wider regions than necessary, while still overshooting the target coverage of 90%. In contrast, JANET achieves near-nominal coverage with far more compact prediction regions, outperforming other methods in efficiency whilst also accounting uncertainty due to poor modelling.

Normalising flow model. For probabilistic forecasting, we used a RealNVP-based model with three affine coupling layers. While these models produce full predictive distributions, their native prediction intervals are drastically miscalibrated. As shown in Table 4, the raw Flow model severely undercovers—sometimes catastrophically (e.g., 0% coverage on UK COVID-19). To test JANET’s robustness, we applied our method on the predicted means from the flow-model to compute the prediciton region.

The results confirm that JANET is able to recover valid joint prediction regions even from poorly calibrated probabilistic forecasts. Across all datasets and values of K, JANET-Flow achieves coverage close to or above 90% while maintaining narrow and interpretable prediction regions. This highlights JANET’s value as a post-hoc calibration framework that can correct for miscalibrated uncertainty estimates— whether they originate from deterministic architectures or full probabilistic models.

Table 4 Comparison of coverage (%) and interval widths/areas on the test set for $\epsilon =0.1$ using Transformer and Normalising Flow models. Coverage values closest to 90% are highlighted in bold. Narrower intervals are preferred

Full size table

Results. Across both architectures, JANET achieves coverage rates close to the target $1 - \epsilon = 0.9$, while producing tighter regions than other baselines. Notably, JANET consistently improves the calibration of the probabilistic flow-based models, which often exhibit over- or under-coverage natively. These findings reinforce JANET’s robustness and generality across a wide range of forecasters, both deterministic and probabilistic.

1.2 C.2 Single time series evaluation: melbourne minimum temperature

To empirically evaluate JANET’s permutation-based calibration strategy in the single time series regime, we use the Melbourne Minimum Temperature dataset.^{Footnote 1} This dataset contains daily minimum temperatures in Melbourne, Australia, recorded from 1981 to 1990, for a total of 3652 observations. Two missing entries were imputed by averaging adjacent values to ensure compatibility with STL-based preprocessing.

We applied STL decomposition to extract and remove trend and seasonality components. The STL function identified yearly and weekly seasonalities. JANET’s permutation approach was applied to the residual (detrended and deseasonalised) series with the other components added later to the permuted residuals to obtain the set of calibration series.

Table 5 Empirical Out-of-Sample Coverages on the Melbourne Minimum Temperature Dataset ($\epsilon = 0.2$, target coverage 80%), empirical width, and training times (minutes), numbers in the parenthesis refer to values of K. Bootstrap-JPR (Boot) and JANET perform similarly whereas Bonferroni (Bonf.) shows over coverage as usual and Scheffé undercovers. Bonf. and Scheffé can only be performed for $K=1$. The bootstrap methods take approximately 10-12 times as long as our method in wall-clock time in our implementations

Full size table

Results. Table 5 reports out-of-sample coverage, prediction width, and training time for JANET and several baselines. JANET achieves coverage close to the target level of $1 - \epsilon = 0.8$ across all values of K, with prediction regions that are consistently narrower than those produced by bootstrap-based joint prediction regions (Boot-JPR). Moreover, JANET is roughly 10-12$\times$ faster in wall-clock time due to its single-model training requirement.

Bonferroni-based intervals overcover, as expected, while Scheffé’s method undercovers significantly in this setting. In contrast, JANET offers a strong balance between coverage accuracy, interpretability, and computational efficiency–even when applied to a single, non-exchangeable time series where classical exchangeability assumptions may not hold.

1.2.1 C.2.1 Effect of preprocessing on coverage

JANET’s validity relies on approximate exchangeability of calibration residuals–an assumption that can be violated in the presence of strong non-stationarity due to trends or seasonal patterns. To understand how such violations affect coverage, we conducted an ablation study on the Melbourne Minimum Temperature dataset by systematically varying the level of preprocessing. Specifically, we compared JANET’s performance under three settings: (i) using the raw, non-stationary series, (ii) removing only the seasonal component using STL decomposition, and (iii) applying full STL decomposition to remove both trend and seasonality.

Table 6 summarises the coverage and width for $K = 1, 2, 3$ at a fixed significance level of $\epsilon = 0.2$. As expected, JANET undercovers when applied directly to the raw data, achieving only 71-73% coverage with relatively wide intervals. Removing seasonality improves both coverage and sharpness, while full STL decomposition results in near-nominal coverage (78-79%) and the narrowest regions.

Table 6 Coverage (%) and interval widths under different preprocessing strategies on the Melbourne Minimum Temperature dataset, evaluated at $K = 1, 2, 3$ for $\epsilon = 0.2$. We compare three cases: (i) using the raw non-stationary series, (ii) removing seasonality only, and (iii) full STL decomposition to remove both trend and seasonality. Coverage improves and width decreases as more structure is accounted for in the time series

Full size table

These results confirm that violations of exchangeability due to non-stationarity can lead to predictable degradations in conformal validity. In the raw series setting, the poor coverage is expected–since the permutations used during calibration no longer correspond to a meaningful approximation of the original time series, thereby invalidating the underlying assumptions of JANET. However, we also observe that prediction intervals become substantially wider in this case, due to bigger residuals on the permuted series’, thus preventing a catastrophic collapse of coverage. In effect, although approximate validity is lost, JANET still produces non-trivial, cautious prediction regions. These findings reinforce the value of applying standard preprocessing pipelines (e.g., STL decomposition or differencing) to restore approximate exchangeability in real-world applications, and they motivate future work on the theoretical guarantees of conformal prediction under non-stationarity.

1.3 C.3 Time-aware weighting of residuals

In some real-world applications, not all time steps in a forecast horizon are equally important. For example, earlier predictions in a sequence may be more actionable or reliable than those further ahead. In this section, we explore whether JANET can accommodate such asymmetries through time-aware calibration.

Although the default K-FWER construction treats all future time steps symmetrically, JANET’s non-conformity score can be flexibly adjusted to apply time-weighted scaling. Specifically, we reweight residuals from different segments of the forecast horizon when computing conformity scores, thereby biasing the prediction region construction toward earlier or later time steps.

We illustrate this idea using the RNN-based model on the Particle5 dataset. Table 7 compares three weighting schemes: (1) uniform weighting (default), (2) upweighting early time steps by downscaling their residuals by a factor of 1.5, and (3) upweighting late time steps by downscaling residuals in the last 12 steps by a factor of 1.5.

Table 7 Coverage (%) and interval widths under different residual weighting strategies on the Particle5 dataset, evaluated at $K = 1, 2, 3$ for $\epsilon = 0.1$. Errors are reported separately for the first and last 12 time steps of the prediction horizon. The final two rows in each block show how the fraction of errors shifts between early and late segments as a function of the weighting applied to residuals during calibration

Full size table

As reported in Table 7, under uniform weighting, errors are distributed roughly evenly across the horizon. When earlier time steps are upweighted, JANET prioritises their coverage, significantly reducing the fraction of early-step errors—at the cost of more errors later in the horizon. Conversely, when later time steps are upweighted, the model shifts errors toward the beginning of the sequence. In all cases, JANET maintains approximate K-FWER coverage, demonstrating its flexibility in allocating uncertainty in a time-sensitive manner.

D Training details

We perform all our experiments on Intel(R) Xeon(R) W-2265 CPU @ 3.50GHz with 20 CPUs, 12 cores per socket, and 2 threads per core. In totality, we used on the order of 100 compute hours.

1.1 D.1 Single time series experiments

For Monte Carlo simulations and the US GDP dataset, we train AR(2) models as the main predictor. For learning scaling factors, we use a linear regression model and use the last 6 steps as the features to output the scaling factors.

1.2 D.2 Particle experiments

We take the same experimental setup as Sun and Yu (2024) and use a 1-layer sequence-to-sequence LSTM network (EncDec) where the encoder has a single LSTM layer with embedding size 24 and the decoder has a single LSTM layer with embedding size 24 and a linear layer. We also fit a 1-layer RNN with a single LSTM layer (RNN) with embedding size 24, followed by a linear layer. We train the model for 150 epochs and set batch size to 150.

For each dataset, 5000 samples were generated and split into 45/45/10 proportions for training, calibration, and testing, respectively. Baselines not requiring calibration used the calibration split for training.

1.3 D.3 UK COVID experiments

We take the same architectures from the particle experiments and apply them to the same COVID data as Stankeviciute et al. (2021). We train these models for 200 epochs with embedding sizes of 128 and set batch size of 64.

Of the 380 time series, we utilise 200 sequences for training, 100 for calibration, and 80 for testing.

1.4 D.4 Complete Monte Carlo results

We visualise the results of these experiments in Fig. 4. For complete numerical results see Tables 8, 9, 10, 11, 12 and 13. We find that our JANET methods are comparable or better with respect to coverage and interval widths against the baselines.

Table 8 Empirical Coverages of Monte Carlo Simulations with Significance Level $\epsilon =0.1$. Coverages are presented as percentages

Full size table

Table 9 Empirical coverages of Monte Carlo simulations with $\epsilon = 0.2$. Coverages are presented as percentages. The desired coverage is 80%

Full size table

Table 10 Empirical coverages of Monte Carlo simulations with $\epsilon = 0.3$. Coverages are presented as percentages. Desired coverage is 70%

Full size table

Table 11 Empirical widths of Monte Carlo simulations with $\epsilon = 0.1$. The desired coverage is 90% and narrower intervals are preferred. These reported widths are geometric means over the time steps averaged over the 1000 simulations

Full size table

Table 12 Empirical widths of Monte Carlo simulations with $\epsilon = 0.2$. The desired coverage is 80%. Narrower intervals are preferred. These reported widths are geometric means over the time steps averaged over the 1000 simulations

Full size table

Table 13 Empirical widths of Monte Carlo simulations with $\epsilon = 0.3$. The desired coverage is 70%. Narrower intervals are preferred. These reported widths are geometric means over the time steps averaged over the 1000 simulations

Full size table

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

English, E., Wong-Toi, E., Fontana, M. et al. JANET: Joint Adaptive predictioN-region Estimation for Time-series. Mach Learn 114, 177 (2025). https://doi.org/10.1007/s10994-025-06812-2

Download citation

Received: 13 April 2025
Revised: 14 April 2025
Accepted: 27 May 2025
Published: 23 June 2025
Version of record: 23 June 2025
DOI: https://doi.org/10.1007/s10994-025-06812-2

Keywords

Part of a collection:

JANET: Joint Adaptive predictioN-region Estimation for Time-series

Abstract

Similar content being viewed by others

Distribution-free conformal joint prediction regions for neural marked temporal point processes

Conformal multistep-ahead multivariate time-series forecasting

Towards Synthetic Multivariate Time Series Generation for Flare Forecasting

Explore related subjects

1 Introduction

2 Background on conformal prediction

3 Generalised inductive conformal predictors

3.1 Multiple time series

3.2 Single time series

Definition 1

3.3 Validity of ICP with permutations

Theorem 1

4 JANET: Joint Adaptive predictioN-region Estimation for Time-series

5 Experiments

5.1 Single time series experiments

5.1.1 Monte Carlo simulations

5.1.2 U.S. real gross domestic product (GDP) dataset

5.2 Multiple independent time series experiments

5.2.1 Particle simulation datasets

5.2.2 UK COVID-19 dataset

6 Related work

7 Discussion and future work

Data availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

A Appendix

1.1 A.1 One-sided intervals and asymmetric intervals

1.2 A.2 Dtails on Theorem 1

1.2.1 A.2.1 Assumptions for Theorem 1

Proof

Theorem 2

B Computational cost analysis

C Additional experiments

1.1 C.1 Multiple time series

1.2 C.2 Single time series evaluation: melbourne minimum temperature

1.2.1 C.2.1 Effect of preprocessing on coverage

1.3 C.3 Time-aware weighting of residuals

D Training details

1.1 D.1 Single time series experiments

1.2 D.2 Particle experiments

1.3 D.3 UK COVID experiments

1.4 D.4 Complete Monte Carlo results

Rights and permissions

About this article

Cite this article

Share this article

Keywords