Modular and Adaptive Conformal Prediction for Sequential Models via Residual Decomposition

William Zhang
wzhang64@mit.edu Operations Research Center, Massachusetts Institute of Technology, Cambridge, MA 02139 Saurabh Amin¹¹footnotemark: 1
amins@mit.edu Georgia Perakis¹¹footnotemark: 1
georgiap@mit.edu

Abstract

Conformal prediction offers finite-sample coverage guarantees under minimal assumptions. However, existing methods treat the entire modeling process as a black box, overlooking opportunities to exploit modular structure. We introduce a conformal prediction framework for two-stage sequential models, where an upstream predictor generates intermediate representations for a downstream model. By decomposing the overall prediction residual into stage-specific components, our method enables practitioners to attribute uncertainty to specific pipeline stages. We develop a risk-controlled parameter selection procedure using family-wise error rate (FWER) control to calibrate stage-wise scaling parameters, and propose an adaptive extension for non-stationary settings that preserves long-run coverage guarantees. Experiments on synthetic distribution shifts, as well as real-world supply chain and stock market data, demonstrate that our approach maintains coverage under conditions that degrade standard conformal methods, while providing interpretable stage-wise uncertainty attribution. This framework offers diagnostic advantages and robust coverage that standard conformal methods lack.

1 Introduction

Modern machine learning systems increasingly rely on modular pipelines, where upstream predictors generate intermediate representations for downstream models. Such pipelines appear across diverse applications: macroeconomic forecasting, where supply chain indicators inform market predictions, and medical diagnosis, where imaging features guide treatment decisions [11, 25]. In these settings, uncertainty quantification is essential and should reflect how error propagates across stages. Conformal prediction [28] provides principled prediction intervals with finite-sample coverage guarantees under exchangeability. However, existing conformal prediction methods treat these multi-stage systems as monolithic black boxes, overlooking their modular structure and missing opportunities for targeted error attribution.

Recent work expands the flexibility of conformal prediction through reweighting and adaptive calibration [7, 1, 20, 12, 4, 14, 15], structured model-aware adaptations [31, 32, 29], and risk-control [8, 2], but largely focuses on single-stage or black-box models, rather than taking a modular perspective.

This perspective is especially pertinent under distribution shift, which often affects stages asymmetrically—e.g., upstream sensors may drift while downstream mappings remain stable. Standard conformal methods cannot disentangle such effects, which may force practitioners to retrain the entire pipeline when targeted interventions would suffice. To address this gap, we propose a stage-wise abstraction that isolates these effects, embedding stage-wise uncertainty into conformal prediction, yielding robust, interpretable intervals under distribution shift.

By decomposing residuals into stage-wise components, our method constructs prediction intervals through a linear combination of stage-specific quantiles, weighted by scaling parameters selected via family-wise error rate (FWER) control over a calibration set. This yields valid, interpretable intervals without requiring internal model access—only structural knowledge of the pipeline. Our approach offers two key advantages over existing conformal methods: (i) it provides diagnostic transparency, identifying which stages contribute most to uncertainty, and (ii) it maintains coverage under distribution shifts affecting individual components where standard approaches degrade.

For intuition, we briefly introduce the two-stage setting with triplets $(w,x,y)$ and latent structure $x=\mu_{1}(w)+\varepsilon_{1}$ , $y=\mu_{2}(x)+\varepsilon_{2}$ , where $\varepsilon_{1},\varepsilon_{2}$ denote additive noise, and the model learns estimators $\hat{\mu}_{1}$ , $\hat{\mu}_{2}$ . For example, in automobile supply chains, $w$ could denote semiconductor prices, with $\hat{\mu}_{1}(w)$ estimating new vehicle demand ( $x$ ), and $\hat{\mu}_{2}(x)$ predicting used vehicle prices ( $y$ ). Our method decomposes the residual $R=|y-\hat{\mu}_{2}(\hat{\mu}_{1}(w))|$ into upstream ( $\hat{\mu}_{1}$ ) error component $\Delta R_{1}$ and downstream ( $\hat{\mu}_{2}$ ) residual $R_{2}$ , capturing the uncertainty of each stage.

We also extend our framework to adaptive settings for which we update scaling parameters based on component-wise empirical coverage, improving responsiveness to (i) upstream, (ii) downstream, and (iii) end-to-end distribution shifts. We illustrate our method on synthetic shifts and real-world supply chain and financial data, showing improved robustness and interpretability over adaptive conformal baselines [14, 15, 1, 4]. For simplicity, we focus on two-stage models, but our methodology can be extended to multi-stage models (Appendix C).

Contributions:

•

We propose a residual decomposition framework for sequential multi-stage models, that partitions prediction error into interpretable upstream and downstream components, enabling stage-wise uncertainty attribution (Section˜3).
•

We develop a risk-controlled parameter selection procedure using FWER-based hypothesis testing to construct valid prediction intervals from decomposed residuals, with formal coverage guarantees (Section˜5).
•

We introduce an adaptive algorithm that dynamically adjusts scaling parameters based on component-wise coverage feedback, preserving long-run coverage guarantees while providing interpretable diagnostics for distribution shifts (Section˜6).
•

We demonstrate the framework’s effectiveness on synthetic shifts and real-world economic forecasting, showing maintained coverage under conditions that degrade existing conformal methods, with diagnostic capabilities that enable targeted model interventions (Section˜7).

Paper Outline. In Sections 2 and 3, we formalize the two-stage prediction setting and introduce our residual decomposition. Section˜4 describes our method for constructing stage-aware prediction intervals, followed by the FWER calibration procedure in Section˜5. We extend our method to an adaptive version Section˜6, and evaluate performance under distribution shift in Section˜7.

2 Problem setting and assumptions

We consider a sequential two-stage prediction problem. Each data point is characterized by a triplet $z=(w,x,y)$ where $w\in\mathcal{W}$ is the input to the first-stage model (upstream features), $x\in\mathcal{X}$ is an intermediate representation, and $y\in\mathcal{Y}$ is the final prediction target. We learn predictors $\hat{\mu}_{1}:\mathcal{W}\to\mathcal{X}$ and $\hat{\mu}_{2}:\mathcal{X}\to\mathcal{Y}$ which are composed to form end-to-end predictor $\hat{\mu}_{2}(\hat{\mu}_{1}(w))=\hat{\mu}_{2}(\hat{x})$ , where $\hat{x}=\hat{\mu}_{1}(w)$ .

To learn these models and perform conformal prediction, we assume access to three disjoint subsets: (i) a training set $S^{\text{train}}=\{(w_{i},x_{i},y_{i})\}_{i=1}^{n}$ used to fit both stages of the model via $(w,x)$ and $(x,y)$ pairs for each stage respectively; (ii) a conformal set $S^{\text{conf}}=\{(w_{i},x_{i},y_{i})\}_{i=n+1}^{n+m}$ for computing nonconformity scores; and (iii) a calibration set $S^{\text{cal}}=\{(w_{i},x_{i},y_{i})\}_{i=n+m+1}^{n+m+l}$ for parameter selection. At prediction time, only the upstream input $w_{\text{test}}$ is observed, while the intermediate value $x_{\text{test}}$ and target $y_{\text{test}}$ are unobserved and must be predicted.

We list some assumptions. Exchangeability. Unless stated otherwise, we assume the data $(w,x,y)$ in $S^{\text{train}},S^{\text{conf}},S^{\text{cal}}$ to be exchangeable, as well as the test point $z_{\text{test}}=(w_{\text{test}},x_{\text{test}},y_{\text{test}})$ . Learning algorithms. We assume that the algorithms that learn $\hat{\mu}_{1},\hat{\mu}_{2}$ are deterministic, i.e. given the same training data, they produce identical predictors; and we also assume they are symmetric, i.e. invariant to permutations of the data. We also assume that the intermediate variable $x$ is observable for the given datasets, enabling residual decomposition. Distribution shifts. Distribution shifts violate exchangeability and can impact each prediction stage differently. Let $P$ denote the distribution of a data point in the training set and $P^{\prime}$ denote the distribution of the test point. We consider three types of shift: Upstream covariate shift: $P(w)\neq P^{\prime}(w)$ ; Upstream concept shift: $P(x|w)\neq P^{\prime}(x|w)$ ; and Downstream concept shift: $P(y|x)\neq P^{\prime}(y|x)$ .

Objective.

Given test upstream input $w_{\text{test}}$ , the goal is to construct prediction interval $\hat{C}_{\alpha}(w_{\text{test}})$ such that $\mathbb{P}(y_{\text{test}}\in\hat{C}_{\alpha}(w_{\text{test}}))\geq 1-\alpha$ , where $\alpha\in(0,1)$ is the target miscoverage level. Under exchangeability, this is the standard conformal prediction objective. However, under distribution shifts, we seek to maintain robust coverage while providing interpretable attribution of uncertainty to specific pipeline stages.

The key challenges in this setting are: (i) Attribution: Understanding which stage contributes more to prediction uncertainty; (ii) Adaptivity: Maintaining coverage under shifts affecting different stages; (iii) Transparency: Providing actionable insights for model improvement or retraining decisions. Our approach addresses these challenges by decomposing the end-to-end prediction error into stage-specific components for targeted uncertainty quantification and robust interval construction. While we define these concepts for two-stage models, we discuss auxiliary inputs, multiple upstream models, and deeper sequential pipelines in Appendix C. We also provide a notation table in Appendix B.2.

3 Two-stage residual decomposition for conformal prediction

To address the aforementioned challenge of attribution, we partition the total prediction residual $R(w,y)=|y-\hat{\mu}_{2}(\hat{\mu}_{1}(w))|$ into upstream and downstream components. This decomposition enables stage-wise attribution of error, in contrast to standard black-box conformal methods. We provide a visualization in Figure˜1.

Refer to caption — Figure 1: Visualization of a sequential two-stage model with residual components $R,\Delta R_{1},R_{2}$ .

Definition 1 (Second-stage residual).

Given a point $(w,x,y)$ and downstream predictor $\hat{\mu}_{2}:\mathcal{X}\to\mathcal{Y}$ , the second-stage residual is

R_{2}(x,y)=|y-\hat{\mu}_{2}(x)|.

This captures downstream prediction error when given the true intermediate input $x$ .

Definition 2 (First-stage delta).

Let $\hat{x}=\hat{\mu}_{1}(w)$ be the output of the first-stage predictor. The first-stage delta is defined as

\Delta R_{1}(w,x,y)=\left|\,|y-\hat{\mu}_{2}(x)|-|y-\hat{\mu}_{2}(\hat{x})|\,\right|.

This quantifies the change in downstream prediction error induced by replacing the true intermediate $x$ with its prediction $\hat{x}$ .

For notational purposes, we drop the dependence on $(w,x,y)$ , and refer to these terms as $R,\Delta R_{1}$ , and $R_{2}$ . Intuitively, $\Delta R_{1}$ reflects the error of the upstream predictor, while $R_{2}$ isolates downstream error without upstream influence. This decomposition satisfies a fundamental upper-bound property.

Proposition 1.

For any point $(w,x,y)$ , the total residual satisfies:

R=|y-\hat{\mu}_{2}(\hat{\mu}_{1}(w))|\leq\Delta R_{1}+R_{2}.

By definition, $\Delta R_{1}=||y-\hat{\mu}_{2}(x)|-|y-\hat{\mu}_{2}(\hat{x})||$ . Let $A=|y-\hat{\mu}_{2}(x)|=R_{2}$ and $B=|y-\hat{\mu}_{2}(\hat{x})|=R$ . By the reverse triangle inequality, $B\leq A+|A-B|=R_{2}+\Delta R_{1}$ . This upper bound property ensures that the sum of components provides a conservative estimate of the total error, which is crucial for the coverage guarantees developed subsequently.

This decomposition provides a clean interpretation: $R_{2}$ measures the inherent uncertainty of the downstream model, while $\Delta R_{1}$ measures how upstream prediction errors affect the final prediction error magnitude. When $\Delta R_{1}$ is small relative to $R_{2}$ , the upstream predictor performs well and downstream uncertainty dominates. Conversely, when $\Delta R_{1}$ is large relative to $R_{2}$ , upstream errors drive prediction uncertainty. This attribution enables practitioners to identify which stage requires improvement and guide retraining decisions. Furthermore, these components provide insights for handling distribution shifts: under upstream shifts, $\Delta R_{1}$ typically increases as the first-stage predictor encounters out-of-distribution inputs, while under downstream shifts, $R_{2}$ increases. These changes can be experimentally visualized in Appendix Figure˜12(b). Thus, the varied responses enable targeted adaptive strategies for different distribution shifts.

These components exhibit intuitive relationships with the total residual: when $R_{2}$ is small, $\Delta R_{1}$ closely approximates $R$ , while when the upstream model is accurate and $\hat{\mu}_{2}$ is smooth, $R_{2}$ closely approximates $R$ (see Appendix A.2). Importantly, at least one component must represent a majority of the error, ensuring meaningful stage-wise attribution. Next, we describe two complementary approaches to combining these residual components into prediction intervals.

4 Constructing prediction intervals

We describe two approaches that incorporate $\Delta R_{1},R_{2}$ , utilizing the conformal set $S^{\text{conf}}$ , to compute component-wise quantiles, with data-driven parameter selection using $S^{\text{cal}}$ addressed in Section˜5. Note that there exists a rich space of heuristics for combining these residual components beyond those listed below—see Appendix C.1. For proofs of the following results, see Appendix A.1.

4.1 Separate component quantiles

We compute sets of residual components on $S^{\text{conf}}$ : $\{\Delta R_{1}^{i}\}_{i=1}^{m}$ and $\{R_{2}^{i}\}_{i=1}^{m}$ , which we abbreviate as $\{\Delta R_{1}\}$ and $\{R_{2}\}$ . Prediction intervals are constructed by summing quantiles computed separately from each set, with the quantile levels controlled by $c$ and $d$ for the respective stages.

Definition 3 (Separate component quantiles).

Let $c,d\in(0,1)$ be quantile levels. For test input $w_{\text{test}}$ , the prediction interval is

\hat{C}_{c,d}(w_{\text{test}})=\hat{\mu}_{2}(\hat{\mu}_{1}(w_{\text{test}}))\pm(Q_{1-c}(\{\Delta R_{1}\})+Q_{1-d}(\{R_{2}\})).

This construction inherits coverage guarantees from standard conformal prediction:

Theorem 1 (Coverage of separate component quantiles).

Under the assumption of exchangeability, for $c,d\in(0,1)$ , the prediction interval $\hat{C}_{c,d}(w_{\text{test}})$ satisfies

\mathbb{P}(y_{\text{test}}\in\hat{C}_{c,d}(w_{\text{test}}))\geq 1-c-d.

4.2 Scaled component quantiles

Our second approach fixes a quantile level $\alpha\in(0,1)$ for both components and selects scaling parameters $a,b\in[0,1]$ to weight their respective contributions. This provides interpretable control over stage-wise uncertainty attribution.

Definition 4 (Scaled component quantiles).

For a fixed quantile $\alpha\in(0,1)$ and scaling coefficients $a,b\in[0,1]$ , the prediction interval is

\hat{C}_{\alpha,a,b}(w_{\text{test}})=\hat{\mu}_{2}(\hat{\mu}_{1}(w_{\text{test}}))\pm(a\cdot Q_{1-\alpha}(\{\Delta R_{1}\})+b\cdot Q_{1-\alpha}(\{R_{2}\}))

where the quantiles are computed over $S^{\text{conf}}$ .

The choice of scaling parameters $a$ and $b$ provides interpretable control: setting $a=0$ ignores upstream uncertainty while $b=0$ focuses only on upstream effects. However, coverage guarantees for arbitrary choices of $(a,b)$ require careful analysis. For fixed weights, we have the following:

Corollary 1 (Coverage with $a=b=1$ ).

Under exchangeability, for $a=b=1$ and $\alpha\in(0,1)$ , the interval $\hat{C}_{\alpha,a,b}(w_{\text{test}})$ satisfies

\mathbb{P}(y_{\text{test}}\in\hat{C}_{\alpha,a,b}(w_{\text{test}}))\geq 1-2\alpha.

Proof.

This follows directly from Theorem˜1 with $c=d=\alpha$ . ∎

For appropriately chosen quantiles and scaling weights, we observe that both can methods yield similar intervals. To provide maximum flexibility for both theoretical analysis and implementation, we can combine both approaches into a general framework:

\hat{C}_{a,b,c,d}(w_{\text{test}})=\hat{\mu}_{2}(\hat{\mu}_{1}(w_{\text{test}}))\pm(a\cdot Q_{1-c}(\{\Delta R_{1}\})+b\cdot Q_{1-d}(\{R_{2}\})),

This unified form allows independent control of both quantile levels and scaling parameters, enabling fine-tuned balance between coverage guarantees and stage-wise identifiability.

4.3 Coverage for scaled parameters

While coverage for intervals of the form $\hat{C}_{c,d}$ (Definition 3) follows from standard conformal analysis, establishing similar guarantees for intervals with scaled residuals $\hat{C}_{\alpha,a,b}$ (Definition 4) presents significant challenges. The scaling parameters $(a,b)$ have no direct mapping to quantile levels, making it difficult to derive explicit guarantees. Despite this, we can establish that valid scaling parameters exist in principle. Under mild regularity conditions, there always exist optimal scaling parameters $(a^{*},b^{*})$ that yield exact marginal coverage:

Proposition 2 (Existence of ideal scaling parameters).

For desired coverage level $1-\alpha$ , $\exists a^{*},b^{*}\in[0,1]$ such that the interval with those scaling parameters satisfies the marginal coverage guarantee

\mathbb{P}\left(y_{\text{test}}\in\hat{C}_{\alpha,a^{*},b^{*}}(w_{\text{test}})\right)=1-\alpha,

provided the distribution of the residual $R$ has no point masses.

However, finding optimal scaling parameters $(a^{*},b^{*})$ requires knowledge of the residual distribution, which is unavailable in practice. This creates a fundamental trade-off: scaled intervals offer interpretable stage-wise control but lack accessible guarantees, while separate quantile intervals provide coverage under minimal assumptions but offer less direct interpretability. To resolve this trade-off, we adopt a data-driven approach [8, 3, 2] to selecting scaling parameters $(a,b)$ from a candidate set using the calibration set $S^{\text{cal}}$ for the unified interval.

5 Risk-controlling approach with residual components

Since finding $(a^{*},b^{*})$ is impossible in practice, we reframe the problem as selecting $(a,b)$ coefficient pairs that satisfy the nominal coverage level $1-\alpha$ . We approach this from the perspective of multiple hypothesis testing; for each $(a,b)$ we test the null hypothesis that the miscoverage rate exceeds $\alpha$ .

5.1 Testing miscoverage via empirical risk

We aim to identify scaling parameters $\lambda=(a,b)\in[0,1]^{2}$ for which the resulting prediction interval $\hat{C}_{\lambda}(w)$ achieves coverage at least $1-\alpha$ . Since theoretical guarantees for arbitrary $\lambda$ are unavailable, we perform a hypothesis test for whether its miscoverage rate exceeds $\alpha$ . We fix a finite candidate set $\Lambda\subseteq[0,1]^{2}$ of scaling pairs and define the corresponding prediction interval for each $\lambda=(a,b)\in\Lambda$ :

\hat{C}_{\lambda}(w)=\left\{y:\left|y-\hat{\mu}_{2}(\hat{\mu}_{1}(w))\right|\leq a\cdot Q_{1-c}(\{\Delta R_{1}\})+b\cdot Q_{1-d}(\{R_{2}\})\right\},

using the general framework introduced earlier that combines Definition˜4 and Definition˜3. Here, the quantile parameters $c$ and $d$ are fixed in advance with quantiles taken over $S^{\text{conf}}$ , while the miscoverage testing is performed using $S^{\text{cal}}$ to identify suitable $\lambda$ .

5.2 Computing $p$ -values from calibration data

Let $l=|S^{\text{cal}}|$ denote the size of the calibration set. For each choice of scaling parameters $\lambda\in\Lambda$ , we define the empirical miscoverage rate $\hat{\mathcal{R}}(\lambda)=\frac{1}{l}\sum_{i=1}^{l}\mathbbm{1}_{\{y_{i}\notin\hat{C}_{\lambda}(w_{i})\}}$ . Under the stronger assumption that the calibration points are IID, and that the true miscoverage rate for a given $\lambda$ is constant, the number of missed points follows a Binomial distribution with parameters $l$ and the underlying miscoverage probability. For each $\lambda$ , define the null hypothesis

\mathcal{H}_{0}(\lambda):\mathbb{P}(y\notin\hat{C}_{\lambda}(w))>\alpha.

Thus, for each $\lambda$ we define a $p$ -value $p_{\lambda}=\mathbb{P}\left(\text{Bin}(l,\alpha)\leq l\hat{\mathcal{R}}(\lambda)\right)$ . The $p$ -values $p_{\lambda}$ are super-uniform under $\mathcal{H}_{0}$ (Appendix A.2); that is, for any $u\in[0,1]$ , we have $\mathbb{P}(p_{\lambda}\leq u)\leq u$ , which is crucial for FWER-controlling guarantees [8]. Thus, we apply FWER-controlling multiple testing algorithms (Appendix B.4) to the collection of $p_{\lambda}$ to obtain the set of valid scaling parameters $\Lambda_{\text{val}}\subseteq\Lambda$ . This ensures that, with probability at least $1-\delta$ for some $\delta>0$ , no $\lambda$ with a miscoverage rate exceeding $\alpha$ is accepted. Note that the $\lambda\in\Lambda_{\text{val}}$ are more conservative as they require evidence of coverage of at least $1-\alpha$ . In contrast, other prediction interval methods yield coverage that is merely close to $1-\alpha$ . This conservatism may result in unnecessarily wide intervals, particularly when coverage is less critical than efficiency, such as in IID settings. To remedy this, we can include tolerance parameter $\tau$ and calculate $p_{\lambda}$ with $\text{Bin}(l,\alpha+\tau)$ , trading some guarantees for practical performance. As demonstrated in our experiments (Appendix D.2.2), even small values of $\tau$ significantly improve efficiency while maintaining empirical coverage under IID settings.

5.3 Coverage guarantees via risk-controlled scaling

Thus, given the $p$ -values $p_{\lambda}$ and a FWER-controlling multiple testing algorithm (Appendix B.4), we identify the set of validated scaling parameters $\Lambda_{\text{val}}\subseteq\Lambda$ . The following result states any intervals associated with $\Lambda_{\text{val}}$ achieve coverage with high probability:

Theorem 2 (Risk control via FWER calibration).

Let $\Lambda_{\text{val}}\subseteq\Lambda$ be the set selected by a FWER-controlling algorithm at level $\delta$ , based on $p$ -values computed over the calibration set $S^{\text{cal}}$ with tolerance $\tau>0$ . Then, for any $\hat{\lambda}\in\Lambda_{\text{val}}$ , we have

\mathbb{P}\left(\mathbb{P}\left(y_{\text{test}}\in\hat{C}_{\hat{\lambda}}(w_{\text{test}})\,\big|\,S^{\text{cal}}\right)\geq 1-\alpha-\tau\right)\geq 1-\delta,

where the outer probability is over the randomness of $S^{\text{cal}}$ , and the inner probability is over the test point $(w_{\text{test}},x_{\text{test}},y_{\text{test}})$ .

This guarantee follows from the FWER-controlling property: by reducing the probability of false positives, i.e. selecting scaling parameters with poor coverage, we ensure that all selected parameters satisfy the coverage requirement with high probability. Thus, with probability at least $1-\delta$ over $S^{\text{cal}}$ , any selected interval $\hat{C}_{\hat{\lambda}}$ has coverage rate at least $1-\alpha-\tau$ . Note that while $l$ does not explicitly appear in the guarantee, larger calibration sets lead to more precise $p$ -value estimates, typically resulting in a larger $\Lambda_{\text{val}}$ and less conservative interval selection. In practice, any pair $(a,b)\in\Lambda_{\text{val}}$ can be used to construct prediction intervals. Although the quantile levels $c,d$ from Definition˜3 could also be tuned jointly with $(a,b)$ , we fix $c,d$ for simplicity, which is typically sufficient in our experiments.

We note that $\Lambda_{\text{val}}$ can be empty, producing no interval. This is intentional, as it can indicate that the shift is too large and retraining is needed, rather than producing an unreliable interval. In this setting, $c,d$ offer a clear interpretation as stage-wise sensitivity to shifts (Section˜7.1). Appendix C.2 also shows how the IID assumption on $S^{\text{cal}}$ can be relaxed to allow stationary $\phi$ -mixing sequences.

6 Adaptive risk control with residual decomposition

We consider an adaptive variant of our method for nonstationary data by updating the prediction intervals over time. The sets $S^{\text{conf}}$ and $S^{\text{cal}}$ are defined using a sliding window over the most recent observations, with a user-specified window-length $k$ . We construct intervals combining Definition˜4 and Definition˜3: $\hat{C}_{\alpha,a,b,c,d}(w)=\hat{\mu}_{2}(\hat{\mu}_{1}(w))\pm(a\,Q_{1-c}(\{\Delta R_{1}\})+b\,Q_{1-d}(\{R_{2}\})),$ parameterized by scaling coefficients $a,b$ , quantile levels $c,d$ , and target coverage level $\alpha$ , where quantiles are taken over $S^{\text{conf}}$ , and $\Lambda_{\text{val}}^{t}$ is recalculated at each time step. With each new point, we dynamically update the $(a_{t},b_{t},c_{t},d_{t},\alpha_{t})$ using adaptive rules based on recent performance: when coverage drops below target, we decrease $\alpha_{t}$ and vice versa; when specific components show persistent errors we adjust their corresponding scaling parameters; and when $\Lambda_{\text{val}}^{t}$ is too restrictive we adjust the quantile levels to expand future options. The algorithmic details are provided in Appendix Section˜B.5.

Crucially, the SelectLambda algorithm implements the adaptive adjustments to $(a_{t},b_{t},c_{t},d_{t})$ : it first identifies which component constitutes more of the error by comparing recent averages $\bar{\Delta R_{1}}=\frac{1}{k}\sum_{i=t-k}^{t-1}\Delta R_{1}^{(i)}$ and $\bar{R_{2}}=\frac{1}{k}\sum_{i=t-k}^{t-1}R_{2}^{(i)}$ over the sliding window. If coverage fails and upstream errors dominate ( $\Delta\bar{R}_{1}>\bar{R}_{2}$ ), it seeks to increase $a_{t}$ within the validated set $\Lambda_{\text{val}}^{t}$ . Conversely, when downstream errors dominate ( $\bar{R}_{2}>\Delta\bar{R}_{1}$ ), it prioritizes $b_{t}$ . If the desired scaling adjustment is unavailable in $\Lambda_{\text{val}}^{t}$ , the algorithm returns signals $\Delta c_{t},\Delta d_{t}\in\{-1,0,1\}$ to adjust $c_{t+1},d_{t+1}$ , effectively “adjusting" the constraints to allow more suitable scaling options.

Even with additional parameters, this approach preserves the long-run coverage guarantee from prior work: $\lim_{T\to\infty}\frac{1}{T}\sum_{t=1}^{T}\text{cov}_{t}\xrightarrow{\text{a.s.}}1-\alpha$ , where $\text{cov}_{t}$ denotes coverage at time $t$ . This is because the core $\alpha_{t}$ update rule remains identical to the adaptive conformal method of Gibbs and Candes [14], resulting in the convergence guarantee formally stated in Appendix A.1.1.

7 Experiments

We evaluate our method on synthetic and real-world forecasting tasks to assess its ability to (i) maintain coverage, (ii) attribute predictive error to specific model stages, and (iii) adapt to distribution shifts in modular pipelines. Our experiments are designed to isolate upstream, downstream, and full-pipeline shifts, highlighting how stage-aware intervals improve transparency and robustness over existing conformal baselines. Additional experiments and details are in Appendix D.

7.1 Non-adaptive methods

We evaluate our non-adaptive method (Section˜4), using the FWER-based procedure for our unified interval which we denote as $\text{SR}_{a,b}$ to generate a validated set $\Lambda_{\text{val}}$ of $(a,b)$ , using fixed quantile levels $c,d$ . We then select from this set the pair $(a,b)$ that yields coverage closest to the nominal level $\alpha=0.1$ on $S^{\text{conf}}$ . We compare against two nonadaptive baselines: standard split conformal prediction (SC, Vovk et al. [28]) and weighted split conformal prediction (WSC, Barber et al. [7]), which accounts for non-exchangeability.

We begin with a synthetic dataset generated by the structural equations $x=3w+\nu_{1}$ , $y=4x+\nu_{2}$ , where $w,\nu_{1},\nu_{2}\sim\mathcal{N}(0,0.1)$ . After an initial stationary period, the data undergoes a concept shift in either the upstream or downstream with increasing noise. We vary the rate of these shifts to include both gradual and rapid changes. Results for both upstream and downstream shifts are shown in Table˜1.

Table 1: Average coverage and interval width for upstream and downstream shifts under gradual and rapid shifts at

\alpha=0.1

, over 50 samples, with std for SR, SC, WSC methods. NA indicates that our method abstains from providing an interval

(a) Upstream shift coverage and width

Shift	Metric	SR_a,b (c=d=0.01)	SR_a,b (c=0.05,d=0.01)	SC	WSC
Gradual	Coverage	$0.8419\pm 0.01$	0.8233 $\pm 0.01$	0.7280 $\pm 0.02$	0.7488 $\pm 0.02$
Gradual	Width	2.9884 $\pm 0.10$	2.4272 $\pm 0.10$	2.3726 $\pm 0.10$	2.4849 $\pm 0.09$
Rapid	Coverage	0.8041 $\pm 0.01$	NA	0.6910 $\pm 0.01$	0.7218 $\pm 0.01$
Rapid	Width	3.9703 $\pm 0.17$	NA	2.9544 $\pm 0.11$	3.1969 $\pm 0.15$

(b) Downstream shift coverage and width

Shift	Metric	SR_a,b (c=d=0.01)	SR_a,b (c=0.01,d=0.05)	SC	WSC
Gradual	Coverage	0.9309 $\pm 0.02$	0.9133 $\pm 0.02$	0.8695 $\pm 0.03$	0.8748 $\pm 0.02$
Gradual	Width	2.4274 $\pm 0.12$	2.3738 $\pm 0.12$	2.0143 $\pm 0.06$	2.0472 $\pm 0.06$
Rapid	Coverage	0.8981 $\pm 0.01$	NA	0.8419 $\pm 0.02$	0.8506 $\pm 0.02$
Rapid	Width	2.4878 $\pm 0.11$	NA	2.1035 $\pm 0.08$	2.1498 $\pm 0.08$

Our method achieves higher coverage than the other methods, which suffer degradation under distributional shifts, particularly under upstream. For example, our method drops from 0.84 to 0.80 coverage vs. SC dropping from 0.73 to 0.69 for gradual vs. rapid upstream shifts. These coverage improvements are at the cost of wider intervals and occasional abstention; this trade-off is worthwhile when attribution and robustness are prioritized. We note that selecting larger $c$ (resp. $d$ ) increases the sensitivity of the method to upstream (resp. downstream shift), resulting in abstention under rapid shifts. This is intentional, with abstention signaling that retraining may be necessary for the given stage; black-box methods lack this ability, as they cannot isolate which part of the model fails, resulting in unnecessary retraining.

Thus, $c$ and $d$ control sensitivity: by selecting larger or smaller $c$ and $d$ , practitioners can balance robustness and abstention at each stage, offering flexibility beyond traditional conformal approaches. Next, we evaluate the adaptive version of our method under more extreme distributional shifts.

7.2 Adaptive methods under structured distribution shifts

We evaluate our adaptive method (Section˜6) in settings with distributional shocks, comparing against recent adaptive baselines: Adaptive Conformal Inference (ACI, Gibbs and Candes [14]), Proportional-Integral-Derivative control (PID, Angelopoulos et al. [1]), and Online Conformal Inference with Decaying step sizes (OCID, Angelopoulos et al. [4]). We also consider DtACI (Gibbs and Candès [15]), an extension of ACI, and report its results in Appendix D.3.2 due to weaker performance. While DtACI is designed for online adaptation, it may be less effective in fixed-horizon forecasting tasks where the lag between prediction and feedback complicates adaptation. We also consider covariate shifts, hyperparameter sweeps, and a real-world stock price dataset, with full experimental details in Appendix D.

Controlled two-stage regression with targeted shifts.

We simulate a two-stage regression pipeline, with controlled stochastic relationships between upstream input $w$ , intermediate representation $x$ , and downstream output $y$ . The initial data follows an i.i.d. structure with $x=\mu_{1}(w)=3w+\nu_{1},\quad y=\mu_{2}(x)=4x+\nu_{2}$ where $\nu_{1},\nu_{2}\sim\mathcal{N}(0,1)$ . Both stages are modeled using ordinary least squares. At test time, we simulate a temporal sequence of three phases: (i) Upstream concept shift: $\mu_{1}$ becomes $\mu_{1}^{\text{shift}}(w)=8w+1+\nu_{1}$ ; (ii) Reversion: The upstream returns to $\mu_{1}(w)$ ; (iii) Downstream concept shift: $\mu_{2}$ becomes $\mu_{2}^{\text{shift}}(x)=7x+5+\nu_{2}$ . We report coverage and widths in Figure˜2, capturing the robustness of our methods to stage-specific shocks without overcompensating on width, particularly for upstream shocks.

Automobile indicators dataset.

We evaluate our method on a real-world dataset consisting of monthly economic indicators for the U.S. automobile supply chain and used vehicle valuations. Upstream features $(w)$ are sourced from FRED [13], including metrics such as price indices, inventory levels, and manufacturing prices, which are then forecasted $(x)$ at a 6-month horizon. Downstream outputs $(y)$ , aggregated used car prices, are obtained from Manheim [17]. This setting fits the two-stage modeling framework: upstream economic conditions influence downstream pricing, and training a separate upstream predictor leverages more widely available historical supply chain data.

A notable feature of this dataset is the sharp shock in 2020 due to the COVID-19 pandemic, which disrupted global supply chains, leading to prolonged production halts, resulting in a steep and persistent rise in used vehicle prices [19]. The nature of the shock may also affect the model asymmetrically, leading to exacerbated upstream inaccuracy, which we observe via the scaling parameter changes in Appendix D.3.5. For the upstream forecasting task, we use an ensemble of VAR and ARIMA models to predict supply-chain indicators. The model is trained on a rolling-window basis, with residual scores computed at each time step. A 40-month sliding window forms the held-out set from which we derive $S^{\text{conf}}$ and $S^{\text{cal}}$ .

We present the resulting prediction intervals for our method, ACI, PID, and OCID in Figure˜3. Our method effectively responds to the shock in 2020, maintaining coverage relative to other methods. OCID is also able to capture the shock, but shows inefficiency during 2014-2019. Notably, our approach maintains a comparable average width (21.085 vs 15.696 (ACI), 18.469 (PID), and 31.058 (OCID)) and can quickly adjust to shocks by consistently considering conservative values of $\lambda$ in $\Lambda_{\text{val}}$ . This suggests that our method is well-suited for real-world forecasting and shines under asymmetric structural shocks. In contrast, for covariate shifts (Appendix D) that affect the entire pipeline, the benefits are less significant.

8 Conclusion

We proposed a conformal prediction framework for sequential models that decomposes prediction residuals into interpretable stage-specific components. Our method combines this decomposition with risk-controlled parameter selection to construct prediction intervals that provide both coverage guarantees and uncertainty attribution. The approach allows practitioners to identify which pipeline stage contributes to prediction uncertainty and provides diagnostic tools for targeted model improvement. Empirical evaluation on synthetic distribution shifts and real-world economic forecasting shows that our method maintains coverage under shifts that degrade standard conformal approaches. While this robustness comes at the cost of wider intervals and occasional abstention when shifts are severe, these trade-offs reflect the method’s conservative design that prioritizes reliable coverage over optimistic predictions. The stage-wise decomposition proves particularly valuable for asymmetric shifts affecting different pipeline components, enabling targeted adaptive responses that standard methods cannot provide.

References

Angelopoulos et al. [2023] Anastasios Angelopoulos, Emmanuel Candes, and Ryan J Tibshirani. Conformal pid control for time series prediction. Advances in neural information processing systems, 36:23047–23074, 2023.
Angelopoulos et al. [2021] Anastasios N Angelopoulos, Stephen Bates, Emmanuel J Candès, Michael I Jordan, and Lihua Lei. Learn then test: Calibrating predictive algorithms to achieve risk control. arXiv preprint arXiv:2110.01052, 2021.
Angelopoulos et al. [2022] Anastasios N Angelopoulos, Stephen Bates, Adam Fisch, Lihua Lei, and Tal Schuster. Conformal risk control. arXiv preprint arXiv:2208.02814, 2022.
Angelopoulos et al. [2024] Anastasios N Angelopoulos, Rina Foygel Barber, and Stephen Bates. Online conformal prediction with decaying step sizes. arXiv preprint arXiv:2402.01139, 2024.
Athreya and Pantula [1986] Krishna B Athreya and Sastry G Pantula. A note on strong mixing of arma processes. Statistics & probability letters, 4(4):187–190, 1986.
Barber [2024] Rina Foygel Barber. Hoeffding and bernstein inequalities for weighted sums of exchangeable random variables. arXiv preprint arXiv:2404.06457, 2024.
Barber et al. [2023] Rina Foygel Barber, Emmanuel J Candes, Aaditya Ramdas, and Ryan J Tibshirani. Conformal prediction beyond exchangeability. The Annals of Statistics, 51(2):816–845, 2023.
Bates et al. [2021] Stephen Bates, Anastasios Angelopoulos, Lihua Lei, Jitendra Malik, and Michael Jordan. Distribution-free, risk-controlling prediction sets. Journal of the ACM (JACM), 68(6):1–34, 2021.
Bauer [1991] Peter Bauer. Multiple testing in clinical trials. Statistics in medicine, 10(6):871–890, 1991.
Bonferroni [1937] Carlo Emilio Bonferroni. On the multiple hypotheses test, 1937. origin of the Bonferroni correction.
Crane and Crotty [1967] Dwight B Crane and James R Crotty. A two-stage forecasting model: Exponential smoothing and multiple regression. Management Science, 13(8):B–501, 1967.
Farinhas et al. [2023] António Farinhas, Chrysoula Zerva, Dennis Ulmer, and André FT Martins. Non-exchangeable conformal risk control. arXiv preprint arXiv:2310.01262, 2023.
FRED [2025] FRED. automobile indicators, 2025. data retrieved from FRED, https://fred.stlouisfed.org/series/CUSR0000SETA02.
Gibbs and Candes [2021] Isaac Gibbs and Emmanuel Candes. Adaptive conformal inference under distribution shift. Advances in Neural Information Processing Systems, 34:1660–1672, 2021.
Gibbs and Candès [2024] Isaac Gibbs and Emmanuel J Candès. Conformal inference for online prediction with arbitrary distribution shifts. Journal of Machine Learning Research, 25(162):1–36, 2024.
Kontorovich and Ramanan [2008] Leonid Kontorovich and Kavita Ramanan. Concentration inequalities for dependent random variables via the martingale method. Annals of Probability, 2008.
Manheim [2025] Manheim. Used vehicle value index, 2025. data retrieved from Cox Automative, https://site.manheim.com/en/services/consulting/used-vehicle-value-index.html.
Mohri and Rostamizadeh [2010] Mehryar Mohri and Afshin Rostamizadeh. Stability bounds for stationary $\varphi$ -mixing and $\beta$ -mixing processes. Journal of Machine Learning Research, 11(2), 2010.
Mullin [2022] John Mullin. Supply chain disruptions, inflation, and the fed. Econ Focus, 22(3Q):14–17, 2022.
Oliveira et al. [2024] Roberto I Oliveira, Paulo Orenstein, Thiago Ramos, and João Vitor Romano. Split conformal prediction and non-exchangeable data. Journal of Machine Learning Research, 25(225):1–38, 2024.
Quach et al. [2023] Victor Quach, Adam Fisch, Tal Schuster, Adam Yala, Jae Ho Sohn, Tommi S Jaakkola, and Regina Barzilay. Conformal language modeling. arXiv preprint arXiv:2306.10193, 2023.
Romano et al. [2019] Yaniv Romano, Evan Patterson, and Emmanuel Candes. Conformalized quantile regression. Advances in neural information processing systems, 32, 2019.
Serfling [1974] Robert J Serfling. Probability inequalities for the sum in sampling without replacement. The Annals of Statistics, pages 39–48, 1974.
Shafer and Vovk [2008] Glenn Shafer and Vladimir Vovk. A tutorial on conformal prediction. Journal of Machine Learning Research, 9(3), 2008.
Soybilgen and Yazgan [2021] Barış Soybilgen and Ege Yazgan. Nowcasting us gdp using tree-based ensemble models and dynamic factors. Computational Economics, 57(1):387–417, 2021.
Tibshirani [2023] Ryan J. Tibshirani. Conformal prediction, 2023. URL https://www.stat.berkeley.edu/˜ryantibs/statlearn-s23/lectures/conformal.pdf. Lecture notes for Advanced Topics in Statistical Learning, Spring 2023, University of California, Berkeley.
Vovk [2012] Vladimir Vovk. Conditional validity of inductive conformal predictors. In Asian conference on machine learning, pages 475–490. PMLR, 2012.
Vovk et al. [2005] Vladimir Vovk, Alexander Gammerman, and Glenn Shafer. Algorithmic learning in a random world, volume 29. Springer, 2005.
Wu et al. [2025] Junxi Wu, Dongjian Hu, Yajie Bao, Shu-Tao Xia, and Changliang Zou. Error-quantified conformal inference for time series. arXiv preprint arXiv:2502.00818, 2025.
Zaffran et al. [2022] Margaux Zaffran, Olivier Féron, Yannig Goude, Julie Josse, and Aymeric Dieuleveut. Adaptive conformal predictions for time series. In International Conference on Machine Learning, pages 25834–25866. PMLR, 2022.
Zargarbashi et al. [2023] Soroush H Zargarbashi, Simone Antonelli, and Aleksandar Bojchevski. Conformal prediction sets for graph neural networks. In International Conference on Machine Learning, pages 12292–12318. PMLR, 2023.
Zhang et al. [2025] Botong Zhang, Shuo Li, and Osbert Bastani. Conformal structured prediction. In ICLR 2025 Workshop on Building Trust in Language Models and Applications, 2025. URL https://openreview.net/forum?id=mKfmLQXP6J.

Appendix A Appendix

A.1 Main Results/Proofs

Proof of Theorem 1

Proof.

We consider the event

\displaystyle y_{test}\notin\hat{C}_{c,d}(w_{test})

where $\hat{C}_{c,d}$ is the prediction interval defined by $\hat{\mu}_{2}(\hat{\mu}_{1}(w_{test}))\pm Q_{1-c}(\{\Delta R_{1}\})+Q_{1-d}(\{R_{2}\})$ with the quantiles being taken over the residual components calculated on held-out conformal set $S^{\text{conf}}$ . This event is when the coverage of the interval fails. See that

\displaystyle y_{test}\notin\hat{C}_{c,d}\implies|y_{test}-\hat{\mu}_{2}(\hat{\mu}_{1}(w_{test}))|\geq Q_{1-c}(\{\Delta R_{1}\})+Q_{1-d}(\{R_{2}\})

But then by definition

	$\displaystyle\|y_{test}-\hat{\mu}_{2}(\hat{\mu}_{1}(w_{test}))\|\geq Q_{1-c}(\{\Delta R_{1}\})+Q_{1-d}(\{R_{2}\})\implies$
	$\displaystyle\|\|y_{test}-\hat{\mu}_{2}(x_{test})\|-\|y_{test}-\hat{\mu}_{2}(\hat{\mu}_{1}(w_{test}))\|\|+\|y_{test}-\hat{\mu}_{2}(x_{test})\|\geq Q_{1-c}(\{\Delta R_{1}\})+Q_{1-d}(\{R_{2}\})$

Thus, the split residual on the $test$ point also falls outside the region. However, this implies that at least one of the following events occur:

A_{1}:|y_{test}-\hat{\mu}_{2}(x_{test})|\geq Q_{1-d}(\{R_{2}\})

A_{2}:||y_{test}-\hat{\mu}_{2}(x_{test})|-|y_{test}-\hat{\mu}_{2}(\hat{\mu}_{1}(w_{test}))||\geq Q_{1-c}(\{\Delta R_{1}\})

Then each of these pieces hold due to exchangeable properties as the probability of the new residual piece being higher in rank than the $1-c$ quantile (resp. $1-d$ ) is $c$ (resp. $d$ ). For $R_{2}$ , the exchangeability is clear as it is directly the prediction interval for $\mu_{2}$ on data $(x,y)$ . For the latter, it is still clearly exchangeable as well: the functions $\hat{\mu}_{1},\hat{\mu}_{2}$ are symmetric and deterministic and are pretrained, thus being fixed on the held-out set. Thus $\Delta R_{1}$ can be interpreted as the result of some deterministic function $\psi$ on each point $(w_{i},x_{i},y_{i})$ , which implies that

	$\displaystyle P(\Delta R_{1}^{1}$	$\displaystyle=r_{1},\dots,\Delta R_{1}^{m}=r_{m})$
		$\displaystyle=P((w_{1},x_{1},y_{1})\in\psi^{-1}(r_{1}),\dots,(w_{m},x_{m},y_{m})\in\psi^{-1}(r_{m}))$
		$\displaystyle=P((w_{\pi(1)},x_{\pi(1)},y_{\pi(1)})\in\psi^{-1}(r_{\pi(1)}),\dots,(w_{\pi(m)},x_{\pi(m)},y_{\pi(m)})\in\psi^{-1}(r_{\pi(m)}))$
		$\displaystyle=P(\Delta R_{1}^{\pi(1)}=r_{\pi(1)},\dots,\Delta R_{1}^{\pi(m)}=r_{\pi(m)})$

where $\Delta R_{1}^{i}$ denotes the first residual component of the $i$ -th point of $S^{\text{conf}}$ . Then we have exchangeability of $\Delta R_{1}$ , thus the probability of each of the noncontainment events for each residual component are bounded by $c$ and $d$ respectively.

Therefore

P(y_{test}\notin\hat{C}_{c,d})\leq c+d-P(A_{1}\cap A_{2})\leq c+d.

In fact, as can be seen in the above equation, the stated result is looser than necessary. ∎

Proof of Proposition 1

Proof.

Since the traditional residual $R$ is assumed to be continuous, the function that maps the scaling parameters $(a,b)$ to the marginal coverage probability of the interval $\hat{C}_{\alpha,a,b}$ is itself continuous. When $a=b=0$ , the interval degenerates to a single point and thus has zero coverage (assuming the problem is nontrivial). When $a=b=1$ , the interval is wide enough to ensure coverage of at least $1-2\alpha$ by Corollary 1. By the intermediate value theorem, there must exist a pair $(a^{*},b^{*})\in[0,1]^{2}$ such that the interval achieves exactly $1-\alpha$ coverage. As an aside, one can further view the problem as an optimization problem

	$\displaystyle\min_{a,b}aQ_{1-\alpha}(\{\Delta R_{1}\})+bQ_{1-\alpha}(\{R_{2}\})$
	$\displaystyle s.t.\quad P(R\geq aQ_{1-\alpha}(\{\Delta R_{1}\})+bQ_{1-\alpha}(\{R_{2}\}))\leq\alpha$

which by KKT conditions implies that any $a^{*},b^{*}$ must satisfy

	$\displaystyle\frac{1}{Q_{1-\alpha}(\{\Delta R_{1}\})}\frac{\partial P(R\geq aQ_{1-\alpha}(\{\Delta R_{1}\})+bQ_{1-\alpha}(\{R_{2}\})))}{\partial a}$
	$\displaystyle=\frac{1}{Q_{1-\alpha}(\{R_{2}\})}\frac{\partial P(R\geq aQ_{1-\alpha}(\{\Delta R_{1}\})+bQ_{1-\alpha}(\{R_{2}\})))}{\partial b}$

which provides that at any optimal solution $a^{*},b^{*}$ there must be the same balance between the magnitude of components and the impact (derivative) of the scaling parameter on true coverage probability for each parameter. Thus large magnitude implies large impact, confirming intuitive understanding. ∎

Proof of Theorem 2

Proof.

By Lemma˜A.2, the $p$ -values are super-uniform under the null hypothesis. Then we may apply Theorem 1 of [2] (Theorem˜3), given a FWER algorithm $\mathcal{A}$ with parameter $\delta>0$ to obtain the result. We note that the size of $S^{\text{cal}}$ does not explicitly appear in the bound, however, it directly affects the $p$ -values which can be seen via Hoeffding’s inequality:

P(\alpha-\text{Bin}(l,\alpha)\geq\alpha-l\hat{\mathcal{R}}(\lambda))\leq\exp\left(\frac{-2(\alpha-l\hat{\mathcal{R}}(\lambda))^{2}}{l}\right)

where $\hat{\mathcal{R}}$ is the empirical error rate on $S^{\text{cal}}$ ∎

A.1.1 Long-Run Coverage

Proposition 3 (Long-Run Coverage of Adaptive Method).

Let $\gamma>0$ be the step size, $\alpha\in(0,1)$ the nominal coverage level, and let $\text{cov}_{t}\in\{0,1\}$ denote the coverage indicator at time $t$ . Then the adaptive algorithm satisfies:

\lim_{T\to\infty}\frac{1}{T}\sum_{t=1}^{T}\text{cov}_{t}\xrightarrow{\text{a.s.}}1-\alpha.

This result ensures that in the long run, the algorithm achieves the desired coverage level $1-\alpha$ , without requiring distributional assumptions on the data.

The additional parameters $(a_{t},b_{t},c_{t},d_{t})$ are updated as deterministic functions of the algorithm’s history, ensuring they do not interfere with the update steps that underlie the coverage guarantee.

A finite-sample convergence bound (Appendix Proposition˜6) extends existing Markov chain analysis to include our additional parameters, with the key insight that the monotonicity properties required still hold under our parameter update rules.

Proof of Proposition 3

Please refer to Algorithm˜1 for notation. We first show a preliminary lemma.

Lemma A.1 (Boundedness of $c_{t}$ and $d_{t}$ ).

Let $\eta>0$ be a constant. Then, for all $t\in\mathbb{N}$ , the sequences $c_{t}$ and $d_{t}$ are bounded within the interval $[-\eta,1+\eta]$ . That is,

\forall t\in\mathbb{N},\quad-\eta\leq c_{t}\leq 1+\eta\quad\text{and}\quad-\eta\leq d_{t}\leq 1+\eta.

Proof.

To establish the lower bound for $c_{t}$ , observe that the argument for $d_{t}$ follows similarly, and the upper-bound can be treated identically.

Suppose that $c_{t}<0$ for some time step $t$ . In this case, by the definition of the residual component $\Delta R_{1}$ , we have:

\Delta R_{1}=\infty\implies\text{cov}_{t}=1,

where $\text{cov}_{t}$ denotes the coverage of the model at time $t$ . Additionally, since $\text{cov}_{t}^{\Delta R_{1}}=0$ by the construction of the residual, this implies that in the output of the algorithm SelectLambda, the update step $\Delta c_{t}\geq 0$ , meaning that:

c_{t+1}\geq c_{t}.

Thus, if $c_{t}$ were negative at some time step, the next value is non-decreasing. Consequently, the sequence $c_{t}$ cannot decrease below by $-\eta$ . Therefore, the minimum value attainable by $c_{t}$ is $-\eta$ .

A similar argument holds for the upper-bound.

Hence, we conclude that:

-\eta\leq c_{t}\leq 1+\eta\quad\text{and}\quad-\eta\leq d_{t}\leq 1+\eta\quad\forall t\in\mathbb{N}.

This holds similarly for $d_{t}$ . ∎

Next, we show the asymptotic result, Section˜A.1.1 via the same arguments as Proposition 4.1 of [14].

Proof.

Although our algorithm includes additional parameters $(a_{t},b_{t},c_{t},d_{t})$ , the update rule for $\alpha_{t}$ remains unchanged from [14]:

\alpha_{t}=\alpha_{t-1}-\gamma((1-\mathrm{cov}_{t-1})-\alpha),

where $\mathrm{cov}_{t}$ is the coverage indicator at time $t$ and $\alpha$ is the target coverage level.

As in [14], we show that $\alpha_{t}\in[-\gamma,1+\gamma]$ with probability 1. For the lower bound (the upper-bound follows symmetrically), suppose $\alpha_{t}<0$ at some time $t$ . Then the candidate set $\Lambda_{\mathrm{val}}=\emptyset$ , so $\mathrm{cov}_{t}=1$ , and the update becomes:

\alpha_{t+1}=\alpha_{t}+\gamma\alpha>\alpha_{t}.

Hence, $\alpha_{t}$ is pushed upward, preventing it from decreasing without bound. Therefore, $\alpha_{t}\geq-\gamma$ for all $t$ .

Next, unfolding the recursion gives:

\alpha_{T}=\alpha_{1}-\sum_{t=1}^{T-1}\gamma((1-\mathrm{cov}_{t})-\alpha).

Rearranging terms and using the bound, we obtain:

\left|\frac{1}{T}\sum_{t=1}^{T}(1-\mathrm{cov}_{t})-\alpha\right|\leq\frac{\max(\alpha_{1},1-\alpha_{1})+\gamma}{\gamma T}.

Equivalently:

\left|\frac{1}{T}\sum_{t=1}^{T}\mathrm{cov}_{t}-(1-\alpha)\right|\leq\frac{\max(\alpha_{1},1-\alpha_{1})+\gamma}{\gamma T},

which converges to zero as $T\to\infty$ , yielding the desired result. ∎

A.2 Auxiliary Results

Proposition 4.

For a given point $(w,x,y)$ , if $R_{2}\leq\epsilon<R$ for some small $\epsilon>0$ i.e. $\mu_{2}(x)$ is close to the true value $y$ , then $|\Delta R_{1}-R|\leq\epsilon$

Proof.

This follows immediately by definition of $\Delta R_{1}$ .

	$\displaystyle\|\Delta R_{1}-R\|$	$\displaystyle=\|\|R_{2}-R\|-R\|$
		$\displaystyle\leq\epsilon$

∎

Next, we show the equivalent for $R_{2}$ under additional assumptions

Proposition 5.

For a given point $(w,x,y)$ , $|R_{2}-R|<\epsilon$ for small $\epsilon>0$ under certain assumptions, such as if $\hat{\mu}_{2}$ is a Lipschitz continuous function for some parameter $L$ , and the first stage has small prediction error, $|x-\hat{\mu}_{1}(w)|<\delta$ . Then $|R_{2}-R|\leq L\delta$

Proof.

	$\displaystyle\|R_{2}-R\|$	$\displaystyle\leq\|\hat{\mu}_{2}(x)-\hat{\mu}_{2}(\hat{\mu}_{1}(w))\|$
		$\displaystyle\leq L\|x-\hat{\mu}_{1}(w)\|$
		$\displaystyle\leq L\delta$

∎

The main takeaway of these two results is that they confirm the intuition for when $\Delta R_{1}$ and $R_{2}$ are roughly similar to $R$ , particularly when one model is accurate (and in the case of $R_{2}$ when the learned function is smooth). This also implies that both $\Delta R_{1},R_{2}$ being small is impossible, thus one must represent a majority of the error. Note that $\Delta R_{1}$ and $R_{2}$ can also serve as rough lower bounds on $R$ , though each can individually exceed $R$ under adversarial noise or model miscalibration.

Lemma A.2.

The $p_{\lambda}$ defined as $p_{\lambda}=\mathbb{P}\left(\text{Bin}(n,\alpha)\leq n\hat{\mathcal{R}}(\lambda)\right)$ are super-uniform.

Proof.

See Appendix D.1 of [21]. ∎

Theorem 3 (Theorem 1 of [2]).

Suppose $p_{\lambda}$ has a distribution stochastically dominating the uniform distribution. Let $\mathcal{A}$ be a FWER controlling algorithm at level $\delta>0$ . Let $\mathcal{R}(\lambda)$ be the expected risk for the choice of $\lambda$ and $\alpha>0$ be the maximal error rate. Then $\Lambda_{\text{val}}=\mathcal{A}(\{p_{\lambda}\}_{\lambda\in\Lambda})$ satisfies the following

\displaystyle\mathbbm{P}\left(\sup_{\lambda\in\Lambda_{\text{val}}}\{\mathcal{R}(\lambda)\}\geq\alpha\right)\leq\delta

A.3 Extensions of Theorem 1

We can also state Theorem˜1 for the setting with auxiliary data (Section˜C.3).

Corollary 2.

For $c,d\in(0,1)$ and new point $(w_{\text{test}},x_{\text{test}},x_{\text{test}}^{\prime},y_{\text{test}})$ , and $\hat{C}_{c,d}$ interval defined as in Section˜C.3 to include auxiliary data, we have

\displaystyle P(y_{\text{test}}\in\hat{C}_{c,d}(w_{\text{test}}))\geq 1-c-d.

which follows by the same proof as above; auxiliary data does not change the proof (assuming that with auxiliary features, the data is still exchangeable). Lastly, we note that the prior results can easily be adapted to nonexchangable data settings using weighted quantiles and a weighting scheme as mentioned in [7]. Thus we have

Corollary 3.

Let $|S^{\text{conf}}|=m$ and $p_{1}\dots,p_{m}$ be weights for each point. For $c,d\in(0,1)$ and new point $(w_{\text{test}},x_{\text{test}},y_{\text{test}})$ , and $\hat{C}_{c,d}$ interval defined as in Definition˜3, we have

\displaystyle P(y_{\text{test}}\in\hat{C}_{c,d})\geq 1-c-d-\psi_{1}-\psi_{2}.

where $\psi_{1}$ and $\psi_{2}$ are coverage penalties from nonexchangability.

Proof.

The result follows analogously to the proof of Section˜A.1, with modifications to account for the use of weighted quantiles and associated penalty terms. Specifically, we let $\psi_{1}$ and $\psi_{2}$ denote penalties applied to the empirical weighted quantiles, as introduced in Theorem 2 of [7]. These penalties are of the form $\sum_{i=1}^{m}\tilde{p}_{i}d_{TV}(\vec{R},\vec{R}^{i}))$ where $\tilde{p}_{i}=\frac{p_{i}}{1+\sum_{i=1}^{m}p_{i}}$ , $\vec{R}$ represents the vector of full residuals and $\vec{R}^{i}$ is the same vector with the $i$ -th point swapped with $(w_{\text{test}},y_{\text{test}},z_{\text{test}})$ .

When applying weighted conformal prediction, the standard quantile threshold used in the unweighted case is replaced with a weighted quantile. By Theorem 2 of [7], these $\psi_{1},\psi_{2}$ ensure that the resulting prediction set retains valid marginal coverage.

Therefore, by adapting the same steps as in Section˜A.1—but replacing the empirical quantiles with weighted quantiles and applying Theorem 2 of [7], we obtain the stated result. ∎

We also introduce a result for coverage for fixed scaling values of $a,b$ under some assumptions.

Corollary 4.

Assume that the CDF of the residual for an out-of-sample point is Lipschitz continuous with constant $L$ . Furthermore, if the maximal value of $\{\Delta R_{1}\}$ (resp. $\{R_{2}\}$ ) is bounded by $\epsilon>0$ , then if we set $a=0,b=1$ (resp. $a=1,b=0$ ) with $c=d=\alpha$ , we have coverage with

P(y_{\text{test}}\notin\hat{C}_{0,1,\alpha}(w_{\text{test}}))\leq 2\alpha+L\epsilon.

Proof.

Consider $\hat{C}_{1,1,\alpha}(w_{\text{test}})=Q_{1-\alpha}(\{\Delta R_{1}\})+Q_{1-\alpha}(\{R_{2}\})$ , which has coverage guarantee

P(y_{\text{test}}\notin\hat{C}_{1,1,\alpha}(w_{\text{test}}))\leq 2\alpha.

By assumption, $Q_{1-\alpha}(\{\Delta R_{1}\})<\epsilon$ , and thus instead using $\hat{C}_{0,1,\alpha}(w_{\text{test}})$ results in a change of at most $\epsilon$ , which by Lipschitz continuity of the CDF produces a resultant change of $L\epsilon$ in the probability guarantee. ∎

The existence of the Lipschitz condition is quite reasonable, for example, we consider $\mu_{2}$ is a linear function and the noise term is normally distributed, the residuals themselves are half-normally distributed and thus satisfy the above requirement. We pay a small theoretical guarantee when shrinking the interval, under the assumption $L\epsilon$ is small, which is not necessarily true (and $L$ is generally unknown). However, this result gives the intuition that if one of the residual pieces contributes little to the overall error, we can remove it for little coverage cost and serves as part of the motivation for the adaptive algorithm. We find that for certain settings with unbalanced residual components, even simple heuristics such as $\max(Q_{1-c}(\{\Delta R_{1}\}),Q_{1-d}(\{R_{2}\}))$ where $c=d=\alpha$ achieves similar coverage to the full residuals.

We can obtain non-asymptotic bounds for the specific setting of a hidden Markov Model, as in Theorem 4.1 of [14] once again, with little changes due to the shared update step.

Proposition 6.

Suppose that the data is coming from a hidden Markov model, with underlying states $A_{t}\in\mathcal{A}$ such that the data $(w_{t},x_{t},y_{t})\sim P_{A_{t}}$ . Observe that $a_{t},b_{t},c_{t},d_{t},\alpha_{t},A_{t}$ form a Markov chain on $[0,1]^{2}\times[-\eta,1+\eta]^{2}\times[-\gamma,1+\gamma]\times\mathcal{A}$ with stationary distribution $\pi$ . Assume that the algorithm has passed its burn-in period and is now in a stationary setting. Suppose that the spectral gap of $\{A_{t}\}=1-\eta>0$ . Let $\mathrm{err}_{t}=1-\text{cov}_{t}$ . Then let $B=\sup_{a\in\mathcal{A}}|E[\mathrm{err}_{t}|A_{t}=a]-\alpha|$ and $\sigma_{B}^{2}=E[(E[\mathrm{err}_{t}|A_{t}]-\alpha)^{2}]$ . Then

P(|\frac{1}{T}\sum_{t=1}^{T}\mathrm{err}_{t}-\alpha|\geq\epsilon)\leq 2\exp(-T\epsilon^{2}/8)+2\exp\left(\frac{-T(1-\eta)\epsilon^{2}}{8(1+\eta)\sigma_{B}^{2}+20B\epsilon}\right)

showing a finite-sample bound deviation bound for the empirical coverage.

Proof.

The structure of this proof follows that of Theorem 4.1 in [14], which depends on their Lemma A.2 and Theorem A.1. We outline the necessary modifications to their lemma (written below) to hold in our setting.

Lemma A.3.

For any $\tau\in\mathbb{R}$ and $t\in\mathcal{N}$

\displaystyle\mathbbm{E}\left[\prod_{s=1}^{t}\exp(\tau(\mathrm{err}_{s}-\mathbbm{E}[\mathrm{err}_{s}|A_{s}]))\right]\leq\exp(\tau^{2}/2)\mathbbm{E}\left[\prod_{s=1}^{t-1}\exp(\tau(\mathrm{err}_{s}-\mathbbm{E}[\mathrm{err}_{s}|A_{s}]))\right]

To show this, we condition on $(a_{1},b_{1},c_{1},d_{1})$ alongside $(\alpha_{1},A_{1},\dots,A_{t})$ . As in [14], we require that

f(\sum_{s=1}^{t-1}\mathrm{err}_{s})=\prod_{s=1}^{t-1}\exp(\tau(\mathrm{err}_{s}-\mathbbm{E}[\mathrm{err}_{s}|A_{s}]))

is non-decreasing and

	$\displaystyle g(\sum_{s=1}^{t-1}\mathrm{err}_{s})=\mathbbm{E}\left[\exp(\tau(\mathrm{err}_{t}-\mathbbm{E}[\mathrm{err}_{t}\|A_{t}]))\|\mathrm{err}_{1},\dots,\mathrm{err}_{t-1}\right]=\mathbb{P}_{A_{t}}(y_{t}\in\hat{C}_{a_{t},b_{t},c_{t},d_{t},\alpha_{t}}(w_{t}))\exp(-\tau\mathbbm{E}[\mathrm{err}_{t}\|A_{t}])$
	$\displaystyle\quad+(1-\mathbb{P}_{A_{t}}(y_{t}\in\hat{C}_{a_{t},b_{t},c_{t},d_{t},\alpha_{t}}(w_{t})))\exp(\tau(1-\mathbbm{E}[\mathrm{err}_{t}\|A_{t}]))$

is non-increasing in $\sum_{s=1}^{t-1}\mathrm{err}_{s}$ for $\tau\geq 0$ . This still holds because:

•

$\alpha_{t}$ remains a non-increasing function of $\sum_{s=1}^{t-1}\mathrm{err}_{s}$ under the same update rule.
•

$a_{,}b_{t},c_{t},d_{t}$ are deterministic functions of the conditioning variables (including $\mathrm{err}_{1},\dots,\mathrm{err}_{t-1}$ ), as the candidate set selection and parameter updates are fixed given the history.
•

As $\alpha_{t}$ decreases, the validation set $\Lambda_{\mathrm{val}}$ shrinks (selecting only higher-width pairs), so the interval width is non-decreasing, thus $\mathbb{P}_{A_{t}}(y_{t}\in\hat{C}_{a_{t},b_{t},c_{t},d_{t},\alpha_{t}}(w_{t}))$ is non-decreasing, meaning that $g$ is overall non-increasing.

These observations ensure that the monotonicity conditions in Lemma A.2 still hold under our conditioning. The remainder of the proof, including the application of the Markov chain concentration inequality (Theorem A.1 of [14]), proceeds unchanged. ∎

Appendix B Conceptual information

B.1 Discussion on quantiles of residual components.

We observe that when quantiles of the residual components are taken then summed, they do not necessarily serve as upper-bounds on the corresponding quantiles of the full residuals, which we use in our implementation. Specifically, the inequality

Q_{1-\alpha}(\{\Delta R_{1}\})+Q_{1-\alpha}(\{R_{2}\})<Q_{1-\alpha}(\{R\})

may not hold. This observation is significant because while the triangle inequality holds for each individual point, i.e., $R\leq\Delta R_{1}+R_{2}$ , such a relationship does not extend to the quantiles. This discrepancy is intentional, as it allows for a tighter combination of residuals compared to the traditional residual. Moreover, in practice, this situation occurs rarely.

B.2 Notation Table

We provide a table of notation in Table˜2.

Table 2: Summary of Notation

Symbol	Meaning
$\hat{\mu}_{1}$	First stage hypothesis
$\hat{\mu}_{2}$	Second stage hypothesis
$R$	Residual $\|y-\hat{\mu}_{2}(\hat{\mu}_{1}(w))\|$ for two-stage model
$\Delta R_{1}$	First stage residual component $\left\|\,\|y-\hat{\mu}_{2}(x)\|-\|y-\hat{\mu}_{2}(\hat{x})\|\,\right\|$
$R_{2}$	Second stage residual component $\|y-\hat{\mu}_{2}(x)\|$
$\alpha$	Desired Miscoverage Rate
$a$	Scaled weight for $\Delta R_{1}$
$b$	Scaled weight for $R_{2}$
$c$	Quantile level to take of $\Delta R_{1}$
$d$	Quantile level to take of $R_{2}$
$\delta$	Rejection stringency to control FWER
$\gamma$	Step-size for updating $\alpha_{t}$ with $\gamma=0.01$ empirically performing well
$\eta$	Step-size for updating $c_{t},d_{t}$ with $\eta=0.01$ performing well
$\tau$	Tolerance for calculating $p$ -values for hypothesis testing
$k$	Window length for adaptive method, with stabilized performance above 100 points

B.3 Choice of Components

Certainly, one could instead take quantiles or rescalings of the total sum $\Delta R_{1}+R_{2}$ instead of separating it into components, however we choose to separate it multiple reasons. First, we note that by splitting the quantiles/weights, it becomes clear in the decomposition which stage contributes more to the error. Second under co-monotonicity of $\Delta R_{1},R_{2}$ , the quantiles of $\Delta R_{1},R_{2}$ match and thus both approaches result in the same interval $Q_{1-\alpha}(\{\Delta R_{1}+R_{2}\})$ which satisfies coverage guarantees as it is an upper-bound on the black box width. Furthermore, for smaller miscoverage levels $\alpha_{1},\alpha_{2}$ then $Q_{1-\alpha_{1}}(\{\Delta R_{1}\})+Q_{1-\alpha_{2}}(\{R_{2}\})$ is an upperbound on $Q_{1-\alpha}(R)$ , so our use of $Q_{1-\alpha}(\{\Delta R_{1}\})+Q_{1-\alpha}(\{R_{2}\})$ can be an upper-bound. Lastly, we also desire the ability to be tighter than $Q_{1-\alpha}(R)$ at times, unlike $Q_{1-\alpha}(\{\Delta R_{1}+R_{2}\})$ , to be able to quickly adapt to easy settings.

B.4 FWER-controlling algorithms for Scaling Selection

To construct the set of valid scaling parameter pairs $\Lambda_{\text{val}}$ , we apply multiple hypothesis testing procedures that control the family-wise error rate (FWER). This ensures that, with high probability, none of the accepted scaling pairs exhibit a true miscoverage rate above the nominal level $\alpha$ .

Definition 5 (FWER-Controlling Algorithm).

Let $\Lambda=\{\lambda_{1},\dots,\lambda_{u}\}$ be a candidate set of scaling pairs, and let $p_{1},\dots,p_{u}\in[0,1]$ denote the associated $p$ -values testing the null hypothesis that the miscoverage of each $\lambda_{i}$ exceeds $\alpha$ . A selection procedure $\mathcal{A}:[0,1]^{u}\to 2^{\{1,\dots,u\}}$ is said to control the family-wise error rate (FWER) at level $\delta$ if:

\mathbb{P}\left(\mathcal{A}(p_{1},\dots,p_{u})\subseteq J\right)\geq 1-\delta,

where $J\subseteq\{1,\dots,u\}$ is the set of true null hypotheses.

Bonferroni Correction.

A classical method for controlling FWER is the Bonferroni correction [10]. It accepts any $\lambda_{i}$ for which $p_{i}\leq\delta/u$ , where $u=|\Lambda|$ . By the union bound, this ensures that the probability of falsely including any $\lambda$ with miscoverage above $\alpha$ is at most $\delta$ . However, Bonferroni can be overly conservative when the number of tests $u$ is large.

Fixed Sequence Testing.

To improve power in large-scale settings, we adopt fixed sequence testing [9]. This procedure assumes a pre-specified ordering of hypotheses $p_{(1)},p_{(2)},\dots,p_{(u)}$ , based on prior knowledge or heuristics (e.g., larger values of $a+b$ are expected to have lower risk). Hypotheses are tested sequentially: each $p_{(i)}$ is compared to $\delta$ , and the first failure (i.e., $p_{(i)}>\delta$ ) terminates the process. All previous hypotheses are accepted, and the rest are rejected. This method is more powerful than Bonferroni in the presence of a meaningful ordering.

Practical Note.

In our algorithm, we use fixed sequence testing to select valid scaling parameters. If no candidate passes the test, i.e., $\Lambda_{\text{val}}=\emptyset$ , we abstain from prediction. This outcome indicates a lack of statistical evidence and may signal that additional calibration data or model refinement is necessary.

B.5 Adaptive method pseudocode

Algorithm 1 Adaptive Risk Controlling with

\Delta R_{1}

R_{2}

1:Window size

k

, Step sizes

\gamma

\eta

; target coverage level

\alpha

2:for time step

t

3: Obtain window of

k

recent observations

4: Obtain Parameters

\alpha_{t}

c_{t}

d_{t}

, previous coverage

\text{cov}_{t-1},\text{cov}^{\Delta R_{1}}_{t-1},\text{cov}^{R_{2}}_{t-1},a_{t-1},b_{t-1}

5: Split window into

S^{\text{conf}}

and

S^{\text{cal}}

and compute

\{\Delta R_{1}\},\{R_{2}\}

6: Compute quantiles:

\Delta R_{1}\leftarrow Q_{c_{t}}(\{\Delta R_{1}\})

R_{2}\leftarrow Q_{d_{t}}(\{R_{2}\})

8: Determine valid

\lambda

set:

\Lambda^{t}_{\text{val}}\leftarrow\textsc{FixedSequenceTesting}(S^{\text{cal}},\alpha_{t},\Delta R_{1},R_{2})

10: Obtain

a_{t},b_{t},\Delta c_{t},\Delta d_{t}\leftarrow\textsc{SelectLambda}(\Lambda^{t}_{\text{val}},\text{cov}_{t-1},\text{cov}^{\Delta R_{1}}_{t-1},\text{cov}^{R_{2}}_{t-1},a_{t-1},b_{t-1})

11: Calculate interval

\hat{C}_{\alpha,a_{t},b_{t},c_{t},d_{t}}(w_{t})

12: Check coverage at

t

\text{cov}_{t},\text{cov}^{\Delta R_{1}}_{t},\text{cov}^{R_{2}}_{t}

13: Update:

\alpha_{t+1}\leftarrow\alpha_{t}+\gamma(\alpha-\text{cov}_{t})

14:

c_{t+1}\leftarrow c_{t}+\eta\Delta c_{t}

d_{t+1}\leftarrow d_{t}+\eta\Delta d_{t}

15:end for

We define the coverage indicator at time $t$ as $\text{cov}_{t}=\mathbbm{1}_{y_{t}\in\hat{C}_{\alpha,a,b,c,d}(w_{t})}.$ To capture deviations in the score components, we define $\text{cov}^{\Delta R_{1}}_{t}$ to be 0 if the true value $(\Delta R_{1})_{t}$ lies within the estimated quantile interval for $\Delta R_{1}$ , and equal to the excess $(\Delta R_{1})_{t}-\Delta R_{1}$ otherwise. This can be succinctly written as $\text{cov}^{\Delta R_{1}}_{t}=\text{ReLU}((\Delta R_{1})_{t}-\Delta R_{1}),$ and similarly for $\text{cov}^{R_{2}}_{t}$ .

Appendix C Extensions

C.1 Additional Heuristic

One could alternatively define the residual components and combination as signed residual components to obtain tighter, asymmetric intervals, at the cost of coverage. We define new residual components

	$\displaystyle\tilde{R}_{2}$	$\displaystyle=y-\hat{\mu}_{2}(x)$
	$\displaystyle\Delta\tilde{R}_{1}$	$\displaystyle=\hat{\mu}_{2}(\hat{\mu}_{1}(w))-\hat{\mu}_{2}(x)$

such that the sum of the components is exactly the full error rather than an absolute value upper-bound. Because the components are now signed, one should construct intervals differently, which also results in asymmetric intervals. Similarly to before, we can scale each component to produce intervals of the form

\displaystyle[\hat{\mu}_{2}(\hat{\mu}_{1}(w_{\text{test}})+aQ_{c/2}(\{\Delta\tilde{R}_{1}\})+bQ_{d/2}(\{\tilde{R}_{2}\}),\hat{\mu}_{2}(\hat{\mu}_{1}(w_{\text{test}})+aQ_{1-c/2}(\{\Delta\tilde{R}_{1}\})+bQ_{1-d/2}(\{\tilde{R}_{2}\})]

Experimentally in IID settings, it produces slightly tighter intervals without additional tolerance, however this heuristic is more prone to producing empty $\Lambda_{\text{val}}$ sets. For example, we report the performance of the method under IID settings with varying levels of nominal error rate $\alpha$ in Table˜3 exemplifying this behaviour.

Table 3: Signed residual heuristic performance over varying

\alpha

Metric	$\alpha=0.13$	$\alpha=0.01$	$\alpha=0.07$	$\alpha=0.05$
Coverage	0.9013 $\pm$ 0.02	0.9196 $\pm$ 0.01	NA	NA
Width	2.0265 $\pm$ 0.13	2.1523 $\pm$ 0.12	NA	NA

Thus we have chosen to use the heuristic given in the main body of the text, but we provide this as an option for many possible ways of constructing intervals using the residual decomposition we introduce. Furthermore, because the decomposition itself is flexible, one could incorporate any recent conformal prediction advancements into the ways quantiles are taken, such as weighted data, or in the adaptive case how the step-size and $\alpha_{t}$ etc. change.

C.2 Stationary $\Phi$ -mixing

We describe how the FWER control method outlined in Section˜5 can be extended beyond the IID setting to accommodate stationary $\phi$ -mixing processes. Note that for dependent data, the concentration properties of empirical quantities degrade, resulting in looser tail bounds. This, in turn, inflates the values of the empirical error estimates $p_{\lambda}$ , thereby requiring larger thresholds $\delta$ to ensure that the selected set $\Lambda_{\text{val}}$ remains nonempty.

Our goal is to identify values of $\lambda$ for which the expected coverage error is below a target level $\alpha$ . For each $\lambda$ , we consider the null hypothesis

H_{0}^{\lambda}:\quad\mathbb{P}\left(y_{\text{test}}\notin\hat{C}_{\lambda}(w_{\text{test}})\right)>\alpha,

and compute an empirical estimate of the coverage error using a held-out calibration dataset. To test this hypothesis, we apply a concentration inequality that bounds the deviation of the empirical error from its expectation under dependence. Specifically, we employ a result for bounded functions of $\phi$ -mixing sequences from [16], with refinements from [18], which provides suitable control over the error rates in the non-IID case.

Theorem 4 (Mixing Concentration Inequality).

Let $Z_{1},...,Z_{n}$ be random variables distributed according to a $\phi$ -mixing distribution. Let $f:Z^{n}\to R$ be a measurable function that is $c$ -Lipschitz with respect to the Hamming metric for some $c>0$ . Then, for any $\epsilon>0$ , the following inequality holds:

\displaystyle P(|f(Z_{1},\dots,Z_{n})-\mathbbm{E}[f(Z_{1},\dots,Z_{n})]|\geq\epsilon)\leq 2\exp{\left(\frac{-2\epsilon^{2}}{nc^{2}\left\lVert\Delta_{n}\right\rVert_{\infty}^{2}}\right)}

where $\left\lVert\Delta_{n}\right\rVert_{\infty}=1+\sum_{i=1}^{n}\phi(i)$

We apply this theorem with $\hat{\mathcal{R}}(\lambda)=f((w_{1},x_{1},y_{1}),\dots,(w_{l},x_{l},y_{l}))=\frac{1}{l}\sum_{i=1}^{l}1_{\{y_{i}\notin\hat{C}_{\lambda}(w_{i})\}}$ . Clearly this is $\frac{1}{l}$ -Lipschitz with respect to Hamming metric. Furthermore,

\mathbbm{E}[f(S^{\text{cal}})]=\frac{1}{l}\mathbbm{E}[\sum_{i=1}^{l}1_{\{y_{i}\notin\hat{C}_{\lambda}(w_{i})\}}]=\frac{1}{l}\sum_{i=1}^{l}\mathbbm{E}[1_{\{y_{i}\notin\hat{C}_{\lambda}(w_{i})\}}]=\alpha

due to stationarity and the null hypothesis. Then

\displaystyle P(\mathbbm{E}[\hat{\mathcal{R}}(\lambda)]-\hat{\mathcal{R}}(\lambda)\geq\epsilon)

\displaystyle\leq 2\exp{\left(\frac{-2\epsilon^{2}}{l\left\lVert\Delta_{l}\right\rVert_{\infty}^{2}}\right)}

where we substitute our realized $\alpha-\hat{\mathcal{R}}(\lambda)$ for $\epsilon$ and plug it into the RHS to obtain a $p$ -value for $\lambda$ .

Then we have the following:

Corollary 5.

Let $\Lambda_{\text{val}}\subseteq\Lambda$ be the set selected by a FWER-controlling algorithm at level $\delta$ , based on $p$ -values computed over $S^{\text{cal}}$ using the above method, which is assumed to be stationary $\phi$ -mixing. Then, for any $\hat{\lambda}\in\Lambda_{\text{val}}$ , we have

\mathbb{P}\left(\mathbb{P}\left(y_{\text{test}}\notin\hat{C}_{\hat{\lambda}}(w_{\text{test}})\,\big|\,S^{\text{cal}}\right)\leq\alpha\right)\geq 1-\delta,

where the outer probability is over the randomness of $S^{\text{cal}}$ , and the inner probability is over the test point $(w_{\text{test}},x_{\text{test}},y_{\text{test}})$ which is assumed to be drawn from the same stationary distribution.

While the statement of the result is similar to before, note that $p_{\lambda}$ is an conservative upper-bound on the true $p$ -value, thus the threshold to accept a $\lambda$ is more stringent. We show that the $p_{\lambda}$ are still super-uniform. Let $P$ be the random variable representing empirical risk when the error rate is $\alpha$ and $Q$ by the empirical risk when the error rate is $\alpha^{\prime}$ . Under the null hypothesis, $\alpha^{\prime}>\alpha$ , which implies that $Q$ stochastically dominates $P$ ( $P(Q\leq t)\leq P(P\leq t)$ ). Let $f(q)=\exp{\left(\frac{-2(\alpha-q)^{2}}{l\left\lVert\Delta_{n}\right\rVert_{\infty}^{2}}\right)}$ and $F_{P}(t)$ and $F_{Q}(t)$ represent CDFs of $P$ and $Q$ respectively. Then

	$\displaystyle p_{\lambda}$	$\displaystyle=f(Q)$
		$\displaystyle\geq P(\alpha-P\geq\alpha-Q)$
		$\displaystyle=F_{P}(Q)$

Furthermore, letting $O=p_{\lambda}$ , by the stochastic dominance assumption,

	$\displaystyle P(O\leq o)$	$\displaystyle=P(p_{\lambda}\leq o)$
		$\displaystyle\leq P(F_{P}(Q)\leq o)$
		$\displaystyle\leq P(F_{Q}(Q)\leq o)$
		$\displaystyle=P(Q\leq F_{Q}^{-1}(o))$
		$\displaystyle=F_{Q}(F_{Q}^{-1}(o))$
		$\displaystyle=o$

Then we apply a FWER-controlling algorithm such as one from Section˜B.4 to set the acceptance threshold to reject the null hypothesis such that the total type-1 error rate is less than $\delta$ , then use Theorem 1 of [2] (Theorem˜3).

We note that it is possible to adapt this result to exchangeable sequences using specific concentration inequalities such as ones found in [23] or more recently [6].

C.3 Decomposition for auxiliary data

We can also define $\Delta R_{1}$ and $R_{2}$ when considering auxiliary data.

Definition 6.

If we have sample $S$ where data is of the form $(w,x,x^{\prime},y)$ where $x^{\prime}$ represents auxiliary data for the second stage such that we have learned upstream and downstream models $\hat{\mu}_{1}:\mathcal{W}\to\mathcal{X}$ , $\hat{\mu}_{2}:\mathcal{X}\times\mathcal{X}^{\prime}\to\mathcal{Y}$ . Then we can decompose the residual into components as the following for some new point $(w,x,x^{\prime},y)$

	$\displaystyle\Delta R_{1}(w,x,x^{\prime},y)$	$\displaystyle=\|\|y-\hat{\mu}_{2}(x,x^{\prime})\|-\|y-\hat{\mu}_{2}(\hat{\mu}_{1}(w),x^{\prime})\|\|$
	$\displaystyle R_{2}(w,x,x^{\prime},y)$	$\displaystyle=\|y-\hat{\mu}_{2}(x,x^{\prime})\|.$

We can use these residual components to create prediction intervals analogously to the original setting without auxiliary data.

C.4 Decomposition for multi-stage model

Now we consider extending the definition of $\Delta R_{1},R_{2}$ to more than two stages. We provide a straightforward extension, but a cleverer one may exist.

Definition 7.

Suppose we have sample $S$ where data is of the form $w_{1},w_{2},\dots,w_{N+1}$ with $N$ stages and corresponding learned hypotheses $\hat{\mu}_{i}:\mathcal{W}_{i}\to\mathcal{W}_{i+1}$ . Then for a point $(w_{1},\dots,w_{N+1})$ , define

	$\displaystyle\Delta R_{1}(w_{1},\dots,w_{N+1})$	$\displaystyle=\|\|\|w_{N+1}-\hat{\mu}_{N}(\hat{\mu}_{N-1}(\dots\hat{\mu}_{2}(w_{2})))\|-\|w_{N+1}-\hat{\mu}_{N}(\hat{\mu}_{N-1}(\dots\hat{\mu}_{1}(w_{1})))\|\|$
	$\displaystyle\dots$
	$\displaystyle\Delta R_{i}(w_{1},\dots,w_{N+1})$	$\displaystyle=\|\|\|w_{N+1}-\hat{\mu}_{N}(\hat{\mu}_{N-1}(\dots\hat{\mu}_{i+1}(w_{i+1})))\|-\|w_{N+1}-\hat{\mu}_{N}(\hat{\mu}_{N-1}(\dots\hat{\mu}_{i}(w_{i})))\|\|$
	$\displaystyle\dots$
	$\displaystyle R_{n}(w_{1},\dots,w_{N+1})$	$\displaystyle=\|w_{N+1}-\hat{\mu}_{n}(w_{N})\|$

Observe that the sum of these terms is still an upperbound on the "full" residual $R=|w_{N+1}-\hat{\mu}_{N}(\hat{\mu}_{N-1}(\dots\hat{\mu}_{1}(w_{1})))|$ by the triangle inequality and thus we have a residual decomposition.

An issue with this definition is that if we try a hypothesis testing method and FWER algorithm to weight each of these components, the set $\Lambda$ becomes prohibitively large as it grows exponentially with each stage. A FWER algorithm that is able to intelligently search the possibilities will become necessary, whereas for two-stages, it was feasible to search across the entire grid. One way to mitigate this is to restrict the proposed set $\Lambda$ at each step by focusing only on the most important residual components, fixing the weight of the remaining ones to remain; i.e. select the top $k$ residual components with the highest average width in the window of recent observations and define $\Lambda$ only for those components, freezing the other component weights. In fact, one can sort the residual components by magnitude then sequentially search through $\Lambda$ , freezing the others outside of the top $k$ , resulting in only $kd$ tests, if $d$ is the number of sub-divisions for each component.

C.5 Decomposition for multiple upstream models

We consider an augmented two-stage model, with multiple upstream models each producing a component of $x$ , the intermediate value, which is now vector valued. Suppose there are $N$ upstream models, $\hat{\mu}^{1}_{1},\dots,\hat{\mu}^{1}_{N}$ that each map upstream features $\mathbf{w}=w_{1},\dots,w_{M}$ to their corresponding intermediate feature in $\mathbf{x}=x_{1},\dots,x_{N}$ . Denote the output of the $i$ -th upstream model to be $\hat{x}_{i}$ . Then we can define $\Delta R_{1},R_{2}$ components for each downstream feature and take a weighted sum. Thus, for the $i$ -th downstream feature, we have

	$\displaystyle(R_{2})_{i}$	$\displaystyle=\|y-\hat{\mu}_{2}(\hat{x}_{1},\dots,x_{i}\dots,\hat{x}_{N})\|$
	$\displaystyle(\Delta R_{1})_{i}$	$\displaystyle=\|\|y-\hat{\mu}_{2}(\hat{x}_{1},\dots,\hat{x}_{N})\|-\|y-\hat{\mu}_{2}(\hat{x}_{1},\dots,x_{i}\dots,\hat{x}_{N})\|\|.$

Then see that a convex combination $\sum_{i=1}^{nN}\theta_{i}((R_{2})_{i}+(\Delta R_{1})_{i}),\sum_{i=1}^{N}\theta_{i}=1,\theta_{i}\in[0,1]$ forms an upperbound on $R=|y-\hat{\mu}_{2}(\hat{x}_{1},\dots,\hat{x}_{N})|$ . This approach has a similar issue as the multi-stage model, as the search-space $\Lambda$ scales with the number of upstream models.

Appendix D Additional experiments

We provide additional experiments on the non-adaptive methods and adaptive methods, providing further insight into the interpretability of the split residual method and the coverage guarantees. We include hyperparameter ablation studies as well as an additional real-world dataset. We include experiments on covariate shift to demonstrate that the split residual method is robust multiple types of shift and is not limited to concept shift, which was extensively explored in the main text. All experiments were conducted on a laptop with an Intel i7 processor and 8GB of RAM. No specialized hardware (e.g., GPUs or TPUs) was used for training or evaluation.

D.1 Experimental hyperparameter details

We describe the hyperparameters we use for the other methods in our experimens: SC, WSC, ACI, DtACI, PID, OCID.

•

SC: There is no other parameter used for split conformal prediction intervals other than the nominal desired miscoverage level $\alpha=0.1$ which we keep the same across all methods
•

WSC uses the same $\alpha$ and also includes a weighting parameter 0.99 which exponentially weights points as they go further back in time, resulting in recent points having more weight.
•

ACI uses the same initial $\alpha=0.1$ and updates $\alpha_{t}$ using the same step size $\gamma=0.01$ as well as the same window size $k=100$ as our method.
•

DtACI uses the same $\alpha=0.1$ and also uses default multiple candidate values for step-size $\gamma=[0.001,0.002,0.004,0.008,0.0160,0.032,0.064,0.128]$ .
•

PID uses the same initial $\alpha=0.1$ , step size $\gamma=0.01$ , and window size $k=100$ . Two additional parameters KI and $C_{\text{sat}}$ are estimated with appropriate hypothesized bound $B$ depending on the data, using the default heuristic given in Appendix B of [1].
•

OCID uses the same $\alpha=0.1$ and decay parameter $0.1$ with the decay weight function given by [4]

Furthermore, note that $S^{\text{cal}}$ for our method is a subset of the held-out data which the other methods have full access to, so we do not use additional data relative to the other methods but simply partition it into smaller sets.

D.2 Non-adaptive experiments

We discuss results for the non-adaptive intervals using our residual decomposition as well as hyperparameter ablation studies.

D.2.1 Coverage under easy settings

We implement both Definition˜4 and Definition˜3 separately and test them on simple IID data with linear relationships between $w,x,y$ . We also include a results for an IID synthetic data featuring a nonlinear relationship in which the upstream task is a binary classification task of diagnosing a patient for a disease given their health (using logistic regression), and the downstream model predicts the cost of treatment given the upstream probabilistic prediction (using XGBoost). Additionally, we implement these prediction interval methods in a $\phi$ -mixing setting which we provide guarantees for in Section˜C.2 where the data is generated by a stationary AR(1) process with bounded uniform noise (known to be a $\phi$ -mixing process [5]) and linear relationships between $w,x,y$ . We report the width and coverage of the methods in Table˜4.

Table 4: Average coverage and width with standard deviation across 50 runs for the scaling components method with fixed

c=d=\alpha=0.1

and FWER as in Definition˜4, which we denote as

\text{SR}_{a,b}

, the separate quantiles method with

c=d=0.05

as in Definition˜3, denoted as

\text{SR}_{c,d}

, SC, and WSC. We examine these under linear and non-linear relationships and non-IID data

Shift	Metric	$\textbf{SR}_{a,b}$	$\textbf{SR}_{c,d}$	SC	WSC
Linear	Coverage	0.9439 $\pm 0.02$	0.9492 $\pm 0.01$	0.8971 $\pm 0.01$	0.9062 $\pm 0.02$
Linear	Width	2.3555 $\pm 0.13$	2.3930 $\pm 0.07$	1.9924 $\pm 0.06$	2.0584 $\pm 0.13$
Non-linear	Coverage	0.9416 $\pm 0.01$	0.9794 $\pm 0.01$	0.8969 $\pm 0.01$	0.9300 $\pm 0.02$
Non-linear	Width	4.6343 $\pm 0.14$	5.2007 $\pm 0.06$	4.2179 $\pm 0.08$	4.5089 $\pm 0.16$
Non-IID (linear)	Coverage	0.9561 $\pm 0.02$	0.9886 $\pm 0.01$	0.8997 $\pm 0.02$	0.9067 $\pm 0.03$
Non-IID (linear)	Width	3.1299 $\pm 0.20$	3.6106 $\pm 0.12$	2.6749 $\pm 0.08$	2.7304 $\pm 0.15$

We observe that both $\text{SR}_{a,b},\text{SR}_{c,d}$ are conservative for these settings, producing wider intervals but attaining the desired $\alpha$ coverage guarantee. Thus, our methods tend to over-cover in easier settings such as IID data and stationary $\phi$ -mixing conditions because FWER hypothesis testing requires an error rate below $\alpha$ whereas for other methods simply being close is sufficient. However, as we see in the ablation studies below, this can be easily remedied if desired by including a tolerance for the hypothesis test itself.

D.2.2 Hyperparameter Study

We vary the tolerance parameter $\tau$ which alters the hypothesis test, obtaining $\text{Bin}(n,\alpha+\tau)$ to demonstrate that when comparing our method with some tolerance, the width and coverage even under the IID settings are comparable and do not over-cover.

Table 5: Coverage and average width (

\pm

std) across 50 runs for varying

\tau\in\{0.01,0.03,0.05\}

. Only

\text{SR}_{a,b}

depends on

\tau

, thus the other methods do not report values for each

\tau

Method	Metric	$\tau=0.01$	$\tau=0.03$	$\tau=0.05$
$\text{SR}_{a,b}$	Coverage	0.9240 $\pm$ 0.01	0.9150 $\pm$ 0.02	0.9019 $\pm$ 0.02
$\text{SR}_{a,b}$	Width	2.1750 $\pm$ 0.09	2.0922 $\pm$ 0.08	2.0192 $\pm$ 0.11
$\text{SR}_{c,d}$	Coverage	0.9516 $\pm$ 0.01	–	–
$\text{SR}_{c,d}$	Width	2.4032 $\pm$ 0.05	–	–
SC	Coverage	0.8981 $\pm$ 0.01	–	–
SC	Width	2.0051 $\pm$ 0.04	–	–
WSC	Coverage	0.9015 $\pm$ 0.02	–	–
WSC	Width	2.0320 $\pm$ 0.13	–	–

We observe that even low values of the tolerance parameter, such as $\tau=0.01$ , are sufficient to make $\text{SR}_{a,b}$ competitive with the black-box methods in the IID setting. This suggests that $\text{SR}_{a,b}$ is not simply over-covering, but rather enforcing stricter criteria for valid prediction intervals than the black-box approaches. These stricter requirements likely contribute to its improved performance under distributional shifts. Therefore, if one seeks tighter but potentially less robust intervals, adjusting the tolerance parameter $\tau$ provides a principled way to trade-off between coverage robustness and interval sharpness, however experimentally, unless stated otherwise, we use $\tau=0$ .

We also vary the $\delta$ parameter, which controls the FWER rejection threshold, which for smaller values encourages more sensitivity of the method overall to abstain from producing an interval; in a sense it has a similar effect to $c,d$ parameters but collectively for both stages at once rather than a stage-wise sensitivity.

Table 6: Coverage and average width (

\pm

std) across 50 runs for varying

\delta\in\{0.1,0.3,0.5,0.7,0.9\}

. Only

\text{SR}_{a,b}

depends on

\delta

Method	Metric	$\delta=0.1$	$\delta=0.3$	$\delta=0.5$	$\delta=0.7$	$\delta=0.9$
$\text{SR}_{a,b}$	Coverage	NA	NA	0.8969 $\pm$ 0.08	0.9294 $\pm$ 0.02	0.9268 $\pm$ 0.02
$\text{SR}_{a,b}$	Width	NA	NA	2.2231 $\pm$ 0.09	2.2195 $\pm$ 0.10	2.1867 $\pm$ 0.11
$\text{SR}_{c,d}$	Coverage	0.9504 $\pm$ 0.01	–	–	–	–
$\text{SR}_{c,d}$	Width	2.3919 $\pm$ 0.05	–	–	–	–
SC	Coverage	0.8992 $\pm$ 0.01	–	–	–	–
SC	Width	1.9987 $\pm$ 0.04	–	–	–	–
WSC	Coverage	0.9082 $\pm$ 0.02	–	–	–	–
WSC	Width	2.0603 $\pm$ 0.14	–	–	–	–

D.3 Adaptive

We include additional experiments using additional synthetic experiments with covariate shift and real-world stock data to demonstrate the flexibility of our method. Furthermore, we provide visualizations of the scaling parameters to demonstrate the interpretability of the decomposition. In addition, we perform hyperparameter sweeps over the synthetic dataset described in Section˜7. We also include comparisons to DtACI, although DtACI performs poorly due to being designed for one step ahead updates, rather than with a forecasting horizon.

D.3.1 Stocks Dataset

We consider a dataset consisting of daily closing values of three S $\&$ P 500 stocks, AAPL, AMZN, MSFT with the goal of forecasting the AAPL closing value. We do so via a two-stage approach in which the first-stage is an N-BEATS forecasting model that forecasts the three stocks at a 6-day horizon, then the second-stage is ridge regression that predicts AAPL given the forecasted values to correct for error from the first stage. We rescale the data and compare against ACI, Conformal PID, and OCID methods to showcase the robustness of our method. The coverage of these methods is presented in Figure˜4.

We observe that our adaptive method provides better coverage than the other three methods, particularly when the forecasts perform worse in July and October, seeing a much less dramatic drop in performance while not being overly tight. Our method uses almost the same average width (0.1347) as ACI (0.1344), while the other two methods have tighter intervals but significantly worse coverage. We list these results in the table below Table˜7 and also the sliding window coverage plots Figure˜5(b). We see that there are two main shocks, and that in comparison to ACI, our method is able to capture the second shock, while OCID is also able to but has volatile intervals during the stable period prior to the shocks.

Table 7: Average and standard deviation of width over time, and minimum average coverage over a sliding window of the last 100 observations for each method on the stocks dataset

Method	Avg. Width	Std. Width	Min Coverage
SR(adaptive)	0.1348	0.0216	0.89
ACI	0.1345	0.0193	0.86
PID	0.1183	0.0226	0.85
OCID	0.1240	0.0348	0.83

D.3.2 DtACI

We also include results on the DtACI method in comparison to the others for the same synthetic dataset with two shocks used in Section˜7.2, with the plots shown in Figure˜6(a).

We observe that DtACI outputs values of $\alpha_{t}>1$ which we clip to 1, resulting in very conservative intervals. This is most likely due to the fact that it is not designed for forecasting with a horizon and suffers in performance because of that. We see similar behavior for the automobile indicators dataset (Figure˜7), in which DtACI achieves good coverage but is wider ( compared to our method, ) because it often outputs $\alpha_{t}=1$ . Interestingly, this does not hold for the stocks dataset, where DtACI is produces the tightest intervals but has the worst coverage.

D.3.3 Automobile Visualizations

We also include additional visualizations of width and coverage for the automobile dataset Figure˜8. We note that our average width is strong in efficiency (21.085), only behind PID (18.469) and ACI (15.696) which have poor coverage, whereas OCID and DtACI are all wider (31.058, 31.356), respectively.

D.3.4 Hyperparameter Study

First, we visualize the effect of various $\gamma$ values, which affects the rate at which $\alpha_{t}$ changes. We find that our method excels at lower values of $\gamma=0.01$ and performance decays at higher values such as $\gamma=0.1$ , particularly because this causes $\alpha_{t}$ to drop to extremely low values, making the algorithm abstain from producing intervals under large shifts. We display the coverage effects for synthetic data in Figure˜9 and for Manheim data in Figure˜10, where we see a similar decay as $\gamma$ increases.

ACI and PID improve slightly at higher $\gamma$ on the synthetic dataset. While one might argue that this implies that one can just use higher values of $\gamma$ without the need for the split residuals method, using a lower $\gamma$ is “safer" as it does not adjust to noisy signals as significantly. Accordingly, the widths of the other methods become extremely volatile with increased $\gamma$ . Thus, our method is able to still catch significant jumps while remaining at a safer, consistent step-size due to always considering conservative scaling parameters in $\Lambda_{\text{val}}$ . Furthermore, even when considering multiple values of $\gamma$ , on the Manheim dataset, our method with $\gamma=0.01$ still outperforms the other methods in coverage at all values of $\gamma$ , and the performance of the other methods actually decay in coverage as $\gamma$ increases. OCID does not use same step-size parameter so it remains unaffected.

Table 8: Coverage, width, and maximum improvements in coverage vs. baselines (ACI, PID, OCID) across 50 runs for varying

\gamma\in\{0.01,0.03,0.05,0.1\}

Metric	$\gamma=0.01$	$\gamma=0.03$	$\gamma=0.05$	$\gamma=0.1$
Width	22.92 $\pm$ 0.13	24.75 $\pm$ 0.10	21.26 $\pm$ 0.12	21.93 $\pm$ 0.12
Coverage	0.835 $\pm$ 0.02	0.815 $\pm$ 0.02	0.595 $\pm$ 0.02	0.715 $\pm$ 0.01
Max $\Delta$ Coverage vs ACI	0.075 $\pm$ 0.02	0.030 $\pm$ 0.01	0.025 $\pm$ 0.01	0.020 $\pm$ 0.01
Max $\Delta$ Coverage vs PID	0.075 $\pm$ 0.02	0.045 $\pm$ 0.01	0.045 $\pm$ 0.01	0.025 $\pm$ 0.01
Max $\Delta$ Coverage vs OCID	0.130 $\pm$ 0.03	0.090 $\pm$ 0.02	0.110 $\pm$ 0.02	0.100 $\pm$ 0.03

Table 9: Coverage, width, and maximum improvements in coverage vs. baselines (ACI, PID, OCID) across 50 runs for varying

\eta\in\{0.01,0.03,0.05,0.1\}

Metric	$\eta=0.01$	$\eta=0.03$	$\eta=0.05$	$\eta=0.1$
Width	23.50 $\pm$ 0.12	24.22 $\pm$ 0.13	24.16 $\pm$ 0.14	23.96 $\pm$ 0.10
Coverage	0.835 $\pm$ 0.03	0.835 $\pm$ 0.02	0.855 $\pm$ 0.01	0.860 $\pm$ 0.02
Max $\Delta$ Coverage vs ACI	0.080 $\pm$ 0.01	0.065 $\pm$ 0.02	0.075 $\pm$ 0.01	0.075 $\pm$ 0.01
Max $\Delta$ Coverage vs PID	0.105 $\pm$ 0.03	0.090 $\pm$ 0.01	0.090 $\pm$ 0.02	0.075 $\pm$ 0.01
Max $\Delta$ Coverage vs OCID	0.125 $\pm$ 0.02	0.135 $\pm$ 0.01	0.115 $\pm$ 0.03	0.115 $\pm$ 0.02

Table 10: Coverage, width, and maximum improvements in coverage vs. baselines (ACI, PID, OCID) across 50 runs for varying

\delta\in\{0.05,0.1,0.3,0.5,0.7\}

Metric	$\delta=0.05$	$\delta=0.1$	$\delta=0.3$	$\delta=0.5$	$\delta=0.7$
Width	NA	23.30 $\pm$ 0.12	22.70 $\pm$ 0.13	24.11 $\pm$ 0.09	23.20 $\pm$ 0.11
Coverage	NA	0.865 $\pm$ 0.03	0.840 $\pm$ 0.02	0.835 $\pm$ 0.01	0.845 $\pm$ 0.02
Max $\Delta$ Coverage vs ACI	NA	0.075 $\pm$ 0.01	0.065 $\pm$ 0.02	0.055 $\pm$ 0.01	0.065 $\pm$ 0.02
Max $\Delta$ Coverage vs PID	NA	0.080 $\pm$ 0.01	0.100 $\pm$ 0.02	0.080 $\pm$ 0.01	0.070 $\pm$ 0.01
Max $\Delta$ Coverage vs OCID	NA	0.115 $\pm$ 0.02	0.130 $\pm$ 0.02	0.110 $\pm$ 0.03	0.105 $\pm$ 0.03

Table 11: Coverage, width, and maximum improvements in coverage vs. baselines (ACI, PID, OCID) across 50 runs for varying window length

k\in\{10,50,100,150\}

Metric	$k=10$	$k=50$	$k=100$	$k=150$
Width	NA	22.21 $\pm$ 0.08	23.24 $\pm$ 0.07	24.32 $\pm$ 0.13
Coverage	NA	0.855 $\pm$ 0.02	0.856 $\pm$ 0.03	0.835 $\pm$ 0.01
Max $\Delta$ Coverage vs ACI	NA	0.050 $\pm$ 0.01	0.070 $\pm$ 0.02	0.070 $\pm$ 0.01
Max $\Delta$ Coverage vs PID	NA	0.080 $\pm$ 0.01	0.090 $\pm$ 0.02	0.080 $\pm$ 0.01
Max $\Delta$ Coverage vs OCID	NA	0.115 $\pm$ 0.01	0.100 $\pm$ 0.01	0.120 $\pm$ 0.01

We also provide tables for other hyperparameters $\gamma,\eta,\delta,k$ in Tables˜10, 9, 10 and 11, fixing them at default values of $0.01,0.01,0.3,100$ respectively when not varying them. We see that increasing $\gamma$ has a rough impact of decreasing width, while increasing $\eta$ has a weaker widening effect, $\delta$ has no particular effect on width, but low values encourage abstention, and lastly small window lengths also encourage abstention, and generally longer windows results in slightly larger intervals.

D.3.5 Visualization of residual components

Here, we provide extra visualizations of the $\Delta R_{1}$ and $R_{2}$ parameters along with how the weights $a,b$ change over time, for the same synthetic setup as Section˜7.2 below in Figures˜11 and 12(a).

We observe that when the first shock occurs in the upstream, $\Delta R_{1}$ captures this change, then after it returns, $\Delta R_{1}$ correspondingly shrinks back to the baseline level. $R_{2}$ correspondingly captures the second downstream shock.

We also see this relationship reflected in the $a,b$ scaling parameters selected by our method as we see them adapt to the upstream shock, the return, then the downstream shock in a similar fashion (Figure˜12(a)). Lastly, we plot the ratios $R/(\Delta R_{1})$ and $R/R_{2}$ over time in Figure˜12(b).

We see that the shifts at $t=100,500,900$ are visibly identifiable: initially $R/R_{2}$ is high during the upstream shift as the majority of error comes from $R_{1}$ . When the shock returns, we observe that the upstream remains the main source of error, and finally in the downstream shift, $R_{2}$ becomes the dominant source of error. Concretely, $\Delta R_{1}$ dominates between $t=100$ to $t=500$ when the upstream undergoes a shock and $R_{2}$ dominates when the downstream undergoes a shock from $t=900$ onward. Thus we can observe which stage is responsible for the error with extreme clarity allowing diagnostic retraining. In particular, if we retrain at $t=200$ after the first upstream shift, we can choose to either retrain just $\hat{\mu}_{1}$ or both $\hat{\mu}_{1}$ and $\hat{\mu}_{2}$ . We observe that just retraining $\hat{\mu}_{1}$ results in a width decrease of $24.6439$ for the SR method, while retraining both models, results in a decrease of 25.3451, demonstrating that it is sufficient for our framework to only retrain the affected upstream.

We similarly plot the used scaling parameters $(a,b)$ over time for the Manheim used vehicles dataset, and observe that there is in fact an asymmetry in the scaling factors in relation to the shifts occurring during 2018-2020. Likewise, we also observe asymmetry in the stocks dataset, with more emphasis on the downstream. This gives us an idea of which part of the model fails and how the shift is affecting it (Figure˜13).

D.3.6 Without quantile level adjustments

In this experiment, we verify that the $c_{t},d_{t}$ heuristic update step of the algorithm is necessary for improved performance. Using the experimental setup with upstream and downstream shocks as Section˜7.2, we consider the case in which we fix $c_{t},d_{t}$ to initial values 0.01 $\forall t$ , in Figure˜14.

We observe that fixing $c,d$ makes the algorithm unable to adapt properly to shifts as it produces $\Lambda_{\text{val}}$ sets frequently after the initial upstream shock, which lowers average coverage (considering that to be a miscoverage). Thus, we see that adjusting $c_{t},d_{t}$ allows us to mitigate abstention, however, if desiring more sensitivity to shifts, keeping them fixed at larger values could serve as a diagnostic for retraining, similar to the non-adaptive case (Table˜1).

D.3.7 Covariate shift

We consider covariate shift, in which the distribution of $w$ suddenly changes, at $t=100,500,900$ which affects the entire pipeline. This shift does not have a specific upstream/downstream affect, but our method is still able to perform comparably well to the ACI, PID, and OCID methods Figure˜15. However, all the methods perform similarly and it is difficult to distinguish them.

Thus, while our method provides clear interpretability for upstream/downstream shifts, it is still able to adapt to other types of shifts without overcompensating width in comparison to other methods. However, it does not obtain any explicit advantage compared to the other methods and is still wider on average; thus we see that our method excels under asymmetric shifts at each stage, and furthermore performs best under upstream shifts.

	$\displaystyle\|R_{2}-R\|$	$\displaystyle\leq\|\hat{\mu}_{2}(x)-\hat{\mu}_{2}(\hat{\mu}_{1}(w))\|$
		$\displaystyle\leq L\|x-\hat{\mu}_{1}(w)\|$
		$\displaystyle\leq L\delta$

Modular and Adaptive Conformal Prediction for Sequential Models via Residual Decomposition

Abstract

1 Introduction

Contributions:

2 Problem setting and assumptions

Objective.

3 Two-stage residual decomposition for conformal prediction

Definition 1 (Second-stage residual).

Definition 2 (First-stage delta).

Proposition 1.

4 Constructing prediction intervals

4.1 Separate component quantiles

Definition 3 (Separate component quantiles).

Theorem 1 (Coverage of separate component quantiles).

4.2 Scaled component quantiles

Definition 4 (Scaled component quantiles).

Corollary 1 (Coverage with a=b=1a=b=1).

Proof.

4.3 Coverage for scaled parameters

Proposition 2 (Existence of ideal scaling parameters).

5 Risk-controlling approach with residual components

5.1 Testing miscoverage via empirical risk

5.2 Computing pp-values from calibration data

5.3 Coverage guarantees via risk-controlled scaling

Theorem 2 (Risk control via FWER calibration).

6 Adaptive risk control with residual decomposition

7 Experiments

7.1 Non-adaptive methods

7.2 Adaptive methods under structured distribution shifts

Controlled two-stage regression with targeted shifts.

Automobile indicators dataset.

8 Conclusion

References

Appendix A Appendix

A.1 Main Results/Proofs

Proof of Theorem 1

Proof.

Proof of Proposition 1

Proof.

Proof of Theorem 2

Proof.

A.1.1 Long-Run Coverage

Proposition 3 (Long-Run Coverage of Adaptive Method).

Proof of Proposition 3

Lemma A.1 (Boundedness of ctc_{t} and dtd_{t}).

Proof.

Proof.

A.2 Auxiliary Results

Proposition 4.

Proof.

Proposition 5.

Proof.

Lemma A.2.

Proof.

Theorem 3 (Theorem 1 of [2]).

A.3 Extensions of Theorem 1

Corollary 2.

Corollary 3.

Proof.

Corollary 4.

Proof.

Proposition 6.

Proof.

Lemma A.3.

Appendix B Conceptual information

B.1 Discussion on quantiles of residual components.

B.2 Notation Table

B.3 Choice of Components

B.4 FWER-controlling algorithms for Scaling Selection

Definition 5 (FWER-Controlling Algorithm).

Bonferroni Correction.

Fixed Sequence Testing.

Practical Note.

B.5 Adaptive method pseudocode

Appendix C Extensions

C.1 Additional Heuristic

C.2 Stationary Φ\Phi-mixing

Theorem 4 (Mixing Concentration Inequality).

Corollary 5.

C.3 Decomposition for auxiliary data

Corollary 1 (Coverage with $a=b=1$ ).

5.2 Computing $p$ -values from calibration data

Lemma A.1 (Boundedness of $c_{t}$ and $d_{t}$ ).

C.2 Stationary $\Phi$ -mixing