0% found this document useful (0 votes)
9 views50 pages

Conformal Survival Bands For Risk Screening Under Right-Censoring

This document presents a method for constructing conformal survival bands that quantify uncertainty in individual survival estimates using right-censored data, enhancing personalized risk screening. The proposed bands provide predictive rather than population-level inference, ensuring that flagged patients for low-risk screening have a high probability of survival beyond a specified time. The approach integrates conformal inference, inverse probability of censoring weighting, and false discovery rate control, offering rigorous uncertainty quantification for black-box survival models.

Uploaded by

khanakmittal92
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views50 pages

Conformal Survival Bands For Risk Screening Under Right-Censoring

This document presents a method for constructing conformal survival bands that quantify uncertainty in individual survival estimates using right-censored data, enhancing personalized risk screening. The proposed bands provide predictive rather than population-level inference, ensuring that flagged patients for low-risk screening have a high probability of survival beyond a specified time. The approach integrates conformal inference, inverse probability of censoring weighting, and false discovery rate control, offering rigorous uncertainty quantification for black-box survival models.

Uploaded by

khanakmittal92
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Conformal Survival Bands for Risk Screening under

Right-Censoring

Matteo Sesia [email protected]


University of Southern California, Los Angeles, California, USA
Vladimir Svetnik vladimir [email protected]
Merck & Co., Inc., Rahway, New Jersey, USA
arXiv:2505.04568v1 [stat.ME] 7 May 2025

Abstract
We propose a method to quantify uncertainty around individual survival distribution es-
timates using right-censored data, compatible with any survival model. Unlike classical
confidence intervals, the survival bands produced by this method offer predictive rather
than population-level inference, making them useful for personalized risk screening. For
example, in a low-risk screening scenario, they can be applied to flag patients whose sur-
vival band at 12 months lies entirely above 50%, while ensuring that at least half of flagged
individuals will survive past that time on average. Our approach builds on recent advances
in conformal inference and integrates ideas from inverse probability of censoring weighting
and multiple testing with false discovery rate control. We provide asymptotic guarantees
and show promising performance in finite samples with both simulated and real data.
Keywords: Censored Data, Conformal Inference, False Discovery Rate, Predictive Cali-
bration, Survival Analysis, Uncertainty Estimation.

1. Introduction
1.1. Background and Motivation
Survival analysis focuses on data involving time-to-event outcomes, such as the time until
death or disease relapse in medicine, or mechanical failure in engineering. Its defining chal-
lenge is censoring, which occurs when the event of interest is not observed for all individuals.
In the common case of right-censoring, we only know that the event has not occurred up
to a certain time, beyond which the individual is no longer followed. For example, a cancer
patient may still be alive at their last clinical follow-up, five years after diagnosis, but their
true time of death is unknown because no data are available after that point. In this case,
the patient’s censoring time is five years and their survival time is unobserved.
In many applications, the goal is to use fitted survival models to generate personalized
inferences in the form of individual survival curves. These curves estimate, for an individual
with specific features, the probability of remaining event-free beyond any future time point.
For example, based on a patient’s medical history, a model might predict a 90% chance of
surviving past one year, 75% past three years, and 40% past five years. These predictions
can be intuitively visualized as a decreasing curve over time. Such personalized survival
curves are widely used to guide decisions, such as identifying high-risk patients for early
intervention or recognizing low-risk individuals who may safely avoid aggressive treatment.
Traditional approaches to survival analysis rely on statistical models that support un-
certainty quantification through confidence intervals and hypothesis testing. These include
Sesia Svetnik

parametric models, semi-parametric models, and nonparametric estimators such as the


Kaplan–Meier (KM) curve (Kaplan and Meier, 1958). Parametric models, such as the ex-
ponential or Weibull distributions, assume a specific form for how the event probability
evolves over time. Semi-parametric models, most notably the Cox proportional hazards
model (Cox, 1972), relax this assumption by leaving the baseline distribution unspecified,
while still imposing a fixed relationship between covariates and risk. The Kaplan–Meier
estimator, by contrast, avoids strong modeling assumptions altogether, but estimates only
population-level survival curves and does not incorporate covariate information.
Although classical survival models have seen many successful applications, they can
be limiting in modern settings that involve large datasets with rich covariate information.
When the relationship between patient features and survival outcomes is complex or poorly
understood, strong modeling assumptions—such as proportional hazards or specific para-
metric distributions—may be difficult to justify. In these situations, it is often more practi-
cal to treat classical models as black-box predictors: tools that generate individual survival
curves, but whose internal assumptions are not relied on for inference. This perspective
also motivates the growing use of more flexible, data-driven approaches—particularly ma-
chine learning methods such as random survival forests (Ishwaran et al., 2008), deep neural
networks (Katzman et al., 2018), and gradient boosting (Barnwal et al., 2022)—which can
model complex relationships with high-dimensional covariates (Spooner et al., 2020).
While black-box models can produce accurate and personalized survival predictions,
they typically lack principled methods for uncertainty quantification. Classical statistical
models offer confidence intervals and hypothesis tests, but—as discussed above—their guar-
antees rely on strong distributional assumptions that may not hold in practice. Machine
learning models, by contrast, often omit uncertainty estimates altogether; and when such
measures are available, they are typically heuristic and lack formal statistical justification.
This limitation is particularly concerning in high-stakes applications such as clinical triage
or treatment planning, where decisions must rely on calibrated, trustworthy risk estimates.
There is therefore a need for widely applicable statistical tools that can provide rigorous,
distribution-free uncertainty quantification for black-box survival models.

1.2. Preview of Our Contributions

We develop a method to construct principled and interpretable “uncertainty bands” around


individual survival curves estimated by black-box models, using right-censored data. These
bands, which we name conformal survival bands, are designed to support clinical screening
tasks and are rigorously calibrated in a useful predictive sense, as explained below.
A preview of our method is provided in Figure 1, which displays conformal survival
bands for four test patients in a simulated dataset. At first glance, these bands may resemble
personalized versions of classical confidence intervals—such as those drawn around Kaplan–
Meier survival curves—but their interpretation is different. Classical confidence intervals
aim to estimate population-level quantities like the true survival probability at a given time,
which is typically only possible under strong parametric assumptions or after binning the
covariates into a small number of population subgroups. Our bands, by contrast, are fully
nonparametric and support individualized inferences without requiring covariate binning.

2
Conformal Survival Bands for Risk Screening under Right-Censoring

Survival Probability Patient 1 Patient 2 Patient 3 Patient 4

Low-quality model
1.00

0.75
Survival Model
0.50
Oracle
0.25 Model
KM
0.00
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Time (years) Uncertainty
CSB
Patient 1 Patient 2 Patient 3 Patient 4
Survival Probability

High-quality model
1.00 Flagged as 'low risk'
(i.e., P[T>3.00] > 0.80)
0.75
Oracle
0.50 CSB
Model
0.25

0.00
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Time (years)

Figure 1: Illustration of the use of conformal survival bands (shaded regions) for screening
test patients in simulated censored data. Solid black curves show survival estimates from
either an inaccurate (top) or accurate (bottom) survival forest model. Dashed green curves
represent the true survival probabilities. The goal is to identify low-risk patients—those
with more than 80% probability of surviving beyond 3 years (marked by the vertical line). A
patient is flagged by our method if their entire conformal band lies above the 80% threshold
(horizontal line) at this time, ensuring at least 80% of flagged individuals survive longer.
Flagging decisions are indicated by colored markers: red triangles (our method), green aster-
isks (oracle), black circles (estimated model), and blue diamonds (Kaplan-Meier). Patients
2–4 are mistakenly flagged as low risk by the inaccurate model. With the accurate model,
our method can identify truly low-risk patients (e.g., Patient 1), unlike the KM curve.

Rather than attempting to directly estimate population-level parameters, our bands are
calibrated for distribution-free predictive screening. They allow practitioners to identify
individuals whose estimated survival probability lies below (or above) a clinically meaningful
threshold—while guaranteeing that, among all such flagged individuals, the false discovery
rate is approximately controlled in a precise predictive sense. For example, in Figure 1, we
illustrate a scenario in which the practitioner wishes to identify “low-risk” patients—defined
as those with more than an p = 80% chance of surviving beyond time t = 5 months. A
patient is flagged if their conformal band lies entirely above the 80% line at t = 5. Our
method asymptotically guarantees that, among all flagged individuals, on average at least
80% will survive beyond this time point. This guarantee holds pointwise over all fixed
choices of the survival probability threshold p and time horizon t, and for “low-risk” as well
as “high-risk” screening, making the method flexible and broadly applicable.
This predictive notion of uncertainty is well-aligned with how survival curves are com-
monly used in real-world settings: to guide actionable decisions such as prioritizing patients
for further testing or early intervention. However, black-box survival models typically pro-
vide only point estimates, without any reliable measure of uncertainty or formal calibration
guarantees. As in the example shown, screening based on point predictions alone, or even

3
Sesia Svetnik

KM estimates, may lead to too many patients incorrectly labeled as low risk—highlighting
the practical benefits of our calibrated bands, which offer rigorous control over such errors.
Our approach integrates three key concepts: conformal inference (Vovk et al., 2005;
Fontana et al., 2023), inverse probability of censoring weighting (IPCW) (Robins and
Rotnitzky, 1992; Robins, 1993), and false discovery rate (FDR) control (Benjamini and
Hochberg, 1995). Concretely, we build on tests for random null hypotheses of the type:

Hlt (t) : T ≥ t, or Hrt (t) : T ≤ t,

where T is the (unobserved) survival time of the test individual and t > 0 is a user-
specified time threshold. In the following, we will compute conformal p-values to test these
hypotheses and show how to use them to construct the conformal survival bands previewed
in Figure 1. Although each of the statistical concepts upon which we build already exists
on its own, the way we integrate them is original and supported by novel theoretical results.

1.3. Related Work


Conformal inference for survival analysis is a very active area of research, with several recent
works addressing different inferential goals. Some methods focus on prediction of survival
times (Boström et al., 2019, 2023; Sun and Wang, 2024) or classification into survival/event
categories (Boström et al., 2017), while others aim to refine the point estimates of survival
probabilities produced black-box models (Qi et al., 2024). The works most closely related
to our own concern the construction of predictive lower bounds for survival times at a fixed
confidence level α ∈ (0, 1), using censored data. These methods seek to deliver distribution-
free prediction intervals while correcting for the non-exchangeability (Barber et al., 2023)
introduced by censoring.
This direction was initiated by Candès et al. (2023), whose method provides valid lower
bounds but is often overly conservative in practice. Gui et al. (2024) introduced refinements
that improve coverage tightness, yet even their method tends to produce wide intervals at
moderate coverage levels (e.g., α = 0.5), limiting its utility in settings that require a finer
characterization of predictive uncertainty. In addition, both of these methods assume type-I
censoring—where censoring times are fully observed for all individuals—an assumption that
is quite limiting in practice. More recent contributions by Davidov et al. (2024), Sesia and
Svetnik (2024), and Farina et al. (2025) relax this constraint to accommodate general right-
censoring. However, these approaches remain restricted to one-sided predictive inference
and do not support threshold-based screening or two-sided uncertainty quantification.
Among the few existing methods that attempt to construct two-sided predictive inter-
vals under censoring are those of Qi et al. (2024), Qin et al. (2024), and Yi et al. (2025).
The approach of Qi et al. (2024) approaches the problem by ‘de-censoring” the data: it es-
timates latent survival times in the calibration data set using the fitted survival model and
then applies standard conformal techniques as if those predictions were observed. While
convenient, this method assumes that the survival model used for de-censoring is reliable.
As argued by Sesia and Svetnik (2024), this assumption undermines the primary appeal of
conformal inference—namely, its robustness to model misspecification. Indeed, empirical
evidence suggests that the method of Qi et al. (2024) may fail to provide valid coverage when
the survival model is inaccurate. By contrast, Qin et al. (2024) take a more model-agnostic

4
Conformal Survival Bands for Risk Screening under Right-Censoring

approach, based on resampling under censoring, but their work focuses on predictive inter-
vals for survival times rather than on calibrating survival screening decisions. Our work is
also related to recent, independent research by Yi et al. (2025), who construct two-sided
predictive intervals for survival times using a method that, like ours, utilizes conformal p-
values with IPC weighting. However, their focus is on survival time prediction rather than
decision-based screening or the calibration of individual survival curves.
In this paper, we address the problem of estimating uncertainty around individual sur-
vival curves to support threshold-based screening tasks. Our approach is based on comput-
ing weighted conformal p-values for testing hypotheses of the form Hlt (t; j) : Tn+j ≥ t and
Hrt (t; j) : Tn+j ≤ t, where Tn+j denotes the (unobserved) survival time of a new individual
and t > 0 is a fixed time threshold. While it may be possible in principle to obtain such
p-values by inverting predictive intervals for survival times obtained using existing meth-
ods, this is practically infeasible for several reasons: (i) most methods only provide lower
bounds, (ii) even the best available bounds remain conservative, especially at moderate
levels of confidence, and (iii) the inversion procedure would be computationally expensive,
as it requires solving nested optimization problems for each time threshold t.
Our method integrates and extends two previously distinct lines of work. First, we build
on the screening-based conformal inference framework of Jin and Candès (2023b), which
develops p-values for testing whether unobserved outcomes exceed user-specified thresholds.
Jin and Candès (2023a) later generalized this approach to allow for importance weighting
under covariate shift (Tibshirani et al., 2019), but their theory and assumptions are not
suited to censoring. Second, we draw on Farina et al. (2025), who introduced IPCW (Robins
and Rotnitzky, 1992; Robins, 1993) into conformal inference to produce predictive intervals
under right-censoring. While their method is tailored to one-sided prediction bounds, we
adapt IPCW techniques to develop conformal p-values for our survival threshold hypotheses.

2. Methods
2.1. Problem Setup and Assumptions
We consider a right-censored survival setting based on a sample of n individuals indexed
by [n] := {1, . . . , n}, drawn i.i.d. from an unknown population. For each individual i ∈ [n],
we observe a vector of covariates Xi ∈ X ⊆ Rd , along with right-censored survival data:
the event indicator Ei := I(Ti < Ci ) and the observed time T̃i := min(Ti , Ci ), where Ti > 0
is the true survival time and Ci > 0 is the censoring time. These n observations form the
calibration dataset, denoted by Dcal := {(Xi , T̃i , Ei )}ni=1 , which will be used to quantify
uncertainty—i.e., to calibrate the predictions of a black-box survival model.
We assume access to two black-box models trained on an independent dataset, which
may itself be censored and need not follow the same distribution as the calibration or
test sets. The only assumption is that the training data are independent of all other
samples, allowing us to treat both models as fixed throughout the analysis. The first
model is the survival model, M̂T . This model produces an estimated individual survival
function ŜT (t | x), which is intended to approximate the true conditional survival probability
ST (t | x) := P(T > t | X = x). The second is an auxiliary censoring model, M̂C , which
estimates the conditional survival function of the censoring distribution, ŜC (t | x), an
approximation to SC (t | x) := P(C > t | X = x). The role of the censoring model is

5
Sesia Svetnik

to reweight the calibration data to correct for the missing information due to censoring,
enabling valid uncertainty quantification for the survival model.
In addition to the calibration set, we consider a disjoint test set consisting of m individ-
uals, indexed by {n + 1, . . . , n + m}, also drawn independently from the same population.
For each test individual j, we observe only the covariates Xj ∈ X . Our goal is to use the
survival model M̂T to estimate the survival curve ŜT (t | Xj ) for each Xj , and to construct
a well-calibrated conformal survival band around this curve that reflects uncertainty.
Rather than aiming for our conformal survival bands to provide valid confidence intervals
for the true survival function ST (t | x), which would not be a feasible goal without additional
assumptions (Barber, 2020), we focus on a practically useful objective that is more naturally
attainable within a conformal inference framework: producing principled and interpretable
uncertainty estimates around ŜT (t | Xj ) that can support confident predictive screening
decisions. For example, given a survival probability threshold q ∈ (0, 1) and a clinically
meaningful time point t > 0, our goal may be to identify high-risk test individuals whose
predicted survival probability ŜT (t | Xj ) falls significantly below q, while guaranteeing
that, among all such flagged individuals, the expected proportion who survive beyond t
remains controlled below level q. Although inherently predictive in nature, this type of
calibration guarantee is intuitive and aligns closely with how physicians and practitioners
often interpret the output of survival models in practice: as actionable, patient-specific risk
estimates that can support confident decision-making.
Because our method is built on the idea of conformal p-values, we first review the key
concepts underlying this approach.

2.2. Preliminaries: Review of Conformal Inference without Censoring


To build intuition for our method, we begin by reviewing how existing conformal inference
techniques for regression can be applied to survival analysis in a simplified setting without
censoring. In this case, imagine observing complete data {(Xi , Ti )}ni=1 , where each (Xi , Ti )
is drawn i.i.d. from an unknown distribution over X × R+ . Given a new test point with
covariates Xn+j , consider the task of testing the null hypothesis

Hlt (t; j) : Tn+j ≥ t, (1)

for a fixed time t > 0. This hypothesis asserts the individual will experience the event after
time t; rejecting it provides evidence that the individual is unlikely to survive beyond t.
If complete (uncensored) data are available, this hypotheses can be easily tested as
follows. Let ŜT (t | x) denote the survival function estimated by the fitted model M̂T . We
define the left-tail nonconformity scoring function as

ŝlt (t; x) := 1 − ŜT (t | x), (2)

which represents the predicted probability that an individual with covariates x experiences
an event before time t. This is a natural and interpretable choice of nonconformity score in
this context, and—crucially—it is monotonically increasing in t. While our later method-
ology will not depend specifically on this choice of score, it does require that the scoring
function ŝlt (t; x) be monotone increasing in t.

6
Conformal Survival Bands for Risk Screening under Right-Censoring

Using this function, we compute a non-conformity score ŝlt (t; Xi ) for each calibration
point i ∈ [n] and for the test point Xn+j . Then, following the approach of Jin and Candès
(2023b), we define the left-tail conformal p-value as:
1 + ni=1 I {ŝlt (Ti ; Xi ) ≥ ŝlt (t; Xn+j )}
P
ϕ̃lt (t; Xn+j ) := , (3)
n+1
where we assume for simplicity that all scores are almost surely distinct; this is not a
limitation in practice since ties can be broken by adding some independent random noise.
The following result, due to Jin and Candès (2023b), highlights the statistical validity
of the left-tail p-value defined in (3) for testing the null hypothesis (1). A formal proof is
included in Appendix A for completeness. Appendix A also contains all other proofs.

Proposition 1 (Jin and Candès (2023b)) Assume the calibration data and test point
(Xn+j , Tn+j ) are exchangeable, and the scoring function ŝlt (t; x) is monotone increasing in
t for all x. Then for any fixed t > 0 and α ∈ (0, 1), the conformal p-value ϕ̃lt (t; Xn+j )
defined in (3), computed using the full (uncensored) data, satisfies:
h i
P ϕ̃lt (t; Xn+j ) ≤ α, Tn+j ≥ t ≤ α.

This implies ϕ̃lt (t; Xn+j ) can be interpreted as a p-value for testing the random hy-
pothesis (1), in the sense that small values provide evidence against the null. Notably, the
guarantee in Proposition 1 does not condition on the null event {Tn+j ≥ t}; instead, it con-
trols the joint probability that both the null is true and the p-value is small. This marginal
formulation differs from the classical frequentist setup, where hypotheses are non-random,
but it is sufficient for our purposes. As we will see later, this form of validity is precisely
what enables us to obtain valid calibration guarantees for personalized risk screening.
Before turning to censored data, it is helpful to note that the above ideas also apply for
testing the complementary right-tail null hypothesis

Hrt (t; j) : Tn+j ≤ t, (4)

which is useful for identifying low-risk individuals. In this case, we consider a right-tail
scoring function assumed to be monotone decreasing in t; e.g.,

ŝrt (t; x) := ŜT (t | x), (5)

interpreted as the predicted probability of experiencing the event after time t. The corre-
sponding right-tail conformal p-value is
1 + ni=1 I {ŝrt (t; Xi ) ≥ ŝrt (t; Xn+j )}
P
ϕ̃rt (t; Xn+j ) := ,
n+1
which is super-uniform under the same assumptions as in Proposition 1:
h i
P ϕ̃rt (t; Xn+j ) ≤ α, Tn+j ≤ t ≤ α.

This completes the review of standard conformal p-values in the uncensored setting, for
both left- and right-tail survival hypotheses. In the next section, we extend this framework
to account for censoring, enabling application to real-world survival data.

7
Sesia Svetnik

2.3. Conformal Inference for Survival Analysis under Right-Censoring


We now return to the right-censored survival setting described in Section 2.1, where the
calibration data consist of i.i.d. samples {(Xi , T̃i , Ei )}ni=1 . As in Section 2.2, our goal is to
test the left-tail and right-tail hypotheses defined in (1) and (4) for new test points Xn+j .
The main challenge compared to the idealized setting of Section 2.2 is that we do not
observe the true event times Ti for all calibration individuals. Consequently, we cannot
compute the full-data conformal p-values ϕ̃lt (t; Xn+j ) or ϕ̃rt (t; Xn+j ) as defined earlier.
To address this, we adapt the conformal p-value construction using IPCW (Robins
and Rotnitzky, 1992; Robins, 1993). This approach uses censored calibration samples but
carefully reweights them to correct for censoring-induced selection bias. Specifically, we rely
on a fitted censoring model M̂C that estimates the conditional censoring survival function
ŜC (t | x) ≈ P(C > t | X = x). Based on this model, we define the weight function

1
ŵ(t, x) := , (6)
ŜC (t | x)

approximating the inverse probability of being uncensored at time t given covariates x.


Bringing this into the conformal framework, we define the IPCW version of the left-tail
conformal p-value as:

1 + ni=1 Ei · ŵ(Ti , Xi ) · I {ŝlt (Ti ; Xi ) ≥ ŝlt (t; Xn+j )}


P
ϕ̂lt (t; Xn+j ) := , (7)
1 + ni=1 Ei · ŵ(Ti , Xi )
P

where the numerator includes only data points whose true event times are observed.
Analogously, we define the IPCW right-tail p-value as:

1 + ni=1 Ei · ŵ(Ti , Xi ) · I {ŝrt (Ti ; Xi ) ≥ ŝrt (t; Xn+j )}


P
ϕ̂rt (t; Xn+j ) := . (8)
1 + ni=1 Ei · ŵ(Ti , Xi )
P

These IPCW p-values are fully computable from the observed censored data. As we now
show, they yield asymptotically valid inference under the assumption that the censoring
model M̂C is consistent, along with mild regularity conditions on the data distribution.
We define the censoring weight estimation error as:
" 2 #!1/2
1 1
∆N := E − ,
ŵ(T ; X) w∗ (T ; X)

where w∗ (t, x) := 1/SC (t | x) denotes the true (unknown) censoring weight function, and
N represents the number of training data points.
We now list the assumptions under which asymptotic validity holds:

(A1) The data are split into three independent parts:

• a training set of cardinality N used to estimate ŵ(t; x), ŝlt (t; x), and ŝrt (t; x);
• an i.i.d. calibration set (Xi , Ti , Ci ) for i = 1, . . . , n, which is censored;
• and an i.i.d. test point (Xn+j , Tn+j , Cn+j ), of which we only see the covariates.

8
Conformal Survival Bands for Risk Screening under Right-Censoring

(A2) The probability of observing an event is bounded away from zero: π := P(T ≤ C) > 0.

(A3) The estimated weights are bounded below: ŵ(T ; X) ≥ ωmin > 0 almost surely.

(A4) The weight estimation error vanishes asymptotically: ∆N → 0 as N → ∞.

Theorem 2 Under Assumptions (A1)–(A4), for any fixed t > 0 and α ∈ (0, 1), the IPCW
conformal p-values defined in (7) satisfy:
h i
lim sup P ϕ̂lt (t; Xn+1 ) ≤ α, Tn+1 ≥ t ≤ α,
N,n→∞
h i
lim sup P ϕ̂rt (t; Xn+1 ) ≤ α, Tn+1 ≤ t ≤ α.
N,n→∞

In the next section, we explain how to use IPCW conformal p-values to construct con-
formal survival bands.

2.4. Conformal Survival Bands for Screening High- or Low-Risk Individuals


Consider m test patients with feature vectors Xn+1 , Xn+2 , . . . , Xn+m . For each j ∈ [m] :=
{1, . . . , m} and a fixed time t > 0, define

Û (t; Xn+j ) := ϕ̂BH


lt (t; Xn+j ), (9)

where ϕ̂BHlt (t; Xn+j ) denotes the Benjamini-Hochberg (BH) adjusted left-tail p-value.
More precisely, let ϕ̂(1) ≤ ϕ̂(2) ≤ · · · ≤ ϕ̂(m) denote the ordered values of the unadjusted
left-tail p-values {ϕ̂lt (t; Xn+j )}m
j=1 , and let π(j) be the index such that ϕ̂(j) = ϕ̂lt (t; Xn+π(j) ).
Then the BH-adjusted p-values are defined as
nm o
ϕ̂BH
lt (t; Xn+π(j) ) := min · ϕ̂(k) ∧ 1, for j = 1, . . . , m. (10)
k≥j k

Each value ϕ̂BH lt (t; Xn+j ) represents the smallest FDR level at which the null hypothesis
Hlt (t; j) can be rejected by the BH procedure (Benjamini and Hochberg, 1995). In practice,
these adjusted p-values can be very easily computed using the p.adjust function in R with
the argument method = "BH".
The quantity Û (t; Xn+j ) can then be interpreted as a calibrated upper bound on the
survival probability at time t for individual Xn+j . If Û (t; Xn+j ) ≤ α for some threshold
α ∈ (0, 1), then the individual may be confidently flagged as high-risk, with the guarantee
that the expected proportion of false discoveries—individuals who survive past t despite
being flagged as “high-risk”—remains below α in the large-sample limit.
Symmetrically, we define

L̂(t; Xn+j ) := 1 − ϕ̂BH


rt (t; Xn+j ), (11)

where ϕ̂BH
rt (t; Xn+j ) is the BH-adjusted p-value computed from the collection of right-tail
IPCW conformal p-values {ϕ̂rt (t; Xn+ℓ )}mℓ=1 . If L̂(t; Xn+j ) ≥ 1 − α, the patient can be con-
fidently flagged as low-risk, with the guarantee that the expected proportion of individuals
who fail before t—despite being classified as low-risk—approximately remains below α.

9
Sesia Svetnik

This construction defines our calibrated individual survival band

[L̂(t; Xn+j ), Û (t; Xn+j )] (12)

which could be interpreted as a calibrated range for likely survival probabilities at time
t—offering statistically principled decision support for confident high- and low-risk screen-
ing. The full procedure described above is summarized in Algorithm 1.

Algorithm 1 Construction of Conformal Survival Bands


input Censored training dataset Dtrain ; censored calibration dataset Dcal =
{(Xi , T̃i , Ei )}ni=1 ; test covariates Xn+1 , . . . , Xn+m ; time grid T = {t1 , . . . , tK }; train-
able survival model M̂T and censoring model M̂C .
1: Define the active calibration subset: Dcal ′ := {(X , T ) : E = 1, i ∈ [n]}.
i i i
2: Train M̂T and M̂C on Dtrain .
3: Estimate censoring weights ŵ(t, x) using M̂C as in Eq. (6).
4: for each calibration point (Xi , Ti ) ∈ Dcal ′ do

5: Compute ŝlt (Ti ; Xi ) and ŝrt (Ti ; Xi ) using Eqs. (2) and (5).
6: end for
7: for each time t ∈ T do
8: for each test point Xn+j , for j = 1, . . . , m do
9: Compute scores ŝlt (t; Xn+j ) and ŝrt (t; Xn+j ).
10: Compute IPCW p-values ϕ̂lt (t; Xn+j ) and ϕ̂rt (t; Xn+j ) using Eqs. (7) and (8).
11: end for
12: Apply BH procedure to p-values across test set to obtain adjusted values ϕ̂BH lt (t; Xn+j )
BH
and ϕ̂rt (t; Xn+j ) for all j ∈ [m] as in Eq. (10).
13: for each test point j ∈ [m] do
14: Compute endpoints Û (t; Xn+j ) and L̂(t; Xn+j ) using Eqs. (9) and (11).
15: end for
16: end for
output Personalized survival bands [L̂(t; Xn+j ), Û (t; Xn+j )] for each j ∈ [m] and all t ∈ T .

This algorithm is straightforward to implement and computationally efficient. Model


training is performed once on the training dataset, and nonconformity scores for the un-
censored calibration points are computed a single time, independently of the test set and
the evaluation time grid. The only components that depend on the chosen time points
are the nonconformity scores for the test samples, the IPCW conformal p-values, and their
BH-adjusted counterparts. While the BH adjustment requires sorting across test samples,
it is very fast in practice. All other steps can be parallelized over both time points and test
samples. As a result, the overall computational cost of conformal calibration is modest, and
typically negligible compared to the cost of training the survival or censoring models.
To implement Algorithm 1 in an even more data-efficient way, one may choose to re-
allocate all censored observations from the calibration set Dcal into the training set. This

yields a larger training dataset Dtrain := Dtrain ∪ {(Xi , T̃i , Ei ) : Ei = 0, i ∈ [n]}. This does
not affect the statistical validity of our method, since the conformal p-values ϕ̂lt (t; Xn+j )
and ϕ̂rt (t; Xn+j ) are computed exclusively using uncensored calibration points. While this

10
Conformal Survival Bands for Risk Screening under Right-Censoring

approach deviates from the standard sample-splitting paradigm commonly used in confor-
mal inference, it remains valid within our framework due to the use of inverse probability of
censoring weighting (IPCW), which automatically corrects for the potential selection bias
introduced by reallocation during the calibration phase. However, it is not guaranteed that
re-allocating the censored observations to the training set will improve model estimation, as
doing so introduces its own form of sampling bias in the training data. Exploring the trade-
offs involved in this reallocation strategy—particularly in combination with the possible use
of importance weighting during training—remains an open direction for future work.

2.5. Theoretical Guarantees for Calibrated Survival Bands


We will now formally state the asymptotic statistical guarantee provided by our method,
which requires the following additional assumption:
(A5) For all t > 0, the score functions ŝlt (t; X) admit a Lebesgue density ft satisfying
0 < fmin ≤ ft (u) ≤ fmax < ∞ for all u ∈ R,
for some constants fmin , fmax ∈ (0, ∞).
Theorem 3 Under Assumptions (A1)–(A5), fix a target level α ∈ (0, 1) and survival cutoff
t > 0. Let pj := ϕ̂lt (t; Xn+j ) denote the IPCW p-value used to test the null hypothesis
Hlt (t; j) : Tn+j ≥ t, for j = 1, . . . , m. Define the false discovery rate (FDR) as
 
|R ∩ H0 |
FDRm,n := E ,
|R| ∨ 1
where R := {j : pj ≤ τb} is the rejection set obtained by applying the BH procedure at level α
to (p1 , . . . , pm ) and H0 := {j : Tn+j ≥ t} is the (random) set of true null hypotheses. Then,
lim sup FDRm,n ≤ α for any fixed m.
N,n→∞

If in addition m = mn → ∞ and m2n ϵn → 0, then


lim sup FDRm,n ≤ α.
N,m,n→∞

rt .
An analogous result holds for the right-tail p-values ϕ̂rt (t; Xn+j ), used to test H0,j
A corollary of Theorem 3 is that high- and low-risk screening rules based on the survival
bands produced by Algorithm 1 described above are asymptotically well-calibrated. For
any t > 0 and α ∈ (0, 1), define the set of test individuals flagged as high-risk at time t as
n o n o
Fhi (t; α) := j ∈ [m] : Û (t; Xn+j ) ≤ α = j ∈ [m] : ϕ̂BH
lt (t; Xn+j ) ≤ α ,

and similarly, define the set of individuals flagged as low-risk as


n o n o
Flo (t; α) := j ∈ [m] : L̂(t; Xn+j ) ≥ 1 − α = j ∈ [m] : ϕ̂BHrt (t; Xn+j ) ≤ α ,

Then, Theorem 3 implies that, in the large-sample limit, the expected proportion of indi-
viduals in Fhi (t; α) who actually survive past t, and the expected proportion of individuals
in Flo (t; α) who fail before t, are both asymptotically controlled below level α.

11
Sesia Svetnik

2.6. Extension: Doubly Robust Survival Bands


The validity guarantee provided by Theorem 3 relies on the assumption that the censoring
model used to estimate the inverse probability of censoring (IPC) weights is asymptoti-
cally consistent. Under this condition, our method enables approximately well-calibrated
screening rules without requiring consistency of the black-box survival model.
We now present a simple but practically valuable extension that can improve robust-
ness to misspecification of the censoring model. Specifically, this variant aims for double
robustness: it yields approximately calibrated screening rules as long as either the censoring
model or the survival model is consistent—though not necessarily both. While we do not
pursue a formal proof of double robustness here, the underlying idea is intuitive.
For each test point j ∈ [m], we define the adjusted conformal survival band:
[L̂DR (t; Xn+j ), Û DR (t; Xn+j )], (13)
where
n o
Û DR (t; Xn+j ) := min ϕ̂BH
lt (t; Xn+j ), ŜT (t | Xn+j ) , (14)

and
n o
L̂DR (t; Xn+j ) := max 1 − ϕ̂BH
rt (t; Xn+j ), ŜT (t | Xn+j ) . (15)

Above, ŜT (t | Xn+j ) is the point estimate of the survival probability produced by M̂T , and
ϕ̂BH BH
lt (t; Xn+j ) and ϕ̂rt (t; Xn+j ) are the upper and lower bounds produced by Algorithm 1.
This adjusted construction was used in the illustrative example shown in Figure 1.
A useful side benefit of this adjusted band is enhanced interpretability: Equation (13)
guarantees that the predicted survival probability ŜT (t | Xn+j ) always lies within the band
itself. This aligns with common practitioner expectations, simplifying communication.

3. Numerical Experiments with Synthetic Data


3.1. Setup
Synthetic data. We consider four data-generating distributions, summarized in Table 3
(Appendix B), which span a range of interesting settings, inspired by previous works. In each
setting, p = 100 covariates X = (X1 , . . . , Xp ) are generated independently, while T and C
are sampled independently conditional on X, from either a log-normal distribution—log T |
X ∼ N (µ(X), σ(X))—or an exponential distribution—C | X ∼ Exp(λ(X)).
The first three settings are borrowed from Sesia and Svetnik (2024) and are ordered by
decreasing modeling difficulty. The first two simulate challenging scenarios where the true
survival distribution is highly nonlinear or complex, making accurate model fitting difficult
and highlighting the need for robust uncertainty estimation via conformal inference. The
third setting, originally appearing in Candès et al. (2023), corresponds to a simpler survival
distribution that is easier to learn with standard survival models, reducing the gap between
model-based and conformal prediction. Finally, the fourth setting, which is new, introduces
a covariate shift between the training and calibration/test distributions, leading to potential
model misspecification even if the survival model fits the training distribution well, and
further emphasizing the value of conformal inference methods such as CSB.

12
Conformal Survival Bands for Risk Screening under Right-Censoring

Design, Methods, and Performance Metrics. We generate independent training and


calibration datasets with varying sample sizes, along with a test dataset containing 1000
samples. Right-censoring is introduced by replacing the true event time T and censoring
time C with the observed time T̃ = min(T, C) and event indicator E = I(T ≤ C). The
censored training data are used to fit survival and censoring models, as described below.
Our goal is to evaluate the proposed conformal survival band (CSB) method and com-
pare it to natural baseline approaches for screening test patients according to rules of the
form P [T > t] > p (for low-risk selection) or P [T > t] < p (for high-risk selection), evalu-
ated at various time thresholds t and probability levels p. We consider three methods for
screening patients based on estimated conditional survival probabilities: (i) model, using
the point estimates of conditional survival probabilities computed by a black-box survival
model M̂surv fitted on all available data (training and calibration); (ii) KM, using a Kaplan-
Meier estimator fitted on the calibration data; and (iii) CSB, using the conformal survival
band produced by Algorithm 1, implemented using the doubly robust extension described
in Section 2.6. In addition, we compare against an ideal oracle that uses the true condi-
tional survival probabilities from the data-generating distribution; although not practical,
the oracle provides valuable insight into the achievable performance limits.
Performance is evaluated on the independent test set using several metrics. The screened
proportion measures the fraction of test patients selected by the screening rule; this quantity
should ideally be large, reflecting high power, provided that selections are accurate. The
survival rate measures the proportion of selected patients who survive beyond time t, and
should be larger than p for low-risk screening and smaller than p for high-risk screening.
Finally, precision and recall are computed relative to the oracle screening decisions, treating
screening as a binary classification task, and both should ideally be high.
All experiments are repeated 100 times, and results are averaged across repetitions.

Models. We consider four model families for fitting the censoring and survival models,
denoted respectively by M̂cens and M̂surv , to ensure consistent comparisons across different
calibration methods. The model families are: (1) grf, a generalized random forest (R package
grf); (2) survreg, an accelerated failure time (AFT) model with a log-normal distribution
(R package survival); (3) rf, a random survival forest (R package randomForestSRC); and
(4) cox, the Cox proportional hazards model (R package survival). The survival model
M̂surv is used both by the model screening method and as input to the proposed conformal
survival band (CSB) method. In contrast, the censoring model M̂cens is only used within
the CSB method, to compute the weights for constructing conformal survival bands.

Leveraging Prior Knowledge on PC|X . To examine the effect of incorporating prior


knowledge about the censoring distribution, we fit M̂cens using only the first p1 ≤ p co-
variates, knowing that C is independent of (Xp1 +1 , . . . , Xp ) given (X1 , . . . , Xp1 ) within
the data-generating distributions used in these experiments (Table 3). Therefore, for
10 ≤ p1 ≤ p = 100, this prior knowledge helps improve the censoring model by excluding
irrelevant predictors and mitigating overfitting. We start with p1 = 10 and later evaluate
the impact of larger p1 , representing weaker prior knowledge. The case p1 = p corresponds
to no prior knowledge, where all covariates are used to fit the censoring model.

13
Sesia Svetnik

3.2. Results

Effect of the Training Sample Size. Figure 2 compares the performance of the four
methods in Setting 1 (the harder case), using the grf models. The training sample size varies
between 100 and 10,000, while the calibration sample size is fixed at 500. We evaluate two
screening rules: selecting low-risk patients with P (T > 6.00) > 0.80 and selecting high-risk
patients with P (T > 12.00) < 0.80.
Low-risk screening results (top panel). The model selects too many patients, resulting
in an excess of false positives; the average survival rate among its selected patients falls
below 60%, despite the target threshold of 80%. The KM method performs even worse,
selecting nearly all patients, with a survival rate close to 50%. In contrast, CSB achieves the
desired survival rate among selected patients, provided that the training sample size is not
too small, and its performance improves steadily as the sample size increases, approaching
that of the oracle. Even when the training size is small (e.g., 100), and the fitted survival
and censoring models are relatively inaccurate, CSB is more robust than the model. As
expected, the oracle achieves a survival rate above 80%—in fact closer to 100%—while
selecting approximately 50% of the test patients. It is important to note that the survival
rate of selected patients does not need to match the threshold p exactly, even for the oracle,
because many patients have true survival probabilities much higher (or lower) than p.
High-risk screening results (bottom panel). In this setting, all methods lead to subsets
of selected patients whose survival rates are below 80%, as desired. However, the KM
method again selects too many patients, including many whose true survival probabilities
are actually higher than 80%, resulting in lower precision. This behavior stems from the
non-personalized nature of KM estimates: KM cannot flexibly select subsets of patients
based on individual characteristics and must either select nearly all or none. In contrast,
the model and CSB methods achieve higher precision and recall, and both approach the
oracle performance as the training sample size increases.
Figures 5 and 6 in Appendix B present similar results for Settings 2 and 3, respectively.
Across these experiments, the CSB method consistently achieves survival rates on the cor-
rect side of the target threshold and maintains relatively high precision and recall compared
to the other methods, more closely approximating the performance of the ideal oracle as
the training sample size increases.

Effect of the Calibration Sample Size. Figure 3 examines the impact of the calibration
sample size on the performance of CSB in the same challenging synthetic data setting as
before, with the training sample size fixed at 5000. In this regime, where the survival and
censoring models are already accurate thanks to a relatively large training set, increasing the
calibration sample size improves screening performance: the survival rates among selected
patients consistently falls on the correct side of the target threshold, and precision and recall
tend to approach the behavior of the ideal oracle. In general, however, if the training sample
size is small and the fitted models are inaccurate, it may be more beneficial to prioritize
training over calibration. Overall, a moderate calibration sample size (on the order of a
few hundred) is typically sufficient to obtain reliable conformal inferences. Figures 7 and 8
in Appendix B present qualitatively similar results from analogous experiments conducted
under synthetic data under Settings 2 and 3.

14
Conformal Survival Bands for Risk Screening under Right-Censoring

(a) Screening for: low−risk patients with P(T> 6.00) > 0.80.
Screened Proportion Survival Rate Precision Recall
1.00 1.00 1.00 1.00 Method
Mean ± 2 SE

0.75 0.75 0.75 0.75 Model

0.50 0.50 0.50 0.50 KM

0.25 0.25 0.25 0.25 CSB


Oracle
0.00 0.00 0.00 0.00
100 1000 10000 100 1000 10000 100 1000 10000 100 1000 10000
Training Sample Size

(b) Screening for: high−risk patients with P(T> 12.00) < 0.80.
Screened Proportion Survival Rate Precision Recall
1.00 1.00 1.00 1.00 Method
Mean ± 2 SE

0.75 0.75 0.75 0.75 Model

0.50 0.50 0.50 0.50 KM

0.25 0.25 0.25 0.25 CSB


Oracle
0.00 0.00 0.00 0.00
100 1000 10000 100 1000 10000 100 1000 10000 100 1000 10000
Training Sample Size

Figure 2: Effect of training sample size on patient screening performance of different meth-
ods in a challenging synthetic data scenario with complex survival and censoring distribu-
tions, using grf models. Top: results for low-risk screening at time t = 6; bottom: high-risk
screening at time t = 12. The calibration sample size is fixed at 500. The conformal survival
band (CSB) method successfully achieves survival rates above the target p = 0.8 for low-risk
screening and below p = 0.8 for high-risk screening, while making personalized selections
that approach the oracle performance as the training sample size increases.

Effect of the Training Sample Size for the Censoring Model. Figure 4 presents
results from experiments similar to those in Figure 2, but here the censoring model is fit
using only a subset from a total of 5000 training samples. The goal is to study how the
quality of the censoring model affects the performance of the CSB method. In the top
panel (low-risk screening), we observe that when the censoring model is trained on too few
samples, CSB may fail to provide valid screening selections. As the training size increases
and the censoring model improves, the survival rate among selected patients eventually
falls on the correct side of the target threshold, consistent with our asymptotic theoretical
results. In the bottom panel (high-risk screening), CSB maintains validity even with a
small censoring training set, but its power (i.e., ability to identify appropriate patients)
improves as the censoring model quality increases. Figures 9 and 10 in Appendix B present
qualitatively similar results for the relatively easier Settings 2 and 3.

Additional Experiments. Additional experiments studying the effect of the number of


features used to fit the censoring model are presented in Figures 11 and 12 in Appendix B.
These results show that when too many features are included—especially with limited sam-
ple sizes—performance may degrade due to challenges in accurately estimating the censoring
model, consistent with the theory. Tables 4–9 in Appendix B report on additional experi-
ments using different survival and censoring models, leading to qualitatively similar results.

15
Sesia Svetnik

(a) Screening for: low−risk patients with P(T> 6.00) > 0.80.
Screened Proportion Survival Rate Precision Recall
1.00 1.00 1.00 1.00 Method
Mean ± 2 SE

0.75 0.75 0.75 0.75 Model

0.50 0.50 0.50 0.50 KM

0.25 0.25 0.25 0.25 CSB


Oracle
0.00 0.00 0.00 0.00
10 100 1000 10 100 1000 10 100 1000 10 100 1000
Calibration Sample Size

(b) Screening for: high−risk patients with P(T> 12.00) < 0.80.
Screened Proportion Survival Rate Precision Recall
1.00 1.00 1.00 1.00 Method
Mean ± 2 SE

0.75 0.75 0.75 0.75 Model

0.50 0.50 0.50 0.50 KM

0.25 0.25 0.25 0.25 CSB


Oracle
0.00 0.00 0.00 0.00
10 100 1000 10 100 1000 10 100 1000 10 100 1000
Calibration Sample Size

Figure 3: Effect of calibration sample size on patient screening performance with conformal
survival bands (CSB) in a challenging synthetic data scenario with complex survival and
censoring distributions, as in Figure 2. The training sample size is fixed at 5000. The CSB
method successfully achieves survival rates above the target p = 0.8 for low-risk screening
and below p = 0.8 for high-risk screening, while making selections that more closely resemble
those of an ideal oracle as the calibration sample size increases.

Effect of Distribution Shift. Finally, we evaluate robustness under a distribution shift


between the training and calibration/test distributions, corresponding to Setting 4 described
in Table 3. Specifically, we now allow the survival distribution to change between training
and calibration/test phases, which may lead to model misspecification even if the survival
model fits the training data well. In these experiments, the training set consists of 5000
samples and the calibration set consists of 500 samples.
Table 1 summarizes the performance of different methods for low-risk screening under
this setting. The “low-quality model” refers to a survival model trained on the shifted
training distribution, resulting in inaccurate predictions on the calibration and test distri-
butions. In this case, the model baseline selects all patients as “low risk” at time t = 3, but
the observed survival rate is only about 54%, far below the target threshold of p = 0.80. By
contrast, the KM and CSB methods conservatively select no patients, thereby maintaining
validity at the cost of reduced power. With a “high-quality model” trained directly on the
calibration/test distribution, the CSB method successfully screens approximately half of
the patients while maintaining valid survival rates, closely approaching oracle performance.
Meanwhile, the KM method remains overly conservative due to its lack of adaptivity, con-
tinuing to select no patients even when safe screening is feasible. Figure 1, introduced
in Section 1, illustrates survival curves, conformal survival bands, and screening decisions
corresponding to these experiments for four example patients in the low-risk setting.
Additional results for high-risk screening under distribution shift are presented in Ta-
ble 10 in Appendix B. In this case, model misspecification primarily reduces the power of all

16
Conformal Survival Bands for Risk Screening under Right-Censoring

Screening for: low−risk patients with P(T> 6.00) > 0.80.


Screened Proportion Survival Rate Precision Recall
1.00 1.00 1.00 1.00 Method
Mean ± 2 SE

0.75 0.75 0.75 0.75 Model

0.50 0.50 0.50 0.50 KM

0.25 0.25 0.25 0.25 CSB


Oracle
0.00 0.00 0.00 0.00
100 300 1000 3000 100 300 1000 3000 100 300 1000 3000 100 300 1000 3000
Training Sample Size for Censoring Model

Screening for: high−risk patients with P(T> 12.00) < 0.80.


Screened Proportion Survival Rate Precision Recall
1.00 1.00 1.00 1.00 Method
Mean ± 2 SE

0.75 0.75 0.75 0.75 Model

0.50 0.50 0.50 0.50 KM

0.25 0.25 0.25 0.25 CSB


Oracle
0.00 0.00 0.00 0.00
100 300 1000 3000 100 300 1000 3000 100 300 1000 3000 100 300 1000 3000
Training Sample Size for Censoring Model

Figure 4: Effect of the training sample size used for fitting the censoring model on patient
screening performance with conformal survival bands (CSB) in a challenging synthetic data
scenario with complex survival and censoring distributions, as in Figure 2. The overall
training sample size is fixed at 5000, but only a subset is used to fit the censoring model.
CSB tends to produce higher-quality selections as the censoring model improves with more
training data, achieving survival rates on the correct side of the target threshold (top), and
improving power toward oracle performance (bottom).

Table 1: Performance of different methods for screening low-risk patients with P (T > 3) >
0.80 under distribution shift (Setting 4 from Table 3). Results are shown separately for a
low-quality survival model trained on the shifted training distribution and a high-quality
model trained directly on the calibration/test distribution. The survival rate highlighted in
red illustrates how a misspecified grf model can lead to invalid screening rules, resulting in
substantially lower survival rates than expected among patients labeled as “low-risk.”

Method Screened Survival Precision Recall


Low-Quality Model
Model 1.000 ± 0.000 0.538 ± 0.003 0.499 ± 0.003 1.000 ± 0.000
KM 0.000 ± 0.000 NA NA 0.000 ± 0.000
CSB 0.000 ± 0.000 NA NA 0.000 ± 0.000
Oracle 0.499 ± 0.003 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000
High-Quality Model
Model 0.496 ± 0.004 1.000 ± 0.000 1.000 ± 0.000 0.994 ± 0.001
KM 0.000 ± 0.000 NA NA 0.000 ± 0.000
CSB 0.488 ± 0.004 1.000 ± 0.000 1.000 ± 0.000 0.978 ± 0.004
Oracle 0.499 ± 0.004 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000

17
Sesia Svetnik

methods, leading to fewer selections. A similar figure illustrating survival curves, conformal
bands, and screening decisions for the high-risk case is provided in Appendix B as Figure 13.

4. Application to Real Data


4.1. Setup
We apply our method to the same seven publicly available datasets analyzed by Sesia
and Svetnik (2024): COLON, GBSG, HEART, METABRIC, PBC, RETINOPATHY, and
VALCT. Details regarding the number of observations, covariates, and data sources are
provided in Table 11 in Appendix C. Additional information about the standard prepro-
cessing procedures applied to these datasets—designed to ensure compatibility with various
survival and censoring models—is also available in Appendix C.
We follow the procedure from Section 3, comparing CSB to the model and KM bench-
marks. Since the true data-generating distribution is unknown, the oracle method is not
applicable. All methods are applied separately using each of the same four model types from
Section 3 to estimate the survival distribution—generalized random forests (grf ), paramet-
ric survival regression (survreg), random survival forests (rf ), and Cox proportional hazards
models (cox )—with the censoring distribution always estimated using grf. Each dataset is
split into 60% training, 20% calibration, and 20% testing sets, and all experiments are
repeated 100 times with independent random splits.
As in the synthetic experiments, we screen test patients using rules of the form P [T >
t] > p (low-risk) or P [T > t] < p (high-risk), evaluated at fixed time thresholds t and
probability levels p. We consider four tasks: low- and high-risk selection with p = 0.8 at
time t1 , and with p = 0.25 at t2 . To facilitate comparisons while accounting for heterogeneity
in the distribution of survival times across data sets, the time thresholds t1 and t2 are set
to the 0.1 and 0.9 quantiles of all observed times, respectively.
Performance is evaluated on the test set using two metrics. The screened proportion
measures the fraction of test patients selected for each screening task, and should ideally
be large under accurate selection. The survival rate is the proportion of selected patients
who survive beyond time t. Since this cannot be computed exactly due to censoring, we
report deterministic bounds: the lower bound treats censored patients as failures at their
censoring times, and the upper bound assumes they survive indefinitely. Together, these
provide a conservative interval for the true survival rate among selected patients.

4.2. Results
Table 2 summarizes the results of this data analysis, aggregating the performance of each
screening method across tasks, datasets, and repetitions, separately for each survival model.
For each combination of survival model and screening method, we report the average pro-
portion of test patients selected (screened proportion) and the empirical distribution of
verification outcomes: valid, dubious, and invalid. Verification outcomes are determined for
each task by assessing whether the survival rate bounds, averaged over 100 random repeti-
tions, fall entirely on the correct side of the threshold p (valid), straddle it (dubious), or lie
entirely on the wrong side (invalid), accounting for uncertainty via two standard errors.

18
Conformal Survival Bands for Risk Screening under Right-Censoring

Table 2: Summary of screening performance for high- and low-risk patient selection meth-
ods based on different survival models, aggregated across different datasets, screening tasks,
and repetitions. The screened proportion denotes the average fraction of test patients se-
lected. Although false positives cannot be directly verified due to censoring, approximate
verification is possible and classified as valid, dubious, or invalid.

Verification Outcome
Method Survival Model Screened Proportion Valid Dubious Invalid
Model grf 0.500 0.643 0.357 0.000
survreg 0.500 0.714 0.250 0.036
rf 0.500 0.536 0.464 0.000
Cox 0.500 0.571 0.393 0.036
KM grf 0.500 0.786 0.214 0.000
survreg 0.500 0.786 0.214 0.000
rf 0.500 0.786 0.214 0.000
Cox 0.500 0.786 0.214 0.000
CSB grf 0.204 0.929 0.071 0.000
survreg 0.002 1.000 0.000 0.000
rf 0.201 0.893 0.107 0.000
Cox 0.188 0.929 0.071 0.000

CSB yields the highest proportion of valid selections across all survival models, with few
or no dubious or invalid outcomes, despite screening fewer patients. In contrast, the Model
and KM benchmarks screen more patients but exhibit lower verification quality. These
results illustrate a fundamental trade-off between power and reliability, unsurprisingly.
Compared to the synthetic data experiments from Section 3, the advantage of CSB is
somewhat less obvious here. This may be due in part to imperfect verification, stemming
from censoring in the test data and the absence of oracle knowledge. Moreover, survival
models may estimate the conditional survival distributions more accurately here than in the
synthetic settings, reducing the relative gains from conformal inference. As noted by Sesia
and Svetnik (2024), these datasets do not appear to be especially difficult to model. Never-
theless, in general, it remains difficult to assess the reliability of screening results produced
by black-box methods. Our approach can provides more confidence in the calibration of
such screening results, which may be particularly valuable in high-stakes applications where
the cost of false positives is substantial, even if it entails some reduction in power.
Additional breakdowns of these results for specific screening tasks are provided in Ap-
pendix C, with low-risk and high-risk summaries shown in Tables 12 and 13, respectively.

5. Discussion
This paper introduced a conformal inference method for constructing uncertainty bands
around individual survival curves under right-censoring, enabling statistically principled
personalized risk screening under minimal assumptions on the data-generating process.

19
Sesia Svetnik

Experiments with synthetic data showed that screening based on uncalibrated black-
box models can be unreliable, particularly in hard-to-model settings, whereas our method
provides greater robustness, albeit with some conservativeness. In our real data applica-
tions, standard survival models seemed to produce reasonable screening performance, likely
because the fitted models were able to approximate the true survival distribution quite
well. Nonetheless, our method remains appealing in more complex or high-stakes scenarios,
where model misspecification is a concern and formal uncertainty quantification is essential.
Two limitations of the current approach are its reliance on asymptotic FDR control
and its focus on pointwise inference, which assumes that the screening thresholds t and p
are fixed in advance. While it may be possible to obtain finite-sample and simultaneous
inference guarantees, doing so would likely require a significantly more conservative method.
Understanding this trade-off between statistical rigor and screening power in greater depth
presents an intriguing direction for future research.
Another promising direction for improvement concerns the treatment of censored cali-
bration points. In Section 2.4, we proposed reallocating these samples to the training set
but did not do this in practice due to concerns about sampling bias potentially degrading
model quality. Recent work by Farina et al. (2025) introduces a more complex strategy
for incorporating censored observations directly into the calibration phase, which may be
possible to adapt to our setting. A potentially simpler alternative is to reallocate censored
calibration samples to the training set and apply importance weighting during training to
mitigate bias, while retaining the calibration procedure described in this paper, which uses
only uncensored observations. Both strategies deserve further investigation.
Additional extensions include de-randomizing our inferences via model aggregation across
different data splits (Carlsson et al., 2014; Linusson et al., 2017), potentially using e-values
(Vovk and Wang, 2023; Wang and Ramdas, 2022; Bashari et al., 2023), and adapting the
method to handle noisy or corrupted data (Sesia et al., 2024; Clarkson et al., 2024).
Software Availability Open-source software implementing our methods and experi-
ments is available at https://github.com/msesia/conformal_survival_screening.

20
Conformal Survival Bands for Risk Screening under Right-Censoring

References
Rina Foygel Barber. Is distribution-free inference possible for binary regression? Electron.
J. Statist., 14(2):3487–3524, 2020.

Rina Foygel Barber, Emmanuel J Candès, Aaditya Ramdas, and Ryan J Tibshirani. Con-
formal prediction beyond exchangeability. Ann. Stat., 51(2):816–845, 2023.

Avinash Barnwal, Hyunsu Cho, and Toby Hocking. Survival regression with accelerated
failure time model in xgboost. J. Comput. Graph. Stat., 31(4):1292–1302, 2022.

Meshi Bashari, Amir Epstein, Yaniv Romano, and Matteo Sesia. Derandomized novelty
detection with fdr control via conformal e-values. In A. Oh, T. Naumann, A. Globerson,
K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing
Systems, volume 36, pages 65585–65596. Curran Associates, Inc., 2023.

Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and
powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Methodol., 57(1):289–300,
1995.

AL Blair, DR Hadden, JA Weaver, DB Archer, PB Johnston, and CJ Maguire. The 5-year


prognosis for vision in diabetes. The Ulster medical journal, 49(2):139, 1980.

Henrik Boström, Lars Asker, Ram Gurung, Isak Karlsson, Tony Lindgren, and Panagiotis
Papapetrou. Conformal prediction using random survival forests. In 2017 16th IEEE
International Conference on Machine Learning and Applications (ICMLA), pages 812–
817. IEEE, 2017.

Henrik Boström, Ulf Johansson, and Anders Vesterberg. Predicting with confidence from
survival data. In Conformal and Probabilistic Prediction and Applications, pages 123–141.
PMLR, 2019.

Henrik Boström, Henrik Linusson, and Anders Vesterberg. Mondrian predictive systems for
censored data. In Harris Papadopoulos, Khuong An Nguyen, Henrik Boström, and Lars
Carlsson, editors, Proceedings of the Twelfth Symposium on Conformal and Probabilistic
Prediction with Applications, volume 204 of Proceedings of Machine Learning Research,
pages 399–412. PMLR, 13–15 Sep 2023.

Emmanuel Candès, Lihua Lei, and Zhimei Ren. Conformalized survival analysis. J. R. Stat.
Soc. Ser. B Methodol., 85(1):24–45, 2023.

Lars Carlsson, Martin Eklund, and Ulf Norinder. Aggregated conformal prediction. In Arti-
ficial Intelligence Applications and Innovations: AIAI 2014 Workshops: CoPA, MHDW,
IIVC, and MT4BD, Rhodes, Greece, September 19-21, 2014. Proceedings 10, pages 231–
240. Springer, 2014.

Jason Clarkson, Wenkai Xu, Mihai Cucuringu, and Gesine Reinert. Split conformal pre-
diction under data contamination. In Simone Vantini, Matteo Fontana, Aldo Solari,
Henrik Boström, and Lars Carlsson, editors, Proceedings of the Thirteenth Symposium

21
Sesia Svetnik

on Conformal and Probabilistic Prediction with Applications, volume 230 of Proceedings


of Machine Learning Research, pages 5–27. PMLR, 09–11 Sep 2024.

David R Cox. Regression models and life-tables. J. R. Stat. Soc. Ser. B Methodol., 34(2):
187–202, 1972.

John Crowley and Marie Hu. Covariance analysis of heart transplant survival data. J. Am.
Stat. Assoc., 72(357):27–36, 1977.

Christina Curtis, Sohrab P Shah, Suet-Feung Chin, Gulisa Turashvili, Oscar M Rueda,
Mark J Dunning, Doug Speed, Andy G Lynch, Shamith Samarajiwa, Yinyin Yuan, et al.
The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel sub-
groups. Nature, 486(7403):346–352, 2012.

Hen Davidov, Shai Feldman, Gil Shamai, Ron Kimmel, and Yaniv Romano. Conformal-
ized survival analysis for general right-censored data. In The Thirteenth International
Conference on Learning Representations, 2024.

Rebecca Farina, Eric J Tchetgen Tchetgen, and Arun Kumar Kuchibhotla. Doubly robust
and efficient calibration of prediction sets for censored time-to-event outcomes. arXiv
preprint arXiv:2501.04615, 2025.

Matteo Fontana, Gianluca Zeni, and Simone Vantini. Conformal prediction: a unified review
of theory and new challenges. Bernoulli, 29(1):1–23, 2023.

Yu Gui, Rohan Hore, Zhimei Ren, and Rina Foygel Barber. Conformalized survival analysis
with adaptive cut-offs. Biometrika, 111(2):459–477, 2024.

Hemant Ishwaran, Udaya B. Kogalur, Eugene H. Blackstone, and Michael S. Lauer. Random
survival forests. Ann. Appl. Stat., 2(3):841 – 860, 2008.

Ying Jin and Emmanuel J Candès. Model-free selective inference under covariate shift via
weighted conformal p-values. arXiv preprint arXiv:2307.09291, 2023a.

Ying Jin and Emmanuel J Candès. Selection by prediction with conformal p-values. J.
Mach. Learn. Res., 24(244):1–41, 2023b.

John D Kalbfleisch and Ross L Prentice. The statistical analysis of failure time data. John
Wiley & Sons, 2002.

Edward L Kaplan and Paul Meier. Nonparametric estimation from incomplete observations.
J. Am. Stat. Assoc., 53(282):457–481, 1958.

Jared L Katzman, Uri Shaham, Alexander Cloninger, Jonathan Bates, Tingting Jiang,
and Yuval Kluger. Deepsurv: personalized treatment recommender system using a Cox
proportional hazards deep neural network. BMC Med. Res. Methodol., 18:1–12, 2018.

Henrik Linusson, Ulf Norinder, Henrik Boström, Ulf Johansson, and Tuve Löfström. On the
calibration of aggregated conformal predictors. In Conformal and probabilistic prediction
and applications, pages 154–173. PMLR, 2017.

22
Conformal Survival Bands for Risk Screening under Right-Censoring

Charles G Moertel, Thomas R Fleming, John S Macdonald, Daniel G Haller, John A Laurie,
Phyllis J Goodman, James S Ungerleider, William A Emerson, Douglas C Tormey, John H
Glick, et al. Levamisole and fluorouracil for adjuvant therapy of resected colon carcinoma.
New England Journal of Medicine, 322(6):352–358, 1990.
Shi-ang Qi, Yakun Yu, and Russell Greiner. Conformalized survival distributions: A generic
post-process to increase calibration. arXiv preprint arXiv:2405.07374, 2024.
Jing Qin, Jin Piao, Jing Ning, and Yu Shen. Conformal predictive intervals in survival
analysis: a re-sampling approach. arXiv preprint arXiv:2408.06539, 2024.
James M Robins. Information recovery and bias adjustment in proportional hazards re-
gression analysis of randomized trials using surrogate markers. In Proc. Biopharm. Sect.
Am. Stat. Assoc., volume 24, pages 24–33. San Francisco CA, 1993.
James M Robins and Andrea Rotnitzky. Recovery of information and adjustment for de-
pendent censoring using surrogate markers. In AIDS epidemiology: methodological issues,
pages 297–331. Springer, 1992.
Matteo Sesia and Vladimir Svetnik. Doubly robust conformalized survival analysis with
right-censored data. arXiv preprint arXiv:2412.09729, 2024.
Matteo Sesia, YX Rachel Wang, and Xin Tong. Adaptive conformal classification with noisy
labels. J. R. Stat. Soc. Ser. B Methodol., page qkae114, 2024.
Annette Spooner, Emily Chen, Arcot Sowmya, Perminder Sachdev, Nicole A Kochan, Julian
Trollor, and Henry Brodaty. A comparison of machine learning methods for survival
analysis of high-dimensional clinical data for dementia prediction. Scientific reports, 10
(1):20410, 2020.
Xiaolin Sun and Yanhua Wang. Conformal prediction with censored data using kaplan-
meier method. In Journal of Physics: Conference Series, volume 2898, page 012030. IOP
Publishing, 2024.
Terry M Therneau, Patricia M Grambsch, Terry M Therneau, and Patricia M Grambsch.
The Cox model. Springer, 2000.
Ryan J Tibshirani, Rina Foygel Barber, Emmanuel Candès, and Aaditya Ramdas. Confor-
mal prediction under covariate shift. Advances in neural information processing systems,
32, 2019.
Vladimir Vovk and Ruodu Wang. Confidence and discoveries with e-values. Statistical
Science, 38(2):329–354, 2023.
Vladimir Vovk, Alexander Gammerman, and Glenn Shafer. Algorithmic learning in a ran-
dom world, volume 29. Springer, 2005.
Ruodu Wang and Aaditya Ramdas. False discovery rate control with e-values. J. R. Stat.
Soc. Ser. B Methodol., 84(3):822–852, 2022.
Menghan Yi, Ze Xiao, Huixia Judy Wang, and Yanlin Tang. Survival conformal prediction
under random censoring. Stat, 14(2):e70052, 2025.

23
Sesia Svetnik

Appendix A. Mathematical Proofs


A.1. Proof of Proposition 1
Proof [of Proposition 1]
Under the null hypothesis that Tn+j ≥ t, because ŝlt (t; x) is monotone increasing in t
for all x, we have that, almost surely
ŝlt (t; Xn+j ) ≤ ŝlt (Tn+j ; Xn+j ),
and thus
Pn
1+ i=1 I {ŝlt (Ti ; Xi ) ≥ ŝlt (t; Xn+j )}
ϕ̃lt (t; Xn+j ) :=
Pn n+1
1+ i=1 I {ŝlt i ; Xi ) ≥ ŝlt (Tn+j ; Xn+j )}
(T
≥ .
n+1
Hence
1 + ni=1 I {ŝlt (Ti ; Xi ) ≥ ŝlt (Tn+j ; Xn+j )}
h i  P 
P ϕ̃lt (t; Xn+j ) ≤ α, Tn+j ≤ t ≤ P ≤α
n+1
≤ α,
where the last inequality follows immediately through standard conformal inference argu-
ments from the fact that (ŝlt (T1 ; X1 ), . . . , ŝlt (Tn ; Xn ), ŝlt (Tn+j ; Xn+j )) are exchangeable.

A.2. Proofs of Theorems 2 and 3


Proof [of Theorem 2: Asymptotic validity of IPCW p-values] We apply the finite-sample
bound from Theorem 5, which states that for any t > 0 and any n ≥ 1,
" r #
h i 1 2(2ωmin + 1) 2 log(2n)
P ϕ̂lt (t; Xn+1 ) ≤ α, Tn+1 ≥ t ≤ α + +8 + 2∆N .
ωmin π n n

By assumption:
• ŵ(T ; X) ≥ ωmin > 0 almost surely,
  2 1/2
• ∆N := E ŵ(T ;X) − w∗ (T ;X)
1 1
→ 0 as N → ∞.

Since the remaining terms in the bound all vanish as N, n → ∞, it follows that
h i
lim sup P ϕ̂lt (t; Xn+1 ) ≤ α, Tn+1 ≥ t ≤ α.
N,n→∞

An identical argument applies to the limit involving ϕ̂rt (t; Xn+1 ), completing the proof.

Proof [of Theorem 3: Asymptotic FDR control] We prove the result for the left-tail p-values
ϕ̂lt (t; Xn+j ); the argument for right-tail p-values is analogous.
The proof proceeds in three steps:

24
Conformal Survival Bands for Risk Screening under Right-Censoring

(1) We compare each empirical p-value to an oracle counterpart and establish uniform
closeness.

(2) We show that the BH rejection sets based on these two sets of p-values are close with
high probability.

(3) We conclude FDR control for the empirical procedure via stability and the oracle
FDR guarantee.

Step 1: Empirical p-values are close to oracle p-values. For each test point j ∈
{1, . . . , m}, define the empirical and oracle p-values as:

pj := ϕ̂lt (t; Xn+j ), p∗j := ϕ∗lt (t; Xn+j ) := P (ŝlt (T ; X) ≥ ŝlt (t; Xn+j )) ,

where the probability in p∗j is taken over an independent copy (T, X) ∼ P .


By Corollary 6, for any δ > 0, with probability at least 1 − δ over the calibration data,

max |pj − p∗j | ≤ ϵn ,


1≤j≤m

where " r #
1 2(ωmin + 1) log(2n/δ)
ϵn = +8 + 2∆N .
ωmin π n n

Step 2: Stability of BH rejection sets. Let R := {j : pj ≤ τb} and R∗ := {j : p∗j ≤ τb∗ }


be the rejection sets obtained by applying the BH procedure at level α to the empirical and
oracle p-values, respectively.
Before we can apply Lemma 8 (a general stability bound for the BH procedure), we
need to verify that the oracle p-values p∗j have a density bounded above. Since

p∗j = ϕ∗lt (t; Xn+j ) = 1 − F (ŝlt (t; Xn+j )),

where F is the CDF of the score variable ŝlt (T ; X), with corresponding density f = F ′
uniformly bounded above and below across all t > 0. Then, the change-of-variables formula
gives that the density f ∗ of p∗j is:

ft (F −1 (1 − u))
f ∗ (u) = ,
f (F −1 (1 − u))

which is bounded from both above and below by:

fmin fmax
≤ f ∗ (u) ≤ for all u ∈ [0, 1].
fmax fmin

Now we can apply Lemma 8 with p = (p1 , . . . , pm ) and q = (p∗1 , . . . , p∗m ), obtaining:

fmax
̸ R∗ ) ≤ δ + 2ϵn m2 ·
P(R = .
fmin

25
Sesia Svetnik

Step 3: Transferring FDR control from the oracle to empirical procedure. By


Lemma 4, the oracle p-values p∗j satisfy the super-uniformity property:

P(p∗j ≤ α, Tn+j ≥ t) ≤ α, for all α ∈ (0, 1),

and the p-values p∗j are mutually independent (conditional on the calibration data). There-
fore, by Theorem 9, the BH procedure applied to (p∗1 , . . . , p∗m ) controls the FDR at level
α:  ∗ 
∗ |R ∩ H0 |
FDR := E ≤ α,
|R∗ | ∨ 1
where H0 := {j : Tn+j ≥ t} is the (random) set of true nulls.
Conclusion. By definition of the FDR and triangle inequality,

|FDR − FDR∗ | ≤ P(R =


̸ R∗ ).

Combining this with the oracle guarantee FDR∗ ≤ α and the stability bound from Lemma 8,
we obtain the finite-sample bound:
fmax
FDR ≤ α + δ + 2ϵn m2 · .
fmin
Letting δ → 0 and noting that ϵn → 0 as n → ∞, we conclude:

lim sup FDRn ≤ α for fixed m.


N,n→∞

Moreover, if m = mn → ∞ and m2n ϵn → 0, then the finite-sample error term vanishes:


fmax
δ + 2ϵn m2 · → 0,
fmin
and we conclude the joint limit:

lim sup FDRm,n ≤ α.


N,m,n→∞

A.3. Auxiliary Results


A.3.1. Validity of Oracle p-Values
Lemma 4 Let (X, T ) and (X ′ , T ′ ) be independent copies of a pair of random variables
taking values in X × R, and let s : R × X → R be a fixed function that is monotone
increasing in its first argument. Define the oracle p-value function

ϕ∗ (t, x) := P (s(T, X) ≥ s(t, x)) ,

where the probability is taken over (X, T ). Then, for any α ∈ [0, 1],

P ϕ∗ (t, X ′ ) ≤ α, T ′ ≥ t ≤ α.


26
Conformal Survival Bands for Risk Screening under Right-Censoring

Proof [of Lemma 4] Let S := −s(T, X) and S ′ := −s(T ′ , X ′ ), and define the cumulative
distribution function FS (u) := P(S ≤ u). By definition of ϕ∗ , we can write:

ϕ∗ (t, X ′ ) = P(S ≤ −s(t, X ′ )) = FS (−s(t, X ′ )).

Now observe that, under the condition T ′ ≥ t and monotonicity of s in t, we have:

s(t, X ′ ) ≤ s(T ′ , X ′ ) ⇒ −s(t, X ′ ) ≥ −s(T ′ , X ′ ),

and since FS is non-decreasing we have the almost-surely inequality:

ϕ∗ (t, X ′ ) = FS (−s(t, X ′ )) ≥ FS (−s(T ′ , X ′ )) = FS (S ′ ).

Hence:

P ϕ∗ (t, X ′ ) ≤ α, T ′ ≥ t ≤ P FS (S ′ ) ≤ α, T ′ ≥ t ≤ P(FS (S ′ ) ≤ α) = α,
 

where the final equality uses the probability integral transform: since S ′ has the same
distribution as S and FS is its CDF, the random variable FS (S ′ ) is uniform on [0, 1].

A.3.2. Finite-sample version of Theorem 2


Theorem 5 Under the assumptions of Theorem 2,
" r √ #
h i 1 2(2ωmin + 1) 2 log(2n n)
P ϕ̂lt (t; Xn+1 ) ≤ α, Tn+1 ≥ t ≤ α + +8 + 2∆N ,
ωmin π n n

for any t > 0, and


" r √ #
h i 1 2(2ωmin + 1) 2 log(2n n)
P ϕ̂rt (t; Xn+1 ) ≤ α, Tn+1 ≤ t ≤ α + +8 + 2∆N .
ωmin π n n

Proof [of Theorem 5] We prove the result for ϕ̂lt ; the argument for ϕ̂rt is identical and
omitted.
For δ ∈ (0, 1), define the event
( ) " r #
1 2(ωmin + 1) log(2n/δ)
Et,δ := sup ϕ̂lt (t; x) − ϕ∗lt (t; x) ≤ ϵn (δ) , ϵn (δ) := +8 + 2∆N .
x∈Rd ωmin π n n

Corollary 6 implies:
h i
P ϕ̂lt (t; x) ≤ α, Tn+1 ≥ t | Xn+1 = x
h i h i
c
= P ϕ̂lt (t; x) ≤ α, Et,δ , Tn+1 ≥ t | Xn+1 = x + P ϕ̂lt (t; x) ≤ α, Et,δ , Tn+1 ≥ t | Xn+1 = x
≤ P [ϕ∗lt (t; x) ≤ α + ϵn (δ), Tn+1 ≥ t | Xn+1 = x] + δ.

27
Sesia Svetnik

Therefore,
h i
P ϕ̂lt (t; x) ≤ α, Tn+1 ≥ t ≤ P [ϕ∗lt (t; x) ≤ α + ϵn (δ), Tn+1 ≥ t] + δ
≤ α + ϵn (δ) + δ,

where the second inequality above follows directly from Lemma 4. Finally, setting δ = 1/n:
" r #
h i 1 2(2ωmin + 1) 2 log(2n)
P ϕ̂lt (t; Xn+1 ) ≤ α, Tn+1 ≥ t ≤ α + +8 + 2∆N .
ωmin π n n

Corollary 6 (Specialized Instance of Theorem 7) Under the assumptions of Theo-


rem 2, for any x ∈ Rd and t > 0, define:

ϕ∗lt (t; x) := P (ŝlt (T ; X) ≥ ŝlt (t; x)) ,


ϕ∗rt (t; x) := P (ŝrt (T ; X) ≥ ŝrt (t; x)) ,

where the probabilities are taken with respect to the randomness in (X, T ).
Then, for any t > 0 and δ ∈ (0, 1), with probability at least 1 − δ over the randomness
in the calibration data D = {(Xi , Ti , Ci )}ni=1 :
" r #
1 2(ωmin + 1) log(2n/δ)
sup ϕ̂lt (t; x) − ϕ∗lt (t; x) ≤ +8 + 2∆N .
x∈R d ωmin π n n

Similarly, for any t > 0 and δ ∈ (0, 1), with probability at least 1 − δ:
" r #
∗ 1 2(ωmin + 1) log(2n/δ)
sup ϕ̂rt (t; x) − ϕrt (t; x) ≤ +8 + 2∆N .
x∈Rd ωmin π n n

Proof [of Corollary 6] We prove the bound for ϕ̂lt ; the argument for ϕ̂rt is identical and
omitted. Define the following variables:

Zi = (Xi , Ti ),
Di = 1(Ti ≤ Ci ),
Ei = ŵ(Ti ; Xi ),
e(Zi ) = w∗ (Ti ; Xi ),
ψ(Zi ) = ŝlt (Ti ; Xi ).

Recall the definitions of Rn (u) and R(u) from Theorem 7:


1 + ni=1 (Di /Ei ) · 1(ψ(Zi ) ≥ u)
P
Rn (u) := ,
1 + ni=1 (Di /Ei )
P
 
D
R(u) := E · 1(ψ(Z) ≥ u) .
e(Z)

28
Conformal Survival Bands for Risk Screening under Right-Censoring

For any x ∈ Rd and t > 0, define

ϕ̂lt (t; x) := Rn (ŝlt (t; x)) , ϕ∗lt (t; x) := R (ŝlt (t; x)) .

Therefore,

sup ϕ̂lt (t; x) − ϕ∗lt (t; x) = sup |Rn (ŝlt (t; x)) − R(ŝlt (t; x))|
x∈Rd x∈Rd
≤ sup |Rn (u) − R(u)| ,
u∈R

and the result follows directly from Theorem 7, noting that Ei ≥ ωmin and E[D] = π.

A.3.3. Uniform Concentration for General Reweighted Survival Estimator


Theorem 7 Let (Zi , Di , Ei )ni=1 be i.i.d. samples from some unknown distribution P, where
Zi ∈ Rd , Di ∈ {0, 1}, and Ei ∈ [e, 1] almost surely for some constant e > 0. For any z ∈ Rd ,
define e(z) := P(Di = 1 | Zi = z), which may be interpreted as a propensity score. For any
fixed function ψ : Rd+1 → R and any u ∈ R, define

1 + ni=1 (Di /Ei ) · 1(ψ(Zi ) ≥ u)


P  
D
Rn (u) := Pn , R(u) := E · 1(ψ(Z) ≥ u) ,
1 + i=1 (Di /Ei ) e(X)

where (Z, D, E) is a generic random sample from P. Define also


" 2 #!1/2
1 1
∆N := E − , π := E[D].
E e(X)

Then for any δ ∈ (0, 1), with probability at least 1 − δ,


" r #
1 2(e + 1) log(2n/δ)
sup |Rn (u) − R(u)| ≤ +8 + 2∆N . (16)
u∈R eπ n n

Proof [of Theorem 7] The proof splits the error into two parts: a stochastic deviation term
(comparing the estimator Rn (u) to a population analogue R̃(u) based on the approximate
weights Ei ), and a bias term due to discrepancy between Ei and the propensity score e(Xi ).
Define

E [(D/E) · 1(ψ(Z) ≥ u)]


R̃(u) := ,
E [D/E]

where the expectations are over the joint distribution of (X, D, Z, E). Then decompose the
error:

|Rn (u) − R(u)| ≤ |Rn (u) − R̃(u)| + |R̃(u) − R(u)|.

29
Sesia Svetnik

Step 1 (Bounding the stochastic deviation). We begin by computing a uniform


bound for |Rn (u) − R̃(u)|. Let Wi := Di /Ei ∈ [0, 1/e] and W := D/E ∈ [0, 1/e]. Define

1 + ni=1 Wi · 1(ψ(Zi ) ≥ u)
P
Nn (u) := ,
Pn 1+n
1 + i=1 Wi
An := ,
1+n
Ñ (u) := E[W · 1(ψ(Z) ≥ u)],
à := E[W ],

so that
Nn (u) Ñ (u)
Rn (u) = , R̃(u) = .
An Ã
Using the inequality
a a′ |a − a′ | 1 1
− ′ ≤ ′
+ |a| − ′ ,
b b b b b

with a = Nn (u), a′ = Ñ (u), b = An , b′ = Ã, and the fact that Nn (u) ≤ An , we obtain

|Nn (u) − Ñ (u)| 1 1


|Rn (u) − R̃(u)| ≤ + An −
à An Ã
|Nn (u) − Ñ (u)| + |An − Ã| (17)
=

|Nn (u) − Ñ (u)| + |An − Ã|
≤ ,
π
because E ≤ 1 and thus à = E[W ] = E[D/E] ≥ E[D] = π.

Uniform concentration bound for |Nn (u) − Ñ (u)|. To bound the deviation
n
1X
sup Wi 1(ψ(Zi ) ≥ u) − E[W · 1(ψ(Z) ≥ u)] ,
u∈R n i=1

note that the class {u 7→ Wi 1(ψ(Zi ) ≥ u)} has envelope bounded by 1/e and VC dimen-
sion 1. Applying Theorem 4.10 and Lemma 4.14 in Wainwright (2019), we obtain: with
probability at least 1 − δ,
s
n
r
1X 4 log(n + 1) 2 log(1/δ)
sup Wi 1(ψ(Zi ) ≥ u) − Ñ (u) ≤ + .
u∈R n e n ne2
i=1

Now recall that


n
1 1 X
Nn (u) = + Wi 1(ψ(Zi ) ≥ u),
1+n 1+n
i=1

30
Conformal Survival Bands for Risk Screening under Right-Censoring

so we may write
n
1 + 1/e 1X
|Nn (u) − Ñ (u)| ≤ + Wi 1(ψ(Zi ) ≥ u) − Ñ (u) .
1+n n
i=1

Therefore, with probability at least 1 − δ,


r r !
1 + 1/e 1 log(n + 1) 2 log(1/δ)
sup |Nn (u) − Ñ (u)| ≤ + 4 + . (18)
u∈R 1+n e n n

Hoeffding bound for |An − Ã|. Recall that

1 + ni=1 Wi
P
An := , Ã := E[W ].
1+n
We decompose the deviation:
n
1 + 1/e 1X
|An − Ã| ≤ + Wi − E[W ] .
1+n n
i=1

Applying Hoeffding’s inequality for Wi ∈ [0, 1/e], we obtain that, with probability at least
1 − δ,
r
1 + 1/e 1 log(2/δ)
|An − Ã| ≤ + . (19)
1+n e 2n

Combining (17) with (18) and (19), we obtain that, with probability at least 1 − δ,
" r r r #
1 2(e + 1) log(n + 1) 2 log(1/δ) log(2/δ)
sup |Rn (u) − R̃(u)| ≤ +4 + +
u∈R eπ 1+n n n 2n
" r # (20)
1 2(e + 1) log(2n/δ)
≤ +8 .
eπ n n

Step 2 (Bounding the bias due to approximation). We now bound the difference
between R̃(u) and the ideal target R(u), which uses the true propensity e(X). Using the
same ratio inequality as before we obtain:

|R̃(u) − R(u)| = Ñ (u) − R(u) + 1 − Ã . (21)

To bound the first term, note that, using the Cauchy-Schwarz inequality,
  
1 1
Ñ (u) − R(u) = E − D · 1(ψ(Z) ≥ u)
E e(X)
v "
  u u 1 2 #
1 1 1
≤E − ≤ E
t − =: ∆N .
E e(X) E e(X)

31
Sesia Svetnik

For the second term, using the same argument;


    
D 1 1
|1 − Ã| = 1 − E = E − D ≤ ∆N .
E e(X) E

Therefore, we conclude:

sup |R̃(u) − R(u)| ≤ 2∆N . (22)


u∈R

Step 3 (Conclusion). Combining the stochastic deviation bound (20) with the approx-
imation bias bound (22), we conclude that with probability at least 1 − δ,

sup |Rn (u) − R(u)| ≤ sup |Rn (u) − R̃(u)| + sup |R̃(u) − R(u)|
u∈R u u
" r #
1 2(e + 1) log(2n/δ)
≤ +8 + 2∆N .
eπ n n

A.3.4. Stability and validity of the BH procedure


Lemma 8 (Stability of BH under small perturbations) Let p = (p1 , . . . , pm ) and
q = (q1 , . . . , qm ) be two vectors of p-values in [0, 1]m . Suppose:

• (Closeness) For some ϵ > 0 and δ ∈ (0, 1), we have


 
P max |pj − qj | ≤ ϵ ≥ 1 − δ.
1≤j≤m

• (Smoothness) Each qj has a density fj supported on [0, 1] satisfying fj (t) ≤ fmax for
all t ∈ [0, 1].

Let Rp and Rq be the rejection sets from applying the BH procedure at level α to p and q,
respectively. Then:
P(Rp ̸= Rq ) ≤ δ + 2ϵm2 fmax .

Proof [of Lemma 8] Define the event


 
E := max |pj − qj | ≤ ϵ ,
1≤j≤m

which holds with probability at least 1 − δ by assumption.


For each k ∈ {1, . . . , m}, the BH procedure compares p-values to candidate thresholds
τk := αk
m . Define the “instability region”:

m
[
Bϵ := (τk − ϵ, τk + ϵ) ,
k=1

32
Conformal Survival Bands for Risk Screening under Right-Censoring

which is the union of m intervals of width 2ϵ, and hence has total Lebesgue measure at
most 2ϵm.
For any fixed index j, the probability that qj ∈ Bϵ is easily bounded using the smoothness
assumption: P(qj ∈ Bϵ ) ≤ 2ϵmfmax . Applying a union bound over all j = 1, . . . , m, we get:

P(∃j : qj ∈ Bϵ ) ≤ 2ϵm2 fmax .

Let A be the “bad event” where either closeness fails or some qj lies near a threshold:

A := E c ∪ {∃j : qj ∈ Bϵ }.

Then:
P(A) ≤ δ + 2ϵm2 fmax .
On the complement Ac (i.e., when p and q are close and all qj are away from thresholds),
the BH procedure makes the same rejection decisions on p and q because: each possible
threshold τk is fixed; each pj and qj differ by at most ϵ; no qj lies within ϵ of any threshold.
Hence, for every j, the relation pj ≤ τk matches qj ≤ τk for all k. Therefore, the sorted
comparisons and BH cutoffs lead to identical rejection sets: Rp = Rq on the event Ac .
Thus,
P(Rp ̸= Rq ) ≤ P(A) ≤ δ + 2ϵm2 fmax .

Theorem 9 (Jin and Candès, 2023b, Thm. 2.3) Let (H1 , . . . , Hm ) ∈ {0, 1}m be random
variables indicating the status of m hypotheses, where Hj = 1 means that the jth hypothesis
is a true null. Let (p1 , . . . , pm ) ∈ [0, 1]m be random variables, interpreted as p-values for the
corresponding hypotheses. Assume:
• The p-values p1 , . . . , pm are mutually independent.

• The hypothesis indicators H1 , . . . , Hm are mutually independent.

• Each p-value pj is independent of Hi for all i ̸= j.

• For each j ∈ {1, . . . , m}, the p-value pj satisfies the joint super-uniformity condition:

P(pj ≤ α, Hj = 1) ≤ α for all α ∈ [0, 1].

Then the Benjamini-Hochberg (BH) procedure at level α ∈ (0, 1), applied to (p1 , . . . , pm ),
controls the false discovery rate:
 
|R ∩ H0 |
FDR := E ≤ α,
|R| ∨ 1
where:
• H0 := {j : Hj = 1} is the (random) set of true null hypotheses,

• R := {j : pj ≤ τb} is the BH rejection set,

33
Sesia Svetnik

• τb := max kα kα

m : p(k) ≤ m is the BH threshold, with p(1) ≤ · · · ≤ p(m) the order
statistics of (p1 , . . . , pm ).

Note: This result is a simplified version of the argument in Jin and Candès (2023b), spe-
cialized to the case of independent p-values (rather than PRDS).

Proof [of Theorem 9] Let R := {j : pj ≤ τb} be the BH rejection set, and let Rj := 1{j ∈ R}.
Then the false discovery rate can be written as:
 
 m 
|R ∩ H0 | X Hj Rj 
FDR = E = E .
|R| ∨ 1 |R| ∨ 1
j=1

Using the identity Rj = 1{pj ≤ |R|α/m} and decomposing over possible values of |R| = k,
we write:
m X m
X 1 
E 1{|R| = k} · Hj · 1 pj ≤ kα
 
FDR = m .
k
j=1 k=1

Now fix any index j ∈ {1, . . . , m}, and define the modified rejection set R(pj → 0) to be
the BH rejection set obtained by replacing pj with 0 while keeping all other p-values fixed.
Because the BH procedure is monotone in each coordinate, and because reducing pj does
not reduce the size of the rejection set, we have:

kα kα
 
1{|R| = k} · 1 pj ≤ m = 1{|R(pj → 0)| = k} · 1 pj ≤ m .

Let Fj := σ(p1 , . . . , pj−1 , 0, pj+1 , . . . , pm ) denote the sigma-algebra generated by the other
p-values excluding pj . Then, using independence of Hj and pj from Fj , and applying the
tower property:
m X
m
X 1   kα

FDR = E 1{|R(pj → 0)| = k} · Hj · 1 pj ≤ m
k
j=1 k=1
m X
m
X 1    kα

= E 1{|R(pj → 0)| = k} · E Hj · 1 pj ≤ m | Fj
k
j=1 k=1
m X m
X 1  kα

= E 1{|R(pj → 0)| = k} · P pj ≤ m, Hj = 1
k
j=1 k=1
m X
m
X 1  kα

≤ · E 1{|R(pj → 0)| = k} · m
k
j=1 k=1
m X m
α X
= P (|R(pj → 0)| = k) ≤ α.
m
j=1 k=1

34
Conformal Survival Bands for Risk Screening under Right-Censoring

Appendix B. Additional Results from Experiments with Synthetic Data

B.1. Additional Details on Data Generating Distributions

Table 3: Summary of four synthetic data generation settings considered for numerical ex-
periments. Settings 1, 2, and 3 are adapted from Sesia and Svetnik (2024), with Setting 3
originally appearing in Candès et al. (2023). Setting 4 is new.

Setting p Covariate, Survival, and Censoring Distributions


X : Unif([0, 1]p )
T | X: LogNormal, µs (X) = (X2 > 21 ) + (X3 < 12 ) + (1 − X1 )0.25 ,
1 100 T | X: LogNormal, σs (X) = 1−X
10
1

C | X: LogNormal, µc (X) = (X2 > 21 ) + (X3 < 12 ) + (1 − X1 )4 + 10 4


,
X2
C | X: LogNormal, σc (X) = 10
X : Unif([0, 1]p )
2 100 T | X: LogNormal, µs (X) = X10.25 , σs (X) = 0.1
4
C | X: LogNormal, µc (X) = X14 + 10 , σc (X) = 0.1
X : Unif([−1, 1]p )
T | X: LogNormal, µs (X) = log(2) + 1 + 0.55(X12 − X3 X5 ),
3 100
T | X: LogNormal, σs (X) = |X10 | + 1
C | X: Exponential, λc (X) = 0.4
X : Unif([−1, 1]p )
T | X: LogNormal,
T | X:µs (X) = 0.2(1 + X1 )X2 + log(2)1(X3 > 0) + log(10)1(X3 ≤ 0),
4 100 T | X:σs (X) = 0.25
C | X: LogNormal, µc (X) = 2 + 0.5X1 , σc (X) = 0.1
Shifted T | X: LogNormal,
T | X:µs (X) = 0.2(1 + X1 )X2 + log(10), σs (X) = 0.25

35
Sesia Svetnik

B.2. Effect of the Training Sample Size

(a) Screening for: low−risk patients with P(T> 2.00) > 0.80.
Screened Proportion Survival Rate Precision Recall
1.00 1.00 1.00 1.00 Method
Mean ± 2 SE

0.75 0.75 0.75 0.75 Model

0.50 0.50 0.50 0.50 KM

0.25 0.25 0.25 0.25 CSB


Oracle
0.00 0.00 0.00 0.00
100 1000 10000 100 1000 10000 100 1000 10000 100 1000 10000
Training Sample Size

(b) Screening for: high−risk patients with P(T> 3.00) < 0.25.
Screened Proportion Survival Rate Precision Recall
1.00 1.00 1.00 1.00 Method
Mean ± 2 SE

0.75 0.75 0.75 0.75 Model

0.50 0.50 0.50 0.50 KM

0.25 0.25 0.25 0.25 CSB


Oracle
0.00 0.00 0.00 0.00
100 1000 10000 100 1000 10000 100 1000 10000 100 1000 10000
Training Sample Size

Figure 5: Patient screening performance as a function of training sample size in a moderately


challenging synthetic data scenario (Setting 2 from Table 3). Other details are as in Figure 2.

(a) Screening for: low−risk patients with P(T> 3.00) > 0.90.
Screened Proportion Survival Rate Precision Recall
1.00 1.00 1.00 1.00 Method
Mean ± 2 SE

0.75 0.75 0.75 0.75 Model

0.50 0.50 0.50 0.50 KM

0.25 0.25 0.25 0.25 CSB


Oracle
0.00 0.00 0.00 0.00
100 1000 10000 100 1000 10000 100 1000 10000 100 1000 10000
Training Sample Size

(b) Screening for: high−risk patients with P(T> 10.00) < 0.50.
Screened Proportion Survival Rate Precision Recall
1.00 1.00 1.00 1.00 Method
Mean ± 2 SE

0.75 0.75 0.75 0.75 Model

0.50 0.50 0.50 0.50 KM

0.25 0.25 0.25 0.25 CSB


Oracle
0.00 0.00 0.00 0.00
100 1000 10000 100 1000 10000 100 1000 10000 100 1000 10000
Training Sample Size

Figure 6: Patient screening performance as a function of training sample size in a relatively


easy synthetic data scenario (Setting 3 from Table 3). Other details are as in Figure 2.

36
Conformal Survival Bands for Risk Screening under Right-Censoring

B.3. Effect of the Calibration Sample Size

(a) Screening for: low−risk patients with P(T> 2.00) > 0.80.
Screened Proportion Survival Rate Precision Recall
1.00 1.00 1.00 1.00 Method
Mean ± 2 SE

0.75 0.75 0.75 0.75 Model

0.50 0.50 0.50 0.50 KM

0.25 0.25 0.25 0.25 CSB


Oracle
0.00 0.00 0.00 0.00
10 100 1000 10 100 1000 10 100 1000 10 100 1000
Calibration Sample Size

(b) Screening for: high−risk patients with P(T> 3.00) < 0.25.
Screened Proportion Survival Rate Precision Recall
1.00 1.00 1.00 1.00 Method
Mean ± 2 SE

0.75 0.75 0.75 0.75 Model

0.50 0.50 0.50 0.50 KM

0.25 0.25 0.25 0.25 CSB


Oracle
0.00 0.00 0.00 0.00
10 100 1000 10 100 1000 10 100 1000 10 100 1000
Calibration Sample Size

Figure 7: Effect of calibration sample size on patient screening performance with conformal
survival bands (CSB) in a moderately challenging synthetic data scenario (Setting 2 from
Table 3). Other details are as in Figure 3.

(a) Screening for: low−risk patients with P(T> 3.00) > 0.90.
Screened Proportion Survival Rate Precision Recall
1.00 1.00 1.00 1.00 Method
Mean ± 2 SE

0.75 0.75 0.75 0.75 Model

0.50 0.50 0.50 0.50 KM

0.25 0.25 0.25 0.25 CSB


Oracle
0.00 0.00 0.00 0.00
10 100 1000 10 100 1000 10 100 1000 10 100 1000
Calibration Sample Size

(b) Screening for: high−risk patients with P(T> 10.00) < 0.50.
Screened Proportion Survival Rate Precision Recall
1.00 1.00 1.00 1.00 Method
Mean ± 2 SE

0.75 0.75 0.75 0.75 Model

0.50 0.50 0.50 0.50 KM

0.25 0.25 0.25 0.25 CSB


Oracle
0.00 0.00 0.00 0.00
10 100 1000 10 100 1000 10 100 1000 10 100 1000
Calibration Sample Size

Figure 8: Effect of calibration sample size on patient screening performance with conformal
survival bands (CSB) in a relatively easy synthetic data scenario (Setting 3 from Table 3).
Other details are as in Figure 3.

37
Sesia Svetnik

B.4. Effect of the Training Sample Size for the Censoring Model

Screening for: low−risk patients with P(T> 2.00) > 0.80.


Screened Proportion Survival Rate Precision Recall
1.00 1.00 1.00 1.00 Method
Mean ± 2 SE

0.75 0.75 0.75 0.75 Model

0.50 0.50 0.50 0.50 KM

0.25 0.25 0.25 0.25 CSB


Oracle
0.00 0.00 0.00 0.00
100 300 1000 3000 100 300 1000 3000 100 300 1000 3000 100 300 1000 3000
Training Sample Size for Censoring Model

Screening for: high−risk patients with P(T> 3.00) < 0.25.


Screened Proportion Survival Rate Precision Recall
1.00 1.00 1.00 1.00 Method
Mean ± 2 SE

0.75 0.75 0.75 0.75 Model

0.50 0.50 0.50 0.50 KM

0.25 0.25 0.25 0.25 CSB


Oracle
0.00 0.00 0.00 0.00
100 300 1000 3000 100 300 1000 3000 100 300 1000 3000 100 300 1000 3000
Training Sample Size for Censoring Model

Figure 9: Effect of the training sample size used for fitting the censoring model on patient
screening performance with conformal survival bands (CSB) in a moderately challenging
synthetic data scenario (Setting 2 from Table 3). Other details are as in Figure 4.

Screening for: low−risk patients with P(T> 3.00) > 0.90.


Screened Proportion Survival Rate Precision Recall
1.00 1.00 1.00 1.00 Method
Mean ± 2 SE

0.75 0.75 0.75 0.75 Model

0.50 0.50 0.50 0.50 KM

0.25 0.25 0.25 0.25 CSB


Oracle
0.00 0.00 0.00 0.00
100 300 1000 3000 100 300 1000 3000 100 300 1000 3000 100 300 1000 3000
Training Sample Size for Censoring Model

Screening for: high−risk patients with P(T> 10.00) < 0.50.


Screened Proportion Survival Rate Precision Recall
1.00 1.00 1.00 1.00 Method
Mean ± 2 SE

0.75 0.75 0.75 0.75 Model

0.50 0.50 0.50 0.50 KM

0.25 0.25 0.25 0.25 CSB


Oracle
0.00 0.00 0.00 0.00
100 300 1000 3000 100 300 1000 3000 100 300 1000 3000 100 300 1000 3000
Training Sample Size for Censoring Model

Figure 10: Effect of the training sample size used for fitting the censoring model on patient
screening performance with conformal survival bands (CSB) in a relatively easy synthetic
data scenario (Setting 3 from Table 3). Other details are as in Figure 4.

38
Conformal Survival Bands for Risk Screening under Right-Censoring

B.5. Effect of the Number of Features for the Censoring Model

Screening for: low−risk patients with P(T> 2.00) > 0.80.


Screened Proportion Survival Rate Precision Recall
1.00

Train sample size = 200


0.75

0.50

0.25
Method
Mean ± 2 SE

Model
0.00
KM
1.00

Train sample size = 500


CSB
0.75 Oracle

0.50

0.25

0.00
10 30 100 10 30 100 10 30 100 10 30 100
Number of Features for Censoring Model

Screening for: high−risk patients with P(T> 3.00) < 0.25.


Screened Proportion Survival Rate Precision Recall
1.00

Train sample size = 200


0.75

0.50

0.25
Method
Mean ± 2 SE

Model
0.00
KM
1.00
Train sample size = 500

CSB
0.75 Oracle

0.50

0.25

0.00
10 30 100 10 30 100 10 30 100 10 30 100
Number of Features for Censoring Model

Figure 11: Impact of the number of features used in fitting the censoring model on pa-
tient screening performance with conformal survival bands (CSB) in a challenging synthetic
data scenario (Setting 1 from Table 3). The total number of features is 100, but only a
subset—including the relevant ones—is used to fit the censoring model. Top: Low-risk
screening. Screening performance is more sensitive to the number of features when the
training sample size is small (200); using too many features degrades performance due to
the difficulty of accurately estimating the censoring model. When the sample size is larger
(500), performance is less sensitive to the number of features. Bottom: High-risk screening
(training sample sizes = 200 and 500). In this case, screening performance shows even less
sensitivity to the number of features used to fit the censoring model.

39
Sesia Svetnik

Screening for: low−risk patients with P(T> 3.00) > 0.90.


Screened Proportion Survival Rate Precision Recall
1.00

Train sample size = 200


0.75

0.50

0.25
Method
Mean ± 2 SE

Model
0.00
KM
1.00

Train sample size = 500


CSB
0.75 Oracle

0.50

0.25

0.00
10 30 100 10 30 100 10 30 100 10 30 100
Number of Features for Censoring Model

Screening for: high−risk patients with P(T> 10.00) < 0.50.


Screened Proportion Survival Rate Precision Recall
1.00

Train sample size = 200


0.75

0.50

0.25
Method
Mean ± 2 SE

Model
0.00
KM
1.00

Train sample size = 500


CSB
0.75 Oracle

0.50

0.25

0.00
10 30 100 10 30 100 10 30 100 10 30 100
Number of Features for Censoring Model

Figure 12: Impact of the number of features used in fitting the censoring model on patient
screening performance with conformal survival bands (CSB) in an easier synthetic data
scenario (Setting 3 from Table 3). Other details are as in Figure 11.

40
Conformal Survival Bands for Risk Screening under Right-Censoring

B.6. Effect of Different Censoring and Survival Models

Table 4: Performance of different methods for screening low-risk patients with P (T >
6) > 0.80 in a challenging synthetic data scenario (Setting 1 from Table 3), using various
censoring and survival models. The training sample size is fixed at 1000. Other details are
as in Figure 2.

Survival Model Method Screened Survival Precision Recall


Censoring Model: grf
grf Model 0.941 ± 0.009 0.573 ± 0.007 0.544 ± 0.007 0.999 ± 0.001
KM 1.000 ± 0.000 0.539 ± 0.003 0.511 ± 0.003 1.000 ± 0.000
CSB 0.120 ± 0.047 0.953 ± 0.022 0.796 ± 0.031 0.169 ± 0.061
Oracle 0.511 ± 0.003 0.974 ± 0.001 1.000 ± 0.000 1.000 ± 0.000
survreg Model 0.995 ± 0.001 0.541 ± 0.003 0.514 ± 0.003 1.000 ± 0.000
KM 1.000 ± 0.000 0.539 ± 0.003 0.511 ± 0.003 1.000 ± 0.000
CSB 0.000 ± 0.000 NA NA 0.000 ± 0.000
Oracle 0.511 ± 0.003 0.974 ± 0.001 1.000 ± 0.000 1.000 ± 0.000
rf Model 0.814 ± 0.007 0.659 ± 0.006 0.626 ± 0.006 0.994 ± 0.001
KM 1.000 ± 0.000 0.539 ± 0.003 0.511 ± 0.003 1.000 ± 0.000
CSB 0.261 ± 0.055 0.875 ± 0.026 0.742 ± 0.018 0.361 ± 0.072
Oracle 0.511 ± 0.003 0.974 ± 0.001 1.000 ± 0.000 1.000 ± 0.000
Cox Model 0.764 ± 0.005 0.652 ± 0.004 0.619 ± 0.004 0.925 ± 0.004
KM 1.000 ± 0.000 0.539 ± 0.003 0.511 ± 0.003 1.000 ± 0.000
CSB 0.192 ± 0.032 0.884 ± 0.017 0.845 ± 0.016 0.301 ± 0.044
Oracle 0.511 ± 0.003 0.974 ± 0.001 1.000 ± 0.000 1.000 ± 0.000
Censoring Model: Cox
grf Model 0.943 ± 0.008 0.572 ± 0.007 0.543 ± 0.006 0.999 ± 0.001
KM 1.000 ± 0.000 0.539 ± 0.003 0.511 ± 0.003 1.000 ± 0.000
CSB 0.584 ± 0.069 0.749 ± 0.033 0.679 ± 0.030 0.710 ± 0.070
Oracle 0.511 ± 0.003 0.974 ± 0.001 1.000 ± 0.000 1.000 ± 0.000
survreg Model 0.995 ± 0.001 0.541 ± 0.003 0.514 ± 0.003 1.000 ± 0.000
KM 1.000 ± 0.000 0.539 ± 0.003 0.511 ± 0.003 1.000 ± 0.000
CSB 0.000 ± 0.000 NA NA 0.000 ± 0.000
Oracle 0.511 ± 0.003 0.974 ± 0.001 1.000 ± 0.000 1.000 ± 0.000
rf Model 0.814 ± 0.007 0.659 ± 0.006 0.626 ± 0.006 0.994 ± 0.001
KM 1.000 ± 0.000 0.539 ± 0.003 0.511 ± 0.003 1.000 ± 0.000
CSB 0.665 ± 0.047 0.709 ± 0.019 0.658 ± 0.013 0.835 ± 0.052
Oracle 0.511 ± 0.003 0.974 ± 0.001 1.000 ± 0.000 1.000 ± 0.000
Cox Model 0.764 ± 0.005 0.652 ± 0.004 0.619 ± 0.004 0.925 ± 0.004
KM 1.000 ± 0.000 0.539 ± 0.003 0.511 ± 0.003 1.000 ± 0.000
CSB 0.488 ± 0.041 0.747 ± 0.016 0.716 ± 0.016 0.655 ± 0.045
Oracle 0.511 ± 0.003 0.974 ± 0.001 1.000 ± 0.000 1.000 ± 0.000

41
Sesia Svetnik

Table 5: Performance of different methods for screening high-risk patients with P (T >
12) < 0.80 in a challenging synthetic data scenario (Setting 1 from Table 3), using various
censoring and survival models. The training sample size is fixed at 1000. Other details are
as in Figure 2.

Survival Model Method Screened Survival Precision Recall


Censoring Model: grf
grf Model 0.756 ± 0.005 0.035 ± 0.002 0.965 ± 0.002 0.956 ± 0.005
KM 1.000 ± 0.000 0.238 ± 0.003 0.763 ± 0.003 1.000 ± 0.000
CSB 0.733 ± 0.027 0.040 ± 0.003 0.959 ± 0.002 0.922 ± 0.034
Oracle 0.763 ± 0.003 0.001 ± 0.000 1.000 ± 0.000 1.000 ± 0.000
survreg Model 0.109 ± 0.004 0.000 ± 0.000 1.000 ± 0.000 0.142 ± 0.006
KM 1.000 ± 0.000 0.238 ± 0.003 0.763 ± 0.003 1.000 ± 0.000
CSB 0.137 ± 0.006 0.000 ± 0.000 1.000 ± 0.000 0.179 ± 0.008
Oracle 0.763 ± 0.003 0.001 ± 0.000 1.000 ± 0.000 1.000 ± 0.000
rf Model 0.819 ± 0.004 0.080 ± 0.003 0.921 ± 0.003 0.988 ± 0.001
KM 1.000 ± 0.000 0.238 ± 0.003 0.763 ± 0.003 1.000 ± 0.000
CSB 0.772 ± 0.043 0.086 ± 0.007 0.908 ± 0.006 0.917 ± 0.051
Oracle 0.763 ± 0.003 0.001 ± 0.000 1.000 ± 0.000 1.000 ± 0.000
Cox Model 0.713 ± 0.005 0.071 ± 0.003 0.929 ± 0.003 0.868 ± 0.005
KM 1.000 ± 0.000 0.238 ± 0.003 0.763 ± 0.003 1.000 ± 0.000
CSB 0.670 ± 0.017 0.070 ± 0.004 0.930 ± 0.004 0.816 ± 0.020
Oracle 0.763 ± 0.003 0.001 ± 0.000 1.000 ± 0.000 1.000 ± 0.000
Censoring Model: Cox
grf Model 0.757 ± 0.004 0.035 ± 0.002 0.965 ± 0.002 0.957 ± 0.005
KM 1.000 ± 0.000 0.238 ± 0.003 0.763 ± 0.003 1.000 ± 0.000
CSB 0.591 ± 0.062 0.031 ± 0.004 0.961 ± 0.002 0.742 ± 0.078
Oracle 0.763 ± 0.003 0.001 ± 0.000 1.000 ± 0.000 1.000 ± 0.000
survreg Model 0.109 ± 0.004 0.000 ± 0.000 1.000 ± 0.000 0.142 ± 0.006
KM 1.000 ± 0.000 0.238 ± 0.003 0.763 ± 0.003 1.000 ± 0.000
CSB 0.128 ± 0.009 0.000 ± 0.000 1.000 ± 0.000 0.168 ± 0.011
Oracle 0.763 ± 0.003 0.001 ± 0.000 1.000 ± 0.000 1.000 ± 0.000
rf Model 0.819 ± 0.004 0.080 ± 0.003 0.921 ± 0.003 0.988 ± 0.001
KM 1.000 ± 0.000 0.238 ± 0.003 0.763 ± 0.003 1.000 ± 0.000
CSB 0.630 ± 0.064 0.054 ± 0.009 0.936 ± 0.008 0.767 ± 0.076
Oracle 0.763 ± 0.003 0.001 ± 0.000 1.000 ± 0.000 1.000 ± 0.000
Cox Model 0.713 ± 0.005 0.071 ± 0.003 0.929 ± 0.003 0.868 ± 0.005
KM 1.000 ± 0.000 0.238 ± 0.003 0.763 ± 0.003 1.000 ± 0.000
CSB 0.530 ± 0.049 0.050 ± 0.007 0.945 ± 0.006 0.650 ± 0.058
Oracle 0.763 ± 0.003 0.001 ± 0.000 1.000 ± 0.000 1.000 ± 0.000

42
Conformal Survival Bands for Risk Screening under Right-Censoring

Table 6: Performance of different methods for screening low-risk patients with P (T >
2) > 0.80 in a relatively challenging data scenario (Setting 2 from Table 3), using various
censoring and survival models. The training sample size is fixed at 1000. Other details are
as in Figure 2.

Survival Model Method Screened Survival Precision Recall


Censoring Model: grf
grf Model 0.969 ± 0.006 0.765 ± 0.005 0.657 ± 0.005 1.000 ± 0.000
KM 1.000 ± 0.000 0.741 ± 0.003 0.636 ± 0.002 1.000 ± 0.000
CSB 0.286 ± 0.075 0.952 ± 0.014 0.797 ± 0.019 0.346 ± 0.089
Oracle 0.636 ± 0.002 0.960 ± 0.001 1.000 ± 0.000 1.000 ± 0.000
survreg Model 0.951 ± 0.003 0.776 ± 0.004 0.669 ± 0.003 1.000 ± 0.000
KM 1.000 ± 0.000 0.741 ± 0.003 0.636 ± 0.002 1.000 ± 0.000
CSB 0.018 ± 0.026 0.996 ± 0.005 0.692 ± 0.005 0.020 ± 0.028
Oracle 0.636 ± 0.002 0.960 ± 0.001 1.000 ± 0.000 1.000 ± 0.000
rf Model 0.880 ± 0.005 0.834 ± 0.004 0.723 ± 0.005 1.000 ± 0.000
KM 1.000 ± 0.000 0.741 ± 0.003 0.636 ± 0.002 1.000 ± 0.000
CSB 0.423 ± 0.064 0.935 ± 0.010 0.857 ± 0.014 0.552 ± 0.080
Oracle 0.636 ± 0.002 0.960 ± 0.001 1.000 ± 0.000 1.000 ± 0.000
Cox Model 0.596 ± 0.011 0.961 ± 0.003 0.967 ± 0.006 0.904 ± 0.011
KM 1.000 ± 0.000 0.741 ± 0.003 0.636 ± 0.002 1.000 ± 0.000
CSB 0.400 ± 0.030 0.982 ± 0.004 0.988 ± 0.007 0.616 ± 0.041
Oracle 0.636 ± 0.002 0.960 ± 0.001 1.000 ± 0.000 1.000 ± 0.000
Censoring Model: Cox
grf Model 0.969 ± 0.006 0.765 ± 0.005 0.657 ± 0.005 1.000 ± 0.000
KM 1.000 ± 0.000 0.741 ± 0.003 0.636 ± 0.002 1.000 ± 0.000
CSB 0.578 ± 0.073 0.892 ± 0.015 0.777 ± 0.018 0.681 ± 0.082
Oracle 0.636 ± 0.002 0.960 ± 0.001 1.000 ± 0.000 1.000 ± 0.000
survreg Model 0.951 ± 0.003 0.776 ± 0.004 0.669 ± 0.003 1.000 ± 0.000
KM 1.000 ± 0.000 0.741 ± 0.003 0.636 ± 0.002 1.000 ± 0.000
CSB 0.000 ± 0.000 NA NA 0.000 ± 0.000
Oracle 0.636 ± 0.002 0.960 ± 0.001 1.000 ± 0.000 1.000 ± 0.000
rf Model 0.880 ± 0.005 0.834 ± 0.004 0.723 ± 0.005 1.000 ± 0.000
KM 1.000 ± 0.000 0.741 ± 0.003 0.636 ± 0.002 1.000 ± 0.000
CSB 0.735 ± 0.042 0.882 ± 0.009 0.790 ± 0.013 0.897 ± 0.046
Oracle 0.636 ± 0.002 0.960 ± 0.001 1.000 ± 0.000 1.000 ± 0.000
Cox Model 0.596 ± 0.011 0.961 ± 0.003 0.967 ± 0.006 0.904 ± 0.011
KM 1.000 ± 0.000 0.741 ± 0.003 0.636 ± 0.002 1.000 ± 0.000
CSB 0.513 ± 0.025 0.968 ± 0.005 0.969 ± 0.010 0.773 ± 0.030
Oracle 0.636 ± 0.002 0.960 ± 0.001 1.000 ± 0.000 1.000 ± 0.000

43
Sesia Svetnik

Table 7: Performance of different methods for screening high-risk patients with P (T >
3) < 0.25 in a relatively challenging data scenario (Setting 2 from Table 3), using various
censoring and survival models. The training sample size is fixed at 1000. Other details are
as in Figure 2.

Survival Model Method Screened Survival Precision Recall


Censoring Model: grf
grf Model 0.999 ± 0.000 0.030 ± 0.001 1.000 ± 0.000 0.999 ± 0.000
KM 0.990 ± 0.020 0.030 ± 0.001 1.000 ± 0.000 0.990 ± 0.020
CSB 0.979 ± 0.028 0.029 ± 0.001 1.000 ± 0.000 0.979 ± 0.028
Oracle 1.000 ± 0.000 0.030 ± 0.001 1.000 ± 0.000 1.000 ± 0.000
survreg Model 0.726 ± 0.006 0.008 ± 0.001 1.000 ± 0.000 0.726 ± 0.006
KM 0.990 ± 0.020 0.030 ± 0.001 1.000 ± 0.000 0.990 ± 0.020
CSB 0.723 ± 0.030 0.010 ± 0.001 1.000 ± 0.000 0.723 ± 0.030
Oracle 1.000 ± 0.000 0.030 ± 0.001 1.000 ± 0.000 1.000 ± 0.000
rf Model 0.907 ± 0.014 0.033 ± 0.001 1.000 ± 0.000 0.907 ± 0.014
KM 0.990 ± 0.020 0.030 ± 0.001 1.000 ± 0.000 0.990 ± 0.020
CSB 0.898 ± 0.030 0.031 ± 0.002 1.000 ± 0.000 0.898 ± 0.030
Oracle 1.000 ± 0.000 0.030 ± 0.001 1.000 ± 0.000 1.000 ± 0.000
Cox Model 0.964 ± 0.002 0.026 ± 0.001 1.000 ± 0.000 0.964 ± 0.002
KM 0.990 ± 0.020 0.030 ± 0.001 1.000 ± 0.000 0.990 ± 0.020
CSB 0.928 ± 0.028 0.024 ± 0.001 1.000 ± 0.000 0.928 ± 0.028
Oracle 1.000 ± 0.000 0.030 ± 0.001 1.000 ± 0.000 1.000 ± 0.000
Censoring Model: Cox
grf Model 0.999 ± 0.000 0.030 ± 0.001 1.000 ± 0.000 0.999 ± 0.000
KM 0.990 ± 0.020 0.030 ± 0.001 1.000 ± 0.000 0.990 ± 0.020
CSB 0.999 ± 0.001 0.030 ± 0.001 1.000 ± 0.000 0.999 ± 0.001
Oracle 1.000 ± 0.000 0.030 ± 0.001 1.000 ± 0.000 1.000 ± 0.000
survreg Model 0.726 ± 0.006 0.008 ± 0.001 1.000 ± 0.000 0.726 ± 0.006
KM 0.990 ± 0.020 0.030 ± 0.001 1.000 ± 0.000 0.990 ± 0.020
CSB 0.746 ± 0.016 0.010 ± 0.001 1.000 ± 0.000 0.746 ± 0.016
Oracle 1.000 ± 0.000 0.030 ± 0.001 1.000 ± 0.000 1.000 ± 0.000
rf Model 0.907 ± 0.014 0.033 ± 0.001 1.000 ± 0.000 0.907 ± 0.014
KM 0.990 ± 0.020 0.030 ± 0.001 1.000 ± 0.000 0.990 ± 0.020
CSB 0.916 ± 0.015 0.032 ± 0.001 1.000 ± 0.000 0.916 ± 0.015
Oracle 1.000 ± 0.000 0.030 ± 0.001 1.000 ± 0.000 1.000 ± 0.000
Cox Model 0.964 ± 0.002 0.026 ± 0.001 1.000 ± 0.000 0.964 ± 0.002
KM 0.990 ± 0.020 0.030 ± 0.001 1.000 ± 0.000 0.990 ± 0.020
CSB 0.938 ± 0.010 0.024 ± 0.001 1.000 ± 0.000 0.938 ± 0.010
Oracle 1.000 ± 0.000 0.030 ± 0.001 1.000 ± 0.000 1.000 ± 0.000

44
Conformal Survival Bands for Risk Screening under Right-Censoring

Table 8: Performance of different methods for screening low-risk patients with P (T > 3) >
0.90 in an easier synthetic data scenario (Setting 3 from Table 3), using various censoring
and survival models. The training sample size is fixed at 1000. Other details are as in
Figure 2.

Survival Model Method Screened Survival Precision Recall


Censoring Model: grf
grf Model 0.000 ± 0.000 NA NA 0.000 ± 0.000
KM 0.000 ± 0.000 NA NA 0.000 ± 0.000
CSB 0.000 ± 0.000 NA NA 0.000 ± 0.000
Oracle 0.002 ± 0.000 0.920 ± 0.039 1.000 ± 0.000 1.000 ± 0.000
survreg Model 1.000 ± 0.000 0.701 ± 0.002 0.002 ± 0.000 1.000 ± 0.000
KM 0.000 ± 0.000 NA NA 0.000 ± 0.000
CSB 0.000 ± 0.000 NA NA 0.000 ± 0.000
Oracle 0.002 ± 0.000 0.920 ± 0.039 1.000 ± 0.000 1.000 ± 0.000
rf Model 0.000 ± 0.000 NA NA 0.000 ± 0.000
KM 0.000 ± 0.000 NA NA 0.000 ± 0.000
CSB 0.000 ± 0.000 NA NA 0.000 ± 0.000
Oracle 0.002 ± 0.000 0.920 ± 0.039 1.000 ± 0.000 1.000 ± 0.000
Cox Model 0.046 ± 0.004 0.702 ± 0.016 0.003 ± 0.003 0.060 ± 0.038
KM 0.000 ± 0.000 NA NA 0.000 ± 0.000
CSB 0.000 ± 0.000 NA NA 0.000 ± 0.000
Oracle 0.002 ± 0.000 0.920 ± 0.039 1.000 ± 0.000 1.000 ± 0.000
Censoring Model: Cox
grf Model 0.000 ± 0.000 NA NA 0.000 ± 0.000
KM 0.000 ± 0.000 NA NA 0.000 ± 0.000
CSB 0.000 ± 0.000 NA NA 0.000 ± 0.000
Oracle 0.002 ± 0.000 0.920 ± 0.039 1.000 ± 0.000 1.000 ± 0.000
survreg Model 1.000 ± 0.000 0.701 ± 0.002 0.002 ± 0.000 1.000 ± 0.000
KM 0.000 ± 0.000 NA NA 0.000 ± 0.000
CSB 0.000 ± 0.000 NA NA 0.000 ± 0.000
Oracle 0.002 ± 0.000 0.920 ± 0.039 1.000 ± 0.000 1.000 ± 0.000
rf Model 0.000 ± 0.000 NA NA 0.000 ± 0.000
KM 0.000 ± 0.000 NA NA 0.000 ± 0.000
CSB 0.000 ± 0.000 NA NA 0.000 ± 0.000
Oracle 0.002 ± 0.000 0.920 ± 0.039 1.000 ± 0.000 1.000 ± 0.000
Cox Model 0.046 ± 0.004 0.702 ± 0.016 0.003 ± 0.003 0.060 ± 0.038
KM 0.000 ± 0.000 NA NA 0.000 ± 0.000
CSB 0.000 ± 0.000 NA NA 0.000 ± 0.000
Oracle 0.002 ± 0.000 0.920 ± 0.039 1.000 ± 0.000 1.000 ± 0.000

45
Sesia Svetnik

Table 9: Performance of different methods for screening high-risk patients with P (T >
10) < 0.50 in an easier synthetic data scenario (Setting 3 from Table 3), using various
censoring and survival models. The training sample size is fixed at 1000. Other details are
as in Figure 2.

Survival Model Method Screened Survival Precision Recall


Censoring Model: grf
grf Model 0.987 ± 0.010 0.386 ± 0.003 0.953 ± 0.001 0.987 ± 0.010
KM 0.910 ± 0.058 0.352 ± 0.022 0.953 ± 0.001 0.910 ± 0.058
CSB 0.886 ± 0.055 0.369 ± 0.016 0.953 ± 0.002 0.886 ± 0.055
Oracle 0.953 ± 0.001 0.378 ± 0.003 1.000 ± 0.000 1.000 ± 0.000
survreg Model 0.000 ± 0.000 0.000 ± 0.000 0.500 ± 0.141 0.000 ± 0.000
KM 0.910 ± 0.058 0.352 ± 0.022 0.953 ± 0.001 0.910 ± 0.058
CSB 0.000 ± 0.000 0.003 ± 0.005 1.000 ± NA 0.000 ± 0.000
Oracle 0.953 ± 0.001 0.378 ± 0.003 1.000 ± 0.000 1.000 ± 0.000
rf Model 0.791 ± 0.025 0.386 ± 0.003 0.953 ± 0.001 0.791 ± 0.025
KM 0.910 ± 0.058 0.352 ± 0.022 0.953 ± 0.001 0.910 ± 0.058
CSB 0.738 ± 0.045 0.385 ± 0.009 0.951 ± 0.003 0.737 ± 0.045
Oracle 0.953 ± 0.001 0.378 ± 0.003 1.000 ± 0.000 1.000 ± 0.000
Cox Model 0.688 ± 0.017 0.386 ± 0.003 0.953 ± 0.002 0.688 ± 0.017
KM 0.910 ± 0.058 0.352 ± 0.022 0.953 ± 0.001 0.910 ± 0.058
CSB 0.540 ± 0.044 0.387 ± 0.009 0.952 ± 0.005 0.540 ± 0.044
Oracle 0.953 ± 0.001 0.378 ± 0.003 1.000 ± 0.000 1.000 ± 0.000
Censoring Model: Cox
grf Model 0.987 ± 0.011 0.386 ± 0.003 0.953 ± 0.001 0.987 ± 0.011
KM 0.910 ± 0.058 0.352 ± 0.022 0.953 ± 0.001 0.910 ± 0.058
CSB 0.898 ± 0.053 0.373 ± 0.014 0.953 ± 0.002 0.898 ± 0.053
Oracle 0.953 ± 0.001 0.378 ± 0.003 1.000 ± 0.000 1.000 ± 0.000
survreg Model 0.000 ± 0.000 0.000 ± 0.000 0.500 ± 0.141 0.000 ± 0.000
KM 0.910 ± 0.058 0.352 ± 0.022 0.953 ± 0.001 0.910 ± 0.058
CSB 0.000 ± 0.000 NA NA 0.000 ± 0.000
Oracle 0.953 ± 0.001 0.378 ± 0.003 1.000 ± 0.000 1.000 ± 0.000
rf Model 0.791 ± 0.025 0.386 ± 0.003 0.953 ± 0.001 0.791 ± 0.025
KM 0.910 ± 0.058 0.352 ± 0.022 0.953 ± 0.001 0.910 ± 0.058
CSB 0.731 ± 0.049 0.380 ± 0.012 0.953 ± 0.002 0.730 ± 0.049
Oracle 0.953 ± 0.001 0.378 ± 0.003 1.000 ± 0.000 1.000 ± 0.000
Cox Model 0.688 ± 0.017 0.386 ± 0.003 0.953 ± 0.002 0.688 ± 0.017
KM 0.910 ± 0.058 0.352 ± 0.022 0.953 ± 0.001 0.910 ± 0.058
CSB 0.538 ± 0.046 0.389 ± 0.009 0.952 ± 0.005 0.538 ± 0.046
Oracle 0.953 ± 0.001 0.378 ± 0.003 1.000 ± 0.000 1.000 ± 0.000

46
Conformal Survival Bands for Risk Screening under Right-Censoring

B.7. Effect of Distribution Shift

Table 10: Performance of different methods for screening high-risk patients with P (T >
3) < 0.50 under distribution shift (Setting 4 from Table 3). These results correspond to the
same experiments reported in Table 1, but evaluate high-risk screening instead of low-risk.
In this case, a misspecified grf model primarily reduces the power of the screening rule by
selecting fewer patients than desired.

Method Screened Survival Precision Recall


Low-Quality Model
Model 0.000 ± 0.000 NA NA 0.000 ± 0.000
KM 0.030 ± 0.034 0.016 ± 0.019 0.495 ± 0.002 0.030 ± 0.034
CSB 0.000 ± 0.000 NA NA 0.000 ± 0.000
Oracle 0.501 ± 0.003 0.077 ± 0.002 1.000 ± 0.000 1.000 ± 0.000
High-Quality Model
Model 0.502 ± 0.004 0.079 ± 0.002 0.998 ± 0.000 1.000 ± 0.000
KM 0.031 ± 0.035 0.017 ± 0.019 0.495 ± 0.002 0.031 ± 0.035
CSB 0.496 ± 0.011 0.077 ± 0.003 0.999 ± 0.000 0.989 ± 0.021
Oracle 0.501 ± 0.004 0.077 ± 0.002 1.000 ± 0.000 1.000 ± 0.000

Patient 1 Patient 2 Patient 3 Patient 4


Survival Probability

Low-quality model

1.00

0.75
Survival Model
0.50
Oracle
0.25 Model
KM
0.00
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Time (years) Uncertainty
CSB
Patient 1 Patient 2 Patient 3 Patient 4
Survival Probability

High-quality model

1.00 Flagged as 'high risk'


(i.e., P[T>3.00] < 0.50)
0.75
Oracle
0.50 CSB
Model
0.25

0.00
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Time (years)

Figure 13: Illustration of the use of conformal survival bands (shaded regions) for screening
test patients in a simulated censored dataset under distribution shift, corresponding to the
high-risk screening experiments discussed in Table 10. Solid black curves show survival
estimates from either an inaccurate (top) or accurate (bottom) survival forest model, while
dashed green curves represent the true survival probabilities. The goal is to identify high-
risk patients—those with less than 50% probability (horizontal dotted line) of surviving
beyond 3 years (vertical dotted line). Other details are as in Figure 1.

47
Sesia Svetnik

Appendix C. Additional Details on Real Data Applications

We apply our method to seven datasets previously utilized by Sesia and Svetnik (2024):
the Colon Cancer Chemotherapy (COLON) dataset; the German Breast Cancer Study
Group (GBSG) dataset; the Stanford Heart Transplant Study (HEART); the Molecular
Taxonomy of Breast Cancer International Consortium (METABRIC) dataset; the Primary
Biliary Cirrhosis (PBC) dataset; the Diabetic Retinopathy Study (RETINOPATHY); and
the Veterans’ Administration Lung Cancer Trial (VALCT). Table 11 provides details on the
number of observations, covariates, and data sources.

The datasets were obtained from various publicly available sources. COLON, HEART,
PBC, RETINOPATHY, and VALCT are included in the survival R package. GBSG was
sourced from GitHub: https://github.com/jaredleekatzman/DeepSurv/. METABRIC
was accessed via https://www.cbioportal.org/study/summary?id=brca_metabric.

Each dataset underwent a pre-processing pipeline to ensure consistency and prepare


the data for analysis, as in Sesia and Svetnik (2024). Survival times equal to zero were
replaced with half the smallest observed non-zero time. Missing values were imputed using
the median for numeric variables and the mode for categorical variables. Factor variables
were processed to merge rare levels (frequency below 2%) into an “Other” category, while
binary factors with one rare level were removed entirely. Dummy variables were created
for all factors, and redundant features were identified and removed using an alias check.
Additionally, highly correlated features (correlation above 0.75) were iteratively filtered.

Table 11: Summary of the publicly available survival analysis datasets used in Section 4.

Dataset Obs. Vars. Source Citation


COLON 1858 11 survival Moertel et al. (1990)
GBSG 2232 6 github.com Katzman et al. (2018)
HEART 172 4 survival Crowley and Hu (1977)
METABRIC 1981 41 cbioportal.org Curtis et al. (2012)
PBC 418 17 survival Therneau et al. (2000)
RETINOPATHY 394 5 survival Blair et al. (1980)
VALCT 137 6 survival Kalbfleisch and Prentice (2002)

48
Conformal Survival Bands for Risk Screening under Right-Censoring

Table 12: Detailed screening results for low-risk selection using the grf model, with thresh-
old rule P (T > t1 ) > 0.80 and t1 set to the 0.1 quantile of observed times in each dataset.
Shown are the screened proportion and bounds on the survival rate among selected patients,
aggregated over 100 repetitions.

Method Screened Survival (lower bound) Survival (upper bound)


COLON
Model 0.922 ± 0.006 0.917 ± 0.004 0.920 ± 0.004
KM 1.000 ± 0.000 0.902 ± 0.004 0.905 ± 0.004
CSB 0.046 ± 0.045 0.996 ± 0.004 0.996 ± 0.004
GBSG
Model 0.896 ± 0.006 0.922 ± 0.003 0.932 ± 0.003
KM 1.000 ± 0.000 0.901 ± 0.003 0.912 ± 0.003
CSB 0.550 ± 0.047 0.958 ± 0.005 0.968 ± 0.005
HEART
Model 1.000 ± 0.000 0.930 ± 0.011 0.965 ± 0.009
KM 1.000 ± 0.000 0.930 ± 0.011 0.965 ± 0.009
CSB 0.822 ± 0.104 0.940 ± 0.013 0.967 ± 0.009
METABRIC
Model 0.964 ± 0.003 0.912 ± 0.004 0.927 ± 0.004
KM 1.000 ± 0.000 0.905 ± 0.003 0.921 ± 0.003
CSB 0.936 ± 0.013 0.915 ± 0.004 0.930 ± 0.004
PBC
Model 0.857 ± 0.011 0.961 ± 0.006 0.961 ± 0.006
KM 1.000 ± 0.000 0.905 ± 0.009 0.905 ± 0.009
CSB 0.694 ± 0.051 0.975 ± 0.006 0.975 ± 0.006
RETINOPATHY
Model 0.981 ± 0.010 0.884 ± 0.008 0.903 ± 0.008
KM 1.000 ± 0.000 0.884 ± 0.008 0.904 ± 0.008
CSB 0.149 ± 0.089 0.979 ± 0.013 0.984 ± 0.010
VALCT
Model 0.849 ± 0.029 0.915 ± 0.018 0.915 ± 0.018
KM 0.980 ± 0.040 0.905 ± 0.015 0.905 ± 0.015
CSB 0.691 ± 0.094 0.941 ± 0.017 0.941 ± 0.017

49
Sesia Svetnik

Table 13: Detailed screening results for high-risk selection using the Cox model, with
threshold rule P (T > t1 ) < 0.80 and t1 set to the 0.1 quantile of observed times in each
dataset. Shown are the screened proportion and bounds on the survival rate among selected
patients, aggregated over 100 repetitions.

Method Screened Survival (lower bound) Survival (upper bound)


COLON
Model 0.069 ± 0.005 0.747 ± 0.030 0.747 ± 0.030
KM 0.000 ± 0.000 NA NA
CSB 0.000 ± 0.000 NA NA
GBSG
Model 0.028 ± 0.002 0.849 ± 0.027 0.861 ± 0.024
KM 0.000 ± 0.000 NA NA
CSB 0.000 ± 0.000 NA NA
HEART
Model 0.000 ± 0.000 NA NA
KM 0.000 ± 0.000 NA NA
CSB 0.000 ± 0.000 NA NA
METABRIC
Model 0.036 ± 0.003 0.749 ± 0.031 0.752 ± 0.031
KM 0.000 ± 0.000 NA NA
CSB 0.000 ± 0.000 NA NA
PBC
Model 0.125 ± 0.011 0.595 ± 0.047 0.595 ± 0.047
KM 0.000 ± 0.000 NA NA
CSB 0.002 ± 0.002 0.020 ± 0.028 0.020 ± 0.028
RETINOPATHY
Model 0.012 ± 0.006 0.256 ± 0.115 0.266 ± 0.119
KM 0.000 ± 0.000 NA NA
CSB 0.000 ± 0.000 NA NA
VALCT
Model 0.139 ± 0.018 0.698 ± 0.077 0.698 ± 0.077
KM 0.020 ± 0.040 0.018 ± 0.036 0.018 ± 0.036
CSB 0.000 ± 0.000 NA NA

50

You might also like