Conformal Survival Bands For Risk Screening Under Right-Censoring
Conformal Survival Bands For Risk Screening Under Right-Censoring
Right-Censoring
Abstract
We propose a method to quantify uncertainty around individual survival distribution es-
timates using right-censored data, compatible with any survival model. Unlike classical
confidence intervals, the survival bands produced by this method offer predictive rather
than population-level inference, making them useful for personalized risk screening. For
example, in a low-risk screening scenario, they can be applied to flag patients whose sur-
vival band at 12 months lies entirely above 50%, while ensuring that at least half of flagged
individuals will survive past that time on average. Our approach builds on recent advances
in conformal inference and integrates ideas from inverse probability of censoring weighting
and multiple testing with false discovery rate control. We provide asymptotic guarantees
and show promising performance in finite samples with both simulated and real data.
Keywords: Censored Data, Conformal Inference, False Discovery Rate, Predictive Cali-
bration, Survival Analysis, Uncertainty Estimation.
1. Introduction
1.1. Background and Motivation
Survival analysis focuses on data involving time-to-event outcomes, such as the time until
death or disease relapse in medicine, or mechanical failure in engineering. Its defining chal-
lenge is censoring, which occurs when the event of interest is not observed for all individuals.
In the common case of right-censoring, we only know that the event has not occurred up
to a certain time, beyond which the individual is no longer followed. For example, a cancer
patient may still be alive at their last clinical follow-up, five years after diagnosis, but their
true time of death is unknown because no data are available after that point. In this case,
the patient’s censoring time is five years and their survival time is unobserved.
In many applications, the goal is to use fitted survival models to generate personalized
inferences in the form of individual survival curves. These curves estimate, for an individual
with specific features, the probability of remaining event-free beyond any future time point.
For example, based on a patient’s medical history, a model might predict a 90% chance of
surviving past one year, 75% past three years, and 40% past five years. These predictions
can be intuitively visualized as a decreasing curve over time. Such personalized survival
curves are widely used to guide decisions, such as identifying high-risk patients for early
intervention or recognizing low-risk individuals who may safely avoid aggressive treatment.
Traditional approaches to survival analysis rely on statistical models that support un-
certainty quantification through confidence intervals and hypothesis testing. These include
Sesia Svetnik
2
Conformal Survival Bands for Risk Screening under Right-Censoring
Low-quality model
1.00
0.75
Survival Model
0.50
Oracle
0.25 Model
KM
0.00
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Time (years) Uncertainty
CSB
Patient 1 Patient 2 Patient 3 Patient 4
Survival Probability
High-quality model
1.00 Flagged as 'low risk'
(i.e., P[T>3.00] > 0.80)
0.75
Oracle
0.50 CSB
Model
0.25
0.00
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Time (years)
Figure 1: Illustration of the use of conformal survival bands (shaded regions) for screening
test patients in simulated censored data. Solid black curves show survival estimates from
either an inaccurate (top) or accurate (bottom) survival forest model. Dashed green curves
represent the true survival probabilities. The goal is to identify low-risk patients—those
with more than 80% probability of surviving beyond 3 years (marked by the vertical line). A
patient is flagged by our method if their entire conformal band lies above the 80% threshold
(horizontal line) at this time, ensuring at least 80% of flagged individuals survive longer.
Flagging decisions are indicated by colored markers: red triangles (our method), green aster-
isks (oracle), black circles (estimated model), and blue diamonds (Kaplan-Meier). Patients
2–4 are mistakenly flagged as low risk by the inaccurate model. With the accurate model,
our method can identify truly low-risk patients (e.g., Patient 1), unlike the KM curve.
Rather than attempting to directly estimate population-level parameters, our bands are
calibrated for distribution-free predictive screening. They allow practitioners to identify
individuals whose estimated survival probability lies below (or above) a clinically meaningful
threshold—while guaranteeing that, among all such flagged individuals, the false discovery
rate is approximately controlled in a precise predictive sense. For example, in Figure 1, we
illustrate a scenario in which the practitioner wishes to identify “low-risk” patients—defined
as those with more than an p = 80% chance of surviving beyond time t = 5 months. A
patient is flagged if their conformal band lies entirely above the 80% line at t = 5. Our
method asymptotically guarantees that, among all flagged individuals, on average at least
80% will survive beyond this time point. This guarantee holds pointwise over all fixed
choices of the survival probability threshold p and time horizon t, and for “low-risk” as well
as “high-risk” screening, making the method flexible and broadly applicable.
This predictive notion of uncertainty is well-aligned with how survival curves are com-
monly used in real-world settings: to guide actionable decisions such as prioritizing patients
for further testing or early intervention. However, black-box survival models typically pro-
vide only point estimates, without any reliable measure of uncertainty or formal calibration
guarantees. As in the example shown, screening based on point predictions alone, or even
3
Sesia Svetnik
KM estimates, may lead to too many patients incorrectly labeled as low risk—highlighting
the practical benefits of our calibrated bands, which offer rigorous control over such errors.
Our approach integrates three key concepts: conformal inference (Vovk et al., 2005;
Fontana et al., 2023), inverse probability of censoring weighting (IPCW) (Robins and
Rotnitzky, 1992; Robins, 1993), and false discovery rate (FDR) control (Benjamini and
Hochberg, 1995). Concretely, we build on tests for random null hypotheses of the type:
where T is the (unobserved) survival time of the test individual and t > 0 is a user-
specified time threshold. In the following, we will compute conformal p-values to test these
hypotheses and show how to use them to construct the conformal survival bands previewed
in Figure 1. Although each of the statistical concepts upon which we build already exists
on its own, the way we integrate them is original and supported by novel theoretical results.
4
Conformal Survival Bands for Risk Screening under Right-Censoring
approach, based on resampling under censoring, but their work focuses on predictive inter-
vals for survival times rather than on calibrating survival screening decisions. Our work is
also related to recent, independent research by Yi et al. (2025), who construct two-sided
predictive intervals for survival times using a method that, like ours, utilizes conformal p-
values with IPC weighting. However, their focus is on survival time prediction rather than
decision-based screening or the calibration of individual survival curves.
In this paper, we address the problem of estimating uncertainty around individual sur-
vival curves to support threshold-based screening tasks. Our approach is based on comput-
ing weighted conformal p-values for testing hypotheses of the form Hlt (t; j) : Tn+j ≥ t and
Hrt (t; j) : Tn+j ≤ t, where Tn+j denotes the (unobserved) survival time of a new individual
and t > 0 is a fixed time threshold. While it may be possible in principle to obtain such
p-values by inverting predictive intervals for survival times obtained using existing meth-
ods, this is practically infeasible for several reasons: (i) most methods only provide lower
bounds, (ii) even the best available bounds remain conservative, especially at moderate
levels of confidence, and (iii) the inversion procedure would be computationally expensive,
as it requires solving nested optimization problems for each time threshold t.
Our method integrates and extends two previously distinct lines of work. First, we build
on the screening-based conformal inference framework of Jin and Candès (2023b), which
develops p-values for testing whether unobserved outcomes exceed user-specified thresholds.
Jin and Candès (2023a) later generalized this approach to allow for importance weighting
under covariate shift (Tibshirani et al., 2019), but their theory and assumptions are not
suited to censoring. Second, we draw on Farina et al. (2025), who introduced IPCW (Robins
and Rotnitzky, 1992; Robins, 1993) into conformal inference to produce predictive intervals
under right-censoring. While their method is tailored to one-sided prediction bounds, we
adapt IPCW techniques to develop conformal p-values for our survival threshold hypotheses.
2. Methods
2.1. Problem Setup and Assumptions
We consider a right-censored survival setting based on a sample of n individuals indexed
by [n] := {1, . . . , n}, drawn i.i.d. from an unknown population. For each individual i ∈ [n],
we observe a vector of covariates Xi ∈ X ⊆ Rd , along with right-censored survival data:
the event indicator Ei := I(Ti < Ci ) and the observed time T̃i := min(Ti , Ci ), where Ti > 0
is the true survival time and Ci > 0 is the censoring time. These n observations form the
calibration dataset, denoted by Dcal := {(Xi , T̃i , Ei )}ni=1 , which will be used to quantify
uncertainty—i.e., to calibrate the predictions of a black-box survival model.
We assume access to two black-box models trained on an independent dataset, which
may itself be censored and need not follow the same distribution as the calibration or
test sets. The only assumption is that the training data are independent of all other
samples, allowing us to treat both models as fixed throughout the analysis. The first
model is the survival model, M̂T . This model produces an estimated individual survival
function ŜT (t | x), which is intended to approximate the true conditional survival probability
ST (t | x) := P(T > t | X = x). The second is an auxiliary censoring model, M̂C , which
estimates the conditional survival function of the censoring distribution, ŜC (t | x), an
approximation to SC (t | x) := P(C > t | X = x). The role of the censoring model is
5
Sesia Svetnik
to reweight the calibration data to correct for the missing information due to censoring,
enabling valid uncertainty quantification for the survival model.
In addition to the calibration set, we consider a disjoint test set consisting of m individ-
uals, indexed by {n + 1, . . . , n + m}, also drawn independently from the same population.
For each test individual j, we observe only the covariates Xj ∈ X . Our goal is to use the
survival model M̂T to estimate the survival curve ŜT (t | Xj ) for each Xj , and to construct
a well-calibrated conformal survival band around this curve that reflects uncertainty.
Rather than aiming for our conformal survival bands to provide valid confidence intervals
for the true survival function ST (t | x), which would not be a feasible goal without additional
assumptions (Barber, 2020), we focus on a practically useful objective that is more naturally
attainable within a conformal inference framework: producing principled and interpretable
uncertainty estimates around ŜT (t | Xj ) that can support confident predictive screening
decisions. For example, given a survival probability threshold q ∈ (0, 1) and a clinically
meaningful time point t > 0, our goal may be to identify high-risk test individuals whose
predicted survival probability ŜT (t | Xj ) falls significantly below q, while guaranteeing
that, among all such flagged individuals, the expected proportion who survive beyond t
remains controlled below level q. Although inherently predictive in nature, this type of
calibration guarantee is intuitive and aligns closely with how physicians and practitioners
often interpret the output of survival models in practice: as actionable, patient-specific risk
estimates that can support confident decision-making.
Because our method is built on the idea of conformal p-values, we first review the key
concepts underlying this approach.
for a fixed time t > 0. This hypothesis asserts the individual will experience the event after
time t; rejecting it provides evidence that the individual is unlikely to survive beyond t.
If complete (uncensored) data are available, this hypotheses can be easily tested as
follows. Let ŜT (t | x) denote the survival function estimated by the fitted model M̂T . We
define the left-tail nonconformity scoring function as
which represents the predicted probability that an individual with covariates x experiences
an event before time t. This is a natural and interpretable choice of nonconformity score in
this context, and—crucially—it is monotonically increasing in t. While our later method-
ology will not depend specifically on this choice of score, it does require that the scoring
function ŝlt (t; x) be monotone increasing in t.
6
Conformal Survival Bands for Risk Screening under Right-Censoring
Using this function, we compute a non-conformity score ŝlt (t; Xi ) for each calibration
point i ∈ [n] and for the test point Xn+j . Then, following the approach of Jin and Candès
(2023b), we define the left-tail conformal p-value as:
1 + ni=1 I {ŝlt (Ti ; Xi ) ≥ ŝlt (t; Xn+j )}
P
ϕ̃lt (t; Xn+j ) := , (3)
n+1
where we assume for simplicity that all scores are almost surely distinct; this is not a
limitation in practice since ties can be broken by adding some independent random noise.
The following result, due to Jin and Candès (2023b), highlights the statistical validity
of the left-tail p-value defined in (3) for testing the null hypothesis (1). A formal proof is
included in Appendix A for completeness. Appendix A also contains all other proofs.
Proposition 1 (Jin and Candès (2023b)) Assume the calibration data and test point
(Xn+j , Tn+j ) are exchangeable, and the scoring function ŝlt (t; x) is monotone increasing in
t for all x. Then for any fixed t > 0 and α ∈ (0, 1), the conformal p-value ϕ̃lt (t; Xn+j )
defined in (3), computed using the full (uncensored) data, satisfies:
h i
P ϕ̃lt (t; Xn+j ) ≤ α, Tn+j ≥ t ≤ α.
This implies ϕ̃lt (t; Xn+j ) can be interpreted as a p-value for testing the random hy-
pothesis (1), in the sense that small values provide evidence against the null. Notably, the
guarantee in Proposition 1 does not condition on the null event {Tn+j ≥ t}; instead, it con-
trols the joint probability that both the null is true and the p-value is small. This marginal
formulation differs from the classical frequentist setup, where hypotheses are non-random,
but it is sufficient for our purposes. As we will see later, this form of validity is precisely
what enables us to obtain valid calibration guarantees for personalized risk screening.
Before turning to censored data, it is helpful to note that the above ideas also apply for
testing the complementary right-tail null hypothesis
which is useful for identifying low-risk individuals. In this case, we consider a right-tail
scoring function assumed to be monotone decreasing in t; e.g.,
interpreted as the predicted probability of experiencing the event after time t. The corre-
sponding right-tail conformal p-value is
1 + ni=1 I {ŝrt (t; Xi ) ≥ ŝrt (t; Xn+j )}
P
ϕ̃rt (t; Xn+j ) := ,
n+1
which is super-uniform under the same assumptions as in Proposition 1:
h i
P ϕ̃rt (t; Xn+j ) ≤ α, Tn+j ≤ t ≤ α.
This completes the review of standard conformal p-values in the uncensored setting, for
both left- and right-tail survival hypotheses. In the next section, we extend this framework
to account for censoring, enabling application to real-world survival data.
7
Sesia Svetnik
1
ŵ(t, x) := , (6)
ŜC (t | x)
where the numerator includes only data points whose true event times are observed.
Analogously, we define the IPCW right-tail p-value as:
These IPCW p-values are fully computable from the observed censored data. As we now
show, they yield asymptotically valid inference under the assumption that the censoring
model M̂C is consistent, along with mild regularity conditions on the data distribution.
We define the censoring weight estimation error as:
" 2 #!1/2
1 1
∆N := E − ,
ŵ(T ; X) w∗ (T ; X)
where w∗ (t, x) := 1/SC (t | x) denotes the true (unknown) censoring weight function, and
N represents the number of training data points.
We now list the assumptions under which asymptotic validity holds:
• a training set of cardinality N used to estimate ŵ(t; x), ŝlt (t; x), and ŝrt (t; x);
• an i.i.d. calibration set (Xi , Ti , Ci ) for i = 1, . . . , n, which is censored;
• and an i.i.d. test point (Xn+j , Tn+j , Cn+j ), of which we only see the covariates.
8
Conformal Survival Bands for Risk Screening under Right-Censoring
(A2) The probability of observing an event is bounded away from zero: π := P(T ≤ C) > 0.
(A3) The estimated weights are bounded below: ŵ(T ; X) ≥ ωmin > 0 almost surely.
Theorem 2 Under Assumptions (A1)–(A4), for any fixed t > 0 and α ∈ (0, 1), the IPCW
conformal p-values defined in (7) satisfy:
h i
lim sup P ϕ̂lt (t; Xn+1 ) ≤ α, Tn+1 ≥ t ≤ α,
N,n→∞
h i
lim sup P ϕ̂rt (t; Xn+1 ) ≤ α, Tn+1 ≤ t ≤ α.
N,n→∞
In the next section, we explain how to use IPCW conformal p-values to construct con-
formal survival bands.
where ϕ̂BHlt (t; Xn+j ) denotes the Benjamini-Hochberg (BH) adjusted left-tail p-value.
More precisely, let ϕ̂(1) ≤ ϕ̂(2) ≤ · · · ≤ ϕ̂(m) denote the ordered values of the unadjusted
left-tail p-values {ϕ̂lt (t; Xn+j )}m
j=1 , and let π(j) be the index such that ϕ̂(j) = ϕ̂lt (t; Xn+π(j) ).
Then the BH-adjusted p-values are defined as
nm o
ϕ̂BH
lt (t; Xn+π(j) ) := min · ϕ̂(k) ∧ 1, for j = 1, . . . , m. (10)
k≥j k
Each value ϕ̂BH lt (t; Xn+j ) represents the smallest FDR level at which the null hypothesis
Hlt (t; j) can be rejected by the BH procedure (Benjamini and Hochberg, 1995). In practice,
these adjusted p-values can be very easily computed using the p.adjust function in R with
the argument method = "BH".
The quantity Û (t; Xn+j ) can then be interpreted as a calibrated upper bound on the
survival probability at time t for individual Xn+j . If Û (t; Xn+j ) ≤ α for some threshold
α ∈ (0, 1), then the individual may be confidently flagged as high-risk, with the guarantee
that the expected proportion of false discoveries—individuals who survive past t despite
being flagged as “high-risk”—remains below α in the large-sample limit.
Symmetrically, we define
where ϕ̂BH
rt (t; Xn+j ) is the BH-adjusted p-value computed from the collection of right-tail
IPCW conformal p-values {ϕ̂rt (t; Xn+ℓ )}mℓ=1 . If L̂(t; Xn+j ) ≥ 1 − α, the patient can be con-
fidently flagged as low-risk, with the guarantee that the expected proportion of individuals
who fail before t—despite being classified as low-risk—approximately remains below α.
9
Sesia Svetnik
which could be interpreted as a calibrated range for likely survival probabilities at time
t—offering statistically principled decision support for confident high- and low-risk screen-
ing. The full procedure described above is summarized in Algorithm 1.
5: Compute ŝlt (Ti ; Xi ) and ŝrt (Ti ; Xi ) using Eqs. (2) and (5).
6: end for
7: for each time t ∈ T do
8: for each test point Xn+j , for j = 1, . . . , m do
9: Compute scores ŝlt (t; Xn+j ) and ŝrt (t; Xn+j ).
10: Compute IPCW p-values ϕ̂lt (t; Xn+j ) and ϕ̂rt (t; Xn+j ) using Eqs. (7) and (8).
11: end for
12: Apply BH procedure to p-values across test set to obtain adjusted values ϕ̂BH lt (t; Xn+j )
BH
and ϕ̂rt (t; Xn+j ) for all j ∈ [m] as in Eq. (10).
13: for each test point j ∈ [m] do
14: Compute endpoints Û (t; Xn+j ) and L̂(t; Xn+j ) using Eqs. (9) and (11).
15: end for
16: end for
output Personalized survival bands [L̂(t; Xn+j ), Û (t; Xn+j )] for each j ∈ [m] and all t ∈ T .
10
Conformal Survival Bands for Risk Screening under Right-Censoring
approach deviates from the standard sample-splitting paradigm commonly used in confor-
mal inference, it remains valid within our framework due to the use of inverse probability of
censoring weighting (IPCW), which automatically corrects for the potential selection bias
introduced by reallocation during the calibration phase. However, it is not guaranteed that
re-allocating the censored observations to the training set will improve model estimation, as
doing so introduces its own form of sampling bias in the training data. Exploring the trade-
offs involved in this reallocation strategy—particularly in combination with the possible use
of importance weighting during training—remains an open direction for future work.
rt .
An analogous result holds for the right-tail p-values ϕ̂rt (t; Xn+j ), used to test H0,j
A corollary of Theorem 3 is that high- and low-risk screening rules based on the survival
bands produced by Algorithm 1 described above are asymptotically well-calibrated. For
any t > 0 and α ∈ (0, 1), define the set of test individuals flagged as high-risk at time t as
n o n o
Fhi (t; α) := j ∈ [m] : Û (t; Xn+j ) ≤ α = j ∈ [m] : ϕ̂BH
lt (t; Xn+j ) ≤ α ,
Then, Theorem 3 implies that, in the large-sample limit, the expected proportion of indi-
viduals in Fhi (t; α) who actually survive past t, and the expected proportion of individuals
in Flo (t; α) who fail before t, are both asymptotically controlled below level α.
11
Sesia Svetnik
and
n o
L̂DR (t; Xn+j ) := max 1 − ϕ̂BH
rt (t; Xn+j ), ŜT (t | Xn+j ) . (15)
Above, ŜT (t | Xn+j ) is the point estimate of the survival probability produced by M̂T , and
ϕ̂BH BH
lt (t; Xn+j ) and ϕ̂rt (t; Xn+j ) are the upper and lower bounds produced by Algorithm 1.
This adjusted construction was used in the illustrative example shown in Figure 1.
A useful side benefit of this adjusted band is enhanced interpretability: Equation (13)
guarantees that the predicted survival probability ŜT (t | Xn+j ) always lies within the band
itself. This aligns with common practitioner expectations, simplifying communication.
12
Conformal Survival Bands for Risk Screening under Right-Censoring
Models. We consider four model families for fitting the censoring and survival models,
denoted respectively by M̂cens and M̂surv , to ensure consistent comparisons across different
calibration methods. The model families are: (1) grf, a generalized random forest (R package
grf); (2) survreg, an accelerated failure time (AFT) model with a log-normal distribution
(R package survival); (3) rf, a random survival forest (R package randomForestSRC); and
(4) cox, the Cox proportional hazards model (R package survival). The survival model
M̂surv is used both by the model screening method and as input to the proposed conformal
survival band (CSB) method. In contrast, the censoring model M̂cens is only used within
the CSB method, to compute the weights for constructing conformal survival bands.
13
Sesia Svetnik
3.2. Results
Effect of the Training Sample Size. Figure 2 compares the performance of the four
methods in Setting 1 (the harder case), using the grf models. The training sample size varies
between 100 and 10,000, while the calibration sample size is fixed at 500. We evaluate two
screening rules: selecting low-risk patients with P (T > 6.00) > 0.80 and selecting high-risk
patients with P (T > 12.00) < 0.80.
Low-risk screening results (top panel). The model selects too many patients, resulting
in an excess of false positives; the average survival rate among its selected patients falls
below 60%, despite the target threshold of 80%. The KM method performs even worse,
selecting nearly all patients, with a survival rate close to 50%. In contrast, CSB achieves the
desired survival rate among selected patients, provided that the training sample size is not
too small, and its performance improves steadily as the sample size increases, approaching
that of the oracle. Even when the training size is small (e.g., 100), and the fitted survival
and censoring models are relatively inaccurate, CSB is more robust than the model. As
expected, the oracle achieves a survival rate above 80%—in fact closer to 100%—while
selecting approximately 50% of the test patients. It is important to note that the survival
rate of selected patients does not need to match the threshold p exactly, even for the oracle,
because many patients have true survival probabilities much higher (or lower) than p.
High-risk screening results (bottom panel). In this setting, all methods lead to subsets
of selected patients whose survival rates are below 80%, as desired. However, the KM
method again selects too many patients, including many whose true survival probabilities
are actually higher than 80%, resulting in lower precision. This behavior stems from the
non-personalized nature of KM estimates: KM cannot flexibly select subsets of patients
based on individual characteristics and must either select nearly all or none. In contrast,
the model and CSB methods achieve higher precision and recall, and both approach the
oracle performance as the training sample size increases.
Figures 5 and 6 in Appendix B present similar results for Settings 2 and 3, respectively.
Across these experiments, the CSB method consistently achieves survival rates on the cor-
rect side of the target threshold and maintains relatively high precision and recall compared
to the other methods, more closely approximating the performance of the ideal oracle as
the training sample size increases.
Effect of the Calibration Sample Size. Figure 3 examines the impact of the calibration
sample size on the performance of CSB in the same challenging synthetic data setting as
before, with the training sample size fixed at 5000. In this regime, where the survival and
censoring models are already accurate thanks to a relatively large training set, increasing the
calibration sample size improves screening performance: the survival rates among selected
patients consistently falls on the correct side of the target threshold, and precision and recall
tend to approach the behavior of the ideal oracle. In general, however, if the training sample
size is small and the fitted models are inaccurate, it may be more beneficial to prioritize
training over calibration. Overall, a moderate calibration sample size (on the order of a
few hundred) is typically sufficient to obtain reliable conformal inferences. Figures 7 and 8
in Appendix B present qualitatively similar results from analogous experiments conducted
under synthetic data under Settings 2 and 3.
14
Conformal Survival Bands for Risk Screening under Right-Censoring
(a) Screening for: low−risk patients with P(T> 6.00) > 0.80.
Screened Proportion Survival Rate Precision Recall
1.00 1.00 1.00 1.00 Method
Mean ± 2 SE
(b) Screening for: high−risk patients with P(T> 12.00) < 0.80.
Screened Proportion Survival Rate Precision Recall
1.00 1.00 1.00 1.00 Method
Mean ± 2 SE
Figure 2: Effect of training sample size on patient screening performance of different meth-
ods in a challenging synthetic data scenario with complex survival and censoring distribu-
tions, using grf models. Top: results for low-risk screening at time t = 6; bottom: high-risk
screening at time t = 12. The calibration sample size is fixed at 500. The conformal survival
band (CSB) method successfully achieves survival rates above the target p = 0.8 for low-risk
screening and below p = 0.8 for high-risk screening, while making personalized selections
that approach the oracle performance as the training sample size increases.
Effect of the Training Sample Size for the Censoring Model. Figure 4 presents
results from experiments similar to those in Figure 2, but here the censoring model is fit
using only a subset from a total of 5000 training samples. The goal is to study how the
quality of the censoring model affects the performance of the CSB method. In the top
panel (low-risk screening), we observe that when the censoring model is trained on too few
samples, CSB may fail to provide valid screening selections. As the training size increases
and the censoring model improves, the survival rate among selected patients eventually
falls on the correct side of the target threshold, consistent with our asymptotic theoretical
results. In the bottom panel (high-risk screening), CSB maintains validity even with a
small censoring training set, but its power (i.e., ability to identify appropriate patients)
improves as the censoring model quality increases. Figures 9 and 10 in Appendix B present
qualitatively similar results for the relatively easier Settings 2 and 3.
15
Sesia Svetnik
(a) Screening for: low−risk patients with P(T> 6.00) > 0.80.
Screened Proportion Survival Rate Precision Recall
1.00 1.00 1.00 1.00 Method
Mean ± 2 SE
(b) Screening for: high−risk patients with P(T> 12.00) < 0.80.
Screened Proportion Survival Rate Precision Recall
1.00 1.00 1.00 1.00 Method
Mean ± 2 SE
Figure 3: Effect of calibration sample size on patient screening performance with conformal
survival bands (CSB) in a challenging synthetic data scenario with complex survival and
censoring distributions, as in Figure 2. The training sample size is fixed at 5000. The CSB
method successfully achieves survival rates above the target p = 0.8 for low-risk screening
and below p = 0.8 for high-risk screening, while making selections that more closely resemble
those of an ideal oracle as the calibration sample size increases.
16
Conformal Survival Bands for Risk Screening under Right-Censoring
Figure 4: Effect of the training sample size used for fitting the censoring model on patient
screening performance with conformal survival bands (CSB) in a challenging synthetic data
scenario with complex survival and censoring distributions, as in Figure 2. The overall
training sample size is fixed at 5000, but only a subset is used to fit the censoring model.
CSB tends to produce higher-quality selections as the censoring model improves with more
training data, achieving survival rates on the correct side of the target threshold (top), and
improving power toward oracle performance (bottom).
Table 1: Performance of different methods for screening low-risk patients with P (T > 3) >
0.80 under distribution shift (Setting 4 from Table 3). Results are shown separately for a
low-quality survival model trained on the shifted training distribution and a high-quality
model trained directly on the calibration/test distribution. The survival rate highlighted in
red illustrates how a misspecified grf model can lead to invalid screening rules, resulting in
substantially lower survival rates than expected among patients labeled as “low-risk.”
17
Sesia Svetnik
methods, leading to fewer selections. A similar figure illustrating survival curves, conformal
bands, and screening decisions for the high-risk case is provided in Appendix B as Figure 13.
4.2. Results
Table 2 summarizes the results of this data analysis, aggregating the performance of each
screening method across tasks, datasets, and repetitions, separately for each survival model.
For each combination of survival model and screening method, we report the average pro-
portion of test patients selected (screened proportion) and the empirical distribution of
verification outcomes: valid, dubious, and invalid. Verification outcomes are determined for
each task by assessing whether the survival rate bounds, averaged over 100 random repeti-
tions, fall entirely on the correct side of the threshold p (valid), straddle it (dubious), or lie
entirely on the wrong side (invalid), accounting for uncertainty via two standard errors.
18
Conformal Survival Bands for Risk Screening under Right-Censoring
Table 2: Summary of screening performance for high- and low-risk patient selection meth-
ods based on different survival models, aggregated across different datasets, screening tasks,
and repetitions. The screened proportion denotes the average fraction of test patients se-
lected. Although false positives cannot be directly verified due to censoring, approximate
verification is possible and classified as valid, dubious, or invalid.
Verification Outcome
Method Survival Model Screened Proportion Valid Dubious Invalid
Model grf 0.500 0.643 0.357 0.000
survreg 0.500 0.714 0.250 0.036
rf 0.500 0.536 0.464 0.000
Cox 0.500 0.571 0.393 0.036
KM grf 0.500 0.786 0.214 0.000
survreg 0.500 0.786 0.214 0.000
rf 0.500 0.786 0.214 0.000
Cox 0.500 0.786 0.214 0.000
CSB grf 0.204 0.929 0.071 0.000
survreg 0.002 1.000 0.000 0.000
rf 0.201 0.893 0.107 0.000
Cox 0.188 0.929 0.071 0.000
CSB yields the highest proportion of valid selections across all survival models, with few
or no dubious or invalid outcomes, despite screening fewer patients. In contrast, the Model
and KM benchmarks screen more patients but exhibit lower verification quality. These
results illustrate a fundamental trade-off between power and reliability, unsurprisingly.
Compared to the synthetic data experiments from Section 3, the advantage of CSB is
somewhat less obvious here. This may be due in part to imperfect verification, stemming
from censoring in the test data and the absence of oracle knowledge. Moreover, survival
models may estimate the conditional survival distributions more accurately here than in the
synthetic settings, reducing the relative gains from conformal inference. As noted by Sesia
and Svetnik (2024), these datasets do not appear to be especially difficult to model. Never-
theless, in general, it remains difficult to assess the reliability of screening results produced
by black-box methods. Our approach can provides more confidence in the calibration of
such screening results, which may be particularly valuable in high-stakes applications where
the cost of false positives is substantial, even if it entails some reduction in power.
Additional breakdowns of these results for specific screening tasks are provided in Ap-
pendix C, with low-risk and high-risk summaries shown in Tables 12 and 13, respectively.
5. Discussion
This paper introduced a conformal inference method for constructing uncertainty bands
around individual survival curves under right-censoring, enabling statistically principled
personalized risk screening under minimal assumptions on the data-generating process.
19
Sesia Svetnik
Experiments with synthetic data showed that screening based on uncalibrated black-
box models can be unreliable, particularly in hard-to-model settings, whereas our method
provides greater robustness, albeit with some conservativeness. In our real data applica-
tions, standard survival models seemed to produce reasonable screening performance, likely
because the fitted models were able to approximate the true survival distribution quite
well. Nonetheless, our method remains appealing in more complex or high-stakes scenarios,
where model misspecification is a concern and formal uncertainty quantification is essential.
Two limitations of the current approach are its reliance on asymptotic FDR control
and its focus on pointwise inference, which assumes that the screening thresholds t and p
are fixed in advance. While it may be possible to obtain finite-sample and simultaneous
inference guarantees, doing so would likely require a significantly more conservative method.
Understanding this trade-off between statistical rigor and screening power in greater depth
presents an intriguing direction for future research.
Another promising direction for improvement concerns the treatment of censored cali-
bration points. In Section 2.4, we proposed reallocating these samples to the training set
but did not do this in practice due to concerns about sampling bias potentially degrading
model quality. Recent work by Farina et al. (2025) introduces a more complex strategy
for incorporating censored observations directly into the calibration phase, which may be
possible to adapt to our setting. A potentially simpler alternative is to reallocate censored
calibration samples to the training set and apply importance weighting during training to
mitigate bias, while retaining the calibration procedure described in this paper, which uses
only uncensored observations. Both strategies deserve further investigation.
Additional extensions include de-randomizing our inferences via model aggregation across
different data splits (Carlsson et al., 2014; Linusson et al., 2017), potentially using e-values
(Vovk and Wang, 2023; Wang and Ramdas, 2022; Bashari et al., 2023), and adapting the
method to handle noisy or corrupted data (Sesia et al., 2024; Clarkson et al., 2024).
Software Availability Open-source software implementing our methods and experi-
ments is available at https://github.com/msesia/conformal_survival_screening.
20
Conformal Survival Bands for Risk Screening under Right-Censoring
References
Rina Foygel Barber. Is distribution-free inference possible for binary regression? Electron.
J. Statist., 14(2):3487–3524, 2020.
Rina Foygel Barber, Emmanuel J Candès, Aaditya Ramdas, and Ryan J Tibshirani. Con-
formal prediction beyond exchangeability. Ann. Stat., 51(2):816–845, 2023.
Avinash Barnwal, Hyunsu Cho, and Toby Hocking. Survival regression with accelerated
failure time model in xgboost. J. Comput. Graph. Stat., 31(4):1292–1302, 2022.
Meshi Bashari, Amir Epstein, Yaniv Romano, and Matteo Sesia. Derandomized novelty
detection with fdr control via conformal e-values. In A. Oh, T. Naumann, A. Globerson,
K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing
Systems, volume 36, pages 65585–65596. Curran Associates, Inc., 2023.
Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and
powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Methodol., 57(1):289–300,
1995.
Henrik Boström, Lars Asker, Ram Gurung, Isak Karlsson, Tony Lindgren, and Panagiotis
Papapetrou. Conformal prediction using random survival forests. In 2017 16th IEEE
International Conference on Machine Learning and Applications (ICMLA), pages 812–
817. IEEE, 2017.
Henrik Boström, Ulf Johansson, and Anders Vesterberg. Predicting with confidence from
survival data. In Conformal and Probabilistic Prediction and Applications, pages 123–141.
PMLR, 2019.
Henrik Boström, Henrik Linusson, and Anders Vesterberg. Mondrian predictive systems for
censored data. In Harris Papadopoulos, Khuong An Nguyen, Henrik Boström, and Lars
Carlsson, editors, Proceedings of the Twelfth Symposium on Conformal and Probabilistic
Prediction with Applications, volume 204 of Proceedings of Machine Learning Research,
pages 399–412. PMLR, 13–15 Sep 2023.
Emmanuel Candès, Lihua Lei, and Zhimei Ren. Conformalized survival analysis. J. R. Stat.
Soc. Ser. B Methodol., 85(1):24–45, 2023.
Lars Carlsson, Martin Eklund, and Ulf Norinder. Aggregated conformal prediction. In Arti-
ficial Intelligence Applications and Innovations: AIAI 2014 Workshops: CoPA, MHDW,
IIVC, and MT4BD, Rhodes, Greece, September 19-21, 2014. Proceedings 10, pages 231–
240. Springer, 2014.
Jason Clarkson, Wenkai Xu, Mihai Cucuringu, and Gesine Reinert. Split conformal pre-
diction under data contamination. In Simone Vantini, Matteo Fontana, Aldo Solari,
Henrik Boström, and Lars Carlsson, editors, Proceedings of the Thirteenth Symposium
21
Sesia Svetnik
David R Cox. Regression models and life-tables. J. R. Stat. Soc. Ser. B Methodol., 34(2):
187–202, 1972.
John Crowley and Marie Hu. Covariance analysis of heart transplant survival data. J. Am.
Stat. Assoc., 72(357):27–36, 1977.
Christina Curtis, Sohrab P Shah, Suet-Feung Chin, Gulisa Turashvili, Oscar M Rueda,
Mark J Dunning, Doug Speed, Andy G Lynch, Shamith Samarajiwa, Yinyin Yuan, et al.
The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel sub-
groups. Nature, 486(7403):346–352, 2012.
Hen Davidov, Shai Feldman, Gil Shamai, Ron Kimmel, and Yaniv Romano. Conformal-
ized survival analysis for general right-censored data. In The Thirteenth International
Conference on Learning Representations, 2024.
Rebecca Farina, Eric J Tchetgen Tchetgen, and Arun Kumar Kuchibhotla. Doubly robust
and efficient calibration of prediction sets for censored time-to-event outcomes. arXiv
preprint arXiv:2501.04615, 2025.
Matteo Fontana, Gianluca Zeni, and Simone Vantini. Conformal prediction: a unified review
of theory and new challenges. Bernoulli, 29(1):1–23, 2023.
Yu Gui, Rohan Hore, Zhimei Ren, and Rina Foygel Barber. Conformalized survival analysis
with adaptive cut-offs. Biometrika, 111(2):459–477, 2024.
Hemant Ishwaran, Udaya B. Kogalur, Eugene H. Blackstone, and Michael S. Lauer. Random
survival forests. Ann. Appl. Stat., 2(3):841 – 860, 2008.
Ying Jin and Emmanuel J Candès. Model-free selective inference under covariate shift via
weighted conformal p-values. arXiv preprint arXiv:2307.09291, 2023a.
Ying Jin and Emmanuel J Candès. Selection by prediction with conformal p-values. J.
Mach. Learn. Res., 24(244):1–41, 2023b.
John D Kalbfleisch and Ross L Prentice. The statistical analysis of failure time data. John
Wiley & Sons, 2002.
Edward L Kaplan and Paul Meier. Nonparametric estimation from incomplete observations.
J. Am. Stat. Assoc., 53(282):457–481, 1958.
Jared L Katzman, Uri Shaham, Alexander Cloninger, Jonathan Bates, Tingting Jiang,
and Yuval Kluger. Deepsurv: personalized treatment recommender system using a Cox
proportional hazards deep neural network. BMC Med. Res. Methodol., 18:1–12, 2018.
Henrik Linusson, Ulf Norinder, Henrik Boström, Ulf Johansson, and Tuve Löfström. On the
calibration of aggregated conformal predictors. In Conformal and probabilistic prediction
and applications, pages 154–173. PMLR, 2017.
22
Conformal Survival Bands for Risk Screening under Right-Censoring
Charles G Moertel, Thomas R Fleming, John S Macdonald, Daniel G Haller, John A Laurie,
Phyllis J Goodman, James S Ungerleider, William A Emerson, Douglas C Tormey, John H
Glick, et al. Levamisole and fluorouracil for adjuvant therapy of resected colon carcinoma.
New England Journal of Medicine, 322(6):352–358, 1990.
Shi-ang Qi, Yakun Yu, and Russell Greiner. Conformalized survival distributions: A generic
post-process to increase calibration. arXiv preprint arXiv:2405.07374, 2024.
Jing Qin, Jin Piao, Jing Ning, and Yu Shen. Conformal predictive intervals in survival
analysis: a re-sampling approach. arXiv preprint arXiv:2408.06539, 2024.
James M Robins. Information recovery and bias adjustment in proportional hazards re-
gression analysis of randomized trials using surrogate markers. In Proc. Biopharm. Sect.
Am. Stat. Assoc., volume 24, pages 24–33. San Francisco CA, 1993.
James M Robins and Andrea Rotnitzky. Recovery of information and adjustment for de-
pendent censoring using surrogate markers. In AIDS epidemiology: methodological issues,
pages 297–331. Springer, 1992.
Matteo Sesia and Vladimir Svetnik. Doubly robust conformalized survival analysis with
right-censored data. arXiv preprint arXiv:2412.09729, 2024.
Matteo Sesia, YX Rachel Wang, and Xin Tong. Adaptive conformal classification with noisy
labels. J. R. Stat. Soc. Ser. B Methodol., page qkae114, 2024.
Annette Spooner, Emily Chen, Arcot Sowmya, Perminder Sachdev, Nicole A Kochan, Julian
Trollor, and Henry Brodaty. A comparison of machine learning methods for survival
analysis of high-dimensional clinical data for dementia prediction. Scientific reports, 10
(1):20410, 2020.
Xiaolin Sun and Yanhua Wang. Conformal prediction with censored data using kaplan-
meier method. In Journal of Physics: Conference Series, volume 2898, page 012030. IOP
Publishing, 2024.
Terry M Therneau, Patricia M Grambsch, Terry M Therneau, and Patricia M Grambsch.
The Cox model. Springer, 2000.
Ryan J Tibshirani, Rina Foygel Barber, Emmanuel Candès, and Aaditya Ramdas. Confor-
mal prediction under covariate shift. Advances in neural information processing systems,
32, 2019.
Vladimir Vovk and Ruodu Wang. Confidence and discoveries with e-values. Statistical
Science, 38(2):329–354, 2023.
Vladimir Vovk, Alexander Gammerman, and Glenn Shafer. Algorithmic learning in a ran-
dom world, volume 29. Springer, 2005.
Ruodu Wang and Aaditya Ramdas. False discovery rate control with e-values. J. R. Stat.
Soc. Ser. B Methodol., 84(3):822–852, 2022.
Menghan Yi, Ze Xiao, Huixia Judy Wang, and Yanlin Tang. Survival conformal prediction
under random censoring. Stat, 14(2):e70052, 2025.
23
Sesia Svetnik
By assumption:
• ŵ(T ; X) ≥ ωmin > 0 almost surely,
2 1/2
• ∆N := E ŵ(T ;X) − w∗ (T ;X)
1 1
→ 0 as N → ∞.
Since the remaining terms in the bound all vanish as N, n → ∞, it follows that
h i
lim sup P ϕ̂lt (t; Xn+1 ) ≤ α, Tn+1 ≥ t ≤ α.
N,n→∞
An identical argument applies to the limit involving ϕ̂rt (t; Xn+1 ), completing the proof.
Proof [of Theorem 3: Asymptotic FDR control] We prove the result for the left-tail p-values
ϕ̂lt (t; Xn+j ); the argument for right-tail p-values is analogous.
The proof proceeds in three steps:
24
Conformal Survival Bands for Risk Screening under Right-Censoring
(1) We compare each empirical p-value to an oracle counterpart and establish uniform
closeness.
(2) We show that the BH rejection sets based on these two sets of p-values are close with
high probability.
(3) We conclude FDR control for the empirical procedure via stability and the oracle
FDR guarantee.
Step 1: Empirical p-values are close to oracle p-values. For each test point j ∈
{1, . . . , m}, define the empirical and oracle p-values as:
pj := ϕ̂lt (t; Xn+j ), p∗j := ϕ∗lt (t; Xn+j ) := P (ŝlt (T ; X) ≥ ŝlt (t; Xn+j )) ,
where " r #
1 2(ωmin + 1) log(2n/δ)
ϵn = +8 + 2∆N .
ωmin π n n
where F is the CDF of the score variable ŝlt (T ; X), with corresponding density f = F ′
uniformly bounded above and below across all t > 0. Then, the change-of-variables formula
gives that the density f ∗ of p∗j is:
ft (F −1 (1 − u))
f ∗ (u) = ,
f (F −1 (1 − u))
fmin fmax
≤ f ∗ (u) ≤ for all u ∈ [0, 1].
fmax fmin
Now we can apply Lemma 8 with p = (p1 , . . . , pm ) and q = (p∗1 , . . . , p∗m ), obtaining:
fmax
̸ R∗ ) ≤ δ + 2ϵn m2 ·
P(R = .
fmin
25
Sesia Svetnik
and the p-values p∗j are mutually independent (conditional on the calibration data). There-
fore, by Theorem 9, the BH procedure applied to (p∗1 , . . . , p∗m ) controls the FDR at level
α: ∗
∗ |R ∩ H0 |
FDR := E ≤ α,
|R∗ | ∨ 1
where H0 := {j : Tn+j ≥ t} is the (random) set of true nulls.
Conclusion. By definition of the FDR and triangle inequality,
Combining this with the oracle guarantee FDR∗ ≤ α and the stability bound from Lemma 8,
we obtain the finite-sample bound:
fmax
FDR ≤ α + δ + 2ϵn m2 · .
fmin
Letting δ → 0 and noting that ϵn → 0 as n → ∞, we conclude:
where the probability is taken over (X, T ). Then, for any α ∈ [0, 1],
P ϕ∗ (t, X ′ ) ≤ α, T ′ ≥ t ≤ α.
26
Conformal Survival Bands for Risk Screening under Right-Censoring
Proof [of Lemma 4] Let S := −s(T, X) and S ′ := −s(T ′ , X ′ ), and define the cumulative
distribution function FS (u) := P(S ≤ u). By definition of ϕ∗ , we can write:
Hence:
P ϕ∗ (t, X ′ ) ≤ α, T ′ ≥ t ≤ P FS (S ′ ) ≤ α, T ′ ≥ t ≤ P(FS (S ′ ) ≤ α) = α,
where the final equality uses the probability integral transform: since S ′ has the same
distribution as S and FS is its CDF, the random variable FS (S ′ ) is uniform on [0, 1].
Proof [of Theorem 5] We prove the result for ϕ̂lt ; the argument for ϕ̂rt is identical and
omitted.
For δ ∈ (0, 1), define the event
( ) " r #
1 2(ωmin + 1) log(2n/δ)
Et,δ := sup ϕ̂lt (t; x) − ϕ∗lt (t; x) ≤ ϵn (δ) , ϵn (δ) := +8 + 2∆N .
x∈Rd ωmin π n n
Corollary 6 implies:
h i
P ϕ̂lt (t; x) ≤ α, Tn+1 ≥ t | Xn+1 = x
h i h i
c
= P ϕ̂lt (t; x) ≤ α, Et,δ , Tn+1 ≥ t | Xn+1 = x + P ϕ̂lt (t; x) ≤ α, Et,δ , Tn+1 ≥ t | Xn+1 = x
≤ P [ϕ∗lt (t; x) ≤ α + ϵn (δ), Tn+1 ≥ t | Xn+1 = x] + δ.
27
Sesia Svetnik
Therefore,
h i
P ϕ̂lt (t; x) ≤ α, Tn+1 ≥ t ≤ P [ϕ∗lt (t; x) ≤ α + ϵn (δ), Tn+1 ≥ t] + δ
≤ α + ϵn (δ) + δ,
where the second inequality above follows directly from Lemma 4. Finally, setting δ = 1/n:
" r #
h i 1 2(2ωmin + 1) 2 log(2n)
P ϕ̂lt (t; Xn+1 ) ≤ α, Tn+1 ≥ t ≤ α + +8 + 2∆N .
ωmin π n n
where the probabilities are taken with respect to the randomness in (X, T ).
Then, for any t > 0 and δ ∈ (0, 1), with probability at least 1 − δ over the randomness
in the calibration data D = {(Xi , Ti , Ci )}ni=1 :
" r #
1 2(ωmin + 1) log(2n/δ)
sup ϕ̂lt (t; x) − ϕ∗lt (t; x) ≤ +8 + 2∆N .
x∈R d ωmin π n n
Similarly, for any t > 0 and δ ∈ (0, 1), with probability at least 1 − δ:
" r #
∗ 1 2(ωmin + 1) log(2n/δ)
sup ϕ̂rt (t; x) − ϕrt (t; x) ≤ +8 + 2∆N .
x∈Rd ωmin π n n
Proof [of Corollary 6] We prove the bound for ϕ̂lt ; the argument for ϕ̂rt is identical and
omitted. Define the following variables:
Zi = (Xi , Ti ),
Di = 1(Ti ≤ Ci ),
Ei = ŵ(Ti ; Xi ),
e(Zi ) = w∗ (Ti ; Xi ),
ψ(Zi ) = ŝlt (Ti ; Xi ).
28
Conformal Survival Bands for Risk Screening under Right-Censoring
ϕ̂lt (t; x) := Rn (ŝlt (t; x)) , ϕ∗lt (t; x) := R (ŝlt (t; x)) .
Therefore,
sup ϕ̂lt (t; x) − ϕ∗lt (t; x) = sup |Rn (ŝlt (t; x)) − R(ŝlt (t; x))|
x∈Rd x∈Rd
≤ sup |Rn (u) − R(u)| ,
u∈R
and the result follows directly from Theorem 7, noting that Ei ≥ ωmin and E[D] = π.
Proof [of Theorem 7] The proof splits the error into two parts: a stochastic deviation term
(comparing the estimator Rn (u) to a population analogue R̃(u) based on the approximate
weights Ei ), and a bias term due to discrepancy between Ei and the propensity score e(Xi ).
Define
where the expectations are over the joint distribution of (X, D, Z, E). Then decompose the
error:
29
Sesia Svetnik
1 + ni=1 Wi · 1(ψ(Zi ) ≥ u)
P
Nn (u) := ,
Pn 1+n
1 + i=1 Wi
An := ,
1+n
Ñ (u) := E[W · 1(ψ(Z) ≥ u)],
à := E[W ],
so that
Nn (u) Ñ (u)
Rn (u) = , R̃(u) = .
An Ã
Using the inequality
a a′ |a − a′ | 1 1
− ′ ≤ ′
+ |a| − ′ ,
b b b b b
with a = Nn (u), a′ = Ñ (u), b = An , b′ = Ã, and the fact that Nn (u) ≤ An , we obtain
Uniform concentration bound for |Nn (u) − Ñ (u)|. To bound the deviation
n
1X
sup Wi 1(ψ(Zi ) ≥ u) − E[W · 1(ψ(Z) ≥ u)] ,
u∈R n i=1
note that the class {u 7→ Wi 1(ψ(Zi ) ≥ u)} has envelope bounded by 1/e and VC dimen-
sion 1. Applying Theorem 4.10 and Lemma 4.14 in Wainwright (2019), we obtain: with
probability at least 1 − δ,
s
n
r
1X 4 log(n + 1) 2 log(1/δ)
sup Wi 1(ψ(Zi ) ≥ u) − Ñ (u) ≤ + .
u∈R n e n ne2
i=1
30
Conformal Survival Bands for Risk Screening under Right-Censoring
so we may write
n
1 + 1/e 1X
|Nn (u) − Ñ (u)| ≤ + Wi 1(ψ(Zi ) ≥ u) − Ñ (u) .
1+n n
i=1
1 + ni=1 Wi
P
An := , Ã := E[W ].
1+n
We decompose the deviation:
n
1 + 1/e 1X
|An − Ã| ≤ + Wi − E[W ] .
1+n n
i=1
Applying Hoeffding’s inequality for Wi ∈ [0, 1/e], we obtain that, with probability at least
1 − δ,
r
1 + 1/e 1 log(2/δ)
|An − Ã| ≤ + . (19)
1+n e 2n
Combining (17) with (18) and (19), we obtain that, with probability at least 1 − δ,
" r r r #
1 2(e + 1) log(n + 1) 2 log(1/δ) log(2/δ)
sup |Rn (u) − R̃(u)| ≤ +4 + +
u∈R eπ 1+n n n 2n
" r # (20)
1 2(e + 1) log(2n/δ)
≤ +8 .
eπ n n
Step 2 (Bounding the bias due to approximation). We now bound the difference
between R̃(u) and the ideal target R(u), which uses the true propensity e(X). Using the
same ratio inequality as before we obtain:
To bound the first term, note that, using the Cauchy-Schwarz inequality,
1 1
Ñ (u) − R(u) = E − D · 1(ψ(Z) ≥ u)
E e(X)
v "
u u 1 2 #
1 1 1
≤E − ≤ E
t − =: ∆N .
E e(X) E e(X)
31
Sesia Svetnik
Therefore, we conclude:
Step 3 (Conclusion). Combining the stochastic deviation bound (20) with the approx-
imation bias bound (22), we conclude that with probability at least 1 − δ,
sup |Rn (u) − R(u)| ≤ sup |Rn (u) − R̃(u)| + sup |R̃(u) − R(u)|
u∈R u u
" r #
1 2(e + 1) log(2n/δ)
≤ +8 + 2∆N .
eπ n n
• (Smoothness) Each qj has a density fj supported on [0, 1] satisfying fj (t) ≤ fmax for
all t ∈ [0, 1].
Let Rp and Rq be the rejection sets from applying the BH procedure at level α to p and q,
respectively. Then:
P(Rp ̸= Rq ) ≤ δ + 2ϵm2 fmax .
m
[
Bϵ := (τk − ϵ, τk + ϵ) ,
k=1
32
Conformal Survival Bands for Risk Screening under Right-Censoring
which is the union of m intervals of width 2ϵ, and hence has total Lebesgue measure at
most 2ϵm.
For any fixed index j, the probability that qj ∈ Bϵ is easily bounded using the smoothness
assumption: P(qj ∈ Bϵ ) ≤ 2ϵmfmax . Applying a union bound over all j = 1, . . . , m, we get:
Let A be the “bad event” where either closeness fails or some qj lies near a threshold:
A := E c ∪ {∃j : qj ∈ Bϵ }.
Then:
P(A) ≤ δ + 2ϵm2 fmax .
On the complement Ac (i.e., when p and q are close and all qj are away from thresholds),
the BH procedure makes the same rejection decisions on p and q because: each possible
threshold τk is fixed; each pj and qj differ by at most ϵ; no qj lies within ϵ of any threshold.
Hence, for every j, the relation pj ≤ τk matches qj ≤ τk for all k. Therefore, the sorted
comparisons and BH cutoffs lead to identical rejection sets: Rp = Rq on the event Ac .
Thus,
P(Rp ̸= Rq ) ≤ P(A) ≤ δ + 2ϵm2 fmax .
Theorem 9 (Jin and Candès, 2023b, Thm. 2.3) Let (H1 , . . . , Hm ) ∈ {0, 1}m be random
variables indicating the status of m hypotheses, where Hj = 1 means that the jth hypothesis
is a true null. Let (p1 , . . . , pm ) ∈ [0, 1]m be random variables, interpreted as p-values for the
corresponding hypotheses. Assume:
• The p-values p1 , . . . , pm are mutually independent.
• For each j ∈ {1, . . . , m}, the p-value pj satisfies the joint super-uniformity condition:
Then the Benjamini-Hochberg (BH) procedure at level α ∈ (0, 1), applied to (p1 , . . . , pm ),
controls the false discovery rate:
|R ∩ H0 |
FDR := E ≤ α,
|R| ∨ 1
where:
• H0 := {j : Hj = 1} is the (random) set of true null hypotheses,
33
Sesia Svetnik
• τb := max kα kα
m : p(k) ≤ m is the BH threshold, with p(1) ≤ · · · ≤ p(m) the order
statistics of (p1 , . . . , pm ).
Note: This result is a simplified version of the argument in Jin and Candès (2023b), spe-
cialized to the case of independent p-values (rather than PRDS).
Proof [of Theorem 9] Let R := {j : pj ≤ τb} be the BH rejection set, and let Rj := 1{j ∈ R}.
Then the false discovery rate can be written as:
m
|R ∩ H0 | X Hj Rj
FDR = E = E .
|R| ∨ 1 |R| ∨ 1
j=1
Using the identity Rj = 1{pj ≤ |R|α/m} and decomposing over possible values of |R| = k,
we write:
m X m
X 1
E 1{|R| = k} · Hj · 1 pj ≤ kα
FDR = m .
k
j=1 k=1
Now fix any index j ∈ {1, . . . , m}, and define the modified rejection set R(pj → 0) to be
the BH rejection set obtained by replacing pj with 0 while keeping all other p-values fixed.
Because the BH procedure is monotone in each coordinate, and because reducing pj does
not reduce the size of the rejection set, we have:
kα kα
1{|R| = k} · 1 pj ≤ m = 1{|R(pj → 0)| = k} · 1 pj ≤ m .
Let Fj := σ(p1 , . . . , pj−1 , 0, pj+1 , . . . , pm ) denote the sigma-algebra generated by the other
p-values excluding pj . Then, using independence of Hj and pj from Fj , and applying the
tower property:
m X
m
X 1 kα
FDR = E 1{|R(pj → 0)| = k} · Hj · 1 pj ≤ m
k
j=1 k=1
m X
m
X 1 kα
= E 1{|R(pj → 0)| = k} · E Hj · 1 pj ≤ m | Fj
k
j=1 k=1
m X m
X 1 kα
= E 1{|R(pj → 0)| = k} · P pj ≤ m, Hj = 1
k
j=1 k=1
m X
m
X 1 kα
≤ · E 1{|R(pj → 0)| = k} · m
k
j=1 k=1
m X m
α X
= P (|R(pj → 0)| = k) ≤ α.
m
j=1 k=1
34
Conformal Survival Bands for Risk Screening under Right-Censoring
Table 3: Summary of four synthetic data generation settings considered for numerical ex-
periments. Settings 1, 2, and 3 are adapted from Sesia and Svetnik (2024), with Setting 3
originally appearing in Candès et al. (2023). Setting 4 is new.
35
Sesia Svetnik
(a) Screening for: low−risk patients with P(T> 2.00) > 0.80.
Screened Proportion Survival Rate Precision Recall
1.00 1.00 1.00 1.00 Method
Mean ± 2 SE
(b) Screening for: high−risk patients with P(T> 3.00) < 0.25.
Screened Proportion Survival Rate Precision Recall
1.00 1.00 1.00 1.00 Method
Mean ± 2 SE
(a) Screening for: low−risk patients with P(T> 3.00) > 0.90.
Screened Proportion Survival Rate Precision Recall
1.00 1.00 1.00 1.00 Method
Mean ± 2 SE
(b) Screening for: high−risk patients with P(T> 10.00) < 0.50.
Screened Proportion Survival Rate Precision Recall
1.00 1.00 1.00 1.00 Method
Mean ± 2 SE
36
Conformal Survival Bands for Risk Screening under Right-Censoring
(a) Screening for: low−risk patients with P(T> 2.00) > 0.80.
Screened Proportion Survival Rate Precision Recall
1.00 1.00 1.00 1.00 Method
Mean ± 2 SE
(b) Screening for: high−risk patients with P(T> 3.00) < 0.25.
Screened Proportion Survival Rate Precision Recall
1.00 1.00 1.00 1.00 Method
Mean ± 2 SE
Figure 7: Effect of calibration sample size on patient screening performance with conformal
survival bands (CSB) in a moderately challenging synthetic data scenario (Setting 2 from
Table 3). Other details are as in Figure 3.
(a) Screening for: low−risk patients with P(T> 3.00) > 0.90.
Screened Proportion Survival Rate Precision Recall
1.00 1.00 1.00 1.00 Method
Mean ± 2 SE
(b) Screening for: high−risk patients with P(T> 10.00) < 0.50.
Screened Proportion Survival Rate Precision Recall
1.00 1.00 1.00 1.00 Method
Mean ± 2 SE
Figure 8: Effect of calibration sample size on patient screening performance with conformal
survival bands (CSB) in a relatively easy synthetic data scenario (Setting 3 from Table 3).
Other details are as in Figure 3.
37
Sesia Svetnik
B.4. Effect of the Training Sample Size for the Censoring Model
Figure 9: Effect of the training sample size used for fitting the censoring model on patient
screening performance with conformal survival bands (CSB) in a moderately challenging
synthetic data scenario (Setting 2 from Table 3). Other details are as in Figure 4.
Figure 10: Effect of the training sample size used for fitting the censoring model on patient
screening performance with conformal survival bands (CSB) in a relatively easy synthetic
data scenario (Setting 3 from Table 3). Other details are as in Figure 4.
38
Conformal Survival Bands for Risk Screening under Right-Censoring
0.50
0.25
Method
Mean ± 2 SE
Model
0.00
KM
1.00
0.50
0.25
0.00
10 30 100 10 30 100 10 30 100 10 30 100
Number of Features for Censoring Model
0.50
0.25
Method
Mean ± 2 SE
Model
0.00
KM
1.00
Train sample size = 500
CSB
0.75 Oracle
0.50
0.25
0.00
10 30 100 10 30 100 10 30 100 10 30 100
Number of Features for Censoring Model
Figure 11: Impact of the number of features used in fitting the censoring model on pa-
tient screening performance with conformal survival bands (CSB) in a challenging synthetic
data scenario (Setting 1 from Table 3). The total number of features is 100, but only a
subset—including the relevant ones—is used to fit the censoring model. Top: Low-risk
screening. Screening performance is more sensitive to the number of features when the
training sample size is small (200); using too many features degrades performance due to
the difficulty of accurately estimating the censoring model. When the sample size is larger
(500), performance is less sensitive to the number of features. Bottom: High-risk screening
(training sample sizes = 200 and 500). In this case, screening performance shows even less
sensitivity to the number of features used to fit the censoring model.
39
Sesia Svetnik
0.50
0.25
Method
Mean ± 2 SE
Model
0.00
KM
1.00
0.50
0.25
0.00
10 30 100 10 30 100 10 30 100 10 30 100
Number of Features for Censoring Model
0.50
0.25
Method
Mean ± 2 SE
Model
0.00
KM
1.00
0.50
0.25
0.00
10 30 100 10 30 100 10 30 100 10 30 100
Number of Features for Censoring Model
Figure 12: Impact of the number of features used in fitting the censoring model on patient
screening performance with conformal survival bands (CSB) in an easier synthetic data
scenario (Setting 3 from Table 3). Other details are as in Figure 11.
40
Conformal Survival Bands for Risk Screening under Right-Censoring
Table 4: Performance of different methods for screening low-risk patients with P (T >
6) > 0.80 in a challenging synthetic data scenario (Setting 1 from Table 3), using various
censoring and survival models. The training sample size is fixed at 1000. Other details are
as in Figure 2.
41
Sesia Svetnik
Table 5: Performance of different methods for screening high-risk patients with P (T >
12) < 0.80 in a challenging synthetic data scenario (Setting 1 from Table 3), using various
censoring and survival models. The training sample size is fixed at 1000. Other details are
as in Figure 2.
42
Conformal Survival Bands for Risk Screening under Right-Censoring
Table 6: Performance of different methods for screening low-risk patients with P (T >
2) > 0.80 in a relatively challenging data scenario (Setting 2 from Table 3), using various
censoring and survival models. The training sample size is fixed at 1000. Other details are
as in Figure 2.
43
Sesia Svetnik
Table 7: Performance of different methods for screening high-risk patients with P (T >
3) < 0.25 in a relatively challenging data scenario (Setting 2 from Table 3), using various
censoring and survival models. The training sample size is fixed at 1000. Other details are
as in Figure 2.
44
Conformal Survival Bands for Risk Screening under Right-Censoring
Table 8: Performance of different methods for screening low-risk patients with P (T > 3) >
0.90 in an easier synthetic data scenario (Setting 3 from Table 3), using various censoring
and survival models. The training sample size is fixed at 1000. Other details are as in
Figure 2.
45
Sesia Svetnik
Table 9: Performance of different methods for screening high-risk patients with P (T >
10) < 0.50 in an easier synthetic data scenario (Setting 3 from Table 3), using various
censoring and survival models. The training sample size is fixed at 1000. Other details are
as in Figure 2.
46
Conformal Survival Bands for Risk Screening under Right-Censoring
Table 10: Performance of different methods for screening high-risk patients with P (T >
3) < 0.50 under distribution shift (Setting 4 from Table 3). These results correspond to the
same experiments reported in Table 1, but evaluate high-risk screening instead of low-risk.
In this case, a misspecified grf model primarily reduces the power of the screening rule by
selecting fewer patients than desired.
Low-quality model
1.00
0.75
Survival Model
0.50
Oracle
0.25 Model
KM
0.00
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Time (years) Uncertainty
CSB
Patient 1 Patient 2 Patient 3 Patient 4
Survival Probability
High-quality model
0.00
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Time (years)
Figure 13: Illustration of the use of conformal survival bands (shaded regions) for screening
test patients in a simulated censored dataset under distribution shift, corresponding to the
high-risk screening experiments discussed in Table 10. Solid black curves show survival
estimates from either an inaccurate (top) or accurate (bottom) survival forest model, while
dashed green curves represent the true survival probabilities. The goal is to identify high-
risk patients—those with less than 50% probability (horizontal dotted line) of surviving
beyond 3 years (vertical dotted line). Other details are as in Figure 1.
47
Sesia Svetnik
We apply our method to seven datasets previously utilized by Sesia and Svetnik (2024):
the Colon Cancer Chemotherapy (COLON) dataset; the German Breast Cancer Study
Group (GBSG) dataset; the Stanford Heart Transplant Study (HEART); the Molecular
Taxonomy of Breast Cancer International Consortium (METABRIC) dataset; the Primary
Biliary Cirrhosis (PBC) dataset; the Diabetic Retinopathy Study (RETINOPATHY); and
the Veterans’ Administration Lung Cancer Trial (VALCT). Table 11 provides details on the
number of observations, covariates, and data sources.
The datasets were obtained from various publicly available sources. COLON, HEART,
PBC, RETINOPATHY, and VALCT are included in the survival R package. GBSG was
sourced from GitHub: https://github.com/jaredleekatzman/DeepSurv/. METABRIC
was accessed via https://www.cbioportal.org/study/summary?id=brca_metabric.
Table 11: Summary of the publicly available survival analysis datasets used in Section 4.
48
Conformal Survival Bands for Risk Screening under Right-Censoring
Table 12: Detailed screening results for low-risk selection using the grf model, with thresh-
old rule P (T > t1 ) > 0.80 and t1 set to the 0.1 quantile of observed times in each dataset.
Shown are the screened proportion and bounds on the survival rate among selected patients,
aggregated over 100 repetitions.
49
Sesia Svetnik
Table 13: Detailed screening results for high-risk selection using the Cox model, with
threshold rule P (T > t1 ) < 0.80 and t1 set to the 0.1 quantile of observed times in each
dataset. Shown are the screened proportion and bounds on the survival rate among selected
patients, aggregated over 100 repetitions.
50