0% found this document useful (0 votes)
29 views38 pages

Coxkan: Kolmogorov-Arnold Networks For Interpretable, High-Performance Survival Analysis

The document introduces CoxKAN, a novel framework utilizing Kolmogorov-Arnold Networks (KANs) for interpretable and high-performance survival analysis, addressing the trade-off between model performance and interpretability in survival modeling. CoxKAN demonstrates superior performance compared to traditional Cox proportional hazards models and tuned deep learning models, while also effectively identifying complex interactions and providing interpretable symbolic formulas for hazard functions. The evaluation on synthetic and real medical datasets shows that CoxKAN can automatically perform feature selection and uncover the effects of important biomarkers on patient risk.

Uploaded by

yobr39015
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views38 pages

Coxkan: Kolmogorov-Arnold Networks For Interpretable, High-Performance Survival Analysis

The document introduces CoxKAN, a novel framework utilizing Kolmogorov-Arnold Networks (KANs) for interpretable and high-performance survival analysis, addressing the trade-off between model performance and interpretability in survival modeling. CoxKAN demonstrates superior performance compared to traditional Cox proportional hazards models and tuned deep learning models, while also effectively identifying complex interactions and providing interpretable symbolic formulas for hazard functions. The evaluation on synthetic and real medical datasets shows that CoxKAN can automatically perform feature selection and uncover the effects of important biomarkers on patient risk.

Uploaded by

yobr39015
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

CoxKAN: Kolmogorov-Arnold Networks for

Interpretable, High-Performance Survival Analysis


William Knottenbelt1 , Zeyu Gao2,3*, Rebecca Wray2,3 ,
Woody Zhidong Zhang2 , Jiashuai Liu4 , Mireia Crispin-Ortuzar2,3*
1 Department of Physics, University of Cambridge, United Kingdom.
2 Department of Oncology, University of Cambridge, United Kingdom.
3 CRUK Cambridge Centre, University of Cambridge, United Kingdom.
arXiv:2409.04290v1 [cs.LG] 6 Sep 2024

4 School of Computer Science, Xi’an Jiaotong University, China.

*Corresponding author(s). E-mail(s): [email protected];


Contributing authors: [email protected];

Abstract
Survival analysis is a branch of statistics used for modeling the time until a
specific event occurs and is widely used in medicine, engineering, finance, and
many other fields. When choosing survival models, there is typically a trade-
off between performance and interpretability, where the highest performance is
achieved by black-box models based on deep learning. This is a major problem in
fields such as medicine where practitioners are reluctant to blindly trust black-
box models to make important patient decisions. Kolmogorov-Arnold Networks
(KANs) were recently proposed as an interpretable and accurate alternative to
multi-layer perceptrons (MLPs). We introduce CoxKAN, a Cox proportional
hazards Kolmogorov-Arnold Network for interpretable, high-performance sur-
vival analysis. We evaluate the proposed CoxKAN on 4 synthetic datasets and
9 real medical datasets. The synthetic experiments demonstrate that CoxKAN
accurately recovers interpretable symbolic formulae for the hazard function, and
effectively performs automatic feature selection. Evaluation on the 9 real datasets
show that CoxKAN consistently outperforms the Cox proportional hazards model
and achieves performance that is superior or comparable to that of tuned MLPs.
Furthermore, we find that CoxKAN identifies complex interactions between pre-
dictor variables that would be extremely difficult to recognise using existing
survival methods, and automatically finds symbolic formulae which uncover the
precise effect of important biomarkers on patient risk.

1
1 Introduction
Survival analysis - also called time-to-event analysis - is a set of statistical methods
used for modelling the time until a specific event occurs, such as death, failure, or
relapse. It is crucial to various fields, including medicine, engineering, economics, and
insurance, where understanding the timing and probability of events can significantly
impact decision-making. For example, survival models are used extensively in oncology
(the study of cancer) to identify biomarkers/prognostic factors [1–3], assess treatment
efficacy [4–7], and develop personalized treatment plans [8].
Arguably, the most common survival model is the Cox proportional hazards model
(CoxPH) [9], which assumes a linear relationship between the patient’s 1 covariates
(e.g., age, blood pressure etc.) and the log-partial hazard, which is a measure of the
patient’s risk of event-occurrence (see Section 2.1.1). This model has the benefit of
interpretability (we can see exactly how each covariate impacts risk), but the linear
assumption is often overly simplistic and can cause significant bias error. Methods
based on machine learning generally have less bias and therefore potentially better
performance. These include models such as random survival forests [10, 11], Bayesian
models based on Gaussian processes [12, 13] and dependant logistic regression [14].
The most powerful survival models are those based on deep learning, which was first
shown with “DeepSurv” [8], a deep neural network based on CoxPH. Deep learning
models also have the advantage of being able to handle diverse input modalities—from
unstructured data such as images to structured datasets like tabular health records—,
making them highly adaptable for multiple healthcare applications. Deep learning has
been used extensively for survival analysis, achieving state-of-the-art performance on
numerous datasets across many domains [15–21]. However, the increased complexity
associated with deep learning comes at the expense of interpretability, with multi-
layer perceptrons (MLPs) being sometimes referred to as a “black-box”. As a result,
these methods have had limited clinical adoption and the search for more interpretable
techniques is an active area of research [22–24].
Kolmogorov-Arnold Networks (KANs) [25] were recently introduced as an alterna-
tive to MLPs, demonstrating enhanced interpretability and accuracy. This approach
differs from MLPs by using learnable activation functions on edges of the network
instead of linear weights, and summing those activation functions on nodes (“neu-
rons”). These learnable activation functions are parameterised as a B-spline curve
with learnable coefficients (see Section 2.2.1) to allow them to approximate any uni-
variate function. The interpretability of KANs stems from the ability to fit symbolic
operators to the learned activation functions, leaving a symbolic formula in-place of
the network. In the original paper, KANs were shown to be useful in physics for
solving partial differential equations and extracting mobility edges in the context of
Anderson localization. Since then, extensive applications of KANs have been found,
including time series analysis [26, 27], medical image segmentation [28] and satellite
image classification [29].

1
We adopt medical terminology when discussing survival data (eg. “patient”), but we emphasise that the
methods introduced in this paper are general and can be applied to survival analysis in any domain

2
In this work we introduce CoxKAN 2 , the first KAN-based framework
for interpretable survival analysis. CoxKAN uses a fast approximation to the Cox
loss to address KANs slow training time; pruning of activation functions to enable
automatic feature selection; and symbolic regression with PySR [30] to better control
an unconventional type of bias-variance tradeoff when finding symbolic representations
of KANs. The key contributions of this paper are in demonstrating that (a) CoxKAN
finds interpretable symbolic formulas for the hazard function, (b) CoxKAN identifies
biomarkers and complex variable interactions, and (c) CoxKAN achieves performance
that is superior to CoxPH and consistent with or better than DeepSurv (the equivalent
MLP-based model).
The paper is organised as follows: In Section 2.1 we describe the theory of survival
analysis. In Section 2.2 we explain the theory and implementation of Kolmogorov-
Arnold Networks. In Section 3 we describe the CoxKAN framework and training
pipeline. In Section 4, we present the experimental results from 3 categories of exper-
iments (synthetic data, clinical data, high dimensional genomics data). Finally, we
conclude in Section 5 by discussing the key takeaways and potential impact.

2 Preliminaries
2.1 Survival Analysis
Survival time is typically described using the survival function and the hazard function.
Let T be the time until the event of interest occurs, with probability density function
f (t). The survival function S(t) = P (T ≥ t) is the probability that a patient survives
longer than time t. The hazard function h(t) is the instantaneous event probability
density at time t, given the patient has survived up to at least that time. Formally, it
is written

P (t ≤ T < t + ∆t|T ≥ t)
h(t) = lim . (1)
∆t→∞ ∆t
This gives us the probability density function as f (t) = h(t)S(t). It can be shown that
the survival function is related to the hazard function by:
Z t
S(t) = exp(− h(s)ds). (2)
0

Survival data for a given patient is comprised of three parts: i) covariates x (predic-
tor variables), ii) time duration t, and iii) event indicator δ. If the event was observed
then δ = 1 and t is the time between the covariates being collected and the event
occurring. If the event was not observed then the patient is said to be right-censored,
δ = 0, and t is the time between the covariates being collected and the last contact
with the patient. For example, this could happen if we are conducting a study on the
survival of cancer patients, and some of the patients drop out of the study at random

2
Codes are available at https://github.com/knottwill/CoxKAN and can be installed using the following
command: pip install coxkan.

3
times. In standard regression methodology, the censored data would be discarded,
which can cause bias in the model. Hence, we have special methodology that makes
use of the censored data.

2.1.1 Cox proportional hazards model (CoxPH)


A proportional hazards model is one which assumes the hazard function takes the
form

h(t, x) = h0 (t) · exp(θ(x)) , (3)


| {z }
partial hazard
where t is time, h0 (t) is the baseline hazard function (same for all patients) and
θ(x) is the log-partial hazard. The log-partial hazard can be thought of as an overall
measure of patient risk that is independent of time.
The original proportional hazards model is called the Cox proportional hazards
model (CoxPH) [9] and is still perhaps the most common survival regression model
used today. It models the log-partial hazard in (3) as a linear combination of the
patient’s covariates:

θ̂CP H (x) = β T x = β1 x1 + β2 x2 + ... + βn xn . (4)

Suppose we have a dataset of N patients, {(xi , ti , δi )}N


i=1 . The weights, β, are
tuned to optimize the Cox partial likelihood, given by
Y exp(θ(xi ))
L(β) = P , (5)
i:δi =1 j∈R(ti ) exp(θ(xj ))

where the risk-set R(ti ) is the set of patients with observed time t ≥ ti (ie. those
who are alive as of time ti ).

2.1.2 DeepSurv
We can construct a proportional hazards model based on deep learning by using a
neural network to predict the log-partial hazard [8, 31]. It is trained using the “Cox
loss” function, which is the negative log of (5):
  
X X
ℓCox = − θ̂(xi ) − log  exp(θ̂(xj )) . (6)
i:δi =1 j∈R(ti )

This model is known as DeepSurv.

2.2 Kolmogorov-Arnold Networks


Kolmogorov-Arnold Networks (KANs) [25] are similar to Multi-Layer Perceptrons
(MLPs) in that they consist of consecutive layers of neurons (nodes), where each layer

4
is computed from the previous one. The shape of a KAN is defined by [n0 , n1 , ..., nL ],
where nl is the number of neurons in the lth layer:

xl = (xl,1 , xl,2 , ..., xl,nl )T , l = 0, · · · , L, (7)

where x0 is the input to the network and xL is the output. The departure from MLPs
comes in that between the lth and (l + 1)th layer of the network, there are nl nl+1
learnable activation functions parameterised using B-splines, allowing them to capture
arbitrary functions (detailed below). The activation function that connects the ith
neuron in the lth layer to the j th neuron in the (l + 1)th layer is denoted ϕl,j,i . The
(l + 1)th layer is then computed as the sum of all incoming post-activations:
nl
X
xl+1,j = ϕl,j,i (xl,i ), j = 1, · · · , nl+1 . (8)
i=1

This can equivalently be considered in matrix form:


 
ϕl,1,1 (·) ϕl,1,2 (·) ··· ϕl,1,nl (·)
 ϕl,2,1 (·) ϕl,2,2 (·) ··· ϕl,2,nl (·)

xl+1 =   xl , (9)
 
.. .. ..
 . . . 
ϕl,nl+1 ,1 (·) ϕl,nl+1 ,2 (·) · · · ϕl,nl+1 ,nl (·)
| {z }
Φl

where Φl is a matrix of the univariate functions. The output of the network given an
input vector x ∈ Rn0 can be written as

KAN(x) = (ΦL−1 ◦ ΦL−2 ◦ · · · ◦ Φ1 ◦ Φ0 )x. (10)

All operations are differentiable, allowing KANs to be trained via back propagation.
Similarly to MLPs, KANs possess the property of universality such that a sufficiently
large KAN with at least one hidden layer can approximate any smooth function on a
compact domain with arbitrary accuracy. An intuitive visualization of a KAN can be
found in Fig. 1

2.2.1 Activation Functions


Each activation function ϕ(x) is given by

ϕ(x) = wb b(x) + ws spline(x), (11)

where wb , ws are trainable weights that control the magnitude of the activation,
b(x) is a (non-trainable) basis function used for training stability (analogous to a

5
Fig. 1 Visualization of Kolmogorov-Arnold Network with shape [2,2,1] - the nodes are connected
by learnable activation functions.

residual connection), and

G+k−1
X
spline(x) = ci Bi,k (x), (12)
i=0

where the ci ’s are trainable parameters and the Bi,k ’s are B-spline basis functions of
degree k on G grid intervals. For sufficiently high k and G, spline(x) can approximate
any smooth 1D function defined on a bounded domain with arbitrary accuracy. The
Bi,k ’s are only non-zero on finite overlapping intervals, hence B-splines provide local
control over the shape of the function (we can modify part of the function without
affecting the rest). In this work, we only consider k = 3, G ∈ {3, 4, 5} and b(x) = x or
b(x) = silu(x) = x/(1 + e−x ).

2.2.2 Regularization
For efficiency and interpretability, we would ideally like our KAN to be as small and
simple as possible. However, we may not know in advance the appropriate shape for
the problem. Hence, [25] proposed a regularization and pruning scheme to simplify a
KAN from an initially large network. First, regularization terms are added to the loss
function to encourage sparsity of the KAN neurons and spline coefficients.
The L1 norm of an activation function ϕ is defined to be its average magnitude
over the training batch of NB inputs,

NB
1 X
|ϕ|1 ≡ ϕ(x(s) ) , (13)
NB s=1

1
PG+k−1
and that of its spline coefficients c is |c|1 = G+K i=0 |ci |.

6
Then, the L1 norm of a full KAN layer Φ with nin inputs and nout outputs, is
given by the sum of the L1 norms of the individual activations:
nin n
X X out

|Φ|1 ≡ |ϕi,j |1 . (14)


i=1 j=1

Pnout for the layer’s collective set of spline coefficients C, we have |C|1 ≡
PnSimilarly,
in
i=1 j=1 |ci,j |1 .
The entropy of Φ is defined to be
nin nout
|ϕi,j |1 |ϕi,j |1
X X  
S(Φ) ≡ − log . (15)
i=1 j=1
|Φ|1 |Φ|1

The total regularization included in the loss function is then


L−1
X L−1
X L−1
X
R= |Φl |1 + λent S(Φl ) + λcoef |Cl |1 , (16)
l=0 l=0 l=0
where λent , λcoef are the relative strengths of the entropy and coefficient regularization.
The L1 regularization on spline coefficients encourages simpler spline functions,
preventing overfitting. The L1 and entropy regularization on the activation magnitudes
encourage sparsity of activations (edges) and neurons (nodes) in the network. We can
then prune the edges (or nodes) from the network by retaining only those with an L1
norm above some threshold.

3 CoxKAN
CoxKAN is a novel proportional hazards model where the log-partial hazard is
estimated by a KAN with a single output node:

θ̂KAN (x) = KAN(x). (17)

The CoxKAN training pipeline can be summarised by the following: Hyperparam-


eter search → train with sparsity regularization → auto-prune network → fit symbolic
representation to result. The latter two steps are visualized in Fig. 2.

Loss and Optimization


CoxKAN is trained using the following objective:

ℓtotal = ℓCox + λR, (18)

where R is the regularization in (16), λ controls overall regularization strength and


ℓCox is a fast approximation to the Cox Loss in (6), where the risk-set R(ti ) is slightly
inaccurate when many patients have the same observed duration (we refer readers to
our well-documented software for further details). This is useful since KANs are slow

7
to train compared to MLPs. CoxKAN is optimized using Adam [32], taking steps on
the whole training set (as opposed to batches) for training stability.

Hyperparameter Tuning
We implement random hyperparameter optimization [33] with the Python package
Optuna [34] using the Tree-structured Parzen Estimator [35] algorithm to efficiently
search the hyperparameter space. The objective function we optimize on is the average
C-Index of the pruned CoxKAN over a 4-fold cross-validation of the experiment’s
training set.

Early Stopping
We conduct early stopping based on validation set C-Index. For smaller datasets we
may want to train on the full training set instead of reserving an extra validation set,
hence we include early stopping as an optimizable hyperparameter. For datasets where
early stopping is determined to be most optimal, we reserve 20% of the designated
training set as the validation set.

Pruning
After training CoxKAN, we prune activation functions in the network by removing
those that have L1 norms below a certain threshold. This allows for automatic feature
selection and control of the network shape. The L1 threshold is a tunable hyperparam-
eter, but when using a validation set for early stopping, we instead select the optimal
threshold based on validation performance.

Symbolic Fitting
For interpretability, we would like the activation functions of CoxKAN to be clean
symbolic formulas rather than parameterised B-spline curves.
Reference [25] proposed the following procedure to convert a KAN to a symbolic
representation: If we suspect a given activation function ϕ(x) is approximating a known
symbolic operator f (e.g., sin or exp), then we can set the activation function to
ϕ(x) = cf (ax + b) + d. The affine parameters (a, b, c, d) are found by fitting them
to a set of pre- and post-activations {x(s) , y (s) }M
s=1 , such that y ≈ cf (ax + b) + d.
This is done by iterative grid search for (a, b) and linear regression for (c, d). The
quality of the fit is measured by the coefficient of determination, R2 (AKA “fraction
of variance explained”). We can either visualize the activations by eye and choose a
suitable function to fit, or we can use pykan’s auto symbolic method, which simply
fits all symbolic operators from a large library and selects the operator that achieves
the highest R2 .
In this work, we used auto symbolic with a library of 22 functions (see Appendix
A), with a few additional improvements. Firstly, several of these functions can become
linear with the right choice of affine parameters, but if a learnt activation is linear
then we want this to be reflected in the symbolic formula. Hence, after training and
pruning CoxKAN, we first fit the linear function f (x) = ax + b (special case with
two affine parameters instead of four) to all activation functions and accept the fit if
R2 > 0.99, otherwise we proceed normally (auto symbolic or recognition by eye).

8
Secondly, certain activation functions may be so complex that (a) we cannot recog-
nise its symbolic form by eye and (b) no operators in our library fit sufficiently well.
In this case, the procedure described above fails. Instead, CoxKAN has the ability
to find a symbolic form for the activation function by using a genetic algorithm to
perform symbolic regression on the pre- and post-activations {x(s) , y (s) }M
s=1 , searching
a much wider space of symbolic functions. The process, known as symbolic regres-
sion, is based on the Python package PySR [30]. To favour simple symbolic operators,
CoxKAN does not use PySR by default.

Handling categorical covariates


We use label-encoding to deal with categorical covariates. The local control of B-splines
means that distinct parameters control the function at different regions of the input
domain, hence the network can deal with each category (almost) independently even
though they are encoded in the same dimension. To then get a symbolic representation
of the B-spline, we simply replace it with a discrete map.

Remark: Why not CoxPH with non-linear terms?


It is possible to incorporate non-linearities and interaction terms into the CoxPH with
manual feature engineering, but this requires extensive domain expertise, whereas
CoxKAN learns them automatically. Another option is to use spline terms in the
log-partial hazard to learn non-linear features, which is essentially CoxKAN with
no hidden layers, no pruning, and no symbolic fitting. It therefore cannot capture
interactions, does not have the property of universality (thus higher bias), and is less
interpretable than CoxKAN since the terms do not become clean symbolic operators.

4 Results
To evaluate CoxKAN as comprehensively as possible we conducted experiments on
both synthetic and real datasets (13 in total). For each experiment, we train CoxKAN
using the procedure described in Section 3 (hyperparameter search → train with spar-
sity regularization → auto-prune → fit symbolic). The hyperparameters found in each
case are detailed in Appendix A.
On the real datasets we compare CoxKAN to CoxPH and DeepSurv. To ensure
a fair comparison, we use the same hyperparameter searching strategy for DeepSurv
as used for CoxKAN, and we boost performance of DeepSurv as much as possible
by enabling modern deep learning techniques such as early stopping, dropout, batch
normalization and weight decay (L2 regularization).
We evaluate all models using the concordance index c (C-Index) [36], which is the
most common metric to judge predictive accuracy of a survival model. It measures
how well the model predictions agree with the ranking of the patient’s survival times,
where c = 0.5 corresponds to random raking and c = 1 is a perfect ranking. We
obtain 95% confidence intervals by bootstrapping [37] the test set (sampling with
replacement), and characterise the difference in performance between two models as
statistically significant if the confidence intervals do not overlap.

9
Fig. 2 Visualization of the CoxKAN pruning and symbolic fitting pipeline for the synthetic dataset
generated using a Gaussian log-partial hazard.

Often the symbolic formula predicted by CoxKAN contains both terms where
covariates contribute to the log-partial hazard in isolation (without interacting) and
terms that involve the interaction between covariates. We refer to the former as
“isolation terms” and the latter as “interaction terms”.

4.1 Evaluation with Synthetic Data


In the original paper [25], KANs were shown to recover exact symbolic formulas that
were used to generate toy regression datasets. However, the authors did not simu-
late any noise in these datasets. Survival data typically provides a very noisy signal,
not only because time-to-event is a random variable, but also because the censoring
distribution adds an additional layer of uncertainty.
To ascertain whether KANs can successfully recover symbolic formulas from sur-
vival data, we generated four datasets based on a proportional-hazards model (3),
using custom symbolic formulas for the log-partial hazard, a constant baseline hazard
of 0.01 and a uniform censoring distribution. The covariates were sampled uniformly
in [−1, 1] unless stated otherwise. We also added two irrelevant noisy covariates to
each dataset. Further details about the data generation are found in Appendix B.
In each case, we observe that the pruning of CoxKAN successfully removes the
irrelevant features (demonstrating automatic feature selection), and leaves CoxKAN
with a shape that is most appropriate for the problem (unless the hyperparameter
search already yielded the “correct” shape). The results 3 are given in Table 1 and the
pruned CoxKANs (before symbolic fitting) are visualized in Fig. 3. In all four cases,
CoxPH fails to accurately predict the hazard function.

4.1.1 Gaussian Formula


We first set the log-partial hazard to be a Gaussian function:

3
Note that since survival time is a random variable, the true formula does not achieve c = 1. In fact,
CoxKAN Symbolic actually achieves a slightly higher C-Index than the true formula on the Gaussian
dataset. This is a result of variance and does not suggest that CoxKAN can be “better” than the ground-
truth. In the limit of an infinite dataset, achieving a higher C-Index than the true formula is impossible.

10
Table 1 Synthetic Datasets: C-Index (95% Confidence Interval)

Dataset Log-partial hazard True CoxPH CoxKAN Recovered


Symbolic

Gaussian 5 exp(−2(x21 + x22 )) 0.759744 0.499213 0.759747 ✓


(0.759, 0.760) (0.497, 0.500) (0.758, 0.760)

Shallow tanh(5x1 )+sin(2πx2 )+x23 0.759795 0.688116 0.759562 ✓


(0.759, 0.761) (0.688, 0.690) (0.759, 0.761)
p
Deep 2 (x1 − x2 )2 + (x3 − x4 )2 0.725470 0.511198 0.722706 ✓
(0.724, 0.726) (0.510, 0.513) (0.721, 0.723)

Difficult tanh(5(log(x1 ) + |x2 |)) 0.690174 0.663698 0.690127 ×


(0.689, 0.691) (0.663, 0.665) (0.689, 0.691)

Fig. 3 Visualizations of CoxKAN trained on synthetic datasets, after pruning but before symbolic
fitting. The ϵ’s represent the irrelevant features added each dataset (successfully pruned in all cases).

θ(x) = 5 exp(−2(x21 + x22 )).


By visualizing the learned activation functions of the pruned CoxKAN in Fig. 3(a),
we can recognise them as two quadratic functions and the exponential function. These
symbolic functions fit to the learned activation functions with a coefficient of determi-
nation R2 > 0.99 in each case, verifying that they were indeed learned. After training
the affine parameters for an additional 50 steps we are left with the following formula
for the log-partial hazard:
2 2
θ̂KAN = 4.98e−1.99(x1 −x2 ) ,
which is approximately the same as the true formula. Unsurprisingly, Table 1
shows this achieves near-identical performance to the true log-partial hazard. By con-
trast, CoxPH achieves a C-Index of approximately 0.5, which is the same as randomly
ranking survival times and demonstrates that CoxPH has no predictive value for this
dataset.

11
4.1.2 Shallow Formula
It is common in survival data to encounter covariates which satisfy the linear CoxPH
assumptions after some non-linear transformation. That is, they have non-linear
relationships to the patient’s risk but they do not interact with each other.
To determine whether CoxKAN can automatically detect and solve this situation,
we set the log-partial hazard to

θ(x) = tanh(5x1 ) + sin(2πx2 ) + x23 .


Following the same procedure as above, CoxKAN predicts the following formula
(affine parameters rounded to 1 d.p.):

θ̂KAN = tanh (5.1x1 ) − sin (6.3x2 − 9.4) + x23 .


Upon first glance the sin term appears incorrect, but we note that
− sin (6.3x2 − 9.4) ≈ − sin(2πx2 − 3π) = sin(2πx2 ).

4.1.3 Deep Formula


To contrast with the previous example, we next set the log-partial hazard to an
expression that requires a deep KAN (2 hidden layers) to capture:
p
θ(x) = 2 (x1 − x2 )2 + (x3 − x4 )2 .
CoxKAN predicts the formula:
v
u x3 + 0.8x24 + 0.9 (0.1 − x1 )2
u 2
u 2 2
θ̂KAN t + (0.1 − x2 ) − 0.5 (x1 + x2 − 0.1) .
= 4u
2
−0.7 (x3 + 0.7x4 + 0.1) + 0.6

By multiplying this out and making some liberal approximations to the affine
parameters, we recover the original formula:
r
1 2
θ̂KAN ≈ 4 (x − 2x1 x2 + x22 + x23 − 2x3 x4 + x24 )
2 1
p
= 2 (x1 − x2 )2 + (x3 − x4 )2 .

4.1.4 Difficult Formula


Finally, we selected an formula for the log-partial hazard that we hypothesized would
be difficult to recover exactly:

θ(x) = tanh(5(log(x1 ) + |x2 |)),

where x1 ∈ [0.1, 1], x2 ∈ [−1, 1]. The intuition here is that tanh(5z) has a shallow
gradient in most of its input domain, hence large portions of the input space have
similar survival functions, which should cause the data to have a weak training signal.

12
Fig. 4 Log-partial hazard surfaces of the (a) true formula θ(x) = tanh(5(log(x1 ) + |x2 |)) and the (b)
‘incorrect’ CoxKAN-predicted formula. They are very similar, demonstrating a strong approximation
in the relevant domain.

Furthermore, the activations of KANs are necessarily smooth (property of B-Spline


curves), hence it is likely difficult to learn |x2 |.
By visualizing the CoxKAN activations in Fig. 3, we would not naturally recognise
them as the ‘true’ operators by eye, thus we perform the default auto-symbolic fitting
procedure described in Section 3 and obtain the (wrong) formula:
!
8.3 −2.6x22
θ̂KAN (x) = −1.0 tanh 3 + 3.0e − 9.8 .
(0.4x1 + 1)

Table 1 tells us that, despite being the ‘wrong’ formula, there is no statistically
significant difference between the C-Index of θ̂KAN and the true log-partial haz-
ard. In Fig. 4 we visualise the true and predicted log-partial hazards, which look
extremely similar. This tells us that in the domain of interest, the predicted for-
mula is a very strong approximation to the true formula. We argue that CoxKAN
still exhibits high-performance and interpretability in this case, since the effect of the
covariates on the hazard is still accurately captured in a symbolic form.

4.2 Evaluation on Real Clinical Data


Real survival data is complex and may not always be appropriately modeled by sim-
ple symbolic formulas like those seen in the previous section. To assess the real-world
application of CoxKAN, we first compare its performance on 5 clinical datasets to
CoxPH and DeepSurv [8]. The results are presented in Table 2. Three of the datasets
(SUPPORT, GBSG, METABRIC) were obtained from the DeepSurv GitHub repos-
itory [8], hence for these experiments, we quote the published DeepSurv results. We
provide results for CoxKAN straight after training (“CoxKAN Trained”), after prun-
ing (“CoxKAN Pruned”), and after symbolic fitting (“CoxKAN Symbolic”). We find
that CoxKAN Symbolic outperforms CoxPH and DeepSurv on all datasets
except FLCHAIN.
We also observe that CoxKAN Symbolic generally achieves a higher C-Index than
CoxKAN Trained. This is not statistically significant since the confidence intervals

13
Table 2 Clinical datasets: C-Index (95% Confidence Interval). Highest C-Index in bold.

Dataset CoxPH DeepSurv CoxKAN CoxKAN CoxKAN


Trained Pruned Symbolic

SUPPORT 0.583074 0.618308∗ 0.624482 0.624485 0.623755


(0.581, 0.585) (0.616, 0.620) (0.622, 0.625) (0.622, 0.625) (0.623, 0.626)

GBSG 0.656291 0.668402∗ 0.678294 0.679219 0.682796


(0.655, 0.662) (0.665, 0.671) (0.676, 0.682) (0.675, 0.681) (0.678, 0.684)

METABRIC 0.632363 0.643375∗ 0.647177 0.648004 0.649618


(0.628, 0.637) (0.639, 0.647) (0.644, 0.652) (0.646, 0.654) (0.644, 0.651)

FLCHAIN 0.797854 0.794520 0.797064 0.795911 0.796281


(0.797, 0.802) (0.793, 0.798) (0.796, 0.801) (0.792, 0.797) (0.795, 0.800)

NWTCO 0.698347 0.698300 0.719843 0.720721 0.722225


(0.693, 0.703) (0.692, 0.703) (0.714, 0.725) (0.708, 0.718) (0.715, 0.725)

DeepSurv results on SUPPORT, GBSG, METABRIC are quoted from the official DeepSurv publication.

overlap but it occurred in on 4 of 5 datasets and it is intuitive that the pruning and
symbolic fitting pipeline would reduce variance error. Pruning removes irrelevant noisy
features and makes the network smaller, and symbolic fitting smooths out the activa-
tions; thus, this pipeline provides an inductive bias towards simpler functions
that are more likely to generalize well. The rest of this section analyses the results of
each experiment in more depth.

4.2.1 Study to Understand Prognoses and Preferences for


Outcomes and Risks of Treatment (SUPPORT)
The Study to Understand Prognoses and Preferences for Outcomes and Risks of Treat-
ment (SUPPORT) investigates the survival of hospitalized, seriously-ill adults [38].
The dataset consists of 7,098 patients for training and 1,775 for testing. Each patient
is equipped with the following covariates: age, race, number of comorbidities, diabetes
indicator, dementia indicator, cancer status, mean blood pressure (meanbp), heart
rate (hr), respiration rate (rr), temperature (temp), serum sodium, white blood cell
count (wbc), and serum creatinine.
Following the usual procedure, we train and auto-prune CoxKAN. The result-
ing network has three hidden neurons and can be considered to consist of two main
sub-networks. The first subnetwork is that which involves the first and third hidden
neurons, and has linear activations in the 2nd layer that are equivalent to ‘skipping
a layer’, hence it represents non-interacting terms that contribute to the log-partial
hazard in isolation. We perform the default auto-symbolic fitting on this sub-network.
The second sub-network is that which involves the second hidden neuron, and encodes
a complex interaction between the patient age and cancer status. The single activation
function in the 2nd layer of this sub-network, which we denote ϕ1,1,2 ≡ ϕinteract , has
no obvious symbolic form so for now we leave it as non-symbolic (we will return to this
soon). In Fig. 5(a) we visualize the full partially-symbolic network and in Fig. 5(b)
we depict the second sub-network (encoding the interaction) more clearly, where each
data point in each activation function represents the value of that activation for a

14
Table 3 Summary of CoxKAN-extracted interaction between age and cancer status in the SUPPORT
dataset. We verify the interaction by splitting the patients into the relevant subgroups and fitting
CoxPH to the age column.

Patient Sub-Group CoxKAN Observation CoxPH Verification


No Cancer Risk increases sharply with age. 0.02 · age
Metastatic Cancer Risk increases with age.a 0.008 · age
Non-Metastatic Cancer & Age ≤ 60 y/o Risk decreases with age. −0.004 · age
Non-Metastatic Cancer & Age > 60 y/o Risk increases with age 0.003 · age
a Increases less sharply and in a non-liner fashion.

given patient. In Fig. 5(c), we re-plot ϕinteract with colour indicating the patient’s
cancer status (top) and age (bottom). We observe that:
• Patients with non-metastatic cancer are in high risk and the risk initially decreases
with age (until approximately 60 years old) and then increases.
• Patients without cancer are in lower risk, but their risk sharply increases with age.
• Patients with metastatic cancer are in the highest risk and their risk increases
non-linearly with age.
Obtaining insights like this using existing survival methods would require cum-
bersome work involving the stratification of patients into subgroups and searching for
trends. This result demonstrates the power of CoxKAN to automatically extract
complex insights from survival data.
To verify this interaction as a true property of the dataset, we split the patients
into 4 relevant subgroups and fitted CoxPH on the age column. The interaction and
corresponding verification are summarised in Table 3.

Fig. 5 Visualization of how CoxKAN extracts a meaningful interaction between two covariates in
the SUPPORT dataset. (a) Full network where activation functions involved in the interaction are
non-symbolic (shown in black), and all other activation functions are symbolic (shown in red). (b)
Sub-network that encodes the interaction between patient age and cancer status. Each data point in
each activation function represents the value of that activation function for a given patient. (c) Top:
ϕinteract where colour indicates cancer status, Bottom: ϕinteract where colour indicates age.

15
Fig. 6 Some of the non-linear symbolic terms in the CoxKAN-predicted hazard for the SUPPORT
dataset. Each data-point represents a patient.

The full (partially symbolic) CoxKAN formula is given by:

θ̂KAN = ϕinteract − 0.0002 · age + 0.003 · creatinine + 0.04 · comorbidities


2
+ 0.9e−0.06(1−0.1·meanbp) + 0.1 tanh (0.02 · hr − 3) − 0.06 sin (0.08 · rr + 0.2)
2
+ 0.6e−572(1−0.02·temp) + 0.0008 · sodium + 0.03 tan (0.02 · wbc − 4)
( ) ( ) ( )
0.007 if male −0.03 if diabetes 0.03 if dementia
+ + +
−0.01 if female 0.0006 otherwise −0.0008 otherwise
 
 0.003
 if metastasis 

+ −0.01 if no cancer ,
 
−0.0098 if cancer
 

where we plot some of the non-linear isolation terms in Fig. 6 for clarity. The isolated
age and cancer status terms can be neglected since most of the effect from these
covariates comes from ϕinteract .
We have just seen that we can leave ϕinteract as the original B-spline curve and
achieve strong interpretability purely by visualization. However, if we still desire a
symbolic form then we can use PySR [30] to find an accurate representation. In Fig. 7
we plot the symbolic fits by using the default auto symbolic method vs using PySR.
We see that auto symbolic causes the loss of important information, whereas PySR
retains essentially all information with the following expression:

ϕinteract (x) = x − sin(x + tanh(sin(x + 0.2)) − 0.8).

Whether this expression actually adds to the interpretability of CoxKAN is debatable,


and it is up to the practitioner to decide which method is preferable.

4.2.2 Rotterdam & German Breast Cancer Study Group (GBSG)


Next, we use breast cancer data from the Rotterdam tumor bank [39] for training
(1,546 patients), and data from a study by the German Breast Cancer Study Group
(GBSG) [40] for testing (686 patients). The covariates include hormonal therapy indi-
cator, tumor size, menopausal status, age, number of positive lymph nodes, and the

16
Fig. 7 Symbolic fitting to the complex activation function ϕinteract using pykan’s auto symbolic
method (left) vs PySR (right). This is an example where auto symbolic fails to capture all important
information.

concentration of progesterone receptors (PGR) and estrogen receptors (ER). The


dataset was preprocessed by [8] according to the method outlined in [41].
The CoxKAN-predicted log-partial hazard is

 
 −0.07 if tumor size ≤ 20 mm
( )  
−0.21 if hormonal therapy 
θ̂KAN = + + 0.21 if 20 < tumor size < 50 mm
0.28 otherwise  
0.48 if tumor size ≥ 50 mm
 
( )
−0.12 if pre-menopausal 2 2
+ + 1.8 (1 − 0.02 · age) − 1.2e−0.02(nodes+0.4)
0.23 if post-menopausal
+ 0.1 cosh (0.002 · PGR − 1.6) − 0.0007 · ER.

This formula is visualized within the structure of CoxKAN in Fig. 8(a). Interest-
ingly, we observe a single trough in the activation functions of age and concentration
of progesterone receptor (PGR), indicating a “sweet spot” for these covariates.
The CoxPH-predicted log-partial hazard is

θ̂CP H = 0.003 · age + 0.3 · size − 0.3 · hormon + 0.26 · meno


+ 0.06 · nodes − 0.0003 · PGR − 0.0003 · ER.

which has similar trends to θ̂KAN (i.e., patient risk increases with tumor size, number
of lymph nodes and menopause, and decreases with hormonal therapy and ER con-
centration) but has worse performance, which can be attributed to bias error due to
the linear assumption.

4.2.3 Molecular Taxonomy of Breast Cancer International


Consortium (METABRIC)
The Molecular Taxonomy of Breast Cancer International Consortium (METABRIC)
is a research project investigating gene expression in breast cancer [42]. The dataset
was preprocessed by [8], and consists of the expression of 4 genes (EGFR, PGR,

17
Fig. 8 Visualizations of CoxKAN Symbolic for the following datasets: (a) GBSG, (b) METABRIC,
(c) FLCHAIN. Activations containing a “c” are functions of categorical covariates that were converted
to a discrete map.

ERBB2, MKI67 ) and 5 clinical features (age and indicators for hormone treatment,
radiotherapy, chemotherapy, estrogen receptor). There are 1,523 patients for training
and 381 for testing. CoxKAN predicts the log-partial hazard as

θ̂KAN = − 0.24 · PGR + 0.2 tanh(1.9 · MKI67 − 10)


2
+ 0.7e−26(1−0.06·ERBB2 ) − 1.7 sin (0.04 · age − 9.5)
( ) ( )
0.1 if hormonal therapy 0.01 if radiotherapy
+ +
0.03 otherwise 0.18 otherwise
( ) ( )
0.6 if chemotherapy 0.07 if ER positive
+ +
−0.05 otherwise −0.04 otherwise

We visualize this formula in Fig. 8(b). These genes are among the most extensively
studied in breast cancer; increased PGR expression is associated with better prog-
nosis [43] and increased expression of ERBB2 and MKI67 is associated with poorer
prognosis and highly aggressive tumors [44]. These effects are re-discovered here using
CoxKAN, with precise symbolic formulas.

4.2.4 Assay of serum free light chain (FLCHAIN)


The FLCHAIN dataset (obtained from [45]) contains 7,874 subjects from a study on
the relationship between the concentration of immunoglobulin light chains (serum free
light chain, FLC) and mortality [46, 47]. We reserved 20% of the patients at random
for testing. The covariates include age, sex, year of blood sample, the kappa and
lambda portion of serum free light chain, FLC group, serum creatinine, and indicator
of monoclonal gammapothy (MGUS).
As can be seen in Table 2, all models achieve very similar performance with heavily
overlapping confidence intervals, which suggests that the linear CoxPH assumption

18
holds on this dataset. CoxKAN predicts:
( )
−0.047 if female
θ̂KAN = 0.09 · age + + 0.4 arctan(0.4 · year − 737) + 0.04 · FLCkappa
0.118 if male
+ 0.3 · FLClambda + 0.009 · FLCgroup + 2 arctan(0.5 · creatinine − 0.9),

while CoxPH predicts:

θ̂CP H = 0.1 · age + 0.3 · sex + 0.06 · year + 0.01 · FLCkappa + 0.2 · FLClambda
+ 0.06 · FLCgroup + 0.03 · creatinine + 0.3 · mgus.

All trends are essentially the same, which validates that CoxKAN can handle
situations where the linear assumption is appropriate.

4.2.5 National Wilm’s Tumor Study (NWTCO)


The National Wilm’s Tumor Study [48] investigated the treatment and survival out-
comes of children with a type of kidney cancer known as Wilms’ tumor. Our dataset
(obtained from [45]) consists of 4,028 subjects from the 3rd and 4th clinical trials of the
study. The event of interest is cancer relapse (not death) and there are 6 covariates:
histology readings (“Favourable Histology (FH)” or “Unfavourable Histology (UH)”)
from local institutions and again from a central lab, cancer stage, clinical trial, age in
months and a binary indication of whether the patient was included in the subcohort
from [49] (subsample stratified jointly on outcome and covariates). We reserved 20%
at random for testing.
CoxKAN has 5 hidden neurons in this case and its symbolic representation is
depicted in Fig. 9(a). The activations ϕ1,1,1 , ϕ1,1,2 , ϕ1,1,5 are linear and ϕ1,1,3 , ϕ1,1,4 are
non-linear, thus the resulting formula is a mixture of isolation and interaction terms:
( )
−0.047 if FH (local)
θ̂KAN = ϕ1,1,3 + ϕ1,1,4 + 0.02 · age +
−0.014 if UH (local)
 
( )   −0.47 if stage = 1  
 0.04 if stage = 2 

−0.22 if FH (central) 
+ +
0.62 if UH (central) 

 0.35 if stage = 3  

0.78 if stage = 4
 
( ) ( )
0.02 if 3rd study 0.2 if in subcohort
+ th
+
0.01 if 4 study −0.07 otherwise

19
Fig. 9 Visualization of CoxKAN on the NWTCO dataset: (a) CoxKAN after symbolic fitting
(ϕ1,1,1 , ϕ1,1,2 , ϕ1,1,5 are all linear), (b) Interpretable visualizations of the interaction term ϕ1,1,3 -
Top: All patients where colour indicates central histology reading, Middle: Patients with unfavourable
histology where colour indicates age, Bottom: Patients with favourable histology where colour indi-
cates age.

where ϕ1,1,3 is given by:


" ( )
−0.1 if FH (local)
ϕ1,3,1 = −2.5 arctan 2 + 0.03 · age +
−0.4 if UH (local)
 
( )   0.1 if stage = 1 

 −0.07 if stage = 2 

0.3 if FH (central) 
+ +
−0.4 if UH (central) 

 −0.03 if stage = 3  

−0.17 if stage = 4
 
( ) ( ) #!
−0.2 if 3rd study 1 if in subcohort
+ +
0.09 if 4th study −0.3 otherwise

and ϕ1,1,4 does not encode any particularly strong interactions between covariates and
has a smaller effect on the hazard than other terms, thus it is not worth discussing.
As we might expect, the isolation terms tell us that unfavourable histology and later
stage cancer are associated with poorer prognosis.
We plot the interaction term ϕ1,1,3 in Fig. 9(b), where colour indicates central
histology readings (top), age for patients with unfavourable histology (middle), and age

20
Table 4 Summary of CoxKAN-extracted interaction between age and histology in
the NWTCO dataset. We verify the interaction by splitting the patients into the
relevant subgroups and fitting CoxPH to the age column.

Patient Sub-Group CoxKAN Observation CoxPH Verification


FH (central) Risk increases with age. 0.01 · age
UH (central) Risk decreases with age. −0.006 · age
UH & Age < 20 Risk decreases sharply with age. −0.14 · age

for patients with favourable histology (bottom). It turns out that the full interaction
comes from considering the composite term ϕ1,1,3 + 0.02 · age, and can be summarised
as follows:
• For patients with favourable histology, ϕ1,1,3 is not significant and +0.02 · age
dominates such that increasing age is good for prognosis.
• For patients with unfavourable histology, ϕ1,1,3 is a sharply decreasing function,
such that overall effect of increasing age on prognosis is negative (particularly for
younger ages).
We validate this interaction by splitting the cohort into subsets and fitting CoxPH
on the age column. The results are summarised in Table 4.

4.3 Evaluation on Real Genomics Data


To assess whether CoxKAN can handle complex data distributions and aid oncol-
ogy researchers in understanding intricate cancer biology, we evaluate it on high-
dimensional genomics datasets derived from The Cancer Genome Atlas Program
(TCGA). These datasets include Copy Number Variations (CNVs) which reflect the
mean amplification or deletion of genes or chromosomal regions relative to a reference
genome; mRNA expression, which reflects gene expression levels and is derived from
RNA sequencing; and mutation status of various genes represented as binary indica-
tors of mutation presence. These datasets are characterized by high dimensionality
and low sample size (commonly known as curse of dimensionality), with each type
of feature presenting its unique challenges:
• CNVs: These features tend to show multicollinearity. This can complicate the
interpretation of model coefficients and affect model stability.
• mRNA expression: These features often exhibit heavily skewed distributions,
which can complicate statistical analysis and modeling.
• Mutation status: This type of feature is characterized by sparsity, with most
entries being 0. This can make it challenging to detect meaningful patterns.
These characteristics significantly increase the risk of overfitting, thus providing a
‘stress test’ for CoxKAN.

Datasets
In total, we curated four genomics datasets with diverse cancer types: Breast Invasive
Carcinoma (BRCA), Stomach Adenocarcinoma (STAD), Glioma (GBM/LGG), and
Kidney Renal Clear Cell Carcinoma (KIRC). To ensure a representative test set, we

21
divided each dataset into training (80%) and test (20%) sets by stratifying according
to the distribution of observed durations and event indicators. All datasets include
sparse mutation features and heavily skewed mRNA expression data. The GBM/LGG
and KIRC datasets, as preprocessed in [50], also exhibit significant multicollinearity
in the CNV features. For STAD and BRCA (preprocessed by us), we solved the mul-
ticollinearity issue in the CNV features, allowing us to evaluate the high-dimensional
datasets with and without multicollinearity. Specifically, the preprocessing pipeline of
STAD and BRCA is as follows: (1) Features were selected based on p-values derived
from univariate CoxPH analysis. (2) Groups of highly correlated CNV features were
consolidated by replacing them with a single feature representing the median value.
(3) Missing values were imputed using the random forest imputation method.
As a result, the BRCA dataset contains 811 training patients, 205 testing patients,
and has 168 features in total (73 CNVs, 91 RNAs, 4 Mutations). The STAD dataset
contains 284 training patients, 71 testing patients, and has 148 features (67 CNVs,
61 RNAs, 20 Mutations). The GBM/LGG dataset contains 400 training patients and
100 testing patients. There are 320 features in total, consisting of the mutation status
of the IDH1 gene, 240 RNAs and 79 CNVs (including the binary status of 1p19q
arm codeletion). Finally, the KIRC dataset contains 388 training patients, 97 testing
patients, and consists of 362 features (116 CNVs, 240 RNAs, and 6 Mutations).

Analysis
For STAD, BRCA, and GBM/LGG, the hyperparameter search of CoxKAN deter-
mined that using no hidden layers is most optimal. This is likely because a shallow
KAN has less capacity for overfitting, but also suggests that there may not be
significant interactions between the genomic features.
Similarly to the previous section, we compare the performance of CoxKAN to
CoxPH and DeepSurv. However, this data is so prone to overfitting that even CoxPH
can overfit (where usually it is assumed to suffer primarily from bias error alone). For
a fairer comparison (and to solve numerical issues due to multicollinearity), we also
evaluated CoxPH with heavy Lasso (L1) regularization (“CoxPH Lasso”).
The results are shown in Table 5. It is clear that CoxPH without regularization
either encounters numerical problems or is only slightly better than random guess-
ing. Introducing heavy Lasso regularization significantly improves the performance
of CoxPH, even outperforming DeepSurv to a statistically significant degree on the
STAD dataset. CoxKAN Symbolic demonstrates consistent and robust performance;
it is either competitive with or surpasses CoxPH Lasso and DeepSurv on all datasets.
We analyse the interpretable log-partial hazard formulas generated by CoxKAN on
the GBM/LGG and BRCA datasets, where CoxKAN Symbolic outperforms CoxPH
with Lasso regularization. For STAD and KIRC, CoxKAN Symbolic achieves com-
parable performance to CoxPH Lasso, please refer to Appendix C for these two
formulas.
Given the high dimensionality of features in these datasets, the log-partial hazard
formulas derived using CoxKAN become quite large. To simplify these formulas, we
estimate the relative importance of each term using its standard deviation σ over the

22
Table 5 Genomics datasets: C-Index (95% Confidence Interval). Highest C-Index in bold.

Dataset CoxPH CoxPH Lasso DeepSurv CoxKAN CoxKAN CoxKAN


Trained Pruned Symbolic

GBM/LGG N/Aa 0.787844 0.819094 0.813647 0.811353 0.818234


(0.777, 0.799) (0.820, 0.836) (0.808, 0.824) (0.804, 0.820) (0.817, 0.828)

BRCA 0.539545 0.613182 0.632500 0.619091 0.634545 0.630455


(0.529, 0.560) (0.607, 0.635) (0.623, 0.648) (0.593, 0.621) (0.617, 0.638) (0.622, 0.642)

STAD 0.543441 0.677172 0.628620 0.700170 0.670358 0.671210


(0.521, 0.543) (0.673, 0.694) (0.616, 0.638) (0.697, 0.715) (0.665, 0.690) (0.659, 0.682)

KIRC N/Aa 0.686285 0.650378 0.672246 0.671706 0.668467


(0.663, 0.686) (0.641, 0.664) (0.662, 0.684) (0.668, 0.688) (0.669, 0.689)
a
In the presence of multicollinearity, the design matrix is non-invertible and CoxPH fails without regularization.

full dataset. Terms with higher standard deviations have a greater impact on the log-
partial hazard. For the derived CoxKAN formulas of each dataset, we only present
the terms with the high standard deviations, σ. One caveat is that certain terms have
extreme values for specific samples, inflating the standard deviation without signif-
icantly affecting corresponding rankings. To address this, we exclude outlier values
when calculating the standard deviation for each term.

4.3.1 Glioblastoma Multiforme and Lower Grade Glioma


(GBM/LGG)
CoxKAN predicted the log-partial hazard as:

θ̂KAN = − 0.2 · (1p19q arm codeletion) (σ = 0.19)


−0.2(−0.6·(10qCN V )−1)2
+e (σ = 0.19)
− 0.2 · IDH1 mut (σ = 0.17)
− 0.06 tan(0.4 · CARD11 CN V + 8) (σ = 0.16)
− 0.08(0.6 · PTEN CN V + 1)4 (σ = 0.14)
− 0.3 sin(3 · JAK2 CN V − 5) (σ = 0.12)
− 0.1 · CDKN2ACN V (σ = 0.12)
− 0.1 sin(9 · CDKN2B CN V − 4) (σ = 0.10)
− 0.3 sin(9 · EGFR CN V + 0.8) (σ = 0.10)
+ less significant terms,
where we plot the non-linear terms in Fig. 10. From this formula, we observe that
1p/19q arm co-deletion and IDH1 mutation both have a negative contribution to
the hazard. The complete loss of both the short arm of chromosome 1 (1p) and the
long arm of chromosome 19 (19q), known as 1p19q arm codeletion, is a key molecu-
lar genetic marker of oligodendrogliomas, which are primary brain tumors accounting
for 10-15% of diffuse gliomas in adults [51]. This co-deletion is strongly linked to bet-
ter survival rates in diffuse glioma patients [52, 53]. Nearly all oligodendrogliomas
with a 1p19q arm codeletion also have a mutation in isocitrate dehydrogenase 1
(IDH1 ) at arginine 132, which has been demonstrated as an early driving factor in

23
Fig. 10 Visualization of the most significant non-linear terms in the CoxKAN-predicted hazard
for the GBM/LGG dataset. Data points represent test-set patients. For points that correspond to
multiple patients, the number of patients are indicated. Note that the x-axis shows the true measured
value of each feature, whereas the quoted equations are in terms of standardised features.

the development of oligodendrogliomas [53]. A comprehensive multi-omics and clinical


retrospective study by TCGA found that patients with grades II/III gliomas who had
both an IDH1/2 mutation and the 1p19q arm codeletion had a median overall survival
of 8.0 years, compared to 6.3 years for those with an IDH1/2 mutation without the
codeletion and 1.7 years for those with wild-type IDH1/2 [53]. These biological and
clinical results are therefore consistent with the terms in CoxKAN-extracted hazard
equation.
Deletions of CDKN2A/B show a positive contribution to the hazard. CDKN2A/B
are tumor suppressor genes located at 9p21. In gliomas, a homozygous deletion of
CDKN2A/B is linked to lower global DNA methylation levels and is associated with
more aggressive tumor behaviour [54]. The loss of CDKN2A is linked to poor outcomes
in both pediatric and adult low-grade and malignant gliomas [55]. Also, a recent
study found that CDKN2AB deletion is common in IDH1/2 -mutant glioblastomas
and associated with shorter survival in these tumors [56].
10q CNV exhibits a non-linear impact on the hazard. Chromosome 10 loss and
chromosome 7 gain are common molecular alterations in adult IDH -WT glioblastomas
[57]. These changes often lead to PTEN loss on chromosome 10 or EGFR amplification
on chromosome 7, which both show strong associations with survival in the generated
hazard equation. Both the loss of chromosome 10q and PTEN has been identified to
be associated with unfavourable prognosis [58, 59]. EGFR, found at 7p12, is crucial in

24
cell functions like division and apoptosis and its amplification is a strong indicator of
poor outcomes [60]. As shown in Fig. 10, the term-based contributions in the generated
hazard equation align with these findings, except for a few cases.
As for the remaining terms (CARD11 and JAK2 ), CARD11 CNVs have been
shown to implicate tumor progression in some cancer types like colorectal cancer [61],
and the JAK2 gene is crucial for hematopoietic and immune signaling. Frequent loss
of CDKN2A in tumors, including melanoma, often coincides with JAK2 deletion,
leading to IFNγ resistance, which is associated with resistance to immunotherapy [62].
The term-based contributions of CARD11 and JAK2 derived from CoxKAN show a
similar pattern to these studies. However, there are currently no studies indicating a
role for CARD11 and JAK2 in glioma progression. Our findings suggest that further
research is needed to understand their biological function in this context.

4.3.2 Breast Invasive Carcinoma (BRCA)


CoxKAN predicts the log-partial hazard as:

θ̂KAN = + 0.2 · KMT2C mut (σ = 0.24)


+ 0.6 sin(0.5 · HSPA8 RN A − 7) (σ = 0.18)
2
− 2e−0.04(0.9·PLXNB2 RN A +1) (σ = 0.17)
−0.05(0.9·PGK1 RN A +1)2
− 2e (σ = 0.15)
− 0.14 · RYR2 mut (σ = 0.14)
+ 0.1 · DMD mut (σ = 0.10)
+ 0.01 · TTN mut (σ = 0.07)
0.4
+ (σ = 0.06)
(1 − 0.1 · group 46CN V )2
2
+ 0.9e−0.06(H2BC5 RN A −0.5) (σ = 0.05)
− 0.3 sin(0.5 · RPL14 RN A + 5) (σ = 0.05)
+ less significant terms,
where group 46 is the median CNV of five highly correlated genes (MRPS21P8,
MRPS21P7, ZNF423, AC027348.2, AC027348.1 ).
Firstly, this equation indicates that mutations in the KMT2C, DMD, and TTN
genes are associated with an increased hazard for breast cancer patients. KMT2C
are histone lysine methyltransferases that catalyze the monomethylation of histone
3 lysine 4 (H3K4) at gene enhancer regions, and it has been indicated that KMT2C
plays a tumor-suppressive role in breast cancer development [63]. While germline DMD
mutations have traditionally been linked to Muscular Dystrophies, their involvement
in cancer is emerging and has been associated with poorer survival in breast invasive
carcinoma [64]. TTN, which encodes Titin, is significantly downregulated in early-
stage triple-negative breast cancer, but its role in cancer progression is still uncertain
[65]. By contrast, RYR2 mutations are associated with a decreased hazard in this
equation. RYR2 mutations are linked to higher tumor mutational burden (TMB),
better clinical outcomes, and enhanced antitumor immunity [66]. Additionally, these

25
Fig. 11 Visualization of the most significant non-linear terms in the CoxKAN-predicted hazard for
the BRCA dataset. Data points represent test-set patients. For points that correspond to multiple
patients, the number of patients are indicated. Note that the x-axis shows the true measured value
of each feature, whereas the quoted equations are in terms of standardised features.

mutations upregulate immune response signaling pathways and suggest a potential


benefit from immunotherapy [67].
Similar to the equation of GBM/LGG, several terms show non-linear associa-
tions with the hazard, as shown in Fig. 11. Among them, PLXNB2, PGK1, H2BC5
mRNA expression, and group 46 CNV, exhibit monotonic relationships with the haz-
ard. PLXNB2, a member of the plexin family, influences cell migration and invasion,
potentially affecting tumor growth and metastasis. PGK1, an enzyme in the glycolytic
pathway, is often overexpressed in tumors, supporting increased glycolysis and can-
cer cell proliferation. Higher expression of H2BC5 is associated with decreased hazard
by promoting chromatin stability and proper gene regulation, which can lead to less
aggressive cancer behaviour.
Furthermore, HSPA8 and RPL14 expression exhibit mostly monotonic behaviour
across their ranges. HSPA8, a chaperone protein, supports cancer cell survival under
stress and may be a target for therapy, suggesting it may enhance tumor survival
and resistance to apoptosis. Conversely, reduced expression of RPL14, which encodes
a ribosomal protein, plays a role in protein synthesis, and its dysregulation can
contribute to cancer progression [68]. Notably, both HSPA8 and RPL14 expression
show reverse effects in certain small ranges, indicating that both over-expression and

26
under-expression can influence patient risk. These CoxKAN results highlight the com-
plexity of these roles in breast cancer and emphasize the need for further research to
understand their biological implications better.

5 Discussion and Conclusion


This paper presented the novel CoxKAN framework, which is the first application of
Kolmogorov-Arnold Networks to interpretable survival regression. We demonstrated
that CoxKAN achieves (i) sophisticated interpretability by obtaining symbolic for-
mulas for the hazard function and visualizing KAN activation functions and (ii) high
performance due to the ability to flexibly capture any function (low bias error). We
were also able to mitigate CoxKAN overfitting, which can be attributed to the explicit
regularization in the loss function, early stopping, and the inductive bias of the prun-
ing and symbolic fitting pipeline that encourages simpler functions, which generalize
better than the original network.

Key Findings
In the first series of experiments, we generated synthetic datasets using custom sym-
bolic formulas for the hazard function and found that in 3/4 examples CoxKAN was
able to recover the correct symbolic form. In the last example (which was made to be
intentionally difficult to recover), CoxKAN found a formula that was shown to be a
highly accurate approximation to the ground truth; we claim that CoxKAN still pos-
sesses the properties of interpretability and high performance in this case. Additionally,
CoxKAN automatically pruned the irrelevant, noisy features added to all synthetic
datasets, demonstrating successful feature selection. We then evaluated CoxKAN on
5 clinical datasets and 4 high-dimensional genomics datasets. On the clinical data,
CoxKAN Symbolic achieved a statistically significant improvement in performance
over CoxPH in 4/5 cases and over DeepSurv in 3/5 cases. On the genomics data,
CoxKAN Symbolic achieved a statistically significant performance improvement over
the DeepSurv in 2/4 cases and outperformed CoxPH with heavy Lasso regularization
twice (though only once was this statistically significant). On datasets that CoxKAN
did not outperform CoxPH or DeepSurv, the performance difference was generally not
statistically significant as characterised by overlapping confidence intervals. CoxKAN
also uncovered useful insights from the survival data. For example, on the SUPPORT
dataset, CoxKAN identified that the risk of cancer patients in metastasis decreases
with age until about 60 years old, then starts to increase, but for patients with non-
metastatic cancer or no cancer at all, their risk only increases with age. This kind
of variable interaction would be extremely difficult to identify using existing survival
models. On the genomics datasets, CoxKAN uncovered a number of important biolog-
ical associations between cancer risk and genomic features such as specific CNVs and
mRNA transcripts, offering valuable insights that can guide further biological studies
and the development of targeted therapeutic strategies.

27
Potiental Applications of CoxKAN
Given that CoxKAN is the essentially first survival model with sophisticated inter-
pretability and low bias, we believe it has far-ranging applications, both within the
medical field and in other disciplines. In medical research, CoxKAN could be used
to discover complex biomarkers involving multi-variable interactions and assess
treatment efficacy by providing insights of how treatment conditions impact sur-
vival and interact with patient features. In a clinical setting, CoxKAN could be used for
personalized medicine by using its predictions/insights to inform treatment plans.
Outside of the medical field, CoxKAN could be used to understand and address under-
lying factors that impact the time to mechanical failure in engineering (helping to
inform construction of equipment), customer churn in business (guiding the devel-
opment of retention strategies), loan default in finance (improving risk assessment
models) and insurance claims (allowing actuaries to justify premiums).

Weaknesses and Future Work


CoxKAN does not work straight out of the box and has several weaknesses that we
believe are solvable. Firstly, CoxKAN is exposed to bias of certain assumptions of
CoxPH such as “the baseline hazard is the same for all patients” and “the relationship
between covariates and risk does not change over time”. An exciting future direc-
tion would be to construct a KAN-based framework that bypasses these assumptions
while retaining precise interpretability. Secondly, CoxKAN is vulnerable to overfit-
ting and thus for the high dimensional genomics datasets the hyperparameter search
typically yielded low-capacity KANs with no hidden layers. This meant that interac-
tions between genomic features were not learned, even though it is well known that
genomics features do experience interactions. Additional effort to mitigate overfitting
while retaining the ability to capture interactions is a promising future direction. Fur-
thermore, the performance of DeepSurv and CoxKAN was fairly unstable with respect
to initialization on the high dimensional genomics datasets, hence CoxPH with Lasso
regularization could be considered a more reliable choice in this case. CoxKAN is
also sensitive to hyper-parameters and can be unstable to train. These flaws could be
addressed by experimenting with more techniques related to hyperparameter tuning,
regularization, and optimization.

6 Data availability
The clinical datasets METABRIC, SUPPORT and GBSG are available at
https://github.com/jaredleekatzman/DeepSurv/tree/master/experiments/data
and NWTCO, FLCHAIN are available at https://vincentarelbundock.github.io/
Rdatasets/. TCGA genomic data (BRCA, STAD, GBM/LGG, and KIRC) are
available at https://portal.gdc.cancer.gov.

28
7 Code availability
The code for training and evaluating CoxKAN is available at
https://github.com/knottwill/CoxKAN, and can be installed using the following
command: “pip install coxkan”.

8 Acknowledgements
We acknowledge funding and support from Cancer Research UK and the Cancer
Research UK Cambridge Centre [CTRQQR-2021-100012], The Mark Foundation for
Cancer Research [RG95043], GE HealthCare, and the CRUK National Cancer Imaging
Translational Accelerator (NCITA) [A27066]. Additional support was also provided
by the National Institute of Health Research (NIHR) Cambridge Biomedical Research
Centre [NIHR203312] and EPSRC Tier-2 capital grant [EP/P020259/1]. The funders
had no role in study design, data collection and analysis, decision to publish, or prepa-
ration of the manuscript. Calculations were performed in part using the Sulis Tier 2
HPC platform hosted by the Scientific Computing Research Technology Platform at
the University of Warwick. Sulis is funded by EPSRC Grant EP/T022108/1 and the
HPC Midlands+ consortium.

References
[1] Koene, R.J., Prizment, A.E., Blaes, A., Konety, S.H.: Shared risk factors in
cardiovascular disease and cancer. Circulation 133, 1104–1114 (2016) https:
//doi.org/10.1161/CIRCULATIONAHA.115.020406

[2] Saegusa, T., Zhao, Z., Ke, H., et al.: Detecting survival-associated biomarkers
from heterogeneous populations. Scientific Reports 11(1), 3203 (2021) https://
doi.org/10.1038/s41598-021-82332-y

[3] Ou, F.S., Michiels, S., Shyr, Y., Adjei, A.A., Oberg, A.L.: Biomarker discovery
and validation: Statistical considerations. Journal of Thoracic Oncology 16(4),
537–545 (2021) https://doi.org/10.1016/j.jtho.2021.01.1616

[4] Mok, T.S., Wu, Y.-L., Thongprasert, S., Yang, C.-H., Chu, D.-T., Saijo,
N., Sunpaweravong, P., Han, B., Margono, B., Ichinose, Y., Nishi-
waki, Y., Ohe, Y., Yang, J.-J., Chewaskulyong, B., Jiang, H., Duffield,
E.L., Watkins, C.L., Armour, A.A., Fukuoka, M.: Gefitinib or carbo-
platin–paclitaxel in pulmonary adenocarcinoma. New England Journal of
Medicine 361(10), 947–957 (2009) https://doi.org/10.1056/NEJMoa0810699
https://www.nejm.org/doi/pdf/10.1056/NEJMoa0810699

[5] Le-Rademacher, J., Wang, X.: Time-to-event data: An overview and analysis
considerations. Journal of Thoracic Oncology 16(7), 1067–1074 (2021) https://
doi.org/10.1016/j.jtho.2021.04.004

29
[6] Monnickendam, G., Zhu, M., McKendrick, J., Su, Y.: Measuring survival benefit
in health technology assessment in the presence of nonproportional hazards. Value
in Health 22(4), 431–438 (2019) https://doi.org/10.1016/j.jval.2019.01.005

[7] Hurwitz, H., Fehrenbacher, L., Novotny, W., Cartwright, T., Hainsworth, J.,
Heim, W., Berlin, J., Baron, A., Griffing, S., Holmgren, E., Ferrara, N., Fyfe,
G., Rogers, B., Ross, R., Kabbinavar, F.: Bevacizumab plus irinotecan, fluo-
rouracil, and leucovorin for metastatic colorectal cancer. New England Journal
of Medicine 350(23), 2335–2342 (2004) https://doi.org/10.1056/NEJMoa032691
https://www.nejm.org/doi/pdf/10.1056/NEJMoa032691

[8] Katzman, J.L., Shaham, U., Cloninger, A., Bates, J., Jiang, T., Kluger, Y.:
Deepsurv: personalized treatment recommender system using a cox proportional
hazards deep neural network. BMC Medical Research Methodology 18(1) (2018)
https://doi.org/10.1186/s12874-018-0482-1

[9] Cox, D.R.: Regression models and life-tables. Journal of the Royal Statistical
Society: Series B (Methodological) 34(2), 187–202 (1972)

[10] Ishwaran, H., Kogalur, U.B.: Random survival forests for r. R News 7(2), 25–31
(2007)

[11] Ishwaran, H., Kogalur, U.B., Blackstone, E.H., Lauer, M.S.: Random survival
forests. Ann. Appl. Statist. 2(3), 841–860 (2008)

[12] Fernandez, T., Rivera, N., Teh, Y.W.: Gaussian processes for survival analysis.
In: Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R. (eds.) Advances
in Neural Information Processing Systems, vol. 29, pp. 5021–5029 (2016)

[13] Alaa, A.M., Schaar, M.: Deep multi-task gaussian processes for survival analy-
sis with competing risks. In: Proceedings of the 31st International Conference
on Neural Information Processing Systems. NIPS’17, pp. 2326–2334. Curran
Associates Inc., Red Hook, NY, USA (2017)

[14] Yu, C.-N., Greiner, R., Lin, H.-C., Baracos, V.: Learning patient-specific cancer
survival distributions as a sequence of dependent regressors. In: Shawe-Taylor, J.,
Zemel, R., Bartlett, P., Pereira, F., Weinberger, K.Q. (eds.) Advances in Neural
Information Processing Systems, vol. 24 (2011)

[15] Lee, C., Zame, W., Yoon, J., Schaar, M.: Deephit: A deep learning approach to
survival analysis with competing risks. Proceedings of the AAAI Conference on
Artificial Intelligence 32(1) (2018) https://doi.org/10.1609/aaai.v32i1.11842

[16] Ren, K., Qin, J., Zheng, L., Yang, Z., Zhang, W., Qiu, L., Yu, Y.: Deep Recurrent
Survival Analysis (2018). https://arxiv.org/abs/1809.02403

[17] Ching, T., Zhu, X., Garmire, L.X.: Cox-nnet: An artificial neural network method

30
for prognosis prediction of high-throughput omics data. PLOS Computational
Biology 14(4), 1006076 (2018) https://doi.org/10.1371/journal.pcbi.1006076

[18] Kvamme, H., Borgan, O.: Continuous and discrete-time survival prediction with
neural networks. Lifetime Data Analysis 27(4), 710–736 (2021) https://doi.org/
10.1007/s10985-021-09532-6

[19] Kvamme, H., Borgan, Scheel, I.: Time-to-event prediction with neural networks
and cox regression. Journal of Machine Learning Research 20(129), 1–30 (2019)

[20] Nagpal, C., Li, X., Dubrawski, A.: Deep survival machines : Fully parametric
survival regression and representation learning for censored data with competing
risks. IEEE Journal of Biomedical and Health Informatics PP, 1–1 (2021) https:
//doi.org/10.1109/JBHI.2021.3052441

[21] Nagpal, C., Yadlowsky, S., Rostamzadeh, N., Heller, K.: Deep cox mixtures for
survival regression. Machine Learning for Healthcare Conference (2021). PMLR

[22] Lu, S.C., Swisher, C.L., Chung, C., Jaffray, D., Sidey-Gibbons, C.: On the impor-
tance of interpretable machine learning predictions to inform clinical decision
making in oncology. Frontiers in Oncology 13, 1129380 (2023) https://doi.org/
10.3389/fonc.2023.1129380

[23] Langbein, S.H., Krzyziński, M., Spytek, M., Baniecki, H., Biecek, P., Wright,
M.N.: Interpretable Machine Learning for Survival Analysis (2024). https://arxiv.
org/abs/2403.10250

[24] Wiegrebe, S., Kopper, P., Sonabend, R., Bischl, B., Bender, A.: Deep learning
for survival analysis: a review. Artificial Intelligence Review 57(3) (2024) https:
//doi.org/10.1007/s10462-023-10681-3

[25] Liu, Z., Wang, Y., Vaidya, S., Ruehle, F., Halverson, J., Soljačić, M., Hou, T.Y.,
Tegmark, M.: KAN: Kolmogorov-Arnold Networks (2024)

[26] Vaca-Rubio, C.J., Blanco, L., Pereira, R., Caus, M.: Kolmogorov-Arnold Net-
works (KANs) for Time Series Analysis (2024). https://arxiv.org/abs/2405.
08790

[27] Genet, R., Inzirillo, H.: A Temporal Kolmogorov-Arnold Transformer for Time
Series Forecasting (2024). https://arxiv.org/abs/2406.02486

[28] Li, C., Liu, X., Li, W., Wang, C., Liu, H., Yuan, Y.: U-KAN Makes Strong
Backbone for Medical Image Segmentation and Generation (2024). https://arxiv.
org/abs/2406.02918

[29] Cheon, M.: Kolmogorov-Arnold Network for Satellite Image Classification in


Remote Sensing (2024). https://arxiv.org/abs/2406.00600

31
[30] Cranmer, M.: Interpretable Machine Learning for Science with PySR and
SymbolicRegression.jl (2023). https://arxiv.org/abs/2305.01582

[31] Faraggi, D., Simon, R.: A neural network model for survival data. Statistics in
medicine 14(1), 73–82 (1995)

[32] Kingma, D., Ba, J.: Adam: A method for stochastic optimization. In: International
Conference on Learning Representations (ICLR), San Diega, CA, USA (2015)

[33] Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. The
Journal of Machine Learning Research 13(1), 281–305 (2012)

[34] Akiba, T., Sano, S., Yanase, T., Ohta, T., Koyama, M.: Optuna: A next-
generation hyperparameter optimization framework. In: Proceedings of the 25th
ACM SIGKDD International Conference on Knowledge Discovery & Data Min-
ing. KDD ’19, pp. 2623–2631. Association for Computing Machinery, New York,
NY, USA (2019). https://doi.org/10.1145/3292500.3330701

[35] Watanabe, S.: Tree-Structured Parzen Estimator: Understanding Its Algorithm


Components and Their Roles for Better Empirical Performance (2023). https:
//arxiv.org/abs/2304.11127

[36] Altman, D.G., Royston, P.: What do we mean by validating a prognostic model?
Statistics in medicine 19(4), 453–473 (2000)

[37] Efron, B., Tibshirani, R.J.: An Introduction to the Bootstrap, (1994)

[38] Knaus, W.A., Harrell, F.E., Lynn, J., Goldman, L., Phillips, R.S., Connors, A.F.,
Dawson, N.V., Fulkerson, W.J., Califf, R.M., Desbiens, N., et al.: The support
prognostic model: objective estimates of survival for seriously ill hospitalized
adults. Annals of internal medicine 122(3), 191–203 (1995)

[39] Foekens, J.A., Peters, H.A., Look, M.P., Portengen, H., Schmitt, M., Kramer,
M.D., Brünner, N., Jänicke, F., Meijer-van Gelder, M.E., Henzen-Logmans, S.C.,
et al.: The urokinase system of plasminogen activation and prognosis in 2780
breast cancer patients. Cancer research 60(3), 636–643 (2000)

[40] Schumacher, M., Bastert, G., Bojar, H., Huebner, K., Olschewski, M., Sauerbrei,
W., Schmoor, C., Beyerle, C., Neumann, R., Rauschecker, H.: Randomized 2 x 2
trial evaluating hormonal treatment and the duration of chemotherapy in node-
positive breast cancer patients. german breast cancer study group. Journal of
Clinical Oncology 12(10), 2086–2093 (1994)

[41] Royston, P., Altman, D.G.: External validation of a cox prognostic model: prin-
ciples and methods. BMC Medical Research Methodology 13(1), 33 (2013)
https://doi.org/10.1186/1471-2288-13-33

32
[42] Curtis, C., Shah, S.P., Chin, S.-F., Turashvili, G., Rueda, O.M., Dunning, M.J.,
Speed, D., Lynch, A.G., Samarajiwa, S., Yuan, Y., et al.: The genomic and tran-
scriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature
486(7403), 346–352 (2012)

[43] Kurozumi, S., Matsumoto, H., Hayashi, Y., et al.: Power of pgr expression as
a prognostic factor for er-positive/her2-negative breast cancer patients at inter-
mediate risk classified by the ki67 labeling index. BMC Cancer 17, 354 (2017)
https://doi.org/10.1186/s12885-017-3331-4

[44] Cheang, M.C.U., Chia, S.K., Voduc, D., Gao, D., Leung, S., Snider, J.,
Watson, M., Davies, S., Bernard, P.S., Parker, J.S., Perou, C.M., Ellis,
M.J., Nielsen, T.O.: Ki67 Index, HER2 Status, and Prognosis of Patients
With Luminal B Breast Cancer. JNCI: Journal of the National Can-
cer Institute 101(10), 736–750 (2009) https://doi.org/10.1093/jnci/djp082
https://academic.oup.com/jnci/article-pdf/101/10/736/18074850/djp082.pdf

[45] Arel-Bundock, V.: Rdatasets: A Collection of Datasets Originally Distributed in


Various R Packages. (2024). R package version 1.0.0. https://vincentarelbundock.
github.io/Rdatasets

[46] Kyle, R.A., Therneau, T.M., Rajkumar, S.V., Larson, D.R., Plevak, M.F.,
Offord, J.R., Dispenzieri, A., Katzmann, J.A., Melton, L.J.: Prevalence of mon-
oclonal gammopathy of undetermined significance. New England Journal of
Medicine 354(13), 1362–1369 (2006) https://doi.org/10.1056/NEJMoa054494
https://www.nejm.org/doi/pdf/10.1056/NEJMoa054494

[47] Dispenzieri, A., Katzmann, J.A., Kyle, R.A., Larson, D.R., Therneau, T.M.,
Colby, C.L., Clark, R.J., Mead, G.P., Kumar, S., Melton, L.J. 3rd, Rajkumar,
S.V.: Use of nonclonal serum immunoglobulin free light chains to predict overall
survival in the general population. Mayo Clin Proc 87(6), 517–523 (2012)

[48] Green, D.M., Thomas, P.R.M., Shochat, S.: The treatment of wilms tumor:
Results of the national wilms tumor studies. Hematology/Oncology Clinics of
North America 9(6), 1267–1274 (1995) https://doi.org/10.1016/S0889-8588(18)
30044-3 . Wilms Tumor

[49] Breslow, N.E., Chatterjee, N.: Design and analysis of two-phase studies with
binary outcome applied to wilms tumour prognosis. Journal of the Royal Sta-
tistical Society. Series C (Applied Statistics) 48(4), 457–468 (1999). Accessed
2024-06-03

[50] Chen, R.J., Lu, M.Y., Wang, J., Williamson, D.F.K., Rodig, S.J., Lindeman, N.I.,
Mahmood, F.: Pathomic fusion: An integrated framework for fusing histopathol-
ogy and genomic features for cancer diagnosis and prognosis. IEEE Transactions
on Medical Imaging 41(4), 757–770 (2022) https://doi.org/10.1109/TMI.2020.
3021387

33
[51] Ostrom, Q.T., Bauchet, L., Davis, F.G., Deltour, I., Fisher, J.L., Langer, C.E.,
Pekmezci, M., Schwartzbaum, J.A., Turner, M.C., Walsh, K.M., et al.: The epi-
demiology of glioma in adults: a “state of the science” review. Neuro-oncology
16(7), 896–913 (2014)

[52] Jenkins, R.B., Blair, H., Ballman, K.V., Giannini, C., Arusell, R.M., Law, M.,
Flynn, H., Passe, S., Felten, S., Brown, P.D., et al.: A t (1; 19)(q10; p10) mediates
the combined deletions of 1p and 19q and predicts a better prognosis of patients
with oligodendroglioma. Cancer research 66(20), 9852–9861 (2006)

[53] Network, C.G.A.R.: Comprehensive, integrative genomic analysis of diffuse lower-


grade gliomas. New England Journal of Medicine 372(26), 2481–2498 (2015)

[54] Komori, T.: Grading of adult diffuse gliomas according to the 2021 who classifi-
cation of tumors of the central nervous system. Laboratory Investigation 102(2),
126–133 (2022)

[55] Reis, G.F., Pekmezci, M., Hansen, H.M., Rice, T., Marshall, R.E., Molinaro,
A.M., Phillips, J.J., Vogel, H., Wiencke, J.K., Wrensch, M.R., et al.: Cdkn2a loss
is associated with shortened overall survival in lower-grade (world health orga-
nization grades ii–iii) astrocytomas. Journal of Neuropathology & Experimental
Neurology 74(5), 442–452 (2015)

[56] Li, K.K.-W., Shi, Z.-f., Malta, T.M., Chan, A.K.-Y., Cheng, S., Kwan, J.S.H.,
Yang, R.R., Poon, W.S., Mao, Y., Noushmehr, H., et al.: Identification of subsets
of idh-mutant glioblastomas with distinct epigenetic and copy number alterations
and stratified clinical risks. Neuro-Oncology Advances 1(1), 015 (2019)

[57] Stichel, D., Ebrahimi, A., Reuss, D., Schrimpf, D., Ono, T., Shirahata, M.,
Reifenberger, G., Weller, M., Hänggi, D., Wick, W., et al.: Distribution of egfr
amplification, combined chromosome 7 gain and chromosome 10 loss, and tert
promoter mutation in brain tumors and their potential for the reclassification of
idh wt astrocytoma to glioblastoma. Acta neuropathologica 136, 793–803 (2018)

[58] Wemmert, S., Ketter, R., Rahnenfuhrer, J., Beerenwinkel, N., Strowitzki, M.,
Feiden, W., Hartmann, C., Lengauer, T., Stockhammer, F., Zang, K.D., et al.:
Patients with high-grade gliomas harboring deletions of chromosomes 9p and 10q
benefit from temozolomide treatment. Neoplasia 7(10), 883–893 (2005)

[59] Ni, X., Wu, W., Sun, X., Ma, J., Yu, Z., He, X., Cheng, J., Xu, P., Liu, H., Shang,
T., et al.: Interrogating glioma-m2 macrophage interactions identifies gal-9/tim-3
as a viable target against pten-null glioblastoma. Science Advances 8(27), 5165
(2022)

[60] Zhao, H.-f., Zhou, X.-m., Wang, J., Chen, F.-f., Wu, C.-p., Diao, P.-y., Cai, L.-
r., Chen, L., Xu, Y.-w., Liu, J., et al.: Identification of prognostic values defined
by copy number variation, mrna and protein expression of lancl2 and egfr in

34
glioblastoma patients. Journal of Translational Medicine 19, 1–15 (2021)

[61] Mamlouk, S., Childs, L.H., Aust, D., Heim, D., Melching, F., Oliveira, C., Wolf,
T., Durek, P., Schumacher, D., Bläker, H., et al.: Dna copy number changes define
spatial patterns of heterogeneity in colorectal cancer. Nature communications
8(1), 14093 (2017)

[62] Horn, S., Leonardelli, S., Sucker, A., Schadendorf, D., Griewank, K.G., Paschen,
A.: Tumor cdkn2a-associated jak2 loss and susceptibility to immunotherapy
resistance. JNCI: Journal of the National Cancer Institute 110(6), 677–681
(2018)

[63] Tinsley, E., Bredin, P., Toomey, S., Hennessy, B.T., Furney, S.J.: Kmt2c and
kmt2d aberrations in breast cancer. Trends in Cancer (2024)

[64] Luce, L.N., Abbate, M., Cotignola, J., Giliberto, F.: Non-myogenic tumors display
altered expression of dystrophin (dmd) and a high frequency of genetic alterations.
Oncotarget 8(1), 145 (2017)

[65] Agarwal, S., Parija, M., Naik, S., Kumari, P., Mishra, S.K., Adhya, A.K., Kashaw,
S.K., Dixit, A.: Dysregulated gene subnetworks in breast invasive carcinoma
reveal novel tumor suppressor genes. Scientific Reports 14(1), 15691 (2024)

[66] Xu, Z., Xiang, L., Wang, R., Xiong, Y., Zhou, H., Gu, H., Wang, J., Peng, L.:
Bioinformatic analysis of immune significance of ryr2 mutation in breast cancer.
BioMed Research International 2021(1), 8072796 (2021)

[67] Liu, Z., Liu, L., Jiao, D., Guo, C., Wang, L., Li, Z., Sun, Z., Zhao, Y., Han,
X.: Association of ryr2 mutation with tumor mutation burden, prognosis, and
antitumor immunity in patients with esophageal adenocarcinoma. Frontiers in
genetics 12, 669694 (2021)

[68] Clark, K., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Sayers, E.W.: Genbank.
Nucleic acids research 44(D1), 67–72 (2016)

35
Appendix
A Hyperparameters
Table 7 shows the CoxKAN hyperparameters found for each experiment. The mean-
ing of each hyperparameter is described in Section 2.2, except for the initialization
hyperparameters:
• Scale weights (equation 11) are initialized as ws = 1 and wb = n1 +
in
Uniform([−ξb , ξb ]), where ξb is the ”spline noise”.
• Spline coefficients (equation 12) initialized as ci ∼ N (0, ( ξGs )2 ), where ξs is ”base
noise”.
As mentioned, the default auto-symbolic fitting to CoxKAN acti-
vation functions uses a library of 22 symbolic operators. These are
2
{sin(x), tan(x),
√ arctan(x), cosh(x), ex , e−x log(x), tanh(x), arctanh(x), sigmoid(x),
sgn(x), |x|, x, √1x , x, x2 , x3 , x4 , x1 , x12 , x14 }
Table 6 shows the DeepSurv hyperparameters found for each experiment. For the
synthetic datasets we did not evaluate DeepSurv and for SUPPORT, GBSG, and
METABRIC we quoted the results from the official DeepSurv publication.

B Simulated data generation


The simulated datasets were generated with 8000 training observations and 2000
testing observations. The death times were generated according to the exponential
distribution:
T ∼ Exponential(h(t, x)),
where h(t, x) = 0.01eθ(x) is the hazard and θ(x) is custom log-partial hazard expres-
sion. We then generated censoring times Tc uniformly in the range from 0 to the largest
observed death time. The final observed times were then given by Z = min(T, Tc ).

Table 6 Hyperparameters of DeepSurv.

Hyperparameter FLCHAIN NWTCO TCGA-STAD TCGA-BRCA TCGA-GBM/LGG TCGA-KIRC

Shape [8,5,1] [6,9,1] [148,19,19,1] [168,15,1] [320,19,1] [362,18,18,1]

Early Stopping True False False True False True

Epochs (300) 135 131 (300) 114 (300)

Learning Rate 0.0067 0.008 0.002 0.001 0.001 0.006

Batch Norm True True True True True False

Dropout 0.12 0.15 0.27 0.15 0.11 0.14

Weight Decay (L2) 6.6e-8 4.7e-8 2.7e-7 4e-7 9e-5 3.2e-6

36
Hyperparameter Gaussian Shallow Deep Difficult SUPPORT GBSG METABRIC FLCHAIN NWTCO TCGA-STAD TCGA-BRCA TCGA-GBM/LGG TCGA-KIRC

KAN Shape [4,2,1] [5,1] [6,5,5,1] [5,1] [14,3,1] [7,2,1] [9,1] [8,3,1] [6,5,1] [148,1] [168,1] [320,1] [362,4,4,1]

Learning Rate 0.035 0.01 0.01 0.1 0.015 0.0076 0.09 0.08 0.002 0.005 0.03 0.014 0.014

Early Stopping False False True False True True True True False True True True True

Steps 133 107 (300) 107 (300) (300) (300) (300) 147 (300) (300) (300) (300)

Prune threshold 0.03 0.03 0.045 0.03 0.00007 0.045 0.035 0.001 0.02 0.008 0.007 0.034 0.012

Grid Intervals 4 5 4 5 3 3 3 3 5 3 3 5 3

Base fn linear silu linear silu linear silu silu linear linear linear silu silu linear

37
Spline noise ξs 0.03 0.06 0.003 0.06 0.11 0.09 0.1 0.12 0.15 0.1 0.02 0.05 0.14

Base noise ξb 0.13 0.14 0.16 0.14 0.05 0.18 0.03 0.04 0.16 0.01 0.009 0.04 0.11

Reg λ 0.014 0.0001 0.01 0.0001 0.005 0.0007 0.003 0.006 0.002 0.0004 0.013 0.01 0.01

Entropy Reg λent 2 7 3 7 2 3 0 2 2 10 14 0 3

Coefficient Reg λcoef 0 0 2 0 4 2 4 1 2 0 3 4 5

Table 7 Hyperparameters of CoxKAN.


C STAD and KIRC hazards
On the STAD dataset, CoxKAN predicted the following log-partial hazard:

θ̂KAN = + 0.2 tanh(CALM2RN A − 0.4) (σ = 0.15)


− 0.1 · PRR15LRN A (σ = 0.10)
+ 0.2 · TOMM20RN A (σ = 0.09)
− 0.09 · MUC16mut (σ = 0.09)
+ 0.8 arctan(0.4 · C3RN A + 0.2) (σ = 0.08)
− 0.1 · HNRNPKRN A (σ = 0.08)
− 0.2 · MISPRN A (σ = 0.08)
+ less significant terms
On the KIRC dataset, CoxKAN predicted:

θ̂KAN = + 0.43 · MT1XRN A (σ = 0.42)


+ 0.34 · DDX43RN A (σ = 0.34)
+ 0.23 · CWH43RN A (σ = 0.31)
+ 0.22 · CILPRN A (σ = 0.31)
− 0.24 · LOC153328RN A (σ = 0.29)
− 0.21 · CYP3A7RN A (σ = 0.28)
+ less significant terms,

38

You might also like