BSTAT 531 SURVIVAL ANALYSIS 2+1
Objective
The course deals with study of survival times and their statistical properties along with the
factors affecting them.
Theory
UNIT I
Concept of survival data, definition and associated probability density function, survival
function, hazard function, Censoring in survival time.
UNIT II
Estimation of survival function by life table analysis, Kaplan and Meier Method.
UNIT III
Survival and failure time distributions: family of exponential and Weibull models.
UNIT IV
Analytical and graphical method for choosing best fitted distribution, Parametric and non-
parametric tests for comparison of survival functions.
UNIT V
Concomitant variables in lifetime distribution models, Cox-proportional hazard models, Cox-
proportional hazard models with time dependent covariates.
Practical
Estimation of survival functions - life table analysis; Kaplan and Meier Method. Estimation of
survival functions in case of censored observations - life table method, Kaplan and Meier
method; Fitting of survival and failure time distributions: family of exponential and Weibull
models (For uncensored and censored observations); Regression and Maximum Likelihood
Method of fitting and choosing appropriate distribution to the survival times; Graphical method
for choosing best fitted distribution, Parametric and Non-Parametric tests for comparison of
survival functions; Parametric tests for comparison of survival functions in the presence of
censored survival times; Non parametric tests for comparing survival functions in the presence of
uncensored survival times; Concomitant variables in lifetime distribution models. Fitting of Cox-
proportional hazard models.
UNIT I
1.1 Introduction
Survival data is a term used for describing data that measures time to occurrence of
some event. The event may be death, appearance of some disease, relapse from remission,
response to a treatment, equipment breakdown etc. Therefore, survival time can be tumor- free
time, the time from the start of treatment to response, length of remission and time to death. The
study of survival data has focused on predicting the probability of response, survival, or mean
lifetime, comparing the survival distributions of experimental animals or of human patients and
the identification of risk and/or prognostic factors related to response, survival, and the
development of a disease. The development of models and methods to deal with survival times
or lifetimes took place in the second half of the twentieth century. The development proceeded
into two main inter mingling streams; viz reliability theory and survival analysis. The reliability
theory concerns with models for lifetimes of components and systems in the engineering and
industrial fields and the survival analysis concerns with medical and similar biological
phenomena.
Survival time or time to event is usually considered as a positive real valued random
variable having a continuous distribution function. The definition of lifetime includes a time
scale and time origin, as well as specification of the event that determines lifetime. In some
instances, time may represent age, with the time origin as the birth of the individual. In other
instances, the natural time origin may be occurrence of some event such as entry into a study or
diagnosis of a particular disease. In some situations, it is difficult to say precisely when the event
occurs, for example, the case of appearance of tumour. The time scale is not always real or
chronological time, especially where machines or equipments are considered. It could be the
number of operations a component performs before it breaks down. The following examples
illustrate various types of survival data that arise in practical situations.
1. Manufactured items with mechanical or electric components are often subjected to life tests
in order to obtain information on their durability. This involves putting items in operation,
often in laboratory setting and observing them until they fail. It is common here to refer to
the lifetimes as ‘failure times’ since when an item ceases operating satisfactorily, it is said to
have ‘failed’.
2. In medical studies dealing with potential fatal diseases, one is interested in the survival time
of individuals with disease, measured from the date of diagnosis or some other starting point.
For example, it is common to compare treatments for a disease at least partly in terms of
survival time distributions of patients receiving the different treatments.
3. A standard experiment in the investigation of carcinogenic substance is one in which
laboratory animals are subjected to doses of the substance and then observed to see if they
develop tumours. The main variable of interest is the time to appearance of a tumour,
measured from when the dose is administered.
4. In remission period of leukemia patients, the patient, though not free of disease, is free of
symptoms. The length of remission period is a variable of interest in this study. The patients
in the state of remission are followed over time to see how long they stay in remission.
1.2 Basic Concepts
Let T be a nonnegative random variable representing lifetimes of individuals having
absolutely continuous distribution function F . with respect to the Lebesgue measure. Let f .
denote the probability density function (p.d.f.) of T . All functions, unless stated otherwise, are
defined over the interval 0, .
1.2.1 Survivor Function
A basic function that describes lifetime data is the survivor function, which is defined as
S t P T t f x dx .
t
The survivor function S t , is the probability of an individual surviving beyond time t .
In the context involving lifetimes of systems or manufactured items, S t is referred to as the
reliability function. S t is a non-increasing continuous function with S 0 1 and lim S t 0
t
.
The function S t is also known as the cumulative survival rate. The graph of S t is called
survival curve. A steep survival curve, such as the one shown in Figure 1a, represents low
survival rate or short survival time. A gradual or flat survival curve such as in Figure 1b
represents high survival rate or longer survival.
The survivorship function or the survival curve is used to find the 50th percentile (the median)
and other percentiles (e.g., 25th and 75th) of survival time and to compare survival distributions
of two or more groups. The median survival times in Figure 1a and b are approximately 5 and 36
units of time, respectively. The mean is generally used to describe the central tendency of a
distribution, but in survival distributions the median is often better because a small number of
individuals with exceptionally long or short lifetimes will cause the mean survival time to be
disproportionately large or small.
1.2.2 Probability Density Function (or Density Function)
Like any other continuous random variable, the survival time T has a probability density
function defined as the limit of the probability that an individual fails in the short interval
[t , t t ) per unit width ∆t, or simply the probability of failure in a small interval per unit time.
It can be expressed as
lim t 0 P[an individual dying in the interval ]
f (t) = .
t
The graph of f (t) is called the density curve. Figure 2a and b give two examples of the density
curve. The density function has the following two properties:
1. f (t) is a nonnegative function:
f (t) ≥0 for all t ≥ 0
=0 for t < 0
2. The area between the density curve and the t axis is equal to 1.
The proportion of individuals that fail in any time interval and the peaks of high
frequency of failure can be found from the density function. The density curve in Figure 2a gives
a pattern of high failure rate at the beginning of the study and decreasing failure rate as time
increases. In Figure 2b, the peak of high failure frequency occurs at approximately 1.7 units of
time. The proportion of individuals that fail between 1 and 2 units of time is equal to the shaded
area between the density curve and the axis. The density function is also known as the
unconditional failure rate.
1.2.3 Hazard Rate
An important function that characterizes lifetime distributions is the hazard rate h t ,
defined as
h t lim
P t T t t | T t .
t 0 t
The hazard rate specifies the instantaneous rate of death or failure at time t , given that the
individual survives up to time t . Thus h t t is the approximate probability of death in the
interval [t , t t ) , given survival up to time t . The hazard rate is also known as conditional
failure rate in reliability, the force of mortality in demography, the intensity function in
stochastic processes, the age-specific failure rate in epidemiology, the inverse of Mill’s ratio in
economics or simply the hazard function.
The hazard function may increase, decrease, remain constant, or indicate a more
complicated process. Figure 3 is a plot of several kinds of hazard function. For example, patients
with acute leukemia who do not respond to treatment have an increasing hazard rate, h1(t). h2(t)
is a decreasing hazard function that, for example, indicates the risk of soldiers wounded by
bullets who undergo surgery. The main danger is the operation itself and this danger decreases if
the surgery is successful. An example of a constant hazard function, h3(t), is the risk of healthy
persons between 18 and 40 years of age whose main risks of death are accidents. The bathtub
curve, h4(t), describes the process of human life. During an initial period, the risk is high (high
infant mortality). Subsequently, h(t) stays approximately constant until a certain time, after which
it increases because of wear-out failures. Finally, patients with tuberculosis have risks that
increase initially, then decrease after treatment. Such an
increasing, then decreasing hazard function is described by h5(t).
1.2.4 Cumulative Hazard function
A related function is cumulative hazard rate H t , defined as
t
H t h x dx .
0
Also H(t)= -log S(t)
Thus, at t=0, S(t) =1, H(t) =0, and at t=∞, S(t) =0, H(t) =∞. The cumulative hazard function can
be any value between zero and infinity.
1.2.5 Relationships
The relationships with f t , h t or S t are given below.
1. The p.d.f. of T may be represented as
dS t
f t .
dt
2. When the p.d.f. of T , f t exists, then the hazard rate is expressed as
f t
h t (1.1)
S t
d log S t
.
dt
The hazard rate fully specifies the distribution of T and determines the survivor function.
3. Integrating (1.1) with respect to t and using S 0 1 , we obtain
t
S t exp h x dx . (1.2)
0
4. The p.d.f of T can be obtained from (1.1) and (1.2) as
t
f t h t .exp h x dx .
0
5. S t can be represented in terms of H t as
S t exp H t .
Hence, if any one of the functions f t , h t or S t is given, then other two can be easily
derived.
In survival studies, many subjects fail to continue to be in the study till the event of
interest occurs. This leads to incomplete data due to censored observations. The analysis of
lifetime data under censoring is a major issue in survival studies.
1.3 Censoring
Censoring is inevitable in survival and reliability studies because the experimenter is
unable to obtain complete information on lifetime of individuals. For example, patients in a
clinical trial may withdraw from the study, or the study may have to be terminated at a prefixed
time point. There are various categories of censoring such as right censoring, left censoring and
interval censoring.
1.3.1 Right Censoring
In both engineering and medical applications, right censoring is the most common form
of censoring with lifetime data. In right censoring only lower bounds on lifetime are available for
some individuals. Right censoring arises in certain situations because some individuals are still
surviving at the time that study is terminated. In other instances, individual may move away from
the study area for reasons unconnected with the study, so contact is lost. In some other situations,
individuals may be withdrawn or decide to withdraw from the study because of worsening or
improving prognosis.
Two types of right censoring are built into the design of experiments to reduce the time
taken for completing the study.
Type I censoring: Sometimes experiments run over a fixed time period in such a way that an
individual lifetime will be known exactly only if it is less than a predetermined value. In such
situations the data are said to be Type I or time censored. In general, suppose that in a life test
experiment n items are simultaneously put into operation. The study is terminated at a
predetermined time t0 . Suppose that r items are failed by this time and the remaining n r
items are operative. Then there are n r censored items and the data consist of lifetimes of r
failed items and the censoring time t0 for the remaining n r items. Type I censoring occurs
frequently in medical research when a decision is made to terminate a study at a date on which
not all individual’s lifetime will be known. For example, suppose that six rats have been
exposed to carcinogens by injecting tumor cells into their foot pads. The times to develop a
tumor of a given size are observed. The investigator decides to terminate the experiment after 30
weeks. Figure 4 is a plot of the development times of the tumors. Rats A, B, and D developed
tumors after 10, 15, and 25 weeks, respectively. Rats C and E did not develop tumors by the end
of the study; their tumor-free times are thus 30-plus weeks. Rat F died accidentally without
tumors after 19 weeks of observation. The survival data (tumor-free times) are 10, 15, 30+, 25,
30+, and 19+ weeks. (The plus indicates a censored observation.)
Type II censoring: The term type II censoring refers to the situation where n individuals start
on study at the same time, and the study terminates once k lifetimes have been observed. Thus
only the smallest k lifetimes, in a random sample of n , are observed where k is a specified
integer between 1 and n . This type of censoring is also known as order censoring or failure
censoring. For example, in an experiment of six rats (Figure 5), the investigator may decide to
terminate the study after four of the six rats have developed tumors. The survival or tumor-free
times are then 10, 15, 35+, 25, 35, and 19+weeks.
Type I and type II censored observations are also called singly censored data, and type
III, progressively censored data, by Cohen (1965). Another commonly used name for type III
censoring is random censoring.
Type III Censoring
In most clinical and epidemiologic studies the period of study is fixed and patients enter
the study at different times during that period. Some may die before the end of the study; their
exact survival times are known. Others may withdraw before the end of the study and are lost to
follow-up. Still others may be alive at the end of the study. For ‘‘lost’’ patients, survival times
are at least from their entrance to the last contact. For patients still alive, survival times are at
least from entry to the end of the study. The latter two kinds of observations are censored
observations. Since the entry times are not simultaneous, the censored times are also different.
This is type III censoring. For example, suppose that six patients with acute leukemia enter a
clinical study during a total study period of one year. Suppose also that all six respond to
treatment and achieve remission. The remission times are plotted in Figure 6. Patients A, C, and
E achieve remission at the beginning of the second, fourth, and ninth months, and relapse after
four, six, and three months, respectively. Patient B achieves remission at the beginning of the
third month but is lost to follow-up four months later; the remission duration is thus at least four
months. Patients D and F achieve remission at the beginning of the fifth and tenth months,
respectively, and are still in remission at the end of the study; their remission times are thus at
least eight and three months. The respective remission times of the six patients are 4, 4+, 6, 8+,
3, and 3+ months.
1.3.2 Left Censoring
Left censoring occurs in life test applications when a unit has failed at the time of its first
inspection; we know only that the unit failed before the inspection time. In other situations, left
censored observations arise when the exact value of a response has not been observed and we
have, instead, an upper bound on that response. Consider, for example, a measuring instrument
that lacks the sensitivity needed to measure the observations below a known threshold. When the
measurement is taken, if the signal is below the instrument threshold, we know only that the
measurement is less than the threshold. As another example, an epidemiologist wishes to know
the age at diagnosis in a follow-up study of diabetic retinopathy. At the time of the examination,
a 50-year-old participant was found to have already developed retinopathy, but there is no record
of the exact time at which initial evidence was found. Thus the age at examination (i.e., 50) is a
left-censored observation. It means that the age of diagnosis for this patient is at most 50 years.
The data set may contain both left and right censored observations and in that case
lifetimes are known as doubly censored. A psychiatrist collected data to determine the age at
which children have learned to perform a particular task. The lifetime was the time the child has
taken to learn to perform the task from date of birth. Those children who already knew how to
perform the task, when he started the study were left censored and those who didn’t learn the
task even by the time the study ends were right censored observations.
1.3.3 Interval Censoring
Interval censoring is still another type of censoring which occurs when the lifetime is
only known to occur within an interval. Such pattern occurs when patients in a clinical trial have
periodic follow up and the patient’s event time is only known to fall in an interval. For example,
if medical records indicate that at age 45, the patient in the example above did not have
retinopathy, his age at diagnosis is between 45 and 50 years.