0% found this document useful (0 votes)
105 views70 pages

Microarray Data Analysis: Class Discovery and Class Prediction: Clustering and Discrimination

No need to specify k. • Need to cut tree; sensitivity. Both have pros and cons. Choice depends on data, goals.

Uploaded by

johanpaul
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
105 views70 pages

Microarray Data Analysis: Class Discovery and Class Prediction: Clustering and Discrimination

No need to specify k. • Need to cut tree; sensitivity. Both have pros and cons. Choice depends on data, goals.

Uploaded by

johanpaul
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 70

Microarray Data Analysis

Class discovery and Class prediction:


Clustering and Discrimination

1
Gene expression profiles

• Many genes
show definite
changes of
expression
between
conditions
• These
patterns are
called gene
profiles
2
Motivation (1):
The problem of finding patterns
• It is common to have hybridizations where
conditions reflect temporal or spatial aspects.
– Yeast cycle data
– Tumor data evolution after chemotherapy
– CNS data in different part of brain
• Interesting genes may be those showing patterns
associated with changes.
• Our problem seems to be distinguishing
interesting or real patterns from meaningless
variation, at the level of the gene
3
Finding patterns: Two
approaches
• If patterns already exist 
Profile comparison (Distance analysis)
– Find the genes whose expression fits specific,
predefined patterns.
– Find the genes whose expression follows the
pattern of predefined gene or set of genes.
• If we wish to discover new patterns 
Cluster analysis (class discovery)
– Carry out some kind of exploratory analysis to
see what expression patterns emerge;
4
Motivation (2): Tumor
classification
• A reliable and precise classification of tumours is essential
for successful diagnosis and treatment of cancer.
• Current methods for classifying human malignancies rely
on a variety of morphological, clinical, and molecular
variables.
• In spite of recent progress, there are still uncertainties in
diagnosis. Also, it is likely that the existing classes are
heterogeneous.
• DNA microarrays may be used to characterize the
molecular variations among tumours by monitoring gene
expression on a genomic scale. This may lead to a more
reliable classification of tumours.

5
Tumor classification, cont
• There are three main types of statistical problems
associated with tumor classification:
1. The identification of new/unknown tumor classes
using gene expression profiles - cluster analysis;
2. The classification of malignancies into known
classes - discriminant analysis;
3. The identification of “marker” genes that characterize
the different tumor classes - variable selection.

6
Cluster and Discriminant analysis
• These techniques group, or equivalently classify,
observational units on the basis of
measurements.
• They differ according to their aims, which in turn
depend on the availability of a pre-existing basis
for the grouping.
– In cluster analysis (unsupervised learning, class
discovery) , there are no predefined groups or labels for
the observations,
– Discriminant analysis (supervised learning, class
prediction) is based on the existence of groups (labels)
7
Clustering microarray data
• Cluster can be applied to genes (rows),
mRNA samples (cols), or both at once.
– Cluster samples to identify new cell or tumour
subtypes.
– Cluster rows (genes) to identify groups of co-
regulated genes.
– We can also cluster genes to reduce
redundancy e.g. for variable selection in
predictive models.
8
Advantages of clustering
• Clustering leads to readily interpretable figures.
• Clustering strengthens the signal when averages
are taken within clusters of genes (Eisen).
• Clustering can be helpful for identifying patterns
in time or space.
• Clustering is useful, perhaps essential, when
seeking new subclasses of cell samples (tumors,
etc).

9
Applications of clustering (1)
– Alizadeh et al (2000) Distinct types of diffuse
large B-cell lymphoma identified by gene
expression profiling.
• Three subtypes of lymphoma (FL, CLL and
DLBCL) have different genetic signatures.
(81 cases total)
• DLBCL group can be partitioned into two
subgroups with significantly different survival. (39
DLBCL cases)

10
Clusters
on both
genes
and
arrays

Taken from
Nature February, 2000
Paper by Allizadeh. A et al
Distinct types of diffuse large
B-cell lymphoma identified by
Gene expression profiling,

11
Discovering tumor subclasses
• DLBCL is clinically
heterogeneous
• Specimens were
clustered based on their
expression profiles of GC
B-cell associated genes.
• Two subgroups were
discovered:
– GC B-like DLBCL
– Activated B-like
DLBCL
12
Applications of clustering (2)
• A naïve but nevertheless important application is
assessment of experimental design
• If one has an experiment with different
experimental conditions, and in each of them
there are biological and technical replicates…
• We would expect that the more homogeneous
groups tend to cluster together
– Tech. replicates < Biol. Replicates < Different groups
• Failure to cluster so suggests bias due to
experimental conditions more than to existing
differences.
13
Basic principles of clustering
Aim: to group observations that are “similar” based on
predefined criteria.

Issues: Which genes / arrays to use?


Which similarity or dissimilarity measure?
Which clustering algorithm?
• It is advisable to reduce the number of genes from
the full set to some more manageable number,
before clustering. The basis for this reduction is
usually quite context specific, see later example.

14
Two main classes of measures of
dissimilarity
• Correlation
• Distance
– Manhattan
– Euclidean
– Mahalanobis distance
– Many more ….

15
Two basic types of methods
Partitioning Hierarchical

16
Partitioning methods

Partition the data into a pre-specified number k of


mutually exclusive and exhaustive groups.

Iteratively reallocate the observations to clusters


until some criterion is met, e.g. minimize within
cluster sums of squares.

Examples:
– k-means, self-organizing maps (SOM), PAM, etc.;
– Fuzzy: needs stochastic model, e.g. Gaussian
mixtures.
17
Hierarchical methods
• Hierarchical clustering methods produce a tree
or dendrogram.
• They avoid specifying how many clusters are
appropriate by providing a partition for each k
obtained from cutting the tree at some level.
• The tree can be built in two distinct ways
– bottom-up: agglomerative clustering;
– top-down: divisive clustering.

18
Agglomerative methods
• Start with n clusters.
• At each step, merge the two closest clusters
using a measure of between-cluster dissimilarity,
which reflects the shape of the clusters.
• Between-cluster dissimilarity measures
– Mean-link: average of pairwise dissimilarities
– Single-link: minimum of pairwise
dissimilarities.
– Complete-link: maximum& of pairwise
dissimilarities.
– Distance between centroids
19
Distance between centroids Single-link

Complete-link Mean-link

20
Divisive methods
• Start with only one cluster.
• At each step, split clusters into two parts.
• Split to give greatest distance between two new
clusters
• Advantages.
• Obtain the main structure of the data, i.e.
focus on upper levels of dendogram.
• Disadvantages.
– Computational difficulties when considering all
possible divisions into two groups.
21
Illustration of points
In two dimensional
space Agglomerative
1 5 2 3 4

4 1,2,3,4,5
3
1,2,5

5 3,4
1,5
1 2

1 5 2 3 4
22
Tree re-ordering?

Agglomerative
2 1 53 4 1 5 2 3 4

4 1,2,3,4,5
3
1,2,5

5 3,4
1,5
1 2

1 5 2 3 4
23
Partitioning or Hierarchical?
• Partitioning: • Hierarchical
– Advantages – Advantages
• Optimal for certain criteria.
• Faster computation.
• Genes automatically
assigned to clusters • Visual.
– Disadvantages – Disadvantages
• Need initial k; • Unrelated genes are
• Often require long eventually joined
computation times.
• All genes are forced into a
• Rigid, cannot correct
cluster. later for erroneous
decisions made earlier.
• Hard to define clusters.

24
Hybrid Methods
• Mix elements of Partitioning and
Hierarchical methods
– Bagging
• Dudoit & Fridlyand (2002)
– HOPACH
• van der Laan & Pollard (2001)

25
Three generic clustering problems
Three important tasks (which are generic) are:

1. Estimating the number of clusters;


2. Assigning each observation to a cluster;
3. Assessing the strength/confidence of
cluster assignments for individual
observations.

Not equally important in every problem.

26
Estimating number of clusters
using silhouette
• Define silhouette width of the observation as :
S = (b-a)/max(a,b)
• Where a is the average dissimilarity to all the points in the cluster
and b is the minimum distance to any of the objects in the other
clusters.
• Intuitively, objects with large S are well-clustered while the ones with
small S tend to lie between clusters.
• How many clusters: Perform clustering for a sequence of the
number of clusters k and choose the number of components
corresponding to the largest average silhouette.
• Issue of the number of clusters in the data is most relevant for novel
class discovery, i.e. for clustering samples
27
Estimating number of clusters
using the bootstrap
There are other resampling (e.g. Dudoit and Fridlyand,
2002) and non-resampling based rules for estimating the
number of clusters (for review see Milligan and Cooper
(1978) and Dudoit and Fridlyand (2002) ).

The bottom line is that none work very well in complicated


situation and, to a large extent, clustering lies outside a
usual statistical framework.

It is always reassuring when you are able to characterize a


newly discovered clusters using information that was not
used for clustering.
28
Limitations
Cluster analyses:
• Usually outside the normal framework of statistical
inference;
• less appropriate when only a few genes are likely to
change.
• Needs lots of experiments
• Always possible to cluster even if there is nothing going
on.
• Useful for learning about the data, but does not provide
biological truth.

29
Discrimination
or Class prediction
or Supervised Learning

30
Motivation: A study of gene
expression on breast tumours
(NHGRI,
cDNA Microarrays
J. Trent)
Parallel Gene Expression Analysis • How similar are the gene
expression profiles of BRCA1
and BRCA2 (+) and sporadic
breast cancer patient biopsies?

• Can we identify a set of


genes that distinguish the
6526 genes /tumor
different tumor types?

• Tumors studied:
– 7 BRCA1 +
– 8 BRCA2 +
– 7 Sporadic

31
Discrimination
• A predictor or classifier for K tumor classes partitions the space X of
gene expression profiles into K disjoint subsets, A1, ..., AK, such that for a
sample with expression profile x=(x1, ...,xp)  Ak the predicted class is k.

• Predictors are built from past experience, i.e., from observations which
are known to belong to certain classes. Such observations comprise the
learning set
L = (x1, y1), ..., (xn,yn).

• A classifier built from a learning set L is denoted by


C( . ,L): X  {1,2, ... ,K},
with the predicted class for observation x being C(x,L).

32
Discrimination and Allocation
Learning Set Prediction
Data with known
classes
Classification Data with
rule unknown classes

Classification
Technique
Class
Assignment
Discrimination

33
Learning set

Bad prognosis Good Prognosis Good Prognosis


?
Predefine recurrence < 5yrs recurrence > 5yrs Matesis > 5
classes
Clinical
outcome

Objects
Array

Feature vectors
Gene
expression
new
array

Reference
L van’t Veer et al (2002) Gene expression
Classification
profiling predicts clinical outcome of breast
cancer. Nature, Jan. rule
.
34
Learning set

Predefine B-ALL T-ALL AML T-ALL


?
classes
Tumor type

Objects
Array

Feature vectors
Gene
expression
new
array

Reference
Golub et al (1999) Molecular classification
of cancer: class discovery and class Classification
prediction by gene expression monitoring.
Science 286(5439): 531-537.
Rule

35
Components of class prediction
• Choose a method of class prediction
– LDA, KNN, CART, ....
• Select genes on which the prediction will
be base: Feature selection
– Which genes will be included in the model?
• Validate the model
– Use data that have not been used to fit the
predictor

36
Prediction methods

37
Choose prediction model
• Prediction methods
– Fisher linear discriminant analysis (FLDA) and
its variants
• (DLDA, Golub’s gene voting, Compound covariate
predictor…)
– Nearest Neighbor
– Classification Trees
– Support vector machines (SVMs)
– Neural networks
– And many more …
38
Fisher linear discriminant analysis
First applied in 1935 by M. Barnard at the suggestion of R. A.
Fisher (1936), Fisher linear discriminant analysis (FLDA)
consists of

i. finding linear combinations x a of the gene expression


profiles x=(x1,...,xp) with large ratios of between-groups to
within-groups sums of squares - discriminant variables;

ii. predicting the class of an observation x by the class


whose mean vector is closest to x in terms of the
discriminant variables.

39
FLDA

40
Classification with SVMs
Generalization of the ideas of separating hyperplanes in the original space.
Linear boundaries between classes in higher-dimensional space lead to
the non-linear boundaries in the original space.

44
Adapted from internet
Nearest neighbor classification
• Based on a measure of distance between
observations (e.g. Euclidean distance or one
minus correlation).
• k-nearest neighbor rule (Fix and Hodges (1951))
classifies an observation x as follows:
– find the k observations in the learning set closest to x
– predict the class of x by majority vote, i.e., choose
the class that is most common among those k
observations.
• The number of neighbors k can be chosen by
cross-validation (more on this later).

45
Nearest neighbor rule

46
Classification tree
• Binary tree structured classifiers are
constructed by repeated splits of subsets
(nodes) of the measurement space X into
two descendant subsets, starting with X
itself.
• Each terminal subset is assigned a class
label and the resulting partition of X
corresponds to the classifier.

47
Classification trees
Mi1 < 1.4
Node 1
Class 1: 10
Class 2: 10

yes Gene 1 no

Mi2 > -0.5


Node 3
Node 2
Class 1: 4
Class 1: 6
Class 2: 1
Class 2: 9
Prediction: 1
yes Gene 2 no

Mi2 > 2.1


Node 4 Node 5
Class 1: 0 Class 1: 6
Class 2: 4 Class 2: 5
Prediction: 2
Gene 3

Node 6 Node 7
Class 1: 1 Class 1: 5
Class 2: 5 Class 2: 0
Prediction: 2 Prediction: 1 48
Three aspects of tree
construction
• Split selection rule:
– Example, at each node, choose split maximizing decrease in
impurity (e.g. Gini index, entropy, misclassification error).
• Split-stopping: The decision to declare a node as
terminal or to continue splitting.
– Example, grow large tree, prune to obtain a sequence of
subtrees, then use cross-validation to identify the subtree with
lowest misclassification rate.
• The assignment: of each terminal node to a class
– Example, for each terminal node, choose the class minimizing
the resubstitution estimate of misclassification probability, given
that a case falls into this node.
Supplementary slide
49
Other classifiers include…
• Support vector machines
• Neural networks
• Bayesian regression methods
• Projection pursuit
• ....

50
Aggregating predictors

• Breiman (1996, 1998) found that gains in


accuracy could be obtained by
aggregating predictors built from perturbed
versions of the learning set.

• In classification, the multiple versions of


the predictor are aggregated by voting.

51
Another component in classification rules:
aggregating classifiers
Resample 1 Classifier 1

Resample 2 Classifier 2

Training
Aggregate
Set
classifier
X1, X2, … X100

Resample 499 Classifier 499


Examples:
Bagging
Resample 500 Classifier 500 Boosting
Random Forest
54
Aggregating classifiers: Test
Bagging sample

Resample 1
X*1, X*2, … X*100 Tree 1 Class 1

Resample 2
X*1, X*2, … X*100 Tree 2 Class 2

Training Lets the 90% Class 1


Set (arrays) tree 10% Class 2
X1, X2, … X100 vote

Resample 499
X*1, X*2, … X*100 Tree 499 Class 1

Resample 500
X*1, X*2, … X*100 Tree 500 Class 1

55
Feature selection

56
Feature selection
• A classification rule must be based on a
set of variables which contribute useful
information for distinguishing the classes.
• This set will usually be small because
most variables are likely to be
uninformative.
• Some classifiers (like CART) perform
automatic feature selection whereas
others, like LDA or KNN, do not.
57
Approaches to feature selection
• Filter methods perform explicit feature selection
prior to building the classifier.
– One gene at a time: select features based on the
value of an univariate test.
– The number of genes or the test p-value are the
parameters of the FS method.
• Wrapper methods perform FS implicitly, as a
part of the classifier building.
– In classification trees features are selected at each
step based on reduction in impurity.
– The number of features is determined by pruning the
tree using cross-validation.

58
Why select features
• Lead to better classification performance
by removing variables that are noise with
respect to the outcome
• May provide useful insights into etiology of
a disease.
• Can eventually lead to the diagnostic tests
(e.g., “breast cancer chip”).

59
Why select features?

No feature Top 100


selection feature selection
Selection based on variance
Correlation plot
-1 +1 Data: Leukemia, 3 class
60
Performance assessment

61
Performance assessment
• Before using a classifier for prediction or prognostic one
needs a measure of its accuracy.
• The accuracy of a predictor is usually measured by the
Missclassification rate: The % of individuals belonging to
a class which are erroneously assigned to another class
by the predictor.
• An important problem arises here
– We are not interested in the ability of the predictor for classifying
current samples
– One needs to estimate future performance based on what is
available.

62
Estimating the error rate
• Using the same dataset on which we have built the
predictor to estimate the missclassification rate may lead
to erroneously low values due to overfitting.
– This is known as the resubstitution estimator
• We should use a completely independent dataset to
evaluate the classifier, but it is rarely available.
• We use alternatives approaches such as
– Test set estimator
– Cross validation

63
Performance assessment (I)
• Resubstitution estimation: Compute the error
rate on the learning set.
– Problem: downward bias

• Test set estimation: Proceeds in two steps


1. Divide learning set into two sub-sets, L and T;
2. Build the classifier on L and compute error rate on T.
– This approach is not free from problems
• L and T must be independent and identically distributed.
• Problem: reduced effective sample size
64
Diagram of performance assessment
(I)
Training
Classifier
Set
Resubstitution
estimation

Training Performance
set assessment

Test set
Independent estimation
Classifier
test set

65
Performance assessment (II)
• V-fold cross-validation (CV) estimation: Cases in learning
set randomly divided into V subsets of (nearly) equal size.
Build classifiers by leaving one set out; compute test set
error rates on the left out set and averaged.
– Bias-variance tradeoff: smaller V can give larger bias but smaller
variance
– Computationally intensive.
• Leave-one-out cross validation (LOOCV).
– Special case for V=n.
– Works well for stable classifiers (k-NN, LDA, SVM)

66
Diagram of performance assessment
(II)
Training
Classifier
Set Resubstitution
estimation

(CV) Learning
set
Cross
Training Validation Performance
set Classifier assessment

(CV) Test
set

Test set
Independent estimation
Classifier
test set

67
Examples

69
Learning set Classification
Case
Bad Good
Rule

Feature selection.
studies
Reference 1
Retrospective study
Correlation with class L van’t Veer et al Gene
labels, very similar to t-test. expression profiling predicts
clinical outcome of breast
Using cross validation to cancer. Nature, Jan 2002.
select 70 genes .

Reference 2
295 samples selected Cohort study
from Netherland Cancer Institute M Van de Vijver et al. A gene
tissue bank (1984 – 1995). expression signature as a
predictor of survival in breast
Results” Gene expression profile is a more cancer. The New England
powerful predictor then standard systems Jouranl of Medicine, Dec
based on clinical and histologic criteria 2002.

Agendia (formed by reseachers from the Netherlands Cancer Institute)


Has started in Oct, 2003 Reference 3
1) 5000 subjects [Health Council of the Netherlands] Prospective trials.
2) 5000 subjects New York based Avon Foundation. Aug 2003
Custom arrays are made by Agilent including Clinical trials
70 genes + 1000 controls http://www.agendia.com/

70
Van’t Veer breast cancer study
study
Investigate whether tumor ability for metastasis is
obtained later in development or inherent in the initial
gene expression signature.

• Retrospective sampling of node-negative women: 44


non-recurrences within 5 years of surgery and 34
recurrences. Additionally, 19 test sample (12 recur. and
7 non-recur)
• Want to demonstrate that gene expression profile is
significantly associated with recurrence independent of
the other clinical variables.
Nature, 2002
71
Predictor development
• Identify a set of genes with correlation > 0.3 with the binary outcome. Show that there
are significant enrichment for such genes in the dataset.
• Rank-order genes on the basis of their correlation
• Optimize number of genes in the classifier by using CV-1

Classification is made on the basis of the correlations of the expression profile of leave-
out-out sample with the mean expression of the remaining samples from the good
and bad prognosis patients, respectively.

N. B.: The correct way to select genes is within rather than outside cross-validation,
resulting in different set of markers for each CV iteration

N. B. : Optimizing number of variables and other parameters should be done via 2-level
cross-validation if results are to be assessed on the training set.

The classification indicator is included into the logistic model along with other clinical
variables. It is shown that gene expression profile has the strongest effect. Note that
some of this may be due to overfitting for the threshold parameter.

72
Van ‘t Veer, et al., 2002
73
van de Vuver’s breast data
(NEJM, 2002)
• 295 additional breast cancer patients, mix
of node-negative and node-positive
samples.
• Want to use the predictor that was
developed to identify patients at risk for
metastasis.
• The predicted class was significantly
associated with time to recurrence in the
multivariate cox-proportional model.
74
75
Acknowledgments
• Many of the slides in this course notes are
based on web materials made available by their
authors.
• I wish to thank specially
– Yee Hwa Yang (UCSF),
– Ben Boldstat, Sandrine Dudoit & Terry Speed, U.C.
Berkeley.
– The Bioconductor Project
– "Estadística I Bioinformàtica" research group at the
University of Barcelona
76

You might also like