0% found this document useful (0 votes)

105 views70 pages

Microarray Data Analysis: Class Discovery and Class Prediction: Clustering and Discrimination

No need to specify k. • Need to cut tree; sensitivity. Both have pros and cons. Choice depends on data, goals.

Uploaded by

johanpaul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

105 views70 pages

Microarray Data Analysis: Class Discovery and Class Prediction: Clustering and Discrimination

No need to specify k. • Need to cut tree; sensitivity. Both have pros and cons. Choice depends on data, goals.

Uploaded by

johanpaul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 70

Microarray Data Analysis

Class discovery and Class prediction:

Clustering and Discrimination

1
Gene expression profiles

• Many genes
show definite
changes of
expression
between
conditions
• These
patterns are
called gene
profiles
2
Motivation (1):
The problem of finding patterns
• It is common to have hybridizations where
conditions reflect temporal or spatial aspects.
– Yeast cycle data
– Tumor data evolution after chemotherapy
– CNS data in different part of brain
• Interesting genes may be those showing patterns
associated with changes.
• Our problem seems to be distinguishing
interesting or real patterns from meaningless
variation, at the level of the gene
3
Finding patterns: Two
approaches
• If patterns already exist 
Profile comparison (Distance analysis)
– Find the genes whose expression fits specific,
predefined patterns.
– Find the genes whose expression follows the
pattern of predefined gene or set of genes.
• If we wish to discover new patterns 
Cluster analysis (class discovery)
– Carry out some kind of exploratory analysis to
see what expression patterns emerge;
4
Motivation (2): Tumor
classification
• A reliable and precise classification of tumours is essential
for successful diagnosis and treatment of cancer.
• Current methods for classifying human malignancies rely
on a variety of morphological, clinical, and molecular
variables.
• In spite of recent progress, there are still uncertainties in
diagnosis. Also, it is likely that the existing classes are
heterogeneous.
• DNA microarrays may be used to characterize the
molecular variations among tumours by monitoring gene
expression on a genomic scale. This may lead to a more
reliable classification of tumours.

5
Tumor classification, cont
• There are three main types of statistical problems
associated with tumor classification:
1. The identification of new/unknown tumor classes
using gene expression profiles - cluster analysis;
2. The classification of malignancies into known
classes - discriminant analysis;
3. The identification of “marker” genes that characterize
the different tumor classes - variable selection.

6
Cluster and Discriminant analysis
• These techniques group, or equivalently classify,
observational units on the basis of
measurements.
• They differ according to their aims, which in turn
depend on the availability of a pre-existing basis
for the grouping.
– In cluster analysis (unsupervised learning, class
discovery) , there are no predefined groups or labels for
the observations,
– Discriminant analysis (supervised learning, class
prediction) is based on the existence of groups (labels)
7
Clustering microarray data
• Cluster can be applied to genes (rows),
mRNA samples (cols), or both at once.
– Cluster samples to identify new cell or tumour
subtypes.
– Cluster rows (genes) to identify groups of co-
regulated genes.
– We can also cluster genes to reduce
redundancy e.g. for variable selection in
predictive models.
8
Advantages of clustering
• Clustering leads to readily interpretable figures.
• Clustering strengthens the signal when averages
are taken within clusters of genes (Eisen).
• Clustering can be helpful for identifying patterns
in time or space.
• Clustering is useful, perhaps essential, when
seeking new subclasses of cell samples (tumors,
etc).

9
Applications of clustering (1)
– Alizadeh et al (2000) Distinct types of diffuse
large B-cell lymphoma identified by gene
expression profiling.
• Three subtypes of lymphoma (FL, CLL and
DLBCL) have different genetic signatures.
(81 cases total)
• DLBCL group can be partitioned into two
subgroups with significantly different survival. (39
DLBCL cases)

10
Clusters
on both
genes
and
arrays

Taken from
Nature February, 2000
Paper by Allizadeh. A et al
Distinct types of diffuse large
B-cell lymphoma identified by
Gene expression profiling,

11
Discovering tumor subclasses
• DLBCL is clinically
heterogeneous
• Specimens were
clustered based on their
expression profiles of GC
B-cell associated genes.
• Two subgroups were
discovered:
– GC B-like DLBCL
– Activated B-like
DLBCL
12
Applications of clustering (2)
• A naïve but nevertheless important application is
assessment of experimental design
• If one has an experiment with different
experimental conditions, and in each of them
there are biological and technical replicates…
• We would expect that the more homogeneous
groups tend to cluster together
– Tech. replicates < Biol. Replicates < Different groups
• Failure to cluster so suggests bias due to
experimental conditions more than to existing
differences.
13
Basic principles of clustering
Aim: to group observations that are “similar” based on
predefined criteria.

Issues: Which genes / arrays to use?

Which similarity or dissimilarity measure?
Which clustering algorithm?
• It is advisable to reduce the number of genes from
the full set to some more manageable number,
before clustering. The basis for this reduction is
usually quite context specific, see later example.

14
Two main classes of measures of
dissimilarity
• Correlation
• Distance
– Manhattan
– Euclidean
– Mahalanobis distance
– Many more ….

15
Two basic types of methods
Partitioning Hierarchical

16
Partitioning methods

Partition the data into a pre-specified number k of

mutually exclusive and exhaustive groups.

Iteratively reallocate the observations to clusters

until some criterion is met, e.g. minimize within
cluster sums of squares.

Examples:
– k-means, self-organizing maps (SOM), PAM, etc.;
– Fuzzy: needs stochastic model, e.g. Gaussian
mixtures.
17
Hierarchical methods
• Hierarchical clustering methods produce a tree
or dendrogram.
• They avoid specifying how many clusters are
appropriate by providing a partition for each k
obtained from cutting the tree at some level.
• The tree can be built in two distinct ways
– bottom-up: agglomerative clustering;
– top-down: divisive clustering.

18
Agglomerative methods
• Start with n clusters.
• At each step, merge the two closest clusters
using a measure of between-cluster dissimilarity,
which reflects the shape of the clusters.
• Between-cluster dissimilarity measures
– Mean-link: average of pairwise dissimilarities
– Single-link: minimum of pairwise
dissimilarities.
– Complete-link: maximum& of pairwise
dissimilarities.
– Distance between centroids
19
Distance between centroids Single-link

Complete-link Mean-link

20
Divisive methods
• Start with only one cluster.
• At each step, split clusters into two parts.
• Split to give greatest distance between two new
clusters
• Advantages.
• Obtain the main structure of the data, i.e.
focus on upper levels of dendogram.
• Disadvantages.
– Computational difficulties when considering all
possible divisions into two groups.
21
Illustration of points
In two dimensional
space Agglomerative
1 5 2 3 4

4 1,2,3,4,5
3
1,2,5

5 3,4
1,5
1 2

1 5 2 3 4
22
Tree re-ordering?

Agglomerative
2 1 53 4 1 5 2 3 4

4 1,2,3,4,5
3
1,2,5

5 3,4
1,5
1 2

1 5 2 3 4
23
Partitioning or Hierarchical?
• Partitioning: • Hierarchical
– Advantages – Advantages
• Optimal for certain criteria.
• Faster computation.
• Genes automatically
assigned to clusters • Visual.
– Disadvantages – Disadvantages
• Need initial k; • Unrelated genes are
• Often require long eventually joined
computation times.
• All genes are forced into a
• Rigid, cannot correct
cluster. later for erroneous
decisions made earlier.
• Hard to define clusters.

24
Hybrid Methods
• Mix elements of Partitioning and
Hierarchical methods
– Bagging
• Dudoit & Fridlyand (2002)
– HOPACH
• van der Laan & Pollard (2001)

25
Three generic clustering problems
Three important tasks (which are generic) are:

1. Estimating the number of clusters;

2. Assigning each observation to a cluster;
3. Assessing the strength/confidence of
cluster assignments for individual
observations.

Not equally important in every problem.

26
Estimating number of clusters
using silhouette
• Define silhouette width of the observation as :
S = (b-a)/max(a,b)
• Where a is the average dissimilarity to all the points in the cluster
and b is the minimum distance to any of the objects in the other
clusters.
• Intuitively, objects with large S are well-clustered while the ones with
small S tend to lie between clusters.
• How many clusters: Perform clustering for a sequence of the
number of clusters k and choose the number of components
corresponding to the largest average silhouette.
• Issue of the number of clusters in the data is most relevant for novel
class discovery, i.e. for clustering samples
27
Estimating number of clusters
using the bootstrap
There are other resampling (e.g. Dudoit and Fridlyand,
2002) and non-resampling based rules for estimating the
number of clusters (for review see Milligan and Cooper
(1978) and Dudoit and Fridlyand (2002) ).

The bottom line is that none work very well in complicated

situation and, to a large extent, clustering lies outside a
usual statistical framework.

It is always reassuring when you are able to characterize a

newly discovered clusters using information that was not
used for clustering.
28
Limitations
Cluster analyses:
• Usually outside the normal framework of statistical
inference;
• less appropriate when only a few genes are likely to
change.
• Needs lots of experiments
• Always possible to cluster even if there is nothing going
on.
• Useful for learning about the data, but does not provide
biological truth.

29
Discrimination
or Class prediction
or Supervised Learning

30
Motivation: A study of gene
expression on breast tumours
(NHGRI,
cDNA Microarrays
J. Trent)
Parallel Gene Expression Analysis • How similar are the gene
expression profiles of BRCA1
and BRCA2 (+) and sporadic
breast cancer patient biopsies?

• Can we identify a set of

genes that distinguish the
6526 genes /tumor
different tumor types?

• Tumors studied:
– 7 BRCA1 +
– 8 BRCA2 +
– 7 Sporadic

31
Discrimination
• A predictor or classifier for K tumor classes partitions the space X of
gene expression profiles into K disjoint subsets, A1, ..., AK, such that for a
sample with expression profile x=(x1, ...,xp)  Ak the predicted class is k.

• Predictors are built from past experience, i.e., from observations which
are known to belong to certain classes. Such observations comprise the
learning set
L = (x1, y1), ..., (xn,yn).

• A classifier built from a learning set L is denoted by

C( . ,L): X  {1,2, ... ,K},
with the predicted class for observation x being C(x,L).

32
Discrimination and Allocation
Learning Set Prediction
Data with known
classes
Classification Data with
rule unknown classes

Classification
Technique
Class
Assignment
Discrimination

33
Learning set

Bad prognosis Good Prognosis Good Prognosis

?
Predefine recurrence < 5yrs recurrence > 5yrs Matesis > 5
classes
Clinical
outcome

Objects
Array

Feature vectors
Gene
expression
new
array

Reference
L van’t Veer et al (2002) Gene expression
Classification
profiling predicts clinical outcome of breast
cancer. Nature, Jan. rule
.
34
Learning set

Predefine B-ALL T-ALL AML T-ALL

?
classes
Tumor type

Objects
Array

Feature vectors
Gene
expression
new
array

Reference
Golub et al (1999) Molecular classification
of cancer: class discovery and class Classification
prediction by gene expression monitoring.
Science 286(5439): 531-537.
Rule

35
Components of class prediction
• Choose a method of class prediction
– LDA, KNN, CART, ....
• Select genes on which the prediction will
be base: Feature selection
– Which genes will be included in the model?
• Validate the model
– Use data that have not been used to fit the
predictor

36
Prediction methods

37
Choose prediction model
• Prediction methods
– Fisher linear discriminant analysis (FLDA) and
its variants
• (DLDA, Golub’s gene voting, Compound covariate
predictor…)
– Nearest Neighbor
– Classification Trees
– Support vector machines (SVMs)
– Neural networks
– And many more …
38
Fisher linear discriminant analysis
First applied in 1935 by M. Barnard at the suggestion of R. A.
Fisher (1936), Fisher linear discriminant analysis (FLDA)
consists of

i. finding linear combinations x a of the gene expression

profiles x=(x1,...,xp) with large ratios of between-groups to
within-groups sums of squares - discriminant variables;

ii. predicting the class of an observation x by the class

whose mean vector is closest to x in terms of the
discriminant variables.

39
FLDA

40
Classification with SVMs
Generalization of the ideas of separating hyperplanes in the original space.
Linear boundaries between classes in higher-dimensional space lead to
the non-linear boundaries in the original space.

44
Adapted from internet
Nearest neighbor classification
• Based on a measure of distance between
observations (e.g. Euclidean distance or one
minus correlation).
• k-nearest neighbor rule (Fix and Hodges (1951))
classifies an observation x as follows:
– find the k observations in the learning set closest to x
– predict the class of x by majority vote, i.e., choose
the class that is most common among those k
observations.
• The number of neighbors k can be chosen by
cross-validation (more on this later).

45
Nearest neighbor rule

46
Classification tree
• Binary tree structured classifiers are
constructed by repeated splits of subsets
(nodes) of the measurement space X into
two descendant subsets, starting with X
itself.
• Each terminal subset is assigned a class
label and the resulting partition of X
corresponds to the classifier.

47
Classification trees
Mi1 < 1.4
Node 1
Class 1: 10
Class 2: 10

yes Gene 1 no

Mi2 > -0.5

Node 3
Node 2
Class 1: 4
Class 1: 6
Class 2: 1
Class 2: 9
Prediction: 1
yes Gene 2 no

Mi2 > 2.1

Node 4 Node 5
Class 1: 0 Class 1: 6
Class 2: 4 Class 2: 5
Prediction: 2
Gene 3

Node 6 Node 7
Class 1: 1 Class 1: 5
Class 2: 5 Class 2: 0
Prediction: 2 Prediction: 1 48
Three aspects of tree
construction
• Split selection rule:
– Example, at each node, choose split maximizing decrease in
impurity (e.g. Gini index, entropy, misclassification error).
• Split-stopping: The decision to declare a node as
terminal or to continue splitting.
– Example, grow large tree, prune to obtain a sequence of
subtrees, then use cross-validation to identify the subtree with
lowest misclassification rate.
• The assignment: of each terminal node to a class
– Example, for each terminal node, choose the class minimizing
the resubstitution estimate of misclassification probability, given
that a case falls into this node.
Supplementary slide
49
Other classifiers include…
• Support vector machines
• Neural networks
• Bayesian regression methods
• Projection pursuit
• ....

50
Aggregating predictors

• Breiman (1996, 1998) found that gains in

accuracy could be obtained by
aggregating predictors built from perturbed
versions of the learning set.

• In classification, the multiple versions of

the predictor are aggregated by voting.

51
Another component in classification rules:
aggregating classifiers
Resample 1 Classifier 1

Resample 2 Classifier 2

Training
Aggregate
Set
classifier
X1, X2, … X100

Resample 499 Classifier 499

Examples:
Bagging
Resample 500 Classifier 500 Boosting
Random Forest
54
Aggregating classifiers: Test
Bagging sample

Resample 1
X*1, X*2, … X*100 Tree 1 Class 1

Resample 2
X*1, X*2, … X*100 Tree 2 Class 2

Training Lets the 90% Class 1

Set (arrays) tree 10% Class 2
X1, X2, … X100 vote

Resample 499
X*1, X*2, … X*100 Tree 499 Class 1

Resample 500
X*1, X*2, … X*100 Tree 500 Class 1

55
Feature selection

56
Feature selection
• A classification rule must be based on a
set of variables which contribute useful
information for distinguishing the classes.
• This set will usually be small because
most variables are likely to be
uninformative.
• Some classifiers (like CART) perform
automatic feature selection whereas
others, like LDA or KNN, do not.
57
Approaches to feature selection
• Filter methods perform explicit feature selection
prior to building the classifier.
– One gene at a time: select features based on the
value of an univariate test.
– The number of genes or the test p-value are the
parameters of the FS method.
• Wrapper methods perform FS implicitly, as a
part of the classifier building.
– In classification trees features are selected at each
step based on reduction in impurity.
– The number of features is determined by pruning the
tree using cross-validation.

58
Why select features
• Lead to better classification performance
by removing variables that are noise with
respect to the outcome
• May provide useful insights into etiology of
a disease.
• Can eventually lead to the diagnostic tests
(e.g., “breast cancer chip”).

59
Why select features?

No feature Top 100

selection feature selection
Selection based on variance
Correlation plot
-1 +1 Data: Leukemia, 3 class
60
Performance assessment

61
Performance assessment
• Before using a classifier for prediction or prognostic one
needs a measure of its accuracy.
• The accuracy of a predictor is usually measured by the
Missclassification rate: The % of individuals belonging to
a class which are erroneously assigned to another class
by the predictor.
• An important problem arises here
– We are not interested in the ability of the predictor for classifying
current samples
– One needs to estimate future performance based on what is
available.

62
Estimating the error rate
• Using the same dataset on which we have built the
predictor to estimate the missclassification rate may lead
to erroneously low values due to overfitting.
– This is known as the resubstitution estimator
• We should use a completely independent dataset to
evaluate the classifier, but it is rarely available.
• We use alternatives approaches such as
– Test set estimator
– Cross validation

63
Performance assessment (I)
• Resubstitution estimation: Compute the error
rate on the learning set.
– Problem: downward bias

• Test set estimation: Proceeds in two steps

1. Divide learning set into two sub-sets, L and T;
2. Build the classifier on L and compute error rate on T.
– This approach is not free from problems
• L and T must be independent and identically distributed.
• Problem: reduced effective sample size
64
Diagram of performance assessment
(I)
Training
Classifier
Set
Resubstitution
estimation

Training Performance
set assessment

Test set
Independent estimation
Classifier
test set

65
Performance assessment (II)
• V-fold cross-validation (CV) estimation: Cases in learning
set randomly divided into V subsets of (nearly) equal size.
Build classifiers by leaving one set out; compute test set
error rates on the left out set and averaged.
– Bias-variance tradeoff: smaller V can give larger bias but smaller
variance
– Computationally intensive.
• Leave-one-out cross validation (LOOCV).
– Special case for V=n.
– Works well for stable classifiers (k-NN, LDA, SVM)

66
Diagram of performance assessment
(II)
Training
Classifier
Set Resubstitution
estimation

(CV) Learning
set
Cross
Training Validation Performance
set Classifier assessment

(CV) Test
set

Test set
Independent estimation
Classifier
test set

67
Examples

69
Learning set Classification
Case
Bad Good
Rule

Feature selection.
studies
Reference 1
Retrospective study
Correlation with class L van’t Veer et al Gene
labels, very similar to t-test. expression profiling predicts
clinical outcome of breast
Using cross validation to cancer. Nature, Jan 2002.
select 70 genes .

Reference 2
295 samples selected Cohort study
from Netherland Cancer Institute M Van de Vijver et al. A gene
tissue bank (1984 – 1995). expression signature as a
predictor of survival in breast
Results” Gene expression profile is a more cancer. The New England
powerful predictor then standard systems Jouranl of Medicine, Dec
based on clinical and histologic criteria 2002.

Agendia (formed by reseachers from the Netherlands Cancer Institute)

Has started in Oct, 2003 Reference 3
1) 5000 subjects [Health Council of the Netherlands] Prospective trials.
2) 5000 subjects New York based Avon Foundation. Aug 2003
Custom arrays are made by Agilent including Clinical trials
70 genes + 1000 controls http://www.agendia.com/

70
Van’t Veer breast cancer study
study
Investigate whether tumor ability for metastasis is
obtained later in development or inherent in the initial
gene expression signature.

• Retrospective sampling of node-negative women: 44

non-recurrences within 5 years of surgery and 34
recurrences. Additionally, 19 test sample (12 recur. and
7 non-recur)
• Want to demonstrate that gene expression profile is
significantly associated with recurrence independent of
the other clinical variables.
Nature, 2002
71
Predictor development
• Identify a set of genes with correlation > 0.3 with the binary outcome. Show that there
are significant enrichment for such genes in the dataset.
• Rank-order genes on the basis of their correlation
• Optimize number of genes in the classifier by using CV-1

Classification is made on the basis of the correlations of the expression profile of leave-
out-out sample with the mean expression of the remaining samples from the good
and bad prognosis patients, respectively.

N. B.: The correct way to select genes is within rather than outside cross-validation,
resulting in different set of markers for each CV iteration

N. B. : Optimizing number of variables and other parameters should be done via 2-level
cross-validation if results are to be assessed on the training set.

The classification indicator is included into the logistic model along with other clinical
variables. It is shown that gene expression profile has the strongest effect. Note that
some of this may be due to overfitting for the threshold parameter.

72
Van ‘t Veer, et al., 2002
73
van de Vuver’s breast data
(NEJM, 2002)
• 295 additional breast cancer patients, mix
of node-negative and node-positive
samples.
• Want to use the predictor that was
developed to identify patients at risk for
metastasis.
• The predicted class was significantly
associated with time to recurrence in the
multivariate cox-proportional model.
74
75
Acknowledgments
• Many of the slides in this course notes are
based on web materials made available by their
authors.
• I wish to thank specially
– Yee Hwa Yang (UCSF),
– Ben Boldstat, Sandrine Dudoit & Terry Speed, U.C.
Berkeley.
– The Bioconductor Project
– "Estadística I Bioinformàtica" research group at the
University of Barcelona
76

Molecular Biology & Genetics: Essential Biology Self-Teaching Guide
From Everand
Molecular Biology & Genetics: Essential Biology Self-Teaching Guide
Sterling Education
No ratings yet
Prewitt Versus Sobel Masks
No ratings yet
Prewitt Versus Sobel Masks
29 pages
Nomograph, Alignment Chart, or Abaque, Is A Graphical Calculating Device, A Two
No ratings yet
Nomograph, Alignment Chart, or Abaque, Is A Graphical Calculating Device, A Two
44 pages
Cluster Analysis in DNA Microarray Experiments: Sandrine Dudoit and Robert Gentleman
No ratings yet
Cluster Analysis in DNA Microarray Experiments: Sandrine Dudoit and Robert Gentleman
48 pages
CMMB 461 Dna Microarray 2 2019 For D2L
No ratings yet
CMMB 461 Dna Microarray 2 2019 For D2L
27 pages
Agenda: 1. Introduction To Clustering
No ratings yet
Agenda: 1. Introduction To Clustering
47 pages
How Does Gene Expression Clustering Work?: Primer
No ratings yet
How Does Gene Expression Clustering Work?: Primer
3 pages
Clustering
No ratings yet
Clustering
36 pages
Cluster Analysis For Gene Expression Data: Jiong Yang Eecs Case Western Reserve University
No ratings yet
Cluster Analysis For Gene Expression Data: Jiong Yang Eecs Case Western Reserve University
34 pages
94dc08a6aded73bc9aea7cb22267245d_MIT6_047F15_Lecture07
No ratings yet
94dc08a6aded73bc9aea7cb22267245d_MIT6_047F15_Lecture07
86 pages
Clustering: Georg Gerber Lecture #6, 2/6/02
No ratings yet
Clustering: Georg Gerber Lecture #6, 2/6/02
50 pages
Clustering
No ratings yet
Clustering
22 pages
Clustering
No ratings yet
Clustering
64 pages
5 Microarray PDF
No ratings yet
5 Microarray PDF
79 pages
Cluster Analysis I: Presidency University
No ratings yet
Cluster Analysis I: Presidency University
98 pages
Ijcet 10 01 005 PDF
No ratings yet
Ijcet 10 01 005 PDF
10 pages
Clustering
No ratings yet
Clustering
75 pages
Unit 5
No ratings yet
Unit 5
5 pages
Data mining and machine learning
No ratings yet
Data mining and machine learning
48 pages
4 Clustering
No ratings yet
4 Clustering
21 pages
Clustering
No ratings yet
Clustering
39 pages
Clustering
No ratings yet
Clustering
20 pages
What Is Cluster Analysis?
No ratings yet
What Is Cluster Analysis?
20 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Agglomerative Clustering
No ratings yet
Agglomerative Clustering
44 pages
043 Chenb Hierarchical
No ratings yet
043 Chenb Hierarchical
4 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
BS1
No ratings yet
BS1
62 pages
k Means Clustering
No ratings yet
k Means Clustering
43 pages
Classification
No ratings yet
Classification
22 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
49 pages
1471-2105-15-S2-S2
No ratings yet
1471-2105-15-S2-S2
18 pages
Bacher 2002 Cluster Analysis
No ratings yet
Bacher 2002 Cluster Analysis
199 pages
K-Means and Kohonen Maps Unsupervised Clustering Techniques: Steve Hookway 4/8/04
No ratings yet
K-Means and Kohonen Maps Unsupervised Clustering Techniques: Steve Hookway 4/8/04
53 pages
Lecture 01 - Unsupervised Learning (Optional)
No ratings yet
Lecture 01 - Unsupervised Learning (Optional)
57 pages
An Approach of Hybrid Clustering Technique For Maximizing Similarity of Gene Expression
No ratings yet
An Approach of Hybrid Clustering Technique For Maximizing Similarity of Gene Expression
14 pages
کتاب چهارم بارگزاری شده
No ratings yet
کتاب چهارم بارگزاری شده
63 pages
UNIT5
No ratings yet
UNIT5
60 pages
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
No ratings yet
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
65 pages
Symmetrical Based Projects
No ratings yet
Symmetrical Based Projects
105 pages
Clustering: Source: I. Business Analytics by U Dinesh Kumar Means-Example-1.htm) rial/Clustering/Numerical Example - HTM
No ratings yet
Clustering: Source: I. Business Analytics by U Dinesh Kumar Means-Example-1.htm) rial/Clustering/Numerical Example - HTM
24 pages
Clustering
No ratings yet
Clustering
8 pages
Chapter13 Slides
No ratings yet
Chapter13 Slides
24 pages
Clustering and Classification: - Task
No ratings yet
Clustering and Classification: - Task
16 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
21 pages
Lec 05 Unsupervised-Kmeans
No ratings yet
Lec 05 Unsupervised-Kmeans
50 pages
Module 5
No ratings yet
Module 5
91 pages
Grouping
No ratings yet
Grouping
98 pages
8. Clustering
No ratings yet
8. Clustering
38 pages
Microarray Full
No ratings yet
Microarray Full
56 pages
Medical Imabmnge Analysis
No ratings yet
Medical Imabmnge Analysis
41 pages
IT3080 Lecture04 2023
No ratings yet
IT3080 Lecture04 2023
56 pages
BIM2005 Lecture10 Clustering Bri 副本
No ratings yet
BIM2005 Lecture10 Clustering Bri 副本
107 pages
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
93 pages
Clustering (1)
No ratings yet
Clustering (1)
53 pages
EML %th Module
No ratings yet
EML %th Module
40 pages
Clustering: Sridhar S Department of IST Anna University
No ratings yet
Clustering: Sridhar S Department of IST Anna University
91 pages
Clustering
No ratings yet
Clustering
45 pages
Chap15 Cluster Analysis
No ratings yet
Chap15 Cluster Analysis
55 pages
Module12.02 UnsupervisedLearning
No ratings yet
Module12.02 UnsupervisedLearning
25 pages
Regents Living Environment: Comprehensive Review for New York Regents Living Environment
From Everand
Regents Living Environment: Comprehensive Review for New York Regents Living Environment
Sterling Test Prep
No ratings yet
Advanced Techniques for Multivariate Data Analysis Using PYTHON. Predictive Models for Classification and Segmentation
From Everand
Advanced Techniques for Multivariate Data Analysis Using PYTHON. Predictive Models for Classification and Segmentation
César Pérez López
No ratings yet
Modal Verbs
No ratings yet
Modal Verbs
18 pages
Blayer 4
No ratings yet
Blayer 4
11 pages
Tutorial Short
No ratings yet
Tutorial Short
1 page
11.8 Constraint Optimization: Lagrange's Multipliers
No ratings yet
11.8 Constraint Optimization: Lagrange's Multipliers
4 pages
Choose The Word Which Is Most Opposite in Meaning To The Word EMBRACE
No ratings yet
Choose The Word Which Is Most Opposite in Meaning To The Word EMBRACE
45 pages
What If Analysis Practice Exercises
No ratings yet
What If Analysis Practice Exercises
9 pages
Quant Checklist 463 by Aashish Arora For Bank Exams 2024
100% (1)
Quant Checklist 463 by Aashish Arora For Bank Exams 2024
102 pages
First Computing Device: Babbage's Difference Engine
No ratings yet
First Computing Device: Babbage's Difference Engine
3 pages
Functional Equations Handout
No ratings yet
Functional Equations Handout
6 pages
Systematic Sampling Thesis
100% (2)
Systematic Sampling Thesis
4 pages
Integers 1
No ratings yet
Integers 1
2 pages
Composite Material
No ratings yet
Composite Material
24 pages
Data Structure Lab Record - VSSUT, Burla
No ratings yet
Data Structure Lab Record - VSSUT, Burla
29 pages
Complex Numbers: A+bi A Real Part Bi Imaginary Part
No ratings yet
Complex Numbers: A+bi A Real Part Bi Imaginary Part
8 pages
A Search For Beauty
No ratings yet
A Search For Beauty
32 pages
Ticket To Studies Grade 10
No ratings yet
Ticket To Studies Grade 10
43 pages
Math 5 - Numerical Soltuions To Ce Problems: B S C E
No ratings yet
Math 5 - Numerical Soltuions To Ce Problems: B S C E
28 pages
Ant QB Model
No ratings yet
Ant QB Model
7 pages
CH11-Job Design and Work Measurement
No ratings yet
CH11-Job Design and Work Measurement
35 pages
On Neural Networks in Identification and Control of Dynamic Systems
No ratings yet
On Neural Networks in Identification and Control of Dynamic Systems
34 pages
Các tính chất của Biến đổi Fourier
No ratings yet
Các tính chất của Biến đổi Fourier
3 pages
Cohesive Zone Models vs. XFEM
100% (1)
Cohesive Zone Models vs. XFEM
23 pages
1 6-Space-Trusses
No ratings yet
1 6-Space-Trusses
10 pages
4a's Lesson Plan - Division of Polynomials
No ratings yet
4a's Lesson Plan - Division of Polynomials
10 pages
Digital Measurement of Frequency With Linear Interpolation in Dynamic States
No ratings yet
Digital Measurement of Frequency With Linear Interpolation in Dynamic States
4 pages
Introduction To Fuzzy Set Theory
No ratings yet
Introduction To Fuzzy Set Theory
67 pages
The XZZX Surface Code: Article
No ratings yet
The XZZX Surface Code: Article
12 pages
Asn1 Ber
No ratings yet
Asn1 Ber
27 pages

Microarray Data Analysis: Class Discovery and Class Prediction: Clustering and Discrimination

Uploaded by

Microarray Data Analysis: Class Discovery and Class Prediction: Clustering and Discrimination

Uploaded by

Microarray Data Analysis

Class discovery and Class prediction:

Issues: Which genes / arrays to use?

Partition the data into a pre-specified number k of

Iteratively reallocate the observations to clusters

1. Estimating the number of clusters;

Not equally important in every problem.

The bottom line is that none work very well in complicated

It is always reassuring when you are able to characterize a

• Can we identify a set of

• A classifier built from a learning set L is denoted by

Bad prognosis Good Prognosis Good Prognosis

Predefine B-ALL T-ALL AML T-ALL

i. finding linear combinations x a of the gene expression

ii. predicting the class of an observation x by the class

Mi2 > -0.5

Mi2 > 2.1

• Breiman (1996, 1998) found that gains in

• In classification, the multiple versions of

Resample 499 Classifier 499

Training Lets the 90% Class 1

No feature Top 100

• Test set estimation: Proceeds in two steps

Agendia (formed by reseachers from the Netherlands Cancer Institute)

• Retrospective sampling of node-negative women: 44

You might also like