ComparativeAnalysisOfClusteringMethods_dissertacaofinal
ComparativeAnalysisOfClusteringMethods_dissertacaofinal
Dissertação de Mestrado
1 Introduction 1
2.1.1 DNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.3 SAGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
iii
CONTENTS iv
3 Cluster Analysis 28
3.1.1 CLICK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.3 k -means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.1 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5 Results 67
5.1.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
CONTENTS v
5.1.2 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2.2 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6 Conclusions 80
A Parametrisation of SOM 93
3.2 Example of two cuts in a dendrogram with nine objects. The two
dashed lines represent respectively cuts with three and four clusters. . 38
5.1 Mean of corrected Rand values from the FC Yeast All experiments . 68
5.2 Mean of corrected Rand values from the Reduced FC Yeast All exper-
iments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.4 Mean of corrected Rand values from the Series CDC 25 experiments 71
5.5 Mean of corrected Rand values from the FC Yeast All experiments . 73
5.6 Mean of corrected Rand values from the Reduced FC Yeast All exper-
iments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.8 Mean of corrected Rand values from the Series CDC 25 experiments 76
vi
List of Tables
4.2 MYGD classes from the FC scheme with theirs respective number of
genes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3 MYGD classes from the REDUCED FC scheme with theirs respective
number of genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.1 Proximity indices with best accuracy in the experiments of Section 5.1
for a given clustering method and data set. . . . . . . . . . . . . . . . 73
A.3 Type of parametrisation and topologies with best accuracy in the ex-
periments wiht SOM. . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
B.1 Detailed results of the SOM method in the experiments with the FC
Yeast All data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
B.4 Detailed results of the dynamical clustering method with the hierar-
chical initialisation in the experiments with the FC Yeast All data set 97
B.5 Detailed results of the k-means method in the experiments with the
FC Yeast All data set . . . . . . . . . . . . . . . . . . . . . . . . . . 97
vii
LIST OF TABLES viii
B.6 Detailed results of the k-means method with the hierarchical initiali-
sation in the experiments with the FC Yeast All data set . . . . . . . 98
B.7 Detailed results of the SOM method in the experiments with the Re-
duced FC Yeast All data set . . . . . . . . . . . . . . . . . . . . . . . 98
B.10 Detailed results of the dynamical clustering with the hierarchical ini-
tialisation method in the experiments with the Reduced FC Yeast All
data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
B.11 Detailed results of the k-means method in the experiments with the
Reduced FC Yeast All data set . . . . . . . . . . . . . . . . . . . . . 99
B.12 Detailed results of the k-means with the hierarchical initialisation method
in the experiments with the Reduced FC Yeast All data set . . . . . 100
B.13 Detailed results of the SOM method in the experiments with the FC
CDC 25 data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
B.17 Detailed results of the k-means method in the experiments with the
FC CDC 25 data set . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
B.18 Detailed results of the k-means method with the hierarchical initiali-
sation in the experiments with the FC CDC 25 data set . . . . . . . 102
B.19 Detailed results of the CLICK method in the experiments with the FC
CDC 25 data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
B.20 Detailed results of the SOM method in the experiments with the Series
CDC 25 data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
B.23 Detailed results of the dynamical clustering method with the hierar-
chical initialisation in the experiments with the Series CDC 25 data
set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
B.24 Detailed results of the k-means method in the experiments with the
Series CDC 25 data set . . . . . . . . . . . . . . . . . . . . . . . . . 104
B.25 Detailed results of the k-means method with the hierarchical initiali-
sation in the experiments with the Series CDC 25 data set . . . . . 104
B.26 Detailed results of the CLICK method in the experiments with the
Series CDC 25 data set . . . . . . . . . . . . . . . . . . . . . . . . . 105
Acknowledgments
x
Abstract
Large scale approaches, namely proteomics and transcriptomics, will play the most
important role of the so-called post-genomics. These approaches allow experiments
to measure the expression of thousands of genes from a cell in distinct time points.
The analysis of this data can allow the the understanding of gene function and gene
regulatory networks (Eisen et al., 1998).
There has been a great deal of work on the computational analysis of gene expres-
sion time series, in which distinct data sets of gene expression, clustering techniques
and proximity indices are used. However, the focus of most of these works are on
biological results. Cluster validation has been applied in few works, but emphasis
was given on the evaluation of the proposed validation methodologies (Azuaje, 2002;
Lubovac et al., 2001; Yeung et al., 2001; Zhu & Zhang, 2000). As a result, there are
few guidelines obtained by validity studies on which clustering methods or proximity
indices are more suitable for the analysis of data from gene expression time series.
Thus, this work performs a data driven comparative study of clustering methods
and proximity indices used in the analysis of gene expression time series (or time
courses). Five clustering methods encountered in the literature of gene expression
analysis are compared: agglomerative hierarchical clustering, CLICK, dynamical clus-
tering, k -means and self-organizing maps. In terms of proximity indices, versions of
three indices are analysed: Euclidean distance, angular separation and Pearson corre-
lation. In order to evaluate the methods, a k-fold cross-validation procedure adapted
to unsupervised methods is applied. The accuracy of the results is assessed by the
comparison of the partitions obtained in these experiments with gene annotation,
such as protein function and series classification.
xi
Resumo
xii
Chapter 1
Introduction
Now that the sequences of genomes from several species have been or are about to
be completed, researchers are looking towards the next step: the understanding of
gene function and gene regulatory networks. Of the roughly 30,000-40,000 genes in
the human genome sequence, the function of an estimated two thirds is likely to
be unknown (Abbot, 1999). In terms of regulatory mechanisms, the knowledge is
even scarcer. Large scale approaches, namely proteomics and transcriptomics, will
play the most important role of the so-called post-genomics. These approaches allow
providing to biologists the information about what gene is turned on, and in what
condition (Abbot, 1999).
While proteins may yield the most important clues to cellular function, proteins
are also the most difficult of the cell’s components to detect on a large scale. This
is not the case of ribonucleic acid (RNA), which is measured by transcriptomics
approaches (D’Haeseleer et al., 1999). When a gene is expressed in a cell, its code is
first transcribed to an intermediary messenger RNA (mRNA), which is then translated
1
CHAPTER 1. INTRODUCTION 2
into a protein. The mRNA levels give a snapshot of the genome’s plans for protein
synthesis under the cellular conditions at that moment. Transcriptomics has the
advantage over proteomics for the technology is simple and lends itself readily to
automation and high throughput (D’Haeseleer et al., 1999). But transcriptomics has
the disadvantage that, although the expression levels it provides reflects the genome’s
plans for protein synthesis, it does not directly represents the final protein levels
(D’Haeseleer et al., 1999).
The analysis of the amount of data generated by large scale approaches makes the
use of advanced statistical and computational methods, such as Machine Learning,
necessary. Such methods can be used to discover trends and patterns in the underlying
gene expression data (Bertone & Gerstein, 2001). The computational challenges
in the analysis of gene expression are vast, and still open for further developments
(Quackenbush, 2001).
Among these challenges, this work will attain to the problem of identification of
meaningful subsets of genes by the use of clustering methods with the objective of
finding co-expressed genes (Eisen et al., 1998). This is accomplished by the analysis
of data from gene expression time series (or time courses). In time series experiments,
the expression of a certain cell is measured in some time points during a particular
biological process. By knowing groups of genes that are expressed in a similar fashion
through a biological process, it is possible to infer the function of these genes. Since
these data sets consist of expression profiles of thousand of genes, this analysis cannot
be carried out manually, making necessary the application of clustering methods.
One main aspect in finding co-expressed genes is the proximity (similarity or dis-
similarity) index used in the clustering method. In this context, the index should give
CHAPTER 1. INTRODUCTION 3
genes have a similar pattern of change through time (or similar series shape), they
are considered co-expressed genes (Heyer et al., 1999).
In fact, there has been a great deal of work on gene expression analysis, each using
distinct data sets of gene expression, clustering techniques and proximity indices.
However, the majority of these works has given emphasis on the biological results, with
no critical evaluation of the suitability of the proximity indices or clustering methods
used. In the few works in which cluster validation was applied with gene expression
data, the focus was on the evaluation of the proposed validation methodology (Azuaje,
2002; Lubovac et al., 2001; Yeung et al., 2001; Zhu & Zhang, 2000). As a consequence,
so far, with the exception of (Costa et al., 2002b; Datta & Datta, 2003), there is no
validity study on which proximity indices or clustering methods are more suitable for
Based on this, a data driven comparative study of proximity indices and clus-
1998), CLICK (Sharan & Shamir, 2002), dynamical clustering (Costa et al., 2002a),
k -means (Tavazoie et al., 1999) and self-organizing maps (Tamayo et al., 1999). With
the exception of dynamical clustering, all other methods are popular in the literature
CHAPTER 1. INTRODUCTION 4
All the experiments are performed with data sets of gene expression time series
of the yeast Saccharomyces cerevisiae. This organism was chosen because there is a
wide availability of public data, as well as the availability of an extensive functional
classification of its genes. The functional classification will serve as an external data
In order to evaluate the clustering methods and proximity indices, this disserta-
which measures the agreement between the clustering results and an a priori clas-
sification data, such as gene functional classification or series classification (Jain &
hypothesis test for equal means is applied (Efron & Tibshirani, 1993).
The remainder of this dissertation is divided into five chapters. In Chapter 2, issues
related to gene expression analysis are described. The aim of this chapter is to in-
troduce the problem approached in this dissertation, as well as to put this work in
ogy are discussed. The validation methodology and the experimental design used in
this dissertation are presented in Chapter 4. Then, Chapter 5 describes and analyses
the results of the experiments. Finally, Chapter 6 brings the conclusions drawn from
the results, some final remarks and future works.
Chapter 2
This chapter gives a description of the problem approached in this dissertation, the
computational analysis of gene expression data. Section 2.1 covers the basic concepts
of molecular biology necessary for understanding gene expression. Then, Section 2.2
describes experiments regarding gene expression and the technologies used to measure
the data. Section 2.3 overviews the computational challenges of gene expression
analysis, focusing the analysis of time series data and cluster validation. Then, a
discussion of related work in both analysis of time series data and cluster validation
Gene expression is the process explained by the central dogma of molecular biology
on how the hereditary information contained in deoxy-ribonucleic acid (DNA) flows
inside a cell, resulting in the protein synthesis (Silva, 2001). This process can be
divided into two steps: the transcription where DNA molecules are used to build
ribonucleic acid (RNA) molecules, and the translation where the RNA molecules
6
CHAPTER 2. GENE EXPRESSION ANALYSIS 7
form proteins (see Figure 2.1). The final product of this process, the proteins, are
responsible for providing the structural components of the cell and for the catalysis
of biochemical reactions. Thus, it can be stated that the expression of the proteins
determine the functional state of the cell (Primer on Molecular Genetics, 1992).
The central dogma of biology have been modified in recent years, since some or-
ganisms, such as viruses, do not fit in the original dogma scheme (Gentrop, 1999).
But for the sake of simplicity, this section will attain to the original dogma, as it
contains the basics concepts necessary for the understanding of gene expression.
2.1.1 DNA
The DNA molecules are responsible for storing the genetic information of the orga-
nisms. These molecules have a double helix structure, formed by a sequence of bases
pairs. The particular order of bases in a determined sequence represents the genetic
information contained in the DNA. These bases can be one of the following: adenine
(A), cytosine (C), guanine (G), and thymine (T) (Gentrop, 1999). Some binding
rules define the possible base pairs, which can be either A=T, T=A, C≡G or G≡C
(each “-” symbolizes a hydrogen bond). As a consequence of these binding rules,
the two sequences of bases that forms a DNA molecules are complementary to one
another (Gentrop, 1999) (see Figure 2.2). Single stranded DNA molecules, which are
only encountered in special conditions, have the capacity to bind with complementary
sequences, in a process called hybridization. Such a process is used as a tool in most
CHAPTER 2. GENE EXPRESSION ANALYSIS 8
The way DNA molecules are arranged in the cells is dependent of the type of
organism, which can be either a procaryote or a eukaryote. The procaryotes are the
organisms without cell nucleus, while the eukaryotes are the organisms with nucleus.
In the procaryotes, the DNA is arranged in a single circular DNA molecule. Whereas,
in the eukaryotes, several DNA molecules, called chromosomes, are present in the cell
nucleus. Each of these chromosomes is formed by billions of bases pairs.
Genes are the basic units responsible for possession and passing on a single char-
acteristic. In other words, genes are DNA regions of the chromosomes that codes one
or more proteins. In fact, only particular regions of the DNA sequences encountered
in organisms represent the genes. In the region before the gene sequence (also called
upstream region), regulatory regions are encountered. These regions influence the
ratio of transcription (or the quantity of RNA produced) of the gene (Shamir et al.,
2002).
In the process called transcription, the region of the DNA representing a gene is copied
into a RNA molecule with the help of a enzyme called RNA polymerase (Gentrop,
1999). This enzyme binds into a upstream region called promoter sequence, which
indicates where the transcription should start. Then, the enzyme slides through the
CHAPTER 2. GENE EXPRESSION ANALYSIS 9
DNA sequences, building the RNA molecule base by base (see Figure 2.3). Although
very similar to DNA, the RNA molecules have some distinctions: (1) RNAs are only
single stranded; (2) instead of the thymine, RNAs have a uracil (U) base; and (3)
RNAs degrade after some short time. The main function of the RNA is the synthesis
of proteins in the cell. These molecules are divided into three groups according to their
task in protein synthesis: ribosomal RNAs (rRNA) that are responsible for forming
the ribosomes, transport RNAs (tRNA) that are responsible for carrying amino acids
to ribosomes, and messenger RNAs (mRNA) that are responsible for encoding the
genetic information contained in the genes (Gentrop, 1999).
The transcription process in eukaryotes is more complex. First, the RNA molecule
is copied from a gene in the DNA, producing the primary transcript (RNA). Before
leaving the nucleus, certain sequences of this primary RNA, called introns (non-coding
regions), are removed by special enzymes, forming the mature RNA. In this process,
certain exons sequences (codifying sequences) can also be removed, changing the final
protein to be synthesised. This mechanism, called alternate splicing, plays a major
role in cell differentiation. The alternate splicing makes it possible for a single gene
CHAPTER 2. GENE EXPRESSION ANALYSIS 10
to codify more than one protein, all this in accordance to the cell context (Gentrop,
1999).
Translation is the process of forming proteins from the information contained in the
RNA. Triples of RNA bases are translated into one of the twenty amino acids, which
are the building block of the proteins. The rules that map the base triplets to a amino
acid is called genetic code. The translation process is coordinated by the ribosome
that “reads”the mRNA molecule sequence three by three, adding the respective amino
acid in the end of the synthesised protein, with the help of tRNA molecules (see Figure
2.3).
Proteins, the final product of the gene expression, are vital to the cell functioning,
since they are responsible for providing the structural components of the cell and for
number of proteins available inside the cell in a certain instant (D’Haeseleer et al.,
1999).
or just a few genes. These experiments were based on a reductionist view, where
by explaining the parts, one could get a view of the whole. With the advent of
genomics, and consequently large scale gene expressions methods, it would be a huge
CHAPTER 2. GENE EXPRESSION ANALYSIS 11
effort to analise such a number of genes using the traditional approach. This amount
of data requires a holist analysis of these experiments, where the data is handled
in a global fashion, without the need to go down to low level details such as the
biochemical reactions. Nowadays, this global analysis have been carried out with the
aid of statistical and computational methods (D’Haeseleer et al., 1999).
There is a number of purposes for the analysis of experiment of large scale gene
expression, such as (Lubovac, 2001):
For each specific purpose, a distinct type of experiment design is necessary. Often,
there are two basic design types of experiments for gene expression. In one type,
the behavior of the expression levels is observed through time, in other words, gene
expression time series (or time courses) are obtained. This design is used for the
finding of co-expressed genes and also for the inference of regulatory networks. In
the other type of design, samples of gene expression of distinct tissues or individuals
are obtained (condition experiments) (Eisen et al., 1998). For example, when there
is also used for the discovery of regulatory networks, where one can be interested in
comparing a normal cell to a mutated one (D’Haeseleer et al., 1999). In the analysis
of drug response, a mix of both arrangements can be performed, as there is interest
CHAPTER 2. GENE EXPRESSION ANALYSIS 12
in comparing the time series expression of a treated individual, with the time series
of non treated individual (Dopazo et al., 2001).
These gene expression experiments were only possible with the development of
a number of techniques capable of measuring large scale gene expression. These
techniques differ in some aspects such as: the substance being measured (RNA or
proteins); the process of reading the results; the way of manufacturing the artifacts;
and the domain of the technology (public or private). Each of these techniques
on the same molecular biology principle called hybridization, in which nucleic acids
have the capacity of recognizing and combining with complementary sequences.
In the Sections 2.2.1 to 2.2.4 measurement technologies with widespread use and
related to this work are be described. This work will concentrate on RNA expression
based technologies, as the use of protein based techniques is not widespread, given
the lack of accuracy and reproducibility of these techniques (D’Haeseleer et al., 1999).
consists of small glass slides, where cDNA are deposited with the aid of robotics
(Schena et al., 1995). The idea behind the functioning of cDNA microarrays is very
simple (Kain, 2001). For each gene to be measured, a sequence complementary to the
gene sequence is defined (these small sequences are called probes). The probes have
size ranging from 20 to 30 bases, in a way that there is a low probability of the probes
hybridizing with sequence others than the target sequence. The probes are replicated
a high number of times (around thousands). Then, a robot fixes the probes in a
CHAPTER 2. GENE EXPRESSION ANALYSIS 13
certain spot of a glass slide. At the end, the small slide will have thousands of DNA
spots, placed side by side, each spot containing thousands of cDNA probes copies
designed to hybridize with RNA from a certain gene.
In the next step, the RNA of the cell is separated and transcribed to cDNA,
given that RNA molecules are unstable and would degrade before the experiment
is over. Afterwards, the cDNA molecules are marked with green fluorescent labels.
Additionally, the RNA of a control cell is also separated and transcribed, but theses
molecules are marked with red fluorescent labels. The cDNA of both cells are poured
in the slide. After some time, the slide is washed, removing the cDNA which has not
hybridized with the probes. Next, the slide is scanned, giving as a result a image
with all the spots intensities (the whole process is illustrated Figure 2.4). The digital
image of the slide is then processed using computational methods, for the purpose
of calculating the intensity obtained by each RNA. Figure 2.5 shows one segment of
CHAPTER 2. GENE EXPRESSION ANALYSIS 14
such images.
The advent of microarray technologies were only possible due to two main factors
(Kain, 2001). First, robotics permitted the manufacturing of the slides (also called
chips) with only a few centimeters, containing around 10.000 of probes spots, organ-
ised side by side as a matrix. The others factors were the sequencing of organism
and the discovery of its genes, as only with this data, it was possible to construct the
microarray probes (Brow & Bostein, 1999).
One problem of the cDNA microarrays is that distinct measurements with mate-
rial from the same cell can obtain distinct results. Certain steps in the process are
influenced by the environment and the way of execution, causing variability of the
final results. Not to mention about the image processing procedure, where the lack of
precision of the robots on the placement of the spots and limitations of the scanner
represent additional noise on the data. This can be observed in Figure 2.5, where not
all spots are uniformity placed and some neighbor spots signals are merged.
These variability problems are attacked in cDNA microarray by the use of the RNA
from a control cell, as described before. The idea is to use the RNA from a single
CHAPTER 2. GENE EXPRESSION ANALYSIS 15
control cell to all slides being measured in a certain experiment. The final expression
level of a gene is calculated by the log ratio of the measured (Cy5) and control cell
e = log(Cy5/Cy3) (2.1)
the final spot intensities are influenced by the order of scanning. As a result, special
procedures should be used in order to normalise the intensities of the red and green
signals (for more details see Schuchhardt et al. (2000) and Yang et al. (2001)).
The cDNA technique has as advantage, among others, the high number of genes
having the expression simultaneously measured, which can reach a 10.000 in a single
slide (D’Haeseleer et al., 1999). In fact, there is no limitation on the number of
genes, as more than one slide can be used to measure the RNA of a certain cell. For
experiments design, as the probes used and consequently the genes measured can be
chosen among any gene with known sequences. The main problem is the financial
cost of the microarrays, which is still very high, limiting the number of conditions
The oligonucleotide array (or Gene Chip) is a private technology developed by Affy-
metrix (Lipshutz et al., 1999). Its functioning is very similar to the cDNA microarray,
although the technologies differs in two aspects, the manufacturing of the slides and
how the variability problem is treated. The Gene Chips are constructed via a optic
process, where the probes are synthesised base by base on the chip surface. The
design of a chip containing a new set of probes is very expensive, but once it is done,
the chip arrangement can be produced in large scale with a lower cost. In order to
reduce the effect of variability, probes with same sequence are placed in 20 to 40 spots
on the chips. This reduces the signal to noise ratio and improves the accuracy of the
The DNA chip also allows the measurements of a high number of genes (up to
50.000 genes per chip). On the other hand, oligonucleotide chips do not offer the
same flexibility as cDNA arrays, for there is a limited number of chip designs available
with a fixed set of probes. On the long run, this problems should be minimised,
as the number of probes packed in the chip tends to get higher, and genomes of
several organisms will be fully revealed. Other advantage of this technology is that
Affymetrix supply all the equipments, in contrast to cDNA microarray, where there
are a number of choices, from where to buy the probes, to the software used in the
image processing (Bowtell, 1999). As a consequence, experiments with DNA chips
are more standardised, making it easier to compare (and analyse) experiments carried
out by distinct laboratories, which is not the case of cDNA microarray.
CHAPTER 2. GENE EXPRESSION ANALYSIS 17
2.2.3 SAGE
The serial analysis of gene expression (SAGE) technology is very distinct from the
other methods described in this section, as SAGE uses sequencing technology to mea-
sure the expression (D’Haeseleer et al., 1999). Initially, the RNA is transcribed to
DNA. Then, sequences with ten bases, capable of uniquely identifying the source
RNA, are extracted from the DNA molecules. Next, these small sequences are joined
together in a single sequence, and then sequenced. The expression of the genes corre-
sponds to the quantity of repeated ten bases sequences encountered at the sequencing
results.
Some advantages of this method, among others, is its higher accuracy in relation
to the array technologies, and the fact that the sequence of the measured RNA does
not need to be known a priori. Additionally, the process uses sequencing technology
that is already available in most of the molecular biology laboratories. However, the
whole process consumes a lot o time. When a high number of genes are measured,
the process can become quite complex, as there is the need of a lot of sequencing. As
a result, experiments with this technology only measure the expression of hundreds
of genes.
The real time PCR, also known as kinetic PCR, is an automation of the reverse
transcribed polymerase reaction (RT-PCR) technique. In the RT-PCR, the RNA of
the desired genes are reverse transcribed (RT) to cDNA molecules (note that the RT
stands for reverse transcribed and not for real time). Then, the cDNA is replicated
using the polymerase chain reaction (PCR) (D’Haeseleer et al., 1999). This process
has to be repeated for each target gene. Finally, with the use of high resolution gels,
CHAPTER 2. GENE EXPRESSION ANALYSIS 18
the number of cDNA molecules are quantified. This process is not of a parallel nature,
what can make it very time consuming. Furthermore, if the whole process in not very
well controlled, there will be a high variability in the results (Bustin, 2002).
In the real time PCR, the amplification, detection and quantifications steps are
carried automatically by a special machinery. All this reduces the time and complexity
has a high precision in measuring the gene expression. However, the experiments are
still time consuming. As a consequence, only a small set of genes (hundreds of genes)
As stated before, the analysis of the amount of data generated by the approaches
with large scale gene expression can only be developed with the aid of statistical and
computational methods (D’Haeseleer et al., 1999). There is a number of computa-
tional challenges for the analysis of gene expression data. Among them, the following
should be point out (Sharan & Shamir, 2002):
• Feature Selection: find a set of genes that are differentially expressed through
the distinct conditions (Bo & Jonassen, 2002; Golub et al., 1999; Heyer et al.,
CHAPTER 2. GENE EXPRESSION ANALYSIS 19
1999).
The focus of this work will be on the use of clustering methods for the analysis
of time series data. In the next section, basic aspects of this type of analysis will
be discussed. The other concern of this work is the validation of the clustering
methods, therefore, validation issues for gene expression analysis will be covered in
the subsequent section.
biological process. Then, the gene expression of the cell is measured in some particular
time points. The analysis of this data is focused on finding co-expressed genes, more
specifically, genes that have similar patterns of expression change through time. By
knowing groups of genes that are expressed in a similar fashion through a biological
process, biologists are able to infer gene function and gene regulation mechanisms
Clustering is the main technique for the analysis of gene expression time series.
In such an approach, the major aspect in finding co-expressed genes is the proximity
Proximity Indices
Proximity indices measure the degree of alikeness between two objects. In the context
of gene expression analysis, the proximity index should give emphasis on capturing
relative magnitude proximity between two gene series. There is a biological reason for
this, as the absolute expression values of two genes can differ, but provided that the
genes have a similar pattern of change through time (the series have similar shape),
they are considered co-expressed genes (Eisen et al., 1998).
Figure 2.6 shows the time series of three genes during six time points (0, 30, 50,
70, 100 and 120 minutes). Apparently, all the time series are very distinct, but the
genes represented by the blue and green lines can be stated to be co-expressed. Both
genes behave in a similar fashion, their expression level goes up until the 70-minute
time point, and than go down until the end of the process. Actually, the intensity
value of the gene in green is the double of the gene in blue in most of the time points.
On the other hand, these two series have a distinct behavior in relation to the gene
in red, which has it expression decreasing through the whole process.
Pre-processing
Another important issue on clustering gene expression time series is the removal of
uninformative time series. During a particular biological process only a few genes
will be active and changing the expression levels through time. The other genes
can either be housekeeping genes or not expressed during that particular process.
The former represents genes that are always active, independently of the particular
biological process going on. While the latter type of genes have low expression levels
in all time points. These two types of genes do not need to be analysed, given that
they are uninformative in relation to that particular process. In fact, the removal of
CHAPTER 2. GENE EXPRESSION ANALYSIS 21
2000
1800
1600
1400
Gene Expression
1200
1000
800
600
400
200
0
0 30 50 70 100 120
Time (minutes)
these genes reduces the computing time of the clustering methods. Furthermore, the
removal can also enhance the accuracy of the results, as the presence of these genes
There are two widespread methods for dealing with such uninformative genes. The
fold approach, used in Eisen et al. (1998) and Tamayo et al. (1999). In this procedure,
only time series where the absolute expression levels changes for at least n folds are
considered. The other approach, proposed by Heyer et al. (1999), genes were ranked
according to their mean and variance. Then, a percentage of the genes with the lowest
mean and variance values are removed (Heyer et al. (1999) removed the genes in the
Most of the work on gene expression analysis relies only on ad-hoc observations to
evaluate the results. As an exception there are few studies where validation issues
are approached. However, these works are focused on the evaluation of the proposed
validation methodologies. They do not address the results obtained by the application
a consequence, so far, there are few guidelines obtained by validity studies on which
proximity indices or clustering methods are more suitable for the analysis of data
from gene expression time series.
One relevant issue in cluster validation, in the context of gene expression analysis, is
the use of external biological data to validate the results. Some validity methodologies
requires a labelling (or classification) of the elements. One common practice is to use
external sources of data related to the objective of study. In validity studies of gene
expression data, functional classification of the genes are largely applied as external
data (Gertein & Janssen, 2000; Lubovac, 2001; Yeung et al., 2001; Zhu & Zhang,
2000). One advantage of functional classifications is, among others, the availability
Project (The Gene Ontology Consortium, 2000), among others. There are also other
types of external data used in the literature, such as, regulatory regions (van Helden
et al., 2001; Zhu & Zhang, 2000), enzymatic classification (Lubovac et al., 2001),
Eisen et al. (1998) presented one of the first applications of clustering methods for
the analysis of gene expression time series. In their study, a hierarchical unweighed
pairwise average linkage method (UPGMA) was used with Pearson correlation to
cluster data from seven distinct time series experiments from yeast. The results
confirmed that genes with similar functions tend to cluster together. Additionally,
the study proposed a graphical representation of the results, which is now widely used
in the field. In this representation, the resulting dendrogram has its leaves reordered
by the mean expression levels of the series, in a way that gene with similar profiles
were close in the tree. In the side of the ordered tree, the expression levels of the
genes are represented in a colored table, where over expressed genes have green values
and under expressed genes red values (see Figure 2.7).
A number of other clustering methods have also been applied for the analysis
of gene expression time series, among them, k -means (Tavazoie et al., 1999), self-
organizing maps (Jonsson, 2001; Tamayo et al., 1999), dynamical clustering (Costa
et al., 2002a), graph theoretical approaches (Sharan & Shamir, 2002), principal com-
ponent analysis (PCA) (Raychaudhuri, 2001) and largest first cluster3ing algorithm
(Zhu & Zhang, 2000). Most of these works are often applications of distinct com-
putational methods to a similar set of gene expression data, not proposing any new
aspects in the analysis of gene expression. Thus, just some of then will be described
in details in this dissertation.
In terms of proximity indices, novel proposals have been presented for the analysis
of data from gene expression time series. The jackknife Pearson correlation, proposed
CHAPTER 2. GENE EXPRESSION ANALYSIS 24
Figure 2.7: Example of the graphical representation suggested by Eisen et al. (1998).
in Heyer et al. (1999), has as objective to handle time series with outliers time points.
The idea of this proximity index is to calculate the Pearson correlation between two
series, not taking into consideration the values of one time point. This is repeated
for all time points in the data set, excluding one distinct time point at a time. In
the end, the highest value obtained is then taken as result. The work developed an
analysis of the proposed proximity index. The results showed that the number of
false positives decreased with the use of the jackknife Pearson correlation in relation
to the original Pearson correlation.
CHAPTER 2. GENE EXPRESSION ANALYSIS 25
suitably measure shape proximity with data containing gene expression time series
from multiple experiments, unless special data handling is made. In the symbolic
approach, the shape similarity of each time series is calculated independently, and
aggregated at the end. The symbolical description was evaluated with the yeast
data set (Eisen et al., 1998), obtaining significant better results in comparison to the
traditional approaches.
A different approach have been explored in Brown et al. (2000). They applied
supervised methods for classifying the gene function, given data of gene expression
time series. Only a subset of the expression series data used in Eisen et al. (1998) was
employed in that work. This subset consisted of the genes belonging to one of the five
functional classes, which clustered well using hierarchical clustering. The supervised
methods applied obtained high precision levels, particulary in the experiments using
support vector machines (SVM). Despite this, the number of false positives was high
for some classes.
The works performed in Lubovac (2001) and Lubovac et al. (2001) evaluated the
use of internal criteria such as compactness and isolation of the clusters, as well as
the use of external criteria that compare the clustering results in relation to gene
annotation. More specifically, the gene annotations used were the enzymatic and
functional classifications of the proteins. The results indicated that internal criteria
CHAPTER 2. GENE EXPRESSION ANALYSIS 26
can be misleading as they did not show correspondence to gene annotation. The study
also proposed a relative entropy criterion. This criterion compares the distribution
A framework to find the ideal number of clusters was presented in Azuaje (2002). In
order to do so, the study applied Dunn’s validity index to the results of the clustering
methods. The work analysed the methodology with expression data from leukemia
(Golub et al., 1999), but the framework can also be applied to time series data.
approach, the data is clustered using all but one condition, which is used to access
the accuracy of the results. This is accomplished in a jackknife fashion, where for
each step, one of the conditions is hold out. This procedure is repeated for the total
number of conditions. The work compared some clustering methods using distinct
data sets, but the authors refrained to draw conclusions from the results, given that
only a small number of data sets were available.
Costa et al. (2002b) applied replication analysis for the purpose of evaluating
the cluster stability in the analysis of gene expression data. More specifically, the
work evaluated Self Organizing Maps (SOM), dynamical clustering and UPGMA
hierarchical clustering with data from yeast time series. The preliminary results
showed that both SOM and dynamical clustering obtained stable results.
So far, the most complete comparative analysis of clustering methods for gene
expression data was performed in Datta & Datta (2003). They proposed a validation
methodology based on the jackknife procedure, similar as the one in Yeung et al.
(2001), in conjunction to three novel relative validation indices. The work evaluated:
CHAPTER 2. GENE EXPRESSION ANALYSIS 27
evaluation, the sporulation data set (Chu et al., 1998) and a simulated data set. In
the results, Diana achieved the best performance, followed closely by the Model-based
and k-means methods. Both hierarchical methods obtained the poorest results.
Zhu & Zhang (2000) investigated the relation of gene expression clustering with
gene function and promoter regions. The study used the yeast time series from Eisen
et al. (1998) as the gene expression data set, and thirteen major classes from the
Munich Information Center for Protein Sequences Yeast Genome Database (MYGED)
(Mewes et al., 2002) as the functional classification. The results showed that genes
with similar expression levels do not necessarily share the same promoter regions
and functions, even though both gene function and promoter regions do help to gain
A similar and broader study was performed in Gertein & Janssen (2000). In that
work, a set of yeast data sets were compared with the MYGED functional classifi-
cation. Only some functional classes had a strong relation to the gene expression
profiles. The reason for this could be, among others, the vague definitions of some
functions and the great overlap of the classification. The study suggested that other
types of data should be used, such as protein structure and regulatory sequences.
A feasibility study was also performed in the context of supervised methods for
the classification of gene function. Kuramochi & Karypis (2001) evaluated the classi-
fication precision of SVM for the fifty biggest MYGED classes, given the data of gene
expression from the yeast (Eisen et al., 1998). The results showed that only in eight
classes the classifiers obtained reasonable accuracy. Such a study concluded that the
number of gene expression data sets available is not enough to build classifiers for all
functional classes.
Chapter 3
Cluster Analysis
Relevant issues in cluster analysis are covered in this chapter. Initially, Section 3.1
describes characteristics and the basic functioning of all clustering methods analysed
in this dissertation, while Section 3.2 presents the proximity indices. Section 3.3 covers
issues on cluster validity relevant to this work. More specifically, validity indices and
related validation methodologies are described in details.
Five distinct clustering methods are analysed in this dissertation. These methods
et al., 2002a) and CLICK (Sharan & Shamir, 2002). All of them, with the excep-
tion of dynamical clustering, have a widespread use in the gene expression analysis.
The dynamical clustering was included because it was utilized in previous work by
the author (Costa et al., 2002a). With the exception of the hierarchical clustering,
all the other methods yield partitions as results. In the following subsections the
28
CHAPTER 3. CLUSTER ANALYSIS 29
3.1.1 CLICK
CLICK (Cluster Identification via Connective Kernels) (Sharan & Shamir, 2002) is
Although CLICK does not take the number of classes as an input, by the use of the
homogeneity parameter, one can force the generation of a larger number of clusters.
The method initially generates a fully connected weighted graph, with the objects
as vertices and the the similarity between the objects as the weights of the edges.
Then, CLICK recursively divides the graph in two, using minimum weight cut com-
putations, until a certain kernel condition is met. The minimum weight cut divides
the graph in two in a way that the sum of the weights of the discarted vertices is
minimum. If a partition with only one object is found, the object is put apart in a
singleton set.
The kernel condition tests if a cluster formed by a given graph is highly coupled,
and consequently, if it should not be further divided. In order to do so, CLICK uses
a statistical model, assuming that the similarities between objects (the weights of the
(objects that should be clustered together) and other for non-mates edges (objects
that should not be clustered together). The Kernel test consists in verifying if the
probability of containing only mate edges exceeds the probability of containing non-
mate edges in a given graph. If the test is true, them, the tested graph is taken as a
final cluster, otherwise, it will be divided in two (using minimum cut computations).
CHAPTER 3. CLUSTER ANALYSIS 30
More formally, let E be the objects data set; G be an fully connected graph, where
each vertice represents an object in E, and the weight of the edges are the similarity
between the two connected edges; minWeightCut(G) be the function that finds the
minimum weight cut, returning two fully connected graphs; and S be the singleton
set; the method is defined by the following recursive function (Sharan & Shamir,
2002):
function formKernel(G, S)
begin
if G = {v} then
S = S ∪ {v};
else
if G is a kernel then
output G;
else
(H, V ) = minWeightCut(G);
formKernel(H, S);
formKernel(V , S);
end;
end;
end;
Dynamical Clustering is a partitional iterative algorithm that optimises the best fit-
ting between classes and their representation, using a predefined number of classes
(Diday & Simon, 1980). Starting with prototypes values from random selected in-
dividuals, the method works on two alternates steps: an allocation step, where all
individuals are allocated to the class with the prototype with lower dissimilarity, fol-
lowed by a representation step, where a prototype is constructed for each class. A
major problem of this algorithm is its sensitivity to the selection of the initial par-
tition. As a consequence, the algorithm may converge to a local minimum (Jain &
CHAPTER 3. CLUSTER ANALYSIS 31
Dubes, 1988). In order to prevent the local minimum problem, a number of runs with
different initialisations are executed. Then, the best run, based on some cohesion
measure, is taken as the result (Jain & Dubes, 1988). Another characteristic of this
method is its robustness to noisy data. In addition, when particular proximity index
and prototype representations are used, the method guarantees optimisation of local
criterion (Diday & Simon, 1980). With respect to the proximity indices investigated
in this work, only the use of the Euclidean distance version with data containing no
More formally, this method looks for a partition P of k classes from an object set
1. Initialisation
2. Representation Step
for i = 1 to k do
prototype Gi is set to the centroid of objects from Ci ;
end;
3. Allocation Step
test = 0;
for j = 1 to n do
find class Cm of ej ;
find class Cl such that Cl mini=1, ..., k D(ej , Gi );
if m = l then
test = 1;
Cl = Cl ∪ {ej } and Cm = Cm − {ej } ;
end;
end;
4. Termination Test
k
∆(P, L) = D(x, Gi ) (3.2)
i=1 x∈Ci
3.1.3 k -means
cluster analysis studies (Jain et al., 1999). This method is a special case of the
dynamical clustering (Jain et al., 1999). Thus, they share some characteristics, such
CHAPTER 3. CLUSTER ANALYSIS 33
as robustness to outliers, the use of a predefined number of classes and the sensitivity
to the initial partition. Furthermore, like the dynamical clustering method, k -means
also optimises the squared-error criterion when the Euclidean distance is used and
there is no missing data. The main distinctions from the dynamical clustering method
are that k -means only works with centroids representations of the classes (Jain et al.,
1999), and only one object is reallocated in each allocation step (dynamical clustering
reallocates all objects in each allocation step). As a result, a strategy on how the
More formally, this method looks for a partition P of k classes from an object set
E and a vector L of k prototypes, where each prototype represents one class of P .
Let D be a dissimilarity function; and O be a random ordering of the objects, where
1. Initialisation
test = 0;
for j = 1 to n do
find class Cm of 0j ;
find class Cl such that Cl = mini=1, ..., k D(0j , Gi );
if m = l then
test = 1;
Cl = Cl ∪ {x} and Cm = Cm − {x} ;
recalculate prototypes Gm and Gl ;
end;
end;
3. Termination Test
The Self-Organizing Map (SOM) is a type of neural network suitable for unsupervised
and accurate with noisy data (Mangiameli et al., 1996). On the other hand, SOM
suffers from the same problems such as those of dynamical clustering: sensibility to
the initial parameters settings and the possibility of getting trapped in local minimum
CHAPTER 3. CLUSTER ANALYSIS 35
The SOM method works as follows. Initially, one has to choose the topology of
the map, for example a 3 x 3 grid as in Figure 3.1. All the nodes are linked to
the input nodes by weighted edges. The weights are first set at random, and then
iteratively adjusted. Each iteration involves randomly selecting an object x and
moving the closest node (and its neighbourhood) in the direction of x. The closest
node is obtained by measuring the Euclidean distance or the dot product between the
object x and the weights of all nodes in the map. The neighbourhood to be adjusted
is defined by a neighbourhood function, which decreases through time.
Figure 3.1: Example of a SOM with topology 3 x 3 and two input variables
1. Initialisate randomly the weights of the edges between the input nodes
and the map;
3. Find node Nlm such that Nlm = mini=1, ..., k; j=1, ..., o D(x, Nij );
4. Update the weights of the node Nlm and its neighbourhood towards the
object x by the function F in accordance to a learning rate l;
One problem with SOM is the high number of parameters to be selected, which
includes, the topology, the learning rate, the neighbourhood function, neighbourhood
radius, among others. The success of the map is dependent on selection of these
in the final iterations. The form of variation is not critical, but one popular practice
is to divide the training in two phases. In the first phase, the ordering phase, a large
initial radius and learning rates are used. Then, in the convergence phase, smaller
initial radius and learning rate are selected (Haykin, 1994).
topological map of the clusters. Such maps should have a number of nodes well
above the number of real clusters in the data (Vesanto & Alhoniemi, 2000). By a
visual inspections of the map, one can select the neighbour nodes that represents each
cluster. However, this process is time consuming and open to subjectivity. In this
study, there is the need of an objective way to assign the nodes to the final clusters, as
a high number of experiments are necessary, and it is not a good practice to include
subjective procedures in the validation process.
CHAPTER 3. CLUSTER ANALYSIS 37
One way to overcome the problem just described is to cluster the nodes after
training the map, by the use of another clustering method (the weights of each node
represents the node input pattern). In this latter clustering, the number of cluster
should be equal to the number of clusters in the data. The resulting partition will
state what nodes are related to each cluster. In Vesanto & Alhoniemi (2000), k-
means and hierarchical clustering are employed for this task, all of them obtaining
good recovery accuracies. For the sake of simplicity, this study will only employ the
Another alternative is to use maps with a unidimensional layer, where the number
of nodes is equal to the number of clusters (Mangiameli et al., 1996). With this type
of topology, the SOM method becomes very similar to k-means. But as k-means is
already analysed in this study, there would be no use of analysing SOM with this
type of topology.
into a dendrogram (Jain & Dubes, 1988). These algorithms start with each object rep-
resenting a cluster, then the methods gradually merge theses clusters into larger ones.
Among the different agglomerative methods, there are three broader used variations:
complete linkage, average linkage, and single linkage. These variations differ in the
way cluster representations are calculated (see Jain & Dubes (1988) for more details).
Depending on the variation used, the hierarchical algorithm is capable of finding non-
isotropic clusters, including well-separated, chain-like, and concentric clusters (Jain et
al., 1999). However, since such methods are deterministic, individuals can be grouped
based only on local decisions, which are not re-evaluated once decisions are made. As
a consequence, these methods are not robust to noisy data (Mangiameli et al., 1996).
CHAPTER 3. CLUSTER ANALYSIS 38
Due to the fact that the methodology applied in this work is only adequate for the
evaluation of partitions, the hierarchies are transformed into partitions before being
evaluated. One way to perform this, is to cut the dendrogram in a certain level, as
shown is Figure 3.2. Additionally, the hierarchical method are also used as initiali-
sation to other partitional methods. This practice improves the initial conditions of
the partitional method that receives the hierarchical results as input (Jain & Dubes,
1988).
0.04
0.035 3 clusters
0.03 4 clusters
0.025
0.02
0.015
0.01
0.005
0
1 2 3 4 5 6 7 8 9
Figure 3.2: Example of two cuts in a dendrogram with nine objects. The two dashed
lines represent respectively cuts with three and four clusters.
This dissertation will focus on the average linkage hierarchical clustering method
or UPGMA (unweighed pair group method average), as it has been extensively used
in the literature of gene expression analysis (Eisen et al., 1998). In this method, the
proximity between two cluster is calculated by the average proximity between the
objects in one group and the objects in the other group. Given the object set E, the
2. Find the most similar pair of clusters and merge these two clusters in
a single one;
association between the data objects. This can be achieved by the use of proximity
(similarity or dissimilarity) indices that calculate the alikeness of two objects. For
the choice of a suitable index, the type of the variables and the characteristics of the
index should be taken into consideration. For example, in the case of quantitative
variables, the use of an Euclidean distance captures the proximity between objects
considering the absolute magnitude of the values, while correlation-type indices mea-
sure the proximity in relation to the relative magnitudes of the values (Gordon, 1999).
contain missing data, the proximity indices studied need also to support missing data
(Gower, 1971). Based on this, versions of the following proximity indices are studied:
Euclidean distance, Pearson correlation and angular separation (Gordon, 1999). The
Euclidean distance does not capture the relative magnitude proximity, unless the ob-
jects have their values normalised or standardised. Because of this, for the Euclidean
CHAPTER 3. CLUSTER ANALYSIS 40
distance version, the effect of normalisation and standardisation are also investigated.
The indices studied can be formally defined as follows. Let xik denote the k th
quantitative value (expression value of time point k) of the ith object (gene) where
i = 1, . . . , n and k = 1, . . . , p; the modified version of the Euclidean distance between
the ith and j th objects is defined as:
p 2
k=1 (xik − xjk ) δijk
dij = p (3.3)
k=1 δijk
where
0, if xij or xik is missing
δijk =
1, otherwise
Such a version of the Euclidean distance (Eq. 3.3) - ED, for short - is a dissimilarity
index, with values near zero representing similar objects. As this version is based on
the Euclidean distance, it shares the desirable characteristics of the original distance
such as the ability to detect compact and isolated clusters. However, attributes with
high scale values can dominate the others. This can be solved by the normalisation
of the data attributes (Jain & Dubes, 1988).
The equation for the version of the Pearson correlation - P C, for short - is as
follows:
p
k=1 (xik − xi )(xjk − xj )δijk
sij = p p (3.4)
p
k=1 (xik − xi ) ϑik k=1 (xjk − xj ) ϑjk
( k=1 δijk ) 2 2
where
p
x ϑ
xi =
k=1 ik ik
p
ϑ
;
k=1 ik
CHAPTER 3. CLUSTER ANALYSIS 41
0, if xik is missing
ϑik = ;
1, otherwise
P C is a correlation type index that measures the angle similarity of two data
vectors, yielding values between -1 and 1, where 1 represents similar objects and -1
dissimilar objects.
The equation for the version of the angular separation - AS, for short - is as follows:
p
xik xjk δijk
sij = p k=1
p p (3.5)
2 2
( k=1 δijk ) k=1 xik ϑik k=1 xjk ϑjk
where δijk is as defined in Eq. 3.3; and ϑik is as defined in Eq. 3.4.
AS (Eq. 3.5) is also a correlation type index, with the same characteristics as
those of P C (Eq. 3.4). The difference between them is that AS measures the angle
similarity from the origin, while P C measures the angle similarity from the mean of
the data. Both correlations differ from ED (with no prior normalisation) in that they
do not consider the vector size when measuring the proximity.
bound to [0,1], the two following equations can be used to transform, respectively,
similarities in dissimilarities (Eq. 3.6) and dissimilarities in similarities (Eq. 3.7):
1 + sij
dij = (3.6)
2
CHAPTER 3. CLUSTER ANALYSIS 42
in the literature of gene expression are analysed (Tamayo et al., 1999). The first
procedure is the normalisation of the data vectors (genes) so that they have a norm
equal to one. This procedure requires the values of the data vectors to be positive.
The other is a standardisation procedure that makes the data vectors to have zero
mean and standard deviation equals to one. The application of either procedure
makes ED capture relative magnitude dissimilarity. As the data sets used in this
work contain missing data, both procedures were adapted to support missing values.
Formally, let xik denote the k th quantitative value (expression value of time point
k) of the ith object (gene) where i = 1, . . . , n and k = 1, . . . , p; the standardised values
missing, if xik is missing
zik = x − x (3.8)
ik i
, otherwise
si
where
p
(xik −xi )2 ϑik
s2i p
= k=1 ;
( k=1
ϑik −1)
missing, if xik is missing
yik = xik (3.9)
p , otherwise
k=1 xik ϑik
CHAPTER 3. CLUSTER ANALYSIS 43
employed in applications of cluster analysis. The reasons for this are, among others,
the lack of general guidelines on how cluster validity should be carried out, and the
great need of computer resources (Jain & Dubes, 1988).
In this section, procedures and tools for cluster validity relevant to this work are
described. More specifically, Section 3.3.1 describes aspects of indices for cluster
validity. Next, in Section 3.3.2, methodologies used for the evaluation of clustering
methods are explained.
the structure gives true information about the data, or that the structure captures
intrinsic characteristics of the data (Jain & Dubes, 1988).
The validity indices vary in two main aspects: the type of structure, and the type
external, internal and relative. The external criteria asses the accuracy by comparing
the structure with a priori information. Internal criteria measure the accuracy by
comparing the structure with the input data (and only the input data). The last
CHAPTER 3. CLUSTER ANALYSIS 44
type, relative criteria, are used to compare two cluster structures, in order to point
out which structure is better in some sense. This work will attain to validity in-
dices appropriate for evaluating partitions and external criteria. The reasons for this
choice are, among others, the fact that: most of the methods evaluated gives par-
titions as result and external labels are available for some data sets. Alternatively,
internal criteria could be used, allowing the addition of unlabelled data sets in this
experiments. However, there are a number of difficulties in applying internal indices,
specially in a comparative analysis, where the choice of the index could favour some
specific clustering methods (Dubes, 1998).
External Indices
External indices are used to assess the degree of agreement between two partitions
(U and V ), where partition U is the result of a clustering method and partition V
is formed by an a priori information independent of partition U , such as a category
label (Jain & Dubes, 1988). There are a number of external indices defined in the
literature, such as Jaccard, Rand and corrected Rand (or adjusted Rand) (Jain &
Dubes, 1988). One characteristic of most of these indices is that they can be sensitive
to the number of classes in the partitions or to the distributions of elements in the
clusters. For example, some indices have a tendency to present higher values for
partitions with more classes (Rand), others for partitions with a smaller number of
classes (Jaccard) (Dubes, 1987). The corrected Rand, which has its values corrected
for chance agreement, does not have any of these undesirable characteristics (Milligan
& Cooper, 1986). Thus, the corrected Rand index is the only external index used in
the validation methodology proposed by this work. However, in order to explain the
general idea of external indices, the Rand and Jaccard indices are first described.
the external indices can be expressed in terms of the following indicator functions
(Jain & Dubes, 1988):
1, if xi ∈ ur and xj ∈ ur for r ≤ R
IU (i, j) = (3.10)
0, otherwise
1, if xi ∈ vc and xj ∈ vc for c ≤ C
IV (i, j) = (3.11)
0, otherwise
IU
1 0
1 a b m1 (3.12)
IV
0 c d M − m1
m2 M − m2 M
In this table, the agreements of the partitions are represented by a and d, where a
indicates the number of individual pairs in the same classes in both partitions, and
d denotes the number of individual pairs in separate classes in both partitions. The
disagreements are indicated by b and c, where b represents pairs in the same class in
a+d
Rand = (3.13)
a+b+c+d
a
Jaccard = (3.14)
a+b+c
Both Jaccard (Eq. 3.14) and Rand (Eq. 3.13) indices yield values in the interval
[0,1], where the more the value approximates to 1 the higher the agreement is. The
difference among them is that Jaccard does not take into consideration the agreement
represented by term d. These two indices suffers from the same problem, as there is no
indication of how good a partition is given the value obtained. For instance, Milligan
& Cooper (1986) showed that partitions with a high number of clusters can obtain
Rand index values near 1 independently of their quality. One way to overcome this
problem is to correct the indices for random agreement. The corrected Rand index
(Hubbert & Arabie, 1985), for example, can be described as the following equation:
a + d − nc
corrected Rand = (3.15)
a + b + c + d − nc
Such an index is obtained by adding to the Rand index a correcting term (nC ),
which adjusts the statistic by estimating random agreement (Hubbert & Arabie,
1985). This correction considers that the baseline distributions of the partitions are
fixed. The corrected Rand index can take values from -1 to 1, with 1 indicating a
perfect agreement between the partitions, and values near 0 or negatives correspond-
ing to cluster agreement found by chance. In fact, an analysis by Milligan & Cooper
(1986) confirmed that corrected Rand scores near to 0 when presented to clusters gen-
erated from random data, and showed that values greater than 0.05 indicate clusters
As the nc term cannot be defined from a, b, c and d, the exact corrected Rand
equation can only be expressed in terms of the contingency table of partitions U and
v1 v2 ... vC
where nij represents the number of objects that are in clusters ui and vi ; n is the
number of all objects in the partitions; ni. indicates the number of objects in cluster
ui ; and n.j indicates the number of objects in cluster vj . Thus, the exact corrected
Rand equation is as follows:
R C nij
n −1 R ni. C n.j
i j 2
− 2 i 2 j 2
corrected Rand = C n.j C n.j (3.17)
1 R ni. n −1 R ni.
2
[ i 2
+ j 2
]− 2 i 2 j 2
Methodologies for cluster validity are inherently statistical. The task of such proce-
dures is to find how unusual or valid a certain cluster structure is. One procedure
very popular in cluster validity is the Monte Carlo test (Jain & Dubes, 1988). In
this test, a number of data sets are built given a null model (usually this null model
represents no structure or randomness). These data sets are clustered and evaluated
Then, the observed value (the cluster structure to be evaluated) is compared with the
baseline distribution (obtained by the null model) using statistical tests.
Monte Carlo tests have been widely employed in cluster validity studies (Gordon,
1999; Jain & Dubes, 1988; Milligan, 1996). However, this test presents some prob-
lems. First, Monte Carlo consumes a lot of computer resources, as a high number
of replications are needed for building the baseline distribution (from 500 to 1000
replications) (Jain & Dubes, 1988). Nowadays, this may not be a big problem, as
processing time is becoming cheaper. However, for complex experiments, such a num-
ber of replications can still be a problem. Second, the definition of the null model is
not a trivial task. In fact, there is a wide range of null models types, each with some
advantages and disadvantages (Gordon, 1996). Indeed, Gordon (1996) suggested that
more than one null model should be employed in validation analysis, what makes the
Other statistical methodology that has an increasing use in cluster validity is boot-
strap. Bootstrap has been used to build consensus trees (Felsenstein, 1985), and to
measure cluster stability (Jain & Moreau, 1987). In fact, bootstrap samples of an
original data set could be used to build a null model (the hypothesis of no structure or
randomness) (Jain & Dubes, 1988). These bootstrap samples can be obtained either
by resampling the objects or the attributes in the data set. In order words, all the
problems present in Monte Carlo tests related to the choice of the null model would
be avoided. But still, the number of resamples necessary for building accurate test is
Replication analysis is another well known procedure for cluster validation (McIn-
tyre & Blashfield, 1980). This procedure, based on cross-validation, measures the
stability of a method in clustering a certain data set. This method is also based on
making a number of samples from the original data set. However, it requires a small
CHAPTER 3. CLUSTER ANALYSIS 49
number of replications to perform the test (at least 30). Since in this work the number
of experiments necessary for comparing the proximity indices and clustering meth-
ods is high (around a 100 distinct experiments), it would be too costly to use either
the Monte Carlo or the bootstrap test. Because of this, the validation methodology
proposed in this dissertation is based on the replication analysis. As a consequence,
Replication analysis
ing the stability (or replicability) of clustering methods. This is done by comparing
the results obtained by clustering subsets of data randomly drawn from a single pop-
ulation (McIntyre & Blashfield, 1980; Morey et al., 1983). The higher the similarity
of the partitions obtained by clustering the distinct subsets, the higher the stability of
the given method is. It is important to point out that stability and accuracy are not
necessarily correlated. Even though the solution given by a method can be stated as
stable, it does not mean that the solution has a good accuracy. In contrast, stability
two steps. First, a sub set of the data (training set) is used to training the method,
obtaining a classifier (or function) as result. Then, another subset of the data (test
set) is presented to this classifier. In contrast, clustering results are not classifiers
(or functions) as in supervised learning. In fact, the results obtained with clustering
given a proximity index, to the class with the nearest centroid. This method resem-
bles steps of some clustering algorithms (SOM, k -means, average linkage hierarchical
clustering) when an element is assigned to a cluster. As a consequence, the use of this
procedure should not include an additional bias in the validation process (McIntyre
& Blashfield, 1980).
Basically, the replication procedure works as follows. The data set is randomly
divided in two disjoint data sets A and B. Then, the objects in A are clustered. The
in the Nearest Centroid Step and Direct Clustering step are compared (both partitions
are obtained from the set B). The higher the agrement between these partitions, the
higher the stability. This procedure is then repeated a number of times with distinct
partitions A and B.
Formally, let D be the data set; n the number of clusters; Ai and Bi the two
random subsets from the data set D; Ri the resulting partition of the set Ai ; Ci the
set of centroids of partition Ri ; Pi and N Ci the resulting partitions of the set Bi , for
1. for i = 1 to k do
The idea behind the replication analysis is simple. The stability is measured by
comparing the partition obtained by a clustering method, with the partition obtained
in a independent sub set of data (via the nearest centroid procedure). Monte carlo
experiments of this procedure (McIntyre & Blashfield, 1980) have shown that the
replication analysis is useful for the evaluation of clustering methods. Furthermore,
its was also demonstrated that there was a high correlation between the stability and
the accuracy of the results.
cedure and the quadratic discriminant analysis classification rule. Monte Carlo ex-
periments demonstrated that the nearest neighbor procedure obtained better results
for detecting instability than the other procedures. However, its was stated that the
nearest centroid procedure should be used in situations were the data is clustered by
relative magnitude or shape, which is the context of this work (Breckenridge, 1989).
Chapter 4
This chapter presents in details the validation methodology and the experimental de-
sign used in the comparative analysis. Section 4.1 introduces the validation method-
ology proposed in this dissertation. Then, Section 4.2 describes the data sets used
in the experiments. The last section describes the experimental design utilised in
the experiments, as well as some implementation issues specific to each clustering
method.
In this section, a methodology for cluster validity with the objective of comparing the
accuracy of clustering methods and proximity indices is described. This methodology
is measured with the use of an external index. The mean values of the external index
obtained by each clustering method (or proximity index) are compared two by two
52
CHAPTER 4. METHODS AND EXPERIMENTS 53
with a bootstrap hypothesis test, in order to asses the statistical significance of any
difference in the results.
4.1.1 Cross-validation
The comparison of two supervised learning methods is, often, accomplished by ana-
lysing the statistical significance of the difference between the mean of the classifi-
cation error rate, on independent test sets, of the methods evaluated. In order to
evaluate the mean of the error rate, several (distinct) data sets are needed. However,
the number of data sets available is often limited. One way to overcome this problem
is to divide the data sets into training and test sets by the use of a k-fold cross valida-
tion procedure (Mitchell, 1997). This procedure can be used to compare supervised
methods, even if only one data set is available. The procedure works as follows. The
data set is divided into k disjoint equal size sets. Then, training is performed in k
steps, each time using a different fold as the test set and the union of the remaining
folds as the training set. Applying the distinct algorithms to the same folds with a
k at least equal to thirty, the statistical significance of the differences between the
methods can be measured, based on the mean of the error rate from the test sets.
available, the comparison between two methods can also be done by detecting the
statistical significance of the difference between the mean value of a certain external
index (it is important to point out that the a priori classification is not used in the
training, but only to evaluate the results). But again, the number of training sets
available is also limited. Monte Carlo and bootstrap tests could be used to generate
additional training sets, but they have a high computational cost. This work proposes
a method to overcome these problems. Such a methodology is an adaptation of the
test set, and the remaining folds as the training set. The training set is presented
to a clustering method, giving a partition as result (training partition). Then, the
nearest centroid technique is used to build a classifier from the training partition.
The centroid technique calculates the proximity between the elements in the test set
and the centroids of each cluster in the training partition (the proximity must be
measured with the same proximity index used by the clustering method evaluated).
A new partition (test partition) is then obtained by assigning each object in the test
set to the cluster with nearest centroid. Next, the test partition is compared with
the a priori partition (or a priori classification) by using an external index (this a
priori partition contains only the objects of the test partition). At the end of the
procedure, a sample with size k of the values for the external index is available.
Formally, let D be the data set; n the number of clusters; Fi the ith test fold (or
set); Ri the resulting partition of the training set D − Fi ; Ci the set of centroids of
partition Ri ; Ti the resulting partition of test fold Fi ; and Pi the a priori partition with
the objects from Fi , for i = 1, . . . , k; then, the unsupervised k-fold cross-validation
2. for i = 1 to k do
The general idea of the k -fold cross-validation procedure is to observe how well
data from an independent set Fi is clustered, given the training results. If the results
of a training set have a low agreement with the a priori classification, so should have
the results of the respective test set. In conclusion, the objective of the procedure is
of the results. In order to do so, the test set is also clustered with the same method
from the Step 3, obtaining one partition of the test set as result. The stability is
measured by comparing this partition with the partition obtained in Step 6 (partition
Ti ) (see algorithm in Section 3.3.2) . On the other hand, the unsupervised k-fold cross
validation is used to analyse the accuracy of the results. This is done by comparing
the partition from Step 6 with an a priori classification. Second, the “test folds”
CHAPTER 4. METHODS AND EXPERIMENTS 56
(fold B) of the replication analysis are not drawn independently from the others, in
contrast to the independent test folds of the unsupervised k-fold procedure.
Two-sample hypothesis tests are applied to measure the significance of the difference
between the sample means of two random variables. In this work, these two samples
are formed by the values of the external index provided by the unsupervised k -fold
cross-validation procedure for the two clustering methods (or proximity indices) to be
compared. The test indicates if a sample mean of a clustering method can be stated
The hypothesis test used in this work is based on bootstrap resampling. Bootstrap
is a data based method used to measure the accuracy of statistical estimates (Efron
& Tibshirani, 1993). The idea behind bootstrap is simple; given a sample, elements
are randomly drawn with replacement, forming a bootstrap sample. The estimate is
build by calculating a desired statistics from a large number of bootstrap samples.
The bootstrap method was chosen due to its capacity to build accurate estimates
when a limited number of elements are available in the samples. Furthermore, the
bootstrap method has the advantage of not making parametric assumptions about
the sample distributions. However, such a method is not so accurate as other tests
such as the t-test (Efron & Tibshirani, 1993).
More formally, let r be the number of bootstrap samples replicates; y be the sample
y = (y1 , . . . , yi , . . . , yn ); z be the sample z = (z1 , . . . , zj , . . . , zm ); and y and z be two
sample means. The hypothesis of the test are:
H0 : y = z
CHAPTER 4. METHODS AND EXPERIMENTS 57
H1 : y < z
Then, the bootstrap procedure to compare samples y and z is defined as (Efron &
Tibshirani, 1993):
and z are the samples means, and x is the mean of the combined sample
2. for k = 1 to r do
a−b
t(A, B) = (4.1)
s2a s2b
n
+ n
where
A = (a1 , . . . , ai , . . . , an );
B = (b1 , . . . , bj , . . . , bm );
n
s2a = i=1 (ai − a)2 /(n − 1)
m
and s2b = i=1 (bi − b)2 /(n − 1).
5. Calculate the statistic t(y, z) with the original samples y and z (Eq. 4.1), and
find the achieved significance level (ASL) (Eq. 4.2), given W = {(y∗1 ,z∗1 ), . . . ,
(y∗k ,z∗k ), . . . , (y∗r ,z∗r )}
The yeast Saccharomyces cerevisiae is one of the most well studied biological organ-
ism, in fact, it is one of the first organisms to have the whole genome known (Heyer
et al., 1999). Since there is a wide availability of public data from the yeast, as well
More specifically, one classification scheme and two data sets from the Yeast are used.
The Yeast Functional Classification consists of a classification scheme of half of the
known yeast genes. The two data sets contain data of gene expression time series:
the Yeast All and the Mitotic Cell Cycle data sets. From these expression data sets,
only genes belonging to a certain classification scheme are used to form the final data
sets. More specifically, from the Yeast All expression data, two data sets are formed
by the use of two distinct functional classifications schemes devised from the Yeast
Functional Classification. In terms of the Mitotic Cell Cycle data set, also two data
sets are formed, one is formed by the Yeast Functional Classification scheme and the
other is formed by a series shape classification performed in Cho et al. (1998).
Munich Information Center for Protein Sequences Yeast Genome Database (MYGD)
is the main scheme for classifying protein function of the yeast organism (Mewes et
al., 2002). This classification scheme is currently composed of a tree with 249 classes
spread in five levels. The genes are catalogued in accordance to information from
biochemical and genetic studies, where genes with a large amount of information
tend to be classified in higher levels of the tree (the number of classes in each level
is shown in Table 1). Genes can be assigned to more than one class, consequently
CHAPTER 4. METHODS AND EXPERIMENTS 59
the overlap of classes is large, with genes being assigned to an average of 2.9 classes.
Out of the 6200 known yeast ORFs (Open Reading Frames), around 3900 belong
to at lest one of the MYGD classes. (Original data available at: http://mips.sf.de/
proj/yeast/catalogues).
Level NUMBER
OF CLASSES
1 16
2 107
3 85
4 39
5 2
Table 4.1: Number of classes in the five levels of the MYGD classification.
This data is used as the external category label in order to evaluate the accuracy
of the clustering results. In other words, this classification data does not contain
any gene expression data, but it is used in conjunction with expression data sets,
supplying a label for the genes contained in the expression data sets. In fact, two
classifications schemes were obtained from this data, the FC and the REDUCED FC.
The FC classification scheme is formed by thirteen first level classes of the MYGD,
as in (Zhu & Zhang, 2000). These classes are expected to show similar expression
profiles. Table 2 shows theses classes and the number of genes in each class.
The REDUCED FC (Table 3) is composed of five MYGD classes that have shown
a high tendency to cluster together (Eisen et al., 1998). Furthermore, genes belonging
to these classes have been successfully used for building function prediction classifiers
using supervised methods (Brown et al., 2000).
CHAPTER 4. METHODS AND EXPERIMENTS 60
CLASS NUMBER
OF GENES
Metabolism 1215
Energy 258
Cell Cycle and DNA Processing 815
Transcription 847
Protein Synthesis 363
Protein Fate 655
Cellular Transport 537
Cellular Communication 60
Cell Rescue, Defense and Virulence 287
Regulation of Cellular Environment 216
Transposable Elements, Viral, Plasmid Proteins 116
Control of Cellular Organisation 217
Transport Facilitation 363
Table 4.2: MYGD classes from the FC scheme with theirs respective number of genes.
CLASS NUMBER
OF GENES
Tricarboxylic acid cycle 17
Respiration 22
Cytoplasmic ribosome 121
Proteasome 35
Histones 11
Table 4.3: MYGD classes from the REDUCED FC scheme with theirs respective
number of genes
CHAPTER 4. METHODS AND EXPERIMENTS 61
This data set contains data from five yeast experiments, where 6200 ORFs had their
expression profiles measured using cDNA microarrays. The ORF profiles contain 71
time points, observed during the following five biological processes: the mitotic cell
division (cycle alpha, cdc15, elutration) (Spellman et al., 1998), sporulation (Chu et
al., 1998) and diauxic shift (DeRisi et al., 1997). These processes contained, respec-
tively, 18, 25, 14, 7 and 7 time points. The expression value of each ORF in a time
point is the log transformation (base 2) of the ratio between the measured expression
level and the control expression level (Eisen et al., 1998). Some of the genes con-
tain missing values, either because insignificant hybridisation levels were detected,
or because the genes were not measured in certain processes. (Data available at:
http://genome-www.stanford.edu/clustering).
As stated in Section 3.2, the normalisation procedure requires the data vectors to
contain only positive values, which is not the case of the log ratio values obtained
in cDNA microarrays. In order to overcome this problem, the data sets applied to
experiments that use normalisation are raised to the power of two, returning to the
original measure-control ratio.
Two data sets were devised from the original Yeast All data set, the FC Yeast All
and the Reduced FC Yeast All. The FC Yeast All data set contains only genes in the
FC classification. A missing data filter was applied to this data set, excluding profiles
with more than 20% of missing data. As in Heyer et al. (1999), a final filtering was
employed in order to remove uninformative genes with low expression levels or with
low variance between the time points. In these removed ORFs, the expression level
did not vary across time, thus these profiles were considered uninformative in relation
to gene function. In order to apply this filtering, genes were ranked according to
their variance and mean, where the ones within the 25% lowest values (Heyer et al.,
CHAPTER 4. METHODS AND EXPERIMENTS 62
1999) in each rank were removed. At the end, the FC Yeast All data set contained
1765 genes. The Reduced FC Yeast All data set contains only genes from the Reduced
FC classification. Since there is a reduced number of genes in this data set, only the
missing filter was applied, ending up with 205 genes.
This data set was obtained in an experiment from the Yeast organism during the
mitotic cell division cycle (Cho et al., 1998). The set contains the expression profiles
measured with oligonucleotides arrays during 17 time points, with a similar set of
ORFs as the one used in the Yeast All data set. In oligonucleotides arrays, there are
20 pairs of probes for each ORF. These pairs are composed of perfect match (PM)
and mismatch (MM) probes, where the latter works as a specificity control. The
expression of a gene is measured by the average of the difference of the PM and MM
Two data sets were also devised from the Mitotic Cell Cycle, the FC CDC 25 and
the Series CDC 25. In the FC CDC 25 dataset, only genes in the FC classification
were considered. A variance filtering was employed in order to remove the 25% of the
genes with lowest variance and mean. This data sets did not contained any missing
data. The final number of genes in this data set was 1869. The Series CDC 25 data
set contains genes belonging to a visual classification of the series shape performed by
Cho et al. (1998). In this classification, 420 genes were assigned to one of five known
phases of the cell cycle (some of the genes were assigned to a multiple phase class).
There was no need to pre-process this data set, as only informative gene profiles were
4.3 Experiments
The experiments are divided in two parts. In the first part, only the proximity indices
are compared, while in the second one the comparison of the clustering methods is
accomplished. The results obtained in the former are used to choose the proximity
indices (with the best accuracy given a clustering method) to be used in the latter
part. In the following two sections, both experiments are described. In the last
section, implementation issues specific to each each clustering method are described.
The first part of the experiments compare versions of three proximity indices: angu-
lar separation (AS), Pearson correlation (P C) and Euclidean distance (ED). With
respect to the Euclidean distance version, experiments are performed with the data
vectors in three forms, namely, original (ED1 ), normalised (ED2 ) and standard-
ised (ED3 ) values. This yields five distinct settings of proximity indices and pre-
processing. Each of these settings was implemented in the following clustering meth-
ods : CLICK, SOM, hierarchical clustering, dynamical clustering, k -means, and dy-
namical clustering and k -means with initialisation from the hierarchical method. The
experiments were accomplished by presenting the four data sets (FC Yeast All, Re-
duced FC Yeast All, FC CDC 25 and Series CDC 25) to all these methods and indices
settings. More specifically, for each method, proximity index, and data set; a thirty-
fold unsupervised cross-validation was applied. Afterwards, the mean values of the
corrected Rand index (CR) for the test folds were measured. Finally, the mean of CR
obtained by the five settings of proximity indices and pre-processing were compared
two by two, using the bootstrap hypothesis test with 1000 bootstrap samples. As
the interest of this experiment is in comparing the proximity indices, the hypothesis
CHAPTER 4. METHODS AND EXPERIMENTS 64
tests only compared the results of experiments performed with the same clustering
methods and data sets.
The second part of the experiments compare the following clustering methods: CLICK,
SOM, hierarchical clustering, dynamical clustering, k -means, and dynamical cluster-
ing and k -means with initialisation from the hierarchical clustering. Each clustering
method was evaluated with the proximity index that obtained the higher accuracy
in the first part of the experiments. The experiments were accomplished by present-
ing the same four data sets (FC Yeast All, Reduced FC Yeast All, FC CDC 25 and
Series CDC 25) to all methods. More specifically, for each method and data set; a
of CR obtained by the seven clustering methods were compared two by two, using
the bootstrap hypothesis test with 1000 bootstrap samples. As the interest of this
experiment is in comparing clustering methods, the hypothesis tests only compared
signment method was also included in this evaluation. This method simply assigns
randomly the objects in the input data set to a cluster. Its is important to notice
that this method is evaluated in the same manner as the other methods. In brief, the
method is used to cluster the training sets in the k-fold cross-validation procedure.
The nearest centroid procedure used to cluster the test set is then performed normally
given the random partition. The only distinction of the evaluation of the random as-
signment method is that the final results are taken from the mean corrected Rand
values obtained in 100 different runs. These mean results obtained by the random
CHAPTER 4. METHODS AND EXPERIMENTS 65
assignment method are taken as the worst case. All other clustering methods should
obtain values significantly higher than it.
In order to perform the experiments with dynamical clustering and k−means meth-
ods, an implementation from (Costa et al., 2002a) was used. In terms of the param-
eters of these two methods, the number of clusters was set to the number of a priori
classes (the number of clusters was also set to the number of a priori classes in the
other methods), and the number of distinct initialisations used was 100.
compared with this method. Missing data was not supported as well, so only the
CDC 25 data sets were used in the CLICK experiments. The homogeneity, the other
algorithm parameter, was set to its default value.
The SOM Toolbox for Matlab was used to run the SOM experiments (SOM Tool-
box available at: http// www.cis.hut.fi/projects/somtoolbox.). The original imple-
mentation only supported the Euclidean distance. Thus, in order to include Pearson
correlation and angular separation, modifications were done in the code. SOM re-
gene expression data, where its was found that the topology was the parameter with
highest impact on the results (Jonsson, 2001).
CHAPTER 4. METHODS AND EXPERIMENTS 66
In order to set the other parameters from SOM, a method of the toolbox that uses
a number of heuristics to set the parameters was employed (for more details see the
description of the method som make in Vesanto et al. (2000)). As not all the results
obtained by this parametrisation were satisfactory, another parametrisation based on
the one used in Vesanto & Alhoniemi (2000) was employed (this parametrisation is
The initial radius was set to the topology highest dimension and the final radius to
half the highest dimension. In the convergence phase, 10 epochs and a learning rate
of 0.05 were used. The initial radius was set to half the highest topology dimension
minus 1 and the final radius to 1. In both phases, the neighbourhood function was the
Gaussian. In relation to the topology, the following procedure was applied. An initial
experiments are done. Otherwise, the same process is repeated for the topology with
best result.
R software was used with the hierarchical clustering experiments (software available
literature (Eisen et al., 1998). As the external index used in this work is suitable only
for partition comparison, the resulting hierarchies were cut in a given level in order to
provide partitions (see Section 3.1.5). In the experiments with gene expression data,
sub-trees with less than 5 objects were ignored.
Chapter 5
Results
The results of the comparative analysis are presented and analysed in this chapter.
Section 5.1 describes the results achieved in the comparative analysis of the proximity
indices, while Section 5.2 describes the results achieved in the comparative analysis
of the clustering methods. The results of the experiments for selection for parameters
for SOM are reported in Appendix A, while Appendix B presents detailed statistics
5.1.1 Experiments
Figure 5.1 shows the mean values of corrected Rand for the experiments performed
with the FC Yeast All data set (the higher the corrected Rand, the higher the accu-
racy). Regarding the experiments with SOM, ED3 and P C obtained higher values
than the other proximity indices. In these cases, the hypotheses of no difference (null
hypothesis) were rejected in favour of ED3 and P C (at significance level α of 0.01).
67
CHAPTER 5. RESULTS 68
In respect to the hierarchical clustering, ED1 and ED2 achieved lower values than
the other proximity indices. In these cases, the hypotheses of no difference were re-
jected in favour of ED3 , AS and P C at α of 0.01. For all other methods, except for
the SOM and hierarchical clustering, AS obtained a higher accuracy than the other
proximity indices. In these cases, the hypotheses of no difference between AS and
the other proximity indices were rejected in favour of AS at α = 0.05. In fact, with
all clustering methods, except for SOM and hierarchical clustering, the hypotheses
Figure 5.1: Mean of corrected Rand values from the FC Yeast All experiments
In Figure 5.2, the mean values of corrected Rand with the Reduced FC Yeast All
data set are illustrated. In the experiments with SOM, ED3 obtained a higher ac-
curacy than the other proximity indices. In these cases, the null hypotheses were
other proximity indices. In these cases, the hypotheses of no difference were rejected
in favour of ED3 , AS and P C at α = 0.01. Furthermore, still in the hierarchical
clustering, P C obtained the higher accuracy than the other proximity indices. The
null hypotheses were rejected in favour of P C at α = 0.05. For the dynamical clus-
tering and k-means, both with or without hierarchical initialisations, AS achieved a
lower accuracy in comparison to the other proximity indices. For these four exper-
iments, the hypotheses of no difference between AS and all other proximity indices
were reject in favour to ED1 , ED2 , ED3 and P C at α = 0.02. Furthermore, still
with these four methods, ED1 obtained an accuracy as high as ED2 , ED3 and P C,
Figure 5.2: Mean of corrected Rand values from the Reduced FC Yeast All experi-
ments
The results for the FC CDC 25 data set are summarised in Figure 5.3. In the dy-
namical clustering (with or without hierarchical initialisation), ED2 and AS achieved
a higher accuracy than ED1 . In both cases, the null hypotheses were rejected in favour
CHAPTER 5. RESULTS 70
of ED2 and AS at α = 0.02. In the results with k-means (with or without hierarchical
initialization), ED2 had a higher accuracy in comparison to ED1 . In this case, the
null hypothesis was rejected in favour of ED2 at α = 0.02. In terms of CLICK, SOM
and hierarchical clustering, no significant difference was detected among the results.
Figure 5.3: Mean of corrected Rand values from the FC CDC 25 experiments
Figure 5.4 shows the mean values of corrected Rand with the Series CDC 25 data
ED1 and the other proximity indices were rejected in favour of ED2 , ED3 , AS and
P C at α = 0.01. The accuracy of ED3 , AS and P C were also higher than ED2 in
Figure 5.4: Mean of corrected Rand values from the Series CDC 25 experiments
5.1.2 Discussions
ED1 lead to the lowest accuracies in all but in the Reduced FC YeastAll data set.
These results were already expected, due to the fact that this proximity index is not
suitable for capturing relative magnitude dissimilarity (or shape dissimilarity). In the
Series CDC 25 data set, which is the only data set with the classification directly
related to the series shape, the difference of ED1 and the other proximity indices,
With respect to the Reduced FC YeastAll data set, ED1 had values as high as other
proximity indices, but not showing a significant advantage over them. This is actually
a rather interesting result that shows that the data set, by having a reduced set of
genes, has distinct characteristics of data sets with a higher number of genes such
as the FC Yeast All data set. Recalling Section 4.2.1, the Reduced FC classification
was devised from the results achieved in Eisen et al. (1998). In other words, this
CHAPTER 5. RESULTS 72
reduced set of classes were the ones more easily classified in previous studies, so one
could argue that these genes profiles are so well separated that even ED1 is capable
of discriminating them. Furthermore, it also can be said that this data set is biased.
In the experiments carried out in Eisen et al. (1998), the hierarchical clustering was
used with the Pearson correlation to cluster the results. Not surprisingly, in the
In the FC Yeast All data set, AS obtained significant higher values than the others.
On the other hand, in the FC CDC 25 data set, ED2 obtained the highest values.
ED2 , ED3 and P C achieved the highest values in the Reduced FC Yeast data set,
while ED3 , AS and P C had the highest values in the Series CDC 25 data set. One
possible reason for these contrasting results is that the data sets were captured with
and control expression levels. In contrast, the elements in the CDC 25 data set were
captured with oligonucleotide arrays, where the expression values represent the mean
5.2.1 Experiments
The results of Section 5.1 were used to select the proximity indices for the comparative
analysis of the clustering methods. Indeed, only the proximity indices with best
accuracy for a given clustering method and data set were selected. Table 5.1 shows
these proximity indices.
Table 5.1: Proximity indices with best accuracy in the experiments of Section 5.1 for
a given clustering method and data set.
Figure 5.5: Mean of corrected Rand values from the FC Yeast All experiments
CHAPTER 5. RESULTS 74
In Figure 5.5, the mean values of corrected Rand for the experiments with the FC
Yeast All data set are shown. The dynamical clustering obtained a higher accuracy
than the other clustering methods. The null hypotheses were rejected in favour to the
dynamical clustering in comparison to random assignment and hierarchical clustering
at α = 0.01. SOM and k-means also achieved a significant higher accuracy than
random assignment and hierarchical clustering. In these cases, the null hypotheses
were rejected in favour to k-means and SOM in comparison to random assignment
random assignment and hierarchical clustering. In these cases, the null hypotheses
were rejected in favour to dynamical clustering and k-means in comparison to random
assignment (α = 0.05) and hierarchical clustering (α = 0.05).
Figure 5.6: Mean of corrected Rand values from the Reduced FC Yeast All experi-
ments
The mean values of corrected Rand for the experiments with the Reduced FC Yeast-
all data set are presented in Figure 5.6. The random assignment method obtained the
lowest accuracy in comparison to all other methods. The null hypotheses were rejected
CHAPTER 5. RESULTS 75
Figure 5.7: Mean of corrected Rand values from the FC CDC 25 experiments
Figure 5.7 ilustrastes the mean values of corrected Rand of the experiments with
the FC CDC 25 data set. The CLICK method obtained a lower result than all
others methods, including the random assignment. In these cases, the null hypotheses
were rejected in favour to all other methods at α = 0.01. k-means (with or without
hierarchical initialization) and SOM obtained significant higher accuracy than random
assignment and hierarchical clustering. The null hypotheses were rejected in favour to
SOM and k-means at a α = 0.01. Dynamical clustering (with or without hierarchical
initialization) also obtained significant higher accuracy than random assignment and
hierarchical clustering. The null hypotheses were rejected in favour to dynamical
clustering at a α = 0.5.
Figure 5.8 shows the mean values of corrected Rand for the experiments performed
with the Series CDC 25 data set. The random assignment method obtained the lowest
CHAPTER 5. RESULTS 76
Figure 5.8: Mean of corrected Rand values from the Series CDC 25 experiments
results in comparison to all other methods. In these experiments, the null hypotheses
were rejected in favour to SOM, hierarchical clustering, CLICK, dynamical clustering
5.2.2 Discussions
data sets (Reduced FC Yeastall and Series CDC 25 ), as the hierarchical clustering got
accuracies as high as other methods. It can be concluded that hierarchical clustering
has some problems in clustering larger data sets formed by the complete Functional
Classification (FC) scheme. The clusters in the data sets based on the FC scheme are
not so compact and isolated, as the ones with the Reduced FC and the series shape
classification. The FC data sets have a higher number of genes and their classification
CHAPTER 5. RESULTS 77
were not devised from gene expression analysis. Given the lack of robustness of the
hierarchical clustering methods to outliers and noisy data (see Section 3.1.5), the low
accuracies in the FC data sets are expected. These results are also compatible to other
comparative analysis of clustering methods for gene expression . In Datta & Datta
(2003), the average hierarchical clustering also obtained worse results than other
(2002b).
Some comments about the results of the CLICK method also should be made. In
the Series CDC 25 experiments, CLICK achieved the highest mean corrected Rand
in relation to all other methods. On the other hand, CLICK obtained negative values
in the FC CDC 25 data set. As mentioned before, the CLICK method encounters
the number of clusters automatically. This task was perfectly performed in the Series
CDC 25, where 6 clusters were encountered in most of the experiments. This was not
the case in the FC CDC 25 experiments, where the number of clusters varied around
20 and 26 with the P C; and around 5 to 7 with the AS. These results suggest that
CLICK showed instability in clustering the FC CDC 25 gene expression data set. It
can be the case that CLICK presented similar problems as the hierarchical clustering,
however, only one data set with the complete Functional Classification was used in
the experiments. Further experiments are necessary to investigate this issue properly.
initialization) and SOM obtained high accuracies in all experiments. The use of
the hierarchical initialization does not affect the accuracy of k-means and dynamical
clustering, even if the hierarchical method alone does not achieve a good accuracy. In
fact, the hierarchical initialization reduces the run time of both dynamical clustering
and k-means experiments, as there is no need of several random initializations (see
CHAPTER 5. RESULTS 78
Section 3.1.2). SOM has one main disadvantage in relation to k-means and dynamical
clustering. It required more complex experiments for selecting the parameters. On the
other hand, SOM returns a topological map, where the clusters have neighboorhood
relations. Such structure is much more informative than simple partitions returned
by k-means and dynamical clustering. Furthermore, in the experiments performed in
this dissertation, the number of clusters was already known. However, in a problem
where this number is unknown, the use of k-means and dynamical clustering also
The results of the experiments with the Reduced FC Yeast All data set reinforce
the suggestions made in Section 5.1.2 that there is some bias in this data set. Even
though, k-means and dynamical clustering (with or without hierarchical clustering
initialization) achieved good results, the hierarchical clustering had the largest ac-
curacy. Again, this is not a coincidence, as the functional classes presented in this
classification scheme were the ones more easily clustered in experiments done with
was used, a low agreement with the clustering results was encountered. In these exper-
iments, the mean values of corrected Rand were below 0.05, which indicate clustering
solutions found by chance (Milligan & Cooper, 1986). A previous study (Gertein &
Janssen, 2000), using similar data sets, had already indicated that the functional clas-
sification has only a weak relation to the clustering of gene expression profiles. The
reasons for this are, among others, the vague definitions of some functions and the
great overlap of the classes (Gertein & Janssen, 2000). These weak relations were also
the context of this work, the previous issues do not represent a problem, since this
work is concerned only with the comparison of the clustering methods (or proximity
metrics), and not with the evaluation of the quality of the clusters generated.
ployed in this work, and as a consequence, the validity of the results encountered. As
expected, the random assignment method showed the lowest accuracy (or accuracies
Conclusions
time series. In order to do so, a validation methodology based on the k-fold cross-
validation procedure and the use of gene annotation was proposed. The study carried
out in this dissertation is more complete than previous ones, as it used more data sets,
and included methods not evaluated before, such as SOM and dynamical clustering.
Furthermore, no comparative analysis of proximity indices has been performed before.
In the comparative analysis of the proximity indices, the results did not indicated
the superiority of one particular index over the others. In three out of four data sets,
the Euclidean distance version with original data obtained the worst results. This was
already expected, since that proximity index does not capture the relative magnitude
proximity. With respect to proximity indices that capture relative magnitude, in the
FC Yeast All data set, the angular separation version achieved the highest values;
while in the FC CDC 25 data set, the Euclidean distance version with normalisation
achieved the highest values. In the Series CDC 25 data set, where series shape was
directly taken into consideration in the classification labels, all relative magnitude
80
CHAPTER 6. CONCLUSIONS 81
proximity indices achieved high values. From these results, no relative magnitude
proximity index can be stated to be superior to the others.
SOM, dynamical clustering and k-means had the best accuracies in all experiments.
Furthermore, the use of the hierarchical method as initialisation to dynamical clus-
tering and k-means resulted in a substantial reduction of run time, with no lost in
the accuracy.
The comparative analysis carried out in this dissertation only compared the ac-
curacy of the clustering methods. However, it is important to point out that other
characteristics should be taken into consideration in the choice of a clustering method.
For example, one should also consider the type of output of the clustering method.
SOM, for instance, gives a topological map as result, a structure more informative
than the partitions provided by CLICK, dynamical clustering and k-means. Another
example, some methods, such as CLICK, do not require the number of clusters to be
set, which is not the case of k-means and dynamical clustering. This characteristic is
very important when the number of clusters in the data set is unknown.
tational cost. The methodology showed consistent results, specially, with the random
assignment method. Such a method obtained the lowest results (or results as low as
other methods) in all data sets. Furthermore, the use of functional classification as
The analysis of the absolute accuracy obtained in the data sets with the complete
Functional Classification is another contribution of this work. The results reinforce
the findings of previous works (Gertein & Janssen, 2000; Kuramochi & Karypis,
2001), where it was found that the functional classification of the genes has only a
weak relation to gene expression data.
The number of public data sets of gene expression time series with an external classi-
fication is undesirable low. The use of new data sets in the future is vital to answering
some of the questions raised in this work. One of the questions is to investigate if a
particular proximity index is more suitable for data captured with a particular type
of microarray technology. Another issue to be further evaluated is the poor results
obtained with the CLICK method in the experiments with the FC CDC 25 data sets.
It should be investigated if the poor results are related with the use of the complete
Functional Classification.
Other types of biological information have already been used as external categories.
Such data can be used in the proposed validation methodology as a complement to
the use of functional classification. Among these sources there are: regulatory regions,
protein structure, and metabolic pathways (Gertein & Janssen, 2000; Zhu & Zhang,
2000).
This analysis can also be enhanced with the inclusion of new clustering methods.
In special, methods with good results in other comparative analysis (Datta & Datta,
2003) and of model-based clustering methods, which are now extensively applied to
analysis of gene expression time series (Schliep et al., 2003).
CHAPTER 6. CONCLUSIONS 83
should carry out Monte Carlo experiments with the generation of artificial data sets,
so as to evaluate the characteristics of this methodology.
Bibliography
Azuaje, F. (2002), A cluster validity framework for genome expression data, Bioin-
formatics, 18(2):319-320.
Bertone, P., Gerstein, M. (2001), Integrative data mining: the new direction in bioin-
formatics, IEEE Engineering in Medicine and Biology, 20:33-40.
Bo, T., Jonassen, I. (2002), New feature subset selection procedures for classification
of expression profiles, Genome Biology, 3(4):research0017.1-0017.11.
Brown, M. P., Grundy, W. N., Lin, D., Cristianini, N., Sugnet, C. W., Furey, T. S.,
84
BIBLIOGRAPHY 85
Brown, M. P., Bostein, D. (1999), Exploring the new world of genome with DNA
microarrays, Nature Genetics, 21:33-37.
39.
Cho, R., Campbell, M., Winzeler, E., Steinmetz, L., Conway, A., Wodicka, L., Wolfs-
berg, T., Gabrielian, A., Landsman, D., Lockhart, J., Davis, W. (1998), A genome-
wide transcriptional analysis of the mitotic cell cycle, Molecular Cell, 2:65-73.
Chu, S., Del Risi, J., Eisen, M., Mulholland, J., Botstein, D., Brown, P.O., Herskowitz
to Gene Expression Time Series Analysis, Proc. of the VII Brazilian Symposium
on Neural Networks, IEEE Computer Society, 1:24-30.
Datta S., Datta, S. (2003), Comparisons and validation of statistical clustering tech-
niques for microarray gene expression data, Bioinformatics, 19:459-466.
DeRisi, J. L., Iyer V. R., Brown P. O. (1997), Exploring the metabolic and genetic
control of gene expression on a genomic scale, Science, 278:680-686.
D’Haeseleer, P., Liang, S., Somogyi, R. (1999), Gene Expression Data Analysis and
Diday, E., Simon, J.C (1980), Clustering Analysis, Digital Pattern Recognition,,
Dopazo, J., Zanders, E., Dragoni, I., Amplett, G., Falciani, F. (2001), Methods and
aproaches in the analysis of gene expression data, Journal of Immunological Meth-
ods, 250:93-112.
Dougherty, E. R., Chen, Y., Batman, S., Bittner, M. L.(1997) Digital measurement
Dubes, R. (1987), How many clusters are best? An experiment, Pattern Recognition,
20(6):645-663.
Dubes, R. (1998), Cluster Analysis and Related Issues, In Handbook of pattern recog-
nition & computer vision, Word Scientific Publishing, Second Edition, 3-32.
Duggan D. J., Bittner M., Chen Y., Meltzer P., Trent J. (1999), Expression profiling
using cDNA microarrays, Nature Genetics, 21:10-14.
Efron B., Tibshirani, R. (1993), An Introduction to the Boostrap, Chapman & Hall,
New York.
Eisen, M. B., Spellman, P. T., Brown, P. O., Botstein, D. (1998), Cluster analysis and
display of genome-wide expression patterns, Proc. of National Academy of Sciences
USA, 95:14863-14868.
BIBLIOGRAPHY 87
Fuhrman, S., Cunningham, M.J., Wen, X., Zweiger, G., Seilhamer, J.J., Somogyi, R.
of whole genome expression data: how does it relates to protein structure and
function?, Current Opinion on Structural Biology, 10(5):574-84.
Golub, T. R., Slonim, D. K, Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P.,
Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A. (1999), Class Prediction
and Discovery Using Gene Expression Data, Science, 286:531-537.
Heyer, L. J., Kruglyak, S., Yooseph, S. (1999), Exploring expression data: identifica-
tion and analysis of coexpressed genes, Genome Research, 9(11):1106-1115.
BIBLIOGRAPHY 88
Jain A. K., Dubes, R. C. (1988), Algorithms for clustering data, Prentice Hall, New
Jersey.
Jain A. K., Murty, M. N., Flynn, P. J. (1999), Data Clustering: a review. ACM
Recognition, 20:547-568.
Lipshutz, R., Fodor, S., Gingeras, T., Lockhart, D. (1999), High density oligonu-
cleotide arrays. Nature Genetics, 21:20-24.
Lubovac, Z., Olsson, B., Jonsson, P., Laurio, K., Andersson, M. L. (2001), Biological
93:402-417.
15:225-238.
Mewes H. W., Frishman D., Güldener U., Mannhaupt G., Mayer K., Mokrejs M.,
Morgenstern B., Münsterkoetter M., Rudd S., Weil B. (2002), MIPS: a database
U.S. Department of Energy. DOE human genome program (1992), Primer on molec-
Raychaudhuri, S., Sutphin, P. D., Chang, J. T., Altman, R. B. (2001), Basic microar-
ray analysis: grouping and feature reduction, Trends in Biotechnology, 19(5):189-
193.
BIBLIOGRAPHY 90
Riley M. (1998), Genes and proteins of Escherichia coli K-12, Nucleic Acids Research,
26:54.
Schena, M., Shalon, D., Davis, R. W., Brown, P. O. (1995), Quantitative monitor-
ing of gene expression patterns with a complementary DNA microarray, Science,
270:467-470.
Schliep A., Schoenhuth A., Steinhoff C. (2003), Using Hidden Markov Models to
Analyze Gene Expression Time Course Data, Proc. of the International Confer-
Schuchhardt, J., Beule, D., Malik, A., Wolski, E., Eickhoff, H., Lehrach, H. and
Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M.
Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Ler,
E. S., Golub, T. R. (1999), Interpreting patterns of gene expression with self-
organizing maps: methods & application to hematopoietic differentiation, Proc.
Tavazoie, S., Hughes, J. D., Campbell, M. J., Cho, R. J., Church, G. M. (1999),
The Gene Ontology Consortium, Gene Ontology: tool for the unification of biology
van Helden, J., Gilbert, D., Wernisch, L., Schroeder, M., Wodak, S. (2001), Appli-
cations of regulatory sequence analysis & metabolic network analysis to the in-
terpretation of gene expression data, Lecture Notes in Computer Sciences 2066:
155-172.
Vesanto J., Alhoniemi, E. (2000), Clustering of the Self-Organizing Map, IEEE Trans-
Vesanto, J., Himberg, J., Alhoniemi, E., Parhankangas, J. (2000), SOM Toolbox for
Matlab 5, Technical Report, Helsinki University of Technology, Neural Networks
Research Centre.
BIBLIOGRAPHY 92
Yang, Y., Dudoit, S., Luu, P., Lin, D., Peng, V., Ngai, J., Speed, T. (2001), Normal-
ization for cDNA Microarray Data: a robust compose method adressing single &
Zhu J., Zhang M.Q. (2000), Cluster, function & promoter: analysis of yeast expression
array, Proc. of Pacific Symponsiun of Biocomputing, 479-490.
Appendix A
Parametrisation of SOM
This appendix ilustrastes the results of the parametrisation experiments with SOM.
As stated in Section 4.3.3, SOM requires parametrization experiments in order to tune
its performance. Due to the number of parameters available, and the complexity of
choosing them, only a reduced set of parameters will be varied. Previous studies with
gene expression data has found that topology was the parameter with highest impact
of the results (Jonsson, 2001). As a result, the topology will be the only parameter
to be varied.
The following procedure was applied to vary the topology. First, an initial topology
is chosen. Then, experiments with a larger and smaller topology are also performed.
If the initial topology obtain the best results then no more experiments are done.
Otherwise, the same process is repeated for the topology with best result. In the FC
Yeast All and FC CDC 25 data sets, the initial topology was 10x10, while in the
Reduced FC Yeast All and Series CDC FC the initial topology was 5x5.
In order to set the other parameters from SOM, a method of the toolbox that
uses a number of heuristics to set the parameters was used (this parametrisation
93
APPENDIX A. PARAMETRISATION OF SOM 94
were satisfactory, another parametrisation based on the one used in Vesanto & Al-
honiemi (2000) was employed (this parametrisation is refereed as VESANTO). The
VESANTO parametrisation had 10 epochs and a learning rate of 0.5 during the or-
dering phase. The initial radius was set to the topology highest dimension and the
final radius to half the highest dimension. In the convergence phase, 10 epochs and
a learning rate of 0.05 are used. The initial radius is set to half the highest topology
dimension minus 1 and the final radius to 1. The exact initial and final radius can
Table A.1: Topologies and parameters used in the VESANTO parametrisation with
the FC Yeast All and FC CDC 25 data sets.
Table A.3 shows the type of parameterisation and the topology that obtained
the best accuracy for each proximity index and data set. In a whole, topologies
smaller than the initial one obtained the best results. In relation to the type of
parametrisation, neither VESANTO or DEFAULT showed advantage over the other.
APPENDIX A. PARAMETRISATION OF SOM 95
Table A.2: Topologies and parameters used in the VESANTO parametrisation with
the Reduced FC Yeast All and Series CDC 25 data sets.
Table A.3: Type of parametrisation and topologies with best accuracy in the experi-
ments wiht SOM.
Appendix B
Table B.1: Detailed results of the SOM method in the experiments with the FC Yeast
All data set
Table B.2: Detailed results of the hierarchical clustering method in the experiments
with the FC Yeast All data set
96
APPENDIX B. RESULTS OF THE EXPERIMENTS 97
Table B.3: Detailed results of the dynamical clustering method in the experiments
with the FC Yeast All data set
Table B.4: Detailed results of the dynamical clustering method with the hierarchical
initialisation in the experiments with the FC Yeast All data set
Table B.5: Detailed results of the k-means method in the experiments with the FC
Yeast All data set
APPENDIX B. RESULTS OF THE EXPERIMENTS 98
Table B.6: Detailed results of the k-means method with the hierarchical initialisation
in the experiments with the FC Yeast All data set
Table B.7: Detailed results of the SOM method in the experiments with the Reduced
FC Yeast All data set
Table B.8: Detailed results of the hierarchical clustering method in the experiments
with the Reduced FC Yeast All data set
APPENDIX B. RESULTS OF THE EXPERIMENTS 99
Table B.9: Detailed results of the dynamical clustering method in the experiments
with the Reduced FC Yeast All data set
Table B.10: Detailed results of the dynamical clustering with the hierarchical initial-
isation method in the experiments with the Reduced FC Yeast All data set
Table B.11: Detailed results of the k-means method in the experiments with the
Reduced FC Yeast All data set
APPENDIX B. RESULTS OF THE EXPERIMENTS 100
Table B.12: Detailed results of the k-means with the hierarchical initialisation method
in the experiments with the Reduced FC Yeast All data set
Table B.13: Detailed results of the SOM method in the experiments with the FC
CDC 25 data set
Table B.14: Detailed results of the hierarchical clustering method in the experiments
with the FC CDC 25 data set
APPENDIX B. RESULTS OF THE EXPERIMENTS 101
Table B.15: Detailed results of the dynamical clustering method in the experiments
with the FC CDC 25 data set
Table B.16: Detailed results of the dynamical clustering method in the experiments
with the FC CDC 25 data set
Table B.17: Detailed results of the k-means method in the experiments with the FC
CDC 25 data set
APPENDIX B. RESULTS OF THE EXPERIMENTS 102
Table B.18: Detailed results of the k-means method with the hierarchical initialisation
in the experiments with the FC CDC 25 data set
AS PC
Minimum -0.0078200000 -0.0356400
1st Quartile 0.0000000000 -0.0200780
Mean -0.0005700667 -0.0021954
Median 0.0000000000 -0.0012555
3rd Quartile 0.0000000000 0.0093820
Maximum 0.0000000000 0.0434950
Std Dev. 0.0017106661 0.0205161
Table B.19: Detailed results of the CLICK method in the experiments with the FC
CDC 25 data set
APPENDIX B. RESULTS OF THE EXPERIMENTS 103
Table B.20: Detailed results of the SOM method in the experiments with the Series
CDC 25 data set
Table B.21: Detailed results of the hierarchical clustering method in the experiments
with the Series CDC 25 data set
Table B.22: Detailed results of the dynamical clustering method in the experiments
with the Series CDC 25 data set
APPENDIX B. RESULTS OF THE EXPERIMENTS 104
Table B.23: Detailed results of the dynamical clustering method with the hierarchical
initialisation in the experiments with the Series CDC 25 data set
Table B.24: Detailed results of the k-means method in the experiments with the
Series CDC 25 data set
Table B.25: Detailed results of the k-means method with the hierarchical initialisation
in the experiments with the Series CDC 25 data set
APPENDIX B. RESULTS OF THE EXPERIMENTS 105
AS PC
Minimum 0.15189800 0.0259480
1st Quartile 0.03358100 0.3414435
Mean 0.01529393 0.4206158
Median 0.00000000 0.4075320
3rd Quartile 0.03937275 0.5755730
Maximum 0.38362000 0.7283580
Std Dev. 0.09963067 0.1914951
Table B.26: Detailed results of the CLICK method in the experiments with the Series
CDC 25 data set