0% found this document useful (0 votes)
252 views32 pages

Gap Statistic for Cluster Estimation

The document presents the Gap statistic method for estimating the number of clusters in a dataset. The Gap statistic compares the within-cluster dispersion for different values of k to their expected values under a reference null distribution, such as a uniform distribution. It chooses the number of clusters with the largest gap as that number is most unlikely to have arisen by chance. The method was shown to outperform other existing indices in simulations with varying cluster configurations and amounts of overlap between clusters.

Uploaded by

Kikie Goguma Gyu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
252 views32 pages

Gap Statistic for Cluster Estimation

The document presents the Gap statistic method for estimating the number of clusters in a dataset. The Gap statistic compares the within-cluster dispersion for different values of k to their expected values under a reference null distribution, such as a uniform distribution. It chooses the number of clusters with the largest gap as that number is most unlikely to have arisen by chance. The method was shown to outperform other existing indices in simulations with varying cluster configurations and amounts of overlap between clusters.

Uploaded by

Kikie Goguma Gyu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd

Estimating the Number of Data

Clusters via the Gap Statistic


Paper by:
Robert Tibshirani, Guenther Walther
and Trevor Hastie
J.R. Statist. Soc. B (2001), 63, pp. 411--423
BIOSTAT M278, Winter 2004
Presented by Andy M. Yip
February 19, 2004
Part I:
General Discussion on Number of Clusters
Cluster Analysis

Goal: partition the observations {xi} so that


C(i)=C(j) if xi and xj are similar
C(i)C(j) if xi and xj are dissimilar

A natural question: how many clusters?


Input parameter to some clustering algorithms
Validate the number of clusters suggested by a
clustering algorithm
Conform with domain knowledge?
Whats a Cluster?

No rigorous definition
Subjective
Scale/Resolution dependent (e.g. hierarchy)

A reasonable answer seems to be:


application dependent
(domain knowledge required)
What do we want?

An index that tells us: Consistency/Uniformity

more likely to be 2 than 3


more likely to be 36 than 11
more likely to be 2 than 36?
(depends, what if each circle represents 1000 objects?)
What do we want?

An index that tells us: Separability

increasing confidence to be 2
What do we want?

An index that tells us: Separability

increasing confidence to be 2
What do we want?

An index that tells us: Separability

increasing confidence to be 2
What do we want?

An index that tells us: Separability

increasing confidence to be 2
What do we want?

An index that tells us: Separability

increasing confidence to be 2
Do we want?

An index that is
independent of cluster volume?
independent of cluster size?
independent of cluster shape?
sensitive to outliers?
etc

Domain Knowledge!
Part II:
The Gap Statistic
Within-Cluster Sum of Squares
Dr xi x j
2

iCr jCr

xj

xi
Within-Cluster Sum of Squares

x x
2
Dr i j
iC r jC r

2nr xi x
2

iC r

k
1
Wk Dr
r 1 2nr

Measure of compactness of clusters


Using Wk to determine # clusters

Idea of L-Curve Method: use the k corresponding to the elbow


(the most significant increase in goodness-of-fit)
Gap Statistic

Problem w/ using the L-Curve method:


no reference clustering to compare
the differences Wk Wk1s are not normalized for
comparison
Gap Statistic:
normalize the curve log Wk v.s. k
null hypothesis: reference distribution
Gap(k) := E*(log Wk) log Wk
Find the k that maximizes Gap(k) (within some
tolerance)
Choosing the Reference Distribution

A single-component is modelled by a log-


concave distribution (strong unimodality
(Ibragimovs theorem))
f(x) = e(x) where (x) is concave

Counting # modes in a unimodal distribution


doesnt work --- impossible to set C.I. for #
modes need strong unimodality
Choosing the Reference Distribution

Insights from the k-means algorithm:

MSE X * (k ) MSE X (k )
Gap (k ) log log
MSE X * (1) MSE X (1)

Note that Gap(1) = 0


Find X* (log-concave) that corresponds to no
cluster structure (k=1)
Solution in 1-D:
MSE X * (k ) MSEU [ 0,1] ( k )
inf* log log
X MSE X * (1) MSEU [ 0,1] (1)

However, in higher dimensional cases, no log-
concave distribution solves
MSE X * (k )
inf* log
X MSE X * (1)

The authors suggest to mimic the 1-D case and use


a uniform distribution as reference in higher
dimensional cases
Two Types of Uniform Distributions

1. Align with feature axes (data-geometry independent)

Observations Bounding Box (aligned Monte Carlo


with feature axes) Simulations
Two Types of Uniform Distributions

2. Align with principle axes (data-geometry dependent)

Observations Bounding Box (aligned Monte Carlo


with principle axes) Simulations
Computation of the Gap Statistic
for l = 1 to B
Compute Monte Carlo sample X1b, X2b, , Xnb (n is # obs.)
for k = 1 to K
Cluster the observations into k groups and compute log Wk
for l = 1 to B
Cluster the M.C. sample into k groups and compute log Wkb
1 B
Compute Gap ( k ) log Wkb log Wk
B b 1

Compute sd(k), the s.d. of {log Wkb}l=1,,B


Set the total s.e. sk 1 1 / B sd (k )

Find the smallest k such that Gap ( k ) Gap (k 1) sk 1

Error-tolerant normalized elbow!


2-Cluster Example
No-Cluster Example (tech. report
version)
No-Cluster Example (journal version)
Example on DNA Microarray Data

6834 genes
64 human tumour
The Gap curve raises at k = 2 and 6
Other Approaches
Bk /( k 1)
Calinski and Harabasz 74 CH (k )
Wk /( n k )
(k 1) 2 / p Wk 1 k 2 / pWk
Krzanowski and Lai 85 KL (k ) 2 / p
k Wk (k 1) 2 / p Wk 1
Wk
Hartigan 75 H (k ) 1 (n k 1)
Wk 1

Kaufman and Rousseeuw 90 (silhouette)


1 n 1 n b(i ) a (i )

n i 1
s (i )
n i 1 max{b(i ), a (i )}
Simulations (50x)

a. 1 cluster: 200 points in 10-D, uniformly distributed


b. 3 clusters: each with 25 or 50 points in 2-D, normally
distributed, w/ centers (0,0), (0,5) and (5,-3)
c. 4 clusters: each with 25 or 50 points in 3-D, normally
distributed, w/ centers randomly chosen from N(0,5I)
(simulation w/ clusters having min distance less than
1.0 was discarded.)
d. 4 clusters: each w/ 25 or 50 points in 10-D, normally
distributed, w/ centers randomly chosen from N(0,1.9I)
(simulation w/ clusters having min distance less than
1.0 was discarded.)
e. 2 clusters: each cluster contains 100 points in 3-D,
elongated shape, well-separated
Overlapping Classes
50 observations from each of two bivariate normal
populations with means (0,0) and (,0), and covariance I.
= 10 value in [0, 5]
10 simulations for each
Conclusions

Gap outperforms existing indices by normalizing


against the 1-cluster null hypothesis
Gap is simple to use
No study on data sets having hierarchical
structures is given
Choice of reference distribution in high-D cases?
Clustering algorithm dependent?

You might also like