0% found this document useful (0 votes)

252 views32 pages

Gap Statistic for Cluster Estimation

The document presents the Gap statistic method for estimating the number of clusters in a dataset. The Gap statistic compares the within-cluster dispersion for different values of k to their expected values under a reference null distribution, such as a uniform distribution. It chooses the number of clusters with the largest gap as that number is most unlikely to have arisen by chance. The method was shown to outperform other existing indices in simulations with varying cluster configurations and amounts of overlap between clusters.

Uploaded by

Kikie Goguma Gyu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

252 views32 pages

Gap Statistic for Cluster Estimation

Uploaded by

Kikie Goguma Gyu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

Estimating the Number of Data

Clusters via the Gap Statistic

Paper by:
Robert Tibshirani, Guenther Walther
and Trevor Hastie
J.R. Statist. Soc. B (2001), 63, pp. 411--423
BIOSTAT M278, Winter 2004
Presented by Andy M. Yip
February 19, 2004
Part I:
General Discussion on Number of Clusters
Cluster Analysis

Goal: partition the observations {xi} so that

C(i)=C(j) if xi and xj are similar
C(i)C(j) if xi and xj are dissimilar

A natural question: how many clusters?

Input parameter to some clustering algorithms
Validate the number of clusters suggested by a
clustering algorithm
Conform with domain knowledge?
Whats a Cluster?

No rigorous definition
Subjective
Scale/Resolution dependent (e.g. hierarchy)

A reasonable answer seems to be:

application dependent
(domain knowledge required)
What do we want?

An index that tells us: Consistency/Uniformity

more likely to be 2 than 3

more likely to be 36 than 11
more likely to be 2 than 36?
(depends, what if each circle represents 1000 objects?)
What do we want?

An index that tells us: Separability

increasing confidence to be 2
What do we want?

An index that tells us: Separability

increasing confidence to be 2
What do we want?

An index that tells us: Separability

increasing confidence to be 2
What do we want?

An index that tells us: Separability

increasing confidence to be 2
What do we want?

An index that tells us: Separability

increasing confidence to be 2
Do we want?

An index that is
independent of cluster volume?
independent of cluster size?
independent of cluster shape?
sensitive to outliers?
etc

Domain Knowledge!
Part II:
The Gap Statistic
Within-Cluster Sum of Squares
Dr xi x j
2

iCr jCr

xi
Within-Cluster Sum of Squares

x x
2
Dr i j
iC r jC r

2nr xi x
2

iC r

k
1
Wk Dr
r 1 2nr

Measure of compactness of clusters

Using Wk to determine # clusters

Idea of L-Curve Method: use the k corresponding to the elbow

(the most significant increase in goodness-of-fit)
Gap Statistic

Problem w/ using the L-Curve method:

no reference clustering to compare
the differences Wk Wk1s are not normalized for
comparison
Gap Statistic:
normalize the curve log Wk v.s. k
null hypothesis: reference distribution
Gap(k) := E*(log Wk) log Wk
Find the k that maximizes Gap(k) (within some
tolerance)
Choosing the Reference Distribution

A single-component is modelled by a log-

concave distribution (strong unimodality
(Ibragimovs theorem))
f(x) = e(x) where (x) is concave

Counting # modes in a unimodal distribution

doesnt work --- impossible to set C.I. for #
modes need strong unimodality
Choosing the Reference Distribution

Insights from the k-means algorithm:

MSE X * (k ) MSE X (k )
Gap (k ) log log
MSE X * (1) MSE X (1)

Note that Gap(1) = 0

Find X* (log-concave) that corresponds to no
cluster structure (k=1)
Solution in 1-D:
MSE X * (k ) MSEU [ 0,1] ( k )
inf* log log
X MSE X * (1) MSEU [ 0,1] (1)

However, in higher dimensional cases, no log-
concave distribution solves
MSE X * (k )
inf* log
X MSE X * (1)

The authors suggest to mimic the 1-D case and use

a uniform distribution as reference in higher
dimensional cases
Two Types of Uniform Distributions

1. Align with feature axes (data-geometry independent)

Observations Bounding Box (aligned Monte Carlo

with feature axes) Simulations
Two Types of Uniform Distributions

2. Align with principle axes (data-geometry dependent)

Observations Bounding Box (aligned Monte Carlo

with principle axes) Simulations
Computation of the Gap Statistic
for l = 1 to B
Compute Monte Carlo sample X1b, X2b, , Xnb (n is # obs.)
for k = 1 to K
Cluster the observations into k groups and compute log Wk
for l = 1 to B
Cluster the M.C. sample into k groups and compute log Wkb
1 B
Compute Gap ( k ) log Wkb log Wk
B b 1

Compute sd(k), the s.d. of {log Wkb}l=1,,B

Set the total s.e. sk 1 1 / B sd (k )

Find the smallest k such that Gap ( k ) Gap (k 1) sk 1

Error-tolerant normalized elbow!

2-Cluster Example
No-Cluster Example (tech. report
version)
No-Cluster Example (journal version)
Example on DNA Microarray Data

6834 genes
64 human tumour
The Gap curve raises at k = 2 and 6
Other Approaches
Bk /( k 1)
Calinski and Harabasz 74 CH (k )
Wk /( n k )
(k 1) 2 / p Wk 1 k 2 / pWk
Krzanowski and Lai 85 KL (k ) 2 / p
k Wk (k 1) 2 / p Wk 1
Wk
Hartigan 75 H (k ) 1 (n k 1)
Wk 1

Kaufman and Rousseeuw 90 (silhouette)

1 n 1 n b(i ) a (i )

n i 1
s (i )
n i 1 max{b(i ), a (i )}
Simulations (50x)

a. 1 cluster: 200 points in 10-D, uniformly distributed

b. 3 clusters: each with 25 or 50 points in 2-D, normally
distributed, w/ centers (0,0), (0,5) and (5,-3)
c. 4 clusters: each with 25 or 50 points in 3-D, normally
distributed, w/ centers randomly chosen from N(0,5I)
(simulation w/ clusters having min distance less than
1.0 was discarded.)
d. 4 clusters: each w/ 25 or 50 points in 10-D, normally
distributed, w/ centers randomly chosen from N(0,1.9I)
(simulation w/ clusters having min distance less than
1.0 was discarded.)
e. 2 clusters: each cluster contains 100 points in 3-D,
elongated shape, well-separated
Overlapping Classes
50 observations from each of two bivariate normal
populations with means (0,0) and (,0), and covariance I.
= 10 value in [0, 5]
10 simulations for each
Conclusions

Gap outperforms existing indices by normalizing

against the 1-cluster null hypothesis
Gap is simple to use
No study on data sets having hierarchical
structures is given
Choice of reference distribution in high-D cases?
Clustering algorithm dependent?

Comparing Clustering Methods for k
No ratings yet
Comparing Clustering Methods for k
53 pages
Cluster Analysis for Market Segmentation
No ratings yet
Cluster Analysis for Market Segmentation
24 pages
Understanding K-Means Clustering Basics
No ratings yet
Understanding K-Means Clustering Basics
28 pages
Clustering Techniques in Data Mining
No ratings yet
Clustering Techniques in Data Mining
76 pages
Clustering Techniques in CMPUT 466
No ratings yet
Clustering Techniques in CMPUT 466
34 pages
Clustering Techniques and Distance Measures
No ratings yet
Clustering Techniques and Distance Measures
55 pages
Clustering Techniques and Methods
No ratings yet
Clustering Techniques and Methods
61 pages
Optimal K-Means Clustering Techniques
No ratings yet
Optimal K-Means Clustering Techniques
5 pages
Clustering Techniques in Data Mining
No ratings yet
Clustering Techniques in Data Mining
23 pages
Unsupervised Learning and Clustering Techniques
No ratings yet
Unsupervised Learning and Clustering Techniques
84 pages
Clustering Techniques in Machine Learning
No ratings yet
Clustering Techniques in Machine Learning
40 pages
Overview of Clustering Techniques
No ratings yet
Overview of Clustering Techniques
55 pages
Understanding Cluster Analysis Techniques
No ratings yet
Understanding Cluster Analysis Techniques
90 pages
Hierarchical Clustering Techniques Explained
No ratings yet
Hierarchical Clustering Techniques Explained
44 pages
K-means Clustering Explained
No ratings yet
K-means Clustering Explained
38 pages
Understanding Clustering Techniques
No ratings yet
Understanding Clustering Techniques
28 pages
Clustering Techniques and K-Means Guide
No ratings yet
Clustering Techniques and K-Means Guide
23 pages
Overview of Clustering Methods
No ratings yet
Overview of Clustering Methods
53 pages
Data Mining: Clustering Techniques Explained
No ratings yet
Data Mining: Clustering Techniques Explained
18 pages
Clustering and Time-Series Forecasting Techniques
No ratings yet
Clustering and Time-Series Forecasting Techniques
23 pages
Data Mining - Chapter 4 Cluster Analysis
No ratings yet
Data Mining - Chapter 4 Cluster Analysis
37 pages
Clustering Techniques for Diabetes Data
No ratings yet
Clustering Techniques for Diabetes Data
20 pages
Clustering Analysis of Health & Economic Data
No ratings yet
Clustering Analysis of Health & Economic Data
18 pages
Understanding K-Means Clustering Techniques
No ratings yet
Understanding K-Means Clustering Techniques
77 pages
Machine Learning Clustering Techniques
No ratings yet
Machine Learning Clustering Techniques
67 pages
Understanding Cluster Analysis Techniques
No ratings yet
Understanding Cluster Analysis Techniques
68 pages
Clustering Techniques in Data Analysis
No ratings yet
Clustering Techniques in Data Analysis
56 pages
Unsupervised Learning: Clustering Techniques
No ratings yet
Unsupervised Learning: Clustering Techniques
87 pages
Data Analytics Fundamentals Overview
No ratings yet
Data Analytics Fundamentals Overview
51 pages
Unsupervised Learning: Clustering Techniques
No ratings yet
Unsupervised Learning: Clustering Techniques
65 pages
Unsupervised Learning in Machine Learning
No ratings yet
Unsupervised Learning in Machine Learning
31 pages
Unsupervised Learning: Clustering Techniques
No ratings yet
Unsupervised Learning: Clustering Techniques
40 pages
K-Means Clustering Overview
No ratings yet
K-Means Clustering Overview
51 pages
Conglomerate Inc. PDA Segmentation Study
No ratings yet
Conglomerate Inc. PDA Segmentation Study
13 pages
K-means and Hierarchical Clustering
No ratings yet
K-means and Hierarchical Clustering
16 pages
Genetic K-Means Clustering Method
No ratings yet
Genetic K-Means Clustering Method
49 pages
Clustering Techniques and Algorithms
No ratings yet
Clustering Techniques and Algorithms
26 pages
Clustering vs. Classification Explained
No ratings yet
Clustering vs. Classification Explained
17 pages
Overview of Clustering Techniques
No ratings yet
Overview of Clustering Techniques
58 pages
TQM - TRG - F-07 - Cluster Analysis - Rev02 - 20180421
No ratings yet
TQM - TRG - F-07 - Cluster Analysis - Rev02 - 20180421
42 pages
Clustering Techniques in Machine Learning
No ratings yet
Clustering Techniques in Machine Learning
21 pages
K-Means Clustering in Python
No ratings yet
K-Means Clustering in Python
60 pages
Unsupervised Learning: Clustering Explained
No ratings yet
Unsupervised Learning: Clustering Explained
54 pages
Cluster Analysis and K-means Overview
No ratings yet
Cluster Analysis and K-means Overview
48 pages
DBSCAN and Hierarchical Clustering Guide
No ratings yet
DBSCAN and Hierarchical Clustering Guide
55 pages
Data Science Concepts and Inequalities
No ratings yet
Data Science Concepts and Inequalities
47 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
44 pages
Clustering Techniques in Machine Learning
No ratings yet
Clustering Techniques in Machine Learning
65 pages
Cluster Analysis in Data Mining
No ratings yet
Cluster Analysis in Data Mining
91 pages
Clustering with Penalized Dissimilarity
No ratings yet
Clustering with Penalized Dissimilarity
47 pages
Understanding Cluster Analysis Basics
No ratings yet
Understanding Cluster Analysis Basics
28 pages
Understanding Cluster Analysis Techniques
No ratings yet
Understanding Cluster Analysis Techniques
104 pages
CS8091 - Big Data Analytics - Unit 2
No ratings yet
CS8091 - Big Data Analytics - Unit 2
44 pages
K-means Clustering Lab Manual
No ratings yet
K-means Clustering Lab Manual
11 pages
Novel K-Means Clustering Method
No ratings yet
Novel K-Means Clustering Method
8 pages
Clustering Techniques in Machine Learning
No ratings yet
Clustering Techniques in Machine Learning
125 pages
Understanding Clustering in Machine Learning
No ratings yet
Understanding Clustering in Machine Learning
45 pages
Clustering Techniques with R: K-Means & Hierarchical
No ratings yet
Clustering Techniques with R: K-Means & Hierarchical
7 pages
Indian Navy Officer Qualification Criteria
No ratings yet
Indian Navy Officer Qualification Criteria
6 pages
NFA and DFA in Compiler Design
No ratings yet
NFA and DFA in Compiler Design
21 pages
Oracle BR100 Audit Setup Guide
No ratings yet
Oracle BR100 Audit Setup Guide
18 pages
CBSE Class 11 Maths: Linear Inequalities Guide
No ratings yet
CBSE Class 11 Maths: Linear Inequalities Guide
22 pages
Liquid Lens Experiment Procedure
No ratings yet
Liquid Lens Experiment Procedure
4 pages
20 Essential Rules from Psychology Books
No ratings yet
20 Essential Rules from Psychology Books
1 page
Shear Force & Bending Moment Experiments
No ratings yet
Shear Force & Bending Moment Experiments
3 pages
Difficult Conversations
No ratings yet
Difficult Conversations
6 pages
HVAC Midterm Exam - MECH 453/6181
No ratings yet
HVAC Midterm Exam - MECH 453/6181
12 pages
Softball Teaching Performance Review
No ratings yet
Softball Teaching Performance Review
2 pages
Potassium Chlorate and Gummy Bear Reaction
No ratings yet
Potassium Chlorate and Gummy Bear Reaction
3 pages
Class 12 Management: Key Concepts and Notes
No ratings yet
Class 12 Management: Key Concepts and Notes
11 pages
Consultancy Proposal for Kintampo Hospital
No ratings yet
Consultancy Proposal for Kintampo Hospital
2 pages
Word Frequency Analysis in NLP
No ratings yet
Word Frequency Analysis in NLP
5 pages
Philosophical Insights on Coen Films
No ratings yet
Philosophical Insights on Coen Films
7 pages
Cisco ERP Implementation Analysis
No ratings yet
Cisco ERP Implementation Analysis
13 pages
Big Data's Impact on Epistemology
No ratings yet
Big Data's Impact on Epistemology
12 pages
150 High-Frequency GRE Vocabulary Words
No ratings yet
150 High-Frequency GRE Vocabulary Words
5 pages
Overview of Planning Theories
100% (2)
Overview of Planning Theories
30 pages
Image Processing: Edges & Filters
No ratings yet
Image Processing: Edges & Filters
74 pages
An Introduction To Zensight Process
100% (1)
An Introduction To Zensight Process
24 pages
UPSC Sociology Strategy by Pratik Mantri
No ratings yet
UPSC Sociology Strategy by Pratik Mantri
7 pages
Understanding Newton's First Law of Motion
No ratings yet
Understanding Newton's First Law of Motion
2 pages
Rahyaft Office by NOIR Office
No ratings yet
Rahyaft Office by NOIR Office
9 pages
Long-Run Unemployment Analysis
No ratings yet
Long-Run Unemployment Analysis
4 pages
Composite Community Detection in Multiplex Networks
No ratings yet
Composite Community Detection in Multiplex Networks
13 pages
Concrete Sclerometer Instruction Manual
No ratings yet
Concrete Sclerometer Instruction Manual
12 pages
Critique of Nido's Family Ad
No ratings yet
Critique of Nido's Family Ad
2 pages
Importance of Research in Business Management
No ratings yet
Importance of Research in Business Management
20 pages
Understanding Mangal Dosh in Matrimony
No ratings yet
Understanding Mangal Dosh in Matrimony
4 pages

Gap Statistic for Cluster Estimation

Uploaded by

Gap Statistic for Cluster Estimation

Uploaded by

Estimating the Number of Data

Clusters via the Gap Statistic

Goal: partition the observations {xi} so that

A natural question: how many clusters?

A reasonable answer seems to be:

An index that tells us: Consistency/Uniformity

more likely to be 2 than 3

An index that tells us: Separability

An index that tells us: Separability

An index that tells us: Separability

An index that tells us: Separability

An index that tells us: Separability

Measure of compactness of clusters

Idea of L-Curve Method: use the k corresponding to the elbow

Problem w/ using the L-Curve method:

A single-component is modelled by a log-

Counting # modes in a unimodal distribution

Insights from the k-means algorithm:

Note that Gap(1) = 0

The authors suggest to mimic the 1-D case and use

1. Align with feature axes (data-geometry independent)

Observations Bounding Box (aligned Monte Carlo

2. Align with principle axes (data-geometry dependent)

Observations Bounding Box (aligned Monte Carlo

Compute sd(k), the s.d. of {log Wkb}l=1,,B

Find the smallest k such that Gap ( k ) Gap (k 1) sk 1

Error-tolerant normalized elbow!

Kaufman and Rousseeuw 90 (silhouette)

a. 1 cluster: 200 points in 10-D, uniformly distributed

Gap outperforms existing indices by normalizing

You might also like