0% found this document useful (0 votes)
54 views32 pages

06 Clus3

The document discusses hierarchical clustering, focusing on various linkage methods such as single, complete, average, centroid, and minimax linkage, each with distinct properties and implications for clustering results. It emphasizes the challenges of determining the optimal number of clusters and highlights the importance of interpreting dendrograms and cluster centers. The document also notes that choosing the right linkage method can be situation-dependent and that further empirical comparisons are needed to assess their performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views32 pages

06 Clus3

The document discusses hierarchical clustering, focusing on various linkage methods such as single, complete, average, centroid, and minimax linkage, each with distinct properties and implications for clustering results. It emphasizes the challenges of determining the optimal number of clusters and highlights the importance of interpreting dendrograms and cluster centers. The document also notes that choosing the right linkage method can be situation-dependent and that further empirical comparisons are needed to assess their performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Clustering 3: Hierarchical clustering (continued);

choosing the number of clusters

Ryan Tibshirani
Data Mining: 36-462/36-662

January 31 2013

Optional reading: ISL 10.3, ESL 14.3

1
Even more linkages
Last time we learned about hierarchical agglomerative clustering,
basic idea is to repeatedly merge two most similar groups, as
measured by the linkage

Three linkages: single, complete, average linkage. Properties:


I Single and complete linkage can have problems with chaining
and crowding, respectively, but average linkage doesn’t
I Cutting an average linkage tree provides no interpretation, but
there is a nice interpretation for single, complete linkage trees
I Average linkage is sensitive to a monotone transformation of
the dissimilarities dij , but single and complete linkage are not
I All three linkages produce dendrograms with no inversions

Actually, there are many more linkages out there, each having
different properties. Today: we’ll look at two more

2
Reminder: linkages
Our setup: given X1 , . . . Xn and pairwise dissimilarities dij . (E.g.,
think of Xi ∈ Rp and dij = kXi − Xj k2 )

Single linkage: measures the closest pair of points

dsingle (G, H) = min dij


i∈G, j∈H

Complete linkage: measures the farthest pair of points

dcomplete (G, H) = max dij


i∈G, j∈H

Average linkage: measures the average dissimilarity over all pairs


1 X
daverage (G, H) = dij
nG · nH
i∈G, j∈H

3
Centroid linkage
Centroid linkage1 is commonly used. Assume that Xi ∈ Rp , and
dij = kXi − Xj k2 . Let X̄G , X̄H denote group averages for G, H.
Then:
dcentroid (G, H) = kX̄G − X̄H k2

2
● ● ●

● ●

● ● ●●
Example (dissimilarities dij are ● ●


●● ● ● ● ●

●● ● ●

1
● ●● ● ●
● ●●
distances, groups are marked ●



●●

● ●

● ●● ● ●

by colors): centroid linkage ●


●●

0


●●
score dcentroid (G, H) is the dis- ●




● ●

tance between the group cen- ● ● ●

−1
● ●● ● ●

● ● ●
● ● ●
● ●
● ●
troids (i.e., group averages) ●● ●
●●
●●
●●


−2

● ●

−2 −1 0 1 2

1
Eisen et al. (1998), “Cluster Analysis and Display of Genome-Wide
Expression Patterns”
4
Centroid linkage is the standard in biology
Centroid linkage is simple: easy to understand, and easy to
implement. Maybe for these reasons, it has become the standard
for hierarchical clustering in biology

5
Centroid linkage example
Here n = 60, Xi ∈ R2 , dij = kXi − Xj k2 . Cutting the tree at
some heights wouldn’t make sense ... because the dendrogram has
inversions! But we can, e.g., still look at ouptut with 3 clusters

● ●●

3

2.5

● ●
● ●

● ● ●
2

2.0
● ●
●● ●

● ●
● ●
● ●
1

● ●

1.5
● ●
● ●

Height

● ●
● ● ●● ●
0



1.0
● ●



−1



● ● ●

0.5
● ●



−2


0.0

−2 −1 0 1 2 3

Cut interpretation: there isn’t one, even with no inversions

6
Shortcomings of centroid linkage
I Can produce dendrograms with inversions, which really messes
up the visualization
I Even if were we lucky enough to have no inversions, still no
interpretation for the clusters resulting from cutting the tree
I Answers change with a monotone transformation of the
dissimilarity measure dij = kXi − Xj k2 . E.g., changing to
dij = kXi − Xj k22 would give a different clustering
distance distance^2
● ●●
● ● ●●

3

3
● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
2

2
● ● ● ●
●● ● ●● ●
● ●
● ● ● ●
● ● ● ●
● ● ● ●
1

1
● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ●
● ● ● ●
● ● ●● ● ● ● ●● ●
0

● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ●
−1

−1

● ●
● ●
● ● ● ● ● ●
● ● ● ●
● ●

● ●

−2

−2

● ●

● ●

−2 −1 0 1 2 3 −2 −1 0 1 2 3

7
Minimax linkage
Minimax linkage2 is a newcomer. First define radius of a group of
points G around Xi as r(Xi , G) = maxj∈G dij . Then:
dminimax (G, H) = min r(Xi , G ∪ H)
i∈G∪H

2
● ● ●

● ●
Example (dissimilarities dij are ● ●

●●


● ● ● ●● ● ●●
distances, groups marked by ●● ● ● ●

1
● ●● ● ●
● ●●

● ● ●

colors): minimax linkage score ●





●● ●
●●

● ●


●●

0
dminimax (G, H) is the smallest ●

●●




● ●
radius encompassing all points ●


−1

● ●● ● ●

in G and H. The center Xc is ●


● ●
●● ●







●●

the black point ●●


● ●●
−2

● ●

−2 −1 0 1 2
2
Bien et al. (2011), “Hierarchical Clustering with Prototypes via Minimax
Linkage”
8
Minimax linkage example
Same data s before. Cutting the tree at h = 2.5 gives clustering
assignments marked by the colors

3.5
● ●●

3

3.0

● ●
● ●

● ● ●
2

2.5
● ●
●● ●

● ●
● ●
● ●
1

2.0
● ●
● ● ●
● ●

Height

● ●
● ●●

1.5
● ●
0




● ●

1.0

−1



● ● ●
● ●

0.5


−2

0.0
−2 −1 0 1 2 3

Cut interpretation: each point Xi belongs to a cluster whose


center Xc satisfies dic ≤ 2.5

9
Properties of minimax linkage

I Cutting a minimax tree at a height h a nice interpretation:


each point is ≤ h in dissimilarity to the center of its cluster.
(This is related to a famous set cover problem)
I Produces dendrograms with no inversions
I Unchanged by monotone transformation of dissimilarities dij
I Produces clusters whose centers are chosen among the data
points themselves. Remember that, depending on the
application, this can be a very important property. (Hence
minimax clustering is the analogy to K-medoids in the world
of hierarchical clustering)

10
Example: Olivetti faces dataset

(From Bien et al. (2011))


11
(From Bien et al. (2011))
12
Centroid and minimax linkage in R

The function hclust in the base package performs hierarchical


agglomerative clustering with centroid linkage (as well as many
other linkages)

E.g.,

d = dist(x)
tree.cent = hclust(d, method="centroid")
plot(tree.cent)

The function protoclust in the package protoclust implements


hierarchical agglomerative clustering with minimax linkage

13
Linkages summary
Unchanged
No Cut
Linkage with monotone Notes
inversions? interpretation?
transformation?
Single X X X chaining
Complete X X X crowding
Average X × ×
Centroid × × × simple
centers are
Minimax X X X
data points

Note: this doesn’t tell us what “best linkage” is

What’s missing here: a detailed empirical comparison of how they


perform. On top of this, remember that choosing a linkage can be
very situation dependent

14
Designing a clever radio system (e.g., Pandora)
Suppose we have a bunch of songs, and dissimilarity scores between
each pair. We’re building a clever radio system—a user is going to
give us an initial song, and a measure of how “risky” he is going to
be, i.e., maximal tolerable dissimilarity between suggested songs

How could we use hierarchical clustering, and with what linkage?


15
Placing cell phone towers

Suppose we are helping to place cell phone towers on top of some


buildings throughout the city. The cell phone company is looking
to build a small number of towers, such that no building is further
than half a mile from a tower

How could we use hierarchical clustering, and with what linkage?

16
How many clusters?

Sometimes, using K-means, K-medoids, or hierarchical clustering,


we might have no problem specifying the number of clusters K
ahead of time, e.g.,
I Segmenting a client database into K clusters for K salesman
I Compressing an image using vector quantization, where K
controls the compression rate

Other times, K is implicitly defined by cutting a hierarchical


clustering tree at a given height, e.g., designing a clever radio
system or placing cell phone towers

But in most exploratory applications, the number of clusters K is


unknown. So we are left asking the question: what is the “right”
value of K?

17
This is a hard problem
Determining the number of clusters is a hard problem!

Why is it hard?
I Determining the number of clusters is a hard task for humans
to perform (unless the data are low-dimensional). Not only
that, it’s just as hard to explain what it is we’re looking for.
Usually, statistical learning is successful when at least one of
these is possible

Why is it important?
I E.g., it might mean a big difference scientifically if we were
convinced that there were K = 2 subtypes of breast cancer
vs. K = 3 subtypes
I One of the (larger) goals of data mining/statistical learning is
automatic inference; choosing K is certainly part of this

18
Reminder: within-cluster variation
We’re going to focus on K-means, but most ideas will carry over
to other settings

Recall: given the number of clusters K, the K-means algorithm


approximately minimizes the within-cluster variation:
K
X X
W = kXi − X̄k k22
k=1 C(i)=k

over clustering assignments C, where X̄k is the average of points


in group k, X̄k = n1k C(i)=k Xi
P

Clearly a lower value of W is better. So why not just run K-means


for a bunch of different values of K, and choose the value of K
that gives the smallest W (K)?

19
That’s not going to work
Problem: within-cluster variation just keeps decreasing

Example: n = 250, p = 2, K = 1, . . . 10

120
● ●
1.5


100
● ● ●
● ●
● ●
● ● ● ● ●
●● ● ● ● ●
●● ● ●● ● ●● ●
●● ● ● ● ● ●●● ●

Within−cluster variation

1.0

●●
● ●
● ● ●●●
● ●


●●●●● ●

●●

80

● ●●

● ● ● ● ● ●● ●
●● ● ● ● ● ●
● ●● ● ● ●● ●● ●
● ●● ●● ●●
● ●
● ●
● ● ● ●● ●●● ●
● ● ●

● ●●● ●● ●●

60

0.5

● ●

● ● ● ● ● ●
● ● ●
● ●
● ● ●●
● ●
● ● ● ● ●
● ● ● ●●
● ● ● ●
●●

40
●● ● ● ● ● ●
● ●
●● ● ●
● ●● ● ●● ●● ●● ●
0.0


● ● ●●
● ●●
● ● ● ●

● ●●
●● ●

● ●
● ●
● ●
● ● ● ● ● ● ● ●
● ●●●●● ●
●● ● ●
● ● ● ● ●
●● ● 20
● ● ● ●
●● ● ●
● ● ● ●
−0.5

● ●

0.0 0.5 1.0 1.5 2 4 6 8 10

20
Between-cluster variation
Within-cluster variation measures how tightly grouped the clusters
are. As we increase the number of clusters K, this just keeps going
down. What are we missing?

Between-cluster variation measures how spread apart the groups


are from each other:
K
X
B= nK kX̄k − X̄k22
k=1

where as before X̄k is the average of points in group k, and X̄ is


the overall average, i.e.
n
1 X 1X
X̄k = Xi and X̄ = Xi
nk n
C(i)=k i=1

21
Example: between-cluster variation
Example: n = 100, p = 2, K = 2


3

● ●


2

● ●

●● ●

● ●
● ● ●
● ●
● ●
●● ● ● ●●
●●●● ●
1

● ● ● ●
● ● ● ● ● ● ● ●
● ● ●
● ● ● ●
● ● ● ● ●●

● ●
●● X1




X ● X2
●●
●●●
●●



● ●
● ● ●● ● ● ●
● ● ●
● ● ● ●
● ●● ●●●
0

● ● ●● ● ● ● ● ●
● ● ●● ● ● ●
● ●● ● ● ● ● ●●
● ●
● ● ● ● ●● ●● ●
● ● ● ● ● ●
● ● ● ● ● ●
●● ●● ● ●
● ●
●●
● ● ● ● ●
●● ●● ● ●
−1

● ● ● ● ● ●● ●
● ● ● ●● ● ● ●
● ● ● ● ● ●
● ● ●
● ● ●
●● ● ●
● ● ●

−2 0 2 4 6 8

B = n1 kX̄1 − X̄k22 + n2 kX̄2 − X̄k22


X X
W = kXi − X̄1 k22 + kXi − X̄2 k22
C(i)=1 C(i)=2

22
Still not going to work
Bigger B is better, can we use it to choose K? Problem: between-
cluster variation just keeps increasing

Running example: n = 250, p = 2, K = 1, . . . 10

● ● ●
● ●

1.5

100
● ●


● ● ●
● ● ●
● ●
● ● ● ● ●
●● ● ● ● ●
●● ● ●● ●

Between−cluster variation

80
●● ●
●● ● ● ● ● ●●● ●

1.0


● ●
● ● ●●
● ●



●●●●


● ●

●●
● ●●

● ● ● ● ● ●● ●
●● ● ● ● ● ●
● ●● ● ● ●●●●● ● ●
● ●● ●● ●

60
● ●
● ●
● ● ● ●● ●●● ●
● ● ●
● ●
● ●●● ●● ●●
0.5



● ● ● ●● ● ●
● ●
● ●● ● ●●

40

● ● ● ● ●
● ● ● ●●
● ● ● ●
●●
●● ● ● ● ● ●
● ●
● ● ● ●
● ●● ● ●● ●●● ● ●
0.0

● ● ●●

● ●●
● ● ●

● ●● ●
● ● ● ●
20
● ● ● ● ●● ● ● ●
● ●

●● ● ●


● ● ●●●
●● ● ● ● ● ●●

● ● ●
●● ●
−0.5

● ●
0

0.0 0.5 1.0 1.5 2 4 6 8 10

23
CH index
Ideally we’d like our clustering assignments C to simultaneously
have a small W and a large B

This is the idea behind the CH index.3 For clustering assignments


coming from K clusters, we record CH score:

B(K)/(K − 1)
CH(K) =
W (K)/(n − K)

To choose K, just pick some maximum number of clusters to be


considered Kmax (e.g., K = 20), and choose the value of K with
the largest score CH(K), i.e.,

K̂ = argmax CH(K)
K∈{2,...Kmax }

3
Calinski and Harabasz (1974), “A dendrite method for cluster analysis”
24
Example: CH index
Running example: n = 250, p = 2, K = 2, . . . 10.

450
● ●
1.5



● ● ●
● ●
● ●
● ● ● ● ●
●● ● ● ● ● ●
●● ● ●● ● ●● ●
●● ● ● ● ● ●●● ●

400

1.0

●●
● ●
● ● ●●●


●●●
●●●
●● ●



●●
● ●
● ● ● ● ● ●● ●
●● ● ● ● ● ●
● ●● ● ● ●● ●● ● ●

CH index
● ●● ●● ●●
● ●
● ●
● ● ● ●● ●●● ●
● ● ●
● ●
● ●●● ●● ●●
0.5



● ● ● ●● ● ● ●
● ●
● ●
● ● ●● ●

350
● ● ●
● ● ● ● ●
● ● ● ●● ●
● ● ● ●●

●● ● ● ● ● ●
● ●
●● ● ●
● ●● ● ●● ●● ●● ●
0.0


● ● ●●
● ●
● ●●
● ●

● ●●
●● ●
● ● ●
● ●
● ●
● ● ● ● ●● ● ●
● ●● ●

●● ● ● ● ● ● ●●● ●●

● ● ●
●●

300

−0.5

● ●

0.0 0.5 1.0 1.5 2 4 6 8 10

We would choose K = 4 clusters, which seems reasonable


General problem: the CH index is not defined for K = 1. We could
never choose just one cluster (the null model)!
25
Gap statistic
It’s true that W (K) keeps dropping, but how much it drops at any
one K should be informative

The gap statistic4 is based on this idea. We compare the observed


within-cluster variation W (K) to Wunif (K), the within-cluster
variation we’d see if we instead had points distributed uniformly
(over an encapsulating box). The gap for K clusters is defined as
Gap(K) = log Wunif (K) − log W (K)

The quantity log Wunif (K) is computed by simulation: we average


the log within-cluster variation over, say, 20 simulated uniform
data sets. We also compute the standard error of s(K) of
log Wunif (K) over the simulations. Then we choose K by
n o
K̂ = min K ∈ {1, . . . Kmax } : Gap(K) ≥ Gap(K +1)−s(K +1)
4
Tibshirani et al. (2001), “Estimating the number of clusters in a data set
via the gap statistic”
26
Example: gap statistic
Running example: n = 250, p = 2, K = 1, . . . 10
● ● ●

0.7
1.5





● ● ●
● ●
● ●
● ● ● ● ● ●
●● ● ● ● ●
●● ● ●● ●

0.6
●● ●
●● ● ● ● ● ●●● ●

1.0

●●●●●

● ●




●●●●
●●
●● ●



●●

● ● ● ● ● ●● ● ●
●● ● ● ● ● ● ●
● ●● ● ● ●● ●● ● ●
● ●● ●● ●●
● ● ●

0.5
● ●
● ●● ●●● ●

Gap
● ●
● ● ●
● ● ●●● ●● ●●
0.5



● ● ● ●● ● ●

● ●
● ●
● ● ●●
● ●

● ● ●
● ●●
● ●

0.4
● ●
● ● ● ●●

●● ● ● ● ● ●
● ●
●● ● ●
● ●● ● ●● ●● ●● ●
0.0


● ● ●●

● ●●
● ●

● ●●
●● ●
● ●●
● ●
● ●
● ● ● ● ●● ● ●
● ●● ●

●● ● ● ● ● ●●●

0.3
● ●●

● ● ●
●● ● ●
−0.5

0.0 0.5 1.0 1.5 2 4 6 8 10

We would choose K = 3 clusters, which is also reasonable


The gap statistic does especially well when the data fall into one
cluster. (Why? Hint: think about the null distribution that it uses)
27
CH index and gap statistic in R
The CH index can be computed using the kmeans function in the
base distribution, which returns both the within-cluster variation
and the between-cluster varation (Homework 2)

E.g.,

k = 5
km = kmeans(x, k, alg="Lloyd")
names(km)
# Now use some of these return items to compute ch

The gap statistic is implemented by the function gap in the


package lga, and by the function gap in the package SAGx.
(Beware: these functions are poorly documented ... it’s unclear
what clustering method they’re using)

28
Once again, it really is a hard problem

(Taken from George Cassella’s CMU talk on January 16 2011)


29
(From George Cassella’s CMU talk on January 16 2011)

30
Recap: more linkages, and determining K
Centroid linkage is commonly used in biology. It measures the
distance between group averages, and is simple to understand and
to implement. But it also has some drawbacks (inversions!)

Minimax linkage is a little more complex. It asks the question:


“which point’s furthest point is closest?”, and defines the answer
as the cluster center. This could be useful for some applications

Determining the number of clusters is both a hard and important


problem. We can’t simply try to find K that gives the smallest
achieved within-class variation. We defined between-cluster
variation, and saw we also can’t choose K to just maximize this

Two methods for choosing K: the CH index, which looks at a


ratio of between to within, and the gap statistic, which is based on
the difference between within-class variation for our data and what
we’d see from uniform data
31
Next time: principal components analysis

Finding interesting directions in our data set

(From ESL page 67)

32

You might also like