06 Clus3
06 Clus3
Ryan Tibshirani
Data Mining: 36-462/36-662
January 31 2013
1
Even more linkages
Last time we learned about hierarchical agglomerative clustering,
basic idea is to repeatedly merge two most similar groups, as
measured by the linkage
Actually, there are many more linkages out there, each having
different properties. Today: we’ll look at two more
2
Reminder: linkages
Our setup: given X1 , . . . Xn and pairwise dissimilarities dij . (E.g.,
think of Xi ∈ Rp and dij = kXi − Xj k2 )
3
Centroid linkage
Centroid linkage1 is commonly used. Assume that Xi ∈ Rp , and
dij = kXi − Xj k2 . Let X̄G , X̄H denote group averages for G, H.
Then:
dcentroid (G, H) = kX̄G − X̄H k2
●
2
● ● ●
●
● ●
●
● ● ●●
Example (dissimilarities dij are ● ●
●
●
●● ● ● ● ●
●
●● ● ●
●
1
● ●● ● ●
● ●●
distances, groups are marked ●
●
●
●
●
●●
●
● ●
●
● ●● ● ●
0
●
●
●●
score dcentroid (G, H) is the dis- ●
●
●
●
●
● ●
●
−1
● ●● ● ●
●
● ● ●
● ● ●
● ●
● ●
troids (i.e., group averages) ●● ●
●●
●●
●●
●
●
−2
● ●
●
−2 −1 0 1 2
1
Eisen et al. (1998), “Cluster Analysis and Display of Genome-Wide
Expression Patterns”
4
Centroid linkage is the standard in biology
Centroid linkage is simple: easy to understand, and easy to
implement. Maybe for these reasons, it has become the standard
for hierarchical clustering in biology
5
Centroid linkage example
Here n = 60, Xi ∈ R2 , dij = kXi − Xj k2 . Cutting the tree at
some heights wouldn’t make sense ... because the dendrogram has
inversions! But we can, e.g., still look at ouptut with 3 clusters
● ●●
●
3
2.5
●
● ●
● ●
●
● ● ●
2
2.0
● ●
●● ●
●
● ●
● ●
● ●
1
● ●
●
1.5
● ●
● ●
Height
●
● ●
● ● ●● ●
0
●
●
●
1.0
● ●
●
●
●
−1
●
●
● ● ●
0.5
● ●
●
●
●
−2
●
0.0
−2 −1 0 1 2 3
6
Shortcomings of centroid linkage
I Can produce dendrograms with inversions, which really messes
up the visualization
I Even if were we lucky enough to have no inversions, still no
interpretation for the clusters resulting from cutting the tree
I Answers change with a monotone transformation of the
dissimilarity measure dij = kXi − Xj k2 . E.g., changing to
dij = kXi − Xj k22 would give a different clustering
distance distance^2
● ●●
● ● ●●
●
3
3
● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
2
2
● ● ● ●
●● ● ●● ●
● ●
● ● ● ●
● ● ● ●
● ● ● ●
1
1
● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ●
● ● ● ●
● ● ●● ● ● ● ●● ●
0
● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ●
−1
−1
● ●
● ●
● ● ● ● ● ●
● ● ● ●
● ●
●
● ●
●
−2
−2
● ●
● ●
−2 −1 0 1 2 3 −2 −1 0 1 2 3
7
Minimax linkage
Minimax linkage2 is a newcomer. First define radius of a group of
points G around Xi as r(Xi , G) = maxj∈G dij . Then:
dminimax (G, H) = min r(Xi , G ∪ H)
i∈G∪H
2
● ● ●
●
● ●
Example (dissimilarities dij are ● ●
●
●●
●
●
● ● ● ●● ● ●●
distances, groups marked by ●● ● ● ●
1
● ●● ● ●
● ●●
●
● ● ●
●
●
●●
0
dminimax (G, H) is the smallest ●
●
●●
●
●
●
●
● ●
radius encompassing all points ●
●
●
●
−1
●
● ●● ● ●
●
●
●●
● ●
●
−2 −1 0 1 2
2
Bien et al. (2011), “Hierarchical Clustering with Prototypes via Minimax
Linkage”
8
Minimax linkage example
Same data s before. Cutting the tree at h = 2.5 gives clustering
assignments marked by the colors
3.5
● ●●
●
3
3.0
●
● ●
● ●
●
● ● ●
2
2.5
● ●
●● ●
●
● ●
● ●
● ●
1
2.0
● ●
● ● ●
● ●
Height
●
● ●
● ●●
1.5
● ●
0
●
●
●
● ●
●
●
1.0
●
−1
●
●
● ● ●
● ●
●
0.5
●
●
−2
0.0
−2 −1 0 1 2 3
9
Properties of minimax linkage
10
Example: Olivetti faces dataset
E.g.,
d = dist(x)
tree.cent = hclust(d, method="centroid")
plot(tree.cent)
13
Linkages summary
Unchanged
No Cut
Linkage with monotone Notes
inversions? interpretation?
transformation?
Single X X X chaining
Complete X X X crowding
Average X × ×
Centroid × × × simple
centers are
Minimax X X X
data points
14
Designing a clever radio system (e.g., Pandora)
Suppose we have a bunch of songs, and dissimilarity scores between
each pair. We’re building a clever radio system—a user is going to
give us an initial song, and a measure of how “risky” he is going to
be, i.e., maximal tolerable dissimilarity between suggested songs
16
How many clusters?
17
This is a hard problem
Determining the number of clusters is a hard problem!
Why is it hard?
I Determining the number of clusters is a hard task for humans
to perform (unless the data are low-dimensional). Not only
that, it’s just as hard to explain what it is we’re looking for.
Usually, statistical learning is successful when at least one of
these is possible
Why is it important?
I E.g., it might mean a big difference scientifically if we were
convinced that there were K = 2 subtypes of breast cancer
vs. K = 3 subtypes
I One of the (larger) goals of data mining/statistical learning is
automatic inference; choosing K is certainly part of this
18
Reminder: within-cluster variation
We’re going to focus on K-means, but most ideas will carry over
to other settings
19
That’s not going to work
Problem: within-cluster variation just keeps decreasing
Example: n = 250, p = 2, K = 1, . . . 10
120
● ●
1.5
●
●
100
● ● ●
● ●
● ●
● ● ● ● ●
●● ● ● ● ●
●● ● ●● ● ●● ●
●● ● ● ● ● ●●● ●
Within−cluster variation
●
1.0
●●
● ●
● ● ●●●
● ●
●
●
●●●●● ●
●
●●
80
●
● ●●
●
● ● ● ● ● ●● ●
●● ● ● ● ● ●
● ●● ● ● ●● ●● ●
● ●● ●● ●●
● ●
● ●
● ● ● ●● ●●● ●
● ● ●
●
● ●●● ●● ●●
60
●
0.5
● ●
●
● ● ● ● ● ●
● ● ●
● ●
● ● ●●
● ●
● ● ● ● ●
● ● ● ●●
● ● ● ●
●●
40
●● ● ● ● ● ●
● ●
●● ● ●
● ●● ● ●● ●● ●● ●
0.0
●
● ● ●●
● ●●
● ● ● ●
●
● ●●
●● ●
●
● ●
● ●
● ●
● ● ● ● ● ● ● ●
● ●●●●● ●
●● ● ●
● ● ● ● ●
●● ● 20
● ● ● ●
●● ● ●
● ● ● ●
−0.5
● ●
20
Between-cluster variation
Within-cluster variation measures how tightly grouped the clusters
are. As we increase the number of clusters K, this just keeps going
down. What are we missing?
21
Example: between-cluster variation
Example: n = 100, p = 2, K = 2
●
3
● ●
●
●
2
● ●
●
●● ●
●
● ●
● ● ●
● ●
● ●
●● ● ● ●●
●●●● ●
1
● ● ● ●
● ● ● ● ● ● ● ●
● ● ●
● ● ● ●
● ● ● ● ●●
● ●
●● X1
●
●
●
●
X ● X2
●●
●●●
●●
●
●
●
● ●
● ● ●● ● ● ●
● ● ●
● ● ● ●
● ●● ●●●
0
● ● ●● ● ● ● ● ●
● ● ●● ● ● ●
● ●● ● ● ● ● ●●
● ●
● ● ● ● ●● ●● ●
● ● ● ● ● ●
● ● ● ● ● ●
●● ●● ● ●
● ●
●●
● ● ● ● ●
●● ●● ● ●
−1
● ● ● ● ● ●● ●
● ● ● ●● ● ● ●
● ● ● ● ● ●
● ● ●
● ● ●
●● ● ●
● ● ●
−2 0 2 4 6 8
22
Still not going to work
Bigger B is better, can we use it to choose K? Problem: between-
cluster variation just keeps increasing
● ● ●
● ●
●
1.5
100
● ●
●
●
● ● ●
● ● ●
● ●
● ● ● ● ●
●● ● ● ● ●
●● ● ●● ●
Between−cluster variation
80
●● ●
●● ● ● ● ● ●●● ●
●
1.0
●
● ●
● ● ●●
● ●
●
●
●
●●●●
●
●
● ●
●
●●
● ●●
●
● ● ● ● ● ●● ●
●● ● ● ● ● ●
● ●● ● ● ●●●●● ● ●
● ●● ●● ●
60
● ●
● ●
● ● ● ●● ●●● ●
● ● ●
● ●
● ●●● ●● ●●
0.5
●
●
● ● ● ●● ● ●
● ●
● ●● ● ●●
●
40
●
● ● ● ● ●
● ● ● ●●
● ● ● ●
●●
●● ● ● ● ● ●
● ●
● ● ● ●
● ●● ● ●● ●●● ● ●
0.0
● ● ●●
●
● ●●
● ● ●
●
● ●● ●
● ● ● ●
20
● ● ● ● ●● ● ● ●
● ●
●
●● ● ●
●
●
● ● ●●●
●● ● ● ● ● ●●
●
● ● ●
●● ●
−0.5
● ●
0
23
CH index
Ideally we’d like our clustering assignments C to simultaneously
have a small W and a large B
B(K)/(K − 1)
CH(K) =
W (K)/(n − K)
K̂ = argmax CH(K)
K∈{2,...Kmax }
3
Calinski and Harabasz (1974), “A dendrite method for cluster analysis”
24
Example: CH index
Running example: n = 250, p = 2, K = 2, . . . 10.
450
● ●
1.5
●
●
●
● ● ●
● ●
● ●
● ● ● ● ●
●● ● ● ● ● ●
●● ● ●● ● ●● ●
●● ● ● ● ● ●●● ●
●
400
●
1.0
●●
● ●
● ● ●●●
●
●
●●●
●●●
●● ●
●
●
●
●●
● ●
● ● ● ● ● ●● ●
●● ● ● ● ● ●
● ●● ● ● ●● ●● ● ●
CH index
● ●● ●● ●●
● ●
● ●
● ● ● ●● ●●● ●
● ● ●
● ●
● ●●● ●● ●●
0.5
●
●
● ● ● ●● ● ● ●
● ●
● ●
● ● ●● ●
350
● ● ●
● ● ● ● ●
● ● ● ●● ●
● ● ● ●●
●
●● ● ● ● ● ●
● ●
●● ● ●
● ●● ● ●● ●● ●● ●
0.0
●
● ● ●●
● ●
● ●●
● ●
●
● ●●
●● ●
● ● ●
● ●
● ●
● ● ● ● ●● ● ●
● ●● ●
●
●● ● ● ● ● ● ●●● ●●
●
● ● ●
●●
300
●
−0.5
● ●
0.7
1.5
●
●
●
●
● ● ●
● ●
● ●
● ● ● ● ● ●
●● ● ● ● ●
●● ● ●● ●
0.6
●● ●
●● ● ● ● ● ●●● ●
●
1.0
●●●●●
●
● ●
●
●
●
●
●●●●
●●
●● ●
●
●
●
●●
●
● ● ● ● ● ●● ● ●
●● ● ● ● ● ● ●
● ●● ● ● ●● ●● ● ●
● ●● ●● ●●
● ● ●
0.5
● ●
● ●● ●●● ●
Gap
● ●
● ● ●
● ● ●●● ●● ●●
0.5
●
●
● ● ● ●● ● ●
●
● ●
● ●
● ● ●●
● ●
●
● ● ●
● ●●
● ●
0.4
● ●
● ● ● ●●
●
●● ● ● ● ● ●
● ●
●● ● ●
● ●● ● ●● ●● ●● ●
0.0
●
● ● ●●
●
● ●●
● ●
●
● ●●
●● ●
● ●●
● ●
● ●
● ● ● ● ●● ● ●
● ●● ●
●
●● ● ● ● ● ●●●
0.3
● ●●
●
● ● ●
●● ● ●
−0.5
E.g.,
k = 5
km = kmeans(x, k, alg="Lloyd")
names(km)
# Now use some of these return items to compute ch
28
Once again, it really is a hard problem
30
Recap: more linkages, and determining K
Centroid linkage is commonly used in biology. It measures the
distance between group averages, and is simple to understand and
to implement. But it also has some drawbacks (inversions!)
32