Hierarchical ClusteringAlgorithm
Hierarchical ClusteringAlgorithm
1
1
Do not have to assume any particular number 3
of clusters
Any desired number of clusters can be
obtained by ‘cutting’ the dendogram at the
proper level
They may correspond to meaningful
taxonomies
Example in biological sciences (e.g., animal
kingdom, phylogeny reconstruction, …)
Hierarchical Clustering – Types
Step 0 Step 1 Step 2 Step 3 Step 4 agglomerative
(AGNES)
a • Start with the points as individual
ab clusters
b • At each step, merge the closest
abcde pair of clusters until only one
c cluster (or k clusters) left
cde
d
de
e
divisive
(DIANA)
Step 4 Step 3 Step 2 Step 1 Step 0 • Start with one, all-inclusive cluster
• At each step, split a cluster until
each cluster contains a point (or
there are k clusters)
Agglomerative Clustering Algorithm
Basic algorithm:
1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains
...
p1 p2 p3 p4 p9 p10 p11 p12
Agglomerative Clustering Algorithm
Intermediate Situation
After some merging steps, we have some clusters
C1 C2 C3 C4 C5
C1
C2
C3
C3
C4
C4
C5
Proximity Matrix
C1
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
Agglomerative Clustering Algorithm
After Merging
The question is “How do we update the C2
U
proximity matrix?” C1
C5
C3 C4
C1 ?
C2 U C5 ? ? ? ?
C3 C3 ?
C4 C4 ?
Proximity Matrix
C1
C2 U C5
...
p1 p2 p3 p4 p9 p10 p11 p12
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
Similarity?
p2
p3
p4
p5
MIN (Single Link)
.
MAX (Complete Link)
.
Average
Distance Between Centroids .
Single link: smallest distance between an element in one cluster and an element
in the other, i.e., dist(Ki, Kj) = min(tip, tjq)
Complete link: largest distance between an element in one cluster and an
element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)
Average: avg distance between an element in one cluster and an element in the
other, i.e., dist(Ki, Kj) = avg(tip, tjq)
Centroid: distance between the centroids of two clusters, i.e., dist(Ki, Kj) =
dist(Ci, Cj)
Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj) = dist(Mi,
Mj)
Medoid: a chosen, centrally located object in the cluster
Hierarchical Clustering: Comparison
5
1 4 1
3
2 5
5 5
2 1 2
2 3 6 3 6
3
1
4 MIN /
4
4 Single Link
5 MAX /
1 Complete Link
2
5
2
3 Average
3
6 Complexity: (for most of the cases)
4 1 Space: O(N2)
4
Time: O(N3)
Hierarchical Clustering
AGNES (Agglomerative Nesting) (1990)
Agglomerative clustering with single-link.
BIRCH (Balanced Iterative Reducing and Clustering Using
Hierarchies)
BIRCH
Balanced Iterative Reducing and Clustering Using Hierarchies
Zhang, Ramakrishnan & Livny, SIGMOD’96
Incrementally construct a CF (Clustering Feature) tree, a hierarchical
data structure for multiphase clustering
Phase 1: scan DB to build an initial in-memory CF tree (a multi-level
compression of the data that tries to preserve the inherent clustering
structure of the data)
Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of
the CF-tree
Scales linearly: finds a good clustering with a single scan and
improves the quality with a few additional scans
Weakness: handles only numeric data, and sensitive to the order of
the data record
BIRCH
BIRCH: Balanced Iterative Reducing and Clustering Using
Hierarchies
Agglomerative clustering designed for clustering a large amount
of numerical data
What Birch algorithm tries to solve?
Most of the existing algorithms DO NOT consider the case that
datasets can be too large to fit in main memory.
They DO NOT concentrate on minimizing the number of scans of
the dataset.
I/O costs are very high.
The complexity of BIRCH is O(n) where n is the number of
objects to be clustered.
BIRCH: The idea by example
Non-leaf node
prev CF1 CF2 CFL next prev CF1 CF2 CFL next
BIRCH: CF Tree insertion
Start with the root
Find the CF entry in the root closest to the data point, move to that
child and repeat the process until a closest leaf entry is found.
At the leaf
If the point can be accommodated in the cluster, update the entry
If this addition violates the threshold T, split the entry, if this violates the
limit imposed by L, split the leaf. If its parent node too is full, split that
and so on
Update the CF entries from the root to the leaf to accommodate this
point
Concerns
Sensitive to insertion order of data points
Since we fix the size of leaf nodes, so clusters may not be so natural
Clusters tend to be spherical given the radius and diameter measures
BIRCH Algorithm
BIRCH – Phase 1
Start with initial threshold and insert points into the tree
If run out of memory, increase threshold value (T), and rebuild a
smaller tree by reinserting values from older tree and then other
values from the data set
Good initial threshold is important but hard to figure out
Outlier removal – when rebuilding tree remove outliers
BIRCH – Phase 2
Optional
Phase 3 sometime have minimum size which performs well, so
phase 2 prepares the tree for phase 3.
Removes outliers, and grouping clusters.
Scans the leaf entries in the initial CF tree to rebuild a smaller
CF tree, while removing more outliners and grouping crowded
subclusters into larger ones.
BIRCH – Phase 3
Problems after phase 1:
Input order affects results
Splitting triggered by node size
Phase 3:
Consider CF entries in leaf nodes only
Cluster all leaf nodes on the CF values according to an existing
algorithm
Algorithm used here: agglomerative hierarchical clustering
Use centroid as the representative of a cluster
BIRCH – Phase 4
Optional – Cluster refining
Additional passes over the data to correct inaccuracies and refine the
clusters further.
Additional scan/s of the dataset, attaching each item to the centroids
found.
Use clusters found in phase 3 as seeds
Redistribute data points to their closest seeds and form new clusters
Recalculate the centroids and redistribute the items.
Always converges
Removal of outliers
Acquisition of membership information
BIRCH
Pros:
Birch performs faster than existing algorithms (CLARANS and
KMEANS) on large datasets in Quality, speed, stability and scalability
Scans whole data only once
Handles outliers better
Cons:
Since each node in a CF tree can hold only a limited number of entries
due to the size limit, a CF tree node doesn’t always correspond to what a
user may consider a nature cluster.
Moreover, if the clusters are not spherical in shape, it doesn’t perform
well because it uses the notion of radius or diameter to control the
boundary of a cluster.