0% found this document useful (0 votes)
5 views

Hierarchical ClusteringAlgorithm

Uploaded by

Imaad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Hierarchical ClusteringAlgorithm

Uploaded by

Imaad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Hierarchical Clustering

 Group data objects into a tree of clusters.


6 5
 Can be visualized as a dendrogram. 4

 A tree like diagram that records the sequences 3


2
4
5
of merges or splits. 2

1
1
 Do not have to assume any particular number 3

of clusters
 Any desired number of clusters can be
obtained by ‘cutting’ the dendogram at the
proper level
 They may correspond to meaningful
taxonomies
 Example in biological sciences (e.g., animal
kingdom, phylogeny reconstruction, …)
Hierarchical Clustering – Types
Step 0 Step 1 Step 2 Step 3 Step 4 agglomerative
(AGNES)
a • Start with the points as individual
ab clusters
b • At each step, merge the closest
abcde pair of clusters until only one
c cluster (or k clusters) left
cde
d
de
e
divisive
(DIANA)
Step 4 Step 3 Step 2 Step 1 Step 0 • Start with one, all-inclusive cluster
• At each step, split a cluster until
each cluster contains a point (or
there are k clusters)
Agglomerative Clustering Algorithm
 Basic algorithm:
1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains

 Key operation is the computation of the proximity of two


clusters
 Different approaches to defining the distance between clusters distinguish
the different algorithms.
Agglomerative Clustering Algorithm
 Starting Point
 Start with clusters of individual points and a proximity matrix
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
.
.
. Proximity Matrix

...
p1 p2 p3 p4 p9 p10 p11 p12
Agglomerative Clustering Algorithm
 Intermediate Situation
 After some merging steps, we have some clusters
C1 C2 C3 C4 C5
C1
C2
C3
C3
C4
C4
C5
Proximity Matrix
C1

C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
Agglomerative Clustering Algorithm
 After Merging
 The question is “How do we update the C2
U
proximity matrix?” C1
C5
C3 C4
C1 ?

C2 U C5 ? ? ? ?

C3 C3 ?

C4 C4 ?
Proximity Matrix

C1

C2 U C5
...
p1 p2 p3 p4 p9 p10 p11 p12
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
Similarity?
p2
p3

p4
p5
 MIN (Single Link)
.
 MAX (Complete Link)
.
 Average
 Distance Between Centroids .

 Distance between Medoids


 Other methods driven by an objective function
 Ward’s Method uses squared error
Distance between clusters X X

 Single link: smallest distance between an element in one cluster and an element
in the other, i.e., dist(Ki, Kj) = min(tip, tjq)
 Complete link: largest distance between an element in one cluster and an
element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)
 Average: avg distance between an element in one cluster and an element in the
other, i.e., dist(Ki, Kj) = avg(tip, tjq)
 Centroid: distance between the centroids of two clusters, i.e., dist(Ki, Kj) =
dist(Ci, Cj)
 Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj) = dist(Mi,
Mj)
 Medoid: a chosen, centrally located object in the cluster
Hierarchical Clustering: Comparison
5
1 4 1
3
2 5
5 5
2 1 2
2 3 6 3 6
3
1
4 MIN /
4
4 Single Link

5 MAX /
1 Complete Link
2
5
2
3 Average
3
6 Complexity: (for most of the cases)
4 1 Space: O(N2)
4
Time: O(N3)
Hierarchical Clustering
 AGNES (Agglomerative Nesting) (1990)
 Agglomerative clustering with single-link.
 BIRCH (Balanced Iterative Reducing and Clustering Using
Hierarchies)
BIRCH
 Balanced Iterative Reducing and Clustering Using Hierarchies
 Zhang, Ramakrishnan & Livny, SIGMOD’96
 Incrementally construct a CF (Clustering Feature) tree, a hierarchical
data structure for multiphase clustering
 Phase 1: scan DB to build an initial in-memory CF tree (a multi-level
compression of the data that tries to preserve the inherent clustering
structure of the data)
 Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of
the CF-tree
 Scales linearly: finds a good clustering with a single scan and
improves the quality with a few additional scans
 Weakness: handles only numeric data, and sensitive to the order of
the data record
BIRCH
 BIRCH: Balanced Iterative Reducing and Clustering Using
Hierarchies
 Agglomerative clustering designed for clustering a large amount
of numerical data
 What Birch algorithm tries to solve?
 Most of the existing algorithms DO NOT consider the case that
datasets can be too large to fit in main memory.
 They DO NOT concentrate on minimizing the number of scans of
the dataset.
 I/O costs are very high.
 The complexity of BIRCH is O(n) where n is the number of
objects to be clustered.
BIRCH: The idea by example

Acknowledgement: Some of the materials used here are


from the slides of Johann Gamper and Mouna Kacimi.
BIRCH: The idea by example
BIRCH: The idea by example
BIRCH: The idea by example
BIRCH: The idea by example
BIRCH: The idea by example
BIRCH: The idea by example
BIRCH: The idea by example
BIRCH: Key components
 Clustering feature (CF)
 Summary of the statistics for a given subcluster: the 0-th, 1st, and 2nd
moments of the subcluster from the statistical point of view
 Registers crucial measurements for computing cluster and utilizes
storage efficiently
 Used to compute centroids, and measures the compactness and distance
of clusters
 CF tree
 A height-balanced tree that stores the clustering features for a
hierarchical clustering
 The non-leaf nodes store sums of the CFs of their children
 Has two parameters
 Branching factor: max # of children
 Threshold: max diameter of sub-clusters stored at the leaf nodes
 Leaf nodes are connected via prev and next pointers.
BIRCH: Clustering feature (CF)
 CF = (N, LS, SS)
 N: Number of data points
𝑁
 LS: Linear sum of N points: 𝑖=1 𝑋𝑖
𝑁 2
 SS: Squared sum of N points: 𝑖=1 𝑋𝑖
BIRCH: Distance measures
 A CF entry has sufficient information to calculate the centroid,
radius, diameter and many other distance measures
BIRCH: CF Tree
BIRCH: CF Tree
Root
CF1 CF2 CF3 CFB

child1 child2 child3 childB

Non-leaf node

CF1 CF2 CF3 CFB

child1 child2 child3 childB

Leaf node Leaf node

prev CF1 CF2 CFL next prev CF1 CF2 CFL next
BIRCH: CF Tree insertion
 Start with the root
 Find the CF entry in the root closest to the data point, move to that
child and repeat the process until a closest leaf entry is found.
 At the leaf
 If the point can be accommodated in the cluster, update the entry
 If this addition violates the threshold T, split the entry, if this violates the
limit imposed by L, split the leaf. If its parent node too is full, split that
and so on
 Update the CF entries from the root to the leaf to accommodate this
point

 Concerns
 Sensitive to insertion order of data points
 Since we fix the size of leaf nodes, so clusters may not be so natural
 Clusters tend to be spherical given the radius and diameter measures
BIRCH Algorithm
BIRCH – Phase 1
 Start with initial threshold and insert points into the tree
 If run out of memory, increase threshold value (T), and rebuild a
smaller tree by reinserting values from older tree and then other
values from the data set
 Good initial threshold is important but hard to figure out
 Outlier removal – when rebuilding tree remove outliers
BIRCH – Phase 2
 Optional
 Phase 3 sometime have minimum size which performs well, so
phase 2 prepares the tree for phase 3.
 Removes outliers, and grouping clusters.
 Scans the leaf entries in the initial CF tree to rebuild a smaller
CF tree, while removing more outliners and grouping crowded
subclusters into larger ones.
BIRCH – Phase 3
 Problems after phase 1:
 Input order affects results
 Splitting triggered by node size

 Phase 3:
 Consider CF entries in leaf nodes only
 Cluster all leaf nodes on the CF values according to an existing
algorithm
 Algorithm used here: agglomerative hierarchical clustering
 Use centroid as the representative of a cluster
BIRCH – Phase 4
 Optional – Cluster refining
 Additional passes over the data to correct inaccuracies and refine the
clusters further.
 Additional scan/s of the dataset, attaching each item to the centroids
found.
 Use clusters found in phase 3 as seeds
 Redistribute data points to their closest seeds and form new clusters
 Recalculate the centroids and redistribute the items.
 Always converges

 Removal of outliers
 Acquisition of membership information
BIRCH
 Pros:
 Birch performs faster than existing algorithms (CLARANS and
KMEANS) on large datasets in Quality, speed, stability and scalability
 Scans whole data only once
 Handles outliers better

 Cons:
 Since each node in a CF tree can hold only a limited number of entries
due to the size limit, a CF tree node doesn’t always correspond to what a
user may consider a nature cluster.
 Moreover, if the clusters are not spherical in shape, it doesn’t perform
well because it uses the notion of radius or diameter to control the
boundary of a cluster.

You might also like