0% found this document useful (0 votes)

5 views

Hierarchical ClusteringAlgorithm

Uploaded by

Imaad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Hierarchical ClusteringAlgorithm

Uploaded by

Imaad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Hierarchical Clustering

 Group data objects into a tree of clusters.

6 5
 Can be visualized as a dendrogram. 4

 A tree like diagram that records the sequences 3

2
4
5
of merges or splits. 2

1
1
 Do not have to assume any particular number 3

of clusters
 Any desired number of clusters can be
obtained by ‘cutting’ the dendogram at the
proper level
 They may correspond to meaningful
taxonomies
 Example in biological sciences (e.g., animal
kingdom, phylogeny reconstruction, …)
Hierarchical Clustering – Types
Step 0 Step 1 Step 2 Step 3 Step 4 agglomerative
(AGNES)
a • Start with the points as individual
ab clusters
b • At each step, merge the closest
abcde pair of clusters until only one
c cluster (or k clusters) left
cde
d
de
e
divisive
(DIANA)
Step 4 Step 3 Step 2 Step 1 Step 0 • Start with one, all-inclusive cluster
• At each step, split a cluster until
each cluster contains a point (or
there are k clusters)
Agglomerative Clustering Algorithm
 Basic algorithm:
1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains

 Key operation is the computation of the proximity of two

clusters
 Different approaches to defining the distance between clusters distinguish
the different algorithms.
Agglomerative Clustering Algorithm
 Starting Point
 Start with clusters of individual points and a proximity matrix
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
.
.
. Proximity Matrix

...
p1 p2 p3 p4 p9 p10 p11 p12
Agglomerative Clustering Algorithm
 Intermediate Situation
 After some merging steps, we have some clusters
C1 C2 C3 C4 C5
C1
C2
C3
C3
C4
C4
C5
Proximity Matrix
C1

C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
Agglomerative Clustering Algorithm
 After Merging
 The question is “How do we update the C2
U
proximity matrix?” C1
C5
C3 C4
C1 ?

C2 U C5 ? ? ? ?

C3 C3 ?

C4 C4 ?
Proximity Matrix

C2 U C5
...
p1 p2 p3 p4 p9 p10 p11 p12
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
Similarity?
p2
p3

p4
p5
 MIN (Single Link)
.
 MAX (Complete Link)
.
 Average
 Distance Between Centroids .

 Distance between Medoids

 Other methods driven by an objective function
 Ward’s Method uses squared error
Distance between clusters X X

 Single link: smallest distance between an element in one cluster and an element
in the other, i.e., dist(Ki, Kj) = min(tip, tjq)
 Complete link: largest distance between an element in one cluster and an
element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)
 Average: avg distance between an element in one cluster and an element in the
other, i.e., dist(Ki, Kj) = avg(tip, tjq)
 Centroid: distance between the centroids of two clusters, i.e., dist(Ki, Kj) =
dist(Ci, Cj)
 Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj) = dist(Mi,
Mj)
 Medoid: a chosen, centrally located object in the cluster
Hierarchical Clustering: Comparison
5
1 4 1
3
2 5
5 5
2 1 2
2 3 6 3 6
3
1
4 MIN /
4
4 Single Link

5 MAX /
1 Complete Link
2
5
2
3 Average
3
6 Complexity: (for most of the cases)
4 1 Space: O(N2)
4
Time: O(N3)
Hierarchical Clustering
 AGNES (Agglomerative Nesting) (1990)
 Agglomerative clustering with single-link.
 BIRCH (Balanced Iterative Reducing and Clustering Using
Hierarchies)
BIRCH
 Balanced Iterative Reducing and Clustering Using Hierarchies
 Zhang, Ramakrishnan & Livny, SIGMOD’96
 Incrementally construct a CF (Clustering Feature) tree, a hierarchical
data structure for multiphase clustering
 Phase 1: scan DB to build an initial in-memory CF tree (a multi-level
compression of the data that tries to preserve the inherent clustering
structure of the data)
 Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of
the CF-tree
 Scales linearly: finds a good clustering with a single scan and
improves the quality with a few additional scans
 Weakness: handles only numeric data, and sensitive to the order of
the data record
BIRCH
 BIRCH: Balanced Iterative Reducing and Clustering Using
Hierarchies
 Agglomerative clustering designed for clustering a large amount
of numerical data
 What Birch algorithm tries to solve?
 Most of the existing algorithms DO NOT consider the case that
datasets can be too large to fit in main memory.
 They DO NOT concentrate on minimizing the number of scans of
the dataset.
 I/O costs are very high.
 The complexity of BIRCH is O(n) where n is the number of
objects to be clustered.
BIRCH: The idea by example

Acknowledgement: Some of the materials used here are

from the slides of Johann Gamper and Mouna Kacimi.
BIRCH: The idea by example
BIRCH: The idea by example
BIRCH: The idea by example
BIRCH: The idea by example
BIRCH: The idea by example
BIRCH: The idea by example
BIRCH: The idea by example
BIRCH: Key components
 Clustering feature (CF)
 Summary of the statistics for a given subcluster: the 0-th, 1st, and 2nd
moments of the subcluster from the statistical point of view
 Registers crucial measurements for computing cluster and utilizes
storage efficiently
 Used to compute centroids, and measures the compactness and distance
of clusters
 CF tree
 A height-balanced tree that stores the clustering features for a
hierarchical clustering
 The non-leaf nodes store sums of the CFs of their children
 Has two parameters
 Branching factor: max # of children
 Threshold: max diameter of sub-clusters stored at the leaf nodes
 Leaf nodes are connected via prev and next pointers.
BIRCH: Clustering feature (CF)
 CF = (N, LS, SS)
 N: Number of data points
𝑁
 LS: Linear sum of N points: 𝑖=1 𝑋𝑖
𝑁 2
 SS: Squared sum of N points: 𝑖=1 𝑋𝑖
BIRCH: Distance measures
 A CF entry has sufficient information to calculate the centroid,
radius, diameter and many other distance measures
BIRCH: CF Tree
BIRCH: CF Tree
Root
CF1 CF2 CF3 CFB

child1 child2 child3 childB

Non-leaf node

CF1 CF2 CF3 CFB

child1 child2 child3 childB

Leaf node Leaf node

prev CF1 CF2 CFL next prev CF1 CF2 CFL next
BIRCH: CF Tree insertion
 Start with the root
 Find the CF entry in the root closest to the data point, move to that
child and repeat the process until a closest leaf entry is found.
 At the leaf
 If the point can be accommodated in the cluster, update the entry
 If this addition violates the threshold T, split the entry, if this violates the
limit imposed by L, split the leaf. If its parent node too is full, split that
and so on
 Update the CF entries from the root to the leaf to accommodate this
point

 Concerns
 Sensitive to insertion order of data points
 Since we fix the size of leaf nodes, so clusters may not be so natural
 Clusters tend to be spherical given the radius and diameter measures
BIRCH Algorithm
BIRCH – Phase 1
 Start with initial threshold and insert points into the tree
 If run out of memory, increase threshold value (T), and rebuild a
smaller tree by reinserting values from older tree and then other
values from the data set
 Good initial threshold is important but hard to figure out
 Outlier removal – when rebuilding tree remove outliers
BIRCH – Phase 2
 Optional
 Phase 3 sometime have minimum size which performs well, so
phase 2 prepares the tree for phase 3.
 Removes outliers, and grouping clusters.
 Scans the leaf entries in the initial CF tree to rebuild a smaller
CF tree, while removing more outliners and grouping crowded
subclusters into larger ones.
BIRCH – Phase 3
 Problems after phase 1:
 Input order affects results
 Splitting triggered by node size

 Phase 3:
 Consider CF entries in leaf nodes only
 Cluster all leaf nodes on the CF values according to an existing
algorithm
 Algorithm used here: agglomerative hierarchical clustering
 Use centroid as the representative of a cluster
BIRCH – Phase 4
 Optional – Cluster refining
 Additional passes over the data to correct inaccuracies and refine the
clusters further.
 Additional scan/s of the dataset, attaching each item to the centroids
found.
 Use clusters found in phase 3 as seeds
 Redistribute data points to their closest seeds and form new clusters
 Recalculate the centroids and redistribute the items.
 Always converges

 Removal of outliers
 Acquisition of membership information
BIRCH
 Pros:
 Birch performs faster than existing algorithms (CLARANS and
KMEANS) on large datasets in Quality, speed, stability and scalability
 Scans whole data only once
 Handles outliers better

 Cons:
 Since each node in a CF tree can hold only a limited number of entries
due to the size limit, a CF tree node doesn’t always correspond to what a
user may consider a nature cluster.
 Moreover, if the clusters are not spherical in shape, it doesn’t perform
well because it uses the notion of radius or diameter to control the
boundary of a cluster.

Miró-Llinares, F. (2015) - That Cyber Routine, That Cyber Victimization
No ratings yet
Miró-Llinares, F. (2015) - That Cyber Routine, That Cyber Victimization
17 pages
Data Science Pocket Dictionary 1691284156
No ratings yet
Data Science Pocket Dictionary 1691284156
28 pages
Birch 09
No ratings yet
Birch 09
31 pages
Balanced Iterative Reducing and Clustering Using Hierarchies
No ratings yet
Balanced Iterative Reducing and Clustering Using Hierarchies
33 pages
Clustering Part 2
No ratings yet
Clustering Part 2
28 pages
4.4 Hierarchical Clustering Methods
No ratings yet
4.4 Hierarchical Clustering Methods
39 pages
Heirarchical clustering
No ratings yet
Heirarchical clustering
22 pages
ML Module Iv
No ratings yet
ML Module Iv
27 pages
Lecture 18
No ratings yet
Lecture 18
27 pages
Data Mining Unit 5
No ratings yet
Data Mining Unit 5
30 pages
Presentation On Clustering Algorithms
No ratings yet
Presentation On Clustering Algorithms
43 pages
Lesson 3.6 - Supervised Learning Neural Networks
No ratings yet
Lesson 3.6 - Supervised Learning Neural Networks
35 pages
Unit 4 - Data Warehousing and Mining
No ratings yet
Unit 4 - Data Warehousing and Mining
51 pages
List of Figures Chapter 1: State of The Art
No ratings yet
List of Figures Chapter 1: State of The Art
25 pages
Week-10
No ratings yet
Week-10
84 pages
Chp10 Cluster Analysis Basic Concepts and Methods
No ratings yet
Chp10 Cluster Analysis Basic Concepts and Methods
24 pages
Birch
No ratings yet
Birch
30 pages
ML_U2_BIRCH_61845fd2-aa4b-4335-afa5-37d9f3b4d63a (1)
No ratings yet
ML_U2_BIRCH_61845fd2-aa4b-4335-afa5-37d9f3b4d63a (1)
20 pages
Balanced Iterative Reducing and Clustering Using Hierarchies
No ratings yet
Balanced Iterative Reducing and Clustering Using Hierarchies
28 pages
Clustering
No ratings yet
Clustering
45 pages
List of Figures Chapter 1: State of The Art
No ratings yet
List of Figures Chapter 1: State of The Art
25 pages
Clustering Hierarchical PDF
No ratings yet
Clustering Hierarchical PDF
31 pages
DOC-20231118-WA0008new Unit 5
No ratings yet
DOC-20231118-WA0008new Unit 5
15 pages
Birch Clustering
No ratings yet
Birch Clustering
11 pages
Cluster
100% (1)
Cluster
72 pages
Clustering
No ratings yet
Clustering
28 pages
Module-5-Cluster Analysis-Part1
No ratings yet
Module-5-Cluster Analysis-Part1
24 pages
DWDM Unit 4
No ratings yet
DWDM Unit 4
24 pages
Hierar Scale4
No ratings yet
Hierar Scale4
51 pages
Clustering
No ratings yet
Clustering
110 pages
Clustering-Part2
No ratings yet
Clustering-Part2
40 pages
Cluster Analysis - Approach 1
No ratings yet
Cluster Analysis - Approach 1
28 pages
Data Mining
No ratings yet
Data Mining
4 pages
DM Clustering UNIT4
No ratings yet
DM Clustering UNIT4
36 pages
Data mining and machine learning
No ratings yet
Data mining and machine learning
48 pages
P 3.1.3 Hierarchical
No ratings yet
P 3.1.3 Hierarchical
30 pages
Agnes
No ratings yet
Agnes
25 pages
DM_C6
No ratings yet
DM_C6
37 pages
13_BIRCH
No ratings yet
13_BIRCH
8 pages
Session 3: Clustering Techniques - Partitioning & Hierarchical Methods
No ratings yet
Session 3: Clustering Techniques - Partitioning & Hierarchical Methods
27 pages
Lecture 9 Clustering
No ratings yet
Lecture 9 Clustering
36 pages
Lecture 13
No ratings yet
Lecture 13
45 pages
BIRCH: A New Data Clustering Algorithm and Its Applications
No ratings yet
BIRCH: A New Data Clustering Algorithm and Its Applications
42 pages
Unit-4 new
No ratings yet
Unit-4 new
36 pages
Clustering
No ratings yet
Clustering
7 pages
Gautam A. Kudale
No ratings yet
Gautam A. Kudale
6 pages
4.6 Birch
No ratings yet
4.6 Birch
21 pages
Chapter 6
No ratings yet
Chapter 6
62 pages
Clustering
No ratings yet
Clustering
39 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Hierarchical-Clustering-in-Machine-Learning
No ratings yet
Hierarchical-Clustering-in-Machine-Learning
10 pages
Clustering
No ratings yet
Clustering
19 pages
Hierarchical Clustering Unit 4 ML
No ratings yet
Hierarchical Clustering Unit 4 ML
14 pages
Birch
No ratings yet
Birch
6 pages
6 - Clustering and Applications and Trends in Datamining
No ratings yet
6 - Clustering and Applications and Trends in Datamining
66 pages
AI20- Hierarchical-clustering
No ratings yet
AI20- Hierarchical-clustering
31 pages
5812d46b-1c39-4a89-ae4b-eec09f93ba4b
No ratings yet
5812d46b-1c39-4a89-ae4b-eec09f93ba4b
66 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
10 pages
Unit 3 DVA
No ratings yet
Unit 3 DVA
50 pages
Automatic Clustering Algorithms
No ratings yet
Automatic Clustering Algorithms
3 pages
Introduction to Coding in Hours With Python Level 1: A Guide to Programming for Students With No Prior Experience (Learn Coding Basics With Python)
From Everand
Introduction to Coding in Hours With Python Level 1: A Guide to Programming for Students With No Prior Experience (Learn Coding Basics With Python)
Jack C. Stanely
No ratings yet
Poetry in Photography: A Journey of a Wandering Teacher
From Everand
Poetry in Photography: A Journey of a Wandering Teacher
Vincent Nugroho
No ratings yet
UNIT-4
No ratings yet
UNIT-4
13 pages
Artificial Neural Networks Jntu Model Com
No ratings yet
Artificial Neural Networks Jntu Model Com
8 pages
ML Basepaper 2
No ratings yet
ML Basepaper 2
3 pages
Unit Iii Supervised Learning
No ratings yet
Unit Iii Supervised Learning
67 pages
Support Vector Machine
0% (1)
Support Vector Machine
7 pages
RNN Part1
No ratings yet
RNN Part1
12 pages
Confusion Matrix and Performance Evaluation Metrics
No ratings yet
Confusion Matrix and Performance Evaluation Metrics
13 pages
neural-networks-and-deep-learning-notes
No ratings yet
neural-networks-and-deep-learning-notes
88 pages
CH05 Business Analytics Process and Data Exploration
No ratings yet
CH05 Business Analytics Process and Data Exploration
37 pages
AIML-Curriculum by Pregrad
No ratings yet
AIML-Curriculum by Pregrad
33 pages
Face XFormer
No ratings yet
Face XFormer
27 pages
OralEpitheliumDB - A Dataset For Oral Epithelial Dysplasia Image Segmentation and Classification
No ratings yet
OralEpitheliumDB - A Dataset For Oral Epithelial Dysplasia Image Segmentation and Classification
20 pages
DM Question Bank For - 2nd Mid Exams
No ratings yet
DM Question Bank For - 2nd Mid Exams
3 pages
2023 Scopus Kids Hobby Prediction
No ratings yet
2023 Scopus Kids Hobby Prediction
6 pages
Trees, Boosting, and Random Forest
No ratings yet
Trees, Boosting, and Random Forest
14 pages
Module ECM3420 (2020) Learning From Data
No ratings yet
Module ECM3420 (2020) Learning From Data
2 pages
ML 2024 Part6 Classification Unsupervised
No ratings yet
ML 2024 Part6 Classification Unsupervised
43 pages
1 - Machine Learning
No ratings yet
1 - Machine Learning
26 pages
Introduction To K-Nearest Neighbors: Simplified (With Implementation in Python)
100% (1)
Introduction To K-Nearest Neighbors: Simplified (With Implementation in Python)
125 pages
k_means numerical
No ratings yet
k_means numerical
3 pages
Modeling and Predicting Cyber Hacking Breaches: Under The Guidance Of: Team Members
100% (1)
Modeling and Predicting Cyber Hacking Breaches: Under The Guidance Of: Team Members
38 pages
DL CS05
No ratings yet
DL CS05
22 pages
Remote Sensing and Basic Principles of GIS
No ratings yet
Remote Sensing and Basic Principles of GIS
9 pages
Thesis Presentation
No ratings yet
Thesis Presentation
33 pages
Business Statistics
100% (22)
Business Statistics
506 pages
MSC Thesis Nordin Sahla
100% (1)
MSC Thesis Nordin Sahla
58 pages
Question Bank DMC
No ratings yet
Question Bank DMC
28 pages
Descriptive Statistics Book Sta 111
No ratings yet
Descriptive Statistics Book Sta 111
40 pages

Hierarchical ClusteringAlgorithm

Uploaded by

Hierarchical ClusteringAlgorithm

Uploaded by

Hierarchical Clustering

 Group data objects into a tree of clusters.

 A tree like diagram that records the sequences 3

 Key operation is the computation of the proximity of two

 Distance between Medoids

Acknowledgement: Some of the materials used here are

child1 child2 child3 childB

CF1 CF2 CF3 CFB

child1 child2 child3 childB

Leaf node Leaf node

You might also like