0% found this document useful (0 votes)
3 views

Unit-4 new

The document provides an overview of clustering techniques, including definitions, applications, and various methods such as k-means, hierarchical agglomerative clustering, and DBSCAN. It discusses the importance of distance metrics and the quality of clustering, as well as considerations for cluster analysis. Additionally, it details the steps involved in hierarchical clustering, including agglomerative and divisive approaches, and explains the use of dendrograms for visualizing clustering results.

Uploaded by

PARTH BHARADIA
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Unit-4 new

The document provides an overview of clustering techniques, including definitions, applications, and various methods such as k-means, hierarchical agglomerative clustering, and DBSCAN. It discusses the importance of distance metrics and the quality of clustering, as well as considerations for cluster analysis. Additionally, it details the steps involved in hierarchical clustering, including agglomerative and divisive approaches, and explains the use of dendrograms for visualizing clustering results.

Uploaded by

PARTH BHARADIA
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Unit-4

Contents

• Clustering
• Choosing distance metrics
• Different clustering approaches
• Hierarchical agglomerative clustering
• k-means (Lloyd’s algorithm)
• DBSCAN
• Relative merits of each method
• Clustering tendency and quality.

2
Clustering

• Cluster: A collection of data objects


o similar (or related) to one another within the same group
o dissimilar (or unrelated) to the objects in other groups
• Cluster analysis (or clustering, data segmentation, …)
Finding similarities between data according to the characteristics found in the data and
grouping similar data objects into clusters
• Unsupervised learning: no predefined classes (i.e., learning by observations vs. learning by
examples: supervised)
• Typical applications
o As a stand-alone tool to get insight into data distribution
o As a preprocessing step for other algorithms

3
Clustering: Applications

• Biology: taxonomy of living things: kingdom, phylum, class, order, family, genus and species
• Information retrieval: document clustering
• Land use: Identification of areas of similar land use in an earth observation database
• Marketing: Help marketers discover distinct groups in their customer bases, and then use this
knowledge to develop targeted marketing programs
• City-planning: Identifying groups of houses according to their house type, value, and
geographical location
• Earth-quake studies: Observed earth quake epicenters should be clustered along continent
faults
• Climate: understanding earth climate, find patterns of atmospheric and ocean
• Economic Science: market research

4
Measure the Quality of Clustering

• Dissimilarity/Similarity metric
o Similarity is expressed in terms of a distance function, typically metric: d(i, j)
o The definitions of distance functions are usually rather different for interval-scaled, boolean,
categorical, ordinal ratio, and vector variables
o Weights should be associated with different variables based on applications and data
semantics

• Quality of clustering:
o There is usually a separate “quality” function that measures the “goodness” of a cluster.
o It is hard to define “similar enough” or “good enough”
o The answer is typically highly subjective

5
Considerations for Cluster Analysis

• Partitioning criteria
o Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is
desirable)
• Separation of clusters
o Exclusive (e.g., one customer belongs to only one region) vs. non-exclusive (e.g., one
document may belong to more than one class)
• Similarity measure
o Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-based (e.g., density or
contiguity)
• Clustering space
o Full space (often when low dimensional) vs. subspaces (often in high-dimensional clustering)

6
Major Clustering Approaches
• Partitioning approach:
o Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square
errors
o Typical methods: k-means, k-medoids, CLARANS
• Hierarchical approach:
o Create a hierarchical decomposition of the set of data (or objects) using some criterion
o Typical methods: Diana, Agnes, BIRCH, CAMELEON
• Density-based approach:
o Based on connectivity and density functions
o Typical methods: DBSACN, OPTICS, DenClue
• Grid-based approach:
o based on a multiple-level granularity structure
o Typical methods: STING, WaveCluster, CLIQUE

7
8
Partitioning Clustering
• Partitioning method: Construct a partition of a database D of n objects into a set of k clusters.
• Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion
o Global optimal: Continuous evaluation all partitions
o Heuristic methods: k-means and k-medoids algorithms.
• In this type, the dataset is divided into a set of k groups, where K is used to define the number of
pre-defined groups. The cluster center is created in such a way that the distance between the data
points of one cluster is minimum as compared to another cluster centroid.

• There are many algorithms that come under partitioning


method some of the popular ones are K-Mean, PAM(K-
Medoids), CLARA algorithm (Clustering Large Applications)
etc

9
K-means clustering

10
11
Hierarchical Clustering

• Hierarchical Clustering algorithm develops the hierarchy of clusters in the form of a tree, and this
tree-shaped structure is known as the dendrogram.

• There is no requirement to predetermine the number of clusters as we did in the K-Means algorithm.

• The hierarchical clustering technique has two approaches:


o Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm starts with
taking all data points as single clusters and merging them until one cluster is left.
o Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it is a top-down
approach.

12
Agglomerative Hierarchical clustering
• It follows the bottom-up approach.
• This algorithm considers each dataset as a single cluster at the beginning, and then start combining the closest
pair of clusters together. It does this until all the clusters are merged into a single cluster that contains all the
datasets.

Step-1: Create each data point as a single cluster. Let's say there are N data points, so the number of clusters will also
be N.

13
Step-2: Take two closest data points or clusters and merge them to form one cluster. So, there will now be N-1
clusters

Step-3: Again, take the two closest clusters and merge them together to form one cluster. There will be N-2 clusters.

14
Step-4: Repeat Step 3 until only one cluster left. So, we will get the following clusters. Consider the below images:

Step-5: Once all the clusters are combined into one big cluster, develop the dendrogram to divide the clusters as
per the problem.

15
There are various ways to calculate the distance between two clusters, and these ways decide the rule for clustering.
These measures are called Linkage methods.

• Some of the popular linkage methods are given below:


1. Single Linkage: It is the Shortest Distance between the closest points of the clusters.
2. Complete Linkage: It is the farthest distance between the two points of two different clusters. It is one
of the popular linkage methods as it forms tighter clusters than single-linkage.
3. Average Linkage: It is the linkage method in which the distance between each pair of datasets is added
up and then divided by the total number of datasets to calculate the average distance between two
clusters.
4. Centroid Linkage: It is the linkage method in which the distance between the centroid of the clusters is
calculated.

16
1. Single Linkage: It is the Shortest Distance between the closest points of the clusters.

2. Complete Linkage: It is the farthest distance between the two points of two different clusters. It is one of the
popular linkage methods as it forms tighter clusters than single-linkage.

17
3. Average Linkage: It is the linkage method in which the distance between each pair of datasets is added up
and then divided by the total number of datasets to calculate the average distance between two clusters.

4. Centroid Linkage: It is the linkage method in which the distance between the centroid of the clusters is
calculated.

18
Working of Dendrogram in Hierarchical clustering
The dendrogram is a tree-like structure that is mainly used to store each step as a memory that the HC algorithm
performs. In the dendrogram plot, the Y-axis shows the Euclidean distances between the data points, and the x-
axis shows all the data points of the given dataset.

19
Working of Dendrogram in Hierarchical clustering
• Firstly, the datapoints P2 and P3 combine together and form a cluster, correspondingly a dendrogram is
created, which connects P2 and P3 with a rectangular shape. The height is decided according to the Euclidean
distance between the data points.

• In the next step, P5 and P6 form a cluster, and the corresponding dendrogram is created. It is higher than of
previous, as the Euclidean distance between P5 and P6 is a little bit greater than the P2 and P3.

• Again, two new dendrograms are created that combine P1, P2, and P3 in one dendrogram, and P4, P5, and P6,
in another dendrogram.

• At last, the final dendrogram is created that combines all the data points together

20
Divisive Clustering
• Divisive clustering works just the opposite of agglomerative clustering. It starts by considering all the data
points into a big single cluster and later on splitting them into smaller heterogeneous clusters continuously
until all data points are in their own cluster. Thus, they are good at identifying large clusters. It follows a top-
down approach and is more efficient than agglomerative clustering.

STEPS IN DIVISIVE CLUSTERING


• Consider all the data points as a single cluster.

o Split into clusters using any flat-clustering method, say K-Means.


o Choose the best cluster among the clusters to split further, choose the one that has the largest Sum of
Squared Error (SSE).
o Repeat steps 2 and 3 until a single cluster is formed.

21
Divisive Clustering
• The data points 1,2,...6 are assigned to large
cluster.
• After calculating the proximity matrix,
based on the dissimilarity the points are
split up into separate clusters.
• The proximity matrix is again computed
until each point is assigned to an individual
cluster.
• The proximity matrix and linkage function
follow the same procedure as agglomerative
clustering,

22
Agglomerative Hierarchical clustering: Example
Suppose we have 6 objects (with name A, B, C, D, E and F) and each object have two measured feature X1 and X2. We
can plot the features in a scattered plot to get the visualization of proximity between objects.

X1 X2
A 1 1
B 1.5 1.5
C 5 5
D 3 4
E 4 4
F 3 3.5

23
Calculate the proximity matrix using single linkage distance

Dist. A B C D E F
In this case, the closest cluster is
A 0 0.71 5.66 3.61 4.24 3.20
between cluster F and D with shortest
B 0.71 0 4.95 2.92 3.54 2.50 distance of 0.5.
C 5.66 4.95 0 2.24 1.41 2.50 Thus, we group cluster D and
D 3.61 2.92 2.24 0 1.00 0.50 F into cluster (D, F). Then we update the

E 4.24 3.54 1.41 1.00 0 1.12 distance matrix

F 3.20 2.50 2.50 0.50 1.12 0

We have 6 objects and we put each object into one cluster. Thus, in the beginning we have 6 clusters.

Our goal is to group those 6 clusters such that at the end of the iterations, we will have only single cluster consists of
the whole six original objects.

24
Distance between ungrouped clusters will not change from the original distance matrix. Now the problem is how to
calculate distance between newly grouped clusters (D, F) and other clusters?

Dist. A B C D, F E In this case, the closest cluster is


A 0 0.71 5.66 ? 4.24 between cluster F and D with shortest

B 0.71 0 4.95 ? 3.54 distance of 0.5.


Thus, we group cluster D and
C 5.66 4.95 0 ? 1.41
F into cluster (D, F). Then we update the
D, F ? ? ? 0 ?
distance matrix
E 4.24 3.54 1.41 ? 0

Using single linkage, we specify minimum distance between original objects of the two clusters.
Using the input distance matrix, distance between cluster (D, F) and cluster A is computed as:
𝑑𝑖𝑠𝑡 𝐷𝐹 → 𝐴 = min(𝑑𝑖𝑠𝑡𝐷𝐴 , 𝑑𝑖𝑠𝑡𝐹𝐴 )
= min(3.61, 3.20)
= 3.20

25
Dist. A B C D, F E
A 0 0.71 5.66 ? 4.24
B 0.71 0 4.95 ? 3.54
C 5.66 4.95 0 ? 1.41
D, F 3.20 ? ? 0 ?
E 4.24 3.54 1.41 ? 0

Similarly: Dist. A B C D, F E
𝑑𝑖𝑠𝑡 𝐷𝐹 → 𝐵 = min(𝑑𝑖𝑠𝑡𝐷𝐵 , 𝑑𝑖𝑠𝑡𝐹𝐵 ) = min(2.92, 2.50) = 2.50 A 0 0.71 5.66 3.20 4.24
𝑑𝑖𝑠𝑡 𝐷𝐹 → 𝐶 = min(𝑑𝑖𝑠𝑡𝐷𝐶 , 𝑑𝑖𝑠𝑡𝐹𝐶 ) = min(2.24, 2.50) = 2.24 B 0.71 0 4.95 2.50 3.54
𝑑𝑖𝑠𝑡 𝐷𝐹 → 𝐸 = min(𝑑𝑖𝑠𝑡𝐷𝐸 , 𝑑𝑖𝑠𝑡𝐹𝐸 ) = min(1.00,1.12) = 1.00 C 5.66 4.95 0 2.24 1.41
D, F 3.20 2.50 2.24 0 1.00
E 4.24 3.54 1.41 1.00 0

26
Looking at the lower triangular updated distance matrix, we found out that the closest distance between cluster B
and cluster A is now 0.71. Thus, we group cluster A and cluster B into a single cluster name (A, B).

Dist. A,B C D, F E
A,B 0 ? ? ?
C ? 0 2.24 1.41
D, F ? 2.24 0 1.00
E ? 1.41 1.00 0

Using the input distance matrix (size 6 by 6), distance between cluster C and cluster (A, B) is computed as:
𝑑𝑖𝑠𝑡 𝐶 → 𝐴𝐵 = min(𝑑𝑖𝑠𝑡𝐶𝐴 , 𝑑𝑖𝑠𝑡𝐶𝐵 ) = min(5.66, 4.95) = 4.95

𝑑𝑖𝑠𝑡 𝐷𝐹 → 𝐴𝐵 = min(𝑑𝑖𝑠𝑡𝐷𝐴 , 𝑑𝑖𝑠𝑡𝐷𝐵 , 𝑑𝑖𝑠𝑡𝐹𝐴 , 𝑑𝑖𝑠𝑡𝐹𝐵 ) = min(3.61, 2.92, 3.20, 2.50) = 2.50

𝑑𝑖𝑠𝑡 𝐸 → 𝐴𝐵 = min(𝑑𝑖𝑠𝑡𝐸𝐴 , 𝑑𝑖𝑠𝑡𝐸𝐵 ) = min(4.24, 3.54) = 3.54

27
The updated distance matrix is:

Dist. A,B C D, F E
A,B 0 4.95 2.50 3.54
C 4.95 0 2.24 1.41
D, F 2.50 2.24 0 1.00
E 3.54 1.41 1.00 0

Observing the updated distance matrix, we can see that the closest distance between clusters happens between cluster
E and (D, F) at distance 1.00. Thus, we cluster them together into cluster ((D, F), E ).

Dist. A,B C (D, F),E


A,B 0 4.95 ?
C 4.95 0 ?
(D, F), E ? ? 0

28
Distance between cluster ((D, F), E) and cluster (A, B) is calculated as:
𝑑𝑖𝑠𝑡 𝐷𝐹 , 𝐸) → (𝐴𝐵) = min(𝑑𝑖𝑠𝑡𝐷𝐴 , 𝑑𝑖𝑠𝑡𝐷𝐵 , 𝑑𝑖𝑠𝑡𝐹𝐴 , 𝑑𝑖𝑠𝑡𝐹𝐵 , 𝑑𝑖𝑠𝑡𝐸𝐴 , 𝑑𝑖𝑠𝑡𝐸𝐵 )
= min(3.61, 2.92, 3.20, 2.50, 4.24, 3.54)
= 2.50
Distance between cluster ((D, F), E) and cluster C can be computed as:
𝑑𝑖𝑠𝑡 𝐷𝐹 , 𝐸) → 𝐶 = min 𝑑𝑖𝑠𝑡𝐷𝐶 , 𝑑𝑖𝑠𝑡𝐹𝐶 , 𝑑𝑖𝑠𝑡𝐸𝐶 = min(2.24, 2.50,1.41)
= 1.41

Dist. A,B C (D, F),E


A,B 0 4.95 2.50
C 4.95 0 1.41
(D, F), E 2.50 1.41 0

we merge cluster ((D, F), E) and cluster C into a new cluster name (((D, F), E), C).

29
The updated distance matrix is shown as:

Dist. A,B ((D, F),E), C


A,B 0 2.50
((D, F), E),C 2.50 0

Distance between cluster ((D, F), E),C and cluster (A,B) can be computed as:
𝑑𝑖𝑠𝑡 ( 𝐷𝐹 , 𝐸), 𝐶) → (𝐴𝐵) = min(𝑑𝑖𝑠𝑡𝐷𝐴 , 𝑑𝑖𝑠𝑡𝐷𝐵 , 𝑑𝑖𝑠𝑡𝐹𝐴 , 𝑑𝑖𝑠𝑡𝐹𝐵 , 𝑑𝑖𝑠𝑡𝐸𝐴 , 𝑑𝑖𝑠𝑡𝐸𝐵 , 𝑑𝑖𝑠𝑡𝐶𝐴 , 𝑑𝑖𝑠𝑡𝐶𝐵 )
= min(3.61, 2.92, 3.20, 2.50, 4.24, 3.54, 5.66, 4.95)
= 2.50

Dist. ((D, F), E),C), (A,B)


((D, F), E),C), (A,B) 0

Now if we merge the remaining two clusters, we will get only single cluster contain the whole 6 objects. Thus, our
computation is finished.
30
We summarized the results of computation as follow:

1. In the beginning we have 6 clusters: A, B, C, D, E and F


2. We merge cluster D and F into cluster (D, F) at distance 0.50
3. We merge cluster A and cluster B into (A, B) at distance 0.71
4. We merge cluster E and (D, F) into ((D, F), E) at distance 1.00
5. We merge cluster ((D, F), E) and C into (((D, F), E), C) at distance 1.41
6. We merge cluster (((D, F), E), C) and (A, B) into ((((D, F), E), C), (A, B)) at distance 2.50
7. The last cluster contain all the objects, thus conclude the computation

31
The hierarchy is given as (((D, F), E),C), (A,B). We can also plot the clustering hierarchy into XY space

Using this information, we can now draw the final results of a dendogram. The dendogram is drawn based on the
distances to merge the clusters above.

32
DBSCAN

33
Clustering Tendency
• A big issue, in cluster analysis, is that clustering methods will return clusters even if the data does not contain any
clusters. In other words, if you blindly apply a clustering method on a data set, it will divide the data into clusters
because that is what it supposed to do.

• Before applying any clustering method on your data, it’s important to evaluate whether the data sets contains
meaningful clusters (i.e.: non-random structures) or not. If yes, then how many clusters are there. This process
is defined as the assessing of clustering tendency or the feasibility of the clustering analysis.

• The two major methods for evaluating the clustering tendency:


i. a statistical (Hopkins statistic) and
ii. a visual methods (Visual Assessment of cluster Tendency (VAT) algorithm).

34
Hopkins statistic
• The Hopkins statistic (Lawson and Jurs 1990) is used to assess the clustering tendency of a data set by measuring
the probability that a given data set is generated by a uniform data distribution. In other words, it tests the spatial
randomness of the data.

Visual method
The algorithm of the visual assessment of cluster tendency (VAT) approach (Bezdek and Hathaway, 2002) is as
follow:
• Compute the dissimilarity (DM) matrix between the objects in the data set using the Euclidean distance measure
• Reorder the DM so that similar objects are close to one another. This process create an ordered dissimilarity
matrix (ODM)
• The ODM is displayed as an ordered dissimilarity image (ODI), which is the visual output of VAT

35
Happy Learning

36

You might also like