Unit-4 new
Unit-4 new
Contents
• Clustering
• Choosing distance metrics
• Different clustering approaches
• Hierarchical agglomerative clustering
• k-means (Lloyd’s algorithm)
• DBSCAN
• Relative merits of each method
• Clustering tendency and quality.
2
Clustering
3
Clustering: Applications
• Biology: taxonomy of living things: kingdom, phylum, class, order, family, genus and species
• Information retrieval: document clustering
• Land use: Identification of areas of similar land use in an earth observation database
• Marketing: Help marketers discover distinct groups in their customer bases, and then use this
knowledge to develop targeted marketing programs
• City-planning: Identifying groups of houses according to their house type, value, and
geographical location
• Earth-quake studies: Observed earth quake epicenters should be clustered along continent
faults
• Climate: understanding earth climate, find patterns of atmospheric and ocean
• Economic Science: market research
4
Measure the Quality of Clustering
• Dissimilarity/Similarity metric
o Similarity is expressed in terms of a distance function, typically metric: d(i, j)
o The definitions of distance functions are usually rather different for interval-scaled, boolean,
categorical, ordinal ratio, and vector variables
o Weights should be associated with different variables based on applications and data
semantics
• Quality of clustering:
o There is usually a separate “quality” function that measures the “goodness” of a cluster.
o It is hard to define “similar enough” or “good enough”
o The answer is typically highly subjective
5
Considerations for Cluster Analysis
• Partitioning criteria
o Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is
desirable)
• Separation of clusters
o Exclusive (e.g., one customer belongs to only one region) vs. non-exclusive (e.g., one
document may belong to more than one class)
• Similarity measure
o Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-based (e.g., density or
contiguity)
• Clustering space
o Full space (often when low dimensional) vs. subspaces (often in high-dimensional clustering)
6
Major Clustering Approaches
• Partitioning approach:
o Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square
errors
o Typical methods: k-means, k-medoids, CLARANS
• Hierarchical approach:
o Create a hierarchical decomposition of the set of data (or objects) using some criterion
o Typical methods: Diana, Agnes, BIRCH, CAMELEON
• Density-based approach:
o Based on connectivity and density functions
o Typical methods: DBSACN, OPTICS, DenClue
• Grid-based approach:
o based on a multiple-level granularity structure
o Typical methods: STING, WaveCluster, CLIQUE
7
8
Partitioning Clustering
• Partitioning method: Construct a partition of a database D of n objects into a set of k clusters.
• Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion
o Global optimal: Continuous evaluation all partitions
o Heuristic methods: k-means and k-medoids algorithms.
• In this type, the dataset is divided into a set of k groups, where K is used to define the number of
pre-defined groups. The cluster center is created in such a way that the distance between the data
points of one cluster is minimum as compared to another cluster centroid.
9
K-means clustering
10
11
Hierarchical Clustering
• Hierarchical Clustering algorithm develops the hierarchy of clusters in the form of a tree, and this
tree-shaped structure is known as the dendrogram.
• There is no requirement to predetermine the number of clusters as we did in the K-Means algorithm.
12
Agglomerative Hierarchical clustering
• It follows the bottom-up approach.
• This algorithm considers each dataset as a single cluster at the beginning, and then start combining the closest
pair of clusters together. It does this until all the clusters are merged into a single cluster that contains all the
datasets.
Step-1: Create each data point as a single cluster. Let's say there are N data points, so the number of clusters will also
be N.
13
Step-2: Take two closest data points or clusters and merge them to form one cluster. So, there will now be N-1
clusters
Step-3: Again, take the two closest clusters and merge them together to form one cluster. There will be N-2 clusters.
14
Step-4: Repeat Step 3 until only one cluster left. So, we will get the following clusters. Consider the below images:
Step-5: Once all the clusters are combined into one big cluster, develop the dendrogram to divide the clusters as
per the problem.
15
There are various ways to calculate the distance between two clusters, and these ways decide the rule for clustering.
These measures are called Linkage methods.
16
1. Single Linkage: It is the Shortest Distance between the closest points of the clusters.
2. Complete Linkage: It is the farthest distance between the two points of two different clusters. It is one of the
popular linkage methods as it forms tighter clusters than single-linkage.
17
3. Average Linkage: It is the linkage method in which the distance between each pair of datasets is added up
and then divided by the total number of datasets to calculate the average distance between two clusters.
4. Centroid Linkage: It is the linkage method in which the distance between the centroid of the clusters is
calculated.
18
Working of Dendrogram in Hierarchical clustering
The dendrogram is a tree-like structure that is mainly used to store each step as a memory that the HC algorithm
performs. In the dendrogram plot, the Y-axis shows the Euclidean distances between the data points, and the x-
axis shows all the data points of the given dataset.
19
Working of Dendrogram in Hierarchical clustering
• Firstly, the datapoints P2 and P3 combine together and form a cluster, correspondingly a dendrogram is
created, which connects P2 and P3 with a rectangular shape. The height is decided according to the Euclidean
distance between the data points.
• In the next step, P5 and P6 form a cluster, and the corresponding dendrogram is created. It is higher than of
previous, as the Euclidean distance between P5 and P6 is a little bit greater than the P2 and P3.
• Again, two new dendrograms are created that combine P1, P2, and P3 in one dendrogram, and P4, P5, and P6,
in another dendrogram.
• At last, the final dendrogram is created that combines all the data points together
20
Divisive Clustering
• Divisive clustering works just the opposite of agglomerative clustering. It starts by considering all the data
points into a big single cluster and later on splitting them into smaller heterogeneous clusters continuously
until all data points are in their own cluster. Thus, they are good at identifying large clusters. It follows a top-
down approach and is more efficient than agglomerative clustering.
21
Divisive Clustering
• The data points 1,2,...6 are assigned to large
cluster.
• After calculating the proximity matrix,
based on the dissimilarity the points are
split up into separate clusters.
• The proximity matrix is again computed
until each point is assigned to an individual
cluster.
• The proximity matrix and linkage function
follow the same procedure as agglomerative
clustering,
22
Agglomerative Hierarchical clustering: Example
Suppose we have 6 objects (with name A, B, C, D, E and F) and each object have two measured feature X1 and X2. We
can plot the features in a scattered plot to get the visualization of proximity between objects.
X1 X2
A 1 1
B 1.5 1.5
C 5 5
D 3 4
E 4 4
F 3 3.5
23
Calculate the proximity matrix using single linkage distance
Dist. A B C D E F
In this case, the closest cluster is
A 0 0.71 5.66 3.61 4.24 3.20
between cluster F and D with shortest
B 0.71 0 4.95 2.92 3.54 2.50 distance of 0.5.
C 5.66 4.95 0 2.24 1.41 2.50 Thus, we group cluster D and
D 3.61 2.92 2.24 0 1.00 0.50 F into cluster (D, F). Then we update the
We have 6 objects and we put each object into one cluster. Thus, in the beginning we have 6 clusters.
Our goal is to group those 6 clusters such that at the end of the iterations, we will have only single cluster consists of
the whole six original objects.
24
Distance between ungrouped clusters will not change from the original distance matrix. Now the problem is how to
calculate distance between newly grouped clusters (D, F) and other clusters?
Using single linkage, we specify minimum distance between original objects of the two clusters.
Using the input distance matrix, distance between cluster (D, F) and cluster A is computed as:
𝑑𝑖𝑠𝑡 𝐷𝐹 → 𝐴 = min(𝑑𝑖𝑠𝑡𝐷𝐴 , 𝑑𝑖𝑠𝑡𝐹𝐴 )
= min(3.61, 3.20)
= 3.20
25
Dist. A B C D, F E
A 0 0.71 5.66 ? 4.24
B 0.71 0 4.95 ? 3.54
C 5.66 4.95 0 ? 1.41
D, F 3.20 ? ? 0 ?
E 4.24 3.54 1.41 ? 0
Similarly: Dist. A B C D, F E
𝑑𝑖𝑠𝑡 𝐷𝐹 → 𝐵 = min(𝑑𝑖𝑠𝑡𝐷𝐵 , 𝑑𝑖𝑠𝑡𝐹𝐵 ) = min(2.92, 2.50) = 2.50 A 0 0.71 5.66 3.20 4.24
𝑑𝑖𝑠𝑡 𝐷𝐹 → 𝐶 = min(𝑑𝑖𝑠𝑡𝐷𝐶 , 𝑑𝑖𝑠𝑡𝐹𝐶 ) = min(2.24, 2.50) = 2.24 B 0.71 0 4.95 2.50 3.54
𝑑𝑖𝑠𝑡 𝐷𝐹 → 𝐸 = min(𝑑𝑖𝑠𝑡𝐷𝐸 , 𝑑𝑖𝑠𝑡𝐹𝐸 ) = min(1.00,1.12) = 1.00 C 5.66 4.95 0 2.24 1.41
D, F 3.20 2.50 2.24 0 1.00
E 4.24 3.54 1.41 1.00 0
26
Looking at the lower triangular updated distance matrix, we found out that the closest distance between cluster B
and cluster A is now 0.71. Thus, we group cluster A and cluster B into a single cluster name (A, B).
Dist. A,B C D, F E
A,B 0 ? ? ?
C ? 0 2.24 1.41
D, F ? 2.24 0 1.00
E ? 1.41 1.00 0
Using the input distance matrix (size 6 by 6), distance between cluster C and cluster (A, B) is computed as:
𝑑𝑖𝑠𝑡 𝐶 → 𝐴𝐵 = min(𝑑𝑖𝑠𝑡𝐶𝐴 , 𝑑𝑖𝑠𝑡𝐶𝐵 ) = min(5.66, 4.95) = 4.95
𝑑𝑖𝑠𝑡 𝐷𝐹 → 𝐴𝐵 = min(𝑑𝑖𝑠𝑡𝐷𝐴 , 𝑑𝑖𝑠𝑡𝐷𝐵 , 𝑑𝑖𝑠𝑡𝐹𝐴 , 𝑑𝑖𝑠𝑡𝐹𝐵 ) = min(3.61, 2.92, 3.20, 2.50) = 2.50
27
The updated distance matrix is:
Dist. A,B C D, F E
A,B 0 4.95 2.50 3.54
C 4.95 0 2.24 1.41
D, F 2.50 2.24 0 1.00
E 3.54 1.41 1.00 0
Observing the updated distance matrix, we can see that the closest distance between clusters happens between cluster
E and (D, F) at distance 1.00. Thus, we cluster them together into cluster ((D, F), E ).
28
Distance between cluster ((D, F), E) and cluster (A, B) is calculated as:
𝑑𝑖𝑠𝑡 𝐷𝐹 , 𝐸) → (𝐴𝐵) = min(𝑑𝑖𝑠𝑡𝐷𝐴 , 𝑑𝑖𝑠𝑡𝐷𝐵 , 𝑑𝑖𝑠𝑡𝐹𝐴 , 𝑑𝑖𝑠𝑡𝐹𝐵 , 𝑑𝑖𝑠𝑡𝐸𝐴 , 𝑑𝑖𝑠𝑡𝐸𝐵 )
= min(3.61, 2.92, 3.20, 2.50, 4.24, 3.54)
= 2.50
Distance between cluster ((D, F), E) and cluster C can be computed as:
𝑑𝑖𝑠𝑡 𝐷𝐹 , 𝐸) → 𝐶 = min 𝑑𝑖𝑠𝑡𝐷𝐶 , 𝑑𝑖𝑠𝑡𝐹𝐶 , 𝑑𝑖𝑠𝑡𝐸𝐶 = min(2.24, 2.50,1.41)
= 1.41
we merge cluster ((D, F), E) and cluster C into a new cluster name (((D, F), E), C).
29
The updated distance matrix is shown as:
Distance between cluster ((D, F), E),C and cluster (A,B) can be computed as:
𝑑𝑖𝑠𝑡 ( 𝐷𝐹 , 𝐸), 𝐶) → (𝐴𝐵) = min(𝑑𝑖𝑠𝑡𝐷𝐴 , 𝑑𝑖𝑠𝑡𝐷𝐵 , 𝑑𝑖𝑠𝑡𝐹𝐴 , 𝑑𝑖𝑠𝑡𝐹𝐵 , 𝑑𝑖𝑠𝑡𝐸𝐴 , 𝑑𝑖𝑠𝑡𝐸𝐵 , 𝑑𝑖𝑠𝑡𝐶𝐴 , 𝑑𝑖𝑠𝑡𝐶𝐵 )
= min(3.61, 2.92, 3.20, 2.50, 4.24, 3.54, 5.66, 4.95)
= 2.50
Now if we merge the remaining two clusters, we will get only single cluster contain the whole 6 objects. Thus, our
computation is finished.
30
We summarized the results of computation as follow:
31
The hierarchy is given as (((D, F), E),C), (A,B). We can also plot the clustering hierarchy into XY space
Using this information, we can now draw the final results of a dendogram. The dendogram is drawn based on the
distances to merge the clusters above.
32
DBSCAN
33
Clustering Tendency
• A big issue, in cluster analysis, is that clustering methods will return clusters even if the data does not contain any
clusters. In other words, if you blindly apply a clustering method on a data set, it will divide the data into clusters
because that is what it supposed to do.
• Before applying any clustering method on your data, it’s important to evaluate whether the data sets contains
meaningful clusters (i.e.: non-random structures) or not. If yes, then how many clusters are there. This process
is defined as the assessing of clustering tendency or the feasibility of the clustering analysis.
34
Hopkins statistic
• The Hopkins statistic (Lawson and Jurs 1990) is used to assess the clustering tendency of a data set by measuring
the probability that a given data set is generated by a uniform data distribution. In other words, it tests the spatial
randomness of the data.
Visual method
The algorithm of the visual assessment of cluster tendency (VAT) approach (Bezdek and Hathaway, 2002) is as
follow:
• Compute the dissimilarity (DM) matrix between the objects in the data set using the Euclidean distance measure
• Reorder the DM so that similar objects are close to one another. This process create an ordered dissimilarity
matrix (ODM)
• The ODM is displayed as an ordered dissimilarity image (ODI), which is the visual output of VAT
35
Happy Learning
36