Advanced Analytical Theory and
Methods: Cluster Analysis
Prof. Dr. Shamim Akhter
Professor, Dept. of CSE
Ahsanullah University of Science and Technology
What is it?
• A cluster is a collection of objects “similar”
Cluster between them and “dissimilar” to the objects
belonging to other clusters.
• Given a finite set of data(X), the problem of
clustering in X is to find several cluster centers that
are required to form a partition of X such that the
Clustering
degree of association is strong for data within
blocks of the partition and weak for data in
different blocks.
Machine learning defines data
clustering as unsupervised
learning.
Classical Clustering Approaches
Classical clustering algorithms find a “hard partition” of a
given dataset based on certain criteria that evaluate the
goodness of a partition.
o “hard partition” means that each datum belongs to exactly
one partition cluster.
The concept of “hard partition” is defined as follows:
Hard Partition Clustering
• A cluster is a collection of objects “similar” between them and
“dissimilar” to the objects belonging to other clusters.
The similarity criterion is the distance
• Two or more objects belong to the same cluster if they are
“close’ according to a given distance (geometrical distance)
• A simple Euclidean distance metric is sufficient to group
similar data instances successfully.
Similarity Measures
• Nearest (Similar) Neighbor Technique
– Nearest can be taken to mean the smallest
Euclidean distance
Features (m) with large dissimilarity
emphasis more during squared.
– City Block Distance/Manhattan metric/taxi cab
distance: Absolute difference rather than squares
De-emphasize large feature and
influence by more small ones.
Similarity Measures
– Maximum distance metric (Chebychev)
Only consider most dissimilar
pair of feature.
– Minkowski Distance
Where r is an adjustable
parameter
– Cosine similarity
K-Nearest Neighbor (K-NN)
• KNN is a supervised, non-parametric, and
lazy learning algorithm.
– no assumption on underlying data distribution, does not assume any specific
form for the relationship between independent and dependent variables.
– does not use the training data points to do any generalization.
kNN tends to work best on smaller data-sets that do
not have many features.
https://www.youtube.com/watch?v=4HKqjENq9OU
K-Nearest Neighbor (K-NN)
1. Compute a distance value between the item to be classified and every item in the
training data-set
https://www.youtube.com/watch?v=4HKqjENq9OU
K-Nearest Neighbor (K-NN)
2. Pick the k closest data points (the items with the k lowest distances)
3. Conduct a “majority vote” among those data points — the dominating
classification in that pool is decided as the final classification
https://www.youtube.com/watch?v=4HKqjENq9OU
Forgy’s Algorithm
1. Initialize the cluster centroids to the seed points(# of
clusters- k random samples)
2. For each sample, find the cluster centroid nearest it.
Put the sample identified with this nearest cluster
centroid.
3. If no samples changed clusters in step 2, stop.
4. Compute the centroids of the resulting clusters and go
to step 2.
Example: Forgy’s Algorithm
Samples (4,4), (8,4), (15,8), (24,4), (24,12)
Sample Nearest Cluster Centroid
• 1st Iteration
(4,4) (4,4)
Centroids (4,4) and (8,4) (8,4) (8,4)
(15,8) (8,4)
Complete clustering for
all samples then decide (24,4) (8,4)
centroid for the cluster. (24,12) (8,4)
Sample Nearest Cluster Centroid
• 2nd Iteration (4,4) (4,4)
(8,4) (4,4)
Centroids (4,4) and (17.75,7)
(15,8) (17.75,7)
(24,4) (17.75,7)
(24,12) (17.75,7)
Example: Forgy’s Algorithm
Samples (4,4), (8,4), (15,8), (24,4), (24,12)
Sample Nearest Cluster Centroid
• 3rd Iteration
(4,4) (6,4)
Centroids (6,4) and (21,8) (8,4) (6,4)
(15,8) (21,8)
(24,4) (21,8)
(24,12) (21,8)
The K-mean algorithm
1. Begin with k clusters, each consisting of one of the first
k samples.
o For each of the remaining n-k samples, find the centroid nearest it.
o Put the sample in the cluster identified with this nearest centroid.
o After each sample is assigned, recompute the centroid of the
altered cluster.
2. Go through the data a second time.
o For each sample, find the centroid nearest it.
o Put the sample in the cluster identified with this nearest centroid.
o Do not recompute any centroid here.
Sample Distance to Distance to
Centroid (9,5.3) Centroid (24,8)
(8,4) (8,4) 1.6 16.5
(24,4) (24,4) 15.1 4.0
(15,8) (8,4) -> (11.5,6) 6.6 9.0
(4,4) (11.5,6)-> (9,5.3) 6.6 40.4
(24,12) (24,4)-> (24,8) 16.4 4.0
Drawback of K-means
• K-means is sensitive to outliers
– Such samples are far away from the majority of the data
– Thus when assigned to a cluster, they can dramatically distort
the mean value of the cluster.
• How can we modify the K-means to diminish
such sensitivity to outliers?
– Instead of mean values of objects, we could take
actual objects to represent a cluster
– K-medoid methods use the sum of absolute error
K-medoid Methods
S=Pastcost-Newcost<0
COST
Soft Partition Clustering: Fuzzy Clustering
• In many real-world clustering problems, however, some
data points partially belong to multiple clusters, rather
than to a single cluster exclusively.
– A Magnetic Resonance Image (MRI) pixel may correspond to a mixture of
two different types of tissues.
– A particular customer may be a “borderline case” between two groups.
– “fuzzy clustering” algorithm.
Fuzzy membership
value
Example: Fuzzy Membership
A matrix U = [Ai (xk)]
Fuzzy C-Means Clustering
• The most frequently used fuzzy clustering algorithm is the Fuzzy
C-Means (FCM) which is a fuzzification of the k-means algorithm.
• FCM is a method of clustering that allows one piece of data to
belong to two or more clusters.
Fuzzy C-Means Clustering
Cluster {1,3} {2,5} {4,8} {7,9}
Xk is the data
1 0.8 0.7 0.2 0.1
point
2 0.2 0.3 0.8 0.9
m=fuzziness parameters, usually 2
V11=(0.82*1+0.72*2+0.22*4+0.12*7)/(0.82+0.72+0.22+0.12)=1.568
V12=(0.82*3+0.72*5+0.22*8+0.12*9)/(0.82+0.72+0.22+0.12)=4.051
V21=(0.22*1+0.32*2+0.82*4+0.92*7)/(0.22+0.32+0.82+0.92)=5.35
V22=(0.22*3+0.32*5+0.82*8+0.92*9)/(0.22+0.32+0.82+0.92)=8.215
Centroids are {1.568, 4.051} and {5.35, 8.215}
D11=sqrt((1-1.568)2+ (3-4.051)2)=1.2, D12=6.79
Cluster {1,3} {2,5} {4,8} {7,9}
D21=1.04, D22=4.54 1 0.8 0.7 0.2 0.1
D31=4.63, D32=1.36 2 0.2 0.3 0.8 0.9
D41=7.34, D42=1.82 1 1 2 2
Fuzzy C-Means Clustering
A11=[(d112/d112)1/2-1+(d112/d122)1/2-1]-1=0.97
A12=[(d122/d112)1/2-1+(d122/d122)1/2-1]-1=0.03
A21=[(d212/d212)1/2-1+(d212/d222)1/2-1]-1=0.95
A22=[(d222/d212)1/2-1+(d222/d222)1/2-1]-1=0.05
A31=[(d312/d312)1/2-1+(d312/d322)1/2-1]-1=0.08
A32=[(d322/d112)1/2-1+(d122/d122)1/2-1]-1=0.92
A41=[(d212/d212)1/2-1+(d212/d222)1/2-1]-1=0.06
A42=[(d222/d212)1/2-1+(d222/d222)1/2-1]-1=0.94
Cluster {1,3} {2,5} {4,8} {7,9}
1 0.97 0.95 0.08 0.06
2 0.03 0.05 0.92 0.94
Fuzzy C-Means Clustering
Problem with Euclidean distance
Euclidean distance can sometimes be misleading (not
scale invariant).
X-scaled
That means if you work with feet and inches, or pounds and kilograms, you must have
them on the same scale.
Well, there is a straightforward fix to this and that is to standardize/normalize your data.
Domain Knowledge is required. How about Text Data?
Conceptual Based Clustering
Conceptual clustering is a form of clustering in machine learning that
given a set of Unlabeled, produces a classification scheme over the
objects.
Conceptual clustering goes one step further by finding characteristic
descriptions for each group, representing a concept or class.
Clustering quality is not solely a function of individual objects. Rather
it incorporates factors such as the generality and simplicity of the
derived concept descriptions.
Conceptual Based Clustering
• Most methods of conceptual clustering adopt a
statistical approach that uses probability
measurements in determining the concepts or
clusters.
– COBWEB
– CLASSIT
Conceptual Based Clustering: COBWEB
• COBWEB is a method of incremental conceptual clustering.
• A categorical attribute value pair describes its input objects.
• It is an overlapping technique. Clusters are not necessarily
disjoint and may share components.
• It creates a hierarchical clustering in the form of a classification
tree. The hierarchy is incrementally built and regularly
rearranged to correct the insertion-order bias.
– Each node of the tree refers to a
concept and contains the
probabilistic description.
– Probability of the concept and
conditional probability
Knowledge Acquisition Via Incremental Conceptual Clustering, Machine Learning 2: 139-172, 1987 © 1987 Kluwer Academic
Publishers, Boston Manufactured in The Netherlands
Category Utility: Heuristic Measure
• The Goodness held in cluster generally-
– Similarity of objects within same class => Intra Class
– Dissimilarity of objects in different classes => Inter Class
• Intra-class similarity is reflected by:
Where Ai=Vij is an attribute-value pair [Table-1] and Ck is a class
The larger this probability, the greater the proportion of class members
sharing the value and the more predictable the value is for class
members.
• Inter-class similarity is reflected by:
The larger this probability, the fewer the
objects in contrasting classes that share this
Vij value and the more predictive the value is of
the class.
Category Utility: Heuristic Measure
Weight Inter-Cluster Intra-Cluster
The Importance of Individual Value
It helps to increase the class-conditioned predictability &
predictiveness of frequently occurring values rather than
infrequently occurring values.
Category Utility: Heuristic Measure
The denominator, n, is the number of categories in a partition.
Knowledge Acquisition Via Incremental Conceptual Clustering, Machine Learning 2: 139-172, 1987 © 1987 Kluwer Academic
Publishers, Boston Manufactured in The Netherlands
There are four operations that the Cobweb uses
while making the tree:
• classifying the object for an existing class,
• creating a new class,
• combining two classes into a single class, and
• dividing a class into several classes.
Knowledge Acquisition Via Incremental Conceptual Clustering, Machine Learning 2: 139-172, 1987 © 1987 Kluwer Academic
Publishers, Boston Manufactured in The Netherlands
Operator I: Placing an object in an
existing class
• To determine which category 'best' hosts a new
object, COBWEB tentatively places the object in
each category.
• The partition that results from adding the object
to a given node is evaluated using category utility
(3 3).
Knowledge Acquisition Via Incremental Conceptual Clustering, Machine Learning 2: 139-172, 1987 © 1987 Kluwer Academic
Publishers, Boston Manufactured in The Netherlands
Operator 2: Creating a new class
• In addition to placing objects in existing classes,
there is a way to create new classes.
• Specifically, the quality of the partition resulting
from placing the object in the best existing host is
compared to the partition resulting from creating
a new singleton class containing the object.
• Depending on which partition is best concerning
category utility, the object is placed in the best
existing class, or a new class is created.
Operator 3 : Merging
• To guard against the effects of initially
skewed data, COBWEB includes operators
for node merging and splitting.
• Merging takes two nodes of a level (of n
nodes) and 'combines' them in hopes that
the resultant partition (of n - 1 nodes) is of
better quality.
• Merging two nodes involves creating a new
node and summing the attribute-value
counts of the nodes being merged. Tile two
original nodes are made children of the
newly created node as shown in the Figure.
Operator 4 : Splitting
• Splitting may increase partition
quality.
• A node of a partition (of n nodes)
may be deleted and its children
promoted, resulting in a partition
of n + m - 1 nodes, where the
deleted node had m children as
shown in the Figure.
The Algorithm Steps
Knowledge Acquisition Via Incremental Conceptual Clustering, Machine Learning 2: 139-172, 1987 © 1987 Kluwer Academic
Publishers, Boston Manufactured in The Netherlands
Criterion Functions For Clustering
• How should one evaluate a partitioning of a
set of samples into clusters-optimal partition?
• Sum-of-Squared-Error Criterion
– Let ni be the # of samples in Di and let mi be the mean of
those samples
An optimum partition is defined as
one that minimizes Je
Clustering of this type are often called Minimum Variance partitions
Criterion Functions For Clustering
• Related Minimum Variance Criteria
We can eliminate the mean vectors from the SSE and
obtain the left expression.
Average squared distance between points in the ith
cluster and it emphasizes the fact that the sum-of-
squared-error criterion uses Euclidean distance as the
measure of similarity.
More generic representation.
Optimal Clustering Methods
• Clustering validity indexes are usually defined by combining
the compactness and separability of the clusters.
Measures the closeness of Separability indicates how
cluster elements. distinct two clusters are.
A common measure of
compactness is variance.
• There are two types of validity techniques used for clustering evaluation- external
criteria and internal criteria.
• When a clustering result is evaluated based on the data that was clustered itself,
this is called internal evaluation
• In external evaluation, clustering results are evaluated based on data not used for
clustering, such as known class labels and external benchmarks. Such
benchmarks consist of a set of pre-classified items, and (expert) humans often
create these sets.
Internal Validation Indexes
• Davies-Bouldin Index: Davies Bouldin (DB) index measures the
average similarity between each cluster and its most similar one.
• A lower value of the DB Index indicates that clusters are tight
compact and well-separated, reflecting better clustering.
• The goal of this index is to achieve minimum within-cluster
variance and maximum between-cluster separations. It measures
the similarity of the cluster (Rij) by the variance of a cluster (Si)
and the separation of the cluster (dij) by the distance between
two clusters (vi and vj). The formulae of the DB index are-
Internal Validation Indexes
• Dunn Index: The value of the Dunn index (DI) is expected to be
large if clusters of the data set are well separated. If the dataset
has compact and well-separated clusters, the distance between
the clusters is expected to be large and the diameter of the
clusters is expected to be smaller.
• The clusters are compact and well separated by maximizing the
inter-cluster distance while minimizing the intra-cluster distance.
The large value of the Dunn index indicates compact and well-
separated clusters. The formulae of the Dunn index are-
Internal Validation Indexes
• Silhouette Coefficient: Silhouette Coefficient (SC) shows- how
well the objects can fit within the cluster.
• It measures the quality of the cluster by ranging between -1
and 1. A value near to one (1) indicates that the point x is
affected to the right cluster.
• There are two terms- cohesion and separation. Cohesion is
intra clustering distance, and separation is distance between
cluster centroids. A(x) is the average dissimilarity between x
and all other points of its cluster. B(x) is the minimum
dissimilarity x and its nearest cluster. A cluster which has a
value near -1, indicates that the point should be affected to
another cluster. The formulae of SC are-
Exploreing K-Means with Internal Validity Indexes for Data
Internal validation measures for K-means clustering
Clustering in Traffic Management System, (IJACSA) International
Journal of Advanced Computer Science and Applications, Vol. 8, No.
3, 2017
External Validation Indexes
Rand Index: The Rand index computes how similar the clusters
(returned by the clustering algorithm) are to the benchmark
classifications. It can be computed using the following formula:
Where: TP=# of True Positive, TN=# of True Negative
FP=# of False Positive, FN=# of False Negative
If the dataset size is N then
• One issue with the Rand index is that false positives and false
negatives are equally weighted. This may be an undesirable
characteristic for some clustering applications.
• The F-measure addresses this concern as does the chance-
corrected adjusted Rand index.
External Validation Indexes
• F-measure: The F-measure can be used to balance the
contribution of false negatives by weighting recall through a
parameter ≥0. Let precision and recall (both external evaluation
measures in themselves) be defined as follows:
We can calculate the F-measure by using the following formula:
When =0, 0= . In other words, the recall has no impact on the
F-measure when =0, and increasing allocates an increasing
amount of weight to recall in the final F-measure. Also, is not
considered and can vary from 0 upward without bound.
External Validation Indexes
Jaccard index: The Jaccard index is used to quantify the similarity
between two datasets. The Jaccard index takes on a value between 0
and 1.
An index of 1 means that the two datasets are identical, and an
index of 0 indicates that the datasets have no common elements.
The following formula defines the Jaccard index :
This is simply the number of unique elements common to both sets
divided by the total number of unique elements in both sets. Note
that is not taken into account.
External Validation Indexes
Dice index: The Dice symmetric measure doubles the weight on
while still ignoring :