0% found this document useful (0 votes)

25 views49 pages

Advanced Cluster Analysis Guide

CSE Department

Uploaded by

asad chowdhury

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views49 pages

Advanced Cluster Analysis Guide

CSE Department

Uploaded by

asad chowdhury

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

Advanced Analytical Theory and

Methods: Cluster Analysis

Prof. Dr. Shamim Akhter

Professor, Dept. of CSE
Ahsanullah University of Science and Technology
What is it?
• A cluster is a collection of objects “similar”
Cluster between them and “dissimilar” to the objects
belonging to other clusters.
• Given a finite set of data(X), the problem of
clustering in X is to find several cluster centers that
are required to form a partition of X such that the
Clustering
degree of association is strong for data within
blocks of the partition and weak for data in
different blocks.

Machine learning defines data

clustering as unsupervised
learning.
Classical Clustering Approaches
Classical clustering algorithms find a “hard partition” of a
given dataset based on certain criteria that evaluate the
goodness of a partition.
o “hard partition” means that each datum belongs to exactly
one partition cluster.

The concept of “hard partition” is defined as follows:

Hard Partition Clustering
• A cluster is a collection of objects “similar” between them and
“dissimilar” to the objects belonging to other clusters.

The similarity criterion is the distance

• Two or more objects belong to the same cluster if they are
“close’ according to a given distance (geometrical distance)

• A simple Euclidean distance metric is sufficient to group

similar data instances successfully.
Similarity Measures
• Nearest (Similar) Neighbor Technique
– Nearest can be taken to mean the smallest
Euclidean distance
Features (m) with large dissimilarity
emphasis more during squared.

– City Block Distance/Manhattan metric/taxi cab

distance: Absolute difference rather than squares
De-emphasize large feature and
influence by more small ones.
Similarity Measures
– Maximum distance metric (Chebychev)
Only consider most dissimilar
pair of feature.

– Minkowski Distance
Where r is an adjustable
parameter

– Cosine similarity
K-Nearest Neighbor (K-NN)
• KNN is a supervised, non-parametric, and
lazy learning algorithm.
– no assumption on underlying data distribution, does not assume any specific
form for the relationship between independent and dependent variables.
– does not use the training data points to do any generalization.

kNN tends to work best on smaller data-sets that do

not have many features.
https://www.youtube.com/watch?v=4HKqjENq9OU
K-Nearest Neighbor (K-NN)

1. Compute a distance value between the item to be classified and every item in the
training data-set

https://www.youtube.com/watch?v=4HKqjENq9OU
K-Nearest Neighbor (K-NN)

2. Pick the k closest data points (the items with the k lowest distances)

3. Conduct a “majority vote” among those data points — the dominating

classification in that pool is decided as the final classification

https://www.youtube.com/watch?v=4HKqjENq9OU
Forgy’s Algorithm
1. Initialize the cluster centroids to the seed points(# of
clusters- k random samples)

2. For each sample, find the cluster centroid nearest it.

Put the sample identified with this nearest cluster
centroid.

3. If no samples changed clusters in step 2, stop.

4. Compute the centroids of the resulting clusters and go

to step 2.
Example: Forgy’s Algorithm
Samples (4,4), (8,4), (15,8), (24,4), (24,12)
Sample Nearest Cluster Centroid
• 1st Iteration
(4,4) (4,4)
Centroids (4,4) and (8,4) (8,4) (8,4)
(15,8) (8,4)
Complete clustering for
all samples then decide (24,4) (8,4)
centroid for the cluster. (24,12) (8,4)

Sample Nearest Cluster Centroid

• 2nd Iteration (4,4) (4,4)

(8,4) (4,4)
Centroids (4,4) and (17.75,7)
(15,8) (17.75,7)
(24,4) (17.75,7)
(24,12) (17.75,7)
Example: Forgy’s Algorithm
Samples (4,4), (8,4), (15,8), (24,4), (24,12)
Sample Nearest Cluster Centroid
• 3rd Iteration
(4,4) (6,4)
Centroids (6,4) and (21,8) (8,4) (6,4)
(15,8) (21,8)
(24,4) (21,8)
(24,12) (21,8)
The K-mean algorithm
1. Begin with k clusters, each consisting of one of the first
k samples.
o For each of the remaining n-k samples, find the centroid nearest it.
o Put the sample in the cluster identified with this nearest centroid.
o After each sample is assigned, recompute the centroid of the
altered cluster.

2. Go through the data a second time.

o For each sample, find the centroid nearest it.
o Put the sample in the cluster identified with this nearest centroid.
o Do not recompute any centroid here.
Sample Distance to Distance to
Centroid (9,5.3) Centroid (24,8)
(8,4) (8,4) 1.6 16.5
(24,4) (24,4) 15.1 4.0
(15,8) (8,4) -> (11.5,6) 6.6 9.0
(4,4) (11.5,6)-> (9,5.3) 6.6 40.4
(24,12) (24,4)-> (24,8) 16.4 4.0
Drawback of K-means
• K-means is sensitive to outliers
– Such samples are far away from the majority of the data
– Thus when assigned to a cluster, they can dramatically distort
the mean value of the cluster.
• How can we modify the K-means to diminish
such sensitivity to outliers?
– Instead of mean values of objects, we could take
actual objects to represent a cluster
– K-medoid methods use the sum of absolute error
K-medoid Methods

S=Pastcost-Newcost<0

COST
Soft Partition Clustering: Fuzzy Clustering
• In many real-world clustering problems, however, some
data points partially belong to multiple clusters, rather
than to a single cluster exclusively.
– A Magnetic Resonance Image (MRI) pixel may correspond to a mixture of
two different types of tissues.
– A particular customer may be a “borderline case” between two groups.
– “fuzzy clustering” algorithm.
Fuzzy membership
value
Example: Fuzzy Membership

A matrix U = [Ai (xk)]

Fuzzy C-Means Clustering
• The most frequently used fuzzy clustering algorithm is the Fuzzy
C-Means (FCM) which is a fuzzification of the k-means algorithm.

• FCM is a method of clustering that allows one piece of data to

belong to two or more clusters.
Fuzzy C-Means Clustering

Cluster {1,3} {2,5} {4,8} {7,9}

Xk is the data
1 0.8 0.7 0.2 0.1
point
2 0.2 0.3 0.8 0.9

m=fuzziness parameters, usually 2

V11=(0.82*1+0.72*2+0.22*4+0.12*7)/(0.82+0.72+0.22+0.12)=1.568
V12=(0.82*3+0.72*5+0.22*8+0.12*9)/(0.82+0.72+0.22+0.12)=4.051
V21=(0.22*1+0.32*2+0.82*4+0.92*7)/(0.22+0.32+0.82+0.92)=5.35
V22=(0.22*3+0.32*5+0.82*8+0.92*9)/(0.22+0.32+0.82+0.92)=8.215

Centroids are {1.568, 4.051} and {5.35, 8.215}

D11=sqrt((1-1.568)2+ (3-4.051)2)=1.2, D12=6.79

Cluster {1,3} {2,5} {4,8} {7,9}
D21=1.04, D22=4.54 1 0.8 0.7 0.2 0.1
D31=4.63, D32=1.36 2 0.2 0.3 0.8 0.9
D41=7.34, D42=1.82 1 1 2 2
Fuzzy C-Means Clustering

A11=[(d112/d112)1/2-1+(d112/d122)1/2-1]-1=0.97
A12=[(d122/d112)1/2-1+(d122/d122)1/2-1]-1=0.03
A21=[(d212/d212)1/2-1+(d212/d222)1/2-1]-1=0.95
A22=[(d222/d212)1/2-1+(d222/d222)1/2-1]-1=0.05

A31=[(d312/d312)1/2-1+(d312/d322)1/2-1]-1=0.08
A32=[(d322/d112)1/2-1+(d122/d122)1/2-1]-1=0.92
A41=[(d212/d212)1/2-1+(d212/d222)1/2-1]-1=0.06
A42=[(d222/d212)1/2-1+(d222/d222)1/2-1]-1=0.94

Cluster {1,3} {2,5} {4,8} {7,9}

1 0.97 0.95 0.08 0.06
2 0.03 0.05 0.92 0.94
Fuzzy C-Means Clustering
Problem with Euclidean distance
Euclidean distance can sometimes be misleading (not
scale invariant).

X-scaled

That means if you work with feet and inches, or pounds and kilograms, you must have
them on the same scale.
Well, there is a straightforward fix to this and that is to standardize/normalize your data.

Domain Knowledge is required. How about Text Data?

Conceptual Based Clustering

Conceptual clustering is a form of clustering in machine learning that

given a set of Unlabeled, produces a classification scheme over the
objects.
Conceptual clustering goes one step further by finding characteristic
descriptions for each group, representing a concept or class.
Clustering quality is not solely a function of individual objects. Rather
it incorporates factors such as the generality and simplicity of the
derived concept descriptions.
Conceptual Based Clustering
• Most methods of conceptual clustering adopt a
statistical approach that uses probability
measurements in determining the concepts or
clusters.
– COBWEB
– CLASSIT
Conceptual Based Clustering: COBWEB
• COBWEB is a method of incremental conceptual clustering.
• A categorical attribute value pair describes its input objects.
• It is an overlapping technique. Clusters are not necessarily
disjoint and may share components.
• It creates a hierarchical clustering in the form of a classification
tree. The hierarchy is incrementally built and regularly
rearranged to correct the insertion-order bias.

– Each node of the tree refers to a

concept and contains the
probabilistic description.

– Probability of the concept and

conditional probability
Knowledge Acquisition Via Incremental Conceptual Clustering, Machine Learning 2: 139-172, 1987 © 1987 Kluwer Academic
Publishers, Boston Manufactured in The Netherlands
Category Utility: Heuristic Measure
• The Goodness held in cluster generally-
– Similarity of objects within same class => Intra Class
– Dissimilarity of objects in different classes => Inter Class
• Intra-class similarity is reflected by:
Where Ai=Vij is an attribute-value pair [Table-1] and Ck is a class

The larger this probability, the greater the proportion of class members
sharing the value and the more predictable the value is for class
members.
• Inter-class similarity is reflected by:
The larger this probability, the fewer the
objects in contrasting classes that share this
Vij value and the more predictive the value is of
the class.
Category Utility: Heuristic Measure

Weight Inter-Cluster Intra-Cluster

The Importance of Individual Value

It helps to increase the class-conditioned predictability &

predictiveness of frequently occurring values rather than
infrequently occurring values.
Category Utility: Heuristic Measure

The denominator, n, is the number of categories in a partition.

Knowledge Acquisition Via Incremental Conceptual Clustering, Machine Learning 2: 139-172, 1987 © 1987 Kluwer Academic
Publishers, Boston Manufactured in The Netherlands
There are four operations that the Cobweb uses
while making the tree:
• classifying the object for an existing class,
• creating a new class,
• combining two classes into a single class, and
• dividing a class into several classes.

Knowledge Acquisition Via Incremental Conceptual Clustering, Machine Learning 2: 139-172, 1987 © 1987 Kluwer Academic
Publishers, Boston Manufactured in The Netherlands
Operator I: Placing an object in an
existing class
• To determine which category 'best' hosts a new
object, COBWEB tentatively places the object in
each category.
• The partition that results from adding the object
to a given node is evaluated using category utility
(3 3).

Knowledge Acquisition Via Incremental Conceptual Clustering, Machine Learning 2: 139-172, 1987 © 1987 Kluwer Academic
Publishers, Boston Manufactured in The Netherlands
Operator 2: Creating a new class
• In addition to placing objects in existing classes,
there is a way to create new classes.
• Specifically, the quality of the partition resulting
from placing the object in the best existing host is
compared to the partition resulting from creating
a new singleton class containing the object.
• Depending on which partition is best concerning
category utility, the object is placed in the best
existing class, or a new class is created.
Operator 3 : Merging
• To guard against the effects of initially
skewed data, COBWEB includes operators
for node merging and splitting.
• Merging takes two nodes of a level (of n
nodes) and 'combines' them in hopes that
the resultant partition (of n - 1 nodes) is of
better quality.
• Merging two nodes involves creating a new
node and summing the attribute-value
counts of the nodes being merged. Tile two
original nodes are made children of the
newly created node as shown in the Figure.
Operator 4 : Splitting

• Splitting may increase partition

quality.
• A node of a partition (of n nodes)
may be deleted and its children
promoted, resulting in a partition
of n + m - 1 nodes, where the
deleted node had m children as
shown in the Figure.
The Algorithm Steps

Knowledge Acquisition Via Incremental Conceptual Clustering, Machine Learning 2: 139-172, 1987 © 1987 Kluwer Academic
Publishers, Boston Manufactured in The Netherlands
Criterion Functions For Clustering
• How should one evaluate a partitioning of a
set of samples into clusters-optimal partition?

• Sum-of-Squared-Error Criterion
– Let ni be the # of samples in Di and let mi be the mean of
those samples
An optimum partition is defined as
one that minimizes Je

Clustering of this type are often called Minimum Variance partitions

Criterion Functions For Clustering
• Related Minimum Variance Criteria

We can eliminate the mean vectors from the SSE and

obtain the left expression.

Average squared distance between points in the ith

cluster and it emphasizes the fact that the sum-of-
squared-error criterion uses Euclidean distance as the
measure of similarity.

More generic representation.

Optimal Clustering Methods
• Clustering validity indexes are usually defined by combining
the compactness and separability of the clusters.

Measures the closeness of Separability indicates how

cluster elements. distinct two clusters are.
A common measure of
compactness is variance.
• There are two types of validity techniques used for clustering evaluation- external
criteria and internal criteria.
• When a clustering result is evaluated based on the data that was clustered itself,
this is called internal evaluation
• In external evaluation, clustering results are evaluated based on data not used for
clustering, such as known class labels and external benchmarks. Such
benchmarks consist of a set of pre-classified items, and (expert) humans often
create these sets.
Internal Validation Indexes
• Davies-Bouldin Index: Davies Bouldin (DB) index measures the
average similarity between each cluster and its most similar one.
• A lower value of the DB Index indicates that clusters are tight
compact and well-separated, reflecting better clustering.
• The goal of this index is to achieve minimum within-cluster
variance and maximum between-cluster separations. It measures
the similarity of the cluster (Rij) by the variance of a cluster (Si)
and the separation of the cluster (dij) by the distance between
two clusters (vi and vj). The formulae of the DB index are-
Internal Validation Indexes
• Dunn Index: The value of the Dunn index (DI) is expected to be
large if clusters of the data set are well separated. If the dataset
has compact and well-separated clusters, the distance between
the clusters is expected to be large and the diameter of the
clusters is expected to be smaller.
• The clusters are compact and well separated by maximizing the
inter-cluster distance while minimizing the intra-cluster distance.
The large value of the Dunn index indicates compact and well-
separated clusters. The formulae of the Dunn index are-
Internal Validation Indexes
• Silhouette Coefficient: Silhouette Coefficient (SC) shows- how
well the objects can fit within the cluster.
• It measures the quality of the cluster by ranging between -1
and 1. A value near to one (1) indicates that the point x is
affected to the right cluster.
• There are two terms- cohesion and separation. Cohesion is
intra clustering distance, and separation is distance between
cluster centroids. A(x) is the average dissimilarity between x
and all other points of its cluster. B(x) is the minimum
dissimilarity x and its nearest cluster. A cluster which has a
value near -1, indicates that the point should be affected to
another cluster. The formulae of SC are-
Exploreing K-Means with Internal Validity Indexes for Data
Internal validation measures for K-means clustering

Clustering in Traffic Management System, (IJACSA) International

Journal of Advanced Computer Science and Applications, Vol. 8, No.
3, 2017
External Validation Indexes
Rand Index: The Rand index computes how similar the clusters
(returned by the clustering algorithm) are to the benchmark
classifications. It can be computed using the following formula:

Where: TP=# of True Positive, TN=# of True Negative

FP=# of False Positive, FN=# of False Negative
If the dataset size is N then

• One issue with the Rand index is that false positives and false
negatives are equally weighted. This may be an undesirable
characteristic for some clustering applications.
• The F-measure addresses this concern as does the chance-
corrected adjusted Rand index.
External Validation Indexes
• F-measure: The F-measure can be used to balance the
contribution of false negatives by weighting recall through a
parameter ≥0. Let precision and recall (both external evaluation
measures in themselves) be defined as follows:

We can calculate the F-measure by using the following formula:

When =0, 0= . In other words, the recall has no impact on the

F-measure when =0, and increasing allocates an increasing
amount of weight to recall in the final F-measure. Also, is not
considered and can vary from 0 upward without bound.
External Validation Indexes
Jaccard index: The Jaccard index is used to quantify the similarity
between two datasets. The Jaccard index takes on a value between 0
and 1.
An index of 1 means that the two datasets are identical, and an
index of 0 indicates that the datasets have no common elements.
The following formula defines the Jaccard index :

This is simply the number of unique elements common to both sets

divided by the total number of unique elements in both sets. Note
that is not taken into account.
External Validation Indexes

Dice index: The Dice symmetric measure doubles the weight on

while still ignoring :

Cluster Analysis for CS Students
No ratings yet
Cluster Analysis for CS Students
43 pages
Clustering
No ratings yet
Clustering
125 pages
Unit-7 Finalized
No ratings yet
Unit-7 Finalized
20 pages
Unsupervised Machine Learning Techniques
No ratings yet
Unsupervised Machine Learning Techniques
58 pages
AI-AG-Day-2-28th Feb 2023
No ratings yet
AI-AG-Day-2-28th Feb 2023
44 pages
Unit 4 Machine Learning
No ratings yet
Unit 4 Machine Learning
12 pages
Unsupervised Learning 1
No ratings yet
Unsupervised Learning 1
40 pages
Lecture-02 Unsupervised Learning Algorithm (Clustering)
No ratings yet
Lecture-02 Unsupervised Learning Algorithm (Clustering)
60 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Unsupervised Learning: Clustering
No ratings yet
Unsupervised Learning: Clustering
57 pages
Clustering Lecture
No ratings yet
Clustering Lecture
46 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
40 pages
Clustering
No ratings yet
Clustering
80 pages
AI Chapter 3 Part 5
No ratings yet
AI Chapter 3 Part 5
30 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
84 pages
FML Unit4
No ratings yet
FML Unit4
14 pages
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
No ratings yet
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
61 pages
Graph Partitioning & Clustering Techniques
No ratings yet
Graph Partitioning & Clustering Techniques
14 pages
Clustering
No ratings yet
Clustering
84 pages
Week 6 AM Slides
No ratings yet
Week 6 AM Slides
39 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
ML Module 4 Unsupervised Learning - Updated
No ratings yet
ML Module 4 Unsupervised Learning - Updated
55 pages
Clustering
No ratings yet
Clustering
104 pages
Unit 4
No ratings yet
Unit 4
125 pages
Unsupervised Learning 2024-PPG
No ratings yet
Unsupervised Learning 2024-PPG
85 pages
Cluster Analysis
No ratings yet
Cluster Analysis
29 pages
Unsupervised Learning Explained
No ratings yet
Unsupervised Learning Explained
54 pages
M5
No ratings yet
M5
40 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
(3rd Year) Pattern REcognition Lecture 4
No ratings yet
(3rd Year) Pattern REcognition Lecture 4
48 pages
K-Means Clustering in Data Mining
No ratings yet
K-Means Clustering in Data Mining
8 pages
Clustering Part1
No ratings yet
Clustering Part1
84 pages
Datamining Lect8
No ratings yet
Datamining Lect8
79 pages
Cluster
100% (1)
Cluster
72 pages
Week 9 - Clustering
No ratings yet
Week 9 - Clustering
63 pages
K Mean Clustering
No ratings yet
K Mean Clustering
59 pages
07 Clustering
No ratings yet
07 Clustering
54 pages
Lecture 14 Clustering
0% (1)
Lecture 14 Clustering
57 pages
10 Lecture AI 10
No ratings yet
10 Lecture AI 10
48 pages
Unsupervised Learning - Clustering
No ratings yet
Unsupervised Learning - Clustering
55 pages
K Mean Clustering1
No ratings yet
K Mean Clustering1
23 pages
Cluster Analysis: G Sreenivas
No ratings yet
Cluster Analysis: G Sreenivas
29 pages
Clustering
No ratings yet
Clustering
28 pages
K Means Clustering
No ratings yet
K Means Clustering
29 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
19 pages
Eml 10 250825
No ratings yet
Eml 10 250825
91 pages
Unit 4
No ratings yet
Unit 4
74 pages
Intro to Cluster Analysis
No ratings yet
Intro to Cluster Analysis
90 pages
Clustering Explanation
No ratings yet
Clustering Explanation
8 pages
Week 10 Lecture - Introduction To Clustering
No ratings yet
Week 10 Lecture - Introduction To Clustering
35 pages
Clustering
No ratings yet
Clustering
38 pages
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
110 pages
Clustering Part-1
No ratings yet
Clustering Part-1
48 pages
MODULE 4 Clustering
No ratings yet
MODULE 4 Clustering
23 pages
Kmeansfinal
No ratings yet
Kmeansfinal
16 pages
CSE4261 Lecture-9
No ratings yet
CSE4261 Lecture-9
45 pages
CSE4261 Lecture-12
No ratings yet
CSE4261 Lecture-12
24 pages
CSE4261 Lecture-10
No ratings yet
CSE4261 Lecture-10
50 pages
CSE4261 Lecture-11
No ratings yet
CSE4261 Lecture-11
35 pages
DWDM Lab Manual Final As On 09-04-2021 R18
No ratings yet
DWDM Lab Manual Final As On 09-04-2021 R18
88 pages
FinQuiz - Curriculum Note, @InsightSquad Study Session 3, Reading 7
No ratings yet
FinQuiz - Curriculum Note, @InsightSquad Study Session 3, Reading 7
11 pages
K Means R and Rapid Miner Patient and Mall Case Study
No ratings yet
K Means R and Rapid Miner Patient and Mall Case Study
80 pages
K Means
No ratings yet
K Means
14 pages
K-Means Clustering Guide
No ratings yet
K-Means Clustering Guide
8 pages
1 Peer-Graded Assignment: Clustering
No ratings yet
1 Peer-Graded Assignment: Clustering
5 pages
Automated Bottle Cap Inspection Using Machine Vision System
No ratings yet
Automated Bottle Cap Inspection Using Machine Vision System
6 pages
Nets 212: Scalable and Cloud Computing: Graph Algorithms in Mapreduce October 15, 2013
No ratings yet
Nets 212: Scalable and Cloud Computing: Graph Algorithms in Mapreduce October 15, 2013
61 pages
AI-Powered Smart Waste Solutions
No ratings yet
AI-Powered Smart Waste Solutions
16 pages
Industry 4.0 & AI in Data Management
No ratings yet
Industry 4.0 & AI in Data Management
8 pages
Seminar Paper
No ratings yet
Seminar Paper
14 pages
Predictive Analytics I: Data Mining: Process, Methods, and Algorithms
No ratings yet
Predictive Analytics I: Data Mining: Process, Methods, and Algorithms
60 pages
ML Unit 2 Notes
No ratings yet
ML Unit 2 Notes
14 pages
K Means Cluster
No ratings yet
K Means Cluster
17 pages
Unit 4 BR1
No ratings yet
Unit 4 BR1
24 pages
Student's Behavior Clustering Based On Ubiquitous Learning Log Data Using Unsupervised Machine Learning
No ratings yet
Student's Behavior Clustering Based On Ubiquitous Learning Log Data Using Unsupervised Machine Learning
7 pages
ML Viva Questions
No ratings yet
ML Viva Questions
25 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
17 pages
Scikit-learn ML Course Guide
100% (1)
Scikit-learn ML Course Guide
23 pages
51 Machine Learning Interview Questions With Answers - Springboard
100% (1)
51 Machine Learning Interview Questions With Answers - Springboard
20 pages
An Improved Fuzzy Clustering Technique For User's Browsing Behaviors
No ratings yet
An Improved Fuzzy Clustering Technique For User's Browsing Behaviors
4 pages
Module5 - Outlier - Analysis: Reference: "Data Mining The Text Book", Charu C. Aggarwal, Springer, 2015. (Chapters 8)
No ratings yet
Module5 - Outlier - Analysis: Reference: "Data Mining The Text Book", Charu C. Aggarwal, Springer, 2015. (Chapters 8)
21 pages
Assignment 6-Fall 2024
No ratings yet
Assignment 6-Fall 2024
5 pages
Company Wise Data Science Interview Questions
100% (2)
Company Wise Data Science Interview Questions
39 pages
Customer Segmentation Analysis
No ratings yet
Customer Segmentation Analysis
34 pages
Aspiring Data Scientist's Portfolio
No ratings yet
Aspiring Data Scientist's Portfolio
1 page
Bisecting K Means
No ratings yet
Bisecting K Means
2 pages
2023 GeoAIHandbook SpatialCV
No ratings yet
2023 GeoAIHandbook SpatialCV
15 pages
KMeans & KMedoids Clustering Guide
No ratings yet
KMeans & KMedoids Clustering Guide
10 pages
L11.2 Prob Models em
No ratings yet
L11.2 Prob Models em
20 pages

Advanced Cluster Analysis Guide

Uploaded by

Advanced Cluster Analysis Guide

Uploaded by

Advanced Analytical Theory and

Methods: Cluster Analysis

Prof. Dr. Shamim Akhter

Machine learning defines data

The concept of “hard partition” is defined as follows:

The similarity criterion is the distance

• A simple Euclidean distance metric is sufficient to group

– City Block Distance/Manhattan metric/taxi cab

kNN tends to work best on smaller data-sets that do

3. Conduct a “majority vote” among those data points — the dominating

2. For each sample, find the cluster centroid nearest it.

3. If no samples changed clusters in step 2, stop.

4. Compute the centroids of the resulting clusters and go

Sample Nearest Cluster Centroid

• 2nd Iteration (4,4) (4,4)

2. Go through the data a second time.

A matrix U = [Ai (xk)]

• FCM is a method of clustering that allows one piece of data to

Cluster {1,3} {2,5} {4,8} {7,9}

m=fuzziness parameters, usually 2

Centroids are {1.568, 4.051} and {5.35, 8.215}

D11=sqrt((1-1.568)2+ (3-4.051)2)=1.2, D12=6.79

Cluster {1,3} {2,5} {4,8} {7,9}

Domain Knowledge is required. How about Text Data?

Conceptual clustering is a form of clustering in machine learning that

– Each node of the tree refers to a

– Probability of the concept and

Weight Inter-Cluster Intra-Cluster

It helps to increase the class-conditioned predictability &

The denominator, n, is the number of categories in a partition.

• Splitting may increase partition

Clustering of this type are often called Minimum Variance partitions

We can eliminate the mean vectors from the SSE and

Average squared distance between points in the ith

More generic representation.

Measures the closeness of Separability indicates how

Clustering in Traffic Management System, (IJACSA) International

Where: TP=# of True Positive, TN=# of True Negative

We can calculate the F-measure by using the following formula:

When =0, 0= . In other words, the recall has no impact on the

This is simply the number of unique elements common to both sets

Dice index: The Dice symmetric measure doubles the weight on

You might also like