0% found this document useful (0 votes)

4 views9 pages

Unsupervised-Learning-Part-1 (1)

Clustering is an unsupervised machine learning technique that groups unlabelled datasets into clusters based on similarities in data points. It is commonly used for tasks such as market segmentation and anomaly detection, with various methods including K-Means, Hierarchical, and Density-Based clustering. The K-Means algorithm specifically partitions data into K predefined clusters by iteratively assigning data points to the nearest centroid and recalculating centroids until convergence.

Uploaded by

suman.struc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views9 pages

Unsupervised-Learning-Part-1 (1)

Uploaded by

suman.struc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Clustering in Machine Learning

Clustering or cluster analysis is a machine learning technique, which groups the unlabelled dataset. It
can be defined as "A way of grouping the data points into different clusters, consisting of
similar data points. The objects with the possible similarities remain in a group that has less or
no similarities with another group."

It does it by finding some similar patterns in the unlabelled dataset such as shape, size, color,
behavior, etc., and divides them as per the presence and absence of those similar patterns.

It is an unsupervised learning method, hence no supervision is provided to the algorithm, and it deals
with the unlabeled dataset.

After applying this clustering technique, each cluster or group is provided with a cluster-ID. ML
system can use this id to simplify the processing of large and complex datasets.

The clustering technique is commonly used for statistical data analysis.

Note: Clustering is somewhere similar to the classification algorithm , but the difference is the type of
dataset that we are using. In classification, we work with the labeled data set, whereas in clustering, we
work with the unlabelled dataset.

Example: Let's understand the clustering technique with the real-world example of Mall: When we
visit any shopping mall, we can observe that the things with similar usage are grouped together. Such
as the t-shirts are grouped in one section, and trousers are at other sections, similarly, at vegetable
sections, apples, bananas, Mangoes, etc., are grouped in separate sections, so that we can easily find
out the things. The clustering technique also works in the same way. Other examples of clustering
are grouping documents according to the topic.

The clustering technique can be widely used in various tasks. Some most common uses of this
technique are:

o Market Segmentation
o Statistical data analysis
o Social network analysis
o Image segmentation
o Anomaly detection, etc.

The below diagram explains the working of the clustering algorithm. We can see the different fruits
are divided into several groups with similar properties.
Types of Clustering Methods
The clustering methods are broadly divided into Hard clustering (datapoint belongs to only one
group) and Soft Clustering (data points can belong to another group also). But there are also other
various approaches of Clustering exist. Below are the main clustering methods used in Machine
learning:

1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering

Basically, there are two types of hierarchical cluster analysis strategies –

1. Agglomerative Clustering: Also known as bottom-up approach or hierarchical
agglomerative clustering (HAC). A structure that is more informative than the unstructured set
of clusters returned by flat clustering. This clustering algorithm does not require us to
prespecify the number of clusters. Bottom-up algorithms treat each data as a singleton cluster
at the outset and then successively agglomerates pairs of clusters until all clusters have been
merged into a single cluster that contains all data.
Algorithm :
given a dataset (d1, d2, d3, ....dN) of size N
# compute the distance matrix
for i=1 to N:
# as the distance matrix is symmetric about
# the primary diagonal so we compute only lower
# part of the primary diagonal
for j=1 to i:
dis_mat[i][j] = distance[di, dj]
each data point is a singleton clusterrepeat
merge the two cluster having minimum distance
update the distance matrixuntil only a single cluster remains

Python implementation of the above algorithm using the scikit-learn library:

 Python3
from sklearn.cluster import AgglomerativeClustering
import numpy as np

# randomly chosen dataset

X = np.array([[1, 2], [1, 4], [1, 0],
[4, 2], [4, 4], [4, 0]])

# here we need to mention the number of clusters

# otherwise the result will be a single cluster
# containing all the data
clustering = AgglomerativeClustering(n_clusters = 2).fit(X)

# print the class labels

print(clustering.labels_)

Output :
[1, 1, 1, 0, 0, 0]
2. Divisive clustering: Also known as a top-down approach. This algorithm also does not
require to prespecify the number of clusters. Top-down clustering requires a method for
splitting a cluster that contains the whole data and proceeds by splitting clusters recursively
until individual data have been split into singleton clusters.
Algorithm :
given a dataset (d1, d2, d3, ....dN) of size N
at the top we have all data in one cluster
the cluster is split using a flat clustering method eg. K-Means etcrepeat
choose the best cluster among all the clusters to split
split that cluster by the flat clustering algorithmuntil each data is in its own singleton cluster

Hierarchical Agglomerative vs Divisive clustering –

a. Divisive clustering is more complex as compared to agglomerative clustering, as in the
case of divisive clustering we need a flat clustering method as “subroutine” to split each
cluster until we have each data having its own singleton cluster.
b. Divisive clustering is more efficient if we do not generate a complete hierarchy all the way
down to individual data leaves. The time complexity of a naive agglomerative clustering
is O(n3) because we exhaustively scan the N x N matrix dist_mat for the lowest distance in
each of N-1 iterations. Using priority queue data structure we can reduce this complexity
to O(n2logn). By using some more optimizations it can be brought down to O(n2).
Whereas for divisive clustering given a fixed number of top levels, using an efficient flat
algorithm like K-Means, divisive algorithms are linear in the number of patterns and
clusters.
c. A divisive algorithm is also more accurate. Agglomerative clustering makes decisions by
considering the local patterns or neighbor points without initially taking into account the
global distribution of data. These early decisions cannot be undone. whereas divisive
clustering takes into consideration the global distribution of data when making top-level
partitioning decisions.
K-Means Clustering Algorithm
K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering
problems in machine learning or data science. In this topic, we will learn what is K-means clustering
algorithm, how the algorithm works, along with the Python implementation of k-means clustering.

What is K-Means Algorithm?

K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into
different clusters. Here K defines the number of pre-defined clusters that need to be created in the
process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.

It is an iterative algorithm that divides the unlabeled dataset into k different

clusters in such a way that each dataset belongs only one group that has similar
properties.

It allows us to cluster the data into different groups and a convenient way to discover the categories
of groups in the unlabeled dataset on its own without the need for any training.

It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of
this algorithm is to minimize the sum of distances between the data point and their corresponding
clusters.

The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and
repeats the process until it does not find the best clusters. The value of k should be predetermined in
this algorithm.

The k-means clustering algorithm mainly performs two tasks:

o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is away from other clusters.

The below diagram explains the working of the K-means Clustering Algorithm:
How does the K-Means Algorithm Work?
The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of
each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Let's understand the above steps by considering the visual plots:

Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is given
below:

o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into different
clusters. It means here we will try to group these datasets into two different clusters.
o We need to choose some random k points or centroid to form the cluster. These points can be
either the points from the dataset or any other point. So, here we are selecting the below two
points as k points, which are not the part of our dataset. Consider the below image:

o Now we will assign each data point of the scatter plot to its closest K-point or centroid. We
will compute it by applying some mathematics that we have studied to calculate the distance
between two points. So, we will draw a median between both the centroids. Consider the
below image:

From the above image, it is clear that points left side of the line is near to the K1 or blue centroid,
and points to the right of the line are close to the yellow centroid. Let's color them as blue and yellow
for clear visualization.

o As we need to find the closest cluster, so we will repeat the process by choosing a new
centroid. To choose the new centroids, we will compute the center of gravity of these
centroids, and will find new centroids as below:
o Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same
process of finding a median line. The median will be like below image:

From the above image, we can see, one yellow point is on the left side of the line, and two blue
points are right to the line. So, these three points will be assigned to new centroids.

As reassignment has taken place, so we will again go to the step-4, which is finding new centroids or
K-points.

We will repeat the process by finding the center of gravity of centroids, so the new centroids will be
as shown in the below image:
o As we got the new centroids so again will draw the median line and reassign the data points.
So, the image will be:

o We can see in the above image; there are no dissimilar data points on either side of the line,
which means our model is formed. Consider the below image:

As our model is ready, so we can now remove the assumed centroids, and the two final clusters will
be as shown in the below image:

MACHINE LEARNING NOTES ANNA UNIVERSITY
No ratings yet
MACHINE LEARNING NOTES ANNA UNIVERSITY
14 pages
Lecture+Notes+ +clustering
No ratings yet
Lecture+Notes+ +clustering
13 pages
Unit 3 Clustering Algorithm
No ratings yet
Unit 3 Clustering Algorithm
44 pages
7.introduction To Clustering
No ratings yet
7.introduction To Clustering
11 pages
ML Mod 4 Part 1
No ratings yet
ML Mod 4 Part 1
99 pages
Module 5
No ratings yet
Module 5
91 pages
Data Mining - Chapter 4 Cluster Analysis
No ratings yet
Data Mining - Chapter 4 Cluster Analysis
37 pages
Clustering Agglo Devisive DBSCAN
No ratings yet
Clustering Agglo Devisive DBSCAN
78 pages
Unit 4
No ratings yet
Unit 4
40 pages
UNIT- IV UNSUPERVISIED LEARNING_NOTES
No ratings yet
UNIT- IV UNSUPERVISIED LEARNING_NOTES
32 pages
Week 11
No ratings yet
Week 11
49 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
7 pages
ML CH 4
No ratings yet
ML CH 4
51 pages
MachineLearning Unit IV.pptx
No ratings yet
MachineLearning Unit IV.pptx
51 pages
ML UNIT 2
No ratings yet
ML UNIT 2
17 pages
Untitled document
No ratings yet
Untitled document
32 pages
Week 9. Unsupervised Learning
No ratings yet
Week 9. Unsupervised Learning
32 pages
Full Methods in Consumer Research Volume 1 New Approaches To Classic Methods 1st Edition Gaston Ares Ebook All Chapters
100% (5)
Full Methods in Consumer Research Volume 1 New Approaches To Classic Methods 1st Edition Gaston Ares Ebook All Chapters
62 pages
Unit 4 Self Made (1)
No ratings yet
Unit 4 Self Made (1)
28 pages
Unit 4
No ratings yet
Unit 4
74 pages
Clustering
No ratings yet
Clustering
84 pages
Machine Learning
No ratings yet
Machine Learning
23 pages
unit 2 ml
No ratings yet
unit 2 ml
11 pages
Unit-4 (2)
No ratings yet
Unit-4 (2)
29 pages
Machine_Learning_Unit_4
No ratings yet
Machine_Learning_Unit_4
22 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
7 pages
Adpc Paper 2
No ratings yet
Adpc Paper 2
134 pages
ML Unit-4 Final 2024-25
No ratings yet
ML Unit-4 Final 2024-25
28 pages
Clustering
No ratings yet
Clustering
29 pages
Unit 3 Data
No ratings yet
Unit 3 Data
37 pages
chapter 3 p4
No ratings yet
chapter 3 p4
18 pages
Clustering in Python
No ratings yet
Clustering in Python
31 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
001-AICTE Application Report 2020-21
No ratings yet
001-AICTE Application Report 2020-21
87 pages
UNIT III - ML
No ratings yet
UNIT III - ML
13 pages
clustering
No ratings yet
clustering
9 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
Introduction to Artificial Intelligence for Security Professionals 1st Edition The Cylance Data Science Team pdf download
100% (4)
Introduction to Artificial Intelligence for Security Professionals 1st Edition The Cylance Data Science Team pdf download
54 pages
Unit 3 & 4 (p18)
No ratings yet
Unit 3 & 4 (p18)
18 pages
U-5_IML (2)
No ratings yet
U-5_IML (2)
20 pages
Lecture Notes - Clustering
No ratings yet
Lecture Notes - Clustering
13 pages
Unit 3 unsupervised learning algorith
No ratings yet
Unit 3 unsupervised learning algorith
15 pages
ML-UNIT-III
No ratings yet
ML-UNIT-III
12 pages
unsupervised learning
No ratings yet
unsupervised learning
23 pages
ARTIFICIAL INTELLIGENCE LEC 5
No ratings yet
ARTIFICIAL INTELLIGENCE LEC 5
20 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
Unit-5 Clustering (March 16, 24)
No ratings yet
Unit-5 Clustering (March 16, 24)
25 pages
U20cs604 Machine Learning Unit III
No ratings yet
U20cs604 Machine Learning Unit III
23 pages
DWDM Unit-5
No ratings yet
DWDM Unit-5
52 pages
AI Week 11
No ratings yet
AI Week 11
21 pages
CLUSTERING
No ratings yet
CLUSTERING
11 pages
Clustering
No ratings yet
Clustering
10 pages
Unit- 4(ML)
No ratings yet
Unit- 4(ML)
13 pages
DWDM Unit5
No ratings yet
DWDM Unit5
14 pages
ML Unit-4
No ratings yet
ML Unit-4
14 pages
Fds Unit03
No ratings yet
Fds Unit03
11 pages
Text Analytics Unit-3
No ratings yet
Text Analytics Unit-3
11 pages
1. MPSC Previous Year Paper Analyses
No ratings yet
1. MPSC Previous Year Paper Analyses
7 pages
UNIT - 3 - Clustering
No ratings yet
UNIT - 3 - Clustering
21 pages
UNIT 4 K-Means Clustring
No ratings yet
UNIT 4 K-Means Clustring
13 pages
Clustering
No ratings yet
Clustering
13 pages
DSUP_Exp5[1]
No ratings yet
DSUP_Exp5[1]
7 pages
CE GATE 2024 Paper II Final With Keys 1
No ratings yet
CE GATE 2024 Paper II Final With Keys 1
24 pages
Ce Gate 2017 Set 1 Qp
No ratings yet
Ce Gate 2017 Set 1 Qp
19 pages
b.sc. m.sc.Mathematics (Hons.) Integrated
No ratings yet
b.sc. m.sc.Mathematics (Hons.) Integrated
3 pages
Immediate download Informatics in Control Automation and Robotics 11th International Conference ICINCO 2014 Vienna Austria September 2 4 2014 Revised Selected Papers 1st Edition Joaquim Filipe ebooks 2024
100% (4)
Immediate download Informatics in Control Automation and Robotics 11th International Conference ICINCO 2014 Vienna Austria September 2 4 2014 Revised Selected Papers 1st Edition Joaquim Filipe ebooks 2024
62 pages
Ce Gate 2018 Set 2 Qp
No ratings yet
Ce Gate 2018 Set 2 Qp
17 pages
Ce Gate 2022 Set 1 Final 2
No ratings yet
Ce Gate 2022 Set 1 Final 2
21 pages
Ce Gate 2015 Set 2 Qp
No ratings yet
Ce Gate 2015 Set 2 Qp
15 pages
CE-GATE-2012
No ratings yet
CE-GATE-2012
19 pages
ML Unit 4 Notes - NJ
No ratings yet
ML Unit 4 Notes - NJ
15 pages
Rssb Je Offline Test Series Civil Btech Test 10 Solution574
No ratings yet
Rssb Je Offline Test Series Civil Btech Test 10 Solution574
14 pages
Ce Gate 2022 Set II Final 2
No ratings yet
Ce Gate 2022 Set II Final 2
21 pages
التحليل المكاني والوظيفي للخدمات التعليمية في مدينة سوران باستخدام نظم المعلومات الجغرافية- عمر حسن حسين رواندزي- ماجستير
88% (8)
التحليل المكاني والوظيفي للخدمات التعليمية في مدينة سوران باستخدام نظم المعلومات الجغرافية- عمر حسن حسين رواندزي- ماجستير
178 pages
JRF Recruitmment Amity University Kolkata Dr. S. Sau Apr14 2021
No ratings yet
JRF Recruitmment Amity University Kolkata Dr. S. Sau Apr14 2021
1 page
b.sc.(Clinical Psychology)
No ratings yet
b.sc.(Clinical Psychology)
1 page
SAP HANA Predictive Analysis Library PAL en
100% (2)
SAP HANA Predictive Analysis Library PAL en
243 pages
4 Clustering
No ratings yet
4 Clustering
9 pages
Pattern Recognition 21BR551 MODULE 04 NOTES
No ratings yet
Pattern Recognition 21BR551 MODULE 04 NOTES
16 pages
Full Download Introduction to Artificial Intelligence for Security Professionals 1st Edition The Cylance Data Science Team PDF DOCX
100% (2)
Full Download Introduction to Artificial Intelligence for Security Professionals 1st Edition The Cylance Data Science Team PDF DOCX
35 pages
Data Mining Business Report Hansraj Yadav
83% (12)
Data Mining Business Report Hansraj Yadav
34 pages
1051uf ME H-I Casting
No ratings yet
1051uf ME H-I Casting
13 pages
Nigerian Visa 2018
No ratings yet
Nigerian Visa 2018
10 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
74 pages
1004uf Ce a Aptitude
No ratings yet
1004uf Ce a Aptitude
7 pages
Date Sheets
No ratings yet
Date Sheets
3 pages
1056uf Ce Def Soil
No ratings yet
1056uf Ce Def Soil
9 pages
1060uf Me Abcdefg Md
No ratings yet
1060uf Me Abcdefg Md
10 pages
ajpbs-v7-id1081
No ratings yet
ajpbs-v7-id1081
10 pages
6f54fdd9-cdf9-4dd5-9149-953f438cf5ab
No ratings yet
6f54fdd9-cdf9-4dd5-9149-953f438cf5ab
13 pages
Academic Calendar for Odd Even Sem 2022 23
No ratings yet
Academic Calendar for Odd Even Sem 2022 23
2 pages
1052uf Cs Abcde Toc
No ratings yet
1052uf Cs Abcde Toc
7 pages
Data Mining Algorithmes
No ratings yet
Data Mining Algorithmes
166 pages
A Review On Customer Segmentation Methods For Personalized Customer Targeting in e Commerce Use Cases
No ratings yet
A Review On Customer Segmentation Methods For Personalized Customer Targeting in e Commerce Use Cases
44 pages
ML Unit 5
No ratings yet
ML Unit 5
50 pages
Ngk Murthy
No ratings yet
Ngk Murthy
9 pages
Gate Brochure
No ratings yet
Gate Brochure
1 page
Aesd Calendar 2024 2025
No ratings yet
Aesd Calendar 2024 2025
1 page
1073uf_03IG_CE_GHIJ_Surveying_24-06-2024-Sol
No ratings yet
1073uf_03IG_CE_GHIJ_Surveying_24-06-2024-Sol
7 pages
1005uf Ce Ab Surveying
No ratings yet
1005uf Ce Ab Surveying
10 pages
Shreekunj Hostel
No ratings yet
Shreekunj Hostel
1 page
Crash Course in Analytics For Non Analytics Managers
No ratings yet
Crash Course in Analytics For Non Analytics Managers
74 pages
Hierarchical Clustering: DSCI 5240 Data Mining and Machine Learning For Business
No ratings yet
Hierarchical Clustering: DSCI 5240 Data Mining and Machine Learning For Business
45 pages
Week 6: Test Bank Questions Data Mining and Data Warehousing - IT 446
No ratings yet
Week 6: Test Bank Questions Data Mining and Data Warehousing - IT 446
39 pages
Cluto Clusterring Manual
No ratings yet
Cluto Clusterring Manual
71 pages
Customer Segmentation Using K-Means Custering Report - ML3
No ratings yet
Customer Segmentation Using K-Means Custering Report - ML3
26 pages
HR Analytics Session34
No ratings yet
HR Analytics Session34
22 pages
Hcac ML PDF
No ratings yet
Hcac ML PDF
8 pages
DSA Presentation Group 6
No ratings yet
DSA Presentation Group 6
34 pages
K Means
No ratings yet
K Means
36 pages
Cluster Analysis: Prof. (DR.) H. J. Jani Mba Programme, Sardar Patel University Vallabh Vidyanagar - 388 120
No ratings yet
Cluster Analysis: Prof. (DR.) H. J. Jani Mba Programme, Sardar Patel University Vallabh Vidyanagar - 388 120
41 pages
Assignment 1-Descriptive Data Mining, Statistical Inference and Bootstrapping
No ratings yet
Assignment 1-Descriptive Data Mining, Statistical Inference and Bootstrapping
13 pages
Clustering in R Tutorial
No ratings yet
Clustering in R Tutorial
13 pages
Full Pattern Cluster Poster
No ratings yet
Full Pattern Cluster Poster
1 page
An Analysis of Outlier Detection Through Clustering Method
No ratings yet
An Analysis of Outlier Detection Through Clustering Method
6 pages
Unit-4: Define The Domain For Clustering
No ratings yet
Unit-4: Define The Domain For Clustering
13 pages
Objective: For One Dimensional Data Set (7,10,20,28,35), Perform Hierarchical Clustering
No ratings yet
Objective: For One Dimensional Data Set (7,10,20,28,35), Perform Hierarchical Clustering
13 pages
Agglomerative Hierarchical Clustering Algorithm-A Review: K.Sasirekha, P.Baby
No ratings yet
Agglomerative Hierarchical Clustering Algorithm-A Review: K.Sasirekha, P.Baby
3 pages
Chapter13 Slides
No ratings yet
Chapter13 Slides
24 pages
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet

Unsupervised-Learning-Part-1 (1)

Uploaded by

Unsupervised-Learning-Part-1 (1)

Uploaded by

Clustering in Machine Learning

The clustering technique is commonly used for statistical data analysis.

Basically, there are two types of hierarchical cluster analysis strategies –

Python implementation of the above algorithm using the scikit-learn library:

# randomly chosen dataset

# here we need to mention the number of clusters

# print the class labels

Hierarchical Agglomerative vs Divisive clustering –

What is K-Means Algorithm?

It is an iterative algorithm that divides the unlabeled dataset into k different

The k-means clustering algorithm mainly performs two tasks:

Step-1: Select the number K to decide the number of clusters.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Let's understand the above steps by considering the visual plots:

You might also like