K Mean Clustering

The document provides a comprehensive overview of K-means clustering, including its definition, objectives, properties, applications, advantages, and disadvantages. It explains the clustering process, methods for determining the optimal number of clusters, and offers a Python implementation example. Additionally, it discusses challenges associated with K-means clustering, such as sensitivity to initial centroids and the requirement for predefined clusters.

Uploaded by

survivor000111

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views59 pages

K Mean Clustering

Uploaded by

survivor000111

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 59

K-means Clustering

Algorithm
Dr. Masroor Ahmed
(Professor)
Email: [email protected]
Capital University of Science and Technology (CUST),
Islamabad
Table of Contents
● What is Clustering?
● Types of Clustering
● What is K-Means Clustering?
● Objective of K-Means Clustering
● Properties of K-Means Clustering
● Applications of K-Means Clustering
● Advantages / Disadvantages of K-means
Table of Contents
● Different Evaluation Metrics for Clustering
● How Does K-Means Clustering Work?
● K-Means Clustering Algorithm How to Choose the Value of
"K number of clusters" in K-Means Clustering?
● Python Implementation of the K-Means Clustering Algorithm
● Challenges With K-Means Clustering Algorithm
INTRODUCTION
Every machine Learning engineer wants to achieve accurate
predictions with their algorithms. Such learning algorithms are
generally divided into two types:

1. Supervised

2. Unsupervised
Comparison between Supervised and
Unsupervised Learning
What is Clustering?
● Clustering is like sorting a bunch of similar items into
different groups based on their characteristics.
● In data mining and machine learning, it’s a powerful
technique used to group similar data points together,
making it easier to find patterns or understand large
datasets.
● Essentially, clustering helps identify natural groupings in
your data.
Inter-Class vs Intra-Class Similarity Clustering
Requirements of clustering
The following are some points why clustering is important in data
mining.
● Scalability
● Ability to deal with different kinds of attributes
● Discovery of clusters with attribute shape
● Interpretability
● High dimensionality
TYPES OF CLUSTERING
Hierarchical Clustering
● Hierarchical clustering is an unsupervised machine
learning algorithm that organizes data into a tree-like
structure of nested clusters.
● Unlike flat clustering methods like k-means, hierarchical
clustering does not require specifying the number of
clusters in advance.
● It results in an attractive tree-based representation of the
observations, called a Dendrogram.
● It is widely used in data mining, pattern recognition, and
exploratory data analysis.
Types of Hierarchical
Clustering
Agglomerative Hierarchical
Clustering (AHC)
● It is a Bottom-Up Approach.
● Each data point starts as its own individual cluster.
● Pairs of clusters are merged iteratively based on
their similarity until only one cluster ( or K clusters
left).
● The process continues until all data points belong to
a single cluster.
Bottom to Up Approach

Agglomerative
Hierarchical
Clustering
(AHC)
Divisive Hierarchical
Clustering (DHC)
● It is a Top-Down Approach.
● All data points start in one large cluster.
● The cluster is recursively split into smaller
clusters based on differences.
● The process continues until each data point is its
own cluster or meets a stopping criterion.
Top-Down Approach

Divisive
Hierarchical
Clustering
(DHC)
Applications of Hierarchical
Clustering
● Bioinformatics: Used to classify genes and proteins based on
sequence similarity.
● Customer Segmentation: Identifies groups of customers with
similar purchasing behaviors.
● Document Clustering: Groups similar texts or web pages for
information retrieval.
● Medical Imaging: Helps in classifying different disease patterns.
Partitioning Clustering
● Partitioning clustering is an unsupervised machine
learning technique that divides a dataset into a
predefined number of k clusters, where each data
point belongs to exactly one cluster.
● The goal is to minimize intra-cluster distances
(similarity within a cluster) and maximize inter-
cluster distances (differences between clusters).
● Partitioning clustering is split into two subtypes - K-
Means clustering and Fuzzy C-Means.
K-Mean Clustering

In k-means clustering,
the objects are divided
into several clusters
mentioned by the
number ‘K.’ So if we say
K = 3, the objects are
divided into three
clusters, c1, c2 and c3.
Fuzzy C-Means Clustering
● An unsupervised machine learning algorithm for clustering.
● Uses soft clustering, meaning each data point can belong
to multiple clusters with probabilities.
● More flexible than K-Means, which uses hard clustering.
● FCM is useful in complex datasets where clear boundaries
between clusters do not exist, such as in image
segmentation, medical diagnosis, and pattern recognition.
Hard Clustering vs Soft Clustering
K-Means Clustering
Introduction of K-Means
Clustering
● K-means clustering is a way of grouping data based on
how similar or close the data points are to each other.
● It is widely used in customer segmentation, image
processing, and anomaly detection.
● The algorithm aims to minimize the variance within
clusters by iteratively refining the cluster centroids.
● In k-means clustering , the clusters are distinct and
well-separated.
● Works well for large datasets due to its efficiency.
Objective of K-Means
Clustering
Properties of K-Means Clustering

Similarity Within a Cluster

• One of the main things K Means aims for is that all the
data points in a cluster should be pretty similar to each
other.

Differences Between Clusters

• Another important aspect is that the clusters themselves

should be as distinct from each other as possible.
Similarity Within a Cluster
(Example)
● Imagine a bank that wants to group its customers based on
income and debt. If customers within the same cluster have
vastly different financial situations, then a one-size-fits-all
approach to offers might not work.
● For example, a customer with high income and high debt
might have different needs compared to someone with low
income and low debt.
● By making sure the customers in each cluster are similar, the
bank can create more tailored and effective strategies.
Differences Between
Clusters (Example)
● if one cluster consists of high-income, high-debt customers
and another cluster has high-income, low-debt customers,
the differences between the clusters are clear.
● This separation helps the bank create different strategies for
each group.
● If the clusters are too similar, it can be challenging to treat
them as separate segments, which can make targeted
marketing less effective.
Applications of K-Means
Clustering
Distance Measures

Image Segmentation

K-Means for Geyser Eruptions

Customer Segmentation

Document Clustering

Recommendation Engines

K-Means for Image Compression

Advantages of K-Means
Clustering
1. Simple and easy to implement: The k-means algorithm is
easy to understand and implement, making it a popular
choice for clustering tasks.
2. Fast and efficient: K-means is computationally efficient and
can handle large datasets with high dimensionality.
3. Scalability: K-means can handle large datasets with many
data points and can be easily scaled to handle even larger
datasets.
4. Flexibility: K-means can be easily adapted to different
applications and can be used with varying metrics of
Disadvantages of K-Means
Clustering
1. Sensitivity to initial centroids: K-means is sensitive to
the initial selection of centroids and can converge to a
suboptimal solution.
2. Requires specifying the number of clusters: The
number of clusters k needs to be specified before running
the algorithm, which can be challenging in some
applications.
3. Sensitive to outliers: K-means is sensitive to outliers,
which can have a significant impact on the resulting
clusters.
Different Evaluation Metrics
for Clustering
When it comes to evaluating how well your clustering
algorithm is working, there are a few key metrics
that can help you get a clearer picture of your
results.
1. Silhouette Analysis
2. Inertia
3. Dunn Index
Silhouette Analysis
● Silhouette analysis is like a report card for your clusters.
● It measures how well each data point fits into its own cluster
compared to other clusters.
● A high silhouette score means that your points are snugly
fitting into their clusters and are quite distinct from points in
other clusters.
● Imagine a score close to 1 as a sign that your clusters are
well-defined and separated.
● Conversely, a score close to 0 indicates some overlap, and a
negative score suggests that the clustering might need some
Inertia
● Inertia is a bit like a gauge of how tightly packed your data
points are within each cluster.
● It calculates the sum of squared distances from each point to
the cluster's center (or centroid).
● Lower inertia means that points are closer to the centroid
and to each other, which generally indicates that your
clusters are well-formed.
● For most numeric data, you'll use Euclidean distance, but if
your data includes categorical features, Manhattan distance
might be better.
Dunn Index
● The Dunn Index takes a broader view by considering both
the distance within and between clusters.
● It’s calculated as the ratio of the smallest distance between
any two clusters (inter-cluster distance) to the largest
distance within a cluster (intra-cluster distance).
● A higher Dunn Index means that clusters are not only tight
and cohesive internally but also well-separated from each
other.
Methods to Determine the
Number of Clusters (K)
● Trial and Error Method
Start with an assumed K value (e.g., 3, 4, 5).
Adjust K iteratively until the best clusters are
formed.
● Elbow Method
Plots WCSS (Within-Cluster Sum of Squares) vs. K.
The "elbow point" in the graph helps determine
the optimal K.
How Does K-Means
Clustering Work?
Clustering Process in K-
Means
● Select K and Initialize Centroids
1. Assign K centroids randomly in the dataset.
● Assign Data Points to Nearest Centroid
1. Calculate the distance of each point from all centroids.
2. Assign each point to the closest centroid.
● Compute New Centroids
1. Calculate the mean position of all points in each cluster.
2. Update centroid locations accordingly.
● Reassign Points Based on New Centroids
1. Recalculate distances of all points from the updated centroids.
2. If needed, reassign points to the nearest centroid.
● Check for Convergence
1. If centroids continue to move, repeat the process.
2. Once centroids stop moving, the algorithm converges and the final clusters are
formed.
Scenario

Finding the
Optimum
bb Number of
Clusters for a
Grocery Shop
Dataset
Step 1: Using the Elbow Method
to Determine K
● Elbow Method is used to find the optimal number
of clusters (K).
● K-Means clustering is applied to the dataset
multiple times with different K values.
● Within-Cluster Sum of Squares (WSS) is calculated
for each K.
Methods to Determine the
Number of Clusters (K)
● Choosing the Optimal K:
WSS is plotted against different values of K.
The point where the WSS dramatically stops decreasing is
the elbow point.
In this case, the optimal K could be 2, 3, or 4, as WSS
stabilizes beyond this.
Step 2: Initializing Cluster
Centroids
● Randomly select initial
centroids (C1 and C2).
● These centroids act as
the starting points for
cluster formation.
Step 3: Assigning Data Points to
the Closest Centroid
● Calculate the distance between each delivery
location and the centroids.
● Assign each location
to the nearest centroid.
● This forms the initial
grouping of data points.
Step 4: Compute New Centroid
for the First Group
● Compute the mean position of all points in the first
cluster.
Step 5: Move the Random Centroid to the
New Centroid Position

● Adjust the centroid position based on the mean

of the assigned points.
Step 6: Compute New Centroid for
the Second Group
● Compute the mean position of all points in the
second cluster.
Step 7: Move the Random Centroid to the
New Centroid Position

● Adjust the centroid position for the second

cluster.
Step 8: Checking for
Convergence
● If centroids continue to
change position, repeat
Steps 3–7.
● Once centroids stop
moving, the K-Means
algorithm converges.
● The final clusters with
centroids C1 and C2 are
established.
K-Means Clustering Algorithm

K-Means clustering is an iterative algorithm that partitions a

dataset into KKK clusters by minimizing the variance within
each cluster.
Step 8: Checking for
Convergence
Python Implementation of the K-Means
Clustering Algorithm
These are the steps you need to take:
● Data pre-processing
● Finding the optimal number of clusters using the elbow
method
● Training the K-Means algorithm on the training data set
● Visualizing the clusters
Data Pre-Processing
!pip install kagglehub
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
import kagglehub
import os
# Download latest version
dataset_path = kagglehub.dataset_download("aajay20/mall-customers-
datacsv")
# Assuming the CSV file is named 'Mall_Customers.csv' - adjust if different
file_path = os.path.join(dataset_path, "Mall_Customers.csv") # Construct the
full file path
dataset = pd.read_csv(file_path) # Use the file path to read the CSV
x = dataset.iloc[:, [3, 4]].values
Find the optimal number of
clusters using Elbow Method
from sklearn.cluster import KMeans
wcss_list= [] #Initializing the list for the values of WCSS
#Using for loop for iterations from 1 to 10.
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', random_state= 42)
kmeans.fit(x)
wcss_list.append(kmeans.inertia_)
mtp.plot(range(1, 11), wcss_list)
mtp.title('The Elobw Method Graph')
mtp.xlabel('Number of clusters(k)')
mtp.ylabel('wcss_list')
mtp.show()
Train the K-means algorithm on the
training dataset
#training the K-means model on a dataset
kmeans = KMeans(n_clusters=5, init='k-means++', random_state=
42)
y_predict= kmeans.fit_predict(x)
Visualize the Clusters
#Visualize the Clusters
mtp.scatter(x[y_predict == 0, 0], x[y_predict == 0, 1], s = 100, c = 'blue', label
= 'Cluster 1') #for first cluster
mtp.scatter(x[y_predict == 1, 0], x[y_predict == 1, 1], s = 100, c = 'green',
label = 'Cluster 2') #for second cluster
mtp.scatter(x[y_predict== 2, 0], x[y_predict == 2, 1], s = 100, c = 'red', label =
'Cluster 3') #for third cluster
mtp.scatter(x[y_predict == 3, 0], x[y_predict == 3, 1], s = 100, c = 'cyan', label
= 'Cluster 4') #for fourth cluster
mtp.scatter(x[y_predict == 4, 0], x[y_predict == 4, 1], s = 100, c = 'magenta',
label = 'Cluster 5') #for fifth cluster
mtp.scatter (kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s =
300, c = 'yellow', label = 'Centroid')
mtp.title('Clusters of customers')
mtp.xlabel('Annual Income (k$)')
mtp.ylabel('Spending Score (1-100)')
mtp.legend()
mtp.show()
Challenges With K-Means
Clustering
1. Choice of K (NumberAlgorithm
of Clusters):Requires predefined clusters, often
determined using the Elbow Method.
2. Initialization Sensitivity: Poor initialization may lead to suboptimal
results.
3. Assumption of Spherical Clusters: Struggles with irregular or
overlapping clusters.
4. Outliers Impact: Susceptible to extreme values shifting centroids.
5. Scalability Issues: Computationally expensive for large datasets.
6. Difficulty Handling Categorical Data: Works poorly with non-numeric
features.
7. Unequal Cluster Sizes: Favors balanced, equally dense clusters.
8. Hard Assignments (Non-Probabilistic): Lacks probabilistic clustering
like GMM.

UNIT-6 K Means Clustering
No ratings yet
UNIT-6 K Means Clustering
12 pages
K Means Clustering
No ratings yet
K Means Clustering
22 pages
ML Module 4 Unsupervised Learning - Updated
No ratings yet
ML Module 4 Unsupervised Learning - Updated
55 pages
Machine Learning Chapter 3
No ratings yet
Machine Learning Chapter 3
12 pages
Mini Project
No ratings yet
Mini Project
8 pages
Unit 4
No ratings yet
Unit 4
125 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
Clustering FinancialData
No ratings yet
Clustering FinancialData
38 pages
ML Unit III
No ratings yet
ML Unit III
82 pages
K-Means Clustering Guide for Beginners
No ratings yet
K-Means Clustering Guide for Beginners
19 pages
K Means Algorithm
No ratings yet
K Means Algorithm
4 pages
Unit 4
No ratings yet
Unit 4
74 pages
Unsupervised Learning Insights
No ratings yet
Unsupervised Learning Insights
10 pages
K Means
No ratings yet
K Means
9 pages
K-Means Clustering
No ratings yet
K-Means Clustering
6 pages
Clustering and K-Means Algorithm
No ratings yet
Clustering and K-Means Algorithm
81 pages
K Means Clustering
No ratings yet
K Means Clustering
3 pages
Clustering
No ratings yet
Clustering
67 pages
ML CH 4
No ratings yet
ML CH 4
51 pages
Clustering
No ratings yet
Clustering
125 pages
Unsupesfwafarvised Learning
No ratings yet
Unsupesfwafarvised Learning
49 pages
P-3 1 2-Kmeans
No ratings yet
P-3 1 2-Kmeans
43 pages
Week 10 Lecture - Introduction To Clustering
No ratings yet
Week 10 Lecture - Introduction To Clustering
35 pages
Day 3
No ratings yet
Day 3
74 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
KMeans Clustering Report
No ratings yet
KMeans Clustering Report
2 pages
Clustering
No ratings yet
Clustering
84 pages
ML Unit-4 Final 2024-25
No ratings yet
ML Unit-4 Final 2024-25
28 pages
ML Unit 4
No ratings yet
ML Unit 4
110 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
K, Eans
No ratings yet
K, Eans
4 pages
K Mean
No ratings yet
K Mean
7 pages
Clustering Part1
No ratings yet
Clustering Part1
84 pages
7.introduction To Clustering
No ratings yet
7.introduction To Clustering
11 pages
Unit 4
No ratings yet
Unit 4
96 pages
Clustering - Unit 4
No ratings yet
Clustering - Unit 4
19 pages
Session 37 CO4 Unsupervised Learning
No ratings yet
Session 37 CO4 Unsupervised Learning
34 pages
BDA Unit 2
No ratings yet
BDA Unit 2
31 pages
K Mean Clustering1
No ratings yet
K Mean Clustering1
23 pages
Clustering
No ratings yet
Clustering
38 pages
Clustering Kmeans
No ratings yet
Clustering Kmeans
6 pages
Lecture 4.6 Unsupervised-Learning Clustering
No ratings yet
Lecture 4.6 Unsupervised-Learning Clustering
60 pages
U1 - KMeans - 5th Sem - DS
No ratings yet
U1 - KMeans - 5th Sem - DS
14 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
23 pages
Clustering Explanation
No ratings yet
Clustering Explanation
8 pages
Machine Learning: Clustering & Algorithms
No ratings yet
Machine Learning: Clustering & Algorithms
66 pages
FML Unit4
No ratings yet
FML Unit4
14 pages
Presentation: Operating System Concept CS-582
No ratings yet
Presentation: Operating System Concept CS-582
13 pages
M3 - Unsupervised Machine Learning
No ratings yet
M3 - Unsupervised Machine Learning
35 pages
K-Means Clustering Guide 2023
No ratings yet
K-Means Clustering Guide 2023
14 pages
Clustering Techniques - Hierarchical, K-Means Clustering
No ratings yet
Clustering Techniques - Hierarchical, K-Means Clustering
22 pages
DWDM Unit V Note
No ratings yet
DWDM Unit V Note
19 pages
ML UNIT 4 Sir
No ratings yet
ML UNIT 4 Sir
42 pages
Słowacja Wszystko PDF
No ratings yet
Słowacja Wszystko PDF
379 pages
K Means
No ratings yet
K Means
40 pages
K-Means Clustering Explained
No ratings yet
K-Means Clustering Explained
16 pages
EML %TH Module
No ratings yet
EML %TH Module
40 pages
Text Analytics Unit-3
No ratings yet
Text Analytics Unit-3
11 pages
K Means Clustering
No ratings yet
K Means Clustering
27 pages
Week 1
No ratings yet
Week 1
75 pages
Lec7 - Domain Model
No ratings yet
Lec7 - Domain Model
23 pages
Addictions
No ratings yet
Addictions
2 pages
Child Psychology
No ratings yet
Child Psychology
1 page
Activity Characterization
No ratings yet
Activity Characterization
2 pages
Purpose and Scope of Literarure
No ratings yet
Purpose and Scope of Literarure
18 pages
Types of Literature
No ratings yet
Types of Literature
18 pages
L1 - Introduction To Data Science
No ratings yet
L1 - Introduction To Data Science
23 pages
ML Bayes05
No ratings yet
ML Bayes05
18 pages
Bayes Soleved Examples
No ratings yet
Bayes Soleved Examples
5 pages
L2 - Mathematical Preliminaries
No ratings yet
L2 - Mathematical Preliminaries
24 pages
Data Science Paper Solution
No ratings yet
Data Science Paper Solution
1 page
19 Push Down Automata
No ratings yet
19 Push Down Automata
53 pages
Acs Lec 006 ST
No ratings yet
Acs Lec 006 ST
13 pages
Myhill-Nerode Theorem Explained
No ratings yet
Myhill-Nerode Theorem Explained
18 pages
Ahad Beykaei AI ML Manager
No ratings yet
Ahad Beykaei AI ML Manager
7 pages
3CP10 MJJ Clustering Intro
No ratings yet
3CP10 MJJ Clustering Intro
18 pages
NUS HS1502 Notes
No ratings yet
NUS HS1502 Notes
51 pages
Chapter 21 Customer Relationship Management CRM
No ratings yet
Chapter 21 Customer Relationship Management CRM
45 pages
IT Support Services Contract Solicitation
No ratings yet
IT Support Services Contract Solicitation
18 pages
The Authority of "Fair" in Machine Learning: Michael Skirpan Micha Gorelick
No ratings yet
The Authority of "Fair" in Machine Learning: Michael Skirpan Micha Gorelick
5 pages
Data Mining Architecture - Data Mining Tutorial by Wideskills
No ratings yet
Data Mining Architecture - Data Mining Tutorial by Wideskills
3 pages
Guidelines Machine Learning
No ratings yet
Guidelines Machine Learning
2 pages
Management Information Systems Managing The Digital Firm 14th Edition Laudon Test Bank Download
100% (25)
Management Information Systems Managing The Digital Firm 14th Edition Laudon Test Bank Download
28 pages
Credit Card Fraud Detection Study
No ratings yet
Credit Card Fraud Detection Study
19 pages
Decision Trees and Random Forests Guide
No ratings yet
Decision Trees and Random Forests Guide
36 pages
MBA (BA) I Introduction To Business Analytics With Data Science 2021
No ratings yet
MBA (BA) I Introduction To Business Analytics With Data Science 2021
1 page
Full (Ebook PDF) Nursing Informatics and The Foundation of Knowledge 3rd PDF All Chapters
100% (11)
Full (Ebook PDF) Nursing Informatics and The Foundation of Knowledge 3rd PDF All Chapters
50 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
47 pages
DB Scan Clustering
No ratings yet
DB Scan Clustering
11 pages
Dataset Impact on FP-Growth Performance
No ratings yet
Dataset Impact on FP-Growth Performance
7 pages
Data Mining Tutorial: Gregory Piatetsky-Shapiro Kdnuggets
No ratings yet
Data Mining Tutorial: Gregory Piatetsky-Shapiro Kdnuggets
20 pages
CIS527: Data Warehousing, Filtering, and Mining: Fall 2004, CIS, Temple University
No ratings yet
CIS527: Data Warehousing, Filtering, and Mining: Fall 2004, CIS, Temple University
50 pages
ROC and AUC Practical Implementation PDF
No ratings yet
ROC and AUC Practical Implementation PDF
6 pages
Data Science and Its Relationship To Big Data and Data-Driven Decision Making
No ratings yet
Data Science and Its Relationship To Big Data and Data-Driven Decision Making
19 pages
Intro to Business Intelligence Course
No ratings yet
Intro to Business Intelligence Course
2 pages
ML06 Classical Techniques
No ratings yet
ML06 Classical Techniques
38 pages
K-Means Clustering Algorithm With Numerical Example
No ratings yet
K-Means Clustering Algorithm With Numerical Example
7 pages
An Introduction To The WEKA Data Mining System
No ratings yet
An Introduction To The WEKA Data Mining System
2 pages
An Intelligent Disease Prediction and Drug Recommendation Prototype by Using Multiple Approaches of
No ratings yet
An Intelligent Disease Prediction and Drug Recommendation Prototype by Using Multiple Approaches of
15 pages
AutoSmart An Efficient and Automatic Machine Learn
No ratings yet
AutoSmart An Efficient and Automatic Machine Learn
9 pages
Unit 2 Notes
No ratings yet
Unit 2 Notes
117 pages
Data Mining UNIT II
No ratings yet
Data Mining UNIT II
19 pages
Data Mining - An Overview
No ratings yet
Data Mining - An Overview
40 pages
Predicting Bank Customer Churn
No ratings yet
Predicting Bank Customer Churn
134 pages