K-means Clustering
Algorithm
Dr. Masroor Ahmed
(Professor)
Email: [email protected]
Capital University of Science and Technology (CUST),
Islamabad
Table of Contents
● What is Clustering?
● Types of Clustering
● What is K-Means Clustering?
● Objective of K-Means Clustering
● Properties of K-Means Clustering
● Applications of K-Means Clustering
● Advantages / Disadvantages of K-means
Table of Contents
● Different Evaluation Metrics for Clustering
● How Does K-Means Clustering Work?
● K-Means Clustering Algorithm How to Choose the Value of
"K number of clusters" in K-Means Clustering?
● Python Implementation of the K-Means Clustering Algorithm
● Challenges With K-Means Clustering Algorithm
INTRODUCTION
Every machine Learning engineer wants to achieve accurate
predictions with their algorithms. Such learning algorithms are
generally divided into two types:
1. Supervised
2. Unsupervised
Comparison between Supervised and
Unsupervised Learning
What is Clustering?
● Clustering is like sorting a bunch of similar items into
different groups based on their characteristics.
● In data mining and machine learning, it’s a powerful
technique used to group similar data points together,
making it easier to find patterns or understand large
datasets.
● Essentially, clustering helps identify natural groupings in
your data.
Inter-Class vs Intra-Class Similarity Clustering
Requirements of clustering
The following are some points why clustering is important in data
mining.
● Scalability
● Ability to deal with different kinds of attributes
● Discovery of clusters with attribute shape
● Interpretability
● High dimensionality
TYPES OF CLUSTERING
Hierarchical Clustering
● Hierarchical clustering is an unsupervised machine
learning algorithm that organizes data into a tree-like
structure of nested clusters.
● Unlike flat clustering methods like k-means, hierarchical
clustering does not require specifying the number of
clusters in advance.
● It results in an attractive tree-based representation of the
observations, called a Dendrogram.
● It is widely used in data mining, pattern recognition, and
exploratory data analysis.
Types of Hierarchical
Clustering
Agglomerative Hierarchical
Clustering (AHC)
● It is a Bottom-Up Approach.
● Each data point starts as its own individual cluster.
● Pairs of clusters are merged iteratively based on
their similarity until only one cluster ( or K clusters
left).
● The process continues until all data points belong to
a single cluster.
Bottom to Up Approach
Agglomerative
Hierarchical
Clustering
(AHC)
Divisive Hierarchical
Clustering (DHC)
● It is a Top-Down Approach.
● All data points start in one large cluster.
● The cluster is recursively split into smaller
clusters based on differences.
● The process continues until each data point is its
own cluster or meets a stopping criterion.
Top-Down Approach
Divisive
Hierarchical
Clustering
(DHC)
Applications of Hierarchical
Clustering
● Bioinformatics: Used to classify genes and proteins based on
sequence similarity.
● Customer Segmentation: Identifies groups of customers with
similar purchasing behaviors.
● Document Clustering: Groups similar texts or web pages for
information retrieval.
● Medical Imaging: Helps in classifying different disease patterns.
Partitioning Clustering
● Partitioning clustering is an unsupervised machine
learning technique that divides a dataset into a
predefined number of k clusters, where each data
point belongs to exactly one cluster.
● The goal is to minimize intra-cluster distances
(similarity within a cluster) and maximize inter-
cluster distances (differences between clusters).
● Partitioning clustering is split into two subtypes - K-
Means clustering and Fuzzy C-Means.
K-Mean Clustering
In k-means clustering,
the objects are divided
into several clusters
mentioned by the
number ‘K.’ So if we say
K = 3, the objects are
divided into three
clusters, c1, c2 and c3.
Fuzzy C-Means Clustering
● An unsupervised machine learning algorithm for clustering.
● Uses soft clustering, meaning each data point can belong
to multiple clusters with probabilities.
● More flexible than K-Means, which uses hard clustering.
● FCM is useful in complex datasets where clear boundaries
between clusters do not exist, such as in image
segmentation, medical diagnosis, and pattern recognition.
Hard Clustering vs Soft Clustering
K-Means Clustering
Introduction of K-Means
Clustering
● K-means clustering is a way of grouping data based on
how similar or close the data points are to each other.
● It is widely used in customer segmentation, image
processing, and anomaly detection.
● The algorithm aims to minimize the variance within
clusters by iteratively refining the cluster centroids.
● In k-means clustering , the clusters are distinct and
well-separated.
● Works well for large datasets due to its efficiency.
Objective of K-Means
Clustering
Properties of K-Means Clustering
Similarity Within a Cluster
• One of the main things K Means aims for is that all the
data points in a cluster should be pretty similar to each
other.
Differences Between Clusters
• Another important aspect is that the clusters themselves
should be as distinct from each other as possible.
Similarity Within a Cluster
(Example)
● Imagine a bank that wants to group its customers based on
income and debt. If customers within the same cluster have
vastly different financial situations, then a one-size-fits-all
approach to offers might not work.
● For example, a customer with high income and high debt
might have different needs compared to someone with low
income and low debt.
● By making sure the customers in each cluster are similar, the
bank can create more tailored and effective strategies.
Differences Between
Clusters (Example)
● if one cluster consists of high-income, high-debt customers
and another cluster has high-income, low-debt customers,
the differences between the clusters are clear.
● This separation helps the bank create different strategies for
each group.
● If the clusters are too similar, it can be challenging to treat
them as separate segments, which can make targeted
marketing less effective.
Applications of K-Means
Clustering
Distance Measures
Image Segmentation
K-Means for Geyser Eruptions
Customer Segmentation
Document Clustering
Recommendation Engines
K-Means for Image Compression
Advantages of K-Means
Clustering
1. Simple and easy to implement: The k-means algorithm is
easy to understand and implement, making it a popular
choice for clustering tasks.
2. Fast and efficient: K-means is computationally efficient and
can handle large datasets with high dimensionality.
3. Scalability: K-means can handle large datasets with many
data points and can be easily scaled to handle even larger
datasets.
4. Flexibility: K-means can be easily adapted to different
applications and can be used with varying metrics of
Disadvantages of K-Means
Clustering
1. Sensitivity to initial centroids: K-means is sensitive to
the initial selection of centroids and can converge to a
suboptimal solution.
2. Requires specifying the number of clusters: The
number of clusters k needs to be specified before running
the algorithm, which can be challenging in some
applications.
3. Sensitive to outliers: K-means is sensitive to outliers,
which can have a significant impact on the resulting
clusters.
Different Evaluation Metrics
for Clustering
When it comes to evaluating how well your clustering
algorithm is working, there are a few key metrics
that can help you get a clearer picture of your
results.
1. Silhouette Analysis
2. Inertia
3. Dunn Index
Silhouette Analysis
● Silhouette analysis is like a report card for your clusters.
● It measures how well each data point fits into its own cluster
compared to other clusters.
● A high silhouette score means that your points are snugly
fitting into their clusters and are quite distinct from points in
other clusters.
● Imagine a score close to 1 as a sign that your clusters are
well-defined and separated.
● Conversely, a score close to 0 indicates some overlap, and a
negative score suggests that the clustering might need some
Inertia
● Inertia is a bit like a gauge of how tightly packed your data
points are within each cluster.
● It calculates the sum of squared distances from each point to
the cluster's center (or centroid).
● Lower inertia means that points are closer to the centroid
and to each other, which generally indicates that your
clusters are well-formed.
● For most numeric data, you'll use Euclidean distance, but if
your data includes categorical features, Manhattan distance
might be better.
Dunn Index
● The Dunn Index takes a broader view by considering both
the distance within and between clusters.
● It’s calculated as the ratio of the smallest distance between
any two clusters (inter-cluster distance) to the largest
distance within a cluster (intra-cluster distance).
● A higher Dunn Index means that clusters are not only tight
and cohesive internally but also well-separated from each
other.
Methods to Determine the
Number of Clusters (K)
● Trial and Error Method
Start with an assumed K value (e.g., 3, 4, 5).
Adjust K iteratively until the best clusters are
formed.
● Elbow Method
Plots WCSS (Within-Cluster Sum of Squares) vs. K.
The "elbow point" in the graph helps determine
the optimal K.
How Does K-Means
Clustering Work?
Clustering Process in K-
Means
● Select K and Initialize Centroids
1. Assign K centroids randomly in the dataset.
● Assign Data Points to Nearest Centroid
1. Calculate the distance of each point from all centroids.
2. Assign each point to the closest centroid.
● Compute New Centroids
1. Calculate the mean position of all points in each cluster.
2. Update centroid locations accordingly.
● Reassign Points Based on New Centroids
1. Recalculate distances of all points from the updated centroids.
2. If needed, reassign points to the nearest centroid.
● Check for Convergence
1. If centroids continue to move, repeat the process.
2. Once centroids stop moving, the algorithm converges and the final clusters are
formed.
Scenario
Finding the
Optimum
bb Number of
Clusters for a
Grocery Shop
Dataset
Step 1: Using the Elbow Method
to Determine K
● Elbow Method is used to find the optimal number
of clusters (K).
● K-Means clustering is applied to the dataset
multiple times with different K values.
● Within-Cluster Sum of Squares (WSS) is calculated
for each K.
Methods to Determine the
Number of Clusters (K)
● Choosing the Optimal K:
WSS is plotted against different values of K.
The point where the WSS dramatically stops decreasing is
the elbow point.
In this case, the optimal K could be 2, 3, or 4, as WSS
stabilizes beyond this.
Step 2: Initializing Cluster
Centroids
● Randomly select initial
centroids (C1 and C2).
● These centroids act as
the starting points for
cluster formation.
Step 3: Assigning Data Points to
the Closest Centroid
● Calculate the distance between each delivery
location and the centroids.
● Assign each location
to the nearest centroid.
● This forms the initial
grouping of data points.
Step 4: Compute New Centroid
for the First Group
● Compute the mean position of all points in the first
cluster.
Step 5: Move the Random Centroid to the
New Centroid Position
● Adjust the centroid position based on the mean
of the assigned points.
Step 6: Compute New Centroid for
the Second Group
● Compute the mean position of all points in the
second cluster.
Step 7: Move the Random Centroid to the
New Centroid Position
● Adjust the centroid position for the second
cluster.
Step 8: Checking for
Convergence
● If centroids continue to
change position, repeat
Steps 3–7.
● Once centroids stop
moving, the K-Means
algorithm converges.
● The final clusters with
centroids C1 and C2 are
established.
K-Means Clustering Algorithm
K-Means clustering is an iterative algorithm that partitions a
dataset into KKK clusters by minimizing the variance within
each cluster.
Step 8: Checking for
Convergence
Python Implementation of the K-Means
Clustering Algorithm
These are the steps you need to take:
● Data pre-processing
● Finding the optimal number of clusters using the elbow
method
● Training the K-Means algorithm on the training data set
● Visualizing the clusters
Data Pre-Processing
!pip install kagglehub
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
import kagglehub
import os
# Download latest version
dataset_path = kagglehub.dataset_download("aajay20/mall-customers-
datacsv")
# Assuming the CSV file is named 'Mall_Customers.csv' - adjust if different
file_path = os.path.join(dataset_path, "Mall_Customers.csv") # Construct the
full file path
dataset = pd.read_csv(file_path) # Use the file path to read the CSV
x = dataset.iloc[:, [3, 4]].values
Find the optimal number of
clusters using Elbow Method
from sklearn.cluster import KMeans
wcss_list= [] #Initializing the list for the values of WCSS
#Using for loop for iterations from 1 to 10.
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', random_state= 42)
kmeans.fit(x)
wcss_list.append(kmeans.inertia_)
mtp.plot(range(1, 11), wcss_list)
mtp.title('The Elobw Method Graph')
mtp.xlabel('Number of clusters(k)')
mtp.ylabel('wcss_list')
mtp.show()
Train the K-means algorithm on the
training dataset
#training the K-means model on a dataset
kmeans = KMeans(n_clusters=5, init='k-means++', random_state=
42)
y_predict= kmeans.fit_predict(x)
Visualize the Clusters
#Visualize the Clusters
mtp.scatter(x[y_predict == 0, 0], x[y_predict == 0, 1], s = 100, c = 'blue', label
= 'Cluster 1') #for first cluster
mtp.scatter(x[y_predict == 1, 0], x[y_predict == 1, 1], s = 100, c = 'green',
label = 'Cluster 2') #for second cluster
mtp.scatter(x[y_predict== 2, 0], x[y_predict == 2, 1], s = 100, c = 'red', label =
'Cluster 3') #for third cluster
mtp.scatter(x[y_predict == 3, 0], x[y_predict == 3, 1], s = 100, c = 'cyan', label
= 'Cluster 4') #for fourth cluster
mtp.scatter(x[y_predict == 4, 0], x[y_predict == 4, 1], s = 100, c = 'magenta',
label = 'Cluster 5') #for fifth cluster
mtp.scatter (kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s =
300, c = 'yellow', label = 'Centroid')
mtp.title('Clusters of customers')
mtp.xlabel('Annual Income (k$)')
mtp.ylabel('Spending Score (1-100)')
mtp.legend()
mtp.show()
Challenges With K-Means
Clustering
1. Choice of K (NumberAlgorithm
of Clusters):Requires predefined clusters, often
determined using the Elbow Method.
2. Initialization Sensitivity: Poor initialization may lead to suboptimal
results.
3. Assumption of Spherical Clusters: Struggles with irregular or
overlapping clusters.
4. Outliers Impact: Susceptible to extreme values shifting centroids.
5. Scalability Issues: Computationally expensive for large datasets.
6. Difficulty Handling Categorical Data: Works poorly with non-numeric
features.
7. Unequal Cluster Sizes: Favors balanced, equally dense clusters.
8. Hard Assignments (Non-Probabilistic): Lacks probabilistic clustering
like GMM.