K_means.ipynb_-_Colab
K_means.ipynb_-_Colab
ipynb - Colab
Introduction
If a dataset do not have any labels associated with it, then unsupervised algorithms are used to find some structure in it. These
structures can be different types of data pattern or group of data. K-Means clustering is most commonly used unsupervised
learning algorithm to find groups in unlabeled data. Here K represents the number of groups or clusters and the process of
creating these groups is known as 'clustering', that why the name K-means clustering.
Uses
Search engine: Search engine, groups results together using clustering algorithm
Customer segmentation: K-mean clustering can be used to create customer clusters based on demographic information,
geographical information and behavioral data.
Social network analysis: To find groups of people with specific interest to direct the personalized ads.
Data center: To organize the computer clusters in data center. Inventory management: Create inventory clusters based on sales
number and manufacturing capacity
K-means is often referred to as Lloyd’s algorithm. It is one of the most popular clustering algorithm. Refer below plot where there
are two clusters (K=2) one is of red data points and another one of green data points
So how does K-Means algorithm find the clusters of the data points without any label? Below steps will explain the inner working
of K-Means algorithm.
First step is to finalize the number of clusters you want to identify in your data. This is the "K" in K-means clustering.
Now randomly initialize the points equal to the number of clusters K. 'Cluster Centroid' is the terminology used to refer
these points.
Note that centroid means center point of given dataset, but initially these points are at random location, but at the end
when K-Means algorithm will converge they will be at the center of their respective cluster.
Once cluster centroids are defined, K-means algorithm will go through each data point from given data and depending on
that points closeness to cluster centroid, it will assign the data point to the cluster centroid. This is called as 'Assignment
Step'.
In order to move the cluster centroids from random location to their respective group, K-means algorithm will find the mean
of each data point assigned to the cluster centroid and move the respective centroid to the mean value location. This is
called as 'Move Centroid Step'
https://colab.research.google.com/drive/1J0FneRFku9SOw3sv5R9xwplXHGSfF4X0#scrollTo=TWugDZUixcoi&printMode=true 1/10
10/17/24, 3:54 PM K_means.ipynb - Colab
Note that during 'Move Centroid Step' data points can get reassigned from one cluster to another as centroid position
change.
Now repeat the assignment and move centroid steps till cluster centroid position don't change. K-means algorithm will
converge when we get the unchanged position of cluster centroids.
Once K-means algorithm is converged, data point assigned to respective centroid will represent the respective cluster.
During cluster assignment step if we found a centroid who has no data point associated with it, then it's better to remove
it.
Since we have to randomly pick the cluster centroids, it's initialization may affect the final outcome of the clustering. In case our
initialization is not correct, then K-Means algorithm may form a cluster with few points only. Such situation is referred as
'centroid random initialization trap' and it may cause algorithm to get stuck at local optima.
Check below plots, where for same dataset, we end up getting different clusters depending on initial position of cluster
centroids. Gray color squares represent the initial positions of centroids and red, green and blue squares represent the final
position of centroids.
https://colab.research.google.com/drive/1J0FneRFku9SOw3sv5R9xwplXHGSfF4X0#scrollTo=TWugDZUixcoi&printMode=true 2/10
10/17/24, 3:54 PM K_means.ipynb - Colab
https://colab.research.google.com/drive/1J0FneRFku9SOw3sv5R9xwplXHGSfF4X0#scrollTo=TWugDZUixcoi&printMode=true 3/10
10/17/24, 3:54 PM K_means.ipynb - Colab
Random Initialization Guidelines To avoid random initialization trap, follow below guidelines for random initialization.
Multiple random initialization technique is more effective when we have a small number of clusters.
Similarly for large number of clusters, few random initialization are sufficient
So using random initialization we can avoid the local optima issue, but to choose how many clusters to look for in a given data
we can use below methods.
Visualization
To find the number of clusters manually by data visualization is one of the most common method. Domain knowledge and
proper understanding of given data also help to make more informed decisions. Since its manual exercise there is always a
scope for ambiguous observations, in such cases we can also use 'Elbow Method'
Elbow Method
In Elbow method we run the K-Means algorithm multiple times over a loop, with an increasing number of cluster choice(say from
1 to 10) and then plotting a clustering score as a function of the number of clusters. Clustering score is nothing but sum of
squared distances of samples to their closest cluster center. Elbow is the point on the plot where clustering score (distortion)
slows down, and the value of cluster at that point gives us the optimum number of clusters to have. But sometimes we don't get
clear elbow point on the plot, in such cases its very hard to finalize the number of clusters.
https://colab.research.google.com/drive/1J0FneRFku9SOw3sv5R9xwplXHGSfF4X0#scrollTo=TWugDZUixcoi&printMode=true 4/10
10/17/24, 3:54 PM K_means.ipynb - Colab
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
df = pd.read_csv('/content/Mall_Customers.csv')
print("Shape of the data= ", df.shape)
df.head()
0 1 Male 19 15 39
1 2 Male 21 15 81
2 3 Female 20 16 6
3 4 Female 23 16 77
4 5 Female 31 17 40
There are total 200 training example without any label to indicate which customer belong to which group
We are going to use annual income and spending score to find the clusters in data.
Note that spending score is from 1 to 100 which is assigned by the mall based on customer behavior and spending nature
plt.figure(figsize=(10,6))
plt.scatter(df['Annual Income (k$)'],df['Spending Score (1-100)'])
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.title('Unlabelled Mall Customer Data')
https://colab.research.google.com/drive/1J0FneRFku9SOw3sv5R9xwplXHGSfF4X0#scrollTo=TWugDZUixcoi&printMode=true 5/10
10/17/24, 3:54 PM K_means.ipynb - Colab
# Since we are going to use Annual Income and Spending Score columns only, lets create 2D array of these columns for
X = df.iloc[:, [3,4]].values
X[:5] # Show first 5 records only
array([[15, 39],
[15, 81],
[16, 6],
[16, 77],
[17, 40]])
By visual inspection of above scatter plot, we can identify 5 possible clusters. But since there is no other information available
its very difficult say it with 100% confidence. So lets try to verify this with Elbow method technique.
Elbow Method
Using the elbow method to find the optimal number of clusters. Let's use 1 to 11 as range of clusters. We will use 'random'
initialization method for this study. Note that Sklearn K-Means algorithm also have ‘k-means++’ initialization method. It selects
initial cluster centers for k-mean clustering in a smart way to speed up convergence.
clustering_score = []
for i in range(1, 11):
kmeans = KMeans(n_clusters = i, init = 'random', random_state = 42)
kmeans.fit(X)
clustering_score.append(kmeans.inertia_) # inertia_ = Sum of squared distances of samples to their closest cluster c
plt.figure(figsize=(10,6))
plt.plot(range(1, 11), clustering_score)
plt.scatter(5,clustering_score[4], s = 200, c = 'red', marker='*')
plt.title('The Elbow Method')
plt.xlabel('No. of Clusters')
plt.ylabel('Clustering Score')
plt.show()
https://colab.research.google.com/drive/1J0FneRFku9SOw3sv5R9xwplXHGSfF4X0#scrollTo=TWugDZUixcoi&printMode=true 6/10
10/17/24, 3:54 PM K_means.ipynb - Colab
From above elbow plot its clear that clustering scores slows down after 5 number of clusters. So we can use K= 5 for further
analysis.
Compute cluster centers and predict cluster index for each sample. Since K=5 we will get the cluster index from 0 to 4 for every
data point in our dataset.
# Compute cluster centers and predict cluster index for each sample.
pred = kmeans.predict(X)
pred
array([4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2,
4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 0,
4, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 3, 1, 0, 1, 3, 1, 3, 1,
0, 1, 3, 1, 3, 1, 3, 1, 3, 1, 0, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1,
3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1,
3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1,
3, 1], dtype=int32)
https://colab.research.google.com/drive/1J0FneRFku9SOw3sv5R9xwplXHGSfF4X0#scrollTo=TWugDZUixcoi&printMode=true 7/10
10/17/24, 3:54 PM K_means.ipynb - Colab
0 1 Male 19 15 39 4
1 2 Male 21 15 81 2
2 3 Female 20 16 6 4
3 4 Female 23 16 77 2
4 5 Female 31 17 40 4
Visualization
plt.figure(figsize=(10,6))
plt.scatter(X[pred == 0, 0], X[pred == 0, 1], c = 'brown', label = 'Cluster 0')
plt.scatter(X[pred == 1, 0], X[pred == 1, 1], c = 'green', label = 'Cluster 1')
plt.scatter(X[pred == 2, 0], X[pred == 2, 1], c = 'blue', label = 'Cluster 2')
plt.scatter(X[pred == 3, 0], X[pred == 3, 1], c = 'purple', label = 'Cluster 3')
plt.scatter(X[pred == 4, 0], X[pred == 4, 1], c = 'orange', label = 'Cluster 4')
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.legend()
plt.title('Customer Clusters')
https://colab.research.google.com/drive/1J0FneRFku9SOw3sv5R9xwplXHGSfF4X0#scrollTo=TWugDZUixcoi&printMode=true 8/10
10/17/24, 3:54 PM K_means.ipynb - Colab
Inner Working
Using below code we can visualize the inner working of K-Means algorithm.
To start with we will define the random centroids. You can see in below plot that initial centroids with original data without any
clusters.
In step 1 we will run the K-Means algorithm only for one iteration and plot the new position of centroid. Notice how centroid
position changes and clusters started to form around it.
In step 2, we will run the K-Means algorithm for two iterations. Notice how data points are reassigned from one cluster to
another as centroid position change
Similarly at the end we run the K-Means algorithm for six iterations, where we get the final location of centroids and associated
clusters.
I/P:
* centroid_history: 2D array of centroids. Each element represent the centroid coordinate.
If there are 5 clusters then first set contains initial cluster cordinates
(i.e. first 5 elements) and then k_means loop will keep appending new cluster coordinates for each iteration
* n_clusters: Total number of clusters to find
* centroid_sets: At the start we set random values as our first centroid set. K-Means loop will keep adding
new centroid sets to centroid_history. Since we are ploting the path of centroid locations, centroid set value
will be K-Means loop iteration number plus 1 for initial centroid set.
So its value will be from 2 to K-Means loops max iter plus 1
* cluster_color: Just to have same line and cluster color
https://colab.research.google.com/drive/1J0FneRFku9SOw3sv5R9xwplXHGSfF4X0#scrollTo=TWugDZUixcoi&printMode=true 9/10
10/17/24, 3:54 PM K_means.ipynb - Colab
c_y = np.append(c_y, centroid_history[:,1][i + cluster_index])
cluster_index = cluster_index + n_clusters
# if there are 5 clusters then first set contains initial cluster cordinates and then k_means loop will ke
plt.figure(figsize=(10,6))
# Compute cluster centers and predict cluster index for each sample
pred = kmeans.fit_predict(X)
plt.figure(figsize=(10,6))
plt.scatter(X[pred == 0, 0], X[pred == 0, 1], c = 'brown', label = 'Cluster 0')
plt.scatter(X[pred == 1, 0], X[pred == 1, 1], c = 'green', label = 'Cluster 1')
plt.scatter(X[pred == 2, 0], X[pred == 2, 1], c = 'blue', label = 'Cluster 2')
plt.scatter(X[pred == 3, 0], X[pred == 3, 1], c = 'purple', label = 'Cluster 3')
plt.scatter(X[pred == 4, 0], X[pred == 4, 1], c = 'orange', label = 'Cluster 4')
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.legend()
plt.title('Iteration:' + str(i) + ' Assignment and Move Centroid Step')
https://colab.research.google.com/drive/1J0FneRFku9SOw3sv5R9xwplXHGSfF4X0#scrollTo=TWugDZUixcoi&printMode=true 10/10