0% found this document useful (0 votes)
38 views

K_means.ipynb_-_Colab

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

K_means.ipynb_-_Colab

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

10/17/24, 3:54 PM K_means.

ipynb - Colab

Introduction

If a dataset do not have any labels associated with it, then unsupervised algorithms are used to find some structure in it. These
structures can be different types of data pattern or group of data. K-Means clustering is most commonly used unsupervised
learning algorithm to find groups in unlabeled data. Here K represents the number of groups or clusters and the process of
creating these groups is known as 'clustering', that why the name K-means clustering.

Uses

Search engine: Search engine, groups results together using clustering algorithm

Customer segmentation: K-mean clustering can be used to create customer clusters based on demographic information,
geographical information and behavioral data.

Social network analysis: To find groups of people with specific interest to direct the personalized ads.

Data center: To organize the computer clusters in data center. Inventory management: Create inventory clusters based on sales
number and manufacturing capacity

Working of K-Means Clustering

K-means is often referred to as Lloyd’s algorithm. It is one of the most popular clustering algorithm. Refer below plot where there
are two clusters (K=2) one is of red data points and another one of green data points

So how does K-Means algorithm find the clusters of the data points without any label? Below steps will explain the inner working
of K-Means algorithm.

First step is to finalize the number of clusters you want to identify in your data. This is the "K" in K-means clustering.

Now randomly initialize the points equal to the number of clusters K. 'Cluster Centroid' is the terminology used to refer
these points.

Note that centroid means center point of given dataset, but initially these points are at random location, but at the end
when K-Means algorithm will converge they will be at the center of their respective cluster.

Once cluster centroids are defined, K-means algorithm will go through each data point from given data and depending on
that points closeness to cluster centroid, it will assign the data point to the cluster centroid. This is called as 'Assignment
Step'.

In order to move the cluster centroids from random location to their respective group, K-means algorithm will find the mean
of each data point assigned to the cluster centroid and move the respective centroid to the mean value location. This is
called as 'Move Centroid Step'

https://colab.research.google.com/drive/1J0FneRFku9SOw3sv5R9xwplXHGSfF4X0#scrollTo=TWugDZUixcoi&printMode=true 1/10
10/17/24, 3:54 PM K_means.ipynb - Colab

Note that during 'Move Centroid Step' data points can get reassigned from one cluster to another as centroid position
change.

Now repeat the assignment and move centroid steps till cluster centroid position don't change. K-means algorithm will
converge when we get the unchanged position of cluster centroids.

Once K-means algorithm is converged, data point assigned to respective centroid will represent the respective cluster.
During cluster assignment step if we found a centroid who has no data point associated with it, then it's better to remove
it.

Since we have to randomly pick the cluster centroids, it's initialization may affect the final outcome of the clustering. In case our
initialization is not correct, then K-Means algorithm may form a cluster with few points only. Such situation is referred as
'centroid random initialization trap' and it may cause algorithm to get stuck at local optima.

Check below plots, where for same dataset, we end up getting different clusters depending on initial position of cluster
centroids. Gray color squares represent the initial positions of centroids and red, green and blue squares represent the final
position of centroids.

https://colab.research.google.com/drive/1J0FneRFku9SOw3sv5R9xwplXHGSfF4X0#scrollTo=TWugDZUixcoi&printMode=true 2/10
10/17/24, 3:54 PM K_means.ipynb - Colab

https://colab.research.google.com/drive/1J0FneRFku9SOw3sv5R9xwplXHGSfF4X0#scrollTo=TWugDZUixcoi&printMode=true 3/10
10/17/24, 3:54 PM K_means.ipynb - Colab

Random Initialization Guidelines To avoid random initialization trap, follow below guidelines for random initialization.

Number of cluster centroids should be less than number of training examples

To avoid local optima issue, try to do multiple random initialization of centroids.

Multiple random initialization technique is more effective when we have a small number of clusters.

Similarly for large number of clusters, few random initialization are sufficient

Choosing The Number of Clusters

So using random initialization we can avoid the local optima issue, but to choose how many clusters to look for in a given data
we can use below methods.

Visualization

To find the number of clusters manually by data visualization is one of the most common method. Domain knowledge and
proper understanding of given data also help to make more informed decisions. Since its manual exercise there is always a
scope for ambiguous observations, in such cases we can also use 'Elbow Method'

Elbow Method

In Elbow method we run the K-Means algorithm multiple times over a loop, with an increasing number of cluster choice(say from
1 to 10) and then plotting a clustering score as a function of the number of clusters. Clustering score is nothing but sum of
squared distances of samples to their closest cluster center. Elbow is the point on the plot where clustering score (distortion)
slows down, and the value of cluster at that point gives us the optimum number of clusters to have. But sometimes we don't get
clear elbow point on the plot, in such cases its very hard to finalize the number of clusters.

https://colab.research.google.com/drive/1J0FneRFku9SOw3sv5R9xwplXHGSfF4X0#scrollTo=TWugDZUixcoi&printMode=true 4/10
10/17/24, 3:54 PM K_means.ipynb - Colab

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

df = pd.read_csv('/content/Mall_Customers.csv')
print("Shape of the data= ", df.shape)
df.head()

Shape of the data= (200, 5)


CustomerID Gender Age Annual Income (k$) Spending Score (1-100)

0 1 Male 19 15 39

1 2 Male 21 15 81

2 3 Female 20 16 6

3 4 Female 23 16 77

4 5 Female 31 17 40

Next steps: Generate code with df


toggle_off View recommended plots New interactive sheet

Understanding The Data

There are total 200 training example without any label to indicate which customer belong to which group

We are going to use annual income and spending score to find the clusters in data.

Note that spending score is from 1 to 100 which is assigned by the mall based on customer behavior and spending nature

plt.figure(figsize=(10,6))
plt.scatter(df['Annual Income (k$)'],df['Spending Score (1-100)'])
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.title('Unlabelled Mall Customer Data')

https://colab.research.google.com/drive/1J0FneRFku9SOw3sv5R9xwplXHGSfF4X0#scrollTo=TWugDZUixcoi&printMode=true 5/10
10/17/24, 3:54 PM K_means.ipynb - Colab

Text(0.5, 1.0, 'Unlabelled Mall Customer Data')

# Since we are going to use Annual Income and Spending Score columns only, lets create 2D array of these columns for
X = df.iloc[:, [3,4]].values
X[:5] # Show first 5 records only

array([[15, 39],
[15, 81],
[16, 6],
[16, 77],
[17, 40]])

Choosing The Number of Clusters

By visual inspection of above scatter plot, we can identify 5 possible clusters. But since there is no other information available
its very difficult say it with 100% confidence. So lets try to verify this with Elbow method technique.

Elbow Method

Using the elbow method to find the optimal number of clusters. Let's use 1 to 11 as range of clusters. We will use 'random'
initialization method for this study. Note that Sklearn K-Means algorithm also have ‘k-means++’ initialization method. It selects
initial cluster centers for k-mean clustering in a smart way to speed up convergence.

clustering_score = []
for i in range(1, 11):
kmeans = KMeans(n_clusters = i, init = 'random', random_state = 42)
kmeans.fit(X)
clustering_score.append(kmeans.inertia_) # inertia_ = Sum of squared distances of samples to their closest cluster c

plt.figure(figsize=(10,6))
plt.plot(range(1, 11), clustering_score)
plt.scatter(5,clustering_score[4], s = 200, c = 'red', marker='*')
plt.title('The Elbow Method')
plt.xlabel('No. of Clusters')
plt.ylabel('Clustering Score')
plt.show()

https://colab.research.google.com/drive/1J0FneRFku9SOw3sv5R9xwplXHGSfF4X0#scrollTo=TWugDZUixcoi&printMode=true 6/10
10/17/24, 3:54 PM K_means.ipynb - Colab

From above elbow plot its clear that clustering scores slows down after 5 number of clusters. So we can use K= 5 for further
analysis.

Compute K-Means Clustering

Compute cluster centers and predict cluster index for each sample. Since K=5 we will get the cluster index from 0 to 4 for every
data point in our dataset.

kmeans= KMeans(n_clusters = 5, random_state = 42)

# Compute k-means clustering


kmeans.fit(X)

# Compute cluster centers and predict cluster index for each sample.
pred = kmeans.predict(X)

pred

array([4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2,
4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 0,
4, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 3, 1, 0, 1, 3, 1, 3, 1,
0, 1, 3, 1, 3, 1, 3, 1, 3, 1, 0, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1,
3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1,
3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1,
3, 1], dtype=int32)

df['Cluster'] = pd.DataFrame(pred, columns=['cluster'] )


print('Number of data points in each cluster= \n', df['Cluster'].value_counts())
df

https://colab.research.google.com/drive/1J0FneRFku9SOw3sv5R9xwplXHGSfF4X0#scrollTo=TWugDZUixcoi&printMode=true 7/10
10/17/24, 3:54 PM K_means.ipynb - Colab

Number of data points in each cluster=


Cluster
0 81
1 39
3 35
4 23
2 22
Name: count, dtype: int64
CustomerID Gender Age Annual Income (k$) Spending Score (1-100) Cluster

0 1 Male 19 15 39 4

1 2 Male 21 15 81 2

2 3 Female 20 16 6 4

3 4 Female 23 16 77 2

4 5 Female 31 17 40 4

... ... ... ... ... ... ...

195 196 Female 35 120 79 1

196 197 Female 45 126 28 3

197 198 Male 32 126 74 1

198 199 Male 32 137 18 3

199 200 Male 30 137 83 1

200 rows × 6 columns

Next steps: Generate code with df


toggle_off View recommended plots New interactive sheet

Visualization

plt.figure(figsize=(10,6))
plt.scatter(X[pred == 0, 0], X[pred == 0, 1], c = 'brown', label = 'Cluster 0')
plt.scatter(X[pred == 1, 0], X[pred == 1, 1], c = 'green', label = 'Cluster 1')
plt.scatter(X[pred == 2, 0], X[pred == 2, 1], c = 'blue', label = 'Cluster 2')
plt.scatter(X[pred == 3, 0], X[pred == 3, 1], c = 'purple', label = 'Cluster 3')
plt.scatter(X[pred == 4, 0], X[pred == 4, 1], c = 'orange', label = 'Cluster 4')

plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:, 1],s = 300, c = 'red', label = 'Centroid', marker=

plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.legend()
plt.title('Customer Clusters')

https://colab.research.google.com/drive/1J0FneRFku9SOw3sv5R9xwplXHGSfF4X0#scrollTo=TWugDZUixcoi&printMode=true 8/10
10/17/24, 3:54 PM K_means.ipynb - Colab

Text(0.5, 1.0, 'Customer Clusters')

Inner Working

Using below code we can visualize the inner working of K-Means algorithm.

To start with we will define the random centroids. You can see in below plot that initial centroids with original data without any
clusters.

In step 1 we will run the K-Means algorithm only for one iteration and plot the new position of centroid. Notice how centroid
position changes and clusters started to form around it.

In step 2, we will run the K-Means algorithm for two iterations. Notice how data points are reassigned from one cluster to
another as centroid position change

Similarly at the end we run the K-Means algorithm for six iterations, where we get the final location of centroids and associated
clusters.

def plot_k_means_progress(centroid_history,n_clusters, centroid_sets, cluster_color):


"""
This function will plot the path taken by the centroids

I/P:
* centroid_history: 2D array of centroids. Each element represent the centroid coordinate.
If there are 5 clusters then first set contains initial cluster cordinates
(i.e. first 5 elements) and then k_means loop will keep appending new cluster coordinates for each iteration
* n_clusters: Total number of clusters to find
* centroid_sets: At the start we set random values as our first centroid set. K-Means loop will keep adding
new centroid sets to centroid_history. Since we are ploting the path of centroid locations, centroid set value
will be K-Means loop iteration number plus 1 for initial centroid set.
So its value will be from 2 to K-Means loops max iter plus 1
* cluster_color: Just to have same line and cluster color

O/P: Plot the centroid path


"""
c_x = [] # To store centroid X coordinated
c_y=[] # To store the centroid Y coordinates
for i in range(0, n_clusters):
cluster_index = 0
for j in range(0, centroid_sets):
c_x = np.append(c_x, centroid_history[:,0][i + cluster_index])

https://colab.research.google.com/drive/1J0FneRFku9SOw3sv5R9xwplXHGSfF4X0#scrollTo=TWugDZUixcoi&printMode=true 9/10
10/17/24, 3:54 PM K_means.ipynb - Colab
c_y = np.append(c_y, centroid_history[:,1][i + cluster_index])
cluster_index = cluster_index + n_clusters
# if there are 5 clusters then first set contains initial cluster cordinates and then k_means loop will ke

plt.plot(c_x, c_y, c= cluster_color['c_' + str(i)], linestyle='--')

# Reset coordinate arrays to avoid continuous lines


c_x = []
c_y=[]

plt.figure(figsize=(10,6))

# Random Initialization of Centroids


plt.scatter(df['Annual Income (k$)'],df['Spending Score (1-100)'])
initial_centroid = np.array([[10, 2], [50,100], [130,20], [50,15], [140,100]])

plt.scatter(initial_centroid[:,0], initial_centroid[:, 1],s = 200, c = 'red', label = 'Random Centroid', marker='*')


plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.legend()
plt.title('Random Initialization of Centroids')

# K-Means loop of assignment and move centroid steps


centroid_history = []
centroid_history = initial_centroid
#
cluster_color= {'c_0':'brown','c_1':'green','c_2':'blue','c_3':'purple','c_4':'orange'}
n_clusters = 5
for i in range(1,6):
kmeans= KMeans(n_clusters, init= initial_centroid, n_init= 1, max_iter= i, random_state = 42) #n_init= 1 since our

# Compute cluster centers and predict cluster index for each sample
pred = kmeans.fit_predict(X)

plt.figure(figsize=(10,6))
plt.scatter(X[pred == 0, 0], X[pred == 0, 1], c = 'brown', label = 'Cluster 0')
plt.scatter(X[pred == 1, 0], X[pred == 1, 1], c = 'green', label = 'Cluster 1')
plt.scatter(X[pred == 2, 0], X[pred == 2, 1], c = 'blue', label = 'Cluster 2')
plt.scatter(X[pred == 3, 0], X[pred == 3, 1], c = 'purple', label = 'Cluster 3')
plt.scatter(X[pred == 4, 0], X[pred == 4, 1], c = 'orange', label = 'Cluster 4')

plt.scatter(centroid_history[:,0], centroid_history[:, 1],s = 50, c = 'gray', label = 'Last Centroid', marker='x')

plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:, 1],s = 200, c = 'red', label = 'Centroid', mark

centroid_history = np.append(centroid_history, kmeans.cluster_centers_, axis=0)

plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.legend()
plt.title('Iteration:' + str(i) + ' Assignment and Move Centroid Step')

centroid_sets = i + 1 # Adding one for initial set of centroids


plot_k_means_progress(centroid_history,n_clusters, centroid_sets, cluster_color)

https://colab.research.google.com/drive/1J0FneRFku9SOw3sv5R9xwplXHGSfF4X0#scrollTo=TWugDZUixcoi&printMode=true 10/10

You might also like