0% found this document useful (0 votes)

38 views

K_means.ipynb_-_Colab

Uploaded by

dipyamanbiswas2004

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views

K_means.ipynb_-_Colab

Uploaded by

dipyamanbiswas2004

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

10/17/24, 3:54 PM K_means.

ipynb - Colab

Introduction

If a dataset do not have any labels associated with it, then unsupervised algorithms are used to find some structure in it. These
structures can be different types of data pattern or group of data. K-Means clustering is most commonly used unsupervised
learning algorithm to find groups in unlabeled data. Here K represents the number of groups or clusters and the process of
creating these groups is known as 'clustering', that why the name K-means clustering.

Uses

Search engine: Search engine, groups results together using clustering algorithm

Customer segmentation: K-mean clustering can be used to create customer clusters based on demographic information,
geographical information and behavioral data.

Social network analysis: To find groups of people with specific interest to direct the personalized ads.

Data center: To organize the computer clusters in data center. Inventory management: Create inventory clusters based on sales
number and manufacturing capacity

Working of K-Means Clustering

K-means is often referred to as Lloyd’s algorithm. It is one of the most popular clustering algorithm. Refer below plot where there
are two clusters (K=2) one is of red data points and another one of green data points

So how does K-Means algorithm find the clusters of the data points without any label? Below steps will explain the inner working
of K-Means algorithm.

First step is to finalize the number of clusters you want to identify in your data. This is the "K" in K-means clustering.

Now randomly initialize the points equal to the number of clusters K. 'Cluster Centroid' is the terminology used to refer
these points.

Note that centroid means center point of given dataset, but initially these points are at random location, but at the end
when K-Means algorithm will converge they will be at the center of their respective cluster.

Once cluster centroids are defined, K-means algorithm will go through each data point from given data and depending on
that points closeness to cluster centroid, it will assign the data point to the cluster centroid. This is called as 'Assignment
Step'.

In order to move the cluster centroids from random location to their respective group, K-means algorithm will find the mean
of each data point assigned to the cluster centroid and move the respective centroid to the mean value location. This is
called as 'Move Centroid Step'

https://colab.research.google.com/drive/1J0FneRFku9SOw3sv5R9xwplXHGSfF4X0#scrollTo=TWugDZUixcoi&printMode=true 1/10
10/17/24, 3:54 PM K_means.ipynb - Colab

Note that during 'Move Centroid Step' data points can get reassigned from one cluster to another as centroid position
change.

Now repeat the assignment and move centroid steps till cluster centroid position don't change. K-means algorithm will
converge when we get the unchanged position of cluster centroids.

Once K-means algorithm is converged, data point assigned to respective centroid will represent the respective cluster.
During cluster assignment step if we found a centroid who has no data point associated with it, then it's better to remove
it.

Since we have to randomly pick the cluster centroids, it's initialization may affect the final outcome of the clustering. In case our
initialization is not correct, then K-Means algorithm may form a cluster with few points only. Such situation is referred as
'centroid random initialization trap' and it may cause algorithm to get stuck at local optima.

Check below plots, where for same dataset, we end up getting different clusters depending on initial position of cluster
centroids. Gray color squares represent the initial positions of centroids and red, green and blue squares represent the final
position of centroids.

https://colab.research.google.com/drive/1J0FneRFku9SOw3sv5R9xwplXHGSfF4X0#scrollTo=TWugDZUixcoi&printMode=true 2/10
10/17/24, 3:54 PM K_means.ipynb - Colab

https://colab.research.google.com/drive/1J0FneRFku9SOw3sv5R9xwplXHGSfF4X0#scrollTo=TWugDZUixcoi&printMode=true 3/10
10/17/24, 3:54 PM K_means.ipynb - Colab

Random Initialization Guidelines To avoid random initialization trap, follow below guidelines for random initialization.

Number of cluster centroids should be less than number of training examples

To avoid local optima issue, try to do multiple random initialization of centroids.

Multiple random initialization technique is more effective when we have a small number of clusters.

Similarly for large number of clusters, few random initialization are sufficient

Choosing The Number of Clusters

So using random initialization we can avoid the local optima issue, but to choose how many clusters to look for in a given data
we can use below methods.

Visualization

To find the number of clusters manually by data visualization is one of the most common method. Domain knowledge and
proper understanding of given data also help to make more informed decisions. Since its manual exercise there is always a
scope for ambiguous observations, in such cases we can also use 'Elbow Method'

Elbow Method

In Elbow method we run the K-Means algorithm multiple times over a loop, with an increasing number of cluster choice(say from
1 to 10) and then plotting a clustering score as a function of the number of clusters. Clustering score is nothing but sum of
squared distances of samples to their closest cluster center. Elbow is the point on the plot where clustering score (distortion)
slows down, and the value of cluster at that point gives us the optimum number of clusters to have. But sometimes we don't get
clear elbow point on the plot, in such cases its very hard to finalize the number of clusters.

https://colab.research.google.com/drive/1J0FneRFku9SOw3sv5R9xwplXHGSfF4X0#scrollTo=TWugDZUixcoi&printMode=true 4/10
10/17/24, 3:54 PM K_means.ipynb - Colab

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

df = pd.read_csv('/content/Mall_Customers.csv')
print("Shape of the data= ", df.shape)
df.head()

Shape of the data= (200, 5)

CustomerID Gender Age Annual Income (k$) Spending Score (1-100)

0 1 Male 19 15 39

1 2 Male 21 15 81

2 3 Female 20 16 6

3 4 Female 23 16 77

4 5 Female 31 17 40

Next steps: Generate code with df

toggle_off View recommended plots New interactive sheet

Understanding The Data

There are total 200 training example without any label to indicate which customer belong to which group

We are going to use annual income and spending score to find the clusters in data.

Note that spending score is from 1 to 100 which is assigned by the mall based on customer behavior and spending nature

plt.figure(figsize=(10,6))
plt.scatter(df['Annual Income (k$)'],df['Spending Score (1-100)'])
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.title('Unlabelled Mall Customer Data')

https://colab.research.google.com/drive/1J0FneRFku9SOw3sv5R9xwplXHGSfF4X0#scrollTo=TWugDZUixcoi&printMode=true 5/10
10/17/24, 3:54 PM K_means.ipynb - Colab

Text(0.5, 1.0, 'Unlabelled Mall Customer Data')

# Since we are going to use Annual Income and Spending Score columns only, lets create 2D array of these columns for
X = df.iloc[:, [3,4]].values
X[:5] # Show first 5 records only

array([[15, 39],
[15, 81],
[16, 6],
[16, 77],
[17, 40]])

Choosing The Number of Clusters

By visual inspection of above scatter plot, we can identify 5 possible clusters. But since there is no other information available
its very difficult say it with 100% confidence. So lets try to verify this with Elbow method technique.

Elbow Method

Using the elbow method to find the optimal number of clusters. Let's use 1 to 11 as range of clusters. We will use 'random'
initialization method for this study. Note that Sklearn K-Means algorithm also have ‘k-means++’ initialization method. It selects
initial cluster centers for k-mean clustering in a smart way to speed up convergence.

clustering_score = []
for i in range(1, 11):
kmeans = KMeans(n_clusters = i, init = 'random', random_state = 42)
kmeans.fit(X)
clustering_score.append(kmeans.inertia_) # inertia_ = Sum of squared distances of samples to their closest cluster c

plt.figure(figsize=(10,6))
plt.plot(range(1, 11), clustering_score)
plt.scatter(5,clustering_score[4], s = 200, c = 'red', marker='*')
plt.title('The Elbow Method')
plt.xlabel('No. of Clusters')
plt.ylabel('Clustering Score')
plt.show()

https://colab.research.google.com/drive/1J0FneRFku9SOw3sv5R9xwplXHGSfF4X0#scrollTo=TWugDZUixcoi&printMode=true 6/10
10/17/24, 3:54 PM K_means.ipynb - Colab

From above elbow plot its clear that clustering scores slows down after 5 number of clusters. So we can use K= 5 for further
analysis.

Compute K-Means Clustering

Compute cluster centers and predict cluster index for each sample. Since K=5 we will get the cluster index from 0 to 4 for every
data point in our dataset.

kmeans= KMeans(n_clusters = 5, random_state = 42)

# Compute k-means clustering

kmeans.fit(X)

# Compute cluster centers and predict cluster index for each sample.
pred = kmeans.predict(X)

pred

array([4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2,
4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 0,
4, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 3, 1, 0, 1, 3, 1, 3, 1,
0, 1, 3, 1, 3, 1, 3, 1, 3, 1, 0, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1,
3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1,
3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1, 3, 1,
3, 1], dtype=int32)

df['Cluster'] = pd.DataFrame(pred, columns=['cluster'] )

print('Number of data points in each cluster= \n', df['Cluster'].value_counts())
df

https://colab.research.google.com/drive/1J0FneRFku9SOw3sv5R9xwplXHGSfF4X0#scrollTo=TWugDZUixcoi&printMode=true 7/10
10/17/24, 3:54 PM K_means.ipynb - Colab

Number of data points in each cluster=

Cluster
0 81
1 39
3 35
4 23
2 22
Name: count, dtype: int64
CustomerID Gender Age Annual Income (k$) Spending Score (1-100) Cluster

0 1 Male 19 15 39 4

1 2 Male 21 15 81 2

2 3 Female 20 16 6 4

3 4 Female 23 16 77 2

4 5 Female 31 17 40 4

... ... ... ... ... ... ...

195 196 Female 35 120 79 1

196 197 Female 45 126 28 3

197 198 Male 32 126 74 1

198 199 Male 32 137 18 3

199 200 Male 30 137 83 1

200 rows × 6 columns

Next steps: Generate code with df

toggle_off View recommended plots New interactive sheet

Visualization

plt.figure(figsize=(10,6))
plt.scatter(X[pred == 0, 0], X[pred == 0, 1], c = 'brown', label = 'Cluster 0')
plt.scatter(X[pred == 1, 0], X[pred == 1, 1], c = 'green', label = 'Cluster 1')
plt.scatter(X[pred == 2, 0], X[pred == 2, 1], c = 'blue', label = 'Cluster 2')
plt.scatter(X[pred == 3, 0], X[pred == 3, 1], c = 'purple', label = 'Cluster 3')
plt.scatter(X[pred == 4, 0], X[pred == 4, 1], c = 'orange', label = 'Cluster 4')

plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:, 1],s = 300, c = 'red', label = 'Centroid', marker=

plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.legend()
plt.title('Customer Clusters')

https://colab.research.google.com/drive/1J0FneRFku9SOw3sv5R9xwplXHGSfF4X0#scrollTo=TWugDZUixcoi&printMode=true 8/10
10/17/24, 3:54 PM K_means.ipynb - Colab

Text(0.5, 1.0, 'Customer Clusters')

Inner Working

Using below code we can visualize the inner working of K-Means algorithm.

To start with we will define the random centroids. You can see in below plot that initial centroids with original data without any
clusters.

In step 1 we will run the K-Means algorithm only for one iteration and plot the new position of centroid. Notice how centroid
position changes and clusters started to form around it.

In step 2, we will run the K-Means algorithm for two iterations. Notice how data points are reassigned from one cluster to
another as centroid position change

Similarly at the end we run the K-Means algorithm for six iterations, where we get the final location of centroids and associated
clusters.

def plot_k_means_progress(centroid_history,n_clusters, centroid_sets, cluster_color):

"""
This function will plot the path taken by the centroids

I/P:
* centroid_history: 2D array of centroids. Each element represent the centroid coordinate.
If there are 5 clusters then first set contains initial cluster cordinates
(i.e. first 5 elements) and then k_means loop will keep appending new cluster coordinates for each iteration
* n_clusters: Total number of clusters to find
* centroid_sets: At the start we set random values as our first centroid set. K-Means loop will keep adding
new centroid sets to centroid_history. Since we are ploting the path of centroid locations, centroid set value
will be K-Means loop iteration number plus 1 for initial centroid set.
So its value will be from 2 to K-Means loops max iter plus 1
* cluster_color: Just to have same line and cluster color

O/P: Plot the centroid path

"""
c_x = [] # To store centroid X coordinated
c_y=[] # To store the centroid Y coordinates
for i in range(0, n_clusters):
cluster_index = 0
for j in range(0, centroid_sets):
c_x = np.append(c_x, centroid_history[:,0][i + cluster_index])

https://colab.research.google.com/drive/1J0FneRFku9SOw3sv5R9xwplXHGSfF4X0#scrollTo=TWugDZUixcoi&printMode=true 9/10
10/17/24, 3:54 PM K_means.ipynb - Colab
c_y = np.append(c_y, centroid_history[:,1][i + cluster_index])
cluster_index = cluster_index + n_clusters
# if there are 5 clusters then first set contains initial cluster cordinates and then k_means loop will ke

plt.plot(c_x, c_y, c= cluster_color['c_' + str(i)], linestyle='--')

# Reset coordinate arrays to avoid continuous lines

c_x = []
c_y=[]

plt.figure(figsize=(10,6))

# Random Initialization of Centroids

plt.scatter(df['Annual Income (k$)'],df['Spending Score (1-100)'])
initial_centroid = np.array([[10, 2], [50,100], [130,20], [50,15], [140,100]])

plt.scatter(initial_centroid[:,0], initial_centroid[:, 1],s = 200, c = 'red', label = 'Random Centroid', marker='*')

plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.legend()
plt.title('Random Initialization of Centroids')

# K-Means loop of assignment and move centroid steps

centroid_history = []
centroid_history = initial_centroid
#
cluster_color= {'c_0':'brown','c_1':'green','c_2':'blue','c_3':'purple','c_4':'orange'}
n_clusters = 5
for i in range(1,6):
kmeans= KMeans(n_clusters, init= initial_centroid, n_init= 1, max_iter= i, random_state = 42) #n_init= 1 since our

# Compute cluster centers and predict cluster index for each sample
pred = kmeans.fit_predict(X)

plt.scatter(centroid_history[:,0], centroid_history[:, 1],s = 50, c = 'gray', label = 'Last Centroid', marker='x')

plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:, 1],s = 200, c = 'red', label = 'Centroid', mark

centroid_history = np.append(centroid_history, kmeans.cluster_centers_, axis=0)

plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.legend()
plt.title('Iteration:' + str(i) + ' Assignment and Move Centroid Step')

centroid_sets = i + 1 # Adding one for initial set of centroids

plot_k_means_progress(centroid_history,n_clusters, centroid_sets, cluster_color)

https://colab.research.google.com/drive/1J0FneRFku9SOw3sv5R9xwplXHGSfF4X0#scrollTo=TWugDZUixcoi&printMode=true 10/10

Solution of Matrix PDF
No ratings yet
Solution of Matrix PDF
9 pages
Algorithms For Interviews: A Problem Solving Approach
No ratings yet
Algorithms For Interviews: A Problem Solving Approach
41 pages
Presentation 1
No ratings yet
Presentation 1
47 pages
AI Week 11
No ratings yet
AI Week 11
21 pages
UNIT - 3 - Clustering
No ratings yet
UNIT - 3 - Clustering
21 pages
Unit 4 Aam
No ratings yet
Unit 4 Aam
26 pages
Kmean
No ratings yet
Kmean
24 pages
algo
No ratings yet
algo
59 pages
13: Clustering: Unsupervised Learning - Introduction
No ratings yet
13: Clustering: Unsupervised Learning - Introduction
4 pages
K-Means Clustering Algorithm
No ratings yet
K-Means Clustering Algorithm
20 pages
ML - K-Means
No ratings yet
ML - K-Means
12 pages
Clustering
No ratings yet
Clustering
6 pages
KMeans Clustering
No ratings yet
KMeans Clustering
16 pages
Assignment 6 ML
No ratings yet
Assignment 6 ML
4 pages
K Means Algorithms
No ratings yet
K Means Algorithms
27 pages
K Mean
No ratings yet
K Mean
12 pages
Working of K Means Algorithm - YashBhure
No ratings yet
Working of K Means Algorithm - YashBhure
14 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
12 pages
LP I Assignment A4 Clustering
No ratings yet
LP I Assignment A4 Clustering
13 pages
Detecting Patterns with Unsupervised Learning
No ratings yet
Detecting Patterns with Unsupervised Learning
21 pages
K-Means Clustering Algorithm - Javatpoint
No ratings yet
K-Means Clustering Algorithm - Javatpoint
21 pages
K-Mean
No ratings yet
K-Mean
9 pages
2 - K-Mean
No ratings yet
2 - K-Mean
39 pages
K - Mean Clustering
No ratings yet
K - Mean Clustering
12 pages
K-Means Algo
No ratings yet
K-Means Algo
4 pages
K Means Clustering Algorithm
No ratings yet
K Means Clustering Algorithm
12 pages
K Means Clustering
No ratings yet
K Means Clustering
22 pages
K Means Clustering
No ratings yet
K Means Clustering
5 pages
Simple K Means
No ratings yet
Simple K Means
3 pages
K-Means Clustering
No ratings yet
K-Means Clustering
6 pages
10.Lab Activity
No ratings yet
10.Lab Activity
11 pages
Artificial Intelligence Report
No ratings yet
Artificial Intelligence Report
23 pages
Mod4_Unsupervised Learning
No ratings yet
Mod4_Unsupervised Learning
9 pages
CLUSTERING
No ratings yet
CLUSTERING
11 pages
DA_EXP_10 (1)
No ratings yet
DA_EXP_10 (1)
6 pages
Machine Learning
No ratings yet
Machine Learning
23 pages
02.1 K-Means Example
No ratings yet
02.1 K-Means Example
12 pages
Introduction To Unsupervised Learning:: Clustering
No ratings yet
Introduction To Unsupervised Learning:: Clustering
21 pages
Clustering_notes
No ratings yet
Clustering_notes
29 pages
Wk03 machine learning
No ratings yet
Wk03 machine learning
5 pages
Clustering
No ratings yet
Clustering
17 pages
Unit IV
No ratings yet
Unit IV
96 pages
K Means Clustering
No ratings yet
K Means Clustering
6 pages
unsupervised learning
No ratings yet
unsupervised learning
23 pages
DA_EXP_10
No ratings yet
DA_EXP_10
6 pages
USL
No ratings yet
USL
21 pages
16 K Mean Clustring 1 18052023 095249am 08042024 093324am
No ratings yet
16 K Mean Clustring 1 18052023 095249am 08042024 093324am
20 pages
Clustering-Part1.pptx
No ratings yet
Clustering-Part1.pptx
84 pages
Machine Learning K Means - Unsupervised
No ratings yet
Machine Learning K Means - Unsupervised
5 pages
Introduction To Kmeans
No ratings yet
Introduction To Kmeans
4 pages
ML-Unit III - K-Means Clustering
No ratings yet
ML-Unit III - K-Means Clustering
22 pages
Unit 3 Data
No ratings yet
Unit 3 Data
37 pages
A Tutorial On Clustering Algorithms
No ratings yet
A Tutorial On Clustering Algorithms
4 pages
Unit_4 (1)
No ratings yet
Unit_4 (1)
63 pages
K-Mean Clustering
No ratings yet
K-Mean Clustering
8 pages
U1 - KMeans - 5th Sem - DS
No ratings yet
U1 - KMeans - 5th Sem - DS
14 pages
Week 9
No ratings yet
Week 9
66 pages
09.unsupervised Learning
No ratings yet
09.unsupervised Learning
50 pages
ML-12
No ratings yet
ML-12
19 pages
Ml Unit5 Notes
No ratings yet
Ml Unit5 Notes
18 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Brute Force Search: Fundamentals and Applications
From Everand
Brute Force Search: Fundamentals and Applications
Fouad Sabry
No ratings yet
4 - Networks Model and Shortest Path
No ratings yet
4 - Networks Model and Shortest Path
11 pages
Python - How To Calculate Distance Between Highest Value of Function - Stack Overflow
No ratings yet
Python - How To Calculate Distance Between Highest Value of Function - Stack Overflow
6 pages
Algorithm Assignment
No ratings yet
Algorithm Assignment
12 pages
Data Structures Unit 4
No ratings yet
Data Structures Unit 4
43 pages
Feature Extraction: 4.1. Principal Component Analysis (PCA)
No ratings yet
Feature Extraction: 4.1. Principal Component Analysis (PCA)
10 pages
Exercise 04
No ratings yet
Exercise 04
2 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
Brute Force Algorithm
No ratings yet
Brute Force Algorithm
17 pages
CD Project: Topic: Implementation of LL (1) Parser
No ratings yet
CD Project: Topic: Implementation of LL (1) Parser
20 pages
Chpter 3-Simplex Algorithm
No ratings yet
Chpter 3-Simplex Algorithm
34 pages
Optimization Techniques: J. Fisher
No ratings yet
Optimization Techniques: J. Fisher
11 pages
Loss Functions - An Algorithm-Wise Comprehensive Summary
No ratings yet
Loss Functions - An Algorithm-Wise Comprehensive Summary
5 pages
Some of The Assumption Taken While Working With Linear Programming Are
No ratings yet
Some of The Assumption Taken While Working With Linear Programming Are
2 pages
Barios
No ratings yet
Barios
10 pages
Assignment No.2
No ratings yet
Assignment No.2
16 pages
CE306 Computer Programming and Computational Techniques
No ratings yet
CE306 Computer Programming and Computational Techniques
2 pages
FX E X: Khwaja Fareed University of Engineering & Information Technology Rahim Yar Khan
No ratings yet
FX E X: Khwaja Fareed University of Engineering & Information Technology Rahim Yar Khan
2 pages
Trees
No ratings yet
Trees
4 pages
CS402 MidTerm MCQs by Talha Sajid
No ratings yet
CS402 MidTerm MCQs by Talha Sajid
35 pages
Arithmetic Coding
No ratings yet
Arithmetic Coding
5 pages
[FREE PDF sample] Operations Research Principles and Applications 3rd Edition G. Srinivasan ebooks
100% (22)
[FREE PDF sample] Operations Research Principles and Applications 3rd Edition G. Srinivasan ebooks
85 pages
Measuring Errors
100% (1)
Measuring Errors
20 pages
A Quantum Algorithm For Solving Systems of Nonlinear Algebraic Equations
No ratings yet
A Quantum Algorithm For Solving Systems of Nonlinear Algebraic Equations
10 pages
Toc Unit 5
No ratings yet
Toc Unit 5
23 pages
DEEP LEARNING IIT Kharagpur Assignment - 5 - 2024
No ratings yet
DEEP LEARNING IIT Kharagpur Assignment - 5 - 2024
9 pages
Bilinear Quad Source Code in Matlab
No ratings yet
Bilinear Quad Source Code in Matlab
2 pages
Digital Circuits - Encoders: 4 To 2 Encoder
No ratings yet
Digital Circuits - Encoders: 4 To 2 Encoder
6 pages

K_means.ipynb_-_Colab

Uploaded by

K_means.ipynb_-_Colab

Uploaded by

10/17/24, 3:54 PM K_means.

Working of K-Means Clustering

Number of cluster centroids should be less than number of training examples

To avoid local optima issue, try to do multiple random initialization of centroids.

Choosing The Number of Clusters

Shape of the data= (200, 5)

Next steps: Generate code with df

Understanding The Data

Text(0.5, 1.0, 'Unlabelled Mall Customer Data')

Choosing The Number of Clusters

Compute K-Means Clustering

kmeans= KMeans(n_clusters = 5, random_state = 42)

# Compute k-means clustering

df['Cluster'] = pd.DataFrame(pred, columns=['cluster'] )

Number of data points in each cluster=

... ... ... ... ... ... ...

195 196 Female 35 120 79 1

196 197 Female 45 126 28 3

197 198 Male 32 126 74 1

198 199 Male 32 137 18 3

199 200 Male 30 137 83 1

200 rows × 6 columns

Next steps: Generate code with df

plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:, 1],s = 300, c = 'red', label = 'Centroid', marker=

Text(0.5, 1.0, 'Customer Clusters')

def plot_k_means_progress(centroid_history,n_clusters, centroid_sets, cluster_color):

O/P: Plot the centroid path

plt.plot(c_x, c_y, c= cluster_color['c_' + str(i)], linestyle='--')

# Reset coordinate arrays to avoid continuous lines

# Random Initialization of Centroids

plt.scatter(initial_centroid[:,0], initial_centroid[:, 1],s = 200, c = 'red', label = 'Random Centroid', marker='*')

# K-Means loop of assignment and move centroid steps

plt.scatter(centroid_history[:,0], centroid_history[:, 1],s = 50, c = 'gray', label = 'Last Centroid', marker='x')

plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:, 1],s = 200, c = 'red', label = 'Centroid', mark

centroid_history = np.append(centroid_history, kmeans.cluster_centers_, axis=0)

centroid_sets = i + 1 # Adding one for initial set of centroids

You might also like