0% found this document useful (0 votes)
14 views

K Means Clustering - Experiment 12

Unsupervised

Uploaded by

Prateek Verma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

K Means Clustering - Experiment 12

Unsupervised

Uploaded by

Prateek Verma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

EXPERIMENT NO.

12
1. Title:
Perform unsupervised classification using k means algorithm in Python Environment

2. Aim:
To study the given dataset, apply the k-means classifier for classification of the dataset
using Python.

3. Theory:

K-Means Clustering is an Unsupervised Machine Learning algorithm, which groups the


unlabelled dataset into different clusters. It is the process of teaching a computer to use
unlabelled, unclassified data and enabling the algorithm to operate on that data without
supervision. Without any previous data training, the machine’s job in this case is to organize
unsorted data according to parallels, patterns, and variations. The goal of clustering is to divide
the population or set of data points into a number of groups so that the data points within each
group are more comparable to one another and different from the data points within the other
groups. It is essentially a grouping of things based on how similar and different they are to one
another.
The algorithm works as follows:
1. First, we randomly initialize k points, called means or cluster centroids.
2. We categorize each item to its closest mean, and we update the mean’s coordinates, which
are the averages of the items categorized in that cluster so far.
3. We repeat the process for a given number of iterations and at the end, we have our clusters.

Here K defines the number of pre-defined clusters that need to be created in the process, as if
K=2, there will be two clusters, and for K=3, there will be three clusters, and so on. It is an
iterative algorithm that divides the unlabelled dataset into k different clusters in such a way
that each dataset belongs only one group that has similar properties. It allows us to cluster the
data into different groups and a convenient way to discover the categories of groups in the
unlabelled dataset on its own without the need for any training. It is a centroid-based algorithm,
where each cluster is associated with a centroid. The main aim of this algorithm is to minimize
the sum of distances between the data point and their corresponding clusters. The algorithm
takes the unlabelled dataset as input, divides the dataset into k-number of clusters, and repeats
the process until it does not find the best clusters. The value of k should be predetermined in
this algorithm.

4. Pre-Questions:
1. What is clustering?
2. How does unsupervised is different from supervised classification?
3. What is the library that is used to import k-means clustering?
5. Post-Questions:
1. What are the applications of clustering?
2. What are the different methods of clustering?
3. What are the parameter on the basis of which k-means clustering is done?
6. Program Code:

##to create dataframe and import libraries


from sklearn.cluster import KMeans
import pandas as pd
from matplotlib import pyplot as plt

##to read csv file


df = pd.read_csv(‘/kaggle/input/income-dataset-for-k-means/income.csv’)
df
##to check first 5 rows
df = df.head()
## to check the basic statistics of the data
df.describe()
df.shape
df.info
## to plot scatter plot between age and income
plt.scatter(df.Age, df['Income($)'])
plt.xlabel('Age')
plt.ylabel('Income($)')

## to fit dataset into clusters


km = KMeans(n_clusters=3)
y_predicted = km.fit_predict(df[['Age','Income($)']])
y_predicted

## print the predicted cluster number for each datapoint


df['cluster']=y_predicted
df.head()

##to check the cluster centers


km.cluster_centers_

##to plot the different datapoints as per their assigned clusters


df1 = df[df.cluster==0]
df2 = df[df.cluster==1]
df3 = df[df.cluster==2]
plt.scatter(df1.Age,df1['Income($)'],color='green')
plt.scatter(df2.Age,df2['Income($)'],color='red')
plt.scatter(df3.Age,df3['Income($)'],color='black')
plt.scatter(km.cluster_centers_[:,0],km.cluster_centers_[:,1],color='purple',marker='*',labe
l='centroid')
plt.xlabel('Age')
plt.ylabel('Income ($)')
plt.legend()
7. Outcome: The k means classifier has been used successfully for clustering of an income
dataset for 3 clusters using Python platform.

Fig. 1. Input Dataset

Fig. 2. 3 Clusters using k-means clustering


8. Conclusion: Thus we have completed the k-means clustering for income dataset
successfully.

You might also like