
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
K-Means Clustering on Handwritten Digits Data Using Scikit-Learn in Python
Introduction
Clustering, which groups similar bits of data based on shared characteristics, is a prominent technique in unsupervised machine learning. K-Means clustering is a popular clustering algorithm. Data is divided into K clusters using the iterative K-Means technique, where K is a predetermined number. The process minimizes the sum of squared distances between the cluster centroids and the data points. In this post, we will look at how to use the Python Scikit-Learn package to conduct K-Means clustering on handwritten digits data.
Definition
A straightforward and efficient unsupervised learning approach called K-Means clustering seeks to separate a dataset into K unique, non-overlapping clusters. Each data point is given its closest centroid, which is the arithmetic mean of all the points assigned to that cluster, and this is how it operates. In order to reduce the sum of squared distances between data points and their corresponding centroids, the algorithm then iteratively changes the centroids. Up to convergence or a predetermined number of iterations, the operation is repeated.
Syntax
from sklearn.cluster import KMeans # Load the digits dataset from sklearn.datasets import load_digits digits = load_digits() # Create a K-Means clustering model kmeans = KMeans(n_clusters=K) # Fit the model to the data kmeans.fit(digits.data) # Predict the cluster labels for the data labels = kmeans.predict(digits.data)
Bring in the required libraries. From the sklearn.cluster package we import The KMeans class . We also import the handwritten digits dataset via the load_digits method from the sklearn.datasets package.
Use the load_digits function to load the dataset of handwritten digits. This collection includes pictures of handwritten numbers, with each number being an 8x8 pixel image.
Initialise a KMeans class instance to produce a K-Means clustering model. The number of clusters (K) we wish to generate are specified by the n_clusters parameter. Depending on the dataset and problem, we can choose any K value.
Call the fit method, passing the dataset, to fit the K-Means model to the data. The cluster centroids are determined in this step, and each data point is mapped to the closest centroid.
Utilising the predict approach, forecast the labels for the data points' clusters. Each piece of data is given a label that corresponds to the cluster to which it belongs.
Algorithm
Step 1 K cluster centroids should be initialised either randomly or using a specified initialization method.
Step 2 Based on the Euclidean distance, place each data point next to its nearest centroid.
Step 3 Calculate the mean of all the data points assigned to each cluster to update the centroids.
Step 4 Up until convergence or the allotted number of iterations, repeat steps 2 and 3 as necessary.
Step 5 We get the final cluster assignments.
Approach
Approach 1 Clustering Handwritten Digits Data
Approach 2 Evaluating Clustering Performance
Approach 1: Clustering Handwritten Digits Data
Example
# Import the necessary libraries from sklearn.cluster import KMeans from sklearn.datasets import load_digits import matplotlib.pyplot as plt # Load the digits dataset digits = load_digits() # Create a K-Means clustering model kmeans = KMeans(n_clusters=10) # Fit the model to the data kmeans.fit(digits.data) # Predict the cluster labels for the data labels = kmeans.predict(digits.data) # Visualize the cluster centroids fig, ax = plt.subplots(2, 5, figsize=(8, 3)) centers = kmeans.cluster_centers_.reshape(10, 8, 8) for axi, center in zip(ax.flat, centers): axi.set(xticks=[], yticks=[]) axi.imshow(center, interpolation='nearest', cmap=plt.cm.binary) # Show the plot plt.show()
Output

As there are 10 different digits (0-9) in the dataset, we next establish a K-Means clustering model by initialising an instance of the KMeans class with n_clusters=10.
Then, using the fit approach, which determines the cluster centroids and places each data point in relation to its closest centroid, we fit the model to the data.
We resize the cluster centres into 8x8 images and plot them using matplotlib in order to see the cluster centroids. The representative photos for each cluster are displayed in the figure that results.
Here we use plt.show() to show the plot at the end of the program.
A plot with 10 subplots organized in a 2x5 grid is the code's output. For a given digit, each subplot corresponds to a cluster centroid. Since initialization is all at random, the output of the given program might differ each time when it is executed.The grayscale centroid image depicts the average digit image for each cluster. Because the cluster centroids' initial positions are chosen at random, the clustering outcomes may vary slightly. As a result, the cluster centroids that are produced and how they are arranged in the plot may alter between different runs.
Approach 2: Evaluating Clustering Performance
Example
# Import the necessary libraries from sklearn.cluster import KMeans from sklearn.datasets import load_digits from sklearn.metrics import silhouette_score # Load the digits dataset digits = load_digits() # Create a K-Means clustering model kmeans = KMeans(n_clusters=10) # Fit the model to the data kmeans.fit(digits.data) # Predict the cluster labels for the data labels = kmeans.predict(digits.data) # Evaluate the clustering performance score = silhouette_score(digits.data, labels) print("Silhouette Score:", score)
Output
Silhouette Score: 0.18185624794421412
After that, we fit a K-Means clustering model to the data with n_clusters set to 10.
The data points' predicted cluster labels are then added to the labels variable.
We utilize the silhouette_score function from sklearn.metrics to assess the performance of the clustering. Values for this statistic, which assesses how closely the data points are clustered, range from -1 to 1. Greater values represent improved clustering performance.
To evaluate the calibre of the clustering results, we publish the silhouette score in the final step.
The output is displayed after the colon, the real score value will be displayed. A higher score on the silhouette scale, which ranges from -1 to 1, indicates better grouping outcomes. It evaluates how well, in relation to other clusters, each sample in the dataset fits its allocated cluster. The clusters are more prominent and well-separated the closer the score gets to 1.When you run the code, the K-Means algorithm's clustering of the digits dataset will be used to get the silhouette score. The output will be the score, which will be printed to the console.
Conclusion
K-Means clustering is a flexible approach that may be used to find hidden patterns and group related data points together in a variety of forms of data. You may extract important insights from your data and make wise judgements by comprehending and using K-Means clustering.