K- means clustering with SciPy
Last Updated :
26 Jun, 2024
Prerequisite: K-means clustering
K-means clustering in Python is one of the most widely used unsupervised machine-learning techniques for data segmentation and pattern discovery. This article will explore K-means clustering in Python using the powerful SciPy library. With a step-by-step approach, we will cover the fundamentals, implementation, and interpretation of K-Means clustering, providing you with a comprehensive understanding of this essential data analysis technique.
K-Means Clustering
K-Means clustering is a process of grouping similar data points into clusters. The algorithm accomplishes this by repeatedly assigning data points to the nearest cluster centroid, re-evaluating the centroids, and achieving convergence to a stable solution. The letter "K" refers to the number of clusters we want to form. The aim of K-Means is to minimize the sum of squared distances between data points and their respective cluster centroids.
The K-Means clustering in Python is one of the partitioning approaches and each cluster will be represented with a calculated centroid. All the data points in the group will have a minimum distance from the computed centroid.
.png)
Scipy
SciPy is a free and open-source library that is built using NumPy as a foundation. The library offers a wide range of functions that are useful for scientific computing and data analysis. It has various modules that can be used for optimization, linear algebra, statistics, image processing, signal processing, and much more. With its modular design, you can easily integrate the functions of SciPy into your data analysis workflows.
It can be installed by running the command given below:
pip install scipy
It has dedicated packages for the process of clustering. There are two modules that can offer clustering methods.
- cluster.vq
- cluster.hierarchy
cluster.vq
This module gives the feature of vector quantization to use with the K-Means clustering in Python method. The quantization of vectors plays a major role in reducing the distortion and improving the accuracy. Mostly the distortion here is calculated using the Euclidean distance between the centroid and each vector. Based on this the vector of data points are assigned to a cluster.
cluster.hierarchy
This module provides methods for general hierarchical clustering and its types such as agglomerative clustering. It has various routines that can be used for applying statistical methods on the hierarchies, visualizing the clusters, plotting the clusters, checking linkages in the clusters, and also checking whether two different hierarchies are equivalent.
In this article, cluster.vq module will be used to carry out the K-Means clustering.
K-Means clustering with Scipy library
The K-means clustering in Python can be done on given data by executing the following steps.
- Normalize the data points.
- Compute the centroids (referred to as code and the 2D array of centroids is referred to as code book).
- Form clusters and assign the data points (referred to as mapping from code book).
cluster.vq.whiten()
This method is used to normalize the data points. Normalization is very important when the attributes considered are of different units. For example, if the length is given in meters and breadth is given in inches, it may produce an unequal variance for the vectors. It is always preferred to have unit variance while performing K-Means clustering to get accurate clusters. Thus, the data array has to pass to whiten() method before any other steps.
cluster.vq.whiten(input_array, check_finite)
Parameters:
- input_array : The array of data points to be normalized.
- check_finite : If set to true, checks whether the input matrix contains only finite numbers. If set to false, ignores checking.
cluster.vq.kmeans()
This vq module has two methods namely kmeans() and kmeans2().
The kmeans() method uses a threshold value which on becoming less than or equal to the change in distortion in the last iteration, the algorithm terminates. This method returns the centroids calculated and the mean value of the Euclidean distances between the observations and the centroids.
cluster.vq.kmeans(input_array, k, iterations, threshold, check_finite)
Parameters:
- input_array : The array of data points to be normalized.
- k : No.of.clusters (centroids)
- iterations : No.of.iterations to perform kmeans so that distortion is minimized. If k is specified it is ignored.
- threshold : An integer value which if becomes less than or equal to change in distortion in last iteration, the algorithm terminates.
- check_finite : If set to true, checks whether the input matrix contains only finite numbers. If set to false, ignores checking.
The kmeans2() method does not use the threshold value to check for convergence. It has more parameters that decide the method of initialization of centroids, a method to handle empty clusters, and validating whether the input matrices contain only finite numbers. This method returns centroids and the clusters to which the vector belongs.
cluster.vq.kmeans2(input_array, k, iterations, threshold, minit, missing, check_finite)
Parameters:
- input_array : The array of data points to be normalized.
- k : No.of.clusters (centroids)
- iterations : No.of.iterations to perform kmeans so that distortion is minimized. If k is specified it is ignored.
- threshold : An integer value which if becomes less than or equal to change in distortion in last iteration, the algorithm terminates.
- minit : A string which denotes the initialization method of the centroids. Possible values are 'random', 'points', '++', 'matrix'.
- missing : A string which denotes action upon empty clusters. Possible values are 'warn', 'raise'.
- check_finite : If set to true, checks whether the input matrix contains only finite numbers. If set to false, ignores checking.
cluster.vq.vq()
This method maps the observations to appropriate centroids which are calculated by the kmeans() method. It requires the input matrices to be normalized. It takes the normalized inputs and generated code-book as input. It returns the index in the code-book to which the observation corresponds to and the distance between the observation and its code (centroid).
K-Means clustering with a 2D array data
Step 1: Import the required modules
Importing modules and functions from the numpy
and scipy.cluster.vq
libraries, which are used for performing K-Means clustering in Python and related operations.
Python
# import modules
import numpy as np
from scipy.cluster.vq import whiten, kmeans, vq, kmeans2
Step 2: Import/generate data. Normalize the data
In this code, we demonstrate how to normalize a dataset using the whiten function from the SciPy library. By scaling the original dataset to have a mean of zero and a variance of one, we can ensure that all features contribute equally to subsequent data analysis tasks. This is a common preprocessing step that helps to ensure fairness in the analysis.
Python
# observations
data = np.array([[1, 3, 4, 5, 2],
[2, 3, 1, 6, 3],
[1, 5, 2, 3, 1],
[3, 4, 9, 2, 1]])
# normalize
data = whiten(data)
print(data)
Output

Step 3: Calculate the centroids and generate the code book for mapping using kmeans() method
K-Means clustering Algorithm in Python using the kmeans
function from the SciPy library. It calculates cluster centroids and provides the mean value of Euclidean distances between data points and their respective cluster centroids. It Randomly choose K data points as initial centroids for the clusters. These centroids will serve as the starting points for the clustering process.
Python
# code book generation
centroids, mean_value = kmeans(data, 3)
print("Code book :\n", centroids, "\n")
print("Mean of Euclidean distances :",
mean_value.round(4))
Output

Step 4: Map the centroids calculated in the previous step to the clusters
In this the vq
function from the SciPy library to assign data points to clusters based on pre-calculated centroids and calculate the distances between data points and their respective cluster centroids. This will display the cluster assignments and the distances of each data point to its assigned centroid.
Python
# mapping the centroids
clusters, distances = vq(data, centroids)
print("Cluster index :", clusters, "\n")
print("Distance from the centroids :", distances)
Output:

Consider the same example with kmeans2(). This does not require the additional step of calling vq() method. Repeat steps 1 and 2, then use the following snippet.
Python
# assign centroids and clusters
centroids, clusters = kmeans2(data, 3,
minit='random')
print("Centroids :\n", centroids, "\n")
print("Clusters :", clusters)
Output:

Example 2: K-Means clustering of Diabetes dataset
The dataset contains the following attributes based on which a patient is either placed in diabetic cluster or non-diabetic cluster.
- Pregnancies
- Glucose
- Blood Pressure
- Skin Thickness
- Insulin
- BMI
- Diabetes Pedigree Function
- Age
This code demonstrates a basic example of using clustering techniques to analyze diabetes patient data and visualize the distribution of diabetic and non-diabetic patients using a pie chart. The code uses Python libraries such as NumPy, SciPy, and Matplotlib.
Python
# import modules
import matplotlib.pyplot as plt
import numpy as np
from scipy.cluster.vq import whiten, kmeans, vq
# load the dataset
dataset = np.loadtxt(r"{your-path}\diabetes-train.csv",
delimiter=",")
# excluding the outcome column
dataset = dataset[:, 0:8]
print("Data :\n", dataset, "\n")
# normalize
dataset = whiten(dataset)
# generate code book
centroids, mean_dist = kmeans(dataset, 2)
print("Code-book :\n", centroids, "\n")
clusters, dist = vq(dataset, centroids)
print("Clusters :\n", clusters, "\n")
# count non-diabetic patients
non_diab = list(clusters).count(0)
# count diabetic patients
diab = list(clusters).count(1)
# depict illustration
x_axis = []
x_axis.append(diab)
x_axis.append(non_diab)
colors = ['green', 'orange']
print("No.of.diabetic patients : " + str(x_axis[0]) +
"\nNo.of.non-diabetic patients : " + str(x_axis[1]))
y = ['diabetic', 'non-diabetic']
plt.pie(x_axis, labels=y, colors=colors, shadow='true')
plt.show()
Output:



In Conclusion, the overall code demonstrates a comprehensive approach to analyzing a diabetes dataset. By leveraging K-Means clustering, it effectively segments patients into groups, allowing for a better understanding of patient demographics and potentially supporting medical insights. The visualization of the distribution of diabetic and non-diabetic patients offers a clear overview of the clustering result, thus enhancing the comprehension of the analysis.
Similar Reads
Hierarchical Clustering with Scikit-Learn
Hierarchical clustering is a popular method in data science for grouping similar data points into clusters. Unlike other clustering techniques like K-means, hierarchical clustering does not require the number of clusters to be specified in advance. Instead, it builds a hierarchy of clusters that can
4 min read
K-Means Clustering in MATLAB
K-means clustering is an unsupervised machine learning algorithm that is commonly used for clustering data points into groups or clusters. The algorithm tries to find K centroids in the data space that represent the center of each cluster. Each data point is then assigned to the nearest centroid, fo
3 min read
SciPy - Cluster
Clustering is nothing but it is the procedure of dividing the datasets into groups consisting of similar data points. In this procedure, the data points in the same group must be identical as possible and should be different from the other groups. Types of SciPy - Cluster: There are two types of Clu
4 min read
Mean Shift Clustering using Sklearn
Clustering is a fundamental method in unsupervised device learning, and one powerful set of rules for this venture is Mean Shift clustering. Mean Shift is a technique for grouping comparable data factors into clusters primarily based on their inherent characteristics, with our previous understanding
9 min read
Cluster Sampling in Pandas
Sampling is a method in which we collect or chosen a small set of data from a large population, without finding the meaning of every individual in set. We're doing a sampling of data from the population because we cannot gather data from the entire population. If we're done sampling then we bring th
4 min read
Spectral Clustering using R
Spectral clustering is a technique used in machine learning and data analysis for grouping data points based on their similarity. The method involves transforming the data into a representation where the clusters become apparent and then using a clustering algorithm on this transformed data. In R Pr
9 min read
Clustering Distance Measures
Clustering is a fundamental concept in data analysis and machine learning, where the goal is to group similar data points into clusters based on their characteristics. One of the most critical aspects of clustering is the choice of distance measure, which determines how similar or dissimilar two dat
7 min read
Clustering Metrics in Machine Learning
Clustering is a technique in Machine Learning that is used to group similar data points. While the algorithm performs its job, helping uncover the patterns and structures in the data, it is important to judge how well it functions. Several metrics have been designed to evaluate the performance of th
8 min read
Clustering in Julia
Clustering in Julia is a very commonly used method in unsupervised learning. In this method, we put similar data points into a cluster based on the number of features they have in common. The number of clusters created during the clustering process is decided based on the complexity and size of the
4 min read
Measuring Clustering Quality in Data Mining
A cluster is the collection of data objects which are similar to each other within the same group. The data objects of a cluster are dissimilar to data objects of other groups or clusters. Clustering Approaches:1. Partitioning approach: The partitioning approach constructs various partitions and the
4 min read