0% found this document useful (0 votes)
4 views9 pages

Unsupervised-Learning-Part-1 (1)

Clustering is an unsupervised machine learning technique that groups unlabelled datasets into clusters based on similarities in data points. It is commonly used for tasks such as market segmentation and anomaly detection, with various methods including K-Means, Hierarchical, and Density-Based clustering. The K-Means algorithm specifically partitions data into K predefined clusters by iteratively assigning data points to the nearest centroid and recalculating centroids until convergence.

Uploaded by

suman.struc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views9 pages

Unsupervised-Learning-Part-1 (1)

Clustering is an unsupervised machine learning technique that groups unlabelled datasets into clusters based on similarities in data points. It is commonly used for tasks such as market segmentation and anomaly detection, with various methods including K-Means, Hierarchical, and Density-Based clustering. The K-Means algorithm specifically partitions data into K predefined clusters by iteratively assigning data points to the nearest centroid and recalculating centroids until convergence.

Uploaded by

suman.struc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Clustering in Machine Learning

Clustering or cluster analysis is a machine learning technique, which groups the unlabelled dataset. It
can be defined as "A way of grouping the data points into different clusters, consisting of
similar data points. The objects with the possible similarities remain in a group that has less or
no similarities with another group."

It does it by finding some similar patterns in the unlabelled dataset such as shape, size, color,
behavior, etc., and divides them as per the presence and absence of those similar patterns.

It is an unsupervised learning method, hence no supervision is provided to the algorithm, and it deals
with the unlabeled dataset.

After applying this clustering technique, each cluster or group is provided with a cluster-ID. ML
system can use this id to simplify the processing of large and complex datasets.

The clustering technique is commonly used for statistical data analysis.

Note: Clustering is somewhere similar to the classification algorithm , but the difference is the type of
dataset that we are using. In classification, we work with the labeled data set, whereas in clustering, we
work with the unlabelled dataset.

Example: Let's understand the clustering technique with the real-world example of Mall: When we
visit any shopping mall, we can observe that the things with similar usage are grouped together. Such
as the t-shirts are grouped in one section, and trousers are at other sections, similarly, at vegetable
sections, apples, bananas, Mangoes, etc., are grouped in separate sections, so that we can easily find
out the things. The clustering technique also works in the same way. Other examples of clustering
are grouping documents according to the topic.

The clustering technique can be widely used in various tasks. Some most common uses of this
technique are:

o Market Segmentation
o Statistical data analysis
o Social network analysis
o Image segmentation
o Anomaly detection, etc.

The below diagram explains the working of the clustering algorithm. We can see the different fruits
are divided into several groups with similar properties.
Types of Clustering Methods
The clustering methods are broadly divided into Hard clustering (datapoint belongs to only one
group) and Soft Clustering (data points can belong to another group also). But there are also other
various approaches of Clustering exist. Below are the main clustering methods used in Machine
learning:

1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering

Basically, there are two types of hierarchical cluster analysis strategies –


1. Agglomerative Clustering: Also known as bottom-up approach or hierarchical
agglomerative clustering (HAC). A structure that is more informative than the unstructured set
of clusters returned by flat clustering. This clustering algorithm does not require us to
prespecify the number of clusters. Bottom-up algorithms treat each data as a singleton cluster
at the outset and then successively agglomerates pairs of clusters until all clusters have been
merged into a single cluster that contains all data.
Algorithm :
given a dataset (d1, d2, d3, ....dN) of size N
# compute the distance matrix
for i=1 to N:
# as the distance matrix is symmetric about
# the primary diagonal so we compute only lower
# part of the primary diagonal
for j=1 to i:
dis_mat[i][j] = distance[di, dj]
each data point is a singleton clusterrepeat
merge the two cluster having minimum distance
update the distance matrixuntil only a single cluster remains

Python implementation of the above algorithm using the scikit-learn library:

 Python3
from sklearn.cluster import AgglomerativeClustering
import numpy as np

# randomly chosen dataset


X = np.array([[1, 2], [1, 4], [1, 0],
[4, 2], [4, 4], [4, 0]])

# here we need to mention the number of clusters


# otherwise the result will be a single cluster
# containing all the data
clustering = AgglomerativeClustering(n_clusters = 2).fit(X)

# print the class labels


print(clustering.labels_)

Output :
[1, 1, 1, 0, 0, 0]
2. Divisive clustering: Also known as a top-down approach. This algorithm also does not
require to prespecify the number of clusters. Top-down clustering requires a method for
splitting a cluster that contains the whole data and proceeds by splitting clusters recursively
until individual data have been split into singleton clusters.
Algorithm :
given a dataset (d1, d2, d3, ....dN) of size N
at the top we have all data in one cluster
the cluster is split using a flat clustering method eg. K-Means etcrepeat
choose the best cluster among all the clusters to split
split that cluster by the flat clustering algorithmuntil each data is in its own singleton cluster

Hierarchical Agglomerative vs Divisive clustering –


a. Divisive clustering is more complex as compared to agglomerative clustering, as in the
case of divisive clustering we need a flat clustering method as “subroutine” to split each
cluster until we have each data having its own singleton cluster.
b. Divisive clustering is more efficient if we do not generate a complete hierarchy all the way
down to individual data leaves. The time complexity of a naive agglomerative clustering
is O(n3) because we exhaustively scan the N x N matrix dist_mat for the lowest distance in
each of N-1 iterations. Using priority queue data structure we can reduce this complexity
to O(n2logn). By using some more optimizations it can be brought down to O(n2).
Whereas for divisive clustering given a fixed number of top levels, using an efficient flat
algorithm like K-Means, divisive algorithms are linear in the number of patterns and
clusters.
c. A divisive algorithm is also more accurate. Agglomerative clustering makes decisions by
considering the local patterns or neighbor points without initially taking into account the
global distribution of data. These early decisions cannot be undone. whereas divisive
clustering takes into consideration the global distribution of data when making top-level
partitioning decisions.
K-Means Clustering Algorithm
K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering
problems in machine learning or data science. In this topic, we will learn what is K-means clustering
algorithm, how the algorithm works, along with the Python implementation of k-means clustering.

What is K-Means Algorithm?


K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into
different clusters. Here K defines the number of pre-defined clusters that need to be created in the
process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.

It is an iterative algorithm that divides the unlabeled dataset into k different


clusters in such a way that each dataset belongs only one group that has similar
properties.

It allows us to cluster the data into different groups and a convenient way to discover the categories
of groups in the unlabeled dataset on its own without the need for any training.

It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of
this algorithm is to minimize the sum of distances between the data point and their corresponding
clusters.

The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and
repeats the process until it does not find the best clusters. The value of k should be predetermined in
this algorithm.

The k-means clustering algorithm mainly performs two tasks:

o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is away from other clusters.

The below diagram explains the working of the K-means Clustering Algorithm:
How does the K-Means Algorithm Work?
The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of
each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Let's understand the above steps by considering the visual plots:

Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is given
below:

o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into different
clusters. It means here we will try to group these datasets into two different clusters.
o We need to choose some random k points or centroid to form the cluster. These points can be
either the points from the dataset or any other point. So, here we are selecting the below two
points as k points, which are not the part of our dataset. Consider the below image:

o Now we will assign each data point of the scatter plot to its closest K-point or centroid. We
will compute it by applying some mathematics that we have studied to calculate the distance
between two points. So, we will draw a median between both the centroids. Consider the
below image:

From the above image, it is clear that points left side of the line is near to the K1 or blue centroid,
and points to the right of the line are close to the yellow centroid. Let's color them as blue and yellow
for clear visualization.

o As we need to find the closest cluster, so we will repeat the process by choosing a new
centroid. To choose the new centroids, we will compute the center of gravity of these
centroids, and will find new centroids as below:
o Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same
process of finding a median line. The median will be like below image:

From the above image, we can see, one yellow point is on the left side of the line, and two blue
points are right to the line. So, these three points will be assigned to new centroids.

As reassignment has taken place, so we will again go to the step-4, which is finding new centroids or
K-points.

We will repeat the process by finding the center of gravity of centroids, so the new centroids will be
as shown in the below image:
o As we got the new centroids so again will draw the median line and reassign the data points.
So, the image will be:

o We can see in the above image; there are no dissimilar data points on either side of the line,
which means our model is formed. Consider the below image:

As our model is ready, so we can now remove the assumed centroids, and the two final clusters will
be as shown in the below image:

You might also like