0% found this document useful (0 votes)
0 views

Clustering in Machine Learning

Clustering is an unsupervised machine learning technique that groups unlabelled datasets based on similarities in data points, such as shape or behavior. It is commonly used in various applications including market segmentation, anomaly detection, and recommendation systems. Different clustering methods include partitioning, density-based, hierarchical, and fuzzy clustering, each with specific algorithms like K-Means and DBSCAN.

Uploaded by

Amita Garg
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Clustering in Machine Learning

Clustering is an unsupervised machine learning technique that groups unlabelled datasets based on similarities in data points, such as shape or behavior. It is commonly used in various applications including market segmentation, anomaly detection, and recommendation systems. Different clustering methods include partitioning, density-based, hierarchical, and fuzzy clustering, each with specific algorithms like K-Means and DBSCAN.

Uploaded by

Amita Garg
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Clustering in Machine Learning

Clustering or cluster analysis is a machine learning technique, which groups


the unlabelled dataset. It can be defined as "A way of grouping the data
points into different clusters, consisting of similar data points. The
objects with the possible similarities remain in a group that has less
or no similarities with another group."

It does it by finding some similar patterns in the unlabelled dataset such as


shape, size, color, behavior, etc., and divides them as per the presence and
absence of those similar patterns.

It is an unsupervised learning method, hence no supervision is provided to


the algorithm, and it deals with the unlabeled dataset.

After applying this clustering technique, each cluster or group is provided


with a cluster-ID. ML system can use this id to simplify the processing of
large and complex datasets.

The clustering technique is commonly used for statistical data analysis.

Note: Clustering is somewhere similar to the classification algorithm,


but the difference is the type of dataset that we are using. In
classification, we work with the labeled data set, whereas in clustering,
we work with the unlabelled dataset.
Example: Let's understand the clustering technique with the real-world
example of Mall: When we visit any shopping mall, we can observe that the
things with similar usage are grouped together. Such as the t-shirts are
grouped in one section, and trousers are at other sections, similarly, at
vegetable sections, apples, bananas, Mangoes, etc., are grouped in separate
sections, so that we can easily find out the things. The clustering technique
also works in the same way. Other examples of clustering are grouping
documents according to the topic.

The clustering technique can be widely used in various tasks. Some most
common uses of this technique are:

o Market Segmentation
o Statistical data analysis
o Social network analysis
o Image segmentation
o Anomaly detection, etc.
Apart from these general usages, it is used by the Amazon in its
recommendation system to provide the recommendations as per the past
search of products. Netflix also uses this technique to recommend the
movies and web-series to its users as per the watch history.

The below diagram explains the working of the clustering algorithm. We can
see the different fruits are divided into several groups with similar properties.

Types of Clustering Methods


The clustering methods are broadly divided into Hard clustering (datapoint
belongs to only one group) and Soft Clustering (data points can belong to
another group also). But there are also other various approaches of
Clustering exist. Below are the main clustering methods used in Machine
learning:

1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering

Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It
is also known as the centroid-based method. The most common example
of partitioning clustering is the K-Means Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to
define the number of pre-defined groups. The cluster center is created in
such a way that the distance between the data points of one cluster is
minimum as compared to another cluster centroid.

Density-Based Clustering
The density-based clustering method connects the highly-dense areas into
clusters, and the arbitrarily shaped distributions are formed as long as the
dense region can be connected. This algorithm does it by identifying
different clusters in the dataset and connects the areas of high densities into
clusters. The dense areas in data space are divided from each other by
sparser areas.

These algorithms can face difficulty in clustering the data points if the
dataset has varying densities and high dimensions.
Distribution Model-Based Clustering
In the distribution model-based clustering method, the data is divided based
on the probability of how a dataset belongs to a particular distribution. The
grouping is done by assuming some distributions commonly Gaussian
Distribution.

The example of this type is the Expectation-Maximization Clustering


algorithm that uses Gaussian Mixture Models (GMM).
Hierarchical Clustering
Hierarchical clustering can be used as an alternative for the partitioned
clustering as there is no requirement of pre-specifying the number of
clusters to be created. In this technique, the dataset is divided into clusters
to create a tree-like structure, which is also called a dendrogram. The
observations or any number of clusters can be selected by cutting the tree at
the correct level. The most common example of this method is
the Agglomerative Hierarchical algorithm.

Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong
to more than one group or cluster. Each dataset has a set of membership
coefficients, which depend on the degree of membership to be in a
cluster. Fuzzy C-means algorithm is the example of this type of clustering;
it is sometimes also known as the Fuzzy k-means algorithm.

Clustering Algorithms
The Clustering algorithms can be divided based on their models that are
explained above. There are different types of clustering algorithms
published, but only a few are commonly used. The clustering algorithm is
based on the kind of data that we are using. Such as, some algorithms need
to guess the number of clusters in the given dataset, whereas some are
required to find the minimum distance between the observation of the
dataset.

Here we are discussing mainly popular Clustering algorithms that are widely
used in machine learning:

1. K-Means algorithm: The k-means algorithm is one of the most


popular clustering algorithms. It classifies the dataset by dividing the
samples into different clusters of equal variances. The number of
clusters must be specified in this algorithm. It is fast with fewer
computations required, with the linear complexity of O(n).
2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense
areas in the smooth density of data points. It is an example of a
centroid-based model, that works on updating the candidates for
centroid to be the center of the points within a given region.
3. DBSCAN Algorithm: It stands for Density-Based Spatial
Clustering of Applications with Noise. It is an example of a
density-based model similar to the mean-shift, but with some
remarkable advantages. In this algorithm, the areas of high density are
separated by the areas of low density. Because of this, the clusters can
be found in any arbitrary shape.
4. Expectation-Maximization Clustering using GMM: This algorithm
can be used as an alternative for the k-means algorithm or for those
cases where K-means can be failed. In GMM, it is assumed that the
data points are Gaussian distributed.
5. Agglomerative Hierarchical algorithm: The Agglomerative
hierarchical algorithm performs the bottom-up hierarchical clustering.
In this, each data point is treated as a single cluster at the outset and
then successively merged. The cluster hierarchy can be represented as
a tree-structure.
6. Affinity Propagation: It is different from other clustering algorithms
as it does not require to specify the number of clusters. In this, each
data point sends a message between the pair of data points until
convergence. It has O(N2T) time complexity, which is the main
drawback of this algorithm.

Applications of Clustering
Below are some commonly known applications of clustering technique in
Machine Learning:

o In Identification of Cancer Cells: The clustering algorithms are


widely used for the identification of cancerous cells. It divides the
cancerous and non-cancerous data sets into different groups.
o In Search Engines: Search engines also work on the clustering
technique. The search result appears based on the closest object to
the search query. It does it by grouping similar data objects in one
group that is far from the other dissimilar objects. The accurate result
of a query depends on the quality of the clustering algorithm used.
o Customer Segmentation: It is used in market research to segment
the customers based on their choice and preferences.
o In Biology: It is used in the biology stream to classify different species
of plants and animals using the image recognition technique.
o In Land Use: The clustering technique is used in identifying the area
of similar lands use in the GIS database. This can be very useful to find
that for what purpose the particular land should be used, that means
for which purpose it is more suitable.

You might also like