Clustering in Machine Learning - Javatpoint
Clustering in Machine Learning - Javatpoint
It does it by finding some similar patterns in the unlabelled dataset such as shape, size, color,
behavior, etc., and divides them as per the presence and absence of those similar patterns.
After applying this clustering technique, each cluster or group is provided with a cluster-ID. ML
system can use this id to simplify the processing of large and complex datasets.
Note: Clustering is somewhere similar to the classification algorithm, but the difference is the
type of dataset that we are using. In classification, we work with the labeled data set, whereas
in clustering, we work with the unlabelled dataset.
Example: Let's understand the clustering technique with the real-world example of Mall: When we
visit any shopping mall, we can observe that the things with similar usage are grouped together.
Such as the t-shirts are grouped in one section, and trousers are at other sections, similarly, at
vegetable sections, apples, bananas, Mangoes, etc., are grouped in separate sections, so that we
can easily find out the things. The clustering technique also works in the same way. Other
examples of clustering are grouping documents according to the topic.
The clustering technique can be widely used in various tasks. Some most common uses of this
technique are:
Market Segmentation
Image segmentation
Anomaly detection, etc.
Apart from these general usages, it is used by the Amazon in its recommendation system to
provide the recommendations as per the past search of products. Netflix also uses this technique
to recommend the movies and web-series to its users as per the watch history.
The below diagram explains the working of the clustering algorithm. We can see the different
fruits are divided into several groups with similar properties.
The clustering methods are broadly divided into Hard clustering (datapoint belongs to only one
group) and Soft Clustering (data points can belong to another group also). But there are also
other various approaches of Clustering exist. Below are the main clustering methods used in
Machine learning:
1. Partitioning Clustering
2. Density-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering
Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also known as the
centroid-based method. The most common example of partitioning clustering is the K-Means
Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define the number of
pre-defined groups. The cluster center is created in such a way that the distance between the data
points of one cluster is minimum as compared to another cluster centroid.
Density-Based Clustering
The density-based clustering method connects the highly-dense areas into clusters, and the
arbitrarily shaped distributions are formed as long as the dense region can be connected. This
algorithm does it by identifying different clusters in the dataset and connects the areas of high
densities into clusters. The dense areas in data space are divided from each other by sparser
areas.
These algorithms can face difficulty in clustering the data points if the dataset has varying
densities and high dimensions.
In the distribution model-based clustering method, the data is divided based on the probability of
how a dataset belongs to a particular distribution. The grouping is done by assuming some
distributions commonly Gaussian Distribution.
The example of this type is the Expectation-Maximization Clustering algorithm that uses
Gaussian Mixture Models (GMM).
Hierarchical Clustering
Hierarchical clustering can be used as an alternative for the partitioned clustering as there is no
requirement of pre-specifying the number of clusters to be created. In this technique, the dataset
is divided into clusters to create a tree-like structure, which is also called a dendrogram. The
observations or any number of clusters can be selected by cutting the tree at the correct level.
The most common example of this method is the Agglomerative Hierarchical algorithm.
Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to more than one
group or cluster. Each dataset has a set of membership coefficients, which depend on the degree
of membership to be in a cluster. Fuzzy C-means algorithm is the example of this type of
clustering; it is sometimes also known as the Fuzzy k-means algorithm.
Clustering Algorithms
The Clustering algorithms can be divided based on their models that are explained above. There
are different types of clustering algorithms published, but only a few are commonly used. The
clustering algorithm is based on the kind of data that we are using. Such as, some algorithms
need to guess the number of clusters in the given dataset, whereas some are required to find the
minimum distance between the observation of the dataset.
Here we are discussing mainly popular Clustering algorithms that are widely used in machine
learning:
1. K-Means algorithm: The k-means algorithm is one of the most popular clustering
algorithms. It classifies the dataset by dividing the samples into different clusters of equal
variances. The number of clusters must be specified in this algorithm. It is fast with fewer
computations required, with the linear complexity of O(n).
2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the smooth
density of data points. It is an example of a centroid-based model, that works on updating
the candidates for centroid to be the center of the points within a given region.
6. Affinity Propagation: It is different from other clustering algorithms as it does not require
to specify the number of clusters. In this, each data point sends a message between the
pair of data points until convergence. It has O(N2T) time complexity, which is the main
drawback of this algorithm.
Applications of Clustering
Below are some commonly known applications of clustering technique in Machine Learning:
In Identification of Cancer Cells: The clustering algorithms are widely used for the
identification of cancerous cells. It divides the cancerous and non-cancerous data sets into
different groups.
In Search Engines: Search engines also work on the clustering technique. The search result
appears based on the closest object to the search query. It does it by grouping similar data
objects in one group that is far from the other dissimilar objects. The accurate result of a
query depends on the quality of the clustering algorithm used.
In Biology: It is used in the biology stream to classify different species of plants and
animals using the image recognition technique.
In Land Use: The clustering technique is used in identifying the area of similar lands use in
the GIS database. This can be very useful to find that for what purpose the particular land
should be used, that means for which purpose it is more suitable.
← Prev
Next →
Youtube
For Videos Join Our Youtube Channel: Join Now
Feedback
Preparation
Company
Interview
Questions
Company
Questions
Trending Technologies
B.Tech / MCA