0% found this document useful (0 votes)
21 views

ML Unit 4 Notes - NJ

Uploaded by

anshugarg401
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

ML Unit 4 Notes - NJ

Uploaded by

anshugarg401
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

MACHINE LEARNING

UNIT-4
Cluster analysis, also known as clustering, is a method of data mining that groups similar data points
together. The goal of cluster analysis is to divide a dataset into groups (or clusters) such that the
data points within each group are more similar to each other than to data points in other groups. This
process is often used for exploratory data analysis and can help identify patterns or relationships
within the data that may not be immediately obvious. There are many different algorithms used for
cluster analysis, such as k-means, hierarchical clustering, and density-based clustering. The choice
of algorithm will depend on the specific requirements of the analysis and the nature of the data being
analyzed.

Cluster Analysis is the process to find similar groups of objects in order to form clusters.It is an
unsupervised machine learning-based algorithm that acts on unlabelled data. A group of data points
would comprise together to form a cluster in which all the objects would belong to the same group.

The given data is divided into different groups by combining similar objects into a group. This group
is nothing but a cluster. A cluster is nothing but a collection of similar data which is grouped
together.

For example, consider a dataset of vehicles given in which it contains information about different
vehicles like cars, buses, bicycles, etc. As it is unsupervised learning there are no class labels like
Cars, Bikes, etc for all the vehicles, all the data is combined and is not in a structured manner.

Now our task is to convert the unlabelled data to labelled data and it can be done using clusters.

The main idea of cluster analysis is that it would arrange all the data points by forming clusters like
cars cluster which contains all the cars, bikes clusters which contains all the bikes, etc.

Simply it is the partitioning of similar objects which are applied to unlabelled data.

Properties of Clustering :

1. Clustering Scalability: Nowadays there is a vast amount of data and should be dealing with huge
databases. In order to handle extensive databases, the clustering algorithm should be scalable. Data
should be scalable, if it is not scalable, then we can’t get the appropriate result which would lead to
wrong results.

2. High Dimensionality: The algorithm should be able to handle high dimensional space along with
the data of small size.

3. Algorithm Usability with multiple data kinds: Different kinds of data can be used with
algorithms of clustering. It should be capable of dealing with different types of data like discrete,
categorical and interval-based data, binary data etc.

4. Dealing with unstructured data: There would be some databases that contain missing values,
and noisy or erroneous data. If the algorithms are sensitive to such data then it may lead to poor
quality clusters. So it should be able to handle unstructured data and give some structure to the data
by organising it into groups of similar data objects. This makes the job of the data expert easier in
order to process the data and discover new patterns.
5. Interpretability: The clustering outcomes should be interpretable, comprehensible, and usable.
The interpretability reflects how easily the data is understood.

Clustering Methods:
The clustering methods can be classified into the following categories:

• Partitioning Method

• Hierarchical Method

• Density-based Method

• Grid-Based Method

• Model-Based Method

• Constraint-based Method
Partitioning Method: It is used to make partitions on the data in order to form clusters. If “n”
partitions are done on “p” objects of the database then each partition is represented by a cluster and
n < p. The two conditions which need to be satisfied with this Partitioning Clustering Method are:

• One objective should only belong to only one group.

• There should be no group without even a single purpose.


In the partitioning method, there is one technique called iterative relocation, which means the object
will be moved from one group to another to improve the partitioning
Hierarchical Method: In this method, a hierarchical decomposition of the given set of data objects
is created. We can classify hierarchical methods and will be able to know the purpose of classification
on the basis of how the hierarchical decomposition is formed. There are two types of approaches for
the creation of hierarchical decomposition, they are:
• Agglomerative Approach: The agglomerative approach is also known as the bottom-up
approach. Initially, the given data is divided into which objects form separate groups.
Thereafter it keeps on merging the objects or the groups that are close to one another which
means that they exhibit similar properties. This merging process continues until the
termination condition holds.

• Divisive Approach: The divisive approach is also known as the top-down approach. In this
approach, we would start with the data objects that are in the same cluster. The group of
individual clusters is divided into small clusters by continuous iteration. The iteration
continues until the condition of termination is met or until each cluster contains one object.

• Once the group is split or merged then it can never be undone as it is a rigid method and is
not so flexible. The two approaches which can be used to improve the Hierarchical
Clustering Quality in Data Mining are: –

• One should carefully analyze the linkages of the object at every partitioning of hierarchical
clustering.
• One can use a hierarchical agglomerative algorithm for the integration of hierarchical
agglomeration. In this approach, first, the objects are grouped into micro-clusters. After
grouping data objects into microclusters, macro clustering is performed on the microcluster.

Density-Based Method: The density-based method mainly focuses on density. In this method,
the given cluster will keep on growing continuously as long as the density in the neighbourhood
exceeds some threshold, i.e, for each data point within a given cluster. The radius of a given
cluster has to contain at least a minimum number of points.

Grid-Based Method: In the Grid-Based method a grid is formed using the object together,i.e,
the object space is quantized into a finite number of cells that form a grid structure. One of the
major advantages of the grid-based method is fast processing time and it is dependent only on the
number of cells in each dimension in the quantized space. The processing time for this method is
much faster so it can save time.

Model-Based Method: In the model-based method, all the clusters are hypothesized in order
to find the data which is best suited for the model. The clustering of the density function is used
to locate the clusters for a given model. It reflects the spatial distribution of data points and also
provides a way to automatically determine the number of clusters based on standard statistics,
taking outlier or noise into account. Therefore, it yields robust clustering methods.

Constraint-Based Method: The constraint-based clustering method is performed by the


incorporation of application or user-oriented constraints. A constraint refers to the user
expectation or the properties of the desired clustering results. Constraints provide us with an
interactive way of communication with the clustering process. The user or the application
requirement can specify constraints.

Overview of Some basic clustering methods:


Partitioning methods: Partitioning methods involve partitioning the data and clustering the
group of similar items. Common Algorithms used in this method are,

• K-Means

• K-Medoids

• K-Modes

1. K-Means Clustering

K-Means Clustering is a classical approach to Clustering. K-Means iteratively relocates the cluster
centers by computing the mean of a cluster.

• Initially, K-Means chooses k cluster centers randomly.

• Distance is calculated between each data point and cluster centers (Euclidean distance is
commonly used). A data point is assigned to a Cluster to which it is very close.
• After all the data points are assigned to a cluster the algorithm computes the mean of the
cluster data points and relocates the cluster centers to its corresponding mean of the cluster.

• This process is continued until the cluster centers do not change.

The advantage of using K-Means is Scalability. K-Means perform well on huge data. The main
disadvantage of using K-Means is its sensitivity to outliers. The outliers have a severe impact while
computing means of the cluster. Cluster results vary according to k value and initial choice of cluster
centers. K-Means algorithm works well only for spherical data and fails to perform well on arbitrary
shapes of data.

2. K-Modes Clustering

K-Means works well for continuous data. What about categorical data? K-Modes Clustering comes
to the rescue. The algorithm is very similar to K-Means but instead of calculating the mean of the
cluster, K-Modes calculate the mode (a value that occurs frequently) of the cluster.
• Initially, K-Mode chooses k cluster centers randomly.
• The similarity is calculated between each data point and cluster centers. A data point is
assigned to a Cluster to which it has high similarity.
• After all the data points are assigned to a cluster the algorithm computes the mode of the
cluster data points and relocates the cluster centers to its corresponding mode of the cluster.
• This process is continued until the cluster centers do not change.
Density-based methods
1. DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based


clustering algorithm that has the ability to perform well on data with arbitrary shapes. DBSCAN
finds the data points that are dense and assigns it to a cluster leaving the less dense data points apart
from the cluster

Terminologies of DBSCAN

• If a data point q is within the radius ϵ of another data point p then the data point q is
considered to be ϵ-neighborhood (epsilon neighborhood) of the data point p.

• A data point p is said to be a core object if the ϵ-neighborhood of point p consists of the
value mentioned in MinPts (Minimum Points). For example, consider MinPts = 5, p is said
to be a core object if the ϵ-neighborhood of p consists of 5 data points.

• A data point p is said to be directly density-reachable from point q if the point p is within
the ϵ-neighborhood of q.

How does it work? DBSCAN checks the ϵ-neighborhood of each data point. If p is a core object and
its data points in ϵ-neighborhood are higher than the value of MinPts it forms a new cluster around
the core object p. DBSCAN finds directly density-reachable data points iteratively and may
amalgamate other few clusters. The process is continued until there are no more points left to
analyze.

The disadvantage of using DBSCAN is the presence of hyperparameters ϵ and MinPts. The results
produced vary according to the values opted for hyperparameters.

Gaussian Mixture Model


Have you ever wondered how machine learning algorithms can effortlessly categorize complex data
into distinct groups?

Gaussian Mixture Models (GMMs) play a pivotal role in achieving this task.

Recognized as a robust statistical tool in machine learning and data science, GMMs excel in
estimating density and clustering data.

Gaussian Mixture Models Overview

Imagine blending multiple Gaussian distributions to form a single model. This is precisely what a
Gaussian Mixture Model does.

At its heart, GMM operates on the principle that a complex, multi-modal distribution can be
approximated by a combination of simpler Gaussian distributions, each representing a different
cluster within the data.

The essence of GMM lies in its ability to determine cluster characteristics such as mean, variance,
and weight.

The mean of each Gaussian component gives us a central point, around which the data points are
most densely clustered.

The variance, on the other hand, provides insight into the spread or dispersion of the data points
around this mean. A smaller variance indicates that the data points are closely clustered around the
mean, while a larger variance suggests a more spread-out cluster.

The weights in a GMM are particularly significant. They represent the proportion of the dataset that
belongs to each Gaussian component.

In a sense, these weights embody the strength or dominance of each cluster within the overall
mixture. Higher weights imply that a greater portion of the data aligns with that particular Gaussian
distribution, signifying its greater prominence in the model.

This triad of parameters — mean, variance, and weight — enables GMMs to model the data with
remarkable flexibility. By adjusting these parameters, a GMM can shape itself to fit a wide variety
of data distributions, whether they are tightly clustered, widely dispersed, or overlapping with one
another.

One of the most powerful aspects of GMMs is their capacity to compute the probability of each data
point belonging to a particular cluster.

This is achieved through a process known as ‘soft clustering’, as opposed to ‘hard clustering’
methods like K-Means.

In soft clustering, instead of forcefully assigning a data point to a single cluster, GMM assigns
probabilities that indicate the likelihood of that data point belonging to each of the Gaussian
components.

Algorithms

Model Representation

At its core, a GMM is a combination of several Gaussian components.

These components are defined by their mean vectors, covariance matrices, and weights, providing
a comprehensive representation of data distributions.

The probability density function of a GMM is a sum of its components, each weighted accordingly.

Notation:
• K: Number of Gaussian components
• N: Number of data points
• D: Dimensionality of the data

GMM Parameters:
• Means (μ): Center locations of Gaussian components.
• Covariance Matrices (Σ): Define the shape and spread of each component.
• Weights (π): Probability of selecting each component.
Model Training

Training a GMM involves setting the parameters using available data. The Expectation-
Maximization (EM) technique is often employed, alternating between the Expectation (E) and
Maximization (M) steps until convergence.

Expectation-Maximization

During the E step, the model calculates the probability of each data point belonging to each Gaussian
component. The M step then adjusts the model’s parameters based on these probabilities.

Clustering and Density Estimation

Post-training, GMMs cluster data points based on the highest posterior probability. They are also
used for density estimation, assessing the probability density at any point in the feature space.

Implementation of Gaussian Mixture Models

This code generates some sample data from two different normal distributions and uses a Gaussian
Mixture Model from Scikit-learn to fit this data.

It then predicts which cluster each data point belongs to and visualizes the data points with their
respective clusters.

The centers of the Gaussian components are marked with red ‘X’ symbols.

The resulting plot provides a visual representation of how the GMM has clustered the data.

After fitting the Gaussian Mixture Model to the data, a new data point at coordinates [2,2] is defined.

The predict_proba method of the GMM object is then used to calculate the probability of this new
data point belonging to each of the two clusters.

The resulting probabilities are printed, and the data points, Gaussian centers, and the new data point
are plotted for visualization.
Use Cases of Gaussian Mixture Models
GMMs find application in a diverse range of fields:
• Anomaly Detection: Identifying unusual data patterns.
• Image Segmentation: Grouping pixels in images based on color or texture.
• Speech Recognition: Assisting in the recognition of phonemes in audio data.
• Handwriting Recognition: Simulating different handwriting styles.
• Customer Segmentation: Grouping customers with similar behaviors or preferences.
• Data Clustering: Finding natural groups in data.
• Computer Vision: Object detection and background removal.
• Bioinformatics: Analyzing gene expression data.
• Recommendation Systems: Personalizing user experiences.
• Medical Imaging: Tissue classification and abnormality detection.
• Finance: Asset price modeling and risk management.
Advantages and Disadvantages of Gaussian Mixture Models
Advantages
• Flexibility in Data Representation: GMMs adeptly represent complex data structures.
• Probabilistic Approach: They provide probabilities for cluster assignments, aiding in
uncertainty estimation.
• Soft Clustering: GMMs offer probabilistic cluster assignments, allowing for more nuanced
data analysis.
• Effective in Overlapping Clusters: They accurately model data with overlapping clusters.
• Density Estimation Capabilities: Useful in understanding the underlying distribution of
data.
• Handling Missing Data: GMMs can estimate parameters even with incomplete data sets.
• Outlier Detection: Identifying data points that do not conform to the general pattern.
• Scalability and Simplicity: Effective in handling large datasets and relatively easy to
implement.
• Interpretable Parameters: Provides meaningful insights into cluster characteristics.
Disadvantages
• Challenges in Determining Component Number: Misjudgment in component number can
lead to overfitting or underfitting.
• Initialization Sensitivity: The outcome is influenced by the initial parameter settings.
• Assumption of Gaussian Distribution: Not always applicable if data do not adhere to
Gaussian distributions.
• Curse of Dimensionality: High-dimensional data can complicate the model.
• Convergence Issues: Problems arise when dealing with singular covariance matrices.
• Resource Intensive for Large Datasets: Computing and memory requirements can be
substantial.

Balanced Iterative Reducing and Clustering using Hierarchies


(BIRCH) :
Clustering algorithms like K-means clustering do not perform clustering very efficiently and it is
difficult to process large datasets with a limited amount of resources (like memory or a slower CPU).
So, regular clustering algorithms do not scale well in terms of running time and quality as the size
of the dataset increases. This is where BIRCH clustering comes in. Balanced Iterative Reducing
and Clustering using Hierarchies (BIRCH) is a clustering algorithm that can cluster large datasets
by first generating a small and compact summary of the large dataset that retains as much
information as possible. This smaller summary is then clustered instead of clustering the larger
dataset. BIRCH is often used to complement other clustering algorithms by creating a summary of
the dataset that the other clustering algorithm can now use. However, BIRCH has one major
drawback – it can only process metric attributes. A metric attribute is any attribute whose values
can be represented in Euclidean space i.e., no categorical attributes should be present. Before we
implement BIRCH, we must understand two important terms: Clustering Feature (CF) and CF –
Tree Clustering Feature (CF): BIRCH summarizes large datasets into smaller, dense regions
called Clustering Feature (CF) entries. Formally, a Clustering Feature entry is defined as an ordered
triple, (N, LS, SS) where ‘N’ is the number of data points in the cluster, ‘LS’ is the linear sum of the
data points and ‘SS’ is the squared sum of the data points in the cluster. It is possible for a CF entry
to be composed of other CF entries. CF Tree: The CF tree is the actual compact representation that
we have been speaking of so far. A CF tree is a tree where each leaf node contains a sub-cluster.
Every entry in a CF tree contains a pointer to a child node and a CF entry made up of the sum of CF
entries in the child nodes. There is a maximum number of entries in each leaf node. This maximum
number is called the threshold. We will learn more about what this threshold value is. Parameters
of BIRCH Algorithm :

• threshold : threshold is the maximum number of data points a sub-cluster in the leaf node
of the CF tree can hold.

• branching_factor : This parameter specifies the maximum number of CF sub-clusters in


each node (internal node).

• n_clusters : The number of clusters to be returned after the entire BIRCH algorithm is
complete i.e., number of clusters after the final clustering step. If set to None, the final
clustering step is not performed and intermediate clusters are returned.

Affinity Propagation
Affinity Propagation is based on the concept of “message-passing” between data points to identify
cluster centres and assign data points to these centres automatically. It utilizes “exemplars,” which
are typical data points representing other data points within the same cluster. The objective of the
algorithm is to identify the most representative exemplars of the overall data and employ them to
cluster the data into compatible groups. This algorithm is especially suitable for data with numerous
clusters or data exhibiting intricate, non-linear distribution patterns.

The core of the algorithm is represented by three matrices which are Similarity Matrix(S),
Responsibility Matrix (R) and Availability Matrix (A).

1. Similarity Matrix(S): To determine the similarity between data points, we calculate a


‘similarity score’ based on their features, rather than their visual attributes like colour and
size. The ‘similarity score’ is computed using the negative squared distance between the data
points. This involves taking the distance between the points, squaring it, and then making
the result negative. The resulting matrix of similarity scores for every pair of data points is
called the ‘similarity matrix’ (S).

2. Responsibility Matrix(R): The matrix used to represent the suitability of one data point to
be the exemplar (cluster centre) for another data point. R(i, k) is a value that indicates how
well data point ‘i’ is suited to be the exemplar for data point ‘k’.

3. Availability Matrix (A): The availability matrix is used to represent the “availability” of
each data point to serve as an exemplar for other data points. The calculation reflects the
competition between different data points to serve as exemplars and helps in determining
the most suitable exemplars for the clusters.

Mean-Shift Clustering
Meanshift is falling under the category of a clustering algorithm in contrast of Unsupervised
learning that assigns the data points to the clusters iteratively by shifting points towards the mode
(mode is the highest density of data points in the region, in the context of the Meanshift). As such,
it is also known as the Mode-seeking algorithm. Mean-shift algorithm has applications in the field
of image processing and computer vision.
Given a set of data points, the algorithm iteratively assigns each data point towards the closest
cluster centroid and direction to the closest cluster centroid is determined by where most of the
points nearby are at. So each iteration each data point will move closer to where the most points
are at, which is or will lead to the cluster center. When the algorithm stops, each point is assigned
to a cluster.
Unlike the popular K-Means cluster algorithm, mean-shift does not require specifying the number
of clusters in advance. The number of clusters is determined by the algorithm with respect to the
data.
Note: The downside to Mean Shift is that it is computationally expensive O(n²).
Mean-shift clustering is a non-parametric, density-based clustering algorithm that can be used to
identify clusters in a dataset. It is particularly useful for datasets where the clusters have arbitrary
shapes and are not well-separated by linear boundaries.
The basic idea behind mean-shift clustering is to shift each data point towards the mode (i.e., the
highest density) of the distribution of points within a certain radius. The algorithm iteratively
performs these shifts until the points converge to a local maximum of the density function. These
local maxima represent the clusters in the data.
The process of mean-shift clustering algorithm can be summarized as follows:
Initialize the data points as cluster centroids.
Repeat the following steps until convergence or a maximum number of iterations is reached:
For each data point, calculate the mean of all points within a certain radius (i.e., the “kernel”)
centered at the data point.
Shift the data point to the mean.
Identify the cluster centroids as the points that have not moved after convergence.
Return the final cluster centroids and the assignments of data points to clusters.
One of the main advantages of mean-shift clustering is that it does not require the number of
clusters to be specified beforehand. It also does not make any assumptions about the distribution
of the data, and can handle arbitrary shapes and sizes of clusters. However, it can be sensitive to
the choice of kernel and the radius of the kernel.
Mean-Shift clustering can be applied to various types of data, including image and video
processing, object tracking and bioinformatics.

OPTICS (Ordering Points To Identify the Clustering Structure)

It is a density-based clustering algorithm, similar to DBSCAN (Density-Based Spatial Clustering


of Applications with Noise), but it can extract clusters of varying densities and shapes. It is useful
for identifying clusters of different densities in large, high-dimensional datasets.
The main idea behind OPTICS is to extract the clustering structure of a dataset by identifying the
density-connected points. The algorithm builds a density-based representation of the data by
creating an ordered list of points called the reachability plot. Each point in the list is associated
with a reachability distance, which is a measure of how easy it is to reach that point from other
points in the dataset. Points with similar reachability distances are likely to be in the same cluster.
The OPTICS algorithm follows these main steps:
Define a density threshold parameter, Eps, which controls the minimum density of clusters.
For each point in the dataset, calculate the distance to its k-nearest neighbours.
Starting with an arbitrary point, calculate the reachability distance of each point in the dataset,
based on the density of its neighbours.
Order the points based on their reachability distance and create the reachability plot.
Extract clusters from the reachability plot by grouping points that are close to each other and have
similar reachability distances.
One of the main advantage of OPTICS over DBSCAN, is that it does not require to set the number
of clusters in advance, instead, it extracts the clustering structure of the data and produces the
reachability plot. This allows the user to have more flexibility in selecting the number of clusters,
by cutting the reachability plot at a certain point.
Also, unlike other density-based clustering algorithms like DBSCAN, It can handle clusters of
different densities and shapes and can identify hierarchical structure.
OPTICS Clustering v/s DBSCAN Clustering:
1. Memory Cost : The OPTICS clustering technique requires more memory as it maintains
a priority queue (Min Heap) to determine the next data point which is closest to the point
currently being processed in terms of Reachability Distance. It also requires more
computational power because the nearest neighbour queries are more complicated than
radius queries in DBSCAN.
2. Fewer Parameters : The OPTICS clustering technique does not need to maintain the
epsilon parameter and is only given in the above pseudo-code to reduce the time taken.
This leads to the reduction of the analytical process of parameter tuning.
3. This technique does not segregate the given data into clusters. It merely produces a
Reachability distance plot and it is upon the interpretation of the programmer to cluster the
points accordingly.
4. Handling varying densities: DBSCAN clustering can struggle to handle datasets with
varying densities, as it requires a single value of epsilon to define the neighborhood size
for all points. In contrast, OPTICS can handle varying densities by using the concept of
reachability distance, which adapts to the local density of the data. This means that OPTICS
can identify clusters of different sizes and shapes more effectively than DBSCAN in
datasets with varying densities.
5. Cluster extraction: While both OPTICS and DBSCAN can identify clusters, OPTICS
produces a reachability distance plot that can be used to extract clusters at different levels
of granularity. This allows for more flexible clustering and can reveal clusters that may not
be apparent with a fixed epsilon value in DBSCAN. However, this also requires more
manual interpretation and decision-making on the part of the programmer.
6. Noise handling: DBSCAN explicitly distinguishes between core points, boundary points, and
noise points, while OPTICS does not explicitly identify noise points. Instead, points with
high reachability distances can be considered as potential noise points. However, this also
means that OPTICS may be less effective at identifying small clusters that are surrounded by
noise points, as these clusters may be merged with the noise points in the reachability distance
plot.
7. Runtime complexity: The runtime complexity of OPTICS is generally higher than that of
DBSCAN, due to the use of a priority queue to maintain the reachability distances. However,
recent research has proposed optimizations to reduce the computational complexity of
OPTICS, making it more scalable for large datasets.

Connectivity-based clustering algorithms, also called


hierarchical clustering.
It is based on the core idea that similar objects lie nearby to each other in a data space while
others lie far away. It uses distance functions to find nearby data points and group the data
points together as clusters.
There are two major types of approaches:
• Agglomerative clustering: Divide the data points into different clusters and then aggregate
them as the distance decreases.
• Divisive clustering: Combine all the data points as a single cluster and divide them as the
distance between them increases.

Agglomerative Clustering
Agglomerative clustering is a bottom-up approach. It starts clustering by treating the individual
data points as a single cluster, then it is merged continuously based on similarity until it forms
one big cluster containing all objects. It is good at identifying small clusters.
The steps for agglomerative clustering are as follows:
1. Compute the proximity matrix using a distance metric.
2. Use a linkage function to group objects into a hierarchical cluster tree based on the
computed distance matrix from the above step.
3. Data points with close proximity are merged together to form a cluster.
4. Repeat steps 2 and 3 until a single cluster remains.
The pictorial representation of the above steps would be:

In the above figure,


• The data points 1,2,...6 are assigned to each individual cluster.
• After calculating the proximity matrix, based on the similarity the points 2,3 and 4,5 are
merged together to form clusters.
• Again, the proximity matrix is computed and clusters with points 4,5 and 6 are merged
together.
• And again, the proximity matrix is computed, then the clusters with points 4,5,6 and 2,3
are merged together to form a cluster.
• As a final step, the remaining clusters are merged together to form a single cluster.
Proximity Matrix and Linkage
The proximity matrix is a matrix consisting of the distance between each pair of data points.
The distance is computed by a distance function. Euclidean distance is one of the most
commonly used distance functions.

The above proximity matrix consists of n points named x, and the d(xi,xj) represents the
distance between the points.
In order to group the data points in a cluster, a linkage function is used where the values in the
proximity matrix are taken and the data points are grouped based on similarity. The newly
formed clusters are linked to each other until they form a single cluster containing all the data
points.
The most common linkage methods are as follows:
• Complete linkage: The maximum of all pairwise distance between elements in each pair of
clusters is used to measure the distance between two clusters.
• Single linkage: The minimum of all pairwise distance between elements in each pair of
clusters is used to measure the distance between two clusters.
• Average linkage: The average of all pairwise distances between elements in each pair of
clusters is used to measure the distance between two clusters.
• Centroid linkage: Before merging, the distance between the two clusters’ centroids are
considered.
• Ward’s Method: It uses squared error to compute the similarity of the two clusters for
merging.

Divisive Clustering
Divisive clustering works just the opposite of agglomerative clustering. It starts by considering
all the data points into a big single cluster and later on splitting them into smaller heterogeneous
clusters continuously until all data points are in their own cluster. Thus, they are good at
identifying large clusters. It follows a top-down approach and is more efficient than
agglomerative clustering. But, due to its complexity in implementation, it doesn’t have any
predefined implementation in any of the major machine learning frameworks.
Steps in Divisive Clustering
Consider all the data points as a single cluster.
1. Split into clusters using any flat-clustering method, say K-Means.
2. Choose the best cluster among the clusters to split further, choose the one that has the
largest Sum of Squared Error (SSE).
3. Repeat steps 2 and 3 until a single cluster is formed.

In the above figure,


• The data points 1,2,...6 are assigned to large cluster.
• After calculating the proximity matrix, based on the dissimilarity the points are split up into separate
clusters.
• The proximity matrix is again computed until each point is assigned to an individual cluster.

Hierarchical clustering can be used for several applications, ranging from customer segmentation to object
recognition.

You might also like