0% found this document useful (0 votes)

21 views

ML Unit 4 Notes - NJ

Uploaded by

anshugarg401

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

ML Unit 4 Notes - NJ

Uploaded by

anshugarg401

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

MACHINE LEARNING

UNIT-4
Cluster analysis, also known as clustering, is a method of data mining that groups similar data points
together. The goal of cluster analysis is to divide a dataset into groups (or clusters) such that the
data points within each group are more similar to each other than to data points in other groups. This
process is often used for exploratory data analysis and can help identify patterns or relationships
within the data that may not be immediately obvious. There are many different algorithms used for
cluster analysis, such as k-means, hierarchical clustering, and density-based clustering. The choice
of algorithm will depend on the specific requirements of the analysis and the nature of the data being
analyzed.

Cluster Analysis is the process to find similar groups of objects in order to form clusters.It is an
unsupervised machine learning-based algorithm that acts on unlabelled data. A group of data points
would comprise together to form a cluster in which all the objects would belong to the same group.

The given data is divided into different groups by combining similar objects into a group. This group
is nothing but a cluster. A cluster is nothing but a collection of similar data which is grouped
together.

For example, consider a dataset of vehicles given in which it contains information about different
vehicles like cars, buses, bicycles, etc. As it is unsupervised learning there are no class labels like
Cars, Bikes, etc for all the vehicles, all the data is combined and is not in a structured manner.

Now our task is to convert the unlabelled data to labelled data and it can be done using clusters.

The main idea of cluster analysis is that it would arrange all the data points by forming clusters like
cars cluster which contains all the cars, bikes clusters which contains all the bikes, etc.

Simply it is the partitioning of similar objects which are applied to unlabelled data.

Properties of Clustering :

1. Clustering Scalability: Nowadays there is a vast amount of data and should be dealing with huge
databases. In order to handle extensive databases, the clustering algorithm should be scalable. Data
should be scalable, if it is not scalable, then we can’t get the appropriate result which would lead to
wrong results.

2. High Dimensionality: The algorithm should be able to handle high dimensional space along with
the data of small size.

3. Algorithm Usability with multiple data kinds: Different kinds of data can be used with
algorithms of clustering. It should be capable of dealing with different types of data like discrete,
categorical and interval-based data, binary data etc.

4. Dealing with unstructured data: There would be some databases that contain missing values,
and noisy or erroneous data. If the algorithms are sensitive to such data then it may lead to poor
quality clusters. So it should be able to handle unstructured data and give some structure to the data
by organising it into groups of similar data objects. This makes the job of the data expert easier in
order to process the data and discover new patterns.
5. Interpretability: The clustering outcomes should be interpretable, comprehensible, and usable.
The interpretability reflects how easily the data is understood.

Clustering Methods:
The clustering methods can be classified into the following categories:

• Partitioning Method

• Hierarchical Method

• Density-based Method

• Grid-Based Method

• Model-Based Method

• Constraint-based Method
Partitioning Method: It is used to make partitions on the data in order to form clusters. If “n”
partitions are done on “p” objects of the database then each partition is represented by a cluster and
n < p. The two conditions which need to be satisfied with this Partitioning Clustering Method are:

• One objective should only belong to only one group.

• There should be no group without even a single purpose.

In the partitioning method, there is one technique called iterative relocation, which means the object
will be moved from one group to another to improve the partitioning
Hierarchical Method: In this method, a hierarchical decomposition of the given set of data objects
is created. We can classify hierarchical methods and will be able to know the purpose of classification
on the basis of how the hierarchical decomposition is formed. There are two types of approaches for
the creation of hierarchical decomposition, they are:
• Agglomerative Approach: The agglomerative approach is also known as the bottom-up
approach. Initially, the given data is divided into which objects form separate groups.
Thereafter it keeps on merging the objects or the groups that are close to one another which
means that they exhibit similar properties. This merging process continues until the
termination condition holds.

• Divisive Approach: The divisive approach is also known as the top-down approach. In this
approach, we would start with the data objects that are in the same cluster. The group of
individual clusters is divided into small clusters by continuous iteration. The iteration
continues until the condition of termination is met or until each cluster contains one object.

• Once the group is split or merged then it can never be undone as it is a rigid method and is
not so flexible. The two approaches which can be used to improve the Hierarchical
Clustering Quality in Data Mining are: –

• One should carefully analyze the linkages of the object at every partitioning of hierarchical
clustering.
• One can use a hierarchical agglomerative algorithm for the integration of hierarchical
agglomeration. In this approach, first, the objects are grouped into micro-clusters. After
grouping data objects into microclusters, macro clustering is performed on the microcluster.

Density-Based Method: The density-based method mainly focuses on density. In this method,
the given cluster will keep on growing continuously as long as the density in the neighbourhood
exceeds some threshold, i.e, for each data point within a given cluster. The radius of a given
cluster has to contain at least a minimum number of points.

Grid-Based Method: In the Grid-Based method a grid is formed using the object together,i.e,
the object space is quantized into a finite number of cells that form a grid structure. One of the
major advantages of the grid-based method is fast processing time and it is dependent only on the
number of cells in each dimension in the quantized space. The processing time for this method is
much faster so it can save time.

Model-Based Method: In the model-based method, all the clusters are hypothesized in order
to find the data which is best suited for the model. The clustering of the density function is used
to locate the clusters for a given model. It reflects the spatial distribution of data points and also
provides a way to automatically determine the number of clusters based on standard statistics,
taking outlier or noise into account. Therefore, it yields robust clustering methods.

Constraint-Based Method: The constraint-based clustering method is performed by the

incorporation of application or user-oriented constraints. A constraint refers to the user
expectation or the properties of the desired clustering results. Constraints provide us with an
interactive way of communication with the clustering process. The user or the application
requirement can specify constraints.

Overview of Some basic clustering methods:

Partitioning methods: Partitioning methods involve partitioning the data and clustering the
group of similar items. Common Algorithms used in this method are,

• K-Means

• K-Medoids

• K-Modes

1. K-Means Clustering

K-Means Clustering is a classical approach to Clustering. K-Means iteratively relocates the cluster
centers by computing the mean of a cluster.

• Initially, K-Means chooses k cluster centers randomly.

• Distance is calculated between each data point and cluster centers (Euclidean distance is
commonly used). A data point is assigned to a Cluster to which it is very close.
• After all the data points are assigned to a cluster the algorithm computes the mean of the
cluster data points and relocates the cluster centers to its corresponding mean of the cluster.

• This process is continued until the cluster centers do not change.

The advantage of using K-Means is Scalability. K-Means perform well on huge data. The main
disadvantage of using K-Means is its sensitivity to outliers. The outliers have a severe impact while
computing means of the cluster. Cluster results vary according to k value and initial choice of cluster
centers. K-Means algorithm works well only for spherical data and fails to perform well on arbitrary
shapes of data.

2. K-Modes Clustering

K-Means works well for continuous data. What about categorical data? K-Modes Clustering comes
to the rescue. The algorithm is very similar to K-Means but instead of calculating the mean of the
cluster, K-Modes calculate the mode (a value that occurs frequently) of the cluster.
• Initially, K-Mode chooses k cluster centers randomly.
• The similarity is calculated between each data point and cluster centers. A data point is
assigned to a Cluster to which it has high similarity.
• After all the data points are assigned to a cluster the algorithm computes the mode of the
cluster data points and relocates the cluster centers to its corresponding mode of the cluster.
• This process is continued until the cluster centers do not change.
Density-based methods
1. DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based

clustering algorithm that has the ability to perform well on data with arbitrary shapes. DBSCAN
finds the data points that are dense and assigns it to a cluster leaving the less dense data points apart
from the cluster

Terminologies of DBSCAN

• If a data point q is within the radius ϵ of another data point p then the data point q is
considered to be ϵ-neighborhood (epsilon neighborhood) of the data point p.

• A data point p is said to be a core object if the ϵ-neighborhood of point p consists of the
value mentioned in MinPts (Minimum Points). For example, consider MinPts = 5, p is said
to be a core object if the ϵ-neighborhood of p consists of 5 data points.

• A data point p is said to be directly density-reachable from point q if the point p is within
the ϵ-neighborhood of q.

How does it work? DBSCAN checks the ϵ-neighborhood of each data point. If p is a core object and
its data points in ϵ-neighborhood are higher than the value of MinPts it forms a new cluster around
the core object p. DBSCAN finds directly density-reachable data points iteratively and may
amalgamate other few clusters. The process is continued until there are no more points left to
analyze.

The disadvantage of using DBSCAN is the presence of hyperparameters ϵ and MinPts. The results
produced vary according to the values opted for hyperparameters.

Gaussian Mixture Model

Have you ever wondered how machine learning algorithms can effortlessly categorize complex data
into distinct groups?

Gaussian Mixture Models (GMMs) play a pivotal role in achieving this task.

Recognized as a robust statistical tool in machine learning and data science, GMMs excel in
estimating density and clustering data.

Gaussian Mixture Models Overview

Imagine blending multiple Gaussian distributions to form a single model. This is precisely what a
Gaussian Mixture Model does.

At its heart, GMM operates on the principle that a complex, multi-modal distribution can be
approximated by a combination of simpler Gaussian distributions, each representing a different
cluster within the data.

The essence of GMM lies in its ability to determine cluster characteristics such as mean, variance,
and weight.

The mean of each Gaussian component gives us a central point, around which the data points are
most densely clustered.

The variance, on the other hand, provides insight into the spread or dispersion of the data points
around this mean. A smaller variance indicates that the data points are closely clustered around the
mean, while a larger variance suggests a more spread-out cluster.

The weights in a GMM are particularly significant. They represent the proportion of the dataset that
belongs to each Gaussian component.

In a sense, these weights embody the strength or dominance of each cluster within the overall
mixture. Higher weights imply that a greater portion of the data aligns with that particular Gaussian
distribution, signifying its greater prominence in the model.

This triad of parameters — mean, variance, and weight — enables GMMs to model the data with
remarkable flexibility. By adjusting these parameters, a GMM can shape itself to fit a wide variety
of data distributions, whether they are tightly clustered, widely dispersed, or overlapping with one
another.

One of the most powerful aspects of GMMs is their capacity to compute the probability of each data
point belonging to a particular cluster.

This is achieved through a process known as ‘soft clustering’, as opposed to ‘hard clustering’
methods like K-Means.

In soft clustering, instead of forcefully assigning a data point to a single cluster, GMM assigns
probabilities that indicate the likelihood of that data point belonging to each of the Gaussian
components.

Algorithms

Model Representation

At its core, a GMM is a combination of several Gaussian components.

These components are defined by their mean vectors, covariance matrices, and weights, providing
a comprehensive representation of data distributions.

The probability density function of a GMM is a sum of its components, each weighted accordingly.

Notation:
• K: Number of Gaussian components
• N: Number of data points
• D: Dimensionality of the data

GMM Parameters:
• Means (μ): Center locations of Gaussian components.
• Covariance Matrices (Σ): Define the shape and spread of each component.
• Weights (π): Probability of selecting each component.
Model Training

Training a GMM involves setting the parameters using available data. The Expectation-
Maximization (EM) technique is often employed, alternating between the Expectation (E) and
Maximization (M) steps until convergence.

Expectation-Maximization

During the E step, the model calculates the probability of each data point belonging to each Gaussian
component. The M step then adjusts the model’s parameters based on these probabilities.

Clustering and Density Estimation

Post-training, GMMs cluster data points based on the highest posterior probability. They are also
used for density estimation, assessing the probability density at any point in the feature space.

Implementation of Gaussian Mixture Models

This code generates some sample data from two different normal distributions and uses a Gaussian
Mixture Model from Scikit-learn to fit this data.

It then predicts which cluster each data point belongs to and visualizes the data points with their
respective clusters.

The centers of the Gaussian components are marked with red ‘X’ symbols.

The resulting plot provides a visual representation of how the GMM has clustered the data.

After fitting the Gaussian Mixture Model to the data, a new data point at coordinates [2,2] is defined.

The predict_proba method of the GMM object is then used to calculate the probability of this new
data point belonging to each of the two clusters.

The resulting probabilities are printed, and the data points, Gaussian centers, and the new data point
are plotted for visualization.
Use Cases of Gaussian Mixture Models
GMMs find application in a diverse range of fields:
• Anomaly Detection: Identifying unusual data patterns.
• Image Segmentation: Grouping pixels in images based on color or texture.
• Speech Recognition: Assisting in the recognition of phonemes in audio data.
• Handwriting Recognition: Simulating different handwriting styles.
• Customer Segmentation: Grouping customers with similar behaviors or preferences.
• Data Clustering: Finding natural groups in data.
• Computer Vision: Object detection and background removal.
• Bioinformatics: Analyzing gene expression data.
• Recommendation Systems: Personalizing user experiences.
• Medical Imaging: Tissue classification and abnormality detection.
• Finance: Asset price modeling and risk management.
Advantages and Disadvantages of Gaussian Mixture Models
Advantages
• Flexibility in Data Representation: GMMs adeptly represent complex data structures.
• Probabilistic Approach: They provide probabilities for cluster assignments, aiding in
uncertainty estimation.
• Soft Clustering: GMMs offer probabilistic cluster assignments, allowing for more nuanced
data analysis.
• Effective in Overlapping Clusters: They accurately model data with overlapping clusters.
• Density Estimation Capabilities: Useful in understanding the underlying distribution of
data.
• Handling Missing Data: GMMs can estimate parameters even with incomplete data sets.
• Outlier Detection: Identifying data points that do not conform to the general pattern.
• Scalability and Simplicity: Effective in handling large datasets and relatively easy to
implement.
• Interpretable Parameters: Provides meaningful insights into cluster characteristics.
Disadvantages
• Challenges in Determining Component Number: Misjudgment in component number can
lead to overfitting or underfitting.
• Initialization Sensitivity: The outcome is influenced by the initial parameter settings.
• Assumption of Gaussian Distribution: Not always applicable if data do not adhere to
Gaussian distributions.
• Curse of Dimensionality: High-dimensional data can complicate the model.
• Convergence Issues: Problems arise when dealing with singular covariance matrices.
• Resource Intensive for Large Datasets: Computing and memory requirements can be
substantial.

Balanced Iterative Reducing and Clustering using Hierarchies

(BIRCH) :
Clustering algorithms like K-means clustering do not perform clustering very efficiently and it is
difficult to process large datasets with a limited amount of resources (like memory or a slower CPU).
So, regular clustering algorithms do not scale well in terms of running time and quality as the size
of the dataset increases. This is where BIRCH clustering comes in. Balanced Iterative Reducing
and Clustering using Hierarchies (BIRCH) is a clustering algorithm that can cluster large datasets
by first generating a small and compact summary of the large dataset that retains as much
information as possible. This smaller summary is then clustered instead of clustering the larger
dataset. BIRCH is often used to complement other clustering algorithms by creating a summary of
the dataset that the other clustering algorithm can now use. However, BIRCH has one major
drawback – it can only process metric attributes. A metric attribute is any attribute whose values
can be represented in Euclidean space i.e., no categorical attributes should be present. Before we
implement BIRCH, we must understand two important terms: Clustering Feature (CF) and CF –
Tree Clustering Feature (CF): BIRCH summarizes large datasets into smaller, dense regions
called Clustering Feature (CF) entries. Formally, a Clustering Feature entry is defined as an ordered
triple, (N, LS, SS) where ‘N’ is the number of data points in the cluster, ‘LS’ is the linear sum of the
data points and ‘SS’ is the squared sum of the data points in the cluster. It is possible for a CF entry
to be composed of other CF entries. CF Tree: The CF tree is the actual compact representation that
we have been speaking of so far. A CF tree is a tree where each leaf node contains a sub-cluster.
Every entry in a CF tree contains a pointer to a child node and a CF entry made up of the sum of CF
entries in the child nodes. There is a maximum number of entries in each leaf node. This maximum
number is called the threshold. We will learn more about what this threshold value is. Parameters
of BIRCH Algorithm :

• threshold : threshold is the maximum number of data points a sub-cluster in the leaf node
of the CF tree can hold.

• branching_factor : This parameter specifies the maximum number of CF sub-clusters in

each node (internal node).

• n_clusters : The number of clusters to be returned after the entire BIRCH algorithm is
complete i.e., number of clusters after the final clustering step. If set to None, the final
clustering step is not performed and intermediate clusters are returned.

Affinity Propagation
Affinity Propagation is based on the concept of “message-passing” between data points to identify
cluster centres and assign data points to these centres automatically. It utilizes “exemplars,” which
are typical data points representing other data points within the same cluster. The objective of the
algorithm is to identify the most representative exemplars of the overall data and employ them to
cluster the data into compatible groups. This algorithm is especially suitable for data with numerous
clusters or data exhibiting intricate, non-linear distribution patterns.

The core of the algorithm is represented by three matrices which are Similarity Matrix(S),
Responsibility Matrix (R) and Availability Matrix (A).

1. Similarity Matrix(S): To determine the similarity between data points, we calculate a

‘similarity score’ based on their features, rather than their visual attributes like colour and
size. The ‘similarity score’ is computed using the negative squared distance between the data
points. This involves taking the distance between the points, squaring it, and then making
the result negative. The resulting matrix of similarity scores for every pair of data points is
called the ‘similarity matrix’ (S).

2. Responsibility Matrix(R): The matrix used to represent the suitability of one data point to
be the exemplar (cluster centre) for another data point. R(i, k) is a value that indicates how
well data point ‘i’ is suited to be the exemplar for data point ‘k’.

3. Availability Matrix (A): The availability matrix is used to represent the “availability” of
each data point to serve as an exemplar for other data points. The calculation reflects the
competition between different data points to serve as exemplars and helps in determining
the most suitable exemplars for the clusters.

Mean-Shift Clustering
Meanshift is falling under the category of a clustering algorithm in contrast of Unsupervised
learning that assigns the data points to the clusters iteratively by shifting points towards the mode
(mode is the highest density of data points in the region, in the context of the Meanshift). As such,
it is also known as the Mode-seeking algorithm. Mean-shift algorithm has applications in the field
of image processing and computer vision.
Given a set of data points, the algorithm iteratively assigns each data point towards the closest
cluster centroid and direction to the closest cluster centroid is determined by where most of the
points nearby are at. So each iteration each data point will move closer to where the most points
are at, which is or will lead to the cluster center. When the algorithm stops, each point is assigned
to a cluster.
Unlike the popular K-Means cluster algorithm, mean-shift does not require specifying the number
of clusters in advance. The number of clusters is determined by the algorithm with respect to the
data.
Note: The downside to Mean Shift is that it is computationally expensive O(n²).
Mean-shift clustering is a non-parametric, density-based clustering algorithm that can be used to
identify clusters in a dataset. It is particularly useful for datasets where the clusters have arbitrary
shapes and are not well-separated by linear boundaries.
The basic idea behind mean-shift clustering is to shift each data point towards the mode (i.e., the
highest density) of the distribution of points within a certain radius. The algorithm iteratively
performs these shifts until the points converge to a local maximum of the density function. These
local maxima represent the clusters in the data.
The process of mean-shift clustering algorithm can be summarized as follows:
Initialize the data points as cluster centroids.
Repeat the following steps until convergence or a maximum number of iterations is reached:
For each data point, calculate the mean of all points within a certain radius (i.e., the “kernel”)
centered at the data point.
Shift the data point to the mean.
Identify the cluster centroids as the points that have not moved after convergence.
Return the final cluster centroids and the assignments of data points to clusters.
One of the main advantages of mean-shift clustering is that it does not require the number of
clusters to be specified beforehand. It also does not make any assumptions about the distribution
of the data, and can handle arbitrary shapes and sizes of clusters. However, it can be sensitive to
the choice of kernel and the radius of the kernel.
Mean-Shift clustering can be applied to various types of data, including image and video
processing, object tracking and bioinformatics.

OPTICS (Ordering Points To Identify the Clustering Structure)

It is a density-based clustering algorithm, similar to DBSCAN (Density-Based Spatial Clustering

of Applications with Noise), but it can extract clusters of varying densities and shapes. It is useful
for identifying clusters of different densities in large, high-dimensional datasets.
The main idea behind OPTICS is to extract the clustering structure of a dataset by identifying the
density-connected points. The algorithm builds a density-based representation of the data by
creating an ordered list of points called the reachability plot. Each point in the list is associated
with a reachability distance, which is a measure of how easy it is to reach that point from other
points in the dataset. Points with similar reachability distances are likely to be in the same cluster.
The OPTICS algorithm follows these main steps:
Define a density threshold parameter, Eps, which controls the minimum density of clusters.
For each point in the dataset, calculate the distance to its k-nearest neighbours.
Starting with an arbitrary point, calculate the reachability distance of each point in the dataset,
based on the density of its neighbours.
Order the points based on their reachability distance and create the reachability plot.
Extract clusters from the reachability plot by grouping points that are close to each other and have
similar reachability distances.
One of the main advantage of OPTICS over DBSCAN, is that it does not require to set the number
of clusters in advance, instead, it extracts the clustering structure of the data and produces the
reachability plot. This allows the user to have more flexibility in selecting the number of clusters,
by cutting the reachability plot at a certain point.
Also, unlike other density-based clustering algorithms like DBSCAN, It can handle clusters of
different densities and shapes and can identify hierarchical structure.
OPTICS Clustering v/s DBSCAN Clustering:
1. Memory Cost : The OPTICS clustering technique requires more memory as it maintains
a priority queue (Min Heap) to determine the next data point which is closest to the point
currently being processed in terms of Reachability Distance. It also requires more
computational power because the nearest neighbour queries are more complicated than
radius queries in DBSCAN.
2. Fewer Parameters : The OPTICS clustering technique does not need to maintain the
epsilon parameter and is only given in the above pseudo-code to reduce the time taken.
This leads to the reduction of the analytical process of parameter tuning.
3. This technique does not segregate the given data into clusters. It merely produces a
Reachability distance plot and it is upon the interpretation of the programmer to cluster the
points accordingly.
4. Handling varying densities: DBSCAN clustering can struggle to handle datasets with
varying densities, as it requires a single value of epsilon to define the neighborhood size
for all points. In contrast, OPTICS can handle varying densities by using the concept of
reachability distance, which adapts to the local density of the data. This means that OPTICS
can identify clusters of different sizes and shapes more effectively than DBSCAN in
datasets with varying densities.
5. Cluster extraction: While both OPTICS and DBSCAN can identify clusters, OPTICS
produces a reachability distance plot that can be used to extract clusters at different levels
of granularity. This allows for more flexible clustering and can reveal clusters that may not
be apparent with a fixed epsilon value in DBSCAN. However, this also requires more
manual interpretation and decision-making on the part of the programmer.
6. Noise handling: DBSCAN explicitly distinguishes between core points, boundary points, and
noise points, while OPTICS does not explicitly identify noise points. Instead, points with
high reachability distances can be considered as potential noise points. However, this also
means that OPTICS may be less effective at identifying small clusters that are surrounded by
noise points, as these clusters may be merged with the noise points in the reachability distance
plot.
7. Runtime complexity: The runtime complexity of OPTICS is generally higher than that of
DBSCAN, due to the use of a priority queue to maintain the reachability distances. However,
recent research has proposed optimizations to reduce the computational complexity of
OPTICS, making it more scalable for large datasets.

Connectivity-based clustering algorithms, also called

hierarchical clustering.
It is based on the core idea that similar objects lie nearby to each other in a data space while
others lie far away. It uses distance functions to find nearby data points and group the data
points together as clusters.
There are two major types of approaches:
• Agglomerative clustering: Divide the data points into different clusters and then aggregate
them as the distance decreases.
• Divisive clustering: Combine all the data points as a single cluster and divide them as the
distance between them increases.

Agglomerative Clustering
Agglomerative clustering is a bottom-up approach. It starts clustering by treating the individual
data points as a single cluster, then it is merged continuously based on similarity until it forms
one big cluster containing all objects. It is good at identifying small clusters.
The steps for agglomerative clustering are as follows:
1. Compute the proximity matrix using a distance metric.
2. Use a linkage function to group objects into a hierarchical cluster tree based on the
computed distance matrix from the above step.
3. Data points with close proximity are merged together to form a cluster.
4. Repeat steps 2 and 3 until a single cluster remains.
The pictorial representation of the above steps would be:

In the above figure,

• The data points 1,2,...6 are assigned to each individual cluster.
• After calculating the proximity matrix, based on the similarity the points 2,3 and 4,5 are
merged together to form clusters.
• Again, the proximity matrix is computed and clusters with points 4,5 and 6 are merged
together.
• And again, the proximity matrix is computed, then the clusters with points 4,5,6 and 2,3
are merged together to form a cluster.
• As a final step, the remaining clusters are merged together to form a single cluster.
Proximity Matrix and Linkage
The proximity matrix is a matrix consisting of the distance between each pair of data points.
The distance is computed by a distance function. Euclidean distance is one of the most
commonly used distance functions.

The above proximity matrix consists of n points named x, and the d(xi,xj) represents the
distance between the points.
In order to group the data points in a cluster, a linkage function is used where the values in the
proximity matrix are taken and the data points are grouped based on similarity. The newly
formed clusters are linked to each other until they form a single cluster containing all the data
points.
The most common linkage methods are as follows:
• Complete linkage: The maximum of all pairwise distance between elements in each pair of
clusters is used to measure the distance between two clusters.
• Single linkage: The minimum of all pairwise distance between elements in each pair of
clusters is used to measure the distance between two clusters.
• Average linkage: The average of all pairwise distances between elements in each pair of
clusters is used to measure the distance between two clusters.
• Centroid linkage: Before merging, the distance between the two clusters’ centroids are
considered.
• Ward’s Method: It uses squared error to compute the similarity of the two clusters for
merging.

Divisive Clustering
Divisive clustering works just the opposite of agglomerative clustering. It starts by considering
all the data points into a big single cluster and later on splitting them into smaller heterogeneous
clusters continuously until all data points are in their own cluster. Thus, they are good at
identifying large clusters. It follows a top-down approach and is more efficient than
agglomerative clustering. But, due to its complexity in implementation, it doesn’t have any
predefined implementation in any of the major machine learning frameworks.
Steps in Divisive Clustering
Consider all the data points as a single cluster.
1. Split into clusters using any flat-clustering method, say K-Means.
2. Choose the best cluster among the clusters to split further, choose the one that has the
largest Sum of Squared Error (SSE).
3. Repeat steps 2 and 3 until a single cluster is formed.

In the above figure,

• The data points 1,2,...6 are assigned to large cluster.
• After calculating the proximity matrix, based on the dissimilarity the points are split up into separate
clusters.
• The proximity matrix is again computed until each point is assigned to an individual cluster.

Hierarchical clustering can be used for several applications, ranging from customer segmentation to object
recognition.

Project Management Essentials You Always Wanted To Know: 4th Edition
No ratings yet
Project Management Essentials You Always Wanted To Know: 4th Edition
34 pages
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
Instruction Type Instruction Count Cycles Per Instruction
No ratings yet
Instruction Type Instruction Count Cycles Per Instruction
12 pages
Battery Management Systems of Electric and Hybrid Electric Vehicles
100% (4)
Battery Management Systems of Electric and Hybrid Electric Vehicles
148 pages
DM MODULE 4
No ratings yet
DM MODULE 4
17 pages
Screenshot 2024-05-17 at 3.30.05 PM
No ratings yet
Screenshot 2024-05-17 at 3.30.05 PM
31 pages
Dmbi Unit-4
No ratings yet
Dmbi Unit-4
18 pages
Clustering
No ratings yet
Clustering
6 pages
ML CH 4
No ratings yet
ML CH 4
51 pages
Data Mining Notes UNIT IV
No ratings yet
Data Mining Notes UNIT IV
19 pages
E-Note_28966_Content_Document_20241211091351PM
No ratings yet
E-Note_28966_Content_Document_20241211091351PM
69 pages
Introduction to Cluster Analysis.
No ratings yet
Introduction to Cluster Analysis.
53 pages
Data Mining - Cluster Analysis
No ratings yet
Data Mining - Cluster Analysis
4 pages
Cluster Analysis-Unit 4
No ratings yet
Cluster Analysis-Unit 4
7 pages
DM Cluster Analysis
No ratings yet
DM Cluster Analysis
3 pages
chapter 3 p4
No ratings yet
chapter 3 p4
18 pages
Module 5
No ratings yet
Module 5
91 pages
Classify Clustering
No ratings yet
Classify Clustering
31 pages
Unit 4
No ratings yet
Unit 4
4 pages
Practical Software Testing
No ratings yet
Practical Software Testing
3 pages
Unit 4
No ratings yet
Unit 4
21 pages
DM Unit-4 Part1
No ratings yet
DM Unit-4 Part1
21 pages
DM Unit 5
No ratings yet
DM Unit 5
15 pages
unit4_ml[1]
No ratings yet
unit4_ml[1]
20 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
64 pages
Unsupervised Learning-01
No ratings yet
Unsupervised Learning-01
42 pages
DMDW Unit-5
No ratings yet
DMDW Unit-5
21 pages
DMDW R20 Unit 5
No ratings yet
DMDW R20 Unit 5
21 pages
Cluster Is A Group of Objects That Belongs To The Same Class
No ratings yet
Cluster Is A Group of Objects That Belongs To The Same Class
12 pages
UNIT 4 Clustering and Applications
No ratings yet
UNIT 4 Clustering and Applications
5 pages
DWDM Lecture Notes U-5
No ratings yet
DWDM Lecture Notes U-5
26 pages
DM UNIT-4 Part2
No ratings yet
DM UNIT-4 Part2
18 pages
MODULE-V
No ratings yet
MODULE-V
16 pages
4.unsupervised Learning Model-Clustering
No ratings yet
4.unsupervised Learning Model-Clustering
45 pages
Unit 3 Clustering Algorithm
No ratings yet
Unit 3 Clustering Algorithm
44 pages
Unit-IV Cluster Outlier Analysis
No ratings yet
Unit-IV Cluster Outlier Analysis
21 pages
DM UNIT-5 NOTES
No ratings yet
DM UNIT-5 NOTES
16 pages
Chapter 5
No ratings yet
Chapter 5
43 pages
Iv Unit DM
No ratings yet
Iv Unit DM
26 pages
Unit 5
No ratings yet
Unit 5
27 pages
U20cs604 Machine Learning Unit III
No ratings yet
U20cs604 Machine Learning Unit III
23 pages
Unit 5
No ratings yet
Unit 5
27 pages
UNIT 3 DWDM Notes
No ratings yet
UNIT 3 DWDM Notes
32 pages
Clustering Agglo Devisive DBSCAN
No ratings yet
Clustering Agglo Devisive DBSCAN
78 pages
05. UNIT-V(DMWH6EM)
No ratings yet
05. UNIT-V(DMWH6EM)
30 pages
Unit-3 DWDM 7TH Sem Cse
No ratings yet
Unit-3 DWDM 7TH Sem Cse
54 pages
Unit 5
No ratings yet
Unit 5
5 pages
ML_Unit-3
No ratings yet
ML_Unit-3
22 pages
clustering
No ratings yet
clustering
9 pages
Data Mining-Unit IV
No ratings yet
Data Mining-Unit IV
15 pages
UNIT 2 DMW
No ratings yet
UNIT 2 DMW
26 pages
Unit III Clustering
No ratings yet
Unit III Clustering
47 pages
unit 4 mining
No ratings yet
unit 4 mining
12 pages
M5
No ratings yet
M5
40 pages
Clustering
No ratings yet
Clustering
7 pages
dwm exp6 a49
No ratings yet
dwm exp6 a49
7 pages
Cluster Analysis
No ratings yet
Cluster Analysis
4 pages
ARTIFICIAL INTELLIGENCE LEC 5
No ratings yet
ARTIFICIAL INTELLIGENCE LEC 5
20 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
7 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
7 pages
DataMining_Unit4_notes
No ratings yet
DataMining_Unit4_notes
27 pages
DSUP_Exp5[1]
No ratings yet
DSUP_Exp5[1]
7 pages
Mid 2
No ratings yet
Mid 2
11 pages
3 Relational Algebra
No ratings yet
3 Relational Algebra
12 pages
Single-Zone Four-Way Ceiling-Cassette System Engineering Manual
No ratings yet
Single-Zone Four-Way Ceiling-Cassette System Engineering Manual
71 pages
Test Plan Template
No ratings yet
Test Plan Template
6 pages
L4 SC Lab Fuzzy Membership Function
No ratings yet
L4 SC Lab Fuzzy Membership Function
20 pages
Kumkum Singh Resume
No ratings yet
Kumkum Singh Resume
1 page
Packing Slip S188547
No ratings yet
Packing Slip S188547
1 page
SickTest - S1 - 2019 - Memorandum
No ratings yet
SickTest - S1 - 2019 - Memorandum
8 pages
DHI-NVR5416 5432 5464-4KS2 - Datasheet - 20210329
No ratings yet
DHI-NVR5416 5432 5464-4KS2 - Datasheet - 20210329
3 pages
Salah Ieeeaccess SPICE-Level Demonstration of Unsupervised Learning With Spintronic Synapses in Spiking Neural Networks
No ratings yet
Salah Ieeeaccess SPICE-Level Demonstration of Unsupervised Learning With Spintronic Synapses in Spiking Neural Networks
10 pages
Toeic-BT.pdfgiai
No ratings yet
Toeic-BT.pdfgiai
9 pages
CS PROJECT (Flight management System)2
No ratings yet
CS PROJECT (Flight management System)2
21 pages
Resume of Shelethamiles - 1
No ratings yet
Resume of Shelethamiles - 1
3 pages
HMT370EX Multilingual Installation and Safety Guide M212306EN
No ratings yet
HMT370EX Multilingual Installation and Safety Guide M212306EN
270 pages
Grade 9 Robotics Reviewer Quarter 2 PDF
No ratings yet
Grade 9 Robotics Reviewer Quarter 2 PDF
11 pages
IGCSE ICT Mock 2 - Full
No ratings yet
IGCSE ICT Mock 2 - Full
12 pages
MR-1000R™ MR-1000T™: Instruction Manual
No ratings yet
MR-1000R™ MR-1000T™: Instruction Manual
47 pages
To Check
No ratings yet
To Check
8 pages
Commission Agent Scenario PDF
No ratings yet
Commission Agent Scenario PDF
17 pages
Comprehensive General Liability Claims Made Policy Wording
No ratings yet
Comprehensive General Liability Claims Made Policy Wording
6 pages
CP2K: Introduction and Orientation: 4 CECAM CP2K Tutorial, 31 Aug - 4 Sep 2015 Iain Bethune Ibethune@epcc - Ed.ac - Uk
No ratings yet
CP2K: Introduction and Orientation: 4 CECAM CP2K Tutorial, 31 Aug - 4 Sep 2015 Iain Bethune Ibethune@epcc - Ed.ac - Uk
48 pages
Null
100% (1)
Null
54 pages
1 FFT PROCESSOR - Modified
No ratings yet
1 FFT PROCESSOR - Modified
19 pages
Flying Bird - Abhishek Sawant
No ratings yet
Flying Bird - Abhishek Sawant
13 pages
How To Make Your Own Day Trading Scanner
No ratings yet
How To Make Your Own Day Trading Scanner
5 pages
Attendance Project Synopsis
No ratings yet
Attendance Project Synopsis
3 pages
Chapter 1: Query Processing and Optimization: Slides By: Ms. Shree Jaswal
No ratings yet
Chapter 1: Query Processing and Optimization: Slides By: Ms. Shree Jaswal
143 pages
How To Copy Material Master, Profiles, and Class Assignments For KMAT in One Step - Product Lifecycle Management - Community Wiki
No ratings yet
How To Copy Material Master, Profiles, and Class Assignments For KMAT in One Step - Product Lifecycle Management - Community Wiki
4 pages