Types of Linkages in Hierarchical Clustering

Hierarchical clustering is used to group similar data points and organise data in a tree-like structure. Key part of this process is linkage which calculates the distance between clusters before they are merged or divided. Different types of linkage is used measure this distance differently. In this article, we’ll look at different linkage methods and see how they affect the cluster formation.

1. Single Linkage

For two clusters R and S the single linkage returns the minimum distance between two points. This method creates long, chain-like clusters because it is sensitive to outliers and can connect clusters based on a very small number of close points.

L(R, S) = min(D(i, j)), i\epsilon R, j\epsilon S

where

D(i, j): Distance function between points i and j.

2. Complete Linkage

For two clusters R and S the complete linkage returns the maximum distance between two points. It tends to create compact and spherical clusters because it is more sensitive to outliers and tries to make sure that the clusters are not too far.

L(R, S) = \max(D(i, j)), \, i \in R, \, j \in S

where

D(i, j): Distance function between points i and j.

3. Average Linkage

It returns the average distance between all pairs of points from two clusters. This method maintain a balance between single and complete linkage by considering all pairs of points not just the closest or farthest point. It usually results in clusters that are moderately compact.

L(R,S) = \frac{1}{n_{R}\times n_{S}}\sum_{i=1}^{n_{R}}\sum_{j=1}^{n_{S}} D(i,j), i\in R, j\in S

where

n_{R} : Number of data-points in R
n_{S} : Number of data-points in S

4. Ward's Linkage

It calculates the distance between two clusters by looking at total spread or variance increase when the clusters are combined. This method creates compact, well-separated clusters by making sure that data within each cluster is as similar as possible.

L(R, S) = \frac{n_R + n_S}{n_R \times n_S} \sum_{i=1}^{n_R} \sum_{j=1}^{n_S} D(i,j), \quad i \in R, j \in S

where

n_R and n_S are the sizes of clusters R and S
D(i, j) is the distance between points i \in R and j \in S.

5. Centroid Linkage

It calculates the distance between two clusters based on the distance between their central points i.e the average of all points in the cluster. This method works well when clusters are round or evenly shaped but it may not be the best for irregularly shaped clusters.

L(R, S) = D(\bar{R}, \bar{S})

where

\bar{R} and \bar{S} are the centroids (mean points) of clusters R and S
D(\bar{R}, \bar{S}) is the distance between the centroids of clusters R and S.

Each linkage method has its own advantages and we can use them based on our needs and type of data we have.

Read More:

Clustering in Machine Learning
Hierarchical Clustering in Machine Learning
Clustering Metrics in Machine Learning

Types of Linkages in Hierarchical Clustering

1. Single Linkage

2. Complete Linkage

3. Average Linkage

4. Ward's Linkage

5. Centroid Linkage

Explore