What is scipy cluster hierarchy? How to cut hierarchical clustering into flat clustering?

The scipy.cluster.hierarchy module provides functions for hierarchical clustering and its types such as agglomerative clustering. It has various routines which we can use to −

Cut hierarchical clustering into the flat clustering.
Implement agglomerative clustering.
Compute statistics on hierarchies
Visualize flat clustering.
To check isomorphism of two flat cluster assignments.
Plot the clusters.

The routine scipy.cluster.hierarchy.fcluster is used to cut hierarchical clustering into flat clustering, which they obtain as a result an assignment of the original data point to single clusters. Let’s understand the concept with the help of below given example −

Example

#Importing the packages
from scipy.cluster.hierarchy import ward, fcluster
from scipy.spatial.distance import pdist

#The cluster linkage method i.e., scipy.cluster.hierarchy.ward will generate a linkage matrix as their output:
A = [
   [0, 0], [0, 1], [1, 0],
   [0, 3], [0, 2], [1, 4],
   [3, 0], [2, 0], [4, 1],
   [3, 3], [2, 3], [4, 3]
]
X = ward(pdist(A))
print(X)

Output

[[ 0.   1.   1.           2. ]
 [ 2.   7.   1.           2. ]
 [ 3.   4.   1.           2. ]
 [ 9.  10.   1.           2. ]
 [ 6.   8.   1.41421356   2. ]
 [11.  15.   1.73205081   3. ]
 [ 5.  14.   2.081666     3. ] 
 [12.  13.   2.23606798   4. ]
 [16.  17.   3.94968353   5. ]
 [18.  19.   5.15012714   7. ]
 [20.  21.   6.4968857   12. ]]

The matrix X as received in the above output represents a dendrogram. In this dendrogram the first and second elements are the two clusters which merged at each step. The distance between these clusters is given by the third element of above dendrogram. The size of the new cluster is provided by the fourth element.

#Flatting the dendrogram by using fcluster() where the assignation of the original
data points to single clusters mostly depend on the distance threshold t.
fcluster(X, t=1.5, criterion='distance') #when t= 1.5

Output

array([6, 6, 7, 4, 4, 5, 1, 7, 1, 2, 2, 3], dtype=int32)

Example

fcluster(X, t=0.9, criterion='distance') #when t= 0.9

Output

array([ 9, 10, 11, 6, 7, 8, 1, 12, 2, 3, 4, 5], dtype=int32)

Example

fcluster(X, t=9, criterion='distance') #when t= 9

Output

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

Gaurav Kumar

Updated on: 2021-11-25T07:05:23+05:30

591 Views

Kickstart Your Career

Get certified by completing the course

Get Started