Hierarchical Clustering with Scikit-Learn
Last Updated :
12 Jun, 2024
Hierarchical clustering is a popular method in data science for grouping similar data points into clusters. Unlike other clustering techniques like K-means, hierarchical clustering does not require the number of clusters to be specified in advance. Instead, it builds a hierarchy of clusters that can be visualized as a dendrogram. In this article, we will explore hierarchical clustering using Scikit-Learn, a powerful Python library for machine learning.
Introduction to Hierarchical Clustering
Hierarchical clustering is a method of cluster analysis that seeks to build a hierarchy of clusters. It is particularly useful when the number of clusters is not known beforehand. The main idea is to create a tree-like structure (dendrogram) that represents the nested grouping of data points.
Types of Hierarchical Clustering
There are two main types of hierarchical clustering:
- Agglomerative Clustering: Agglomerative clustering is a "bottom-up" approach. It starts with each data point as a single cluster and merges the closest pairs of clusters iteratively until all points are in a single cluster or a stopping criterion is met.
- Divisive Clustering: Divisive clustering is a "top-down" approach. It starts with all data points in a single cluster and recursively splits the clusters into smaller ones until each data point is in its own cluster or a stopping criterion is met.
In this article, we will focus on agglomerative clustering, as it is more commonly used and is well-supported by Scikit-Learn.
How Hierarchical Clustering Works?
Hierarchical clustering involves the following steps:
- Calculate the Distance Matrix: Compute the distance between every pair of data points using a distance metric (e.g., Euclidean distance).
- Merge Closest Clusters: Identify the two closest clusters and merge them into a single cluster.
- Update the Distance Matrix: Recalculate the distances between the new cluster and all other clusters.
- Repeat: Repeat steps 2 and 3 until all data points are merged into a single cluster or a stopping criterion is met.
Dendrograms: Visualizing Hierarchical Clustering
A dendrogram is a tree-like diagram that shows the arrangement of clusters produced by hierarchical clustering. It provides a visual representation of the merging process and helps in determining the optimal number of clusters.
How to Read a Dendrogram?
- Leaves: Represent individual data points.
- Branches: Represent clusters formed by merging data points or other clusters.
- Height: Represents the distance or dissimilarity between clusters. The higher the branch, the more dissimilar the clusters.
Implementing Hierarchical Clustering with Scikit-Learn
Scikit-Learn provides a straightforward implementation of hierarchical clustering through the AgglomerativeClustering
 class. Let's walk through the steps to implement hierarchical clustering using Scikit-Learn.
Step 1: Import Libraries
Python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
Step 2: Generate Sample Data
For demonstration purposes, we will generate synthetic data using the make_blobs function.
Python
# Generate sample data
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
Step 3: Perform Agglomerative Clustering
Python
# Perform agglomerative clustering
agg_clustering = AgglomerativeClustering(n_clusters=4)
y_pred = agg_clustering.fit_predict(X)
Step 4: Plot the Clusters
Python
# Plot the clusters
plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap='rainbow')
plt.title('Agglomerative Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
Output:
ClustersStep 5: Plot the Dendrogram
To plot the dendrogram, we need to use the linkage function from the scipy.cluster.hierarchy module.
Python
# Generate the linkage matrix
Z = linkage(X, method='ward')
# Plot the dendrogram
plt.figure(figsize=(10, 7))
dendrogram(Z)
plt.title('Dendrogram')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.show()
Output:
DendrogramAdvantages and Disadvantages of Hierarchical Clustering
Advantages:
- No Need to Specify Number of Clusters: Unlike K-means, hierarchical clustering does not require the number of clusters to be specified in advance.
- Dendrogram: Provides a visual representation of the clustering process and helps in determining the optimal number of clusters.
- Versatility: Can be used for various types of data and distance metrics.
Disadvantages:
- Computational Complexity: Hierarchical clustering can be computationally expensive, especially for large datasets.
- Sensitivity to Noise: Can be sensitive to noise and outliers, which may affect the clustering results.
- Lack of Scalability: Not suitable for very large datasets due to its high time complexity.
Conclusion
Hierarchical clustering is a powerful and versatile clustering technique that builds a hierarchy of clusters without requiring the number of clusters to be specified in advance. Scikit-Learn provides an easy-to-use implementation of hierarchical clustering through the AgglomerativeClustering
 class. By following the steps outlined in this article, you can perform hierarchical clustering on your own datasets and visualize the results using dendrograms.
Similar Reads
Hierarchical Clustering in Machine Learning
In the real world data often lacks a target variable making supervised learning impractical. Have you ever wondered how social networks like Facebook recommend friends or how scientists group similar species together? These are some examples of hierarchical clustering that we will learn about in thi
7 min read
Hierarchical clustering using Weka
In this article, we will see how to utilize the Weka Explorer to perform hierarchical analysis. The sample data set for this example is based on iris data in ARFF format. The data has been appropriately preprocessed, as this article expects. This dataset has 150 iris occurrences. Clustering: Cluster
3 min read
Hierarchical Clustering in Data Mining
A Hierarchical clustering method works via grouping data into a tree of clusters. Hierarchical clustering begins by treating every data point as a separate cluster. Then, it repeatedly executes the subsequent steps: Identify the 2 clusters which can be closest together, andMerge the 2 maximum compar
5 min read
Hierarchical Clustering in R Programming
Hierarchical clustering in R Programming Language is an Unsupervised non-linear algorithm in which clusters are created such that they have a hierarchy(or a pre-determined ordering). For example, consider a family of up to three generations. A grandfather and mother have their children that become f
3 min read
K- means clustering with SciPy
Prerequisite: K-means clustering K-means clustering in Python is one of the most widely used unsupervised machine-learning techniques for data segmentation and pattern discovery. This article will explore K-means clustering in Python using the powerful SciPy library. With a step-by-step approach, we
8 min read
Project | Scikit-learn - Whisky Clustering
Introduction | Scikit-learn Scikit-learn is a machine learning library for Python.It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numeric
4 min read
Structured vs Unstructured Ward in Hierarchical Clustering Using Scikit Learn
Hierarchical clustering is a popular and widely used method for grouping and organizing data into clusters. In hierarchical clustering, data points are grouped based on their similarity, with similar data points being placed in the same cluster and dissimilar data points being placed in different cl
3 min read
Creating Heatmaps with Hierarchical Clustering
Before diving into our actual topic, let's have an understanding of Heatmaps and Hierarchical Clustering. HeatmapsHeatmaps are a powerful data visualization tool that can reveal patterns, relationships, and similarities within large datasets. When combined with hierarchical clustering, they become e
8 min read
Multiple Linear Regression With scikit-learn
In this article, let's learn about multiple linear regression using scikit-learn in the Python programming language. Regression is a statistical method for determining the relationship between features and an outcome variable or result. Machine learning, it's utilized as a method for predictive mode
8 min read
Spectral Co-Clustering Algorithm in Scikit Learn
Spectral co-clustering is a type of clustering algorithm that is used to find clusters in both rows and columns of a data matrix simultaneously. This is different from traditional clustering algorithms, which only cluster the rows or columns of a data matrix. Spectral co-clustering is a powerful too
4 min read