What is Clustering?
The task of grouping data points based on their similarity with each other
is called Clustering or Cluster Analysis. This method is defined under the
branch of unsupervised learning, which aims at gaining insights from unlabelled
data points.
Types of Clustering
Broadly speaking, there are 2 types of clustering that can be performed to group
similar data points:
Hard Clustering: In this type of clustering, each data point belongs to a
cluster completely or not. For example, Let’s say there are 4 data point and
we have to cluster them into 2 clusters. So each data point will either
belong to cluster 1 or cluster 2.
Soft Clustering: In this type of clustering, instead of assigning each data
point into a separate cluster, a probability or likelihood of that point being
that cluster is evaluated. For example, Let’s say there are 4 data point and
we have to cluster them into 2 clusters. So we will be evaluating a
probability of a data point belonging to both clusters. This probability is
calculated for all data points.
Uses of Clustering
Now before we begin with types of clustering algorithms, we will go through
the use cases of Clustering algorithms. Clustering algorithms are majorly used
for:
Market Segmentation: Businesses use clustering to group their customers
and use targeted advertisements to attract more audience.
Market Basket Analysis: Shop owners analyze their sales and figure out
which items are majorly bought together by the customers. For example, In
USA, according to a study diapers and beers were usually bought together
by fathers.
Social Network Analysis: Social media sites use your data to understand
your browsing behavior and provide you with targeted friend
recommendations or content recommendations.
Medical Imaging: Doctors use Clustering to find out diseased areas in
diagnostic images like X-rays.
Anomaly Detection: To find outliers in a stream of real-time dataset or
forecasting fraudulent transactions we can use clustering to identify them.
Types of Clustering Methods
Various types of clustering algorithms are:
1. Centroid-based Clustering (Partitioning methods)
2. Density-based Clustering (Model-based methods)
3. Connectivity-based Clustering (Hierarchical clustering)
Centroid-based Clustering (Partitioning methods)
Centroid-based clustering organizes data points around central vectors
(centroids) that represent clusters. Each data point belongs to the cluster with
the nearest centroid. Generally, the similarity measure chosen for these
algorithms are Euclidian distance, Manhattan Distance or Minkowski Distance.
The datasets are separated into a predetermined number of clusters, and
each cluster is referenced by a vector of values. When compared to the
vector value, the input data variable shows no difference and joins the
cluster.
Popular algorithms of Centroid-based clustering are:
K-means and
K-medoids clustering
Density-based Clustering (Model-based methods)
Density-based clustering identifies clusters as areas of high density separated
by regions of low density in the data space. Unlike centroid-based methods,
density-based clustering automatically determines the number of clusters
and is less susceptible to initialization positions.
Connectivity-based Clustering (Hierarchical clustering)
Connectivity-based clustering builds a hierarchy of clusters using a measure
of connectivity based on distance when organizing a collection of items
based on their similarities. This method builds a dendrogram, a tree-like
structure that visually represents the relationships between objects.
There are 2 approaches for Hierarchical clustering:
Divisive Clustering: It follows a top-down approach, here we consider all
data points to be part one big cluster and then this cluster is divide into
smaller groups.
Agglomerative Clustering: It follows a bottom-up approach, here we
consider all data points to be part of individual clusters and then these
clusters are clubbed together to make one big cluster with all data points.
Distribution-based Clustering
Distribution-based clustering is a technique that assumes data points are
generated from a mixture of probability distributions (e.g., Gaussian,
Poisson, etc.). The goal is to identify clusters by estimating the parameters of
these distributions. In distribution-based clustering:
Each cluster is represented by a probability distribution.
Data points are assigned to clusters based on how likely they are to belong
to each distribution.
Unlike distance-based methods (e.g., K-Means), this approach can capture
clusters of varying shapes, sizes, and densities.
Fuzzy Clustering
Fuzzy clustering allows data points to belong to multiple clusters with varying
degrees of membership.
Each data point is assigned a membership value between 0 and 1 for every
cluster.
These membership values indicate the degree to which a data point belongs
to a particular cluster.
K-Means clustering is an unsupervised machine learning algorithm used to
group unlabeled data into distinct clusters based on similarity. The goal is to
minimize the distance between data points and their corresponding cluster
centers (centroids). It is commonly used in fields like customer segmentation,
market analysis, image compression, and pattern recognition.
The term "unsupervised" refers to the fact that the algorithm does not rely on
predefined labels but instead finds patterns and structures directly from the
input data.
Working of K-Means
The K-Means algorithm follows an iterative process based on the following
steps:
1. Choosing the Number of Clusters (K):
o The first step is to decide the value of K, i.e., the number of
clusters required.
2. Initializing Centroids:
o Randomly select K data points from the dataset as initial
centroids.
3. Assigning Points to Clusters:
o Each data point is assigned to the nearest centroid using a distance
metric, commonly the Euclidean distance.
o Formula for Euclidean distance between two points
o
4. Updating Centroids:
o After assigning all points, update the centroids by calculating the
mean of all points in each cluster.
5. Repeating the Process:
o Steps 3 and 4 are repeated until the centroids no longer change
significantly, indicating convergence.
6. Stopping Criteria:
o No change in cluster assignments.
7.
o A maximum number of iterations reached.
o Minimal movement of centroids. Example
Example
A shopping mall wants to group customers based on their annual income and
spending score to design targeted marketing strategies.
Steps:
Assume K = 2 (Two clusters: High spenders & Low spenders).
Randomly select two customers as initial centroids.
Assign other customers based on proximity to centroids.
Update centroids by calculating the mean of the assigned customers'
values.
Repeat until the centroids stabilize.
SOLUTION
Cluster 1: Customers A, C, E (Low spenders)
Cluster 2: Customers B, D (High spenders)
Applications of K-Means
1. Customer Segmentation: Grouping customers based on purchasing
behavior.
2. Image Compression: Reducing image size by grouping similar pixel
colors.
3. Anomaly Detection: Identifying outliers in datasets.
4. Document Clustering: Organizing documents into topics.
5. Market Basket Analysis: Finding patterns in purchasing behavior.
Advantages
✅ Simple and easy to implement.
✅ Fast and efficient for large datasets.
✅ Works well when clusters are clearly separated.
Disadvantages
❌ Requires pre-defining the value of K.
❌ Sensitive to outliers and noise.
❌ May converge to a local minimum, depending on initial centroid placement.
❌ Not suitable for clusters of varying shapes and sizes.
DBSCAN Clustering in ML | Density based clustering
DBSCAN is a density-based clustering algorithm that groups data points
that are closely packed together and marks outliers as noise based on their
density in the feature space. It identifies clusters as dense regions in the data
space, separated by areas of lower density.
Unlike K-Means or hierarchical clustering, which assume clusters are compact
and spherical, DBSCAN excels in handling real-world data irregularities such
as:
Arbitrary-Shaped Clusters: Clusters can take any shape, not just circular
or convex.
Noise and Outliers: It effectively identifies and handles noise points without
assigning them to any cluster.
Key Parameters in DBSCAN
1. Epsilon (ε):
o The maximum distance between two points for one to be
considered part of the neighborhood of the other.
2. Minimum Points (MinPts):
o The minimum number of points required to form a dense region (a
cluster).
3. Core Point:
o A point that has at least MinPts within its ε-neighborhood.
4. Border Point:
o A point that is not a core point but lies within the ε-neighborhood
of a core point.
5. Noise (Outlier):
o A point that does not belong to any cluster (neither a core nor a
border point).
✅ How DBSCAN Works
1. Select an unvisited data point and mark it as visited.
2. Find its neighbors within distance ε.
o If there are at least MinPts neighbors, create a new cluster.
oIf not, mark it as noise (outlier).
3. Expand the cluster:
o All points within ε of the core point are added to the cluster.
o If a neighbor also has at least MinPts points in its neighborhood, it
becomes a core point, and its neighbors are also added.
4. Repeat:
o Continue until all points are either classified into a cluster or
labeled as noise.
Example
Imagine you are tracking animals' locations in a wildlife sanctuary to identify
herds.
Parameters:
Epsilon (ε) = 2
MinPts = 2
Steps:
Animals A, B, and C form a dense cluster since they are close and meet
MinPts.
Animals D and E also form another dense cluster.
Animal F is isolated and marked as noise.
Result:
Cluster 1: A, B, C (First herd)
Cluster 2: D, E (Second herd)
Applications of Density-Based Clustering
1. Anomaly Detection: Identifying fraud or unusual behavior in banking
systems.
2. Geographic Data Analysis: Clustering regions based on population
density.
3. Image Processing: Detecting patterns in images.
4. Market Segmentation: Grouping customers with unique purchasing
habits.
5. Astronomy: Finding celestial clusters in large datasets.
Advantages
✅ Can detect clusters of arbitrary shapes.
✅ Automatically identifies outliers.
✅ No need to specify the number of clusters beforehand.
✅ Disadvantages
❌ Choosing the right values for ε and MinPts can be challenging.
❌ Does not perform well with clusters of varying densities.
❌ Struggles with high-dimensional data due to the curse of dimensionality.
Dimensionality Reduction
Dimensionality Reduction is a technique in machine learning and data analysis
that reduces the number of input variables or features in a dataset while
retaining as much relevant information as possible.
In real-world scenarios, datasets often contain hundreds or thousands of
features, leading to the curse of dimensionality, where models become
complex, slow, and prone to overfitting. Dimensionality reduction simplifies
these datasets, improving performance and visualization.
Why Is Dimensionality Reduction Important?
1. Reduces computational cost.
2. Improves model performance by eliminating irrelevant features.
3. Simplifies data visualization (e.g., reducing dimensions from 1000 to 2).
4. Helps prevent overfitting by reducing noise.
Types of Dimensionality Reduction Techniques
Dimensionality reduction methods are broadly classified into two categories:
1. Feature Selection: Choosing a subset of relevant features from the
original dataset.
o Techniques: Filter methods, Wrapper methods, Embedded
methods.
2. Feature Extraction: Creating new features by transforming the original
dataset.
o Techniques:
Principal Component Analysis (PCA)
Linear Discriminant Analysis (LDA)
t-Distributed Stochastic Neighbor Embedding (t-SNE)
Autoencoders (Deep Learning)
Principal Component Analysis (PCA) - Detailed Explanation
PCA is one of the most popular dimensionality reduction techniques. It
transforms the data into a new coordinate system, where the greatest variance
by any projection of the data comes to lie on the first axis (called the principal
component), the second greatest variance on the second axis, and so on.
How PCA Works:
1. Standardize the Data:
o Convert features to have a mean of 0 and standard deviation of 1.
2. Compute the Covariance Matrix:
o Shows how variables relate to each other.
3. Calculate Eigenvalues and Eigenvectors:
o These determine the principal components.
4. Sort Eigenvalues:
o Select the top K eigenvectors corresponding to the largest
eigenvalues.
5. Transform the Data:
o Project the original data onto the selected principal components.
Steps:
1. The scores are standardized.
2. A covariance matrix is created to analyze the relationships between Math
and Science scores.
3. Eigenvalues and eigenvectors are computed to find the direction of
maximum variance.
4. The data is projected onto this new axis, reducing the two-dimensional
dataset to one dimension while preserving most of the variation.
5. Result:
6. The dataset is reduced to a single score that reflects performance across
both subjects.
Applications of Dimensionality Reduction (3 Marks)
1. 🎨 Image Compression: Reduces the size of image data while preserving
important features.
2. 🏦 Financial Modeling: Simplifies complex financial datasets for
analysis.
3. 🧬 Genomics: Helps in identifying significant genes from large datasets.
4. 📈 Data Visualization: Reduces high-dimensional data into 2D or 3D for
plotting.
5. 🔒 Anomaly Detection: Helps in identifying fraud or unusual patterns.
Advantages
✅ Reduces overfitting by removing redundant features.
✅ Faster computation and training time.
✅ Easier visualization of complex data.
✅ Improves model accuracy by eliminating irrelevant variables.
Disadvantages
❌ Loss of interpretability since transformed features lose their original
meaning.
❌ Some information might be lost during the reduction process.
❌ Requires careful preprocessing, such as scaling and normalization.
How Dimensionality Reduction Works?
On the left, data points exist in a 3D space (X, Y, Z), but the Z-
dimension appears unnecessary since the data primarily varies along the X and
Y axes. The goal of dimensionality reduction is to remove less important
dimensions without losing valuable information.
On the right, after reducing the dimensionality, the data is represented
in lower-dimensional spaces. The top plot (X-Y) maintains the meaningful
structure, while the bottom plot (Z-Y) shows that the Z-dimension contributed
little useful information.
This process makes data analysis more efficient, improving computation speed
and visualization while minimizing redundancyFeature selection and Feature
Extraction
Feature Selection
Feature selection chooses the most relevant features from the dataset without
altering them. It helps remove redundant or irrelevant features, improving
model efficiency. There are several methods for feature selection
including filter methods, wrapper methods, and embedded methods.
Filter methods rank the features based on their relevance to the target
variable.
Wrapper methods use the model performance as the criteria for selecting
features.
Embedded methods combine feature selection with the model training
process.
Feature Extraction
Feature extraction involves creating new features by combining or
transforming the original features. There are several methods for feature
extraction stated above in the introductory part which is responsible for
creating and transforming the features. PCA is a popular technique that
projects the original features onto a lower-dimensional space while
preserving as much of the variance as possible.
Advantages of Dimensionality Reduction
As seen earlier, high dimensionality makes models inefficient. Let’s now
summarize the key advantages of reducing dimensionality.
Faster Computation: With fewer features, machine learning algorithms
can process data more quickly. This results in faster model training and
testing, which is particularly useful when working with large datasets.
Better Visualization: As we saw in the earlier figure, reducing dimensions
makes it easier to visualize data, revealing hidden patterns.
Prevent Overfitting: With fewer features, models are less likely to
memorize the training data and overfit. This helps the model generalize
better to new, unseen data, improving its ability to make accurate
predictions.
Disadvantages of Dimensionality Reduction
Data Loss & Reduced Accuracy – Some important information may be
lost during dimensionality reduction, potentially affecting model
performance.
Interpretability Challenges – The transformed features (e.g., principal
components) may not have clear meanings, making it harder to understand
relationships in the original data.
Choosing the Right Components – Deciding how many dimensions to
keep is difficult, as keeping too few may lose valuable information, while
keeping too many can lead to overfitting.
Collaborative Filtering in Machine Learning
Collaborative Filtering (CF) is a widely used machine learning technique for
building recommendation systems. It predicts user preferences by analyzing
past behaviors, such as ratings, views, or purchases. Unlike content-based
filtering, CF relies solely on user-item interactions rather than item
characteristics. This approach assumes that users who liked similar items in the
past will continue to share preferences in the future.
Types of Collaborative Filtering
Collaborative Filtering can be classified into two main types:
1. User-Based Collaborative Filtering (UBCF):
o This approach finds users with similar tastes and recommends
items that those similar users have liked.
o Working:
Identify users similar to the target user based on past ratings.
Recommend items that these similar users have rated highly
but the target user has not yet interacted with.
o Example: If User A and User B have similar ratings for books, and
User A enjoys a new book, User B might get that book as a
recommendation.
2. Item-Based Collaborative Filtering (IBCF):
o This method focuses on similarities between items instead of users.
o Working:
Identify items that are frequently rated together by users.
Recommend similar items based on the user’s past
preferences.
o Example: If many users who liked "The Lord of the Rings" also
liked "The Hobbit," someone who liked "The Lord of the Rings"
might be recommended "The Hobbit."
Mathematical Foundation
1. User-Item Interaction Matrix:
o A sparse matrix where rows represent users, and columns represent
items. The values reflect ratings, views, or likes.
2. Similarity Measures:
Collaborative Filtering relies on similarity calculations, such as:
o Cosine Similarity: Measures the cosine of the angle between two
vectors
Pearson Correlation Coefficient: Measures linear correlation between two
variables.
Euclidean Distance: Measures the straight-line distance between two points
in space.
Prediction Formula (for User-Based CF):
There are basically four types of algorithms To say techniques to build
Collaborative filtering recommender systems:
Memory-Based
Model-Based
Hybrid
Deep Learning