0% found this document useful (0 votes)

21 views16 pages

Unit 4

Clustering is a method of grouping data points based on their similarity, categorized into hard and soft clustering. It has various applications, including market segmentation, anomaly detection, and medical imaging, and can be performed using different algorithms like K-means, DBSCAN, and hierarchical clustering. Dimensionality reduction techniques, such as PCA, help simplify datasets by reducing the number of features while retaining essential information, improving model performance and visualization.

Uploaded by

siva71469

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views16 pages

Unit 4

Uploaded by

siva71469

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

What is Clustering?

The task of grouping data points based on their similarity with each other
is called Clustering or Cluster Analysis. This method is defined under the
branch of unsupervised learning, which aims at gaining insights from unlabelled
data points.

Types of Clustering
Broadly speaking, there are 2 types of clustering that can be performed to group
similar data points:
 Hard Clustering: In this type of clustering, each data point belongs to a
cluster completely or not. For example, Let’s say there are 4 data point and
we have to cluster them into 2 clusters. So each data point will either
belong to cluster 1 or cluster 2.

 Soft Clustering: In this type of clustering, instead of assigning each data

point into a separate cluster, a probability or likelihood of that point being
that cluster is evaluated. For example, Let’s say there are 4 data point and
we have to cluster them into 2 clusters. So we will be evaluating a
probability of a data point belonging to both clusters. This probability is
calculated for all data points.

Uses of Clustering
Now before we begin with types of clustering algorithms, we will go through
the use cases of Clustering algorithms. Clustering algorithms are majorly used
for:
 Market Segmentation: Businesses use clustering to group their customers
and use targeted advertisements to attract more audience.
 Market Basket Analysis: Shop owners analyze their sales and figure out
which items are majorly bought together by the customers. For example, In
USA, according to a study diapers and beers were usually bought together
by fathers.
 Social Network Analysis: Social media sites use your data to understand
your browsing behavior and provide you with targeted friend
recommendations or content recommendations.
 Medical Imaging: Doctors use Clustering to find out diseased areas in
diagnostic images like X-rays.
 Anomaly Detection: To find outliers in a stream of real-time dataset or
forecasting fraudulent transactions we can use clustering to identify them.

Types of Clustering Methods

Various types of clustering algorithms are:
1. Centroid-based Clustering (Partitioning methods)
2. Density-based Clustering (Model-based methods)
3. Connectivity-based Clustering (Hierarchical clustering)
Centroid-based Clustering (Partitioning methods)
Centroid-based clustering organizes data points around central vectors
(centroids) that represent clusters. Each data point belongs to the cluster with
the nearest centroid. Generally, the similarity measure chosen for these
algorithms are Euclidian distance, Manhattan Distance or Minkowski Distance.
The datasets are separated into a predetermined number of clusters, and
each cluster is referenced by a vector of values. When compared to the
vector value, the input data variable shows no difference and joins the
cluster.
Popular algorithms of Centroid-based clustering are:
 K-means and
 K-medoids clustering

Density-based Clustering (Model-based methods)

Density-based clustering identifies clusters as areas of high density separated
by regions of low density in the data space. Unlike centroid-based methods,
density-based clustering automatically determines the number of clusters
and is less susceptible to initialization positions.

Connectivity-based Clustering (Hierarchical clustering)

Connectivity-based clustering builds a hierarchy of clusters using a measure
of connectivity based on distance when organizing a collection of items
based on their similarities. This method builds a dendrogram, a tree-like
structure that visually represents the relationships between objects.

There are 2 approaches for Hierarchical clustering:

 Divisive Clustering: It follows a top-down approach, here we consider all
data points to be part one big cluster and then this cluster is divide into
smaller groups.
 Agglomerative Clustering: It follows a bottom-up approach, here we
consider all data points to be part of individual clusters and then these
clusters are clubbed together to make one big cluster with all data points.
Distribution-based Clustering
Distribution-based clustering is a technique that assumes data points are
generated from a mixture of probability distributions (e.g., Gaussian,
Poisson, etc.). The goal is to identify clusters by estimating the parameters of
these distributions. In distribution-based clustering:
 Each cluster is represented by a probability distribution.
 Data points are assigned to clusters based on how likely they are to belong
to each distribution.
 Unlike distance-based methods (e.g., K-Means), this approach can capture
clusters of varying shapes, sizes, and densities.

Fuzzy Clustering
Fuzzy clustering allows data points to belong to multiple clusters with varying
degrees of membership.
 Each data point is assigned a membership value between 0 and 1 for every
cluster.
 These membership values indicate the degree to which a data point belongs
to a particular cluster.

K-Means clustering is an unsupervised machine learning algorithm used to

group unlabeled data into distinct clusters based on similarity. The goal is to
minimize the distance between data points and their corresponding cluster
centers (centroids). It is commonly used in fields like customer segmentation,
market analysis, image compression, and pattern recognition.

The term "unsupervised" refers to the fact that the algorithm does not rely on
predefined labels but instead finds patterns and structures directly from the
input data.

Working of K-Means
The K-Means algorithm follows an iterative process based on the following
steps:

1. Choosing the Number of Clusters (K):

o The first step is to decide the value of K, i.e., the number of
clusters required.
2. Initializing Centroids:
o Randomly select K data points from the dataset as initial
centroids.
3. Assigning Points to Clusters:
o Each data point is assigned to the nearest centroid using a distance
metric, commonly the Euclidean distance.
o Formula for Euclidean distance between two points
o
4. Updating Centroids:
o After assigning all points, update the centroids by calculating the
mean of all points in each cluster.
5. Repeating the Process:
o Steps 3 and 4 are repeated until the centroids no longer change
significantly, indicating convergence.
6. Stopping Criteria:
o No change in cluster assignments.
7.
o A maximum number of iterations reached.
o Minimal movement of centroids. Example

Example
A shopping mall wants to group customers based on their annual income and
spending score to design targeted marketing strategies.

Steps:

 Assume K = 2 (Two clusters: High spenders & Low spenders).

 Randomly select two customers as initial centroids.
 Assign other customers based on proximity to centroids.
 Update centroids by calculating the mean of the assigned customers'
values.
 Repeat until the centroids stabilize.

SOLUTION

 Cluster 1: Customers A, C, E (Low spenders)

 Cluster 2: Customers B, D (High spenders)

Applications of K-Means
1. Customer Segmentation: Grouping customers based on purchasing
behavior.
2. Image Compression: Reducing image size by grouping similar pixel
colors.
3. Anomaly Detection: Identifying outliers in datasets.
4. Document Clustering: Organizing documents into topics.
5. Market Basket Analysis: Finding patterns in purchasing behavior.

Advantages

✅ Simple and easy to implement.

✅ Fast and efficient for large datasets.
✅ Works well when clusters are clearly separated.

Disadvantages

❌ Requires pre-defining the value of K.

❌ Sensitive to outliers and noise.
❌ May converge to a local minimum, depending on initial centroid placement.
❌ Not suitable for clusters of varying shapes and sizes.

DBSCAN Clustering in ML | Density based clustering

DBSCAN is a density-based clustering algorithm that groups data points
that are closely packed together and marks outliers as noise based on their
density in the feature space. It identifies clusters as dense regions in the data
space, separated by areas of lower density.
Unlike K-Means or hierarchical clustering, which assume clusters are compact
and spherical, DBSCAN excels in handling real-world data irregularities such
as:
 Arbitrary-Shaped Clusters: Clusters can take any shape, not just circular
or convex.
 Noise and Outliers: It effectively identifies and handles noise points without
assigning them to any cluster.


Key Parameters in DBSCAN
1. Epsilon (ε):
o The maximum distance between two points for one to be
considered part of the neighborhood of the other.
2. Minimum Points (MinPts):
o The minimum number of points required to form a dense region (a
cluster).
3. Core Point:
o A point that has at least MinPts within its ε-neighborhood.
4. Border Point:
o A point that is not a core point but lies within the ε-neighborhood
of a core point.
5. Noise (Outlier):
o A point that does not belong to any cluster (neither a core nor a
border point).

✅ How DBSCAN Works

1. Select an unvisited data point and mark it as visited.

2. Find its neighbors within distance ε.
o If there are at least MinPts neighbors, create a new cluster.
oIf not, mark it as noise (outlier).
3. Expand the cluster:
o All points within ε of the core point are added to the cluster.
o If a neighbor also has at least MinPts points in its neighborhood, it
becomes a core point, and its neighbors are also added.
4. Repeat:
o Continue until all points are either classified into a cluster or
labeled as noise.

Example
Imagine you are tracking animals' locations in a wildlife sanctuary to identify
herds.

Parameters:

 Epsilon (ε) = 2
 MinPts = 2

Steps:

 Animals A, B, and C form a dense cluster since they are close and meet
MinPts.
 Animals D and E also form another dense cluster.
 Animal F is isolated and marked as noise.

Result:

 Cluster 1: A, B, C (First herd)

 Cluster 2: D, E (Second herd)
Applications of Density-Based Clustering
1. Anomaly Detection: Identifying fraud or unusual behavior in banking
systems.
2. Geographic Data Analysis: Clustering regions based on population
density.
3. Image Processing: Detecting patterns in images.
4. Market Segmentation: Grouping customers with unique purchasing
habits.
5. Astronomy: Finding celestial clusters in large datasets.

Advantages

✅ Can detect clusters of arbitrary shapes.

✅ Automatically identifies outliers.
✅ No need to specify the number of clusters beforehand.

✅ Disadvantages

❌ Choosing the right values for ε and MinPts can be challenging.

❌ Does not perform well with clusters of varying densities.
❌ Struggles with high-dimensional data due to the curse of dimensionality.
Dimensionality Reduction
Dimensionality Reduction is a technique in machine learning and data analysis
that reduces the number of input variables or features in a dataset while
retaining as much relevant information as possible.

In real-world scenarios, datasets often contain hundreds or thousands of

features, leading to the curse of dimensionality, where models become
complex, slow, and prone to overfitting. Dimensionality reduction simplifies
these datasets, improving performance and visualization.

 Why Is Dimensionality Reduction Important?

1. Reduces computational cost.

2. Improves model performance by eliminating irrelevant features.
3. Simplifies data visualization (e.g., reducing dimensions from 1000 to 2).
4. Helps prevent overfitting by reducing noise.

Types of Dimensionality Reduction Techniques

Dimensionality reduction methods are broadly classified into two categories:

1. Feature Selection: Choosing a subset of relevant features from the

original dataset.
o Techniques: Filter methods, Wrapper methods, Embedded
methods.
2. Feature Extraction: Creating new features by transforming the original
dataset.
o Techniques:
 Principal Component Analysis (PCA)
 Linear Discriminant Analysis (LDA)
 t-Distributed Stochastic Neighbor Embedding (t-SNE)
 Autoencoders (Deep Learning)

Principal Component Analysis (PCA) - Detailed Explanation

PCA is one of the most popular dimensionality reduction techniques. It
transforms the data into a new coordinate system, where the greatest variance
by any projection of the data comes to lie on the first axis (called the principal
component), the second greatest variance on the second axis, and so on.
How PCA Works:
1. Standardize the Data:
o Convert features to have a mean of 0 and standard deviation of 1.
2. Compute the Covariance Matrix:
o Shows how variables relate to each other.
3. Calculate Eigenvalues and Eigenvectors:
o These determine the principal components.
4. Sort Eigenvalues:
o Select the top K eigenvectors corresponding to the largest
eigenvalues.
5. Transform the Data:
o Project the original data onto the selected principal components.

Steps:
1. The scores are standardized.
2. A covariance matrix is created to analyze the relationships between Math
and Science scores.
3. Eigenvalues and eigenvectors are computed to find the direction of
maximum variance.
4. The data is projected onto this new axis, reducing the two-dimensional
dataset to one dimension while preserving most of the variation.
5. Result:
6. The dataset is reduced to a single score that reflects performance across
both subjects.
Applications of Dimensionality Reduction (3 Marks)
1. 🎨 Image Compression: Reduces the size of image data while preserving
important features.
2. 🏦 Financial Modeling: Simplifies complex financial datasets for
analysis.
3. 🧬 Genomics: Helps in identifying significant genes from large datasets.
4. 📈 Data Visualization: Reduces high-dimensional data into 2D or 3D for
plotting.
5. 🔒 Anomaly Detection: Helps in identifying fraud or unusual patterns.
Advantages
✅ Reduces overfitting by removing redundant features.

✅ Faster computation and training time.

✅ Easier visualization of complex data.

✅ Improves model accuracy by eliminating irrelevant variables.

Disadvantages

❌ Loss of interpretability since transformed features lose their original

meaning.
❌ Some information might be lost during the reduction process.
❌ Requires careful preprocessing, such as scaling and normalization.

How Dimensionality Reduction Works?

On the left, data points exist in a 3D space (X, Y, Z), but the Z-
dimension appears unnecessary since the data primarily varies along the X and
Y axes. The goal of dimensionality reduction is to remove less important
dimensions without losing valuable information.
On the right, after reducing the dimensionality, the data is represented
in lower-dimensional spaces. The top plot (X-Y) maintains the meaningful
structure, while the bottom plot (Z-Y) shows that the Z-dimension contributed
little useful information.
This process makes data analysis more efficient, improving computation speed
and visualization while minimizing redundancyFeature selection and Feature
Extraction

Feature Selection
Feature selection chooses the most relevant features from the dataset without
altering them. It helps remove redundant or irrelevant features, improving
model efficiency. There are several methods for feature selection
including filter methods, wrapper methods, and embedded methods.
 Filter methods rank the features based on their relevance to the target
variable.
 Wrapper methods use the model performance as the criteria for selecting
features.
 Embedded methods combine feature selection with the model training
process.
Feature Extraction
Feature extraction involves creating new features by combining or
transforming the original features. There are several methods for feature
extraction stated above in the introductory part which is responsible for
creating and transforming the features. PCA is a popular technique that
projects the original features onto a lower-dimensional space while
preserving as much of the variance as possible.
Advantages of Dimensionality Reduction
As seen earlier, high dimensionality makes models inefficient. Let’s now
summarize the key advantages of reducing dimensionality.
 Faster Computation: With fewer features, machine learning algorithms
can process data more quickly. This results in faster model training and
testing, which is particularly useful when working with large datasets.
 Better Visualization: As we saw in the earlier figure, reducing dimensions
makes it easier to visualize data, revealing hidden patterns.
 Prevent Overfitting: With fewer features, models are less likely to
memorize the training data and overfit. This helps the model generalize
better to new, unseen data, improving its ability to make accurate
predictions.
Disadvantages of Dimensionality Reduction
 Data Loss & Reduced Accuracy – Some important information may be
lost during dimensionality reduction, potentially affecting model
performance.
 Interpretability Challenges – The transformed features (e.g., principal
components) may not have clear meanings, making it harder to understand
relationships in the original data.
 Choosing the Right Components – Deciding how many dimensions to
keep is difficult, as keeping too few may lose valuable information, while
keeping too many can lead to overfitting.

Collaborative Filtering in Machine Learning

Collaborative Filtering (CF) is a widely used machine learning technique for
building recommendation systems. It predicts user preferences by analyzing
past behaviors, such as ratings, views, or purchases. Unlike content-based
filtering, CF relies solely on user-item interactions rather than item
characteristics. This approach assumes that users who liked similar items in the
past will continue to share preferences in the future.

Types of Collaborative Filtering

Collaborative Filtering can be classified into two main types:

1. User-Based Collaborative Filtering (UBCF):

o This approach finds users with similar tastes and recommends
items that those similar users have liked.
o Working:
 Identify users similar to the target user based on past ratings.
 Recommend items that these similar users have rated highly
but the target user has not yet interacted with.
o Example: If User A and User B have similar ratings for books, and
User A enjoys a new book, User B might get that book as a
recommendation.
2. Item-Based Collaborative Filtering (IBCF):
o This method focuses on similarities between items instead of users.
o Working:
 Identify items that are frequently rated together by users.
 Recommend similar items based on the user’s past
preferences.
o Example: If many users who liked "The Lord of the Rings" also
liked "The Hobbit," someone who liked "The Lord of the Rings"
might be recommended "The Hobbit."
Mathematical Foundation

1. User-Item Interaction Matrix:

o A sparse matrix where rows represent users, and columns represent
items. The values reflect ratings, views, or likes.
2. Similarity Measures:
Collaborative Filtering relies on similarity calculations, such as:
o Cosine Similarity: Measures the cosine of the angle between two
vectors

Pearson Correlation Coefficient: Measures linear correlation between two

variables.

Euclidean Distance: Measures the straight-line distance between two points

in space.
Prediction Formula (for User-Based CF):

There are basically four types of algorithms To say techniques to build

Collaborative filtering recommender systems:
 Memory-Based
 Model-Based
 Hybrid
 Deep Learning

Clustering New
No ratings yet
Clustering New
6 pages
Unit - 4 DWDM
No ratings yet
Unit - 4 DWDM
27 pages
Clustering
No ratings yet
Clustering
11 pages
Intro to Clustering Techniques
No ratings yet
Intro to Clustering Techniques
13 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
64 pages
Clustering Techniques in Unsupervised Learning
No ratings yet
Clustering Techniques in Unsupervised Learning
42 pages
M5
No ratings yet
M5
40 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
23 pages
Unit 4
No ratings yet
Unit 4
29 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
40 pages
7.introduction To Clustering
No ratings yet
7.introduction To Clustering
11 pages
ML - 8
No ratings yet
ML - 8
70 pages
Clustering Explanation
No ratings yet
Clustering Explanation
8 pages
DWM PT 2 QB Soln
No ratings yet
DWM PT 2 QB Soln
8 pages
Machine Learning Unit-4
No ratings yet
Machine Learning Unit-4
24 pages
Clustering
No ratings yet
Clustering
11 pages
Datamining Lect8
No ratings yet
Datamining Lect8
79 pages
Unit 4
No ratings yet
Unit 4
74 pages
Overview of Clustering Methods in ML
No ratings yet
Overview of Clustering Methods in ML
37 pages
Lecturer-1 Unit 3
No ratings yet
Lecturer-1 Unit 3
31 pages
Unit 3 Updated Notes
No ratings yet
Unit 3 Updated Notes
29 pages
K-Means Clustering Explained
No ratings yet
K-Means Clustering Explained
6 pages
Lecture 9 Clustering
No ratings yet
Lecture 9 Clustering
36 pages
ML Unit-4 Final 2024-25
No ratings yet
ML Unit-4 Final 2024-25
28 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
ML Unit4
No ratings yet
ML Unit4
19 pages
ML Unit III
No ratings yet
ML Unit III
82 pages
Unit 3 Clustering Algorithm
No ratings yet
Unit 3 Clustering Algorithm
44 pages
Clustering
No ratings yet
Clustering
67 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
ML Unit 4
No ratings yet
ML Unit 4
110 pages
ML Unit-Iii
No ratings yet
ML Unit-Iii
18 pages
Clustering Analysis
No ratings yet
Clustering Analysis
12 pages
DMW Unit 5
No ratings yet
DMW Unit 5
10 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
45 pages
The Math Behind The K-Means and Hierarchical Clust+
No ratings yet
The Math Behind The K-Means and Hierarchical Clust+
13 pages
Classification vs Clustering Guide
No ratings yet
Classification vs Clustering Guide
31 pages
Unit 4
No ratings yet
Unit 4
19 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
ML CH 4
No ratings yet
ML CH 4
51 pages
4.unit 4 ML Q&A
No ratings yet
4.unit 4 ML Q&A
73 pages
Module 5
No ratings yet
Module 5
98 pages
AI
No ratings yet
AI
19 pages
Unit 3
No ratings yet
Unit 3
34 pages
ML Module 4 Unsupervised Learning - Updated
No ratings yet
ML Module 4 Unsupervised Learning - Updated
55 pages
4 Clustering
No ratings yet
4 Clustering
9 pages
DWDM Unit V Note
No ratings yet
DWDM Unit V Note
19 pages
Clustering in Data Mining Lecture
No ratings yet
Clustering in Data Mining Lecture
80 pages
Unit III Clustering
No ratings yet
Unit III Clustering
47 pages
Ds Econtent
No ratings yet
Ds Econtent
8 pages
DW & DM Unit 4 Notes
No ratings yet
DW & DM Unit 4 Notes
40 pages
Clustering Methods in Machine Learning
No ratings yet
Clustering Methods in Machine Learning
45 pages
Unit 5 Cluster Analysis
No ratings yet
Unit 5 Cluster Analysis
15 pages
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
110 pages
(3rd Year) Pattern REcognition Lecture 4
No ratings yet
(3rd Year) Pattern REcognition Lecture 4
48 pages
Structural Uncertainty Modeling Guide
No ratings yet
Structural Uncertainty Modeling Guide
9 pages
Osborne - Costello - SAMPLE SIZE
No ratings yet
Osborne - Costello - SAMPLE SIZE
10 pages
DSC 402
No ratings yet
DSC 402
14 pages
Nonlinear Regression & Interaction Terms
No ratings yet
Nonlinear Regression & Interaction Terms
2 pages
How To Conduct Paired-T-Test SPSS: Comprehension in Adsorption With Bibliometric
No ratings yet
How To Conduct Paired-T-Test SPSS: Comprehension in Adsorption With Bibliometric
8 pages
HW 1
No ratings yet
HW 1
4 pages
Statistical Analysis in Chemistry
No ratings yet
Statistical Analysis in Chemistry
8 pages
Some of The Individual Values
No ratings yet
Some of The Individual Values
42 pages
Statistics Module 4, Testing Hypotheses, The Critical Ratio
No ratings yet
Statistics Module 4, Testing Hypotheses, The Critical Ratio
69 pages
Assignment 9
No ratings yet
Assignment 9
2 pages
STA 201 - Final - 32 - 240515 - 031956
No ratings yet
STA 201 - Final - 32 - 240515 - 031956
2 pages
Sample Problems
0% (1)
Sample Problems
2 pages
Advanced Panel Data Methods: Basic Econometrics
100% (1)
Advanced Panel Data Methods: Basic Econometrics
32 pages
Discrete Probability Distributions Overview
No ratings yet
Discrete Probability Distributions Overview
24 pages
Les8e PPT Study 09 02
No ratings yet
Les8e PPT Study 09 02
19 pages
Finance Econometrics: Regression Models
No ratings yet
Finance Econometrics: Regression Models
29 pages
Expectation and Variance Examples and Computation of Mean Time To Failure
No ratings yet
Expectation and Variance Examples and Computation of Mean Time To Failure
7 pages
AMOS - Statistics Solutions
No ratings yet
AMOS - Statistics Solutions
4 pages
SPSS Binary Logistic Regression Demo 1 Terminate
100% (1)
SPSS Binary Logistic Regression Demo 1 Terminate
22 pages
LAS Variance
No ratings yet
LAS Variance
2 pages
Microeconometrics
No ratings yet
Microeconometrics
228 pages
STAT200 Assignment 1 Analysis Plan
No ratings yet
STAT200 Assignment 1 Analysis Plan
3 pages
Grade 12 Mathematics Paper 2 Cheat Sheet 1st Edition 2023 222
No ratings yet
Grade 12 Mathematics Paper 2 Cheat Sheet 1st Edition 2023 222
14 pages
BMIS360 Forecasting & Location Strategies
No ratings yet
BMIS360 Forecasting & Location Strategies
13 pages
Statistical Analysis of Coin Weights
67% (3)
Statistical Analysis of Coin Weights
6 pages
Logistic Regression Hands-On
No ratings yet
Logistic Regression Hands-On
72 pages
Bayesian Classification - Problem
No ratings yet
Bayesian Classification - Problem
4 pages
The Chi-Square Distribution and Its Applications I
No ratings yet
The Chi-Square Distribution and Its Applications I
14 pages
Malcomb NewUniversalLaw
No ratings yet
Malcomb NewUniversalLaw
506 pages
Econometrics: Dummy Variables Guide
No ratings yet
Econometrics: Dummy Variables Guide
10 pages

Unit 4

Uploaded by

Unit 4

Uploaded by

What is Clustering?

 Soft Clustering: In this type of clustering, instead of assigning each data

Types of Clustering Methods

Density-based Clustering (Model-based methods)

Connectivity-based Clustering (Hierarchical clustering)

There are 2 approaches for Hierarchical clustering:

K-Means clustering is an unsupervised machine learning algorithm used to

1. Choosing the Number of Clusters (K):

 Assume K = 2 (Two clusters: High spenders & Low spenders).

 Cluster 1: Customers A, C, E (Low spenders)

✅ Simple and easy to implement.

❌ Requires pre-defining the value of K.

DBSCAN Clustering in ML | Density based clustering

✅ How DBSCAN Works

1. Select an unvisited data point and mark it as visited.

 Cluster 1: A, B, C (First herd)

✅ Can detect clusters of arbitrary shapes.

❌ Choosing the right values for ε and MinPts can be challenging.

In real-world scenarios, datasets often contain hundreds or thousands of

 Why Is Dimensionality Reduction Important?

1. Reduces computational cost.

Types of Dimensionality Reduction Techniques

1. Feature Selection: Choosing a subset of relevant features from the

Principal Component Analysis (PCA) - Detailed Explanation

✅ Faster computation and training time.

✅ Easier visualization of complex data.

✅ Improves model accuracy by eliminating irrelevant variables.

❌ Loss of interpretability since transformed features lose their original

How Dimensionality Reduction Works?

Collaborative Filtering in Machine Learning

Types of Collaborative Filtering

Collaborative Filtering can be classified into two main types:

1. User-Based Collaborative Filtering (UBCF):

1. User-Item Interaction Matrix:

Pearson Correlation Coefficient: Measures linear correlation between two

Euclidean Distance: Measures the straight-line distance between two points

There are basically four types of algorithms To say techniques to build

You might also like