0% found this document useful (0 votes)
13 views

UCS 401 Unit-lll Lect 13 Distance Based Models Neighbours and Examples

Uploaded by

buest21ucs028
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

UCS 401 Unit-lll Lect 13 Distance Based Models Neighbours and Examples

Uploaded by

buest21ucs028
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 20

ML UCS-401

Topics: Distance Based Models Neighbours and


Examples, Nearest Neighbours Classification,
Distance based clustering-K means Algorithm,
Hierarchical clustering
Distance Based Models Neighbours and Examples

Distance-based models in machine learning rely on


the concept of measuring the distance or similarity
between data points to make predictions, cluster
data, or retrieve relevant examples. These models
assume that similar data points are close to each
other in the feature space.
Distance-Based Models:
1. k-Nearest Neighbors (k-NN): A non-parametric
classification and regression algorithm.
•Predicts based on the labels or values of the
nearest k points in the feature space.
•Common distance metrics:
•Euclidean distance
•Manhattan distance
•Minkowski distance
•Cosine similarity
2. Radius Neighbors: Similar to k-NN but uses all
neighbors within a predefined radius for predictions
instead of a fixed k.
3. Support Vector Machines (SVM) with RBF
Kernel: Uses distance-based measures in its kernel
functions (e.g., Radial Basis Function) to find
optimal decision boundaries.
4. Clustering Algorithms:
•K-Means: Clusters points by minimizing within-
cluster distances to centroids.
•DBSCAN: Groups points based on density and
distance.
•Agglomerative Clustering: Forms clusters by
merging points or clusters based on linkage
distance.
5. Self-Organizing Maps (SOM): An unsupervised
learning method using distance-based mapping to
reduce dimensions and visualize data.
6. Instance-Based Learning Algorithms: Learning
by memorizing instances and using distance
metrics for prediction (e.g., Locally Weighted
Regression).
Examples of Applications:
1. k-NN in Image Recognition
Classifying images based on the majority label
of k nearest images in feature space
2. Recommendation Systems:Using cosine similarity
or Euclidean distance to recommend similar items.
3. Customer Segmentation with K-Means: Grouping
customers based on purchasing behavior or
demographics.
4. Anomaly Detection with DBSCAN:Identifying
outliers as points with low-density neighborhoods.
5. RBF Kernel in SVM for Classification:Mapping non-
linear data into a higher dimension using distance
metrics for better separation.
Nearest Neighbours Classification
Nearest Neighbors Classification is a simple,
instance-based learning algorithm that classifies a
data point based on the majority label of its closest
neighbors in the feature space. It assumes that
similar data points are near each other.
How it Works:
1. Training Phase:There is no explicit training; the
algorithm simply stores the dataset.
2. Prediction Phase:
For a new data point:
•Compute the distance between the point and all
points in the training data.
•Identify the k closest neighbors (k = a predefined
number).
•Determine the majority class label among these
neighbors.
•Assign the majority class label to the new point.
Advantages
•Simple to understand and implement.
•No explicit training phase; efficient for small datasets.
•Non-parametric: No assumptions about data
distribution.
Disadvantages:
•Computationally expensive for large datasets
(requires computation of distances for all points).
•Sensitive to irrelevant or redundant features.
•Choice of kk and distance metric can greatly affect
performance.
•Struggles with imbalanced datasets, as minority
classes may get outvoted.
Applications:
1. Image Classification: Classifying images based
on pixel-level features.
2. Text Categorization: Labeling text documents
using distance metrics in vectorized text space.
3. Anomaly Detection: Identifying outliers by
looking for points without enough neighbors in a
radius.
4. Medical Diagnosis: Predicting diseases based on
symptoms and historical cases
Distance based clustering-K means Algorithm

K-Means is a popular distance-based clustering


algorithm used for partitioning a dataset
into kk distinct clusters. It minimizes the variance
within each cluster by iteratively updating cluster
centroids and assigning data points to the nearest
centroid.
How K-Means Works
1. Initialization:
•Select k, the number of clusters.
•Randomly initialize k centroids (cluster centers)
2. Iteration:
•Assign each data point to the nearest centroid
using a distance metric, typically Euclidean
distance.
•Compute new centroids by taking the mean of all
points assigned to each cluster.
3. Convergence: Repeat the assignment and update
steps until the centroids do not change significantly
or a maximum number of iterations is reached.
Advantages
•Simple to implement and computationally efficient.
•Scales well to large datasets.
•Works well for spherical clusters with similar sizes.
Disadvantages
•Requires predefining k, the number of clusters.
•Sensitive to initialization (different initial centroids can
lead to different results).
•Assumes clusters are spherical and evenly sized.
•Struggles with non-convex clusters or datasets with
varying densities.
Applications
1. Customer Segmentation: Grouping customers
based on purchasing behavior.
2. Image Compression: Quantizing colors in an
image.
3. Document Clustering: Grouping documents
with similar topics.
4. Anomaly Detection: Identifying data points
that do not fit any cluster.
Hierarchical clustering
Hierarchical clustering is an unsupervised machine
learning algorithm that groups data points into
clusters based on their similarity. It creates a
hierarchy or tree-like structure (dendrogram) to
represent the nested grouping of data at different
levels of granularity.
Types of Hierarchical Clustering
1. Agglomerative (Bottom-Up):Starts with each
data point as its own cluster.Iteratively merges the
closest clusters until a single cluster remains.
•2. Divisive (Top-Down):
•Starts with all data points in one cluster.
•Recursively splits clusters into smaller clusters until
each data point is its own cluster.
Advantages
•Does not require predefining the number of clusters
(k).
•Can capture hierarchical relationships among data
points.
•Works well for data with arbitrary shapes.
Disadvantages
•Computationally expensive for large datasets (O(n^2)).
•Sensitive to noise and outliers.
•Requires interpretation of the dendrogram to decide the
number of clusters.
Applications
1. Document Clustering: Grouping text documents with similar
content.
2. Gene Expression Analysis: Clustering genes with similar
expression profiles.
3. Market Segmentation: Identifying customer segments based
on purchasing behavior.
4. Social Network Analysis: Detecting communities or groups.

You might also like