0% found this document useful (0 votes)
42 views53 pages

ML Unit 2

The document discusses various classification and regression models, focusing on linear segmentation, decision trees, linear discriminants, linear regression, and logistic regression. It explains how these models work, their advantages, limitations, and applications in real-world scenarios. Key concepts include decision boundaries, training data, and the importance of linear separability for model effectiveness.

Uploaded by

rebellion1452
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views53 pages

ML Unit 2

The document discusses various classification and regression models, focusing on linear segmentation, decision trees, linear discriminants, linear regression, and logistic regression. It explains how these models work, their advantages, limitations, and applications in real-world scenarios. Key concepts include decision boundaries, training data, and the importance of linear separability for model effectiveness.

Uploaded by

rebellion1452
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

CLASSIFICATION AND

REGRESSION MODELS
Linear Segmentation and Decision region
Linear Segmentation

What it is: Imagine you have a bunch of data points on a graph, and you want to divide
them into groups (or "segments") based on their features. Linear segmentation uses
straight lines (in 2D) or hyperplanes (in higher dimensions) to create these divisions.

Why it's useful: It's a simple and efficient way to separate data when the groups have a
clear, linear boundary. Think of it like drawing a line to separate apples from oranges on a
table.

Examples:
• Image processing: Identifying edges in an image.
• Classification: Categorizing customers as likely to buy or not buy a product.
Linear Segmentation and Decision region (Contd.)
Decision Regions

What they are: The areas created by your segmentation lines. Each region represents a
specific category or class. If a new data point falls within a region, it's assigned to that
category.

How they work: The decision boundary (your line or hyperplane) is what separates the
regions. The goal is to have the decision boundary placed so that it correctly classifies as
many data points as possible.

Example: In a medical diagnosis scenario, one region might represent "healthy" and
another "needs further testing."
Linear Segmentation and Decision region (Contd.)
Key Concepts

•Linear Classifiers: Algorithms that use linear segmentation to create decision regions (e.g.,
Logistic Regression, Linear SVM).
•Feature Space: The space where your data points are plotted, with each axis representing
a feature.
•Training: The process of finding the best position for the decision boundary using labeled
data.
Important Notes
•Linear segmentation works best when the data is linearly separable (i.e., you can draw a
straight line to perfectly divide the groups).
•Real-world data is often more complex, requiring non-linear methods for accurate
segmentation.
Linear Segmentation and Decision region (Contd.)
Explanation:
•Data points: Each flower is represented by a point
on the graph. Blue circles are "Iris" flowers, and
red squares are "Not Iris" flowers.
•Features: The x-axis represents sepal length, and
the y-axis represents petal width.
•Decision boundary: The straight line is the
decision boundary. It's what our linear
segmentation creates.
•Decision regions: The area above the line is the
"Iris" decision region, and the area below is the
"Not Iris" decision region.
•Classification: If a new flower has a sepal length
and petal width that fall in the "Iris" region, we
classify it as "Iris." Otherwise, it's "Not Iris."
Linear Discriminates
Def: Linear discriminants are a fundamental concept in machine learning, particularly in
classification tasks. They provide a way to separate data points into different categories
using linear decision boundaries. Here's a breakdown of what they are and how they work:

What are Linear Discriminants?


•Decision Boundaries: Imagine you have data points belonging to different classes (e.g., cats
vs. dogs). A linear discriminant aims to find a straight line (in 2D) or a hyperplane (in higher
dimensions) that best separates these classes. This line or hyperplane is called the decision
boundary.
•Linear Combination of Features: The decision boundary is defined by a linear combination
of the features of your data points. For example, if you have two features (x1 and x2), the
decision boundary might be represented by an equation like: w1x1 + w2x2 + b = 0, where
w1 and w2 are weights, and b is a bias term.
•Classification: To classify a new data point, you simply plug its features into the equation. If
the result is positive, it belongs to one class; otherwise, it belongs to the other.
Linear Discriminates (Contd.)
How do Linear Discriminants Work?
1.Training Data: You start with labeled data, where you know the class of each data point.
2.Finding the Best Boundary: The goal is to find the weights (w1, w2, etc.) and bias (b) that
define the decision boundary that best separates the classes. This is typically done using
optimization algorithms.
3.Maximizing Separation: The algorithm tries to maximize the distance between the classes
and the decision boundary. This helps to improve the classifier's ability to generalize to new,
unseen data.

Types of Linear Discriminants


•Fisher's Linear Discriminant (FLD): A classic method that finds the linear combination of
features that maximizes the separation between classes.
•Perceptron: A simple algorithm that learns a linear decision boundary by iteratively
adjusting the weights based on misclassified data points.
Linear Discriminates (Contd.)
Advantages of Linear Discriminants
•Simple and Efficient: They are computationally inexpensive and easy to implement.
•Interpretability: The weights assigned to each feature provide insights into which features
are most important for classification.

Limitations of Linear Discriminants


•Linear Separability: They work best when the classes are linearly separable, meaning you
can draw a straight line or hyperplane to perfectly divide them.
•Complex Data: They may not perform well on complex datasets with non-linear
relationships between features and classes.

Applications of Linear Discriminants


•Pattern Recognition: Identifying objects in images or sounds.
•Medical Diagnosis: Classifying patients into different disease categories.
•Natural Language Processing: Categorizing text documents.
Linear Discriminates (Contd.)
Linear Discriminant Analysis (LDA) is a dimensionality reduction and
classification technique commonly used in machine learning and pattern
recognition. In the context of classification it aims to find a linear combination
of features that best separates different classes or categories of data. It seeks
to reduce the dimensionality of the feature space while preserving as much of
the class-separability information as possible.
Linear Regression
Def: Linear regression is a fundamental and widely used algorithm in machine learning and
statistics. It's used for predicting a continuous outcome variable based on one or more
predictor variables.

What is Linear Regression?


•Predicting a Continuous Value: Linear regression aims to find the best-fitting linear
relationship between the predictor variables (also called independent variables or features)
and the outcome variable (also called the dependent variable or target). The outcome
variable is continuous, meaning it can take on any value within a range (e.g., house prices,
temperature, sales figures).
•Linear Relationship: The core assumption is that the relationship between the predictors
and the outcome can be modeled by a straight line (in simple linear regression with one
predictor) or a hyperplane (in multiple linear regression with more than one predictor).
•Finding the Best Fit: The algorithm learns the coefficients (weights) for each predictor
variable that minimize the difference between the predicted values and the actual values in
the training data. This difference is often measured using the mean squared error.
Linear Regression (Contd.)
Types of Linear Regression:
•Simple Linear Regression: One predictor variable. Example: Predicting house prices based
on the size of the house.
•Multiple Linear Regression: Two or more predictor variables. Example: Predicting house
prices based on size, number of bedrooms, and location.

Cost Function (Mean Squared Error):


The mean squared error (MSE) measures the average squared difference between
the predicted values and the actual values. The goal is to minimize the MSE.
Linear Regression (Contd.)
Linear Regression (Contd.)
Advantages of Linear Regression:
•Simple and Easy to Understand: Linear regression is relatively easy to understand and
interpret.
•Computationally Efficient: Training and prediction are fast, even with large datasets.
•Widely Available: Linear regression is implemented in almost all statistical software and
machine learning libraries.

Limitations of Linear Regression:


•Linearity Assumption: It assumes a linear relationship between the predictors and the
outcome. If the relationship is non-linear, linear regression may not perform well.
•Sensitivity to Outliers: Outliers can significantly affect the regression line.
•Overfitting: With too many predictor variables, the model can overfit the training data and
not generalize well to new data.
Linear Regression (Contd.)
In the given figure,

X-axis = Independent variable

Y-axis = Output / dependent variable

Line of regression = Best fit line for a


model

Here, a line is plotted for the given data


points that suitably fit all the issues.
Hence, it is called the ‘best fit line.’ The
goal of the linear regression algorithm is to
find this best fit line seen in the above
figure.
Logistic Regression
Def: Logistic regression is a powerful and widely used algorithm in machine learning for
classification tasks. Unlike linear regression, which predicts continuous values, logistic
regression predicts the probability of an instance belonging to a certain class.

What is Logistic Regression?


•Classification: Logistic regression is used when the outcome variable is categorical,
meaning it belongs to a set of distinct categories (e.g., spam or not spam, cat or dog, disease
or no disease).
•Probability: The output of logistic regression is a probability between 0 and 1, representing
the likelihood of an instance belonging to a particular class.
•Sigmoid Function: The core of logistic regression is the sigmoid function, which takes any
real-valued number as input and outputs a value between 0 and 1. This function is what
allows us to interpret the output as a probability.
Logistic Regression (Contd.)
Logistic Regression (Contd.)
Types of Logistic Regression:
•Binary Logistic Regression: The outcome variable has only two possible classes (e.g., spam
or not spam).
•Multinomial Logistic Regression: The outcome variable has more than two possible classes
(e.g., classifying images as cat, dog, or bird).

Advantages of Logistic Regression:


•Simple and Easy to Understand: Logistic regression is relatively easy to understand and
interpret.
•Efficient: Training and prediction are fast, even with large datasets.
•Probabilistic Output: The output is a probability, which provides more information than
just a class label.
Logistic Regression (Contd.)

Limitations of Logistic Regression:


•Linearity Assumption: Logistic regression assumes a linear relationship between the
features and the log-odds of the outcome.
•Sensitivity to Outliers: Outliers can significantly affect the model.
•Overfitting: With too many features, the model can overfit the training data.

Applications of Logistic Regression:


•Medical Diagnosis: Predicting the likelihood of a patient having a certain disease.
•Marketing: Predicting customer churn or the likelihood of a customer clicking on an ad.
•Finance: Predicting loan defaults or credit card fraud.
•Natural Language Processing: Classifying text documents or identifying spam emails.
Logistic Regression (Contd.)
Decisiontrees
Def: Decision trees are widely used machine learning algorithm and can be used for both
classification and regression tasks. These models work by splitting data into subsets based
on feature and this splitting is called as decision making and each leaf node tells us
prediction. This splitting creates a tree-like structure. They are easy to interpret and visualize
for understanding the decision-making process.

Types of Decision Tree Algorithms


The different decision tree algorithms are listed below:
•ID3(Iterative Dichotomiser 3)
•C4.5
•CART(Classification and Regression Trees)
Decisiontrees (Contd.)
Decision tree is a simple diagram that shows different choices and their possible results
helping you make decisions easily. This article is all about what decision trees are, how they
work, their advantages and disadvantages and their applications.

Understanding Decision Tree


A decision tree is a graphical representation of different options for solving a problem and
show how different factors are related. It has a hierarchical tree structure starts with one
main question at the top called a node which further branches out into different possible
outcomes where:
•Root Node is the starting point that represents the entire dataset.
•Branches: These are the lines that connect nodes. It shows the flow from one decision to
another.
•Internal Nodes are Points where decisions are made based on the input features.
•Leaf Nodes: These are the terminal nodes at the end of branches that represent final
outcomes or predictions
Decisiontrees (Contd.)
Decisiontrees (Contd.)
They also support decision-making by
visualizing outcomes. You can quickly
evaluate and compare the “branches” to
determine which course of action is best for
you.
Now, let’s take an example to understand the
decision tree. Imagine you want to decide
whether to drink coffee based on the time of
day and how tired you feel. First the tree
checks the time of day—if it’s morning it asks
whether you are tired. If you’re tired the tree
suggests drinking coffee if not it says there’s
no need. Similarly in the afternoon the tree
again asks if you are tired. If you
recommends drinking coffee if not it
concludes no coffee is needed.
Decisiontrees (Contd.)
Classification of Decision Tree
We have mainly two types of decision tree based on the nature of the target
variable: classification trees and regression trees.
•Classification trees: They are designed to predict categorical outcomes means they classify
data into different classes. They can determine whether an email is “spam” or “not spam”
based on various features of the email.
•Regression trees : These are used when the target variable is continuous It predict
numerical values rather than categories. For example a regression tree can estimate the
price of a house based on its size, location, and other features.
Decisiontrees (Contd.)
How Decision Trees Work?
 A decision tree working starts with a main question known as the root node. This
question is derived from the features of the dataset and serves as the starting point for
decision-making.
 From the root node, the tree asks a series of yes/no questions. Each question is designed
to split the data into subsets based on specific attributes.
 This branching continues through a sequence of decisions. As you follow each branch,
you get more questions that break the data into smaller groups. This step-by-step process
continues until you have no more helpful questions .
 You reach at the end of a branch where you find the final outcome or decision. It could be
a classification (like “spam” or “not spam”) or a prediction (such as estimated price).
ID3 Algorithm
 The ID3 algorithm is a popular decision tree algorithm used in machine learning. It aims
to build a decision tree by iteratively selecting the best attribute to split the data based
on information gain. Each node represents a test on an attribute, and each branch
represents a possible outcome of the test. The leaf nodes of the tree represent the final
classifications.
 It is a greedy algorithm that builds a decision tree by recursively partitioning the data set
into smaller and smaller subsets until all data points in each subset belong to the same
class.
 Thе ID3 (Iterative Dichotomiser 3) algorithm is a classic decision tree algorithm used for
both classification and regression tasks.ID3 deals primarily with categorical properties,
which means that it can efficiently handle objects with a discrete set of values.
 One of the strengths of ID3 is its ability to generate interpretable decision trees. The
resulting tree structure is easily understood and visualized, providing insight into the
decision-making process.
ID3 Algorithm (Contd.)

 The ID3 algorithm works by building a decision tree, which is a hierarchical structure that
classifies data points into different categories and splits the dataset into smaller subsets
based on the values of the features in the dataset.
 The ID3 algorithm then selects the feature that provides the most information about the
target variable.
 The decision tree is built top-down, starting with the root node, which represents the
entire dataset.
 At each node, the ID3 algorithm selects the attribute that provides the most information
gain about the target variable.
 The attribute with the highest information gain is the one that best separates the data
points into different categories.
ID3 Algorithm (Contd.)
ID3 metrices

The ID3 algorithm utilizes metrics related to information theory, particularly entropy and information
gain, to make decisions during the tree-building process.
Information Gain and Attribute Selection
The ID3 algorithm uses a measure of impurity, such as entropy or Gini impurity, to calculate
the information gain of each attribute. Entropy is a measure of disorder in a dataset. A dataset with
high entropy is a dataset where the data points are evenly distributed across the different categories.
A dataset with low entropy is a dataset where the data points are concentrated in one or a few
categories.

If entropy is low, data is well understood; if high, more information is needed. Preprocessing data
before using ID3 can enhance accuracy. In sum, ID3 seeks to reduce uncertainty and make informed
decisions by picking attributes that offer the most insight in a dataset.
ID3 Algorithm (Contd.)
Information gain assesses how much valuable information an attribute can provide. We select the
attribute with the highest information gain, which signifies its potential to contribute the most to
understanding the data. If information gain is high, it implies that the attribute offers a significant
insight. ID3 acts like an investigator, making choices that maximize the information gain in each step.
This approach aims to minimize uncertainty and make well-informed decisions, which can be further
enhanced by preprocessing the data.
ID3 Algorithm (Contd.)
What are the steps in ID3 algorithm?

1.Determine entropy for the overall the dataset using class distribution.

2.For each feature.

2. Calculate Entropy for Categorical Values.

3. Assess information gain for each unique categorical value of the feature.

3.Choose the feature that generates highest information gain.

4.Iteratively apply all above steps to build the decision tree structure.
ID3 Algorithm (Contd.)
How ID3 Works:
The ID3 algorithm is specifically designed for building decision trees from a given dataset. Its primary
objective is to construct a tree that best explains the relationship between attributes in the data and
their corresponding class labels.
1. Selecting the Best Attribute
 ID3 employs the concept of entropy and information gain to determine the attribute that best
separates the data. Entropy measures the impurity or randomness in the dataset.
 The algorithm calculates the entropy of each attribute and selects the one that results in the most
significant information gain when used for splitting the data.
2. Creating Tree Nodes
 The chosen attribute is used to split the dataset into subsets based on its distinct values.
 For each subset, ID3 recurses to find the next best attribute to further partition the data, forming
branches and new nodes accordingly.
3. Stopping Criteria
The recursion continues until one of the stopping criteria is met, such as when all instances in a
branch belong to the same class or when all attributes have been used for splitting.
ID3 Algorithm (Contd.)
4. Handling Missing Values
ID3 can handle missing attribute values by employing various strategies like attribute mean/mode
substitution or using majority class values.

5. Tree Pruning
Pruning is a technique to prevent overfitting. While not directly included in ID3, post-processing
techniques or variations like C4.5 incorporate pruning to improve the tree's generalization.
ID3 Algorithm (Contd.)
Mathematical Concepts of ID3 Algorithm
ID3 Algorithm (Contd.)
ID3 Algorithm (Contd.)
Advantages of ID3
•Simple and easy to understand.
•Requires little training data.
•Can work well with data with discrete and continuous attributes.

Disadvantages of ID3
•Can lead to overfitting.
•May not be effective with data with many attributes.

Applications of ID3
1.Fraud detection: ID3 can be used to develop models that can detect fraudulent transactions or
activities.
2.Medical diagnosis: ID3 can be used to develop models that can diagnose diseases or medical
conditions.
3.Customer segmentation: ID3 can be used to segment customers into different groups based on
their demographics, purchase history, or other factors.
4.Risk assessment: ID3 can be used to assess risk in a variety of different areas, such as insurance,
finance, and healthcare.
C4.5 Algorithm
Introduction:
 The C4.5 algorithm's fundamental building block, decision trees provide the framework
for its categorization procedure. These trees depict a structure that is hierarchical and
akin to a flowchart, with each internal node signifying an attribute test, each branch
designating the test's result, and every leaf node designating a class name.
 The decision tree in C4.5 is built iteratively, splitting the dataset at each stage based on
the best attribute. Based on metrics like data gain or gain ratio-which gauge how well an
attribute reduces confusion about the class labels-the optimal attribute is chosen.
 Decision trees, however, are susceptible to overfitting, a phenomenon in which the
model incorrectly identifies patterns in its training data as noise. C4.5 uses pruning
approaches to increase the tree's generalisation efficiency on unseen data and to lessen
the impact of this problem.
C4.5 Algorithm (Contd.)
 C4.5 uses a modified version of information gain called the gain ratio to reduce the bias
towards features with many values. The gain ratio is computed by dividing the
information gain by the intrinsic information which measures the amount of data
required to describe an attribute’s values:

 It addresses several limitations of ID3 including its inability to handle continuous


attributes and its tendency to overfit the training set. It handles continuous attributes by
first sorting the attribute values and then selecting the midpoint between adjacent values
as a potential split point. The split that maximizes information gain or gain ratio is chosen.
 It can also generate rules from the decision tree by converting each path from the root to
a leaf into a rule, which can be used to make predictions on new data.
 This algorithm improves accuracy and reduces overfitting by using gain ratio and post-
pruning. While effective for both discrete and continuous attributes, C4.5 may still
struggle with noisy data and large feature sets.
C4.5 Algorithm (Contd.)
C4.5 Pruning Techniques:
1. Decreased Pruning Error:
With this method, the decision tree is traversed recursively from bottom to top, and the
effects of eliminating each subtree are assessed on an approved dataset. A subtree is cut
down (replaced with an individual leaf node) if its removal results in better performance or
no appreciable decline in performance.
2. After-run rule:
C4.5 builds decision trees and turns these into sets of rules rather than directly pruning
subtrees. After that, these rules are reduced in size using a validation dataset's predicted
accuracy as a basis. Pruning is the process of getting rid of rules that cause overfitting or
don't enhance classification performance.
C4.5 Algorithm (Contd.)
3. The principle of Minimum Description Length (MDL):
When determining when to stop developing the decision tree, C4.5 applies the MDL concept
as a guideline. This concept strikes a compromise between the model's intricacy and how
well it fits the data. The decision tree is expanded until its description length (a gauge of
model complexity) is significantly reduced without additional partitioning.
4. Subtree Substitution:
If the subtree's error rate is not appreciably higher than the leaf node's, C4.5 replaces whole
subtrees using a single leaf node. In doing so, the decision tree's simplicity and forecast
accuracy are maintained.
C4.5 Algorithm (Contd.)
How the C4.5 Algorithm Operates:
1. Starting Point:
Starting with the complete dataset, the method treats it as the decision tree's root node.
Every instance in the collection is a piece of information with associated class label and
characteristics, or attributes.
2. Selection of Attributes:
At each decision tree node, C4.5 determines which attribute is optimal to partition the
dataset into. For every characteristic, it computes a metric, usually information gain and gain
ratio. These measures indicate how well a characteristic reduces ambiguity regarding the
class labels. For the present node, the splitting criterion is determined by selecting the
characteristic that has the greatest amount of data gain or gain ratio.
C4.5 Algorithm (Contd.)
3. Dividing the Collection:
Following attribute selection, the dataset is partitioned into subsets according to the
attribute's potential values. Each subgroup for a categorical attribute corresponds to a
unique attribute value. C4.5 establishes an appropriate threshold to separate the data in
subsets for continuous characteristics.
4. Building Recursive Trees:
The method splits the dataset and performs attribute selection iteratively to every subset
that was produced in the preceding stage. This procedure keeps going until any of the
following requirements is satisfied:
 When all instances inside a subset are members of the same class, a leaf node is
produced.
 There are no more qualities that can be divided.
 The tree has reached a certain depth. The amount of occurrences in a subset reaches a
certain threshold.
C4.5 Algorithm (Contd.)
5. Reduction:
Pruning is done once the tree reaches maturity in order to lessen overfitting. By eliminating
nodes or branch that do not considerably increase prediction accuracy, pruning includes
making the tree simpler. Reduced mistake pruning, rule after pruning, and subtree
substitution are common pruning methods.
6. Results:
A collection of categorization rules is represented by the decision tree that is produced.
Every leaf node in the structure of the tree matches a class label, and every internal node in
this tree represents a choice made in response to an attribute. The requirements needed to
categorise an instance are represented by the path that leads from a root node to the leaf
node.
C4.5 Algorithm (Contd.)
7. Grouping:
Based on the instance's attribute values, it moves through the decision tree through the
root nodes to a leaf node in order to categories a new instance. The algorithm assesses the
attribute condition at each internal node and proceeds down the relevant branch until it
arrives at a leaf node. The instance is allocated the class label corresponding to the leaf
node that was reached during traversal.
C4.5 Algorithm (Contd.)
Splitting Criteria:
1. Knowledge Gain:
 Information gain quantifies how well a characteristic reduces ambiguity regarding the
class labels.
 It is computed by contrasting the dataset's entropy, also known as impurity, prior to and
following the split according to the attribute.
 The dataset's degree of disorder and uncertainty is measured by entropy. Greater
homogeneity of labels for classes within subgroups is indicated by lower entropy.
 The following formula is used to compute information gain:
 Entropy before to split - The weighted mean of entropies following split equals
information gain.
 For the present node, the splitting criterion is chosen based on the property that has the
biggest information gain.
C4.5 Algorithm (Contd.)
2. Ratio of Gain:
 Although information acquisition is useful, it favours characteristics with a high number
of values.
 The bias towards qualities with more values is addressed by the gain ratio, an alteration
of data gain that penalises attributes with many different values.
 It is computed by dividing the information gain-a measure of the attribute's intrinsic
information-by the split information.
 To get the gain ratio, use this formula:
 Information Gain/Split Information equals Gain Ratio
 For the attribute values, the amount of entropy or another indicator of impurity is used
to determine the split information.
C4.5 Algorithm (Contd.)
Summary:
 To sum up, the algorithm known as C4.5 is an effective method for building decision
networks in classification applications.
 Using splitting criteria like data gain or loss ratio, it chooses characteristics that optimise
the decrease in ambiguity around class labels.
 With the use of pruning strategies to avoid overfitting and repeated dataset division, C4.5
produces interpretable decision trees that can effectively categorise cases.
 Notwithstanding its efficacy, C4.5 could be constrained by issues including biassed tree
building and susceptibility to noisy data.
 Even so, it continues to be a fundamental algorithm in the field of machine learning,
helping to comprehend and create more sophisticated categorization methods.
K-Nearest Neighbor(KNN) Algorithm
Getting Started with K-Nearest Neighbors
K-Nearest Neighbors is also called as a lazy learner algorithm because it does not learn from
the training set immediately instead it stores the dataset and at the time of classification it
performs an action on the dataset.
As an example, consider the following table of data points containing two features:
K-Nearest Neighbor(KNN) Algorithm (Contd.)
The new point is classified as Category 2 because most of its closest neighbors are blue
squares. KNN assigns the category based on the majority of nearby points.

The image shows how KNN predicts the category of a new data point based on its closest
neighbours.
•The red diamonds represent Category 1 and the blue squares represent Category 2.
•The new data point checks its closest neighbours (circled points).
•Since the majority of its closest neighbours are blue squares (Category 2) KNN predicts the
new data point belongs to Category 2.

KNN works by using proximity and majority voting to make predictions.


K-Nearest Neighbor(KNN) Algorithm (Contd.)
What is ‘K’ in K Nearest Neighbour ?
In the k-Nearest Neighbours (k-NN) algorithm k is just a number that tells the algorithm
how many nearby points (neighbours) to look at when it makes a decision.

Example:
Imagine you’re deciding which fruit it is based on its shape and size. You compare it to fruits
you already know.
•If k = 3, the algorithm looks at the 3 closest fruits to the new one.
•If 2 of those 3 fruits are apples and 1 is a banana, the algorithm says the new fruit is an
apple because most of its neighbours are apples.
K-Nearest Neighbor(KNN) Algorithm (Contd.)
How to choose the value of k for KNN Algorithm?
The value of k is critical in KNN as it determines the number of neighbors to consider when
making predictions. Selecting the optimal value of k depends on the characteristics of the
input data.

If the dataset has significant outliers or noise a higher k can help smooth out the
predictions and reduce the influence of noisy data. However choosing very high value can
lead to underfitting where the model becomes too simplistic.
K-Nearest Neighbor(KNN) Algorithm (Contd.)
Statistical Methods for Selecting k:

 Cross-Validation: A robust method for selecting the best k is to perform k-fold cross-
validation. This involves splitting the data into k subsets training the model on some
subsets and testing it on the remaining ones and repeating this for each subset. The value
of k that results in the highest average validation accuracy is usually the best choice.
 Elbow Method: In the elbow method we plot the model’s error rate or accuracy for
different values of k. As we increase k the error usually decreases initially. However after
a certain point the error rate starts to decrease more slowly. This point where the curve
forms an “elbow” that point is considered as best k.
 Odd Values for k: It’s also recommended to choose an odd value for k especially in
classification tasks to avoid ties when deciding the majority class.
K-Nearest Neighbor(KNN) Algorithm (Contd.)
Distance Metrics Used in KNN Algorithm
KNN uses distance metrics to identify nearest neighbour, these neighbours are used for
classification and regression task. To identify nearest neighbour we use below distance
metrics:
1. Euclidean Distance
Euclidean distance is defined as the straight-line distance between two points in a plane or
space. You can think of it like the shortest path you would walk if you were to go directly
from one point to another.
Euclidean distance (p=2): This is the most commonly used distance measure, and it is limited
to real-valued vectors. Using the below formula, it measures a straight line between the
query point and the other point being measured.
K-Nearest Neighbor(KNN) Algorithm (Contd.)
2. Manhattan Distance
This is the total distance you would travel if you could only move along horizontal and
vertical lines (like a grid or city streets). It’s also called “taxicab distance” because a taxi can
only drive along the grid-like streets of a city.

3. Minkowski distance:
This distance measure is the generalized form of Euclidean and Manhattan distance metrics.
The parameter, p, in the formula below, allows for the creation of other distance metrics.
Euclidean distance is represented by this formula when p is equal to two, and Manhattan
distance is denoted with p equal to one.

You might also like