0% found this document useful (0 votes)
19 views

AIML PPT[1]

The document provides an introduction to the K-Nearest Neighbours (KNN) algorithm, detailing its non-parametric nature, lazy learning approach, and the steps involved in its operation, including storing the dataset, calculating distances, selecting nearest neighbours, and making predictions. It discusses the importance of choosing the right distance metric and the optimal value of 'K', along with practical considerations for implementation, evaluation metrics for classification and regression, and specific use cases in India. The document also highlights the strengths and weaknesses of KNN, emphasizing the need for careful data preprocessing and optimization techniques.

Uploaded by

vansh4835
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

AIML PPT[1]

The document provides an introduction to the K-Nearest Neighbours (KNN) algorithm, detailing its non-parametric nature, lazy learning approach, and the steps involved in its operation, including storing the dataset, calculating distances, selecting nearest neighbours, and making predictions. It discusses the importance of choosing the right distance metric and the optimal value of 'K', along with practical considerations for implementation, evaluation metrics for classification and regression, and specific use cases in India. The document also highlights the strengths and weaknesses of KNN, emphasizing the need for careful data preprocessing and optimization techniques.

Uploaded by

vansh4835
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 13

K-Nearest Neighbours

(KNN): An Introduction
In this presentation, we'll explore the K-Nearest Neighbours
algorithm, a simple yet powerful tool for both classification and
regression tasks. KNN is non-parametric, meaning it makes no
assumptions about the underlying data distribution. It's also a lazy
learner, as it doesn't explicitly learn a model during the training
phase. Instead, it stores the entire dataset and performs
calculations at the time of prediction. Join us as we unravel the
intricacies of KNN and discover its potential!
KNN (How KNN Works?)
1 Store the Dataset
The algorithm begins by storing the entire dataset, which will serve as its
reference point for future predictions.

2 Calculate Distances
When a new data point arrives, KNN calculates the distance between this point
and every other point in the existing dataset.

3 Select K-Nearest Neighbours


Based on the calculated distances, the algorithm selects the 'K' nearest
neighbours to the new data point.

4 Assign Class or Predict Value


For classification, KNN assigns the class label based on the majority class among
the K neighbours. For regression, it predicts the value based on the average (or
weighted average) of the K neighbours.

For example, imagine you have a new customer. KNN helps decide which customer
segment they belong to based on the characteristics of their nearest neighbours in the
dataset.
Distance Metrics:
(Measuring Distances)
Euclidean Manhatta Minkowsk Hamming
Distance n i Distance Distance
Distance
The straight- The sum of A The number
line distance the absolute generalizatio of positions
between two differences n of both at which the
points in between the Euclidean correspondin
Euclidean coordinates and g symbols
space. of two points. Manhattan are different
distances. (for
categorical
data).

Choosing the right distance metric is crucial for KNN's performance,


and it depends on the nature of your data. Each metric captures
distance in a unique way, impacting the algorithm's decision-making
process.
Choosing the Right 'K': Neighbors? (How Many Nei
Small K
Sensitive to noise, leading to a complex and potentially overfitting decision boundary. With too few neighbors, even a
single noisy data point can significantly influence the classification.

Large K
Smoother, more generalized decision boundary, but may misclassify data points, especially in regions with local
variations. It can mask minority classes.

Optimal K
Balances noise sensitivity and misclassification potential, creating a robust and accurate model. This is the sweet spot.

Finding the right 'K' is essential for KNN's success. As a rule of thumb, K can be set to the square root of the number of data points.
Techniques like cross-validation (such as k-fold cross-validation) and the Elbow method can help you find the optimal K for your
specific dataset. For example, if you're dealing with sparse customer data with many outliers, a slightly larger 'K' might be better to
reduce the impact of those outliers. Conversely, for dense datasets where local patterns are important, a smaller 'K' could be more
appropriate.
KNN Implementation: Practical Considerations
Data Pre-processing Categorical Features Missing Values
Scaling and normalisation are critical Use one-hot encoding to convert Employ imputation techniques to
to prevent features with larger categorical features into numerical handle missing values in your
ranges from dominating the distance data. Use one-hot encoding to dataset. Use imputation techniques
calculation. Data pre-processing, convert categorical features into to handle missing values in your
scaling and normalisation are very numerical data. dataset. This maintains the quality of
important. This ensures that features the data and ensures the algorithm
with very large values do not gives correct results.
dominate the distance calculation.

Distance Metrics
Experiment with different metrics like Euclidean, Manhattan, or Minkowski distance to find the one that best fits your data.
The choice of distance metric is very important for the performance of KNN.

Keep in mind the computational complexity of KNN, which can be slow for large datasets. Consider KD-trees and ball trees for
optimisation. Keep in mind the computational complexity of KNN, which can be slow for large datasets. Consider KD-trees and
ball trees for optimisation.
Comparison with Other Algorithms
Algorithm Pros Cons

KNN Simple to implement, versatile, no Computationally expensive, sensitive


training phase, adapts to new data to irrelevant features, requires large
memory for storing training data

Decision Tree Easy to interpret, handles non-linear Overfitting, high variance, can be
data, can capture complex unstable, biased towards dominant
relationships, requires minimal data classes
preparation
SVM Effective in high dimensional spaces, Difficult to interpret, parameter
memory efficient, robust to outliers, tuning required, can be
good generalization capabilities computationally intensive, sensitive
to kernel selection
Logistic Regression Simple, efficient, easy to interpret, Assumes linearity, sensitive to
provides probability estimates, outliers, can underperform with
computationally inexpensive complex data, requires careful
feature engineering
Evaluation Metrics for
Classification:
Accuracy Precision Recall
Overall Ability to Ability to
correctness of predict positive identify all
the model. outcomes positive
correctly. outcomes.

F1-Score
Harmonic mean
of precision and
recall.

Use a Confusion Matrix to visualize correct and incorrect predictions.


In real-world scenarios, such as predicting loan defaults, focus on
recall to minimize false negatives. These metrics provide valuable
insights into your model's performance.
Evaluation Metrics for
Regression

MSE RMSE
Mean Squared Error Root Mean Squared Error

MAE
Mean Absolute Error

R-squared, also known as the coefficient of determination, measures the


explained variance in your regression model. If you're predicting house prices,
RMSE tells you the average prediction error in rupees, helping you understand
the model's accuracy in a practical context. These metrics provide a
comprehensive evaluation of regression model performance.
KNN in Action: Use Cases in India

Healthcare Finance
Disease prediction based on Credit risk assessment and fraud
1 2
patient data. detection.

Agriculture E-commerce
Crop yield prediction and soil 4 3 Recommendation systems and
classification. customer segmentation.

Specific to India, KNN can be used to predict monsoon patterns based on historical weather data, assisting farmers
in making informed decisions about planting and harvesting. Its adaptability makes it a valuable tool across diverse
sectors.
Strengths and Weaknesses of KNN
Advantages Disadvantages

• Simple to understand and implement. • Computationally expensive for large datasets.


• Versatile: Can be used for classification and regression. • Sensitive to irrelevant features.
• No training period required. • Optimal value of K is data-dependent.
• Impact of imbalanced data.

Mitigate weaknesses with data preprocessing and optimizations like KD-trees and feature selection. Acknowledging
these limitations is crucial for effective application.
Limitations and Key Takeaways
K-Nearest Neighbors (KNN) faces several limitations that must be addressed for effective real-world application, especially within the Indian c

• High Computational Cost: KNN's prediction time grows linearly with dataset size because it calculates distances to every point.
For large Indian datasets like census data or nationwide transaction records, this becomes impractical without optimizations.
• Memory Intensive: Storing the entire training dataset is memory-intensive. High-dimensional datasets such as images require significan
• Sensitivity to Feature Scaling: Features with larger scales dominate distance calculations. Inconsistent units can skew results;
standardize income (INR) and expenditure (categorical) before calculating distances.

Key Takeaways:

• KNN is intuitive for classification and regression but requires careful tuning.
• Choose K based on your dataset size and validation performance. Different distance metrics suit different data types (Euclidean
for continuous, Hamming for categorical).
• Preprocess data: scale features, handle missing values, and reduce dimensionality.
• KNN is powerful when used correctly, but awareness of limitations and mitigation strategies are vital for successful deployment.
Team Members

• HRISHIKA BHATNAGAR – BTF/22/141

• VANSH SHARMA – BTF/22/161

• ISHAAN GARG – BTF/22/151

• ARMAAN SINGH – BTF/22/156

• AMRIT KUMAR RAO – BTF/22/160

• KESHAV MITTAL – BTF/22/157

You might also like