003-KNN Complete Updated
003-KNN Complete Updated
K=1
KNN Decision Boundary
Voronoi Tessellation and KNN decision boundaries
K=1
Properties of KNN – Non-parametric
The KNN Algorithm is a supervised, non-
parametric algorithm
• It does not make any assumptions about the
underlying distribution nor tries to estimate it
https://sebastianraschka.com/faq/docs/parametric_vs_nonparametric.html
Properties of KNN – Non-parametric
• Parametric models summarize data with a fixed set of
parameters (independent of the number of training examples).
• No matter how much data you throw at a parametric model, it won’t
change its mind about how many parameters it needs.
• Non-parametric models make no assumptions about the
probability distribution or number of parameters when modeling
the data.
• Nonparametric methods are good when you have a lot of data and no
prior knowledge, and when you don’t want to worry too much about
choosing just the right features.
• Non-parametric does not mean that they have no parameters! On the
contrary, non-parametric models (can) become more and more complex
with an increasing amount of data.
https://sebastianraschka.com/faq/docs/parametric_vs_nonparametric.html
Properties of KNN – Non-parametric
• Differences:
• In a parametric model, we have a finite number of
parameters, and in nonparametric models, the number of
parameters is (potentially) infinite.
• In statistics, the term parametric is also associated with a
specified probability distribution that you “assume” our data
follows, and this distribution comes with the finite number of
parameters (for example, the mean and standard deviation
of a normal distribution)
o We don’t make/have these assumptions in non-parametric models.
So, in intuitive terms, we can think of a non-parametric model as a
“distribution” or (quasi) assumption-free model.
• Still, the distinction is a bit ambiguous at best
https://sebastianraschka.com/faq/docs/parametric_vs_nonparametric.html
Notes: Detailed Differences between Parametric
and Non-Parametric Models
Parametric models assume some finite set of parameters for the model,
while non-parametric models do not make any assumptions about
the data distribution and have an infinite number of parameters.
Here are some key differences:
Number of Parameters:
1. Parametric Models: These models have a fixed number of parameters. For
example, in a linear regression model, the parameters are the slope and
intercept.
2. Non-parametric Models: These models do not have a fixed number of
parameters. The number of parameters grows with the amount of training
data. For example, in a k-nearest neighbors (KNN) model, the "parameters" are
essentially the entire training dataset.
11
Notes: Detailed Differences between Parametric
and Non-Parametric Models
Flexibility:
1. Parametric Models: Less flexible as they make strong assumptions
about the data distribution.
2. Non-parametric Models: More flexible as they make fewer
assumptions about the data distribution.
Computational Complexity:
1. Parametric Models: Generally, less computationally intensive as
they require estimating only a fixed number of parameters.
2. Non-parametric Models: Usually, more computationally intensive
as they involve a larger number of parameters and often require
computation over the entire dataset.
12
Detailed Differences between Parametric and Non-
Parametric Models
Risk of Overfitting:
1. Parametric Models: Higher risk of underfitting as they might not capture the
underlying complexity of the data.
2. Non-parametric Models: Higher risk of overfitting as they might capture too
much noise in the data.
Examples:
• Parametric Models: Linear Regression, Logistic Regression, Naive Bayes, etc.
• Non-parametric Models: Decision Trees with unbounded height, k-Nearest
Neighbors, Support Vector Machines, etc.
13
Properties of KNN – Hyperparameter K
Parameters and hyperparameters are two types of values that a
model uses to make predictions, but they serve different
purposes and are learned in different ways:
Parameters:
• These are the parts of the model that are learned from the training data.
For example, the weights in a linear regression model are parameters.
• Parameters are learned directly from the training data during the training
process. The model uses the training data to adjust the parameters to
minimize the prediction error.
Hyperparameters:
• These are the settings or configurations that need to be specified prior to
training the model. They are not learned from the data but are essential
for the learning process.
• For example, the learning rate in gradient descent, the depth of a
decision tree, or the number of clusters in k-means clustering are all
hyperparameters.
• The values of hyperparameters are usually set before training the model
and remain constant during the training process. They may be adjusted
between runs of training to optimize model performance.
Properties of KNN
• Used for classification and regression
• Classification: Choose the most frequent class label amongst k-nearest
neighbors
• Regression: Take an average over the output values of the k-nearest
1
neighbors and assign to the test point – may be weighted e.g. w = (𝑑:
𝑑
distance from 𝑥)
Properties of KNN
An Instance-based learning algorithm
• Instead of performing explicit generalization, form hypotheses by
comparing new problem instances with training instances
• (+) Can easily adapt to unseen data
• (-) Complexity of prediction is a function of 𝑛 (size of training data)
A lazy learning algorithm
• Delay computations on training data until a query is made, as opposed to
eager learning
• (+) Good for continuously updated training data like recommender
systems
• (-) Slower to evaluate and need to store the whole training data
Nearest Neighbors Methods
Distance/Similarity
Sources
• Machine Learning for Intelligent Systems, Kilian Weinberger, Cornell University, Lecture 2,
https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote02_kNN.html
• Nearest Neighbor Methods, Victor Lavrenko, Assistant Professor at the University of
Edinburgh, https://www.youtube.com/playlist?list=PLBv09BD7ez_48heon5Az-TsyoXVYOJtDZ
• Wiki K-Nearest Neighbors: https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
• Effects of Distance Measure Choice on KNN Classifier Performance - A Review, V. B.
Surya Prasath et al., https://arxiv.org/pdf/1708.04321.pdf
• A Comparative Analysis of Similarity Measures to find Coherent Documents, Mausumi
Goswami et al. http://www.ijamtes.org/gallery/101.%20nov%20ijmte%20-%20as.pdf
• A Comparison Study on Similarity and Dissimilarity Measures in Clustering Continuous
Data, Ali Seyed Shirkhorshidi et al.,
https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0144059&type=printable
Similarity/Distance Measures
• If scaled between 0 and 1, then 𝑠𝑖𝑚 = 1 − 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒
• Lots of choices, depends on the problem
• The Minkowski distance is a generalized metric form of
Euclidean, Manhattan and Chebyshev distances
• The Minkowski distance between two n-dimensional
vectors 𝑃 =< 𝑝1 , 𝑝2 , … , 𝑝𝑛 > and 𝑄 =< 𝑞1 , 𝑞2 , … , 𝑞𝑛 >, it is
defined as:
1ൗ
𝑛 𝑎
𝑎
𝑑𝑚𝑖𝑛𝑘𝑜𝑤𝑠𝑘𝑖 𝑝, 𝑞 = 𝑝𝑖 − 𝑞𝑖 ,𝑎 ≥ 1
𝑖=1
• 𝑎 = 1, is the Manhattan distance
• 𝑎 = 2, is the Euclidean distance
• 𝑎 → ∞, is the Chebyshev distance
Constraints on Distance Metrics
The distance function between vectors 𝑝 and 𝑞 is a function 𝑑(𝑝, 𝑞) that defines
the distance between both vectors is considered as a metric if it satisfy a certain
number of properties:
1. Non-negativity: The distance between 𝑝 and 𝑞 is always a value greater
than or equal to zero
𝒅(𝒑, 𝒒) ≥ 𝟎
2. Identity of indiscernible vectors: The distance between 𝑝 and 𝑞 is equal to
zero if and only if 𝑝 is equal to 𝑞
𝒅(𝒑, 𝒒) = 𝟎 𝒊𝒇𝒇 𝒑 = 𝒒
3. Symmetry: The distance between 𝑝 and 𝑞 is equal to the distance between
𝑞 and 𝑝.
𝒅(𝒑, 𝒒) = 𝒅(𝒒, 𝒑)
4. Triangle inequality: Given a third point 𝑟, the distance between 𝑝 and 𝑞 is
always less than or equal to the sum of the distance between 𝑝 and 𝑟 and the
distance between 𝑟 and 𝑞
𝒅(𝒑, 𝒒) ≤ 𝒅(𝒑, 𝒓) + 𝒅(𝒓, 𝒒)
Manhatten Distance
𝒅𝑴𝒂𝒏 𝒑, 𝒒 = 𝒅 𝒒, 𝒑 = |𝒑𝟏 − 𝒒𝟏 | + |𝒑𝟐 − 𝒒𝟐 |, … , |𝒑𝒏 − 𝒒𝒏 |
𝒏
𝒅 𝒑, 𝒒 = 𝒅 𝒒, 𝒑 = |𝒑𝒊 − 𝒒𝒊 |
𝒊=𝟏
• The distance between two points is the sum of the
absolute differences of their Cartesian coordinates.
• It is the total sum of the difference between the x-
coordinates and y-coordinates.
• Also known as Manhattan length, rectilinear distance,
L1 distance or L1 norm, city block distance, snake
distance, taxi-cab metric, or city block distance
• Works well for high dimensional data.
• It does not amplify differences among features of the two
vectors and as a result does not ignore the effects of any
feature dimensions
• Higher values of 𝑎 amplify differences and ignore
features with smaller differences
Euclidean vs Manhatten Distance
https://en.wikipedia.org/wiki/Taxicab_geometry
Euclidean Distance
𝑑 𝑝, 𝑞 = 𝑑 𝑞, 𝑝 = (𝑝1 − 𝑞1 )2 +(𝑝2 − 𝑞2 )2 , … , (𝑝𝑛 − 𝑞𝑛 )2
𝑑 𝑝, 𝑞 = 𝑑 𝑞, 𝑝 = (𝑝𝑖 − 𝑞𝑖 )2
𝑖=1
𝒏 𝟏ൗ
𝒂
𝒂
𝒅𝑪𝒉𝒆𝒃 𝒑, 𝒒 = 𝐥𝐢𝐦 𝒑𝒊 − 𝒒𝒊 = 𝐦𝐚𝐱 𝒑𝒊 − 𝒒𝒊
𝒂→ ∞ 𝒊
𝒊=𝟏
How?
Assume, 𝑝 =< 2,3, … , 9 >, 𝑞 =< 4,6, … 10 >
1
𝑎
𝑑𝐶ℎ𝑒𝑏 𝑝, 𝑞 = lim ( 2 − 4 + 3 − 6 𝑎 +, … + 9 − 10 𝑎 )𝑎
𝑎→ ∞
1
𝑑𝐶ℎ𝑒𝑏 𝑝, 𝑞 = lim (2𝑎 + 3𝑎 +, … + 1𝑎 )𝑎
𝑎→ ∞
Suppose, 𝑎 = 2
1
𝑑 𝑝, 𝑞 = (4 + 9+, … + 1)2
Suppose, 𝑎 = 3
1
𝑑 𝑝, 𝑞 = (8 + 27+, … + 1)3
Suppose, 𝑎 = 10
1
𝑑 𝑝, 𝑞 = (1,024 + 59,049+, … + 1) 10
Now, 𝑎 → ∞
𝑛
1
𝑎 𝑎 𝑎
𝑑𝐶ℎ𝑒𝑏 𝑝, 𝑞 = lim (( 𝑝𝑖 − 𝑞𝑖 ) → max 𝑝𝑖 − 𝑞𝑖 )
𝑎→ ∞ 𝑖
𝑖=1
1
𝑑𝐶ℎ𝑒𝑏 𝑝, 𝑞 = lim (max 𝑝𝑖 − 𝑞𝑖 𝑎 )𝑎
𝑎→ ∞ 𝑖
𝒅𝑪𝒉𝒆𝒃 𝒑, 𝒒 = 𝒎𝒂𝒙 𝒑𝒊 − 𝒒𝒊
𝒊
Chebyshev Distance
Effect of Different Distance Measures in Result of Cluster Analysis, Sujan Dahal
𝑑𝐶ℎ𝑒𝑏 𝑝, 𝑞 = max 𝑝𝑖 − 𝑞𝑖
𝑖
• For Chebyshev distance, the distance between two vectors is the greatest of
their differences along any coordinate dimension
• When two objects are to be defined as “different”, if they are different in any
one dimension
• Also called chessboard distance, maximum metric, or 𝐿∞ metric
Nearest Neighbors Methods
The KNN Algorithm
Sources
• Machine Learning for Intelligent Systems, Kilian Weinberger, Cornell University, Lecture 2,
https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote02_kNN.html
• Nearest Neighbor Methods, Victor Lavrenko, Assistant Professor at the University of
Edinburgh, https://www.youtube.com/playlist?list=PLBv09BD7ez_48heon5Az-TsyoXVYOJtDZ
• Wiki K-Nearest Neighbors: https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
• Effects of Distance Measure Choice on KNN Classifier Performance - A Review, V. B.
Surya Prasath et al., https://arxiv.org/pdf/1708.04321.pdf
• A Comparative Analysis of Similarity Measures to find Coherent Documents, Mausumi
Goswami et al. http://www.ijamtes.org/gallery/101.%20nov%20ijmte%20-%20as.pdf
• A Comparison Study on Similarity and Dissimilarity Measures in Clustering Continuous
Data, Ali Seyed Shirkhorshidi et al.,
https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0144059&type=printable
The KNN Algorithm
Input: Training samples 𝐷 = { 𝑥Ԧ1 , 𝑦1 , 𝑥Ԧ2 , 𝑦2 , … , 𝑥Ԧ𝑛 , 𝑦𝑛 }, Test
sample 𝑑 = (𝑥,Ԧ 𝑦), 𝑘. Assume 𝑥Ԧ to be an m-dimensional vector.
Note:
All action takes place in the test phase, the training phase is
essentially to clean, normalize and store the data
KNN Classification and Regression
Height Weight Heart Cholesterol Euclidean K=3
# B.P. Sys B.P. Dia
(inches) (kgs) disease Level Distance
1 62 70 120 80No 150 52.59
2 72 90 110 70No 160 47.81
3 74 80 130 70No 130 43.75
4 65 120 150 90Yes 200 7.14
5 67 100 140 85Yes 190 16.61
6 64 110 130 90No 130 15.94
7 69 150 170 100Yes 250 44.26
8 66 115 145 90
32
Example: Handwritten digit recognition
• 16x16 bitmaps
• 8-bit grayscale
• Euclidean distances over raw pixels
x y
Accuracy:
• 7-NN ~ 95.2%
𝟐𝟓𝟓 • SVM ~ 95.8%
• 𝑫 𝒙, 𝒚 = 𝒊=𝟎 𝒙𝒊 − 𝒚𝒊 𝟐
• Humans ~ 97.5%
http://rstudio-pubs-static.s3.amazonaws.com/6287_c079c40df6864b34808fa7ecb71d0f36.html,
Victor Lavrenko https://www.youtube.com/watch?v=ZD_tfNpKzHY&list=PLBv09BD7ez_48heon5Az-TsyoXVYOJtDZ&index=6
33
Complexity of KNN
Input: Training samples 𝐷 = { 𝑥Ԧ1 , 𝑦1 , 𝑥Ԧ2 , 𝑦2 , … , 𝑥Ԧ𝑛 , 𝑦𝑛 }, Test sample 𝑑 =
(𝑥,
Ԧ 𝑦), 𝑘. Assume 𝑥Ԧ to be an m-dimensional vector.
35
Choosing the value of K – the theory
k=1:
• High variance
• Small changes in the dataset will lead to big changes in classification
• Overfitting
o Is too specific and not well-generalized
o It tends to be sensitive to noise
o The model accomplishes a high accuracy on train set but will be a poor
predictor on new, previously unseen data points
k= very large (e.g., 100):
• The model is too generalized and not a good predictor on both train and test
sets.
• High bias
• Underfitting
k=n:
• The majority class in the dataset wins for every prediction
• High bias
Tuning the hyperparameter K – the Method
1. Divide your training data into training and validation sets.
2. Do multiple iterations of m-fold cross-validation, each time with
a different value of k, starting from k=1
3. Keep iterating until the k with the best classification accuracy
(minimal loss) is found
Challenges:
1. How to find the optimum value of K?
2. How to find the right distance function?
Problems:
1. High computational time cost for each prediction.
2. High memory requirement as we need to keep all training samples.
3. The curse of dimensionality.
K Nearest Neighbors
Error as 𝑛→∞
Sources
• Machine Learning for Intelligent Systems, Kilian Weinberger, Cornell, Lecture 2,
https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote02_kNN.html
• Wiki K-Nearest Neighbors: https://en.wikipedia.org/wiki/K-
nearest_neighbors_algorithm
• Effects of Distance Measure Choice on KNN Classifier Performance - A Review,
V. B. Surya Prasath et al., https://arxiv.org/pdf/1708.04321.pdf
• A Comparative Analysis of Similarity Measures to find Coherent Documents,
Mausumi Goswami et al. http://www.ijamtes.org/gallery/101.%20nov%20ijmte%20-
%20as.pdf
• A Comparison Study on Similarity and Dissimilarity Measures in Clustering
Continuous Data, Ali Seyed Shirkhorshidi et al.,
https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0144059&type=
printable
• Cover, Thomas, and, Hart, Peter. Nearest neighbor pattern classification[J].
Information Theory, IEEE Transactions on, 1967, 13(1): 21-27
Bayes Error (https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote02_kNN.html)
Therefore:
1. The probability that 𝑦 ∗ was the correct label of 𝑥𝑡 but the nearest neighbor 𝑥𝑁𝑁 did not have that
label:
𝑃(𝑦 ∗ |𝑥𝑡 )(1 − 𝑃 𝑦 ∗ 𝑥𝑁𝑁 )
2. The probability that 𝑦 ∗ was not the correct label of 𝑥𝑡 but the nearest neighbor 𝑥𝑁𝑁 had that
label?
𝑃(𝑦 ∗ |𝑥𝑁𝑁 )(1 − 𝑃 𝑦 ∗ 𝑥𝑡 )
1-NN Error as 𝒏 → ∞ (Cover and Hart 1967, Weinberger Lec 2)
So, the total probability of misclassification is:
𝜖𝑁𝑁 = 𝑃 𝑦 ∗ 𝑥𝑡 1 − 𝑃 𝑦 ∗ 𝑥𝑁𝑁 + 𝑃 𝑦 ∗ 𝑥𝑁𝑁 1 − 𝑃 𝑦 ∗ 𝑥𝑡
As 𝑃 𝑦 ∗ 𝑥𝑡 ≤ 1 and 𝑃 𝑦 ∗ 𝑥𝑁𝑁 ≤ 1,
𝜖𝑁𝑁 = 𝑃 𝑦 ∗ 𝑥𝑡 1 − 𝑃 𝑦 ∗ 𝑥𝑁𝑁 + 𝑃 𝑦 ∗ 𝑥𝑁𝑁 1 − 𝑃 𝑦 ∗ 𝑥𝑡
≤ 1 1 − 𝑃 𝑦 ∗ 𝑥𝑁𝑁 + 1 1 − 𝑃 𝑦 ∗ 𝑥𝑡
𝜖𝑁𝑁 ≤ 2 1 − 𝑃 𝑦 ∗ 𝑥𝑡
𝝐𝑵𝑵 ≤ 𝟐𝝐𝑩𝒂𝒚𝒆𝒔
As 𝒏 → ∞, the 𝟏-NN classifier is only a factor 𝟐 worse than the best possible classifier.
K Nearest Neighbors
The Curse of Dimensionality
Sources
Machine Learning for Intelligent Systems, Kilian Weinberger, Cornell, Lecture 2,
https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote02_kNN.html
The Curse of Dimensionality, Aaron Lipeles, https://towardsdatascience.com/the-
curse-of-dimensionality-f07c66128fe1
Wiki K-Nearest Neighbors: https://en.wikipedia.org/wiki/K-
nearest_neighbors_algorithm
Effects of Distance Measure Choice on KNN Classifier Performance - A Review, V.
B. Surya Prasath et al., https://arxiv.org/pdf/1708.04321.pdf
A Comparative Analysis of Similarity Measures to find Coherent Documents,
Mausumi Goswami et al. http://www.ijamtes.org/gallery/101.%20nov%20ijmte%20-
%20as.pdf
A Comparison Study on Similarity and Dissimilarity Measures in Clustering
Continuous Data, Ali Seyed Shirkhorshidi et al.,
https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0144059&type=
printable
Cover, Thomas, and, Hart, Peter. Nearest neighbor pattern classification[J]. Information
Theory, IEEE Transactions on, 1967, 13(1): 21-27
The Curse of Dimensionality
(https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote02_kNN.html)
KNN Assumption
• Points near one another tend to have the same label
• So, KNN tries to find the nearest points
Problem
• In high dimensions, points drawn from a uniform
distribution, are never near one another
o High dimensional spaces are sparsely populated
• So, there are no neighbors near you!
o You have “nearest” neighbors but no “near” neighbors!
o This contradicts the basic assumption of KNN (above)
Sparsely Populated Spaces
• As dimensionality grows, all/most regions of space get sparsely populated
– Fewer observations per region
– E.g., 10 observations spread across:
• 1d: 3 regions, 2d: 32 = 9 regions, 1000d: 31000 regions
https://people.engr.tamu.edu/rgutier/lectures/iss/iss_l10.pdf
Why is High Dimensionality Bad, in general?
• A small family of instances living in a huge house: What’s the problem?
• Machine Learning methods are statistical:
– Count observations in various regions of some space
– Use counts to predict outcomes
– E.g. Naïve Bayes, Decision Trees
• As dimensionality grows, all/most regions of space get sparsely populated
– Fewer observations per region
– E.g., 10 observations spread across:
• 1d: 3 regions, 2d: 32 regions, 1000d: 31000 regions
𝟏
Demonstration 1
(https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote02_kNN.html)
https://pinetools.com/random-bitmap-generator, https://onlinerandomtools.com/shuffle-words
The Curse of Dimensionality
Another example: Words
• Example paragraph, 62 tokens, 46 word types, Through the Looking-glass by Lewis Carroll
One thing was certain, that the white kitten had had nothing to do with it. It was the black kitten’s fault entirely. For the
white kitten had been having its face washed by the old cat for the last quarter of an hour and bearing it pretty well,
considering; so you see that it couldn’t have had any hand in the mischief.
62
Parzen Windows
• Idea: Given a set of data points, Parzen Windows estimate
the pdf at any point in the space by counting the number
of data points in a window centered around that point.
63
Parzen Windows and Kernels
3-NN Parzen Window
R
𝒚𝒊 = −𝟏 𝒚𝒊 = −𝟏
R R
𝒚𝒊 = +𝟏 𝒚𝒊 = +𝟏
Parzen Windows
𝑓 𝑥 = sgn 𝑦𝑖 R
𝑖:𝑥𝑖 ∈𝑅 𝑥
1
𝑓 𝑥 = sgn 𝑦𝑖 . 1 𝑥𝑖 −𝑥 ≤𝑅
𝑖
Distance from x
Ref: Victor Lavrenko, Univesity of Edinburgh
64
Parzen Windows and Kernels
3-NN Parzen Window
R
𝒚𝒊 = −𝟏 𝒚𝒊 = −𝟏
R R
𝒚𝒊 = +𝟏 𝒚𝒊 = +𝟏
Parzen Windows
𝑓 𝑥 = sgn 𝑦𝑖
1 A kernel that converts distances to numbers
𝑖:𝑥𝑖 ∈𝑅 𝑥
𝑓 𝑥 = sgn 𝑦𝑖 . 1 𝑥𝑖 −𝑥 ≤𝑅
𝑖
Distance from x
Ref: Victor Lavrenko, Univesity of Edinburgh
65
Parzen Windows and Kernels
3-NN Parzen Window
R
𝒚𝒊 = −𝟏 𝒚𝒊 = −𝟏
R R
𝒚𝒊 = +𝟏 𝒚𝒊 = +𝟏
67
Performance of KNN Algorithm
Time complexity: 𝑂 𝑛𝑑
Using More Efficient Distance Metrics:
• Choose distance metrics that are less computationally
intensive. For instance, Manhattan distance may be computed
faster than Euclidean distance in high-dimensional spaces.
Using Approximate Nearest Neighbor Algorithms:
• Algorithms like Locality-Sensitive Hashing (LSH) or k-d trees
can be used to approximate the nearest neighbors,
significantly reducing computation time.
68
Performance of KNN Algorithm
• Time complexity: 𝑂(𝑛𝑑)
• Reduce 𝑑: Dimensionality reduction
• Reduce 𝑛: Compare to a subset of examples
• Identify 𝑚 ≪ 𝑛 potential near neighbors to compare against
• 𝑂 𝑚𝑑
• K-D trees: Low-dimensional, real-valued data
o 𝑂(𝑑𝑙𝑜𝑔2 𝑛), only works when 𝑑 ≪ 𝑛, inexact: can miss neighbors
• Locality-sensitive hashing: high-dimensional, real or discrete
o 𝑂 𝑛′ 𝑑 , 𝑛′ ≪ 𝑛, inexact: can miss neighbors
• Inverted lists: High-dimensional, discrete (sparse) data
o 𝑂(𝑛’𝑑’), where 𝑑 ′ ≪ 𝑑, 𝑛′ ≪ 𝑛, only for sparse data (e.g. text), exact
69
K-D Trees
• Pick a random dimension, find median, split data, repeat
1,9 , 2,3 , 4,1 , 3,7 , 5,4 , 6,8 , 7,2 , 8,8 , 7,9 , 9,6
• 𝑂(𝑑 𝑙𝑜𝑔2 𝑛)
• E.g. test point: (7,4)
• Compare with all the points in the region
• Can easily miss nearest neighbors
73
Conclusion
Weaknesses:
1. Computational Complexity: Computationally expensive, especially
with large datasets as it requires computing the distance to all
training samples for each query.
2. Memory Intensive: Requires storing the entire dataset, which can be
memory-intensive for large datasets.
3. Sensitive to Irrelevant Features: The performance of k-NN can
degrade with irrelevant or redundant features because it uses all
features equally in the distance calculation.
4. Sensitive to the Scale of the Data: Different scales of features can
impact the performance of k-NN. Feature scaling is often required.
5. Optimal k Value: Choosing the optimal number of neighbors (𝑘) can
be challenging.
74
For more details please visit
http://aghaaliraza.com
Thank you!
75