0% found this document useful (0 votes)
35 views

003-KNN Complete Updated

Uploaded by

Rao aafaq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

003-KNN Complete Updated

Uploaded by

Rao aafaq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

Nearest Neighbors Methods

Intuition and Background

Agha Ali Raza


Sources
• Machine Learning for Intelligent Systems, Kilian Weinberger, Cornell University, Lecture 2,
https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote02_kNN.html
• Nearest Neighbor Methods, Victor Lavrenko, Assistant Professor at the University of
Edinburgh, https://www.youtube.com/playlist?list=PLBv09BD7ez_48heon5Az-TsyoXVYOJtDZ
• Wiki K-Nearest Neighbors: https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
• Effects of Distance Measure Choice on KNN Classifier Performance - A Review, V. B.
Surya Prasath et al., https://arxiv.org/pdf/1708.04321.pdf
• A Comparative Analysis of Similarity Measures to find Coherent Documents, Mausumi
Goswami et al. http://www.ijamtes.org/gallery/101.%20nov%20ijmte%20-%20as.pdf
• A Comparison Study on Similarity and Dissimilarity Measures in Clustering Continuous
Data, Ali Seyed Shirkhorshidi et al.,
https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0144059&type=printable
The K Nearest Neighbors Algorithm
“A data point is known by the company it keeps”
(Aesop – the data scientist)

“A data point is known by the most common company it keeps”


(Aesop – the improving data scientist)
The K Nearest Neighbors Algorithm
Basic idea: Similar Inputs have similar outputs
Classification rule:
For a test input 𝑥, assign the most common label
amongst its 𝑘 most similar (nearest) training inputs
Formal Definition
Assuming 𝑥 to be our test point, lets denote the set of the 𝑘 nearest neighbors
of 𝑥 as 𝑆𝑥
Formally, 𝑆𝑥 is defined as
𝑺𝒙 ⊆ 𝑫 𝒔. 𝒕. 𝑺𝒙 = 𝒌
𝑎𝑛𝑑
∀ 𝒙′ , 𝒚′ ∈ 𝑫 ∖ 𝑺𝒙 ,
𝒅𝒊𝒔𝒕(𝒙, 𝒙′) ≥ ′′𝐦𝐚𝐱
′′
𝒅𝒊𝒔𝒕(𝒙, 𝒙′′) ,
𝒙 ,𝒚 ∈𝑺𝒙
That is, every point that is in 𝐷 but not in 𝑆𝑥 is at least as far away from 𝑥 as the
furthest point in 𝑆𝑥 .
We define the classifier ℎ() as a function returning the most common label in
𝑆𝑥 :
𝒉(𝒙) = 𝒎𝒐𝒅𝒆({𝒚′′: (𝒙′′, 𝒚′′) ∈ 𝑺𝒙 }),
where mode(⋅) means to select the label of the highest occurrence.
So, what do we do if there is a draw?
• Keep 𝑘 odd (for binary classification) or return the result of 𝑘-1NN with a smaller
𝑘
KNN Decision Boundary
Voronoi Tessellation and KNN decision boundaries

K=1
KNN Decision Boundary
Voronoi Tessellation and KNN decision boundaries
K=1
Properties of KNN – Non-parametric
The KNN Algorithm is a supervised, non-
parametric algorithm
• It does not make any assumptions about the
underlying distribution nor tries to estimate it

https://sebastianraschka.com/faq/docs/parametric_vs_nonparametric.html
Properties of KNN – Non-parametric
• Parametric models summarize data with a fixed set of
parameters (independent of the number of training examples).
• No matter how much data you throw at a parametric model, it won’t
change its mind about how many parameters it needs.
• Non-parametric models make no assumptions about the
probability distribution or number of parameters when modeling
the data.
• Nonparametric methods are good when you have a lot of data and no
prior knowledge, and when you don’t want to worry too much about
choosing just the right features.
• Non-parametric does not mean that they have no parameters! On the
contrary, non-parametric models (can) become more and more complex
with an increasing amount of data.

https://sebastianraschka.com/faq/docs/parametric_vs_nonparametric.html
Properties of KNN – Non-parametric
• Differences:
• In a parametric model, we have a finite number of
parameters, and in nonparametric models, the number of
parameters is (potentially) infinite.
• In statistics, the term parametric is also associated with a
specified probability distribution that you “assume” our data
follows, and this distribution comes with the finite number of
parameters (for example, the mean and standard deviation
of a normal distribution)
o We don’t make/have these assumptions in non-parametric models.
So, in intuitive terms, we can think of a non-parametric model as a
“distribution” or (quasi) assumption-free model.
• Still, the distinction is a bit ambiguous at best

https://sebastianraschka.com/faq/docs/parametric_vs_nonparametric.html
Notes: Detailed Differences between Parametric
and Non-Parametric Models
Parametric models assume some finite set of parameters for the model,
while non-parametric models do not make any assumptions about
the data distribution and have an infinite number of parameters.
Here are some key differences:
Number of Parameters:
1. Parametric Models: These models have a fixed number of parameters. For
example, in a linear regression model, the parameters are the slope and
intercept.
2. Non-parametric Models: These models do not have a fixed number of
parameters. The number of parameters grows with the amount of training
data. For example, in a k-nearest neighbors (KNN) model, the "parameters" are
essentially the entire training dataset.

11
Notes: Detailed Differences between Parametric
and Non-Parametric Models
Flexibility:
1. Parametric Models: Less flexible as they make strong assumptions
about the data distribution.
2. Non-parametric Models: More flexible as they make fewer
assumptions about the data distribution.
Computational Complexity:
1. Parametric Models: Generally, less computationally intensive as
they require estimating only a fixed number of parameters.
2. Non-parametric Models: Usually, more computationally intensive
as they involve a larger number of parameters and often require
computation over the entire dataset.

12
Detailed Differences between Parametric and Non-
Parametric Models
Risk of Overfitting:
1. Parametric Models: Higher risk of underfitting as they might not capture the
underlying complexity of the data.
2. Non-parametric Models: Higher risk of overfitting as they might capture too
much noise in the data.
Examples:
• Parametric Models: Linear Regression, Logistic Regression, Naive Bayes, etc.
• Non-parametric Models: Decision Trees with unbounded height, k-Nearest
Neighbors, Support Vector Machines, etc.

13
Properties of KNN – Hyperparameter K
Parameters and hyperparameters are two types of values that a
model uses to make predictions, but they serve different
purposes and are learned in different ways:
Parameters:
• These are the parts of the model that are learned from the training data.
For example, the weights in a linear regression model are parameters.
• Parameters are learned directly from the training data during the training
process. The model uses the training data to adjust the parameters to
minimize the prediction error.
Hyperparameters:
• These are the settings or configurations that need to be specified prior to
training the model. They are not learned from the data but are essential
for the learning process.
• For example, the learning rate in gradient descent, the depth of a
decision tree, or the number of clusters in k-means clustering are all
hyperparameters.
• The values of hyperparameters are usually set before training the model
and remain constant during the training process. They may be adjusted
between runs of training to optimize model performance.
Properties of KNN
• Used for classification and regression
• Classification: Choose the most frequent class label amongst k-nearest
neighbors
• Regression: Take an average over the output values of the k-nearest
1
neighbors and assign to the test point – may be weighted e.g. w = (𝑑:
𝑑
distance from 𝑥)
Properties of KNN
An Instance-based learning algorithm
• Instead of performing explicit generalization, form hypotheses by
comparing new problem instances with training instances
• (+) Can easily adapt to unseen data
• (-) Complexity of prediction is a function of 𝑛 (size of training data)
A lazy learning algorithm
• Delay computations on training data until a query is made, as opposed to
eager learning
• (+) Good for continuously updated training data like recommender
systems
• (-) Slower to evaluate and need to store the whole training data
Nearest Neighbors Methods
Distance/Similarity
Sources
• Machine Learning for Intelligent Systems, Kilian Weinberger, Cornell University, Lecture 2,
https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote02_kNN.html
• Nearest Neighbor Methods, Victor Lavrenko, Assistant Professor at the University of
Edinburgh, https://www.youtube.com/playlist?list=PLBv09BD7ez_48heon5Az-TsyoXVYOJtDZ
• Wiki K-Nearest Neighbors: https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
• Effects of Distance Measure Choice on KNN Classifier Performance - A Review, V. B.
Surya Prasath et al., https://arxiv.org/pdf/1708.04321.pdf
• A Comparative Analysis of Similarity Measures to find Coherent Documents, Mausumi
Goswami et al. http://www.ijamtes.org/gallery/101.%20nov%20ijmte%20-%20as.pdf
• A Comparison Study on Similarity and Dissimilarity Measures in Clustering Continuous
Data, Ali Seyed Shirkhorshidi et al.,
https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0144059&type=printable
Similarity/Distance Measures
• If scaled between 0 and 1, then 𝑠𝑖𝑚 = 1 − 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒
• Lots of choices, depends on the problem
• The Minkowski distance is a generalized metric form of
Euclidean, Manhattan and Chebyshev distances
• The Minkowski distance between two n-dimensional
vectors 𝑃 =< 𝑝1 , 𝑝2 , … , 𝑝𝑛 > and 𝑄 =< 𝑞1 , 𝑞2 , … , 𝑞𝑛 >, it is
defined as:
1ൗ
𝑛 𝑎
𝑎
𝑑𝑚𝑖𝑛𝑘𝑜𝑤𝑠𝑘𝑖 𝑝, 𝑞 = ෍ 𝑝𝑖 − 𝑞𝑖 ,𝑎 ≥ 1
𝑖=1
• 𝑎 = 1, is the Manhattan distance
• 𝑎 = 2, is the Euclidean distance
• 𝑎 → ∞, is the Chebyshev distance
Constraints on Distance Metrics
The distance function between vectors 𝑝 and 𝑞 is a function 𝑑(𝑝, 𝑞) that defines
the distance between both vectors is considered as a metric if it satisfy a certain
number of properties:
1. Non-negativity: The distance between 𝑝 and 𝑞 is always a value greater
than or equal to zero
𝒅(𝒑, 𝒒) ≥ 𝟎
2. Identity of indiscernible vectors: The distance between 𝑝 and 𝑞 is equal to
zero if and only if 𝑝 is equal to 𝑞
𝒅(𝒑, 𝒒) = 𝟎 𝒊𝒇𝒇 𝒑 = 𝒒
3. Symmetry: The distance between 𝑝 and 𝑞 is equal to the distance between
𝑞 and 𝑝.
𝒅(𝒑, 𝒒) = 𝒅(𝒒, 𝒑)
4. Triangle inequality: Given a third point 𝑟, the distance between 𝑝 and 𝑞 is
always less than or equal to the sum of the distance between 𝑝 and 𝑟 and the
distance between 𝑟 and 𝑞
𝒅(𝒑, 𝒒) ≤ 𝒅(𝒑, 𝒓) + 𝒅(𝒓, 𝒒)
Manhatten Distance
𝒅𝑴𝒂𝒏 𝒑, 𝒒 = 𝒅 𝒒, 𝒑 = |𝒑𝟏 − 𝒒𝟏 | + |𝒑𝟐 − 𝒒𝟐 |, … , |𝒑𝒏 − 𝒒𝒏 |
𝒏

𝒅 𝒑, 𝒒 = 𝒅 𝒒, 𝒑 = ෍ |𝒑𝒊 − 𝒒𝒊 |
𝒊=𝟏
• The distance between two points is the sum of the
absolute differences of their Cartesian coordinates.
• It is the total sum of the difference between the x-
coordinates and y-coordinates.
• Also known as Manhattan length, rectilinear distance,
L1 distance or L1 norm, city block distance, snake
distance, taxi-cab metric, or city block distance
• Works well for high dimensional data.
• It does not amplify differences among features of the two
vectors and as a result does not ignore the effects of any
feature dimensions
• Higher values of 𝑎 amplify differences and ignore
features with smaller differences
Euclidean vs Manhatten Distance
https://en.wikipedia.org/wiki/Taxicab_geometry
Euclidean Distance
𝑑 𝑝, 𝑞 = 𝑑 𝑞, 𝑝 = (𝑝1 − 𝑞1 )2 +(𝑝2 − 𝑞2 )2 , … , (𝑝𝑛 − 𝑞𝑛 )2

𝑑 𝑝, 𝑞 = 𝑑 𝑞, 𝑝 = ෍(𝑝𝑖 − 𝑞𝑖 )2
𝑖=1

• Good choice for numeric attributes


• When data is dense or continuous, this is a good proximity measure
• Downside: Sensitive to extreme deviations in attributes (as it squares differences)
• The variables which have the largest value greatly influence the result
• Does not work well for situations where features on different scales are mixed
(e.g., #bedrooms (1-5) and area (200 – 5,000 sq feet) of a house)
o Solution: feature normalization (min-max scaling)
Chebyshev Distance
Effect of Different Distance Measures in Result of Cluster Analysis, Sujan Dahal

𝒏 𝟏ൗ
𝒂
𝒂
𝒅𝑪𝒉𝒆𝒃 𝒑, 𝒒 = 𝐥𝐢𝐦 ෍ 𝒑𝒊 − 𝒒𝒊 = 𝐦𝐚𝐱 𝒑𝒊 − 𝒒𝒊
𝒂→ ∞ 𝒊
𝒊=𝟏
How?
Assume, 𝑝 =< 2,3, … , 9 >, 𝑞 =< 4,6, … 10 >
1
𝑎
𝑑𝐶ℎ𝑒𝑏 𝑝, 𝑞 = lim ( 2 − 4 + 3 − 6 𝑎 +, … + 9 − 10 𝑎 )𝑎
𝑎→ ∞
1
𝑑𝐶ℎ𝑒𝑏 𝑝, 𝑞 = lim (2𝑎 + 3𝑎 +, … + 1𝑎 )𝑎
𝑎→ ∞
Suppose, 𝑎 = 2
1
𝑑 𝑝, 𝑞 = (4 + 9+, … + 1)2
Suppose, 𝑎 = 3
1
𝑑 𝑝, 𝑞 = (8 + 27+, … + 1)3
Suppose, 𝑎 = 10
1
𝑑 𝑝, 𝑞 = (1,024 + 59,049+, … + 1) 10
Now, 𝑎 → ∞
𝑛
1
𝑎 𝑎 𝑎
𝑑𝐶ℎ𝑒𝑏 𝑝, 𝑞 = lim ((෍ 𝑝𝑖 − 𝑞𝑖 ) → max 𝑝𝑖 − 𝑞𝑖 )
𝑎→ ∞ 𝑖
𝑖=1
1
𝑑𝐶ℎ𝑒𝑏 𝑝, 𝑞 = lim (max 𝑝𝑖 − 𝑞𝑖 𝑎 )𝑎
𝑎→ ∞ 𝑖

𝒅𝑪𝒉𝒆𝒃 𝒑, 𝒒 = 𝒎𝒂𝒙 𝒑𝒊 − 𝒒𝒊
𝒊
Chebyshev Distance
Effect of Different Distance Measures in Result of Cluster Analysis, Sujan Dahal

𝑑𝐶ℎ𝑒𝑏 𝑝, 𝑞 = max 𝑝𝑖 − 𝑞𝑖
𝑖
• For Chebyshev distance, the distance between two vectors is the greatest of
their differences along any coordinate dimension
• When two objects are to be defined as “different”, if they are different in any
one dimension
• Also called chessboard distance, maximum metric, or 𝐿∞ metric
Nearest Neighbors Methods
The KNN Algorithm
Sources
• Machine Learning for Intelligent Systems, Kilian Weinberger, Cornell University, Lecture 2,
https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote02_kNN.html
• Nearest Neighbor Methods, Victor Lavrenko, Assistant Professor at the University of
Edinburgh, https://www.youtube.com/playlist?list=PLBv09BD7ez_48heon5Az-TsyoXVYOJtDZ
• Wiki K-Nearest Neighbors: https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
• Effects of Distance Measure Choice on KNN Classifier Performance - A Review, V. B.
Surya Prasath et al., https://arxiv.org/pdf/1708.04321.pdf
• A Comparative Analysis of Similarity Measures to find Coherent Documents, Mausumi
Goswami et al. http://www.ijamtes.org/gallery/101.%20nov%20ijmte%20-%20as.pdf
• A Comparison Study on Similarity and Dissimilarity Measures in Clustering Continuous
Data, Ali Seyed Shirkhorshidi et al.,
https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0144059&type=printable
The KNN Algorithm
Input: Training samples 𝐷 = { 𝑥Ԧ1 , 𝑦1 , 𝑥Ԧ2 , 𝑦2 , … , 𝑥Ԧ𝑛 , 𝑦𝑛 }, Test
sample 𝑑 = (𝑥,Ԧ 𝑦), 𝑘. Assume 𝑥Ԧ to be an m-dimensional vector.

Output: Class label of test sample 𝑑


1. Compute the distance between 𝑑 and every sample in 𝐷
2. Choose the 𝐾 samples in 𝐷 that are nearest to 𝑑; denote the
set by 𝑆𝑑 ∈ 𝐷
3. Assign 𝑑 the label 𝑦𝑖 of the majority class in 𝑆𝑑

Note:
All action takes place in the test phase, the training phase is
essentially to clean, normalize and store the data
KNN Classification and Regression
Height Weight Heart Cholesterol Euclidean K=3
# B.P. Sys B.P. Dia
(inches) (kgs) disease Level Distance
1 62 70 120 80No 150 52.59
2 72 90 110 70No 160 47.81
3 74 80 130 70No 130 43.75
4 65 120 150 90Yes 200 7.14
5 67 100 140 85Yes 190 16.61
6 64 110 130 90No 130 15.94
7 69 150 170 100Yes 250 44.26
8 66 115 145 90

200 + 190 + 130


= 173.33
3

32
Example: Handwritten digit recognition
• 16x16 bitmaps
• 8-bit grayscale
• Euclidean distances over raw pixels
x y

Accuracy:
• 7-NN ~ 95.2%
𝟐𝟓𝟓 • SVM ~ 95.8%
• 𝑫 𝒙, 𝒚 = ෌𝒊=𝟎 𝒙𝒊 − 𝒚𝒊 𝟐
• Humans ~ 97.5%
http://rstudio-pubs-static.s3.amazonaws.com/6287_c079c40df6864b34808fa7ecb71d0f36.html,
Victor Lavrenko https://www.youtube.com/watch?v=ZD_tfNpKzHY&list=PLBv09BD7ez_48heon5Az-TsyoXVYOJtDZ&index=6

33
Complexity of KNN
Input: Training samples 𝐷 = { 𝑥Ԧ1 , 𝑦1 , 𝑥Ԧ2 , 𝑦2 , … , 𝑥Ԧ𝑛 , 𝑦𝑛 }, Test sample 𝑑 =
(𝑥,
Ԧ 𝑦), 𝑘. Assume 𝑥Ԧ to be an m-dimensional vector.

Output: Class label of test sample 𝑑


1. Compute the distance between 𝑑 and every sample in 𝐷
𝑛 samples, each is 𝑚-dimensional ⇒ 𝑂(𝑚𝑛)
2. Choose the 𝐾 samples in 𝐷 that are nearest to 𝑑; denote the set by 𝑆𝑑 ∈ 𝐷
- Either naively do 𝐾 passes of all samples costing 𝑂(𝑛) each time for 𝑂(𝑛𝑘)
- Or use the quickselect algorithm (median of medians) to find the 𝑘𝑡ℎ
smallest distance in 𝑂(𝑛) and then return all distances no larger than the 𝑘𝑡ℎ
smallest distance. This will accumulate to 𝑂(𝑛)
3. Assign 𝑑 the label 𝑦𝑖 of the majority class in 𝑆𝑑
This is 𝑂 𝑘 .
Time complexity: 𝑂(𝑚𝑛 + 𝑛 + 𝑘) = 𝑂(𝑚𝑛), assuming 𝑘 to be a constant.

Space complexity: 𝑂(𝑚𝑛), to store the n, m-dimensional training data


samples.
Tuning the Hyperparameter K
What happens if we use the training set itself as the test dataset,
instead of a validation set? Which k wins?
• K=1, as there is always a nearest instance with the correct label:
The instance itself!
What happens if we use K=n?
• KNN will always return the majority class in the dataset

35
Choosing the value of K – the theory
k=1:
• High variance
• Small changes in the dataset will lead to big changes in classification
• Overfitting
o Is too specific and not well-generalized
o It tends to be sensitive to noise
o The model accomplishes a high accuracy on train set but will be a poor
predictor on new, previously unseen data points
k= very large (e.g., 100):
• The model is too generalized and not a good predictor on both train and test
sets.
• High bias
• Underfitting
k=n:
• The majority class in the dataset wins for every prediction
• High bias
Tuning the hyperparameter K – the Method
1. Divide your training data into training and validation sets.
2. Do multiple iterations of m-fold cross-validation, each time with
a different value of k, starting from k=1
3. Keep iterating until the k with the best classification accuracy
(minimal loss) is found

- What happens if we use the training set itself, instead of a


validation set? Which k wins?
- K=1, as there is always a nearest instance with the correct label, the
instance itself
KNN – The good, the bad and the ugly
KNN is a simple algorithm but is highly effective for solving various real life
classification problems. Especially when the datasets are large and
continuously growing.
• We can show that as 𝑛 → ∞, the 1-NN classifier is only a factor 2 worse than
the best possible classifier (the Bayes Optimal Classifier).

Challenges:
1. How to find the optimum value of K?
2. How to find the right distance function?

Problems:
1. High computational time cost for each prediction.
2. High memory requirement as we need to keep all training samples.
3. The curse of dimensionality.
K Nearest Neighbors
Error as 𝑛→∞
Sources
• Machine Learning for Intelligent Systems, Kilian Weinberger, Cornell, Lecture 2,
https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote02_kNN.html
• Wiki K-Nearest Neighbors: https://en.wikipedia.org/wiki/K-
nearest_neighbors_algorithm
• Effects of Distance Measure Choice on KNN Classifier Performance - A Review,
V. B. Surya Prasath et al., https://arxiv.org/pdf/1708.04321.pdf
• A Comparative Analysis of Similarity Measures to find Coherent Documents,
Mausumi Goswami et al. http://www.ijamtes.org/gallery/101.%20nov%20ijmte%20-
%20as.pdf
• A Comparison Study on Similarity and Dissimilarity Measures in Clustering
Continuous Data, Ali Seyed Shirkhorshidi et al.,
https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0144059&type=
printable
• Cover, Thomas, and, Hart, Peter. Nearest neighbor pattern classification[J].
Information Theory, IEEE Transactions on, 1967, 13(1): 21-27
Bayes Error (https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote02_kNN.html)

Assume that we know 𝑃(𝑦|𝑥), so we can simply predict the correct


label 𝑦 ∗ as:
𝒚∗ = 𝒉𝒐𝒑𝒕 𝒙 = 𝒂𝒓𝒈𝒎𝒂𝒙𝒚 𝑷(𝒚|𝒙)
Although the Bayes Classifier is optimal, but it is not perfect and
can still make mistakes. E.g. It will predict incorrectly when a
test point does not have the most likely label.
So, if 𝑃(𝑦 ∗ |𝑥) is the probability of correct classification then the
probability of incorrect classification, the 𝐵𝑎𝑦𝑒𝑠 𝐸𝑟𝑟𝑜𝑟, is given
as:
𝝐𝑩𝒂𝒚𝒆𝒔 = 𝟏 − 𝑷(𝒚∗ |𝒙)
1-NN Error as 𝒏 → ∞ (Cover and Hart 1967, Weinberger Lec 2)
Let 𝑥𝑁𝑁 be the nearest neighbor of our test point 𝑥𝑡
As 𝑛 → ∞, 𝑑𝑖𝑠𝑡 𝑥𝑁𝑁 , 𝑥𝑡 → 0
• i.e. 𝑥𝑁𝑁 → 𝑥𝑡
• 1-NN returns the label of 𝑥𝑁𝑁

Small 𝒏 Large 𝒏 𝒏→∞

What is the probability that this is not the correct label of 𝑥𝑡 ?


• As 𝑥𝑁𝑁 → 𝑥𝑡 , the probability of misclassification is the same as the
probability of 𝑥𝑁𝑁 and 𝑥𝑡 having different labels
1-NN Error as 𝒏 → ∞ (Cover and Hart 1967, Weinberger Lec 2)
There are two ways this could happen.
1. 𝑦 ∗ was the correct label of 𝑥𝑡 but the nearest neighbor 𝑥𝑁𝑁 did not have that label
2. 𝑦 ∗ was not the correct label of 𝑥𝑡 but the nearest neighbor 𝑥𝑁𝑁 had that label

Let's define some probabilities:


• The probability that 𝑥𝑁𝑁 had the correct label 𝑦 ∗
o 𝑃(𝑦 ∗ |𝑥𝑁𝑁 )
• The probability that 𝑥𝑁𝑁 did not have the correct label 𝑦 ∗
o 1 − 𝑃(𝑦 ∗ |𝑥𝑁𝑁 )
• The probability that 𝑦 ∗ was the correct label of 𝑥𝑡
o 𝑃(𝑦 ∗ |𝑥𝑡 )
• The probability that 𝑦 ∗ was not the correct label of 𝑥𝑡
o 1-𝑃(𝑦 ∗ |𝑥𝑡 )

Therefore:
1. The probability that 𝑦 ∗ was the correct label of 𝑥𝑡 but the nearest neighbor 𝑥𝑁𝑁 did not have that
label:
𝑃(𝑦 ∗ |𝑥𝑡 )(1 − 𝑃 𝑦 ∗ 𝑥𝑁𝑁 )
2. The probability that 𝑦 ∗ was not the correct label of 𝑥𝑡 but the nearest neighbor 𝑥𝑁𝑁 had that
label?
𝑃(𝑦 ∗ |𝑥𝑁𝑁 )(1 − 𝑃 𝑦 ∗ 𝑥𝑡 )
1-NN Error as 𝒏 → ∞ (Cover and Hart 1967, Weinberger Lec 2)
So, the total probability of misclassification is:
𝜖𝑁𝑁 = 𝑃 𝑦 ∗ 𝑥𝑡 1 − 𝑃 𝑦 ∗ 𝑥𝑁𝑁 + 𝑃 𝑦 ∗ 𝑥𝑁𝑁 1 − 𝑃 𝑦 ∗ 𝑥𝑡

As 𝑃 𝑦 ∗ 𝑥𝑡 ≤ 1 and 𝑃 𝑦 ∗ 𝑥𝑁𝑁 ≤ 1,
𝜖𝑁𝑁 = 𝑃 𝑦 ∗ 𝑥𝑡 1 − 𝑃 𝑦 ∗ 𝑥𝑁𝑁 + 𝑃 𝑦 ∗ 𝑥𝑁𝑁 1 − 𝑃 𝑦 ∗ 𝑥𝑡
≤ 1 1 − 𝑃 𝑦 ∗ 𝑥𝑁𝑁 + 1 1 − 𝑃 𝑦 ∗ 𝑥𝑡

As, 𝑥𝑁𝑁 → 𝑥𝑡 , 𝑃 𝑦 ∗ | 𝑥𝑁𝑁 = 𝑃(𝑦 ∗ |𝑥𝑡 )


𝜖𝑁𝑁 ≤ 1 − 𝑃 𝑦 ∗ 𝑥𝑡 + 1 − 𝑃 𝑦 ∗ 𝑥𝑡

𝜖𝑁𝑁 ≤ 2 1 − 𝑃 𝑦 ∗ 𝑥𝑡

𝝐𝑵𝑵 ≤ 𝟐𝝐𝑩𝒂𝒚𝒆𝒔

As 𝒏 → ∞, the 𝟏-NN classifier is only a factor 𝟐 worse than the best possible classifier.
K Nearest Neighbors
The Curse of Dimensionality
Sources
Machine Learning for Intelligent Systems, Kilian Weinberger, Cornell, Lecture 2,
https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote02_kNN.html
The Curse of Dimensionality, Aaron Lipeles, https://towardsdatascience.com/the-
curse-of-dimensionality-f07c66128fe1
Wiki K-Nearest Neighbors: https://en.wikipedia.org/wiki/K-
nearest_neighbors_algorithm
Effects of Distance Measure Choice on KNN Classifier Performance - A Review, V.
B. Surya Prasath et al., https://arxiv.org/pdf/1708.04321.pdf
A Comparative Analysis of Similarity Measures to find Coherent Documents,
Mausumi Goswami et al. http://www.ijamtes.org/gallery/101.%20nov%20ijmte%20-
%20as.pdf
A Comparison Study on Similarity and Dissimilarity Measures in Clustering
Continuous Data, Ali Seyed Shirkhorshidi et al.,
https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0144059&type=
printable
Cover, Thomas, and, Hart, Peter. Nearest neighbor pattern classification[J]. Information
Theory, IEEE Transactions on, 1967, 13(1): 21-27
The Curse of Dimensionality
(https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote02_kNN.html)

KNN Assumption
• Points near one another tend to have the same label
• So, KNN tries to find the nearest points
Problem
• In high dimensions, points drawn from a uniform
distribution, are never near one another
o High dimensional spaces are sparsely populated
• So, there are no neighbors near you!
o You have “nearest” neighbors but no “near” neighbors!
o This contradicts the basic assumption of KNN (above)
Sparsely Populated Spaces
• As dimensionality grows, all/most regions of space get sparsely populated
– Fewer observations per region
– E.g., 10 observations spread across:
• 1d: 3 regions, 2d: 32 = 9 regions, 1000d: 31000 regions

https://people.engr.tamu.edu/rgutier/lectures/iss/iss_l10.pdf
Why is High Dimensionality Bad, in general?
• A small family of instances living in a huge house: What’s the problem?
• Machine Learning methods are statistical:
– Count observations in various regions of some space
– Use counts to predict outcomes
– E.g. Naïve Bayes, Decision Trees
• As dimensionality grows, all/most regions of space get sparsely populated
– Fewer observations per region
– E.g., 10 observations spread across:
• 1d: 3 regions, 2d: 32 regions, 1000d: 31000 regions

– Statistics needs repetitions:


• Flip a coin once → head
• P(head) = 100%?
https://people.engr.tamu.edu/rgutier/lectures/iss/iss_l10.pdf
Demonstration1
(https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote02_kNN.html)

• Let’s draw points randomly


(sampled from a uniform
distribution) from the unit cube
• How much space the 𝑘 nearest
neighbors of the red test point
will occupy?
𝟏

𝟏
Demonstration 1
(https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote02_kNN.html)

• Imagine the unit cube 0,1 𝑑 𝟏


• All training data is sampled uniformly within
this cube
∀𝑖, 𝑥𝑖 ∈ 0,1 𝑑
𝒍
• Now, let 𝑙 be the edge length of the smallest
cube that contains the 𝑘 nearest neighbor of
𝟏 𝒍
the test point.
• The ratio between volumes of the small cube 𝒍
𝑙 𝑑 and unit cube 1𝑑 is:
𝑘
𝑙𝑑 ≈
𝑛
𝟏
as the points are drawn uniformly. Therefore:
1 𝑑 𝑙
𝑘 𝑑
𝑙≈ 2 0.1
𝑛
10 0.63
• If 𝑘 = 10 and 𝑛 = 1000. How big is 𝑙? 100 0.955
1000 0.9954
Demonstration 1
(https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote02_kNN.html)

• If 𝒌 = 𝟏𝟎 and 𝒏 = 𝟏𝟎𝟎𝟎. How big is


𝒍?
• So, the test point has no neighbors
near it!
𝒍
• Most “nearest” neighbors (green) are
close to the edges, otherwise we 𝟏
would have drawn a smaller cube
around them
𝒍
• So why should the test point share its 𝟏
label with these faraway points? The 𝑑 𝑙
whole premise fails! 2 0.1
10 0.63
100 0.955
1000 0.9954
Demonstration 1
(https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote02_kNN.html)

• How about we increase the number of


training samples, 𝑛, until the nearest
neighbors are truly close to the test point?
• How many data points do we need such that
𝑙 becomes truly small?
𝒍
1
• If 𝑙 = = 0.1 𝟏
10
• i.e., we want the k nearest neighbors to fit in
one tenth of the space of the unit cube
𝑘𝑑
𝑘 𝒍
𝑙 ≈ ⇒ 𝒏 = 𝑑 = 𝑘 ⋅ 10𝑑
𝑛 ℓ 𝟏
• This grows exponentially!
• For 𝑑 > 100 we would need far more data 𝑑 𝑙
points than there are electrons in the 2 0.1
universe! 10 0.63
81
• The current estimate is 10 in the observable
universe. 100 0.955
1000 0.9954
Another example (https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote02_kNN.html)

• Assume a unit line segment


• 𝜖 is the length of very small segments on both sides of it 𝝐 1 𝝐
• What is the probability of drawing a point uniform
randomly from the segment such that it does not lie on
the edges 𝜖?
1 − 2𝜖
𝑃 𝑁. 𝐸. 𝑝𝑜𝑖𝑛𝑡 = ( )
𝑙𝑒𝑛𝑔𝑡ℎ 𝑜𝑓 𝑡ℎ𝑒 𝑠𝑒𝑔𝑚𝑒𝑛𝑡
= 1 − 2𝜖
• What if we convert the line segment into a unit square.
What is the probability of N.E. point now?
𝑃 𝑁. 𝐸. 𝑃𝑜𝑖𝑛𝑡 = 1 − 2𝜖 2
• How about a 𝑑-dimensional cube?
𝑃 𝑁. 𝐸. 𝑃𝑜𝑖𝑛𝑡 = 1 − 2𝜖 𝑑
𝝐 1 𝝐
• As, 1 − 2𝜖 < 1, this will reach 0 very rapidly!
• A point will become an edge point even if it is an edge
point in any one dimension. So, the probability of getting a
non-edge point in every single dimension is near to 0.
So how does KNN work at all?
(https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote02_kNN.html)

• The basic assumption of the Curse of Dimensionality is that the


instances are spread uniformly across the n-dimensional
space
• Many practical problems (SPAM vs. Not SPAM with bag-of-
words, face recognition) etc. have high ambient dimensionality
(SPAM vs. Not SPAM has |V|, face recognition may have
|number of pixels| e.g., 18M) but low intrinsic dimensionality
• May be a small subset of |V| helps with the SPAM vs. Not
SPAM classification
• May be the actual face recognition features are much less
e.g., humans recognize face based on m/f, color, hair color,
spectacles, shape of the nose etc.
• So, the true or effective dimensionality might be much lower
making KNN still feasible.
Dimensionality: Real vs. Apparent
• ML datasets are typically high dimensional
– Bag-of-word features
• Images: As each pixel is an attribute, we have 2 × 106 features for a 2megapixel
image
• Text: As each word type is a feature, we can easily get a million features
– The high dimensionality is apparent – not real!
• It is an artifact of the way we store attributes
– The real dimensionality is often much lower
• A surface or manifold (sheet) embedded in a high ambient dimensional space

Lecture 14.1 — Dimensionality Reduction Motivation I | Data Compression — [ Andrew Ng ]


The Curse of Dimensionality
• High dimensional spaces are sparsely populated
• Hand-written digits
– 20 x 20 bitmaps: 400 features and 0,1 400 possible events (feature values)
– However, most of these feature value combinations will never be observed
• Actual digits only occupy a tiny fraction of this 400-dimensional space. Digits are very rare in
that space!
• True dimensionality: Possible variations of pen strokes constrained by the laws of physics
• Try drawing random events from this space!

https://pinetools.com/random-bitmap-generator, https://onlinerandomtools.com/shuffle-words
The Curse of Dimensionality
Another example: Words
• Example paragraph, 62 tokens, 46 word types, Through the Looking-glass by Lewis Carroll
One thing was certain, that the white kitten had had nothing to do with it. It was the black kitten’s fault entirely. For the
white kitten had been having its face washed by the old cat for the last quarter of an hour and bearing it pretty well,
considering; so you see that it couldn’t have had any hand in the mischief.

• Using this vocabulary, we generate random sequences of length 10


(4610 𝑝𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒𝑠)
• Most of the random combinations of these features make no sense
– for fault kitten’s one entirely to last considering and that
– having having having for its any so old one mischief
– of couldn’t do to so it considering well bearing mischief
– last with entirely been of you entirely face with white
– entirely it entirely so white well an one white face
– you having old in face entirely by thing hour hour
– white one see considering quarter do so bearing been do
• Most of the potential strings would be meaningless!
– So, the true dimensionality must be lower.
So how does KNN work at all?
(https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote02_kNN.html)

• The data may lie on a lower-dimensional subspace or manifold


where Euclidean distances may hold locally
K Nearest Neighbors
Enhancements
Sources
Nearest Neighbor Methods, Victor Lavrenko, Assistant Professor at the University of Edinburgh,
https://www.youtube.com/playlist?list=PLBv09BD7ez_48heon5Az-TsyoXVYOJtDZ
Machine Learning for Intelligent Systems, Kilian Weinberger, Cornell, Lecture 2,
https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote02_kNN.html
Wiki K-Nearest Neighbors: https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
Effects of Distance Measure Choice on KNN Classifier Performance - A Review, V. B. Surya
Prasath et al., https://arxiv.org/pdf/1708.04321.pdf
A Comparative Analysis of Similarity Measures to find Coherent Documents, Mausumi
Goswami et al. http://www.ijamtes.org/gallery/101.%20nov%20ijmte%20-%20as.pdf
A Comparison Study on Similarity and Dissimilarity Measures in Clustering Continuous
Data, Ali Seyed Shirkhorshidi et al.,
https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0144059&type=printable
Cover, Thomas, and, Hart, Peter. Nearest neighbor pattern classification. Information Theory,
IEEE Transactions on, 1967, 13(1): 21-27
Parzen Windows
• Parzen Windows or kernel density estimation
(KDE), is a non-parametric approach to estimate the
probability density function (pdf) of a random variable.

• Particularly used for pattern classification

• Useful for tackling problems where parametric


estimations are not suitable because the underlying
distribution is unknown or difficult to identify.

62
Parzen Windows
• Idea: Given a set of data points, Parzen Windows estimate
the pdf at any point in the space by counting the number
of data points in a window centered around that point.

• Window Function: A window function (or kernel function)


𝐾 and a window width ℎ are used to define the window.
• Common kernel functions include Gaussian, rectangular, and
Epanechnikov kernels.

63
Parzen Windows and Kernels
3-NN Parzen Window

R
𝒚𝒊 = −𝟏 𝒚𝒊 = −𝟏

R R

𝒚𝒊 = +𝟏 𝒚𝒊 = +𝟏

Parzen Windows

𝑓 𝑥 = sgn ෍ 𝑦𝑖 R
𝑖:𝑥𝑖 ∈𝑅 𝑥
1

𝑓 𝑥 = sgn ෍ 𝑦𝑖 . 1 𝑥𝑖 −𝑥 ≤𝑅
𝑖
Distance from x
Ref: Victor Lavrenko, Univesity of Edinburgh
64
Parzen Windows and Kernels
3-NN Parzen Window

R
𝒚𝒊 = −𝟏 𝒚𝒊 = −𝟏

R R

𝒚𝒊 = +𝟏 𝒚𝒊 = +𝟏

Parzen Windows

𝑓 𝑥 = sgn ෍ 𝑦𝑖
1 A kernel that converts distances to numbers
𝑖:𝑥𝑖 ∈𝑅 𝑥

𝑓 𝑥 = sgn ෍ 𝑦𝑖 . 1 𝑥𝑖 −𝑥 ≤𝑅
𝑖
Distance from x
Ref: Victor Lavrenko, Univesity of Edinburgh
65
Parzen Windows and Kernels
3-NN Parzen Window

R
𝒚𝒊 = −𝟏 𝒚𝒊 = −𝟏

R R

𝒚𝒊 = +𝟏 𝒚𝒊 = +𝟏

Parzen Windows Kernelized Parzen window

𝑓 𝑥 = sgn ෍ 𝑦𝑖 𝑓 𝑥 = sgn ෍ 𝑦𝑖 𝐾(𝑥𝑖 , 𝑥)


𝑖:𝑥𝑖 ∈𝑅 𝑥 𝑖
vs
𝑓 𝑥 = sgn ෍ 𝑦𝑖 . 1 𝑥𝑖 −𝑥 ≤𝑅
𝑖
𝑓 𝑥 = sgn ෍ 𝛼𝑖 𝑦𝑖 𝐾(𝑥𝑖 , 𝑥)
𝑖
Ref: Victor Lavrenko, Univesity of Edinburgh
66
Performance of KNN Algorithm
Time complexity: 𝑂 𝑛𝑑
Dimensionality Reduction:
• Use dimensionality reduction techniques like Principal Component
Analysis (PCA) or Linear Discriminant Analysis (LDA) to reduce the
number of features, which can significantly speed up distance
computation.
Instance Reduction:
• Reduce the number of training instances by removing redundant or noisy
samples. Techniques such as condensed nearest neighbor (CNN) can
be used for instance reduction.
o An under-sampling technique that seeks a subset of a collection of samples
that results in no loss in model performance, referred to as a minimal
consistent set.
o Enumerate the examples in the dataset and add them to the “store” only if
they cannot be classified correctly by the current contents of the store.

67
Performance of KNN Algorithm
Time complexity: 𝑂 𝑛𝑑
Using More Efficient Distance Metrics:
• Choose distance metrics that are less computationally
intensive. For instance, Manhattan distance may be computed
faster than Euclidean distance in high-dimensional spaces.
Using Approximate Nearest Neighbor Algorithms:
• Algorithms like Locality-Sensitive Hashing (LSH) or k-d trees
can be used to approximate the nearest neighbors,
significantly reducing computation time.

68
Performance of KNN Algorithm
• Time complexity: 𝑂(𝑛𝑑)
• Reduce 𝑑: Dimensionality reduction
• Reduce 𝑛: Compare to a subset of examples
• Identify 𝑚 ≪ 𝑛 potential near neighbors to compare against
• 𝑂 𝑚𝑑
• K-D trees: Low-dimensional, real-valued data
o 𝑂(𝑑𝑙𝑜𝑔2 𝑛), only works when 𝑑 ≪ 𝑛, inexact: can miss neighbors
• Locality-sensitive hashing: high-dimensional, real or discrete
o 𝑂 𝑛′ 𝑑 , 𝑛′ ≪ 𝑛, inexact: can miss neighbors
• Inverted lists: High-dimensional, discrete (sparse) data
o 𝑂(𝑛’𝑑’), where 𝑑 ′ ≪ 𝑑, 𝑛′ ≪ 𝑛, only for sparse data (e.g. text), exact

69
K-D Trees
• Pick a random dimension, find median, split data, repeat
1,9 , 2,3 , 4,1 , 3,7 , 5,4 , 6,8 , 7,2 , 8,8 , 7,9 , 9,6

• 𝑂(𝑑 𝑙𝑜𝑔2 𝑛)
• E.g. test point: (7,4)
• Compare with all the points in the region
• Can easily miss nearest neighbors

Example ref: Victor Lavrenko, University of Edinburgh, https://www.youtube.com/playlist?list=PLBv09BD7ez_48heon5Az-TsyoXVYOJtDZ


70
Locality-sensitive Hashing
• Draw random hyper-planes ℎ1 , … , ℎ𝑘
• The space is sliced into 2𝑘 regions
• Polytopes: Flat sides, bounded by
hyperplanes
• Mutually exclusive
• Compare x only to training points in that region
• Complexity: 𝑂(𝑑 𝑙𝑜𝑔𝑛) if 𝑘 ≈ 𝑙𝑜𝑔 𝑛
• Inexact: Can miss neighbors
• Repeat with different hyperplanes
• Why do we need these?
• In case of K-D trees, in high dimensions,
someone could be your neighbor in d-1
dimensions, but still very far away in the 𝑑𝑡ℎ
dimension
• LSH cuts across dimensions rather than
doing it one by one
Example ref: Victor Lavrenko, University of Edinburgh, https://www.youtube.com/playlist?list=PLBv09BD7ez_48heon5Az-TsyoXVYOJtDZ
71
Inverted Lists
• High dimensional, sparse data
• New email: “account review”
• 𝑂 𝑑 𝑛 , where d: non-zero attributes, √𝑛: avg length of list
• Exact: does not miss neighbors

Example ref: Victor Lavrenko, University of Edinburgh, https://www.youtube.com/playlist?list=PLBv09BD7ez_48heon5Az-TsyoXVYOJtDZ


72
Conclusion
The k-Nearest Neighbors (k-NN) algorithm is a simple, instance-based
learning algorithm.
Strengths:
1. Simple and Intuitive: k-NN is easy to understand and implement.
2. No Training Phase: k-NN doesn’t require a training phase. New data
can be added seamlessly.
3. Non-Parametric: It makes no assumptions about the underlying data
distribution.
4. Multiclass Classification: Can be used for multiclass classification
problems.
5. Versatile: Can be used for classification, regression, and search tasks.

73
Conclusion
Weaknesses:
1. Computational Complexity: Computationally expensive, especially
with large datasets as it requires computing the distance to all
training samples for each query.
2. Memory Intensive: Requires storing the entire dataset, which can be
memory-intensive for large datasets.
3. Sensitive to Irrelevant Features: The performance of k-NN can
degrade with irrelevant or redundant features because it uses all
features equally in the distance calculation.
4. Sensitive to the Scale of the Data: Different scales of features can
impact the performance of k-NN. Feature scaling is often required.
5. Optimal k Value: Choosing the optimal number of neighbors (𝑘) can
be challenging.

74
For more details please visit

http://aghaaliraza.com

Thank you!
75

You might also like