Mathematical explanation of K-Nearest Neighbour

KNN stands for K-nearest neighbour is a popular algorithm in Supervised Learning commonly used for classification tasks. It works by classifying data based on its similarity to neighboring data points. The core idea of KNN is straightforward when a new data point is introduced the algorithm finds its K nearest neighbors and assigns the most frequent class from these neighbors to the new point.

Working of K-Nearest Neighbour

KNN algorithm stores all available cases and classifies new data based on the majority class of its nearest neighbors. Value of K in KNN refers to the number of nearest neighbors to consider when performing classification.

K parameter is critical because:

If K is too small, the model may be sensitive to noise in the dataset.
If K is too large, the classification might be too generalized, and nuances in the data may be overlooked.

Distance between data points is measured using a distance metric, such as Euclidean distance, to find the nearest neighbors.

How do we choose K?

Choosing the right value for K is crucial:

A commonly used rule of thumb is to select K ≈ sqrt(n), where n is the number of data points in the dataset.
If n is even adjust K to be odd by adding or subtracting 1 to avoid ties in majority voting.

Let’s dive deeper into an example of KNN to make the concept clearer. Below is a data that includes age, gender and the class of sports people play.

NAME	AGE	GENDER	CLASS OF SPORTS
Ajay	32	0	Football
Mark	40	0	Neither
Sara	16	1	Cricket
Zaira	34	1	Cricket
Sachin	55	0	Neither
Rahul	40	0	Cricket
Pooja	20	1	Neither
Smith	15	0	Cricket
Laxmi	55	1	Football
Michael	15	0	Football

Here male is denoted with numeric value 0 and female with 1. Let’s find in which class of people Angelina will lie whose k factor is 3 and age is 5. So we have to find out the distance using Euclidean distance formula:

d = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2} to find the distance between any two points.

To calculate the distance between Angelina and other individuals in the dataset:

d = \sqrt{(age_2 - age_1)^2 + (gender_2 - gender_1)^2}

Here, Angelina has:

Age = 5
Gender = 1 (female)

1. Distance between Angelina and Ajay (age = 32, gender = 0):

d = \sqrt{(5 - 32)^2 + (1 - 0)^2} = \sqrt{729 + 1} = \sqrt{730} = 27.02

2. Distance between Angelina and Mark (age = 40, gender = 0):

d = \sqrt{(5 - 40)^2 + (1 - 0)^2} = \sqrt{1225 + 1} = \sqrt{1226} = 35.01

3. Distance between Angelina and Sara (age = 16, gender = 1):

d = \sqrt{(5 - 16)^2 + (1 - 1)^2} = \sqrt{121 + 0} = \sqrt{121} = 11.00

4. Distance between Angelina and Zaira (age = 34, gender = 1):

d = \sqrt{(5 - 34)^2 + (1 - 1)^2} = \sqrt{841 + 0} = \sqrt{841} = 29.00

5. Distance between Angelina and Sachin (age = 55, gender = 0):

d = \sqrt{(5 - 55)^2 + (1 - 0)^2} = \sqrt{2500 + 1} = \sqrt{2501} = 50.01

6. Distance between Angelina and Rahul (age = 40, gender = 0):

d = \sqrt{(5 - 40)^2 + (1 - 0)^2} = \sqrt{1225 + 1} = \sqrt{1226} = 35.01

7. Distance between Angelina and Pooja (age = 20, gender = 1):

d = \sqrt{(5 - 20)^2 + (1 - 1)^2} = \sqrt{225 + 0} = \sqrt{225} = 15.00

8. Distance between Angelina and Smith (age = 15, gender = 0):

d = \sqrt{(5 - 15)^2 + (1 - 0)^2} = \sqrt{100 + 1} = \sqrt{101} = 10.05

9. Distance between Angelina and Laxmi (age = 55, gender = 1):

d = \sqrt{(5 - 55)^2 + (1 - 1)^2} = \sqrt{2500 + 0} = \sqrt{2500} = 50.00

10. Distance between Angelina and Michael (age = 15, gender = 0):

d = \sqrt{(5 - 15)^2 + (1 - 0)^2} = \sqrt{100 + 1} = \sqrt{101} = 10.05

Distance between Angelina and	Distance
Ajay	27.02
Mark	35.01
Sara	11.00
Zaira	29.00
Sachin	50.01
Rahul	35.01
Pooja	15.00
Smith	10.05
Laxmi	50.00
Michael	10.05

K-Nearest Neighbors (K = 3): 3 nearest neighbors to Angelina are:

Smith (Cricket)- 10.5
Michael (Football)- 10.05
Sara (Cricket)- 11

So according to KNN algorithm classifying based on Majority Vote, Angelina will be in the class of people who like cricket.

This example illustrates the working of KNN and how it classifies data based on the majority class of its nearest neighbors. By calculating the distance between data points, KNN helps in making predictions about new data. The importance of selecting the right value of K and handling categorical data appropriately (such as converting gender to numeric values) cannot be overstated in ensuring accurate classification results.

Mathematical explanation of K-Nearest Neighbour

Working of K-Nearest Neighbour

How do we choose K?

Explore