0% found this document useful (0 votes)
38 views

12 ML KNN

K Nearest Neighbors (KNN) is a simple machine learning algorithm that classifies new data points based on their similarity to existing data points in the training set. It considers the k nearest neighbors of the new point and assigns the most common class among those neighbors as the prediction. Choosing the right value for k is important, as too small a value risks overfitting to noise and too large a value may include other classes. KNN is among the simplest algorithms and is useful for classification problems in domains like medicine, news, and banking.

Uploaded by

In Tech
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

12 ML KNN

K Nearest Neighbors (KNN) is a simple machine learning algorithm that classifies new data points based on their similarity to existing data points in the training set. It considers the k nearest neighbors of the new point and assigns the most common class among those neighbors as the prediction. Choosing the right value for k is important, as too small a value risks overfitting to noise and too large a value may include other classes. KNN is among the simplest algorithms and is useful for classification problems in domains like medicine, news, and banking.

Uploaded by

In Tech
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

K Nearest Neighbors

Introduction To Machine Learning


Dr. Hikmat Ullah Khan
KNN - Definition

Idea:
 k-NN stands for k-Nearest neighbour
 k-NN is a simple algorithm
 Classifies new cases based on a
similarity measure of new case to the
already instances having class labels
KNN – different names
• K-Nearest Neighbors
• Considers k neighbors only
• Memory-Based Reasoning
• Required data for computation
• Instance-Based Learning
• Example-Based Reasoning
• Case-Based Reasoning
• Takes existing examples into account, considers test
instance and computes result
• Lazy Learning
• All computation deferred until classification decision
WHY NEAREST NEIGHBOR?

 Used to classify objects based on closest training


examples in the feature space
 Top 10 Data Mining Algorithm
 A very good for start up students
 ICDM paper – December 2007

 Among the simplest of all Data Mining Algorithms


 Classification Method

? 4
Instance-Based Classifiers
Set of Stored Cases • Store the training records

……... • Use training records to


Atr1 AtrN Class
predict the class label of
A unseen cases
B
B
Unseen Case
C
Atr1 ……... AtrN
A
C
B
Nearest Neighbor Classifiers
 Basic idea:
 If it swims like a duck, quacks like a duck, then it’s probably a
duck

Compute
Distance Test
Record

Training Choose k of the


Records “nearest” records
KNN – Number of Neighbors
 If K=1,
 select the nearest neighbor
 If K>1,
 For classification select based on k
neighbors.
Nearest-Neighbor Classifiers
Unknown record
 Requires three things
– The set of stored
records
– Distance Metric to
compute distance
between records
– The value of k, the
number of nearest
neighbors to retrieve
Nearest-Neighbor Classifiers
Unknown record  To classify an unknown
record:
– Compute distance to
other training records
– Identify k nearest
neighbors
– Use class labels of
nearest neighbors to
determine the class label
of unknown record (e.g.,
by taking majority vote)
Definition of Nearest Neighbor

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of a record x are data points


that have the k smallest distance to x
k NEAREST NEIGHBOR

 k = 1:
 Belongs to square class

 k = 3:
?  Belongs to triangle class

 k = 7:
 Belongs to square class

12
8
k NEAREST NEIGHBOR

 k = 1:
 Belongs to square class

 k = 3:
?
 Belongs to triangle class

 k = 7:
 Belongs to square class

 Choosing the value of k:


 If k is too small, sensitive to noise points
 If k is too large, neighborhood may include
points from other classes
13
8
 Choose an odd value for k, to eliminate ties
KNN Classification

$250,000

$200,000

$150,000

Loan$ Non-Default
$100,000 Default

$50,000

$0
0 10 20 30 40 50 60 70
Age
KNN Classification – Distance
Age Loan Default Distance
25 $40,000 N 102000
35 $60,000 N 82000
45 $80,000 N 62000
20 $20,000 N 122000
35 $120,000 N 22000
52 $18,000 N 124000
23 $95,000 Y 47000
40 $62,000 Y 80000
60 $100,000 Y 42000
48 $220,000 Y 78000
33 $150,000 Y 8000

48 $142,000 ?

D  ( x1  x2 )  ( y1  y2 )
2 2
KNN Classification – Standardized
Distance
Age Loan Default Distance
0.125 0.11 N 0.7652
0.375 0.21 N 0.5200
0.625 0.31 N 0.3160
0 0.01 N 0.9245
0.375 0.50 N 0.3428
0.8 0.00 N 0.6220
0.075 0.38 Y 0.6669
0.5 0.22 Y 0.4437
1 0.41 Y 0.3650
0.7 1.00 Y 0.3861
0.325 0.65 Y 0.3771

0.7 0.61 ?
X  Min
Xs 
Max  Min
Nearest Neighbor Classification
 Compute distance between two points:
 Euclidean distance

d ( p, q )   ( pi
i
q ) i
2

 Determine the class from nearest neighbor list


 take the majority vote of class labels among the k-nearest
neighbors
k NEAREST NEIGHBOR
 Common Distance Metrics:
 Euclidean distance(continuos distribution)
d(p,q) = √∑(pi – qi)2

 Hamming distance (overlap metric)


bat (distance = 1) toned (distance = 3)
cat roses

 Discrete Metric(boolean metric)


if x = y then d(x,y) = 0. Otherwise, d(x,y) = 1

18
7
Example
(Test: Durability:3, Strength:7, Class;?)

Type No Item Item Class


Durability Strength
Type-1 7 7 Bad

Type-2 7 4 Bad

Type-3 3 4 Good

Type-4 1 4 Good
Example
Type No Item Item Class Distance
Durabilit Streng
y th
Type-1 7 7 Bad Sqrt((7-3)2 + (7-
7)2) = 4
Type-2 7 4 Bad

Type-3 3 4 Good

Type-4 1 4 Good
Type No Item Item Class Distance
Durabilit Strengt
y h

Type-1 7 7 Bad Sqrt((7-


3)2 + (7-
7)2) = 4

Type-2 7 4 Bad 5

Type-3 3 4 Good 3

Type-4 1 4 Good 3.6


Type No Item Item Class Distance Rank
Dura Stren
bility gth

Type-1 7 7 Bad Sqrt((7-3)2 3


+ (7-7)2) =
4
Type-2 7 4 Bad 5 4

Type-3 3 4 Good 3 1

Type-4 1 4 Good 3.6 2


k NEAREST NEIGHBOR
 Accuracy of all NN based algorithms depends on a data
model.
 Scaling issues
 Attributes may have to be scaled to prevent
distance measures from being dominated by one
of the attributes.
 Examples
 Height of a person may vary from 4’ to 6’
 Weight of a person may vary from 100lbs to 300lbs
 Income of a person may vary from $10k to $500k

23
9
Distance – Categorical Variables

X Y Distance
Male Male 0
Male Female 1

x yD0
x  y  D 1
KNN - Applications
• Classification
– legal,
– medical,
– news,
– banking
• Problem-solving
– planning,
– pronunciation
• Teaching and aiding
– help desk,
– user training
Merits and Demerits
 Advantages
 Simple technique that is easily implemented
 Can work with relatively little information
 Well suited for multi classes problems
 Learning is simple (does not involve preprocessing )
 Performs best in some cases (gene and protein identification)

 Dis-advantages
 Memory issues and expensive computation for large datasets
 Low Accuracy if presence of noisy or irrelevant features
 Feature selection problem
Exercise: KNN using R
 KNN tutorial using R language
 https://www.youtube.com/watch?v=lDCWX6vCLFA

 All steps
 UCI dataset
 WDBC data
 U use any data you like
 Use python or any other tool

27
9

You might also like