12 ML KNN
12 ML KNN
Idea:
k-NN stands for k-Nearest neighbour
k-NN is a simple algorithm
Classifies new cases based on a
similarity measure of new case to the
already instances having class labels
KNN – different names
• K-Nearest Neighbors
• Considers k neighbors only
• Memory-Based Reasoning
• Required data for computation
• Instance-Based Learning
• Example-Based Reasoning
• Case-Based Reasoning
• Takes existing examples into account, considers test
instance and computes result
• Lazy Learning
• All computation deferred until classification decision
WHY NEAREST NEIGHBOR?
? 4
Instance-Based Classifiers
Set of Stored Cases • Store the training records
Compute
Distance Test
Record
X X X
k = 1:
Belongs to square class
k = 3:
? Belongs to triangle class
k = 7:
Belongs to square class
12
8
k NEAREST NEIGHBOR
k = 1:
Belongs to square class
k = 3:
?
Belongs to triangle class
k = 7:
Belongs to square class
$250,000
$200,000
$150,000
Loan$ Non-Default
$100,000 Default
$50,000
$0
0 10 20 30 40 50 60 70
Age
KNN Classification – Distance
Age Loan Default Distance
25 $40,000 N 102000
35 $60,000 N 82000
45 $80,000 N 62000
20 $20,000 N 122000
35 $120,000 N 22000
52 $18,000 N 124000
23 $95,000 Y 47000
40 $62,000 Y 80000
60 $100,000 Y 42000
48 $220,000 Y 78000
33 $150,000 Y 8000
48 $142,000 ?
D ( x1 x2 ) ( y1 y2 )
2 2
KNN Classification – Standardized
Distance
Age Loan Default Distance
0.125 0.11 N 0.7652
0.375 0.21 N 0.5200
0.625 0.31 N 0.3160
0 0.01 N 0.9245
0.375 0.50 N 0.3428
0.8 0.00 N 0.6220
0.075 0.38 Y 0.6669
0.5 0.22 Y 0.4437
1 0.41 Y 0.3650
0.7 1.00 Y 0.3861
0.325 0.65 Y 0.3771
0.7 0.61 ?
X Min
Xs
Max Min
Nearest Neighbor Classification
Compute distance between two points:
Euclidean distance
d ( p, q ) ( pi
i
q ) i
2
18
7
Example
(Test: Durability:3, Strength:7, Class;?)
Type-2 7 4 Bad
Type-3 3 4 Good
Type-4 1 4 Good
Example
Type No Item Item Class Distance
Durabilit Streng
y th
Type-1 7 7 Bad Sqrt((7-3)2 + (7-
7)2) = 4
Type-2 7 4 Bad
Type-3 3 4 Good
Type-4 1 4 Good
Type No Item Item Class Distance
Durabilit Strengt
y h
Type-2 7 4 Bad 5
Type-3 3 4 Good 3
Type-3 3 4 Good 3 1
23
9
Distance – Categorical Variables
X Y Distance
Male Male 0
Male Female 1
x yD0
x y D 1
KNN - Applications
• Classification
– legal,
– medical,
– news,
– banking
• Problem-solving
– planning,
– pronunciation
• Teaching and aiding
– help desk,
– user training
Merits and Demerits
Advantages
Simple technique that is easily implemented
Can work with relatively little information
Well suited for multi classes problems
Learning is simple (does not involve preprocessing )
Performs best in some cases (gene and protein identification)
Dis-advantages
Memory issues and expensive computation for large datasets
Low Accuracy if presence of noisy or irrelevant features
Feature selection problem
Exercise: KNN using R
KNN tutorial using R language
https://www.youtube.com/watch?v=lDCWX6vCLFA
All steps
UCI dataset
WDBC data
U use any data you like
Use python or any other tool
27
9