Confusion Matrix
for evaluating the Knn model
KNN Algorithm
• Generates a model that can predict based on supervised learning.
• Trained on labelled data
• It is a nearest neighbour Algorithm
• Prediction of new data is done based on feature similarity with the
neighbours.
• Also called feature similarly algorithm
• We should choose a right ‘K’….usually sqrt(n)
Examples where KNN is used to predict
• Amazon uses it to recommend books to the new customers
• Bank uses it to approve loan
• Doctors use it to predict diabetics
• Predicting risk factor of prostrate cancer
Steps for Developing a KNN Algorithm
Step -1: Data Collection followed by importing data in R
Step -2 Prepare and explore data
Step 3: Normalizing numeric data
Step 4: Creating training and test data set
Step 5: Training a model on data
Step 6: Evaluate the model performance….Using Confusion Matrix
What is a Confusion Matrix (CM)
• useful tool for calibrating the output of a model.....a tool for evaluating the
performance of the model
• examines all possible outcomes...True Positive (TP), True Negative(TN), False
Positive(FP), False Negative(FN)
• categorises the predictions against the actual values....it is a 2-dim matrix of predicted
values X actual values
• gives lot of additional data in addition to the accuracy of the KNN model.
• has this name as it shows how confused the model is between predicted outcome
values and actual outcome values.
• The columns in CM represent actual classes and rows represent the predicted
classes….or VV
How does Confusion Matrix (CM) look like?
What does TP, TN, FP, FN stand for in the Confusion Matrix (CM)?
In all there can be 4 possible outcomes:
TP: these are cases where the model correctly predicts outcome Y
FP: these are cases where the model incorrectly predicts outcome Y
TN: these are cases where the model correctly predicts N
FN: these are cases where the model incorrectly predicts outcome N
We can have binary matrix with two levels or more.
In case of fibre identification problem, there are three level : Cotton, Silk, Wool for
both Actual and predicted values........CM in this case will be a 3X3 matrix.
Examples
Confusion Matrix (CM)
For predicting mails as Spam or Non-spam
Example: Confusion Matrix (CM)....
for predicting Cancer or No-Cancer in diagnoses
Example: Confusion Matrix (CM)....
for predicting Cancer is benign or malignant
For prediction of prostrate cancer among men
based on their medical report for test 1 and test 2.
• At random sample tests (say, test 1 and test 2)
performed on 500 men ,
• Of these, 50 actually have prostate cancer.
• I predicted 100 total cases of prostrate cancer
• 45 of which are actually having prostrate cancer.
Confusion Matrix
Actual Positive Actual Negative
Type 1
Predicted Positive TP 45 FP 55
Predicted Negative FN 5 TN 395
Type 2
We can measure other additional information in Confusion Matrix
Actual values
Predicted values
For prediction of prostrate cancer among men
based on their medical report for test 1 and test 2.
• At random sample tests performed on500 men ,
Calculate the statistics using
• Of these, 50 actually have prostate cancer. labelled confusion matrix:
• I predicted 100 total cases of prostrate cancer
1. Accuracy (all correct / all) = TP + TN / TP + TN + FP + FN
• 45 of which are actually having PC. (45 + 395) / 500 = 440 / 500 = 0.88 or 88% Accuracy
Actual Actual
2. Misclassification (all incorrect / all) = FP + FN / TP + TN + FP +
Positive Negative FN
Predicted TP 45 FP 55 (55 + 5) / 500 = 60 / 500 = 0.12 or 12% Misclassification
Precision
Positive You can also just do 1 — Accuracy,
so: 1–0.88 = 0.12 or 12% Misclassification
Predicted FN 5 TN 395
3. Precision (true positives / predicted positives) = TP / TP + FP
Negative 45 / (45 + 55) = 45 / 100 = 0.45 or 45% Precision
Sensitivity Specificity
4. Sensitivity aka Recall (true positives / all actual positives) = TP /
TP + FN
45 / (45 + 5) = 45 / 50 = 0.90 or 90% Sensitivity
5. Specificity (true negatives / all actual negatives) =TN / TN + FP
395 / (395 + 55) = 395 / 450 = 0.88 or 88% Specificity
Therefore, KNN model built for prediction of prostrate cancer among
men based on their medical report for test 1 and test 2
(hypothetical….data not provided) with outcome as “have prostrate
cancer” or “do not have prostrate cancer” is 88% accurate.
Interpretation of the Results
• A confusion matrix can help us evaluate the performance of our models, it
provides statistics ….Accuracy, Precision, Sufficiency, etc….one or more of
these can be used to evaluate the model.
• In the given example (prostrate cancer), it can be seen that there are high
incidences of false positives (45 out of 45+55=100). Therefore, precision
given by TP/TP+FP ……is just 45%. This means that we are falsely
predicting prostrate cancer 55% of the time. Our model is thus NOT
precise.
Command in R to execute the Confusion Matrix
confusionMatrix(data = m1,factor(fibre_test_target))
1. We specify the predicted data and the actual data as the arguments
2. Both the datasets should be of factor type….convert them to factor in case it is not
conf_matrix <- confusionMatrix(data = m1,factor(fibre_test_target))
Assign the output of CM to a variable
print (conf_matrix)
Print confusion matrix to evaluate the model based on the statistics
R programme for predicting the fibre type (cotton, silk, wool) using KNN Model
Result of confusion matrix in KNN Model
Model’s accuracy is 1