Cross Validation
Cross Validation
In machine learning, we couldn’t fit the model on the training data and can’t say that the
model will work accurately for the real data. For this, we must assure that our model got
the correct patterns from the data, and it is not getting up too much noise. For this
purpose, we use the cross-validation technique.
Cross-Validation
Cross-validation is a technique in which we train our model using the subset of the data-
set and then evaluate using the complementary subset of the data-set.
The three steps involved in cross-validation are as follows :
Total instances: 25
Value of k : 5
This is a list of rates that are often computed from a confusion matrix for
a binary classifier:
Accuracy: Overall, how often is the classifier correct?
o (TP+TN)/total = (100+50)/165 = 0.91
Misclassification Rate: Overall, how often is it wrong?
o (FP+FN)/total = (10+5)/165 = 0.09
o equivalent to 1 minus Accuracy
o also known as "Error Rate"
True Positive Rate: When it's actually yes, how often does it
predict yes?
o TP/actual yes = 100/105 = 0.95
o also known as "Sensitivity" or "Recall"
False Positive Rate: When it's actually no, how often does it
predict yes?
o FP/actual no = 10/60 = 0.17
True Negative Rate: When it's actually no, how often does it
predict no?
o TN/actual no = 50/60 = 0.83
o equivalent to 1 minus False Positive Rate
o also known as "Specificity"
Precision: When it predicts yes, how often is it correct?
o TP/predicted yes = 100/110 = 0.91
Prevalence: How often does the yes condition actually occur in
our sample?
o actual yes/total = 105/165 = 0.64
Here,
• Class 1 : Positive
• Class 2 : Negative
However, there are problems with accuracy. It assumes equal costs for both kinds of
errors. A 99% accuracy can be excellent, good, mediocre, poor or terrible depending
upon the problem.
Recall:
Recall can be defined as the ratio of the total number of correctly classified positive
examples divide to the total number of positive examples. High Recall indicates the
class is correctly recognized (small number of FN).
Recall is given by the relation:
Precision:
To get the value of precision we divide the total number of correctly classified positive
examples by the total number of predicted positive examples. High Precision indicates
an example labeled as positive is indeed positive (small number of FP).
Precision is given by the relation:
High recall, low precision:This means that most of the positive examples are correctly
recognized (low FN) but there are a lot of false positives.
Low recall, high precision:This shows that we miss a lot of positive examples (high
FN) but those we predict as positive are indeed positive (low FP)
F-measure:
Since we have two measures (Precision and Recall) it helps to have a measurement
that represents both of them. We calculate an F-measure which uses Harmonic Mean in
place of Arithmetic Mean as it punishes the extreme values more.
The F-Measure will always be nearer to the smaller value of Precision or Recall.
Let’s consider an example now, in which we have infinite data elements of class B and
a single element of class A and the model is predicting class A against all the instances
in the test data.
Here,
Precision : 0.0
Recall : 1.0
Now:
Arithmetic mean: 0.5
Harmonic mean: 0.0
When taking the arithmetic mean, it would have 50% correct. Despite being the
worst possible outcome! While taking the harmonic mean, the F-measure is 0.
Example to interpret confusion matrix:
For the simplification of the above confusion matrix i have added all the terms like TP,FP,etc
and the row and column totals in the following image:
Now,
Classification Rate/Accuracy:
Accuracy = (TP + TN) / (TP + TN + FP + FN)= (100+50) /(100+5+10+50)= 0.90
Recall: Recall gives us an idea about when it’s actually yes, how often does it predict
yes.
Recall=TP / (TP + FN)=100/(100+5)=0.95
Precision: Precsion tells us about when it predicts yes, how often is it correct.
Precision = TP / (TP + FP)=100/ (100+10)=0.91
F-measure:
Fmeasure=(2*Recall*Precision)/(Recall+Presision)=(2*0.95*0.91)/(0.91+0.95)=0.92
Here is a python script which demonstrates how to create a confusion matrix on a
predicted model.For this, we have to import confusion matrix module from sklearn
library which helps us to generate the confusion matrix.