0% found this document useful (0 votes)
26 views

Cross Validation

The document discusses cross validation in machine learning. Cross validation is a technique used to evaluate machine learning models on a dataset where the data is split into training and validation subsets. This is done to reduce problems like overfitting and get a more accurate estimate of a model's performance. K-fold cross validation and leave-one-out cross validation are described as methods to implement cross validation.

Uploaded by

vinitrohit96
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Cross Validation

The document discusses cross validation in machine learning. Cross validation is a technique used to evaluate machine learning models on a dataset where the data is split into training and validation subsets. This is done to reduce problems like overfitting and get a more accurate estimate of a model's performance. K-fold cross validation and leave-one-out cross validation are described as methods to implement cross validation.

Uploaded by

vinitrohit96
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Cross Validation in Machine Learning

In machine learning, we couldn’t fit the model on the training data and can’t say that the
model will work accurately for the real data. For this, we must assure that our model got
the correct patterns from the data, and it is not getting up too much noise. For this
purpose, we use the cross-validation technique.
Cross-Validation

Cross-validation is a technique in which we train our model using the subset of the data-
set and then evaluate using the complementary subset of the data-set.
The three steps involved in cross-validation are as follows :

1. Reserve some portion of sample data-set.


2. Using the rest data-set train the model.
3. Test the model using the reserve portion of the data-set.
Methods of Cross Validation
Validation
In this method, we perform training on the 50% of the given data-set and rest 50% is
used for the training purpose. The major drawback of this method is that we perform
training on the 50% of the dataset, it may possible that the remaining 50% of the data
contains some important information which we are leaving while training our model i.e
higer bias.
LOOCV (Leave One Out Cross Validation)
In this method, we perform training on the whole data-set but leaves only one data-point
of the available data-set and then iterates for each data-point. It has some advantages
as well as disadvantages also.
An advantage of using this method is that we make use of all data points and hence it is
low bias.
The major drawback of this method is that it leads to higher variation in the testing
model as we are testing against one data point. If the data point is an outlier it can lead
to higher variation. Another drawback is it takes a lot of execution time as it iterates over
‘the number of data points’ times.
K-Fold Cross Validation
In this method, we split the data-set into k number of subsets(known as folds) then we
perform training on the all the subsets but leave one(k-1) subset for the evaluation of
the trained model. In this method, we iterate k times with a different subset reserved for
testing purpose each time.
Note:
It is always suggested that the value of k should be 10 as the lower value
of k is takes towards validation and higher value of k leads to LOOCV
method.
Example
The diagram below shows an example of the training subsets and evaluation subsets
generated in k-fold cross-validation. Here, we have total 25 instances. In first iteration
we use the first 20 percent of data for evaluation, and the remaining 80 percent for
training([1-5] testing and [5-25] training) while in the second iteration we use the second
subset of 20 percent for evaluation, and the remaining three subsets of the data for
training([5-10] testing and [1-5 and 10-25] training), and so on.

Total instances: 25
Value of k : 5

No. Iteration Training set observations


Testing set observations
1 [ 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [0
1 2 3 4]
2 [ 0 1 2 3 4 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [5
6 7 8 9]
3 [ 0 1 2 3 4 5 6 7 8 9 15 16 17 18 19 20 21 22 23 24]
[10 11 12 13 14]
4 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 20 21 22 23 24]
[15 16 17 18 19]
5 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
[20 21 22 23 24]
Comparision of train/test split to cross-validation

Advantages of train/test split:


1. This runs K times faster than Leave One Out cross-validation because K-fold cross-
validation repeats the train/test split K-times.
2. Simpler to examine the detailed results of the testing process.
Advantages of cross-validation:
1. More accurate estimate of out-of-sample accuracy.
2. More “efficient” use of data as every observation is used for both training and testing.
Confusion matrix terminology
A confusion matrix is a table that is often used to describe the
performance of a classification model (or "classifier") on a set of test
data for which the true values are known. The confusion matrix itself is
relatively simple to understand, but the related terminology can be
confusing.
I wanted to create a "quick reference guide" for confusion matrix
terminology because I couldn't find an existing resource that suited my
requirements: compact in presentation, using numbers instead of
arbitrary variables, and explained both in terms of formulas and
sentences.
Let's start with an example confusion matrix for a binary
classifier (though it can easily be extended to the case of more than two
classes):

What can we learn from this matrix?


 There are two possible predicted classes: "yes" and "no". If we
were predicting the presence of a disease, for example, "yes"
would mean they have the disease, and "no" would mean they
don't have the disease.
 The classifier made a total of 165 predictions (e.g., 165 patients
were being tested for the presence of that disease).
 Out of those 165 cases, the classifier predicted "yes" 110 times,
and "no" 55 times.
 In reality, 105 patients in the sample have the disease, and 60
patients do not.
Let's now define the most basic terms, which are whole numbers (not
rates):
 true positives (TP): These are cases in which we predicted yes
(they have the disease), and they do have the disease.
 true negatives (TN): We predicted no, and they don't have the
disease.
 false positives (FP): We predicted yes, but they don't actually
have the disease. (Also known as a "Type I error.")
 false negatives (FN): We predicted no, but they actually do have
the disease. (Also known as a "Type II error.")
I've added these terms to the confusion matrix, and also added the row
and column totals:

This is a list of rates that are often computed from a confusion matrix for
a binary classifier:
 Accuracy: Overall, how often is the classifier correct?
o (TP+TN)/total = (100+50)/165 = 0.91
 Misclassification Rate: Overall, how often is it wrong?
o (FP+FN)/total = (10+5)/165 = 0.09
o equivalent to 1 minus Accuracy
o also known as "Error Rate"
 True Positive Rate: When it's actually yes, how often does it
predict yes?
o TP/actual yes = 100/105 = 0.95
o also known as "Sensitivity" or "Recall"
 False Positive Rate: When it's actually no, how often does it
predict yes?
o FP/actual no = 10/60 = 0.17
 True Negative Rate: When it's actually no, how often does it
predict no?
o TN/actual no = 50/60 = 0.83
o equivalent to 1 minus False Positive Rate
o also known as "Specificity"
 Precision: When it predicts yes, how often is it correct?
o TP/predicted yes = 100/110 = 0.91
 Prevalence: How often does the yes condition actually occur in
our sample?
o actual yes/total = 105/165 = 0.64

Confusion Matrix in Machine Learning


In the field of machine learning and specifically the problem of statistical classification, a
confusion matrix, also known as an error matrix.
A confusion matrix is a table that is often used to describe the performance of a
classification model (or “classifier”) on a set of test data for which the true values are
known. It allows the visualization of the performance of an algorithm.
It allows easy identification of confusion between classes e.g. one class is commonly
mislabeled as the other. Most performance measures are computed from the confusion
matrix.
This article aims at:
1. What the confusion matrix is and why you need to use it.
2. How to calculate a confusion matrix for a 2-class classification problem from
scratch.
3. How to create a confusion matrix in Python.
Confusion Matrix:
A confusion matrix is a summary of prediction results on a classification problem.
The number of correct and incorrect predictions are summarized with count values and
broken down by each class. This is the key to the confusion matrix.
The confusion matrix shows the ways in which your classification model is confused
when it makes predictions.
It gives us insight not only into the errors being made by a classifier but more
importantly the types of errors that are being made.

Here,
• Class 1 : Positive
• Class 2 : Negative

Definition of the Terms:


• Positive (P) : Observation is positive (for example: is an apple).
• Negative (N) : Observation is not positive (for example: is not an apple).
• True Positive (TP) : Observation is positive, and is predicted to be positive.
• False Negative (FN) : Observation is positive, but is predicted negative.
• True Negative (TN) : Observation is negative, and is predicted to be negative.
• False Positive (FP) : Observation is negative, but is predicted positive.
Classification Rate/Accuracy:
Classification Rate or Accuracy is given by the relation:

However, there are problems with accuracy. It assumes equal costs for both kinds of
errors. A 99% accuracy can be excellent, good, mediocre, poor or terrible depending
upon the problem.
Recall:
Recall can be defined as the ratio of the total number of correctly classified positive
examples divide to the total number of positive examples. High Recall indicates the
class is correctly recognized (small number of FN).
Recall is given by the relation:

Precision:
To get the value of precision we divide the total number of correctly classified positive
examples by the total number of predicted positive examples. High Precision indicates
an example labeled as positive is indeed positive (small number of FP).
Precision is given by the relation:

High recall, low precision:This means that most of the positive examples are correctly
recognized (low FN) but there are a lot of false positives.
Low recall, high precision:This shows that we miss a lot of positive examples (high
FN) but those we predict as positive are indeed positive (low FP)
F-measure:
Since we have two measures (Precision and Recall) it helps to have a measurement
that represents both of them. We calculate an F-measure which uses Harmonic Mean in
place of Arithmetic Mean as it punishes the extreme values more.
The F-Measure will always be nearer to the smaller value of Precision or Recall.

Let’s consider an example now, in which we have infinite data elements of class B and
a single element of class A and the model is predicting class A against all the instances
in the test data.
Here,
Precision : 0.0
Recall : 1.0
Now:
Arithmetic mean: 0.5
Harmonic mean: 0.0
When taking the arithmetic mean, it would have 50% correct. Despite being the
worst possible outcome! While taking the harmonic mean, the F-measure is 0.
Example to interpret confusion matrix:

For the simplification of the above confusion matrix i have added all the terms like TP,FP,etc
and the row and column totals in the following image:

Now,
Classification Rate/Accuracy:
Accuracy = (TP + TN) / (TP + TN + FP + FN)= (100+50) /(100+5+10+50)= 0.90

Recall: Recall gives us an idea about when it’s actually yes, how often does it predict
yes.
Recall=TP / (TP + FN)=100/(100+5)=0.95
Precision: Precsion tells us about when it predicts yes, how often is it correct.
Precision = TP / (TP + FP)=100/ (100+10)=0.91
F-measure:
Fmeasure=(2*Recall*Precision)/(Recall+Presision)=(2*0.95*0.91)/(0.91+0.95)=0.92
Here is a python script which demonstrates how to create a confusion matrix on a
predicted model.For this, we have to import confusion matrix module from sklearn
library which helps us to generate the confusion matrix.

You might also like