0% found this document useful (0 votes)
30 views4 pages

Cross Validation

Uploaded by

tharunekaa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views4 pages

Cross Validation

Uploaded by

tharunekaa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Cross-Validation in Machine Learning

Cross-validation is a technique for validating the model efficiency by training it on the


subset of input data and testing on previously unseen subset of the input data. We can also
say that it is a technique to check how a statistical model generalizes to an
independent dataset.

In machine learning, there is always the need to test the stability of the model. It means
based only on the training dataset; we can't fit our model on the training dataset. For this
purpose, we reserve a particular sample of the dataset, which was not part of the training
dataset. After that, we test our model on that sample before deployment, and this complete
process comes under cross-validation. This is something different from the general train-
test split.

Hence the basic steps of cross-validations are:

o Reserve a subset of the dataset as a validation set.

o Provide the training to the model using the training dataset.

o Now, evaluate model performance using the validation set. If the model performs
well with the validation set, perform the further step, else check for the issues.

Methods used for Cross-Validation

1. Leave One Out Cross-Validation

In this method, we divide the data into train and test sets. Instead of dividing the data into 2
subsets, we select a single observation as test data, and everything else is labeled as
training data and the model is trained. Now the 2nd observation is selected as test data and
the model is trained on the remaining data.

This process continues ‘n’ times and the average of all these iterations is calculated and
estimated as the test set error.
When it comes to test-error estimates, LOOCV gives unbiased estimates (low bias). But
bias is not the only matter of concern in estimation problems. We should also consider
variance.

LOOCV has an extremely high variance because we are averaging the output of n-models
which are fitted on an almost identical set of observations, and their outputs are highly
positively correlated with each other.

2. K-Fold Cross-Validation

In this resampling technique, the whole data is divided into k sets of almost equal sizes. The
first set is selected as the test set and the model is trained on the remaining k-1 sets. The
test error rate is then calculated after fitting the model to the test data.

In the second iteration, the 2nd set is selected as a test set and the remaining k-1 sets are
used to train the data and the error is calculated. This process continues for all the k sets.

The mean of errors from all the iterations is calculated as the CV test error estimate.

In K-Fold CV, the no of folds k is less than the number of observations in the data (k<n) and
we are averaging the outputs of k fitted models that are somewhat less correlated with
each other since the overlap between the training sets in each model is smaller. This leads
to low variance then LOOCV.
The best part about this method is each data point gets to be in the test set exactly once and
gets to be part of the training set k-1 times. As the number of folds k increases, the variance
also decreases (low variance). This method leads to intermediate bias because each
training set contains fewer observations (k-1)n/k than the Leave One Out method but more
than the Hold Out method.

Typically, K-fold Cross Validation is performed using k=5 or k=10 as these values have been
empirically shown to yield test error estimates that neither have high bias nor high
variance.

The major disadvantage of this method is that the model has to be run from scratch k-times
and is computationally expensive than the Hold Out method but better than the Leave One
Out method.

3. Stratified K-Fold Cross-Validation

This is a slight variation from K-Fold Cross Validation, which uses ‘stratified
sampling’ instead of ‘random sampling.’

Let’s quickly understand what stratified sampling is and how is it different from random
sampling.

Suppose your data contains reviews for a cosmetic product used by both the male and
female population. When we perform random sampling to split the data into train and test
sets, there is a possibility that most of the data representing males is not represented in
training data but might end up in test data. When we train the model on sample training
data that is not a correct representation of the actual population, the model will not predict
the test data with good accuracy.

This is where Stratified Sampling comes to the rescue. Here the data is split in such a way
that it represents all the classes from the population.

Let’s consider the above example which has a cosmetic product review of 1000 customers
out of which 60% is female and 40% is male. I want to split the data into train and test data
in proportion (80:20). 80% of 1000 customers will be 800 which will be chosen in such a
way that there are 480 reviews associated with the female population and 320
representing the male population. In a similar fashion, 20% of 1000 customers will be
chosen for the test data ( with the same female and male representation).
This is exactly what stratified K-Fold CV does and it will create K-Folds by preserving the
percentage of sample for each class. This solves the problem of random sampling
associated with Hold out and K-Fold methods.

4. Leave-P-out cross-validation

In this approach, the p datasets are left out of the training data. It means, if there are total n
datapoints in the original input dataset, then n-p data points will be used as the training
dataset and the p data points as the validation set. This complete process is repeated for all
the samples, and the average error is calculated to know the effectiveness of the model.

There is a disadvantage of this technique; that is, it can be computationally difficult for the
large p.

Limitations of Cross-Validation

There are some limitations of the cross-validation technique, which are given below:

o For the ideal conditions, it provides the optimum output. But for the inconsistent
data, it may produce a drastic result. So, it is one of the big disadvantages of cross-
validation, as there is no certainty of the type of data in machine learning.

o In predictive modeling, the data evolves over a period, due to which, it may face the
differences between the training set and validation sets. Such as if we create a model
for the prediction of stock market values, and the data is trained on the previous 5
years stock values, but the realistic future values for the next 5 years may drastically
different, so it is difficult to expect the correct output for such situations.

Applications of Cross-Validation

o This technique can be used to compare the performance of different predictive


modeling methods.

o It has great scope in the medical research field.

o It can also be used for the meta-analysis, as it is already being used by the data
scientists in the field of medical statistics.

You might also like