Cross Validation - What, Why and How | Machine Learning

What is Cross Validation , Why it is used and What are the different types ?

Ashwin Prasad

Published in

Analytics Vidhya

6 min readJun 6, 2021

What is Cross Validation and Why do we Need it ?

In a Supervised Machine Learning problem , we usually train the model on the dataset and use the trained model to predict the target, given new predictor values.
But, How do we know if the model we have trained on the dataset will producing effective and accurate results on the new input data.
We cannot conclude that the model has performed well based on the error rates or certain statistic measures (such as R square statistic) that we get from the dataset on which the model is trained.
The main problem is that there is no way of knowing if the model has high bias (underfitted) or high variance (overfitted) and how well the model will perform on new data just from the error rates from the model.
When it comes to predictive modeling, It’s the duty of the data scientist to ensure that the model is going to perform well on new data.
Cross Validation is a process that helps us do exactly this.

It is the process by which the machine learning models are evaluated on a separate set known as validation set or hold-out set with which the best hyper-parameters are found, so that we get the optimal model, that can be used on future data and which is capable of yielding the best possible predictions

One way of doing this is to split our dataset into 3 parts: Training Set, Validation or Hold-Out set and the Test Set.

before going further, familiarisation of concepts such as bias and variance is required.

Training , Validation and Testing Split

Training Set: The part of the Dataset on which the model is trained

Validation Set: The trained model is then used on this set to predict the targets and the loss is noted. The result is compared to the training set results to check for overfitting or underfitting and this is done repeatedly until certain optimal result is produced.
basically, we are training the model on the training set. but, the hyper-parameters are continuously updated and trained again until the model performs best on the cross-validation set. (By performing best, we are trying to minimize the validation set while also preventing overfitting)

Test Set: The fully trained model, after being evaluated by the validation or hold-out set is used on the test set to get the true test error and this error can be treated as a very good estimate of how the performance of the model will be on any new data

Why test set Shouldn’t be used more than Once ?

Why the test set shouldn’t be used more than once ? Why do we have to use a separate set for cross-validation and testing . why can’t we use the same set of data for both ?

All these questions are essentially are the same and have a common answer:
Using the test set more than once will eventually lead to bias (Not the bias that we usually refer to, in Machine Learning ) because the hyper parameters are being adjusted for optimal performance on the test set. In that case, we cannot use the test set to estimate how well the model will perform on new real life data.
“Meanwhile, the model is used multiple times on the validation set to find the best hyper-parameters. The best one is chosen out of it and is used on the test set”. Just emphasising this point.

Types of Cross-Validation

There are 3 main types of cross validation techniques

The Standard Validation Set Approach
The Leave One Out Cross Validation (LOOCV)
K-fold Cross Validation

In all the above methods, The Dataset is split into training set, validation set and testing set. We will mostly be discussing about training and the validation set as the usage of testing set is common for all the 3 methods and is already mentioned above.

Note: In all the below approaches, the test set is already split and is separate. There’s just not going to be much mention of the test set as we concentrate more on the training set and the validation set

Standard Validation Set Approach

Train — Validation Split

This is a very simple and standard approach and is used commonly. we randomly split the training and the validation set in different proportions from the original dataset.

The model is trained on the training set, evaluated on the validation set and retrained on the training set with different hyper-parameters. This is done until the best hyper-parameter is found the reduces the validation set loss and also does not lead to overfitting.

The model hyper-parameters that produce the best results on the cross validation set is chosen and then it’s performance is measured on a separate test set.

Disadvantages of this approach

Initially, we randomly split the data into training and validation set. due to this, the validation error estimate of the test set may be somewhat highly variable and which observation goes into which set may be a factor that affects the model parameters.
We know that more the data given to the model for training , the better. but, due to this split, the amount of data on which training happens, is reduced

Leave One Out Cross Validation

This method tries to overcome the disadvantages of the previous method and it takes an iterative approach.

First Iteration
In the first iteration, we use only the observation (x1,y1) as the entire validation set and train the model on the rest of the n-1 observations.
here, the validation set error E1 is calculated as (h(x1) — (y1))² ,where h(x1) is prediction for X1 from the model.

Second Iteration
We leave (x2,y2) as the validation set and train the model on the rest n-1 observations. E2 is calculated in a similar way to the first iteration but with the validation data of the second iteration.

The process is repeated for n iterations until each and every observation from the training set has been the validation set for exactly once.
This gives us a set of n error estimates {E1,E2,E3….En}

The total validation error was then calculated by taking the mean of the n error estimates we got above.
i.e,
(E1 + E2 + E3 + E4 + ….. + En)/n

This is how the validation set error is calculated in this method and as we are using all the observations except one, for training, more data is given to the model and the error estimate is also not highly variable as we take the mean of n error estimates.

Disadvantages of this approach
This is considered computationally expensive as we are fitting the model n times to get the validation error estimate just once.

K-fold Cross Validation

The K-fold cross validation aims to solve the problem of computation by reducing the number of times the model needs to train in-order to calculate the validation error once.

It is very similar to LOOCV validation approach, except it takes K observations to be part of the validation set instead of 1, for each iteration and the validation error is calculated on these k observations per each iteration, and finally, the mean of all these error rates are taken to be the true validation set error. By doing this, the number of times the model has to be trained on the training set is reduced to n/k.

Usually, this number K is chosen to be something like 5 or 10. but, it depends on the dataset size. if we have a lot of data, a bigger value for K can be used.

Conclusion

Often times in machine learning, we don’t want the model or algorithm that performs best on the training data. rather, we need a model that performs best on the test set and a model that is guaranteed to perform well on new input data.
Cross validation is a very important process that makes sure we are able to find such an algorithm or model.