What is Cross-Validation in Machine Learning?

In Machine Learning, Cross-validation is a statistical method of evaluating generalization performance that is more stable and thorough than using a division of dataset into a training and test set. In this article, I’ll walk you through what cross-validation is and how to use it for machine learning using the Python programming language.

What is Cross-Validation?

In cross-validation, the data is instead split multiple times and multiple models are trained. The most commonly used version of cross-validation is k-times cross-validation, where k is a user-specified number, usually 5 or 10.

Also, Read – Machine Learning Full Course for free.

In five-way cross-validation, the data is first partitioned into five parts of (approximately) equal size, called folds. Then, a sequence of models is formed. The first model is trained using the first fold as a test set, and the remaining folds (2–5) are used as a training set.

The model is built using data from folds 2 to 5, then the precision is evaluated on fold 1. Then another model is built, this time using fold 2 as the test set and the data from folds 1, 3, 4 and 5 as a training set.

This process is repeated using folds 3, 4 and 5 as test sets. For each of these five divisions of the data into training and testing sets, we calculate the precision. In the end, we collected five precision values.

Implementation Of Cross-Validation with Python

We can easily implement the process of Cross-validation with Python programming language by using the Scikit-learn library in Python.

Cross-validation is implemented in scikit-learn using the cross_val_score function of the model_selection module. The parameters of the cross_val_score function are the model we want to evaluate, the training data, and the ground truth labels. Let’s evaluate LogisticRegression on the iris dataset:

from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

iris = load_iris()
logreg = LogisticRegression()

scores = cross_val_score(logreg, iris.data, iris.target)
print("Cross-validation scores: {}".format(scores))
Code language: Python (python)
Output: Cross-validation scores: [ 0.961 0.922 0.958]

By default, cross_val_score performs triple cross-validation, returning three precision values. We can modify the number of folds used by modifying the cv parameter:

scores = cross_val_score(logreg, iris.data, iris.target, cv=5)
print("Cross-validation scores: {}".format(scores))
Code language: Python (python)
Output: Cross-validation scores: [ 1. 0.967 0.933 0.9 1. ]

A common way to summarize the precision of cross-validation is to calculate the mean:

print("Average cross-validation score: {:.2f}".format(scores.mean()))Code language: Python (python)
Output: Average cross-validation score: 0.96

Benefits & Drawbacks of Using Cross-Validation

There are several advantages to using cross-validation instead of a single division into one training and one set of tests. First of all, remember that train_test_split performs a random division of data.

Imagine that we are “lucky” at randomly splitting the data, and all the hard-to-categorize examples end up in the training set. In this case, the test set will only contain “simple” examples, and the accuracy of our test set will be unrealistic.

Conversely, if we are “unlucky” we may have randomly placed all of the hard-to-rank examples in the test set and therefore have an unrealistic score.

However, when using cross-validation, each example will be in the test set exactly once: each example is in one of the folds, and each fold is the test set once. Therefore, the model must generalize well to all samples in the dataset for all cross-validation scores (and their mean) to be high.

Having multiple splits of the data also provides information about the sensitivity of our model to the selection of the training data set. For the iris dataset, we saw accuracies between 90% and 100%. That’s quite a range, and it gives us an idea of ​​how the model might work in the worst-case scenario and the best-case scenario when applied to new data.

Another advantage of cross-validation over using a single data division is that we use our data more efficiently. When using train_test_split, we typically use 75% of the data for training and 25% of the data for evaluation.

When using five-fold cross-validation, on each iteration we can use four-fifths of the data (80%) to fit the model. When using 10 cross-validations, we can use the nine-tenths of the data (90%) to fit the model. More data will generally result in more accurate models.

The main disadvantage is the increase in computational costs. Since we are currently training k models instead of a single model, the cross-validation will be about k times slower than doing a single division of the data.

Conclusion

Using the mean cross-validation, we can conclude that we expect the model to be around 96% accurate on average. Looking at the five scores produced by the five-fold cross-validation, we can also conclude that there is a relatively high variance in precision between folds, ranging from 100% precision to 90% precision.

This could imply that the model is very dependent on the particular folds used for training, but it could also simply be a consequence of the small size of the data set.

I hope you liked this article on what is Cross-validation, its implementation using Python and its benefits & drawbacks. Feel free to ask your valuable questions in the comments section below.

Follow Us:

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1498

Leave a Reply