Testing and Validation are the only way to know how well a model will generalize to new cases is to actually try it out on new cases. One way to do that is to put your model in production and monitor how well it performs. This works well, but if your model is horribly bad, your users will complain which is not the best idea.
A better option is to split your data into two sets: the training set and the test set. As these names imply, you train your model using the training set, and you test it using the test set. The error rate on new cases is called the generalization error, and by evaluating your model on the test set, you get an estimate of this error. This value tells you how well your model will perform on instances it has never seen before.
If the training error is low (i.e., your model makes few mistakes on the training set) but the generalization error is high, it means that your model is overfitting the training data.
The ratio of Splitting the Data into Training, Testing and Validation
It is common to use 80% of the data for training and hold out 20% for testing. However, this depends on the size of the dataset: if it contains 10 million instances, then holding out 1% means your test set will contain 100,000 instances, probably more than enough to get a good estimate of the generalization error.
Splitting the Data into Training, Testing and Validation
Now let’s start with splitting a training and testing set. We can easily split the data into two sets using Scikit-Learn’s train_test_split method:
import numpy as np from sklearn.model_selection import train_test_split from sklearn import datasets from sklearn import svm X, y = datasets.load_iris(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train) clf.score(X_test, y_test)Code language: Python (python)
Here X is the original set of features in the dataset. Y represents the entire set of significant true labels. The above code splits the data into 80 per cent training and 20 per cent testing.
Validation Set is used to evaluate the model’s hyperparameters. Our machine learning model will go through this data, but it will never learn anything from the validation set. A Data Scientist uses the results of a Validation set to update higher level hyperparameters.
We can use a validation set with the help of the cross-validation method in machine learning. Let’s see how you can do this with training, Testing and validation set.
from sklearn.model_selection import cross_val_score clf = svm.SVC(kernel='linear', C=1) scores = cross_val_score(clf, X, y, cv=5) scoresCode language: Python (python)
array([0.96666667, 1. , 0.96666667, 0.96666667, 1. ])
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))Code language: Python (python)
Accuracy: 0.98 (+/- 0.03)
As you can see we got a great accuracy in both the testing and validation sets. I hope you liked this article, feel free to ask your valuable questions in the comments section below. Also, follow me on Medium, to read more amazing articles.