Evaluating a model is simple enough to use a test set. But suppose you are hesitating in model selection between two types of models (say, a linear model and a polynomial model); how can you decide between them? One option is to train both and compare how well they generalize using the test set.
How to Choose Hyperparameters
Now suppose that the linear model generalizes better, but you want to apply some regularization to avoid overfitting. The question is how, do you choose the value of the regularization hyperparameter? One option is to train 100 different models using 100 distinct values for this hyperparameter.
Suppose you find the best hyperparameter value that produces a model with the lowest generalization error-say, just 5% error. You launch this model into production, but unfortunately, it does not perform as well as expected and produces 15% errors. What just happened?
Model Selection and Hyperparameters Tuning
The problem is that you measured the generalization error multiple times on the test set, and you adapted the model and hyperparameters to produces the best model for that particular set. This means that the model is unlikely to perform as well on new data.
A standard solution to this problem is called holdout validation; you hold out part of the training set to evaluate several candidate models and select the best one. The new held –out the set is called the validation set ( or sometimes the development set, or dev set), more specifically, you train multiple models with various hyperparameters on the reduced training set (i.e., the full training set minus the validation set), and you select the model that performs best on the validation set. After this holdout validation set), and this gives you the final model. Lastly, you evaluate this final model on the test set to get an estimate of the generalization error.
Problems you will Face in Model Selection
This solution usually works quite well. However, if the validation set is too small, then model evaluations will be imprecise; You may end up selecting a suboptimal model by mistake. Conversely, If the validation set is too large, then the remaining training set will be much smaller than the full training set. Why is this bad? Well, since the final model will be trained on the entire training set, it is not ideal for comparing candidate models trained on the much smaller training set. It would be like selecting the fastest sprinter to participate in a marathon.
One way to solve this problem is to perform repeated cross-validation, using many small validation sets. Each model is evaluated once per validation set after it is trained on the rest of the data. By averaging out all the evaluations of a model, you get a much accurate measure of its performance. There is a drawback; however: the training time is multiplied by the number of validation sets.