In machine learning, model validation is a very simple process: after choosing a model and its hyperparameters, we can estimate its efficiency by applying it to some of the training data and then comparing the prediction of the model to the known value.
In this article, I’ll introduce you to a very naive approach to model validation and the reasons for its failure, before exploring the use of exclusion sets and cross-validation for more robust model evaluation.
Model validation the wrong way
I will start by demonstrating the naive approach to validation using Iris data. Let’s start with this task by loading the data:
from sklearn.datasets import load_iris iris = load_iris() X = iris.data y = iris.target
Next, we need to choose a model and hyperparameters. Here, I’ll use a k-neighbors classifier with n_neighbors = 1. It’s a very simple and intuitive model:
from sklearn.neighbors import KNeighborsClassifier model = KNeighborsClassifier(n_neighbors=1)
Next, we train the model and use it to predict the labels of the data we already know:
model.fit(X, y) y_model = model.predict(X)
Then as the final step, we calculate the fraction of correctly labelled points:
from sklearn.metrics import accuracy_score accuracy_score(y, y_model)
We can see an accuracy of 1.0 which conveys that 100% of the points were correctly labelled by the model. But is this a measure of the expected accuracy? Have we come across a model that we expect to be correct 100% of the time?
As you may have understood, the answer is no. This approach has a fundamental flaw: it trains and evaluates the model on the same data. Additionally, the nearest neighbour model is an instance-based estimator that simply stores the training data and predicts the labels by comparing the new data to those stored points: except in artificial cases, it will get an accuracy of 100% every time.
Model validation the right way
So what can be done? A better idea of the performance of a model can be found by using what is called an exclusion set: that is, we retain a subset of the data from the training of the model, then let’s use this exclusion set to check the performance of the model. This splitting can be done using the train_test_split utility in Scikit-Learn:
from sklearn.cross_validation import train_test_split # split the data with 50% in each set X1, X2, y1, y2 = train_test_split(X, y, random_state=0, train_size=0.5) # fit the model on one set of data model.fit(X1, y1) # evaluate the model on the second set of data y2_model = model.predict(X2) accuracy_score(y2, y2_model)
Here we see a more reasonable result: the nearest neighbor classifier is about 90% accurate on this restraint set. The exclusion set is similar to unknown data because the model has not “seen” it before.
I hope you liked this article on how to validate a model by using the model validation method in Machine Learning. Feel free to ask your valuable questions in the comments section below.