By the end of this article, you will understand the concepts of overfitting and underfitting in machine learning, and you can also apply these concepts to train your machine learning models more accurately.
What are Overfitting and Underfitting in Machine Learning?
Let’s say you are visiting a foreign country and the taxi driver scams you. You might be tempted to say that all the taxi drivers in this country are thieves. Overgeneralization is something we humans do too often, and unfortunately, machines can fall into the same trap if we’re not careful. In machine learning, this is called overfitting: it means that the model works well on the training data, but it does not generalize well.
Overfitting occurs when the model is too complex for the amount and noise of the training data. Here are the possible solutions:
- Simplify the model by selecting a model with fewer parameters (for example, a linear model rather than a high degree polynomial model), reducing the number of attributes in the training data, or constraining the model.
- Collect more training data.
- Reduce noise in training data (for example, correct data errors and remove outliers).
As you can guess, underfitting is the opposite of overfitting: it occurs when your model is too simple to learn the underlying structure of the data.
Here are the main options for solving the problem of underfitting:
- Select a more powerful model, with more parameters.
- Bring better features to the learning algorithm (feature engineering).
- Reduce the constraints on the model (for example, reduce the regularization hyperparameter).
While training a Machine Learning model we care more about the accuracy of the performance of our trained model on new data, which we can estimate from the validation set, the idea is to strike a balance between overfitting and underfitting.
Handling Overfitting and Underfitting
I will consider a case study to take you through how we can practically handle Overfitting and Underfitting with Machine Learning.
There are very fewer alternatives for controlling the depth of the tree, and many allow certain routes through the tree to have greater depth than other routes. But the max_leaf_nodes argument provides a very good way to control overfitting versus underfitting. The more you will allow the model to make predictions, the more you will go from the area of underfitting in the above diagram to the area of overfitting.
Now let’s see how we can solve this problem of overfitting and underfitting with machine learning code. I’ll be using a utility function to help compare MAE scores of different values for max_leaf_nodes:
from sklearn.metrics import mean_absolute_error from sklearn.tree import DecisionTreeRegressor def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y): model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0) model.fit(train_X, train_y) preds_val = model.predict(val_X) mae = mean_absolute_error(val_y, preds_val) return(mae)
The dataset I am using here can be easily downloaded from here. Now I will load the data into train_X, val_X, train_y and val_y:
import pandas as pd # Load data melbourne_data = pd.read_csv("melb_data") # Filter rows with missing values filtered_melbourne_data = melbourne_data.dropna(axis=0) # Choose target and features y = filtered_melbourne_data.Price melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 'YearBuilt', 'Lattitude', 'Longtitude'] X = filtered_melbourne_data[melbourne_features] from sklearn.model_selection import train_test_split # split data into training and validation data, for both features and target train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)
We can now use a for loop to compare the precision or accuracy rate of models built with different values for max_leaf_nodes:
# compare MAE with differing values of max_leaf_nodes for max_leaf_nodes in [5, 50, 500, 5000]: my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y) print("Max leaf nodes: %d \t\t Mean Absolute Error: %d" %(max_leaf_nodes, my_mae))
Max leaf nodes: 5 Mean Absolute Error: 347380 Max leaf nodes: 50 Mean Absolute Error: 258171 Max leaf nodes: 500 Mean Absolute Error: 243495 Max leaf nodes: 5000 Mean Absolute Error: 254983
I hope you liked this article on the concepts of Overfitting and Underfitting in Machine Learning. Feel free to ask your valuable questions in the comments section below. You can also follow me on Medium to learn every topic of Machine Learning.