Overfitting and Underfitting in Machine Learning

By the end of this article, you will understand the concepts of overfitting and underfitting in machine learning, and you can also apply these concepts to train your machine learning models more accurately.

What are Overfitting and Underfitting in Machine Learning?


Let’s say you are visiting a foreign country and the taxi driver scams you. You might be tempted to say that all the taxi drivers in this country are thieves. Overgeneralization is something we humans do too often, and unfortunately, machines can fall into the same trap if we’re not careful. In machine learning, this is called overfitting: it means that the model works well on the training data, but it does not generalize well.

Also, Read – The Best Laptop for Machine Learning.

Overfitting occurs when the model is too complex for the amount and noise of the training data. Here are the possible solutions:

  • Simplify the model by selecting a model with fewer parameters (for example, a linear model rather than a high degree polynomial model), reducing the number of attributes in the training data, or constraining the model.
  • Collect more training data.
  • Reduce noise in training data (for example, correct data errors and remove outliers).


As you can guess, underfitting is the opposite of overfitting: it occurs when your model is too simple to learn the underlying structure of the data.

Here are the main options for solving the problem of underfitting:

  • Select a more powerful model, with more parameters.
  • Bring better features to the learning algorithm (feature engineering).
  • Reduce the constraints on the model (for example, reduce the regularization hyperparameter).

While training a Machine Learning model we care more about the accuracy of the performance of our trained model on new data, which we can estimate from the validation set, the idea is to strike a balance between overfitting and underfitting.

Handling Overfitting and Underfitting 

I will consider a case study to take you through how we can practically handle Overfitting and Underfitting with Machine Learning.

Case Study:

image for post

There are very fewer alternatives for controlling the depth of the tree, and many allow certain routes through the tree to have greater depth than other routes. But the max_leaf_nodes argument provides a very good way to control overfitting versus underfitting. The more you will allow the model to make predictions, the more you will go from the area of ​​underfitting in the above diagram to the area of ​​overfitting.

Now let’s see how we can solve this problem of overfitting and underfitting with machine learning code. I’ll be using a utility function to help compare MAE scores of different values ​​for max_leaf_nodes:

from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)Code language: JavaScript (javascript)

The dataset I am using here can be easily downloaded from here. Now I will load the data into train_Xval_Xtrain_y and val_y:

import pandas as pd
# Load data

melbourne_data = pd.read_csv("melb_data") 
# Filter rows with missing values
filtered_melbourne_data = melbourne_data.dropna(axis=0)
# Choose target and features
y = filtered_melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 
                        'YearBuilt', 'Lattitude', 'Longtitude']
X = filtered_melbourne_data[melbourne_features]

from sklearn.model_selection import train_test_split

# split data into training and validation data, for both features and target
train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)Code language: PHP (php)

We can now use a for loop to compare the precision or accuracy rate of models built with different values ​​for max_leaf_nodes:

# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))Code language: PHP (php)
Max leaf nodes: 5  		 Mean Absolute Error:  347380
Max leaf nodes: 50  		 Mean Absolute Error:  258171
Max leaf nodes: 500  		 Mean Absolute Error:  243495
Max leaf nodes: 5000  		 Mean Absolute Error:  254983

Also, Read – 8 Neural Networks Projects for Machine Learning.

I hope you liked this article on the concepts of Overfitting and Underfitting in Machine Learning. Feel free to ask your valuable questions in the comments section below. You can also follow me on Medium to learn every topic of Machine Learning.

Follow Us:

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1536

Leave a Reply