XGBoost or Gradient Boosting is a machine learning algorithm that goes through cycles to iteratively add models to a set. In this article, I will take you through the XGBoost algorithm in Machine Learning.
The cycle of the XGBoost algorithm begins by initializing the whole with a unique model, the predictions of which can be quite naive.
Also, Read – Overfitting and Underfitting in Machine Learning.
The Process of XGBoost Algorithm:
- First, we use the current set to generate predictions for each observation in the dataset. To make a prediction, we add the predictions of all the models in the set.
- These predictions are used to calculate a loss function.
- Then we use the loss function to fit a new model which will be added to the set. Specifically, we determine the parameters of the model so that adding this new model to the set reduces the loss.
- Finally, we add the new model to the set, and …
- then repeat!
XGBoost Algorithm in Action
I’ll start by loading the training and validation data into X_train, X_valid, y_train and y_valid. The dataset, I am using here can be easily downloaded from here.
import pandas as pd
from sklearn.model_selection import train_test_split
# Read the data
data = pd.read_csv('melb_data.csv')
# Select subset of predictors
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]
# Select target
y = data.Price
# Separate data into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y)
Code language: PHP (php)
Now, here you will learn how to use the XGBoost algorithm. Here we need to import the scikit-learn API for XGBoost (xgboost.XGBRegressor). This allows us to create and adjust a model like we would in scikit-learn. As you will see in the output, the XGBRegressor class has many adjustable parameters:
from xgboost import XGBRegressor
my_model = XGBRegressor()
my_model.fit(X_train, y_train)
Code language: JavaScript (javascript)
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, gamma=0, importance_type='gain', learning_rate=0.1, max_delta_step=0, max_depth=3, min_child_weight=1, missing=None, n_estimators=100, n_jobs=1, nthread=None, objective='reg:linear', random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None, silent=None, subsample=1, verbosity=1)
Now, we need to make predictions and evaluate our model:
from sklearn.metrics import mean_absolute_error
predictions = my_model.predict(X_valid)
print("Mean Absolute Error: " + str(mean_absolute_error(predictions, y_valid)))
Code language: JavaScript (javascript)
Mean Absolute Error: 280355.04334039026
Parameter Tuning
XGBoost has a few features that can drastically affect the accuracy and speed of training. The first feature you need to understand are:
n_estimators
n_estimators specifies the number of times to skip the modelling cycle described above. It is equal to the number of models we include in the set.
- Too low a value results in an underfitting, leading to inaccurate predictions on training data and test data.
- Too high a value results in overfitting, resulting in accurate predictions on training data, but inaccurate predictions on test data (which is important to us).
Typical the values ​​lie between 100 to 1000, although it all depends a lot on the learning_rate parameter described below. Here is the code to set the number of models in the set:
my_model = XGBRegressor(n_estimators=500)
my_model.fit(X_train, y_train)
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, gamma=0, importance_type='gain', learning_rate=0.1, max_delta_step=0, max_depth=3, min_child_weight=1, missing=None, n_estimators=500, n_jobs=1, nthread=None, objective='reg:linear', random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None, silent=None, subsample=1, verbosity=1)
early_stopping_rounds
early_stopping_rounds provides a way to automatically find the ideal value for n_estimators. Stopping early causes the iteration of the model to stop when the validation score stops improving, even though we are not stopping hard for n_estimators. It’s a good idea to set n_estimators high and then use early_stopping_rounds to find the optimal time to stop the iteration.
Since random chance sometimes causes a single round where validation scores do not improve, you must specify a number for the number of direct deterioration turns to allow before stopping. Setting early_stopping_rounds = 5 is a reasonable choice. In this case, we stop after 5 consecutive rounds of deterioration of validation scores. Now let’s see how we can use early_stopping:
my_model = XGBRegressor(n_estimators=500)
my_model.fit(X_train, y_train,
early_stopping_rounds=5,
eval_set=[(X_valid, y_valid)],
verbose=False)
Code language: PHP (php)
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, gamma=0, importance_type='gain', learning_rate=0.1, max_delta_step=0, max_depth=3, min_child_weight=1, missing=None, n_estimators=500, n_jobs=1, nthread=None, objective='reg:linear', random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None, silent=None, subsample=1, verbosity=1)
learning_rate
Instead of getting predictions by simply adding up the predictions of each component model, we can multiply the predictions of each model by a small number before adding them.
This means that every tree we add to the set helps us less. So we can set a high value for the n_estimators without overfitting. If we use early shutdown, the appropriate number of trees will be determined automatically. Now, let’s see how we can use learning_rate in XGBoost algorithm:
my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05)
my_model.fit(X_train, y_train,
early_stopping_rounds=5,
eval_set=[(X_valid, y_valid)],
verbose=False)
Code language: PHP (php)
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, gamma=0, importance_type='gain', learning_rate=0.05, max_delta_step=0, max_depth=3, min_child_weight=1, missing=None, n_estimators=1000, n_jobs=1, nthread=None, objective='reg:linear', random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None, silent=None, subsample=1, verbosity=1)
n_jobs
On larger datasets where execution is a consideration, you can use parallelism to build your models faster. It is common to set the n_jobs parameter equal to the number of cores on your machine. On smaller data sets, this won’t help.
The resulting model will not be better, so micro-optimizing the timing of the fit is usually just a distraction. But it’s very useful in large datasets where you would spend a lot of time waiting for the fit command. Now, let’s see how to use this parameter in the XGBoost algorithm:
my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=4)
my_model.fit(X_train, y_train,
early_stopping_rounds=5,
eval_set=[(X_valid, y_valid)],
verbose=False)
Code language: PHP (php)
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, gamma=0, importance_type='gain', learning_rate=0.05, max_delta_step=0, max_depth=3, min_child_weight=1, missing=None, n_estimators=1000, n_jobs=4, nthread=None, objective='reg:linear', random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None, silent=None, subsample=1, verbosity=1)
Also, Read – The Best Laptop for Machine Learning.
XGBoost is a leading software library for working with standard tabular data with fine-tuning of parameters, you can train very precise models. I hope you liked this article on the XGBoost algorithm in Machine Learning. Feel free to ask your valuable questions in the comments section below. You can also follow me on Medium to learn every topic of Machine Learning.