XGBoost or Gradient Boosting is a machine learning algorithm that goes through cycles to iteratively add models to a set. In this article, I will take you through the XGBoost algorithm in Machine Learning.

The cycle of the XGBoost algorithm begins by initializing the whole with a unique model, the predictions of which can be quite naive.

**Also, Read – Overfitting and Underfitting in Machine Learning.**

The Process of XGBoost Algorithm:

- First, we use the current set to generate predictions for each observation in the dataset. To make a prediction, we add the predictions of all the models in the set.
- These predictions are used to calculate a loss function.
- Then we use the loss function to fit a new model which will be added to the set. Specifically, we determine the parameters of the model so that adding this new model to the set reduces the loss.
- Finally, we add the new model to the set, and …
- then repeat!

## XGBoost Algorithm in Action

I’ll start by loading the training and validation data into X_train, X_valid, y_train and y_valid. The dataset, I am using here can be easily downloaded from **here**.

```
import pandas as pd
from sklearn.model_selection import train_test_split
# Read the data
data = pd.read_csv('melb_data.csv')
# Select subset of predictors
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]
# Select target
y = data.Price
# Separate data into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y)
```

Code language: PHP (php)

Now, here you will learn how to use the XGBoost algorithm. Here we need to import the scikit-learn API for XGBoost (xgboost.XGBRegressor). This allows us to create and adjust a model like we would in scikit-learn. As you will see in the output, the XGBRegressor class has many adjustable parameters:

```
from xgboost import XGBRegressor
my_model = XGBRegressor()
my_model.fit(X_train, y_train)
```

Code language: JavaScript (javascript)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, gamma=0, importance_type='gain', learning_rate=0.1, max_delta_step=0, max_depth=3, min_child_weight=1, missing=None, n_estimators=100, n_jobs=1, nthread=None, objective='reg:linear', random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None, silent=None, subsample=1, verbosity=1)

Now, we need to make predictions and evaluate our model:

```
from sklearn.metrics import mean_absolute_error
predictions = my_model.predict(X_valid)
print("Mean Absolute Error: " + str(mean_absolute_error(predictions, y_valid)))
```

Code language: JavaScript (javascript)

Mean Absolute Error: 280355.04334039026

## Parameter Tuning

XGBoost has a few features that can drastically affect the accuracy and speed of training. The first feature you need to understand are:

#### n_estimators

n_estimators specifies the number of times to skip the modelling cycle described above. It is equal to the number of models we include in the set.

- Too low a value results in an underfitting, leading to inaccurate predictions on training data and test data.
- Too high a value results in overfitting, resulting in accurate predictions on training data, but inaccurate predictions on test data (which is important to us).

Typical the values â€‹â€‹lie between 100 to 1000, although it all depends a lot on the learning_rate parameter described below. Here is the code to set the number of models in the set:

```
my_model = XGBRegressor(n_estimators=500)
my_model.fit(X_train, y_train)
```

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, gamma=0, importance_type='gain', learning_rate=0.1, max_delta_step=0, max_depth=3, min_child_weight=1, missing=None, n_estimators=500, n_jobs=1, nthread=None, objective='reg:linear', random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None, silent=None, subsample=1, verbosity=1)

#### early_stopping_rounds

early_stopping_rounds provides a way to automatically find the ideal value for n_estimators. Stopping early causes the iteration of the model to stop when the validation score stops improving, even though we are not stopping hard for n_estimators. It’s a good idea to set n_estimators high and then use early_stopping_rounds to find the optimal time to stop the iteration.

Since random chance sometimes causes a single round where validation scores do not improve, you must specify a number for the number of direct deterioration turns to allow before stopping. Setting early_stopping_rounds = 5 is a reasonable choice. In this case, we stop after 5 consecutive rounds of deterioration of validation scores. Now let’s see how we can use early_stopping:

```
my_model = XGBRegressor(n_estimators=500)
my_model.fit(X_train, y_train,
early_stopping_rounds=5,
eval_set=[(X_valid, y_valid)],
verbose=False)
```

Code language: PHP (php)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, gamma=0, importance_type='gain', learning_rate=0.1, max_delta_step=0, max_depth=3, min_child_weight=1, missing=None, n_estimators=500, n_jobs=1, nthread=None, objective='reg:linear', random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None, silent=None, subsample=1, verbosity=1)

#### learning_rate

Instead of getting predictions by simply adding up the predictions of each component model, we can multiply the predictions of each model by a small number before adding them.

This means that every tree we add to the set helps us less. So we can set a high value for the n_estimators without overfitting. If we use early shutdown, the appropriate number of trees will be determined automatically. Now, let’s see how we can use learning_rate in XGBoost algorithm:

```
my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05)
my_model.fit(X_train, y_train,
early_stopping_rounds=5,
eval_set=[(X_valid, y_valid)],
verbose=False)
```

Code language: PHP (php)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, gamma=0, importance_type='gain', learning_rate=0.05, max_delta_step=0, max_depth=3, min_child_weight=1, missing=None, n_estimators=1000, n_jobs=1, nthread=None, objective='reg:linear', random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None, silent=None, subsample=1, verbosity=1)

#### n_jobs

On larger datasets where execution is a consideration, you can use parallelism to build your models faster. It is common to set the n_jobs parameter equal to the number of cores on your machine. On smaller data sets, this won’t help.

The resulting model will not be better, so micro-optimizing the timing of the fit is usually just a distraction. But it’s very useful in large datasets where you would spend a lot of time waiting for the fit command. Now, let’s see how to use this parameter in the XGBoost algorithm:

```
my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=4)
my_model.fit(X_train, y_train,
early_stopping_rounds=5,
eval_set=[(X_valid, y_valid)],
verbose=False)
```

Code language: PHP (php)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, gamma=0, importance_type='gain', learning_rate=0.05, max_delta_step=0, max_depth=3, min_child_weight=1, missing=None, n_estimators=1000, n_jobs=4, nthread=None, objective='reg:linear', random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None, silent=None, subsample=1, verbosity=1)

**Also, Read – The Best Laptop for Machine Learning.**

XGBoost is a leading software library for working with standard tabular data with fine-tuning of parameters, you can train very precise models. I hope you liked this article on the XGBoost algorithm in Machine Learning. Feel free to ask your valuable questions in the comments section below. You can also follow me on **Medium** to learn every topic of Machine Learning.