Categories
By Aman Kharwal

Bagging and Pasting in Machine Learning

In Machine Learning, one way to use the same training algorithm for more prediction models and to train them on different sets of the data is known as Bagging and Pasting. Bagging means to perform sampling with replacement and when the process of bagging is done without replacement then this is known as Pasting.

Why Bagging and Pasting?

Bagging and Pasting

Generally speaking, both bagging and pasting allow the training samples for sampling a lot of time across multiple prediction models, but only bagging can allow the training samples for sampling a lot of time on the same prediction model.

Now I will import some libraries to start with a prediction model. I will first train a machine learning model, then we will see how we can process the machine learning algorithm using the concept of Bagging and Pasting. So let’s start with importing some necessary libraries:

# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

Now let’s train a Prediction Model:

from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

Note that the predictors are trained parallelly, by the medium of CPU cores or even different servers. In the same way, we can also make predictions parallelly. This is one of the important reasons why bagging and pasting are an important concept of machine learning as they scale the algorithm very well.

Bagging and Pasting in Machine Learning

In Machine Learning, scikit-learn provides an API for the process of both bagging and pasting. We have BaggingClassifier in scikit-learn. Now let’s go through the process:

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(random_state=42), n_estimators=500,
    max_samples=100, bootstrap=True, random_state=42)
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

0.904

tree_clf = DecisionTreeClassifier(random_state=42)
tree_clf.fit(X_train, y_train)
y_pred_tree = tree_clf.predict(X_test)
print(accuracy_score(y_test, y_pred_tree))

0.856

from matplotlib.colors import ListedColormap

def plot_decision_boundary(clf, X, y, axes=[-1.5, 2.45, -1, 1.5], alpha=0.5, contour=True):
    x1s = np.linspace(axes[0], axes[1], 100)
    x2s = np.linspace(axes[2], axes[3], 100)
    x1, x2 = np.meshgrid(x1s, x2s)
    X_new = np.c_[x1.ravel(), x2.ravel()]
    y_pred = clf.predict(X_new).reshape(x1.shape)
    custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0'])
    plt.contourf(x1, x2, y_pred, alpha=0.3, cmap=custom_cmap)
    if contour:
        custom_cmap2 = ListedColormap(['#7d7d58','#4c4c7f','#507d50'])
        plt.contour(x1, x2, y_pred, cmap=custom_cmap2, alpha=0.8)
    plt.plot(X[:, 0][y==0], X[:, 1][y==0], "yo", alpha=alpha)
    plt.plot(X[:, 0][y==1], X[:, 1][y==1], "bs", alpha=alpha)
    plt.axis(axes)
    plt.xlabel(r"$x_1$", fontsize=18)
    plt.ylabel(r"$x_2$", fontsize=18, rotation=0)
fix, axes = plt.subplots(ncols=2, figsize=(10,4), sharey=True)
plt.sca(axes[0])
plot_decision_boundary(tree_clf, X, y)
plt.title("Decision Tree", fontsize=14)
plt.sca(axes[1])
plot_decision_boundary(bag_clf, X, y)
plt.title("Decision Trees with Bagging", fontsize=14)
plt.ylabel("")
plt.show()
Image for post

The BaggingClassifier will automatically use the voting classifier to estimate the class predictions. The output above simply compares the decision boundary of a Decision Tree with the decision boundary of a bagging classifier of 500 trees. We can see that the BaggingClassifier is generalizing very much better than the predictions of the Decision Tree.

Overall, BaggingClassifier often results in better models. I hope you liked this article on BaggingClassifier and Pasting in Machine Learning. Feel free to ask your valuable questions in the comments section below. You can also follow me on Medium to read more amazing articles.

Get Daily Newsletters

Leave a Reply