In Machine Learning, one way to use the same training algorithm for more prediction models and to train them on different sets of the data is known as Bagging and Pasting. Bagging means to perform sampling with replacement and when the process of bagging is done without replacement then this is known as Pasting.
Why Bagging and Pasting?
Generally speaking, both bagging and pasting allow the training samples for sampling a lot of time across multiple prediction models, but only bagging can allow the training samples for sampling a lot of time on the same prediction model.
Now I will import some libraries to start with a prediction model. I will first train a machine learning model, then we will see how we can process the machine learning algorithm using the concept of Bagging and Pasting. So let’s start with importing some necessary libraries:
# Python ≥3.5 is required import sys assert sys.version_info >= (3, 5) # Scikit-Learn ≥0.20 is required import sklearn assert sklearn.__version__ >= "0.20" # Common imports import numpy as np import os # to make this notebook's output stable across runs np.random.seed(42) # To plot pretty figures %matplotlib inline import matplotlib as mpl import matplotlib.pyplot as plt mpl.rc('axes', labelsize=14) mpl.rc('xtick', labelsize=12) mpl.rc('ytick', labelsize=12)
Now let’s train a Prediction Model:
from sklearn.model_selection import train_test_split from sklearn.datasets import make_moons X, y = make_moons(n_samples=500, noise=0.30, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
Note that the predictors are trained parallelly, by the medium of CPU cores or even different servers. In the same way, we can also make predictions parallelly. This is one of the important reasons why bagging and pasting are an important concept of machine learning as they scale the algorithm very well.
Bagging and Pasting in Machine Learning
In Machine Learning, scikit-learn provides an API for the process of both bagging and pasting. We have BaggingClassifier in scikit-learn. Now let’s go through the process:
from sklearn.ensemble import BaggingClassifier from sklearn.tree import DecisionTreeClassifier bag_clf = BaggingClassifier( DecisionTreeClassifier(random_state=42), n_estimators=500, max_samples=100, bootstrap=True, random_state=42) bag_clf.fit(X_train, y_train) y_pred = bag_clf.predict(X_test)
from sklearn.metrics import accuracy_score print(accuracy_score(y_test, y_pred))
tree_clf = DecisionTreeClassifier(random_state=42) tree_clf.fit(X_train, y_train) y_pred_tree = tree_clf.predict(X_test) print(accuracy_score(y_test, y_pred_tree))
from matplotlib.colors import ListedColormap def plot_decision_boundary(clf, X, y, axes=[-1.5, 2.45, -1, 1.5], alpha=0.5, contour=True): x1s = np.linspace(axes, axes, 100) x2s = np.linspace(axes, axes, 100) x1, x2 = np.meshgrid(x1s, x2s) X_new = np.c_[x1.ravel(), x2.ravel()] y_pred = clf.predict(X_new).reshape(x1.shape) custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0']) plt.contourf(x1, x2, y_pred, alpha=0.3, cmap=custom_cmap) if contour: custom_cmap2 = ListedColormap(['#7d7d58','#4c4c7f','#507d50']) plt.contour(x1, x2, y_pred, cmap=custom_cmap2, alpha=0.8) plt.plot(X[:, 0][y==0], X[:, 1][y==0], "yo", alpha=alpha) plt.plot(X[:, 0][y==1], X[:, 1][y==1], "bs", alpha=alpha) plt.axis(axes) plt.xlabel(r"$x_1$", fontsize=18) plt.ylabel(r"$x_2$", fontsize=18, rotation=0) fix, axes = plt.subplots(ncols=2, figsize=(10,4), sharey=True) plt.sca(axes) plot_decision_boundary(tree_clf, X, y) plt.title("Decision Tree", fontsize=14) plt.sca(axes) plot_decision_boundary(bag_clf, X, y) plt.title("Decision Trees with Bagging", fontsize=14) plt.ylabel("") plt.show()
The BaggingClassifier will automatically use the voting classifier to estimate the class predictions. The output above simply compares the decision boundary of a Decision Tree with the decision boundary of a bagging classifier of 500 trees. We can see that the BaggingClassifier is generalizing very much better than the predictions of the Decision Tree.
Overall, BaggingClassifier often results in better models. I hope you liked this article on BaggingClassifier and Pasting in Machine Learning. Feel free to ask your valuable questions in the comments section below. You can also follow me on Medium to read more amazing articles.