Random Forest Algorithm

The Random Forest algorithm is an ensemble of the Decision Trees algorithm. A Decision Tree model is generally trained using the Bagging Classifier. If you don’t want to use a bagging classifier algorithm to pass it through the Decision Tree Classification model, you can use a Random Forest algorithm as it is more convenient and better optimized for Decision Tree Classification. In this article, I will take you through the Random Forest algorithm in Machine Learning.

I will use all the CPU cores to train a RandomForestClassifier algorithm with 500 trees. But first, let’s start with importing the necessary libraries and data preparation to fit into a RandomForestClassifier algorithm:

import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)Code language: Python (python)

Data Preparation

Now, I will load the data, and split it into training and test sets:

from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)Code language: Python (python)

Now I will prepare the data using the bagging classifier and the decision tree classification model:

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(splitter="random", max_leaf_nodes=16, random_state=42),
    n_estimators=500, max_samples=1.0, bootstrap=True, random_state=42)
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)Code language: Python (python)

Random Forest Algorithm in Machine Learning

With a very few parameters, a RandomForestClassifier uses all the hyperparameters of a Decision Tree Classification model and bagging classifier algorithm. Now let’s see how we can do this:

from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, random_state=42)
rnd_clf.fit(X_train, y_train)

y_pred_rf = rnd_clf.predict(X_test)
np.sum(y_pred == y_pred_rf) / len(y_pred)Code language: Python (python)

0.976

The RandomForestClassifier algorithm works by introducing extra randomness while producing decision trees. Instead of searching for the best features, it works by searching for the best features among the random sets of features.

Also, Read – Work on Artificial Intelligence Projects.

from sklearn.datasets import load_iris
iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators=500, random_state=42)
rnd_clf.fit(iris["data"], iris["target"])
for name, score in zip(iris["feature_names"], rnd_clf.feature_importances_):
    print(name, score)Code language: Python (python)
sepal length (cm) 0.11249225099876375
sepal width (cm) 0.02311928828251033
petal length (cm) 0.4410304643639577
petal width (cm) 0.4233579963547682
rnd_clf.feature_importances_Code language: Python (python)

array([0.11249225, 0.02311929, 0.44103046, 0.423358 ])

Decision Boundary for Random Forest

Now I will plot a decision boundary on our model. For this purpose, I will first create a function to plot the decision boundary. If you don’t know what a decision boundary is, you can learn it from here. Now let’s create the function:

from matplotlib.colors import ListedColormap

def plot_decision_boundary(clf, X, y, axes=[-1.5, 2.45, -1, 1.5], alpha=0.5, contour=True):
    x1s = np.linspace(axes[0], axes[1], 100)
    x2s = np.linspace(axes[2], axes[3], 100)
    x1, x2 = np.meshgrid(x1s, x2s)
    X_new = np.c_[x1.ravel(), x2.ravel()]
    y_pred = clf.predict(X_new).reshape(x1.shape)
    custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0'])
    plt.contourf(x1, x2, y_pred, alpha=0.3, cmap=custom_cmap)
    if contour:
        custom_cmap2 = ListedColormap(['#7d7d58','#4c4c7f','#507d50'])
        plt.contour(x1, x2, y_pred, cmap=custom_cmap2, alpha=0.8)
    plt.plot(X[:, 0][y==0], X[:, 1][y==0], "yo", alpha=alpha)
    plt.plot(X[:, 0][y==1], X[:, 1][y==1], "bs", alpha=alpha)
    plt.axis(axes)
    plt.xlabel(r"$x_1$", fontsize=18)
    plt.ylabel(r"$x_2$", fontsize=18, rotation=0)Code language: Python (python)
plt.figure(figsize=(6, 4))

for i in range(15):
    tree_clf = DecisionTreeClassifier(max_leaf_nodes=16, random_state=42 + i)
    indices_with_replacement = np.random.randint(0, len(X_train), len(X_train))
    tree_clf.fit(X[indices_with_replacement], y[indices_with_replacement])
    plot_decision_boundary(tree_clf, X, y, axes=[-1.5, 2.45, -1, 1.5], alpha=0.02, contour=False)

plt.show()Code language: Python (python)
random forest

The Random Forest algorithm results in higher tree diversification, which trades with a very higher level of bias for lower variance. Overall, it leads to a better prediction model. I hope you liked this article, feel free to ask your valuable questions in the comments section below.

Follow Us:

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1498

4 Comments

Leave a Reply