Decision Boundary in Machine Learning

The general goal of a classification model is to find a decision boundary. The purpose of the decision boundaries is to identify those regions of the input class space that corresponds to each class. In this article, I will take you through the concept of decision boundary in machine learning.

To explain the concept of decision boundaries in machine learning, I will first create a Logistic Regression model. So now let’s import some libraries and get started with the task:

# Python ≥3.5 is required import sys assert sys.version_info >= (3, 5) # Scikit-Learn ≥0.20 is required import sklearn assert sklearn.__version__ >= "0.20" # Common imports import numpy as np import os # to make this notebook's output stable across runs np.random.seed(42) # To plot pretty figures %matplotlib inline import matplotlib as mpl import matplotlib.pyplot as plt mpl.rc('axes', labelsize=14) mpl.rc('xtick', labelsize=12) mpl.rc('ytick', labelsize=12)

Decision Boundaries with Logistic Regression

I will use the iris dataset to fit a Linear Regression model. Iris is a very famous dataset among machine learning practitioners for classification tasks. It contains the sepal and petal length with width of 150 iris flowers of three different species; Iris setosa, Iris versicolor, and Iris Virginica.

Now I will try to build a classification model to detect the Iris virginica type based only on the width of a petal:

t = np.linspace(-10, 10, 100) sig = 1 / (1 + np.exp(-t)) plt.figure(figsize=(9, 3)) plt.plot([-10, 10], [0, 0], "k-") plt.plot([-10, 10], [0.5, 0.5], "k:") plt.plot([-10, 10], [1, 1], "k:") plt.plot([0, 0], [-1.1, 1.1], "k-") plt.plot(t, sig, "b-", linewidth=2, label=r"$\sigma(t) = \frac{1}{1 + e^{-t}}$") plt.xlabel("t") plt.legend(loc="upper left", fontsize=20) plt.axis([-10, 10, -0.1, 1.1]) plt.show()

Now let’s train our Logistic Regression model to frame a decision boundary:

from sklearn import datasets iris = datasets.load_iris() X = iris["data"][:, 3:] # petal width y = (iris["target"] == 2).astype(np.int) # 1 if Iris virginica, else 0 from sklearn.linear_model import LogisticRegression log_reg = LogisticRegression(solver="lbfgs", random_state=42) log_reg.fit(X, y)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='auto', n_jobs=None, penalty='l2', random_state=42, solver='lbfgs', tol=0.0001, verbose=0, warm_start=False)

Decision Boundary using One Feature (Petal Width)

Now let’s have a quick look at our trained model’s estimated probabilities for flowers with the petal widths that vary from 0 cm to 3 cm:

X_new = np.linspace(0, 3, 1000).reshape(-1, 1) y_proba = log_reg.predict_proba(X_new) decision_boundary = X_new[y_proba[:, 1] >= 0.5][0] plt.figure(figsize=(8, 3)) plt.plot(X[y==0], y[y==0], "bs") plt.plot(X[y==1], y[y==1], "g^") plt.plot([decision_boundary, decision_boundary], [-1, 2], "k:", linewidth=2) plt.plot(X_new, y_proba[:, 1], "g-", linewidth=2, label="Iris virginica") plt.plot(X_new, y_proba[:, 0], "b--", linewidth=2, label="Not Iris virginica") plt.text(decision_boundary+0.02, 0.15, "Decision boundary", fontsize=14, color="k", ha="center") plt.arrow(decision_boundary, 0.08, -0.3, 0, head_width=0.05, head_length=0.1, fc='b', ec='b') plt.arrow(decision_boundary, 0.92, 0.3, 0, head_width=0.05, head_length=0.1, fc='g', ec='g') plt.xlabel("Petal width (cm)", fontsize=14) plt.ylabel("Probability", fontsize=14) plt.legend(loc="center left", fontsize=14) plt.axis([0, 3, -0.02, 1.02]) plt.show()
decision boundary

The output above shows that the petal width of Iris virginica ranges from 1.4 cm to 2.5 cm, while the other flowers are having a small petal width which is ranging from 0.1 cm to 1.8 cm. If we will use the predict() method to predict the class of the flower, it will return the class that mostly falls in this category.

There is a decision boundary at around 1.6 cm where both the probabilities are 50 percent, which conveys that if the petal width is higher than 1.6 cm, then our classification model will predict that the input class is an Iris virginica, and otherwise the model will predict that it is not iris virginica.

Decision Boundary using Two Features (Petal Width & Petal Length)

Now let’s plot a little bit complex decision boundary which will be based on two features petal width and petal length. Now we will train the model based on two features to predict whether the flower is Iris virginica:

from sklearn.linear_model import LogisticRegression X = iris["data"][:, (2, 3)] # petal length, petal width y = (iris["target"] == 2).astype(np.int) log_reg = LogisticRegression(solver="lbfgs", C=10**10, random_state=42) log_reg.fit(X, y) x0, x1 = np.meshgrid( np.linspace(2.9, 7, 500).reshape(-1, 1), np.linspace(0.8, 2.7, 200).reshape(-1, 1), ) X_new = np.c_[x0.ravel(), x1.ravel()] y_proba = log_reg.predict_proba(X_new) plt.figure(figsize=(10, 4)) plt.plot(X[y==0, 0], X[y==0, 1], "bs") plt.plot(X[y==1, 0], X[y==1, 1], "g^") zz = y_proba[:, 1].reshape(x0.shape) contour = plt.contour(x0, x1, zz, cmap=plt.cm.brg) left_right = np.array([2.9, 7]) boundary = -(log_reg.coef_[0][0] * left_right + log_reg.intercept_[0]) / log_reg.coef_[0][1] plt.clabel(contour, inline=1, fontsize=12) plt.plot(left_right, boundary, "k--", linewidth=3) plt.text(3.5, 1.5, "Not Iris virginica", fontsize=14, color="b", ha="center") plt.text(6.5, 2.3, "Iris virginica", fontsize=14, color="g", ha="center") plt.xlabel("Petal length", fontsize=14) plt.ylabel("Petal width", fontsize=14) plt.axis([2.9, 7, 0.8, 2.7]) plt.show()
iris virginica machine learning

In the output above the dashed line is representing the points where our Logistic Regression model predicts a probability of 50 percent, this line is the decision boundary for our classification model. One thing to note here is that it is a Linear decision boundary.

Also, Read: Anomaly Detection with Machine Learning.

Here each parallel line is representing the points where the output of our model shows a specific probability from 15 percent to 90 percent. All the flowers lying beyond the top right line have over 90 percent probability of being Iris Virginia. I hope you liked this article on Decision Boundary in Machine Learning. Feel free to ask your valuable questions in the comments section below. You can also follow me on Medium to read more amazing articles.

Get Daily Newsletters

Leave a Reply