Categories
By Aman Kharwal

Logistic Regression in Machine Learning with Python

Logistic Regression

One of the best things about the scikit-learn library in python is that it provides four steps modeling patterns that make it easy for the programmer to train a machine learning classifier. In this article, I will use Logistic Regression with python, to classify the digits which are based on images. After preparing our machine learning model with this logistic regression, we can use it to predict an image labeled with the numbers.

Also, read – Train and Run and Linear Regression Model

Logistic Regression on Digits with Python

The scikit-learn library comes with a preloaded digits dataset. That means we need to load the digits dataset, and we are not required to download any dataset for this classification. Now let’s load our dataset.

from sklearn.datasets import load_digits
digits = load_digits()

Now let’s look at some insights from the dataset.

# Print to show there are 1797 images (8 by 8 images for a dimensionality of 64)
print("Image Data Shape" , digits.data.shape)
# Print to show there are 1797 labels (integers from 0–9)
print("Label Data Shape", digits.target.shape)

To Show the Images and Labels in Digits Dataset

Now let’s see what our data contains, I will visualize the images and labels present in the dataset, to know what I need to work with.

import numpy as np 
import matplotlib.pyplot as plt
plt.figure(figsize=(20,4))
for index, (image, label) in enumerate(zip(digits.data[0:5], digits.target[0:5])):
  plt.subplot(1, 5, index + 1)
  plt.imshow(np.reshape(image, (8,8)), cmap=plt.cm.gray)
  plt.title('Training: %in' % label, fontsize = 20)
training data

Split the Data into Training and Test Set

Now I will split the data into 75 percent training and 25 percent testing sets. The need to break the data into training and testing sets is to ensure that our classification model can fit properly in the new data.

Scikit-learn 4 Steps Modelling Pattern(Logistic Regression)

Step one is the import the model that we want to use, As this article is based on the logistic regression so, I will import the logistic regression model from the scikit-learn library in python.

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(digits.data, 
                                                    digits.target, test_size=0.25, 
                                                    random_state=0)
from sklearn.linear_model import LogisticRegression

Step two is to create an instance of the model, which means that we need to store the Logistic Regression model into a variable.

logisticRegr = LogisticRegression()

Step three will be to train the model. For this, we need the fit the data into our Logistic Regression model.

logisticRegr.fit(x_train, y_train)

Step four is to predict the labels for the new data,
In this step, we need to use the information that we learned while training the model.

# Returns a NumPy Array
# Predict for One Observation (image)
logisticRegr.predict(x_test[0].reshape(1,-1))
logisticRegr.predict(x_test[0:10])
predictions = logisticRegr.predict(x_test)

Measure the Accuracy of our Logistic Regression Model

I will measure the Accuracy of our trained Logistic Regressing Model, where Accuracy is defined as the fraction of correct predictions, which is correct predictions/total number of data points.

# Use score method to get accuracy of model
score = logisticRegr.score(x_test, y_test)
print(score)

So our Accuracy gives the output as 95.3 percent, which is generally appreciated.

Confusion Matrix

Confusion Matrix is the table used in describing the performance of a Classifier that we have trained using the dataset. Here I will use Matplotlib and Seaborn in python to describe the performance of our trained model.

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
cm = metrics.confusion_matrix(y_test, predictions)
print(cm)

Now let’s visualize our performance using the confusion matrix. First, I will visualize the confusion matrix using the Seaborn library in python.

plt.figure(figsize=(9,9))
sns.heatmap(cm, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Blues_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title = 'Accuracy Score: {0}'.format(score)
plt.title(all_sample_title, size = 15)
plt.show()
logistic regression

Now let’s visualize our Logistic Regression model’s performance with the confusion matrix using the matplotlib library in python.

plt.figure(figsize=(9,9))
plt.imshow(cm, interpolation='nearest', cmap='Pastel1')
plt.title('Confusion matrix', size = 15)
plt.colorbar()
tick_marks = np.arange(10)
plt.xticks(tick_marks, ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"], rotation=45, size = 10)
plt.yticks(tick_marks, ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"], size = 10)
plt.tight_layout()
plt.ylabel('Actual label', size = 15)
plt.xlabel('Predicted label', size = 15)
width, height = cm.shape
for x in xrange(width):
  for y in xrange(height):
    plt.annotate(str(cm[x][y]), xy=(y, x), 
    horizontalalignment='center',
    verticalalignment='center')
logistic regression

Logistic Regression (MNIST)

The Logistic Regression model that you saw above was you give you an idea of how this classifier works with python to train a machine learning model. Now let’s prepare a Logistic Regression model for a real-world example using more significant data to fit our model.

Load the MNIST Dataset

from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')

Now after loading the MNIST dataset, let’s see some insights into the data.

# These are the images
# There are 70,000 images (28 by 28 images for a dimensionality of 784)
print(mnist.data.shape)
# These are the labels
print(mnist.target.shape)

In the output, you will see 70000 images and 70000 labels in this dataset, which sounds very challenging for a real-world problem.

Split the Data into Training and Testing

Now let’s split the data into training and testing sets. Here I will break the dataset into 60000 images as a training set and 10000 images as a testing set.

Visualize the Data

As I told you earlier, that we need to look at the data before moving forward to see what we need to work with. Here I will visualize the data using the matplotlib library in python.

from sklearn.model_selection import train_test_split
train_img, test_img, train_lbl, test_lbl = train_test_split(
 mnist.data, mnist.target, test_size=1/7.0, random_state=0)
import numpy as np
import matplotlib.pyplot as plt
plt.figure(figsize=(20,4))
for index, (image, label) in enumerate(zip(train_img[0:5], train_lbl[0:5])):
  plt.subplot(1, 5, index + 1)
  plt.imshow(np.reshape(image, (28,28)), cmap=plt.cm.gray)
  plt.title('Training: %in' % label, fontsize = 20)
  plt.show()
training data

Scikit-Learn Modelling Pattern

Now let’s follow the scikit-learn’s modeling pattern as I did earlier in the above example.

from sklearn.linear_model import LogisticRegression

# all parameters not specified are set to their defaults
# default solver is incredibly slow thats why we change it
logisticRegr = LogisticRegression(solver = 'lbfgs')

logisticRegr.fit(train_img, train_lbl)
# Returns a NumPy Array
# Predict for One Observation (image)
logisticRegr.predict(test_img[0].reshape(1,-1))
logisticRegr.predict(test_img[0:10])
predictions = logisticRegr.predict(test_img)

Also, read – 10 Machine Learning Projects to Boost your Portfolio

So, this is how you can efficiently train a machine learning model. If you prepare a model in python with Scikit-learn, you will never find it difficult. I hope this article helps you. Feel free to ask questions on Logistic Regression in Machine Learning with Python or any other topic, in the comments section.

Leave a Reply