Where Binary Classification distinguish between two classes, Multiclass Classification or Multinomial Classification can distinguish between more than two classes.
Some algorithms such as SGD classifiers, Random Forest Classifiers, and Naive Bayes classification are capable of handling multiple classes natively. Others such as Logistic Regression or Support Vector Machine Classifiers are strictly binary classifiers. However, there are various strategies that you can use to perform multiclass classification with multiple binary classifiers.
Techniques of Multiclass Classification
There are two Techniques of Multiclass Classification, OvO and OvR, let’s go through both these techniques one by one:
OvR Strategy
One way to create a system that can classify the digit imsges into 10 classes (from 0 to 9) is to train 10 binary classifiers, one for each digit ( a 0 – detector, a 1 – detector, and so on). Then when you want to classify an image, you get the decision score from each classifier for that image and you select the class whose classifier outputs the highest score. This is called the one-versus-the-rest (OvR) strategy also known as one-versus-all.
OvO Strategy
Another strategy is to train a binary classifier for every pair of digits: one to distinguish 0s and 1s, another to distinguish 0s and 2s, another for 1s and 2s, and so on. This is called the one-versus-one (OvO) strategy. If there are N classes, you need to train N × (N – 1)/2 classifiers.
For the MNIST problem, this means training 45 binary classifiers. When you want to classify an image, you have to run the image through all 45 classifiers and see which class wins the most duels. The main advantage of OvO is that each classifier only needs to be trained on the part of the training set for the two classes that it must distinguish.
Training a Multiclass Classification Model
Some algorithms such as Support Vector Machine classifiers scale poorly with the size of the training set. For these algorithms OvO is preferred because it is faster to train many classifiers on small training sets than to train few classifiers on large training sets. For most binary classification algorithms, however, OvR is preferred.
Scikit-Learn detects when you try to use a binary classification algorithm for a multiclass classification task, and it automatically runs OvR or OvO, depending on the algorithm. Let’s try this with a Support Vector Machine classifier, but before I suggest you to go through my article on Binary Classification, because I will use the same classification problem so that you can understand the difference between training a binary classification and a multiclass classification. I will not start the code here from beginning, you can continue this code from the end of your binary classification model:
from sklearn.svm import SVC
svm_clf = SVC(gamma="auto", random_state=42)
svm_clf.fit(X_train[:1000], y_train[:1000]) # y_train, not y_train_5
svm_clf.predict([some_digit])
Code language: Python (python)
array([5], dtype=uint8)
That was easy, this code trains the SVC on the training set using the original target class from 0 to 9 (y_train), instead of the 5-versus-the-rest target classes (y_train_5). Then it makes a prediction (a correct one in this case). Under the hood, Scikit-Learn actually used the OvO strategy: it trained 45 binary classifiers, got their decision scores for the image, and selected the class that won the most duels.
If you call the decision_function() method, you will see that it returns 10 scores per instance (instead of just 1). That’s one score per class:
some_digit_scores = svm_clf.decision_function([some_digit])
some_digit_scores
Code language: Python (python)
array([[ 2.92492871, 7.02307409, 3.93648529, 0.90117363, 5.96945908, 9.5 , 1.90718593, 8.02755089, -0.13202708, 4.94216947]])
The highest score is indeed the one corresponding to class 5:
np.argmax(some_digit_scores)
Code language: Python (python)
5
If you want to force Scikit-Learn to use one-versus-one or one-versus-the-rest, you can use the OneVsOneClassifier of OneVsRestClassifier classes. Simply create an instance and pass a Classifier to its constructor. For example, this code creates a multiclass classification using the OvR strategy, based on SVC:
from sklearn.multiclass import OneVsRestClassifier
ovr_clf = OneVsRestClassifier(SVC(gamma="auto", random_state=42))
ovr_clf.fit(X_train[:1000], y_train[:1000])
ovr_clf.predict([some_digit])
Code language: Python (python)
array([5], dtype=uint8)
len(ovr_clf.estimators_)
Code language: Python (python)
10
Training an SGDClassifier is just as easy:
sgd_clf.fit(X_train, y_train)
sgd_clf.predict([some_digit])
Code language: Python (python)
array([5], dtype=uint8)
This time Scikit-Learn did not have to run OvR or OvO because SGD classifiers can directly classify instances into multiple classes. The decision_function() method now returns one value per class. Let’s look at the score that SGD classifier assigned to each class:
sgd_clf.decision_function([some_digit])
Code language: Python (python)
array([[-15955.22627845, -38080.96296175, -13326.66694897, 573.52692379, -17680.6846644 , 2412.53175101, -25526.86498156, -12290.15704709, -7946.05205023, -10631.35888549]])
Now of course you want to evaluate this multiclass classification. I will use the cross-validation function to evaluate the SGDClassifier’s accuracy:
cross_val_score(sgd_clf, X_train, y_train, cv=3, scoring="accuracy")
Code language: Python (python)
array([0.8489802 , 0.87129356, 0.86988048])
It gets over 84 percent on all test folds. If you used a random classifier, you would get 10 percent accuracy, so this is not such a bad score, but you can still do much better. Simply scaling the inputs increases accuracy above 89 percent:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))
cross_val_score(sgd_clf, X_train_scaled, y_train, cv=3, scoring="accuracy")
Code language: Python (python)
array([0.89707059, 0.8960948 , 0.90693604])
Error Analysis of Multiclass Classification
Now, let’s look at the confusion matrix first. You need to make predictions using the cross_val_predict() function, then call the confusion_matrix() function:
y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)
conf_mx = confusion_matrix(y_train, y_train_pred)
conf_mx
Code language: Python (python)
array([[5578, 0, 22, 7, 8, 45, 35, 5, 222, 1], [ 0, 6410, 35, 26, 4, 44, 4, 8, 198, 13], [ 28, 27, 5232, 100, 74, 27, 68, 37, 354, 11], [ 23, 18, 115, 5254, 2, 209, 26, 38, 373, 73], [ 11, 14, 45, 12, 5219, 11, 33, 26, 299, 172], [ 26, 16, 31, 173, 54, 4484, 76, 14, 482, 65], [ 31, 17, 45, 2, 42, 98, 5556, 3, 123, 1], [ 20, 10, 53, 27, 50, 13, 3, 5696, 173, 220], [ 17, 64, 47, 91, 3, 125, 24, 11, 5421, 48], [ 24, 18, 29, 67, 116, 39, 1, 174, 329, 5152]])
That’s a lot of numbers. It’s often more convenient to look at an image representing of the confusion matrix, using Matplotlib’s matshow() function:
# since sklearn 0.22, you can use sklearn.metrics.plot_confusion_matrix()
def plot_confusion_matrix(matrix):
"""If you prefer color and a colorbar"""
fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(111)
cax = ax.matshow(matrix)
fig.colorbar(cax)
plt.matshow(conf_mx, cmap=plt.cm.gray)
plt.show()
Code language: Python (python)

Let’s focus the plot on errors. First we need to divide each value in the confusion matrix by the number of images in the corresponding class so that you can campare error rates instead of absolute numbers of errors:
row_sums = conf_mx.sum(axis=1, keepdims=True)
norm_conf_mx = conf_mx / row_sums
np.fill_diagonal(norm_conf_mx, 0)
plt.matshow(norm_conf_mx, cmap=plt.cm.gray)
save_fig("confusion_matrix_errors_plot", tight_layout=False)
plt.show()
Code language: Python (python)

Analyzing individual errors can also be a good way to gain insights on what your classifier is doing and why it is failing, but it is more difficult and time consuming. For example, let’s plot examples of 3s and 5s:
cl_a, cl_b = 3, 5
X_aa = X_train[(y_train == cl_a) & (y_train_pred == cl_a)]
X_ab = X_train[(y_train == cl_a) & (y_train_pred == cl_b)]
X_ba = X_train[(y_train == cl_b) & (y_train_pred == cl_a)]
X_bb = X_train[(y_train == cl_b) & (y_train_pred == cl_b)]
plt.figure(figsize=(8,8))
plt.subplot(221); plot_digits(X_aa[:25], images_per_row=5)
plt.subplot(222); plot_digits(X_ab[:25], images_per_row=5)
plt.subplot(223); plot_digits(X_ba[:25], images_per_row=5)
plt.subplot(224); plot_digits(X_bb[:25], images_per_row=5)
save_fig("error_analysis_digits_plot")
plt.show()
Code language: Python (python)

Also Read: 10 Machine Learning Projects to Boost your Portfolio.
The main difference between the 3s and 5s is the position of the small line that joins the top line to bottom arc. If you draw a 3 with the junction slightly shifted to the left, the classifier might classify it as 5, and vice versa. So I hope you liked this article on Multiclass Classification. Feel free to ask your valuable questions in the comments section below.