Precision and Recall

In Machine Learning, Precision and Recall are the two most important metrics for Model Evaluation. Precision represents the percentage of the results of your model, which are relevant to your model. The recall represents the percentage total of total pertinent results classified correctly by your machine learning algorithm.

In this article, I will show you how you can apply Precision and Recall to evaluate the performance of your Machine Learning model.

Applying Precision and Recall in Machine Learning

I will apply Precision and Recall using my earlier post on Binary Classification. I will continue this task from where I ended in Binary Classification.

Scikit-Learn provides several functions to compute classifier metrics:

from sklearn.metrics import precision_score, recall_score
precision_score(y_train_5, y_train_pred)Code language: Python (python)

0.8370879772350012

4096 / (4096 + 1522)Code language: Python (python)

0.7290850836596654

recall_score(y_train_5, y_train_pred)Code language: Python (python)

0.6511713705958311

F1 Score in Precision and Recall

It is often convenient to combine these two metrics into a single parameter called the F1 score, in particular, if you need a simple way to compare two classifiers. The F1 score is the harmonic mean of precision and recall.

Whereas the regular mean treats all values equally, the harmonic mean gives much more weight to low values. As a result, the classifier will only get a high F1 score if both the metrics are high.

precision and recall

To compute the F1 score, simply call the f1_score() function:

from sklearn.metrics import f1_score
f1_score(y_train_5, y_train_pred)Code language: Python (python)

0.7325171197343846

4096 / (4096 + (1522 + 1325) / 2)Code language: Python (python)

0.7420962043663375

Precision/Recall Trade-off

To understand this trade-off, let’s look at how the SGDClassifier makes its classification decisions. For each instance, it computes a score based on a decision function. If that score is higher than a threshold, it assigns the example to the positive class; otherwise, it assigns it to the negative category.

trade-off

In the image above precision/recall trade-off, models are ranked by their classifier score, and those above the chosen decision threshold are considered positive; the higher the limit, the lower the recall, but (in general) the higher the precision.

Scikit-Learn does not let you set the threshold directly, but it does give you access to the decision scores that it uses to make predictions. Instead of calling the classifier’s predict() method, you can call its decision_function() method, which returns a score for each instance, and then use any threshold you want to make predictions based on those scores:

y_scores = sgd_clf.decision_function([some_digit])
y_scoresCode language: Python (python)

array([2164.22030239])

threshold = 0
y_some_digit_pred = (y_scores > threshold)
y_some_digit_predCode language: Python (python)

array([ True])

The SGDClassifier uses a threshold equal to 0, so the previous code returns the same result as the predict() method (i.e., True). Let’s raise the threshold:

threshold = 8000
y_some_digit_pred = (y_scores > threshold)
y_some_digit_predCode language: Python (python)

array([False])

How do you decide which threshold to use? First, use the cross_val_predict() function to get the scores of all instances in the training set, but this time specify that you want to return decision scores instead of predictions:

y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3,
                             method="decision_function")
from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)
def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision", linewidth=2)
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall", linewidth=2)
    plt.legend(loc="center right", fontsize=16) # Not shown in the book
    plt.xlabel("Threshold", fontsize=16)        # Not shown
    plt.grid(True)                              # Not shown
    plt.axis([-50000, 50000, 0, 1])             # Not shown



recall_90_precision = recalls[np.argmax(precisions >= 0.90)]
threshold_90_precision = thresholds[np.argmax(precisions >= 0.90)]


plt.figure(figsize=(8, 4))                                                                  # Not shown
plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.plot([threshold_90_precision, threshold_90_precision], [0., 0.9], "r:")                 # Not shown
plt.plot([-50000, threshold_90_precision], [0.9, 0.9], "r:")                                # Not shown
plt.plot([-50000, threshold_90_precision], [recall_90_precision, recall_90_precision], "r:")# Not shown
plt.plot([threshold_90_precision], [0.9], "ro")                                             # Not shown
plt.plot([threshold_90_precision], [recall_90_precision], "ro")                             # Not shown
save_fig("precision_recall_vs_threshold_plot")                                              # Not shown
plt.show()Code language: Python (python)
precision and recall

You may wonder why the blue curve is bumpier than the green curve in the output above. The reason is that precision may sometimes go
down when you raise the threshold (although in general, it will go
up).

Another way to select a good trade-off is to plot these two metrics directly against the recall:

(y_train_pred == (y_scores > 0)).all()
def plot_precision_vs_recall(precisions, recalls):
    plt.plot(recalls, precisions, "b-", linewidth=2)
    plt.xlabel("Recall", fontsize=16)
    plt.ylabel("Precision", fontsize=16)
    plt.axis([0, 1, 0, 1])
    plt.grid(True)

plt.figure(figsize=(8, 6))
plot_precision_vs_recall(precisions, recalls)
plt.plot([0.4368, 0.4368], [0., 0.9], "r:")
plt.plot([0.0, 0.4368], [0.9, 0.9], "r:")
plt.plot([0.4368], [0.9], "ro")
save_fig("precision_vs_recall_plot")
plt.show()Code language: Python (python)

You can see that precision starts to fall sharply around 80% recall. You will probably want to select a precision/recall trade-off just before that drop. I hope you liked this article. Feel free to ask your valuable questions in the comments section below. Also, follow me on Medium to read some more amazing articles like this.

Follow Us:

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1435

Leave a Reply