In the previous article of our blog post series on machine learning metrics, we saw the difference between a metric and an objective function, and why metrics are important and how to choose a good one. In this article, we will focus on binary classification metrics. We provide below the links to others articles of this blog post.

An introduction to Machine Learning metrics [LINK TO INTRODUCTION]

Binary Classification metrics

Regression metrics [LINK TO REGRESSION METRICS]

Multi Classification metrics [LINK TO MULTI CLASSIFICATION METRICS]
Introduction
Classification refers to predictive modeling problems that involve predicting a class label. To evaluate the models, different types of metrics can be used. In binary classification, these metrics can be decomposed into three categories:
 metrics whose objective is to order the instances like AUC,
 metrics whose objective is to offer the most qualitative probability possible like log loss,
 metrics whose objective is to maximize the final decision like accuracy or F1 score.
In this article, you will discover how to calculate metrics for binary classification predictive modeling projects.
I would like to thank Abishek Takhur for allowing us to reuse the implementation code for the metrics discussed in this article. We invite you to read the excellent book Approaching (Almost) Any Machine Learning Problem.
To explain the metrics discussed in this article, we first need to introduce the concepts of confusion, precision and recall matrix.
Confusion Matrix
The confusion matrix is a performance measurement for machine learning classification in both binary and multiclass classification. It compares the predicted classes by the models to the ground truth classes. However, most machine learning tools return the probability of belonging to each class (example: 0.33) and not the class label (example predicted class = 0). Therefore, there is a need to convert these probabilities into class labels.
How to convert Probabilities to Class Labels?
To convert probabilities to class labels, a threshold can be defined. Based on the latter, each object will be assigned to the class with the probability above the threshold. In Prevision.io, in order to obtain an optimal threshold, several ones are tested. The optimal threshold is obtained by maximizing F1 Score.
True Positive / True Negative / False Positive / False Negative
Once the conversion is done, the confusion matrix is obtained from the following table.
Columns in the confusion matrix table represent the actual values of the target and rows the predicted value of the target by the classification model. TP (True Positive) corresponds to the number of predicted positive classes which are actually positive. Similarly to TP, TN (True Negative) corresponds to the number of predicted negative classes which are actually negative. FP (False Positive or Type 1 error) corresponds to the number of predicted positive classes which are actually negative. FN (False Negative or Type 2 error) corresponds to the number of predicted negative classes which are actually positive. This is usually the error we need to decrease the most.
The confusion matrix can be seen as an overview of the model’s predictions. We can easily read the percentage of correct predictions (diagonal) and of incorrect ones (antidiagonal).
Below is the implementation in python of the calculation of the confusion matrix.
def true_positive(y_true, y_pred): 
Precision / Recall
Precision (also called positive predictive value) is the fraction of relevant instances among the retrieved instances, while recall (also known as sensitivity) is the fraction of relevant instances that were retrieved. Both precision and recall are therefore based on relevance.
Below is the implementation in python of the calculation of the precision and the recall.
def precision(y_true, y_pred): 
def recall(y_true, y_pred): 
All the terms explained below will be used in the calculation of the following metrics.
AUC
AUC means Area Under the Receiver Operating Characteristic Curve.
This metric is used to evaluate how well a binary classification model is able to distinguish between true positives and false positives. The AUC score represents the degree or measure of separability. It tells how much the model is capable of distinguishing between classes. An AUC of 1 indicates a perfect classifier, while an AUC of .0.5 indicates a poor classifier whose performance is no better than random guessing. The AUC bears its name because it is represented in graphic form as below
Example of ROC CURVE [source]
from sklearn import metrics 
GINI
The GINI score is an adjustment to the AUC so that a perfectly random model scores 0 and a reversing model has a negative sign.
from sklearn import metrics 
Log loss
The logarithmic loss metric can be used to evaluate the performance of a binomial classifier. Unlike the AUC which looks at how well a model can classify a binary target, the log loss evaluates how close a model’s predicted values (uncalibrated probability estimates) are to the actual target value. For example, does a model tend to assign a high predicted value like 0.80 for the positive class, or does it show a poor ability to recognize the positive class and assign a lower predicted value like 0.50?
The log loss can be any value greater than or equal to 0, with 0 meaning that the model correctly assigns a probability of 0% or 100%.
The log loss equation is defined as follows:
Where:
N is the total number of rows (observations) of your corresponding dataframe.
w is the per row userdefined weight (defaults is 1).
p is the predicted value (uncalibrated probability) assigned to a given row (observation).
y is the actual target value.
Below is the implementation in python of the calculation of the log loss.
import numpy as np 
Accuracy
In binary classification, the accuracy score is the number of correct predictions over the number of observations:
This metric is not recommended when your target is imbalanced. Indeed, let’s take the example of a churn problem with a positive target rate of 1%. A model that predicts “no churn” for all observations will have an accuracy level of 99% which is great but unnecessary for identifying churners. In the case of imbalance classes, we recommend the use of the AUC metric.
Below is the implementation in python of the calculation of the accuracy metric.
def accuracy(y_true, y_pred): 
Error Rate
Error rate is deduced from the previous Accuracy metric as follows:
Error rate = 1 – Accuracy.
Some users of the Prevision.io platform prefer to display the accuracy rate over the error one. It is very dependent on the use case. For example, for a problem of predictive maintenance, we will try to minimize the error rate.
As for the previous metric, this metric should not be used in the case of imbalance classes.
def error_rate(y_true, y_pred): 
F1
The F1 score provides a measure of how well a binary classifier can classify positive cases (given a threshold value). The F1 score is calculated from the harmonic mean of the precision and recall. An F1 score of 1 means both precision and recall are perfect and the model correctly identified all the positive cases and didn’t mark a negative case as a positive case. If either precision or recall are very low it will be reflected with a F1 score closer to 0.
The F1 equation is given as follows:
F1 Equation Formula
Where:
precision is the positive observations (true positives) the model correctly identifies from all the observations it labeled as positive (the true positives + the false positives).
recall is the positive observations (true positives) the model correctly identified from all the actual positive cases (the true positives + the false negatives).
There are several versions of the F1 score depending on the expected granularity.
micro: Calculate metrics globally by counting the total true positives, false negatives and false positives.
macro: Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
weighted: Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an Fscore that is not between precision and recall.
Below is the implementation in python of the calculation of true positive, true negative, false positive and false negative.
def true_positive(y_true, y_pred): 
import numpy as np 
import numpy as np 
from collections import Counter 
F0.5
The F0.5 score is the weighted harmonic mean of the precision and recall (given a threshold value). Unlike the F1 score, which gives equal weight to precision and recall, the F0.5 score gives more weight to precision than to recall. More weight should be given to precision for cases where False Positives are considered worse than False Negatives. For example, if your use case is to predict which products you will run out of, you may consider False Positives worse than False Negatives. In this case, you want your predictions to be very precise and only capture the products that will definitely run out. If you predict a product will need to be restocked when it actually doesn’t, you incur cost by having purchased more inventory than you actually need.
The F0.5 equation is given as follows:
Where:
precision is the positive observations (true positives) the model correctly identifies from all the observations it labeled as positive (the true positives + the false positives).
recall is the positive observations (true positives) the model correctly identified from all the actual positive cases (the true positives + the false negatives).
from sklearn.metrics import fbeta_score 
F2
The F2 score is the weighted harmonic mean of the precision and recall (given a threshold value). Unlike the F1 score, which gives equal weight to precision and recall, the F2 score gives more weight to recall than to precision. More weight should be given to recall for cases where False Negatives are considered worse than False Positives. For example, if your use case is to predict which customers will churn, you may consider False Negatives worse than False Positives. In this case, you want your predictions to capture all of the customers that will churn. Some of these customers may not be at risk for churning, but the extra attention they receive is not harmful. More importantly, no customers actually at risk of churning have been missed.
The F2 score is defined by the following equation:
Where:
precision is the positive observations (true positives) the model correctly identified from all the observations it labeled as positive (the true positives + the false positives).
recall is the positive observations (true positives) the model correctly identified from all the actual positive cases (the true positives + the false negatives).
In the literature, there are also F3 and F4 metrics which give even more weight to recall. For all practical purposes, these metrics are also implemented in Prevision.io.
from sklearn.metrics import fbeta_score 
MCC
MCC stands for Means Matthews Correlation Coefficient. It represents the confusion matrix of a
model as a single number by combining the true positives, false positives, true negatives, and
false negatives using the equation described below.
The same process to determine an optimal threshold for the conversion of probabilities to class labels is used to find the maximal value MCC. By choosing this metric, Prevision.io goal is to continue increasing this maximum MCC.
Unlike metrics like Accuracy, MCC is a good scorer to use when the target variable is imbalanced. In the case of imbalanced data, high Accuracy can be found by predicting the majority class. Metrics like Accuracy and F1 can be misleading, especially in the case of imbalanced data, because they do not consider the relative size of the four confusion matrix categories. MCC, on the other hand, takes the proportion of each class into account. The MCC value ranges from 1 to 1 where 1 indicates a classifier that predicts the opposite class from the actual value, 0 means the classifier does no better than random guessing, and 1 indicates a perfect classifier.
MCC is defined as follows:
Where:
TP = True Positive,
TN = True Negative,
FP = False Positive,
FN = False Negative. (if you have the slightest doubt about the understanding of these terms, I invite you to review the confusion matrix paragraph)
from sklearn.metrics import matthews_corrcoef 
[email protected]
The lift @ k metric measures the overconcentration obtained by the model at k%.
Indeed, for certain issues, in particular to operate marketing actions, for cross sell or churn issues, it is not a question of maximizing the order of all instances, as in the case of the AUC metric, but to maximize a fraction (k%) of the total population.
For example: telemarketing call for the 5% top scores, sending a sponsorship code for the 20% top scores, …
At Prevision.io, the lift can be represented by decile, i.e. by dividing the population into increments of 10%, as shown in the following graph.
Lift per decile in Prevision.io
from scikitplot.metrics import plot_lift_curve 
The following table summarizes the binary classification metrics discussed in this article.
Prevision.io Notation  
METRIC  Range  Lower is better  Weights accepted  3 STARS  2 STARS  1 STAR  0 STAR  Tips 
AUC  0 – 1  False  True  ]0.85 ; 1]  ]0.65 ; 0.85]  ]0.5 ; 0.65]  [0 ; 0.5]  Optimizes sort order of predictions 
LOGLOSS  0 – ∞  True  True  [0 ; 0.223[  [0.223 ; 0.693[  [0.693 ; +inf[  Optimizes probabilities  
ERROR RATE  0 – 1  True  True  [0 ; 0.125[  [0.125 ; 0.25[  [0.25 ; +inf[  Highly interpretable  
Accuracy  0 – 1  False  True  ]0.875 ; 1]  ]0.75 ; 0.875]  ]0.5 ; 0.75]  [0: 0.5]  Highly interpretable 
F1  0 – 1  False  True  ]0.85 ; 1]  ]0.65 ; 0.85]  ]0.5 ; 0.65]  [0 ; 0.5]  Equal weight on precision and recall 
MCC  0 – 1  False  True  ]0.9 ; 1]  ]0.7 ; 0.9]  ]0.5 ; 0.7]  [1 ; 0.5]  All classes are equally weighted 
Gini  0 – 1  False  True  ]0.85 ; 1]  ]0.65 ; 0.85]  ]0.5 ; 0.65]  [0 ; 0.5]  
F05  0 – 1  False  True  ]0.85 ; 1]  ]0.65 ; 0.85]  ]0.5 ; 0.65]  [0 ; 0.5]  
F2  0 – 1  False  True  ]0.85 ; 1]  ]0.65 ; 0.85]  ]0.5 ; 0.65]  [0 ; 0.5]  More weight on recall, less weight on precision 
F3  0 – 1  False  True  ]0.85 ; 1]  ]0.65 ; 0.85]  ]0.5 ; 0.65]  [0 ; 0.5]  More More weight on recall, less weight on precision 
F4  0 – 1  False  True  ]0.85 ; 1]  ]0.65 ; 0.85]  ]0.5 ; 0.65]  [0 ; 0.5]  More More More weight on recall, less weight on precision 
0 – ∞  False  True  ]1 + 7 * (1k) ; 1]  ]1 + 3 * (1k) ; 1 + 7 * (1k) ]  ]1 + 1 * (1k) ; 1 + 3 * (1k) ]  [0 ; 1 + 1 * (1k)]  particularly useful when you want to maximize the lift on the X% top scores 
Binary Classification metrics summary table
Conclusion
In this article, we introduced binary classification metrics that can be used to evaluate the performance of machine learning models. We discussed:
the main binary classification metrics definitions, formula,
their code implementation in Python,
in which situations are they used,
a summary table of these metrics
It should be noted that the metrics presented in this article are provided in our endtoend machine learning platform Prevision.io. A free trial can be found here.