In this blog post series, we are going to explore Machine Learning metrics, their impact on a model and how they can have a critical importance from a business user perspective.
To access the other articles, click below on the subject that interests you:

An introduction to Machine Learning metrics [LINK TO INTRODUCTION]

Binary classification metrics [LINK TO CLASSIFICATION METRICS]

Regression metrics [LINK TO REGRESSION METRICS]

Multi Classification metrics
Introduction
Multiclass classification refers to classification challenges in machine learning that involve more than two classes. When evaluating and comparing machine learning algorithms on multi class targets, performance metrics are extremely valuable. Many measures can be used to evaluate a multiclass classifier’s performance. These metrics prove beneficial at many stages of the development process, such as comparing the performance of two different models or analyzing the behavior of the same model by changing various parameters. In this post, we go through a list of the most often used multiclass metrics, their benefits and drawbacks, and how they can be employed in the building of a classification model.
I would like to thank Abishek Takhur for allowing us to reuse the implementation code for the following metrics. We invite you to read the excellent book Approaching (Almost) Any Machine Learning Problem.
Accuracy
Accuracy is one of the most popular metrics in multiclass classification and it is directly computed from the confusion matrix. The formula of the Accuracy considers the sum of True Positive and True Negative elements at the numerator and the sum of all the entries of the confusion matrix at the denominator. [source]
Example of Confusion Matrix for MultiClass Classification in Prevision.io
def accuracy(y_true, y_pred): 
Error rate
Error rate is deduced from the previous Accuracy metric. In fact, Error rate = 1 – Accuracy.
def error_rate(y_true, y_pred): 
Multi Log Loss
Multi Log loss, aka logistic loss or crossentropy loss measures the performance of a classification model whose output is a probability value between 0 and 1. Crossentropy loss increases as the predicted probability diverges from the actual label. So predicting a probability of .012 when the actual observation label is 1 would be bad and result in a high loss value. A perfect model would have a log loss of 0.
In a multiclassification problem, we define the logarithmic loss function F in terms of the logarithmic loss function per label Fi as:
where :
 N is the number of instances,
 M is the number of different labels,
 yij is the binary variable with the expected labels
 pij is the classification probability output by the classifier for the iinstance and the jlabel.
The cost function F measures the distance between two probability distributions, i.e. how similar is the distribution of actual labels and classifier probabilities. Hence, values close to zero are preferred.
import numpy as np 
F1 Score
In statistical analysis of binary classification, the Fscore or Fmeasure is a measure of a test’s accuracy. It is calculated from the precision and recall of the test, where the precision is the number of true positive results divided by the number of all positive results, including those not identified correctly, and the recall is the number of true positive results divided by the number of all samples that should have been identified as positive. Precision is also known as positive predictive value, and recall is also known as sensitivity in diagnostic binary classification.
The F1 score is the harmonic mean of precision and recall. The more generic {\displaystyle F_{\beta }}F_{\beta } score applies additional weights, valuing one of precision or recall more than the other.
The highest possible value of an Fscore is 1.0, indicating perfect precision and recall, and the lowest possible value is 0, if either the precision or the recall is zero. [source]
Precision / Recall définitions [source]
The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0.
There are several versions of the F1 score depending on the expected granularity.
 micro: Calculate metrics globally by counting the total true positives, false negatives and false positives.
 macro: Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
 weighted: Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an Fscore that is not between precision and recall.
def true_positive(y_true, y_pred): 
import numpy as np 
import numpy as np 
from collections import Counter 
AUC
As we saw in the article Classification Metrics: [ADD LINK TO BINARY CLASSIFICATION POST], AUC (Area Under the ROC Curve), which measures the probability that a positive instance has a higher score than a negative instance, is a wellknown performance metric for a scoring function’s ranking quality. AUC often comes up as a more appropriate performance metric than accuracy in various applications due to its appealing properties, e.g., insensitivity toward label distributions and costs. [source]
David J. Hand & Robert J. Till proposed in 2001 a simple generalization of the Area Under the ROC Curve for Multiple Class Classification Problems [source]
AUC values range from 0 to 1:
 AUC = 1 implies you have a perfect model. Most of the time, it means that
you made some mistake with validation and should revisit data processing
and holdout pipeline of yours. If you didn’t make any mistakes, then
congratulations, you have the best model one can have for the dataset you
built it on.
 AUC = 0 implies that your model is very bad (or very good!). Try inverting
the probabilities for the predictions, for example, if your probability for the
positive class is p, try substituting it with 1p. This kind of AUC may also
mean that there is some problem with your validation or data processing.
 AUC = 0.5 implies that your predictions are random. So, for any binary
classification problem, if I predict all targets as 0.5, I will get an AUC of
0.5.
 AUC values between 0 and 0.5 imply that your model is worse than random. Most
of the time, it’s because you inverted the classes. If you try to invert your
predictions, your AUC might become more than 0.5. AUC values closer to 1 are
considered good.
from sklearn import metrics 
Quadratic Weight Kappa (QWKP)
Quadratic Weight Kappa is also called Weighted Cohen’s Kappa.
Quadratic Weighted Kappa measures the agreement between two ratings. This metric typically varies from 0 (random agreement between raters) to 1 (complete agreement between raters). In the event that there is less agreement between the raters than expected by chance, the metric may go below 0. The quadratic weighted kappa is calculated between the scores which are expected/known and the predicted scores. [source]
Take the example of a multi class with N class. The quadratic weighted kappa is calculated as follows. First, an N x N histogram matrix O is constructed, such that Oi,j corresponds to the number of adoption records that have a rating of i (actual) and received a predicted rating j. An NbyN matrix of weights, w, is calculated based on the difference between actual and predicted rating scores.
An NbyN histogram matrix of expected ratings, E, is calculated, assuming that there is no correlation between rating scores. This is calculated as the outer product between the actual rating’s histogram vector of ratings and the predicted rating’s histogram vector of ratings, normalized such that E and O have the same sum.
From these three matrices, the quadratic weighted kappa is calculated.
 First, create a multi class confusion matrix O between predicted and actual ratings.
 Second, construct a weight matrix w which calculates the weight between the actual and predicted ratings.
 Third, calculate value_counts() for each rating in preds and actuals.
 Fourth, calculate E, which is the outer product of two value_count vectors
 Fifth, normalize the E and O matrix
Interpreting the Quadratic Weighted Kappa Metric
 A weighted Kappa is a metric which is used to calculate the amount of similarity between predictions and actuals. A perfect score of 1.0 is granted when both the predictions and actuals are the same.
 Whereas, the least possible score is 1 which is given when the predictions are furthest away from actuals.
 The aim is to get as close to 1 as possible. Generally a score of 0.6+ is considered to be a really good score.
from sklearn import metrics 
[email protected]
The [email protected] metric measures the [email protected] for recommendations shown for different users and averages them over all queries in the dataset. The [email protected] metric is the most commonly used metric for evaluating recommender systems.
[email protected] all range from 0 to 1 with 1 being the best.
The mean average precision (mAP) of a set of queries is defined by Wikipedia as such:
Mean average precision formula given provided by Wikipedia
where:
 Q is the number of queries in the set
 AveP(q) is the average precision (AP) for a given query q.
What the formula is essentially telling us is that, for a given query, q, we calculate its corresponding AP, and then the mean of all these AP scores would give us a single number, called the mAP, which quantifies how good our model is at performing the query.
def pk(y_true, y_pred, k): 
def mapk(y_true, y_pred, k): 
Multi Classifications metrics summary table
Prevision.io Notation 

Metric 
Range 
Lower is better 
Weights accepted 
3 Stars 
2 Stars 
1 Star 
0 Star 
Tips 
Multi LogLoss 
0 – ∞ 
True 
True 
[0 ; 0.223[ 
[0.223 ; 0.693[ 
[0.693 ; +inf[ 
– 
Optimizes probabilities 
Macro F1 
0 – 1 
False 
True 
]0.85 ; 1] 
]0.65 ; 0.85] 
]0.5 ; 0.65] 
[0 ; 0.5] 
Equal weight on precision and recall 
Macro AUC 
0 – 1 
False 
True 
]0.85 ; 1] 
]0.65 ; 0.85] 
]0.5 ; 0.65] 
[0 ; 0.5] 
Optimizes sort order of predictions 
Macro Accuracy 
0 – 1 
False 
True 
]0.857 ; 1] 
]0.75 ; 0.857] 
]0.5 ; 0.75] 
[0 ; 0.5] 
Highly interpretable 
Quadratic Kappa 
1 – 1 
False 
True 
]0.8 ; 1] 
]0.6 ; 0.8] 
]0.2 ; 0.6] 
[1 ; 0.2] 

0 – 1 
False 
False 
]0.875 ; 1] 
]0.75 ; 0.875] 
]0.5 ; 0.75] 
[0: 0.5] 
Conclusion
We have introduced multi classification metrics, those implemented in Prevision.io
In this article we have seen:
 the main multi classification metrics,
 their code implementation in Python,
 in which situations are they used,
 a summary table of these metrics.