So far in this blog post series, we discussed the importance of choosing metrics for machine learning models, their importances and presented common metrics used in binary classification. In this article, we present common metrics that can be used to evaluate regression tasks.
To access the other articles, click below on the subject that interests you:
 An introduction to Machine Learning metrics [LINK TO INTRODUCTION]
 Binary Classification metrics [LINK TO CLASSIFICATION METRICS]
 Regression metrics
 Multi Classification metrics [LINK TO MULTI CLASSIFICATION METRICS]
Introduction
Regression refers to predictive modeling problems that involve predicting a numeric value. It is different from classification that involves predicting a class label. Unlike classification, you cannot use classification accuracy to evaluate the predictions made by a regression model.
Instead, you must use error metrics specifically designed for evaluating predictions made on regression problems.
In this article, you will discover how to calculate error metrics for regression predictive modeling projects.
I would like to thank Abishek Takhur for allowing us to reuse the implementation code for the metrics discussed in this article. We invite you to read the excellent book Approaching (Almost) Any Machine Learning Problem.
Mean Square Error (MSE)
Mean Square Error (MSE) is defined as Mean or Average of the square of the difference between actual and estimated values. It measures the average of the squares of the errors or deviations. MSE takes the distances from the points to the regression line (these distances are the “errors”) and squaring them to remove any negative signs. The MSE score incorporates both the variance and the bias of the predictor. It also gives more weight to larger differences. The bigger the error, the more it is penalized.
MSE Formula
def mean_squared_error(y_true, y_pred): 
Root Mean Square Error (RMSE)
The Root mean squared error (RMSE) is the square root of the mean of the square of all of the errors. The use of RMSE is very common, and it is considered an excellent generalpurpose error metric for numerical predictions.
where:
 Oi are the observations,
 Si predicted values of a variable,
 n the number of observations available for analysis.
The RMSE score is a good measure of accuracy, but only to compare prediction errors of different models or model configurations for a particular variable and not between variables, as it is scaledependent.
def root_mean_squared_error(y_true, y_pred): 
Root Mean Square Logarithmic Error (RMSLE)
The RMSLE measures the ratio between actual and predicted. It is the Root Mean Squared Error of the logtransformed predicted and logtransformed actual values. RMSLE adds 1 to both actual and predicted values before taking the natural logarithm to avoid taking the natural log of possible 0 (zero) values.
In case of RMSLE, you take the log of the predictions and actual values. So basically, what changes is the variance that you are measuring. I believe RMSLE is usually used when you don’t want to penalize huge differences in the predicted and the actual values when both predicted and true values are huge numbers.
where:
 yi are the real values,
 ŷi predicted values of a variable,
 n the number of observations available for analysis.
import numpy as np 
Root Mean Square Percentage Error (RMSPE)
The RMSPE, which is mostly used as an advanced forecasting metric, is defined by the following equation:
where:
 yi are the observations,
 ŷi predicted values of a variable,
 n the number of observations available for analysis.
def root_mean_squared_percentage_error(y_true, y_pred): 
Mean Absolute Error (MAE)
The MAE score measures the average magnitude of the errors in a set of predictions, without considering their direction. It’s the average over the test sample of the absolute differences between prediction and actual observation where all individual differences have equal weight.
The mean absolute error uses the same scale as the data. This is known as a scaledependent accuracy measure and, therefore, cannot be used to make comparisons between series using different scales.
where:
 yj are the observations,
 ŷj predicted values of a variable,
 n the number of observations available for analysis.
import numpy as np 
Mean Absolute Percentage Error (MAPE)
The MAPE score measures the accuracy of a forecasting method in statistics, for example in trend estimation. It is also used as a loss function for regression problems in machine learning. The MAPE usually expresses accuracy as a percentage.
From Wikipedia, although the concept of MAPE sounds very simple and convincing, it has major drawbacks in practical application, and there are many studies on shortcomings and misleading results from MAPE.
 It cannot be used if there are zero values (which happens frequently in demand data for example) because there would be a division by zero.
 For forecasts which are too low the percentage error cannot exceed 100%, but for forecasts which are too high there is no upper limit to the percentage error.
 MAPE puts a heavier penalty on negative errors than on positive errors. As a consequence, when MAPE is used to compare the accuracy of prediction methods it is biased in that it will systematically select a method whose forecasts are too low. This littleknown but serious issue can be overcome by using an accuracy measure based on the logarithm of the accuracy ratio (the ratio of the predicted to actual value). This approach leads to superior statistical properties and leads to predictions which can be interpreted in terms of the geometric mean.
To overcome these issues with MAPE, there are some other measures proposed in literature like the Symmetric Mean Absolute Percentage Error (SMAPE) presented later.
import numpy as np 
Median Absolute Error (MedAE)
Median absolute error output is a nonnegative floating point with an optimal value corresponding to 0. The median absolute error is particularly interesting because it is robust to outliers. The loss is calculated by taking the median of all absolute differences between the target and the prediction. If ŷ is the predicted value of the ith sample and yi is the corresponding true value, then the median absolute error estimated over n samples is defined as follows:
Where:
 X̂i is the median of the sample
 yi predictions
 n is total number of observations
import numpy as np 
Symmetric Mean Absolute Percentage Error (SMAPE)
SMAPE means Symmetric Mean Absolute Percentage Error. This metric is an accuracy measure based on percentage (or relative) errors.
Where:

Ai is the actual value

Fi is the forecast value

n is total number of observations
In contrast to the mean absolute percentage error, SMAPE has both a lower bound and an upper bound. Indeed, the formula above provides a result between 0% and 200%. However a percentage error between 0% and 100% is much easier to interpret.
A limitation to SMAPE is that if the actual value or forecast value is 0, the value of error will boom up to the upperlimit of error. (200% for the first formula and 100% for the second formula).
import numpy as np 
R2 (R squared)
R² helps us to know how good our regression model is compared to a very simple model that just predicts the mean value of target from the train set as predictions.
Where:
 SS RES term shows the sum of the Square of the distance between the actual point and the predicted point in the bestfit line.
 SS TOT term shows the sum of the Square of the distance between the actual point and the mean of all the points in the mean line.
 yi are the observations,
 ŷi predicted values,
 y mean values
import numpy as np 
Regression metrics summary table
Prevision.io Notation 

Metric 
Range 
Lower is better 
Weights accepted 
3 Stars 
2 Stars 
1 Star 
0 Star 
Sensitive to Outliers 
Tips 
MSE 
0 – ∞ 
True 
True 
[0 ; 0.01 * VAR[ 
[0.01 * VAR ; 0.1 * VAR[ 
[0.1 * VAR ; VAR [ 
[VAR ; +inf [ 
Yes 

RMSE 
0 – ∞ 
True 
True 
[0 ; 0.1 * STD[ 
[0.1 * STD ; 0.3 * STD[ 
[0.3 * STD ;STD[ 
[STD ; +inf [ 
Yes 

MAE 
0 – ∞ 
True 
True 
[0 ; 0.1 * STD[ 
[0.1 * STD ; 0.3 * STD[ 
[0.3 * STD ;STD[ 
[STD ; +inf [ 
No 

MAPE 
0 – ∞ 
True 
True 
[0 ; 0.1[ 
[0.1 ; 0.3[ 
[0.3 ; 1[ 
[1 ; +inf[ 
No 
Use when target values are across different scales 
RMSLE 
0 – ∞ 
True 
True 
[0 ; 0.095[ 
[0.095 ; 0.262[ 
[0.262 ; 0.693[ 
[0.693 ; +inf[ 
Yes 

RMSPE 
0 – ∞ 
True 
True 
[0 ; 0.1[ 
[0.1 ; 0.3[ 
[0.3 ; 1[ 
[1 ; +inf[ 
Yes 
Use when target values are across different scales 
SMAPE 
0 – ∞ 
True 
True 
[0 ; 0.1[ 
[0.1 ; 0.3[ 
[0.3 ; 1[ 
[1 ; +inf[ 
No 
Use when target values are close to 0 
R2 
∞ – 1 
False 
True 
]0.9 ; 1] 
]0.7 ; 0.9] 
]0.5 ; 0.7] 
]inf ; 0.5] 
No 
Use when you want performance scaled between 0 and 1 
Conclusion
We have introduced regression metrics, those implemented in Prevision.io.
In this article we have seen:
 the main regression metrics,
 their code implementation in Python
 in which situations are they used
 a summary table of these metrics
Example of gradient descent. Courtesy of towardsdatascience.com
In this example we saw why it is important for our objective function to be differentiable.
Now, let’s imagine we had a regularization parameter to our linear regression. We can use the Ridge Regression, and Ŷi = ax1i + bx2i + c + (a + b)². We want to choose the best , through hyper parameter optimization. For simplicity we will try =1, 2, 3 :
fix = 1
repeat the above process to determine a, b, c (with the objective function)
evaluate metric(Y, Ŷ, =1), with the metric we want. It can be anything
We do the same thing with ⁼ 2 and 3 and choose the one which gives the best result, from metric perspective.
Why do we need metrics?
As we saw above, different models can be evaluated thanks to metrics. Metrics are more flexible than objective functions and must be business oriented. The advantage is that you will be able to communicate with less sophisticated audiences. Never forget that the enduser of your model is generally not you, and must be able to understand why your model performs well. Once your models have been trained, you can use different metrics to present it. For example, for a classification problem, you could show the true positive rate, false positive rate etc. If you’re building an algorithm to determine if one has COVID, true positive and false negative rate can be very important…
Metrics can also be used to calculate expected gain or ROI. If you can evaluate the gain / loss attached to the outcomes of your model, it can help you to show the monetary value of your work, as you can see here with a cost matrix alongside a confusion matrix:
cost and confusion matrix
How to choose a good metric ?
The first question you have to ask yourself is “why am I building this machine learning model?” And, in which context will it be used? Algorithms that we build are usually used in a very well defined work environment. Let’s take a (very basic) example.
You work in a telecommunication firm. The Marketing team wants to tackle the issue of churn in your company. They noticed that customers seem to leave the company and want to prevent it. One idea is to begin a customized email campaign with some special offer for each and every customer.. If we had an unlimited budget, we could send a personal email to every customer in our database, and it would be just great, wouldn’t it? Well, we don’t have unlimited funds and resources, so we want to target customers which have the highest probability to churn. This is when you get to show your magic. Now back to metrics. We deal with a classification problem. So you will probably use a logloss as an objective function. But logloss is not very business oriented and I can see the long faces from here when you will say “hey, I have a logloss of 0.12, let’s put my model in production”.
One question you should ask to the marketing team is “is it worse to miss a churner or to consider a churner as a nonchurner?” In other words, is it better to optimize recall or precision…which will lead you to choose a threshold, as we will see in another blog post on classification metrics.
This reasoning can be applied to any machine learning use case. You have to discuss with the people who will actually use your model, and ask questions in a business oriented manner.
Now that we have covered an overview of what a Machine Learning metric is and how it differentiates from an objective function, we will go deeper into typical metrics for regression, classification and multi classification use cases and see which ones are the best for you.