Machine Learning & Signals Learning
7 Regression Losses and Metrics
For regression, loss and metrics are often the same quantity.
7.1 Preface
The predicted range of \(\hat {y}_i\) results from the particular applied model.
Metric A performance metric is a function \(J:\hat {\by },\by \rightarrow \mathscr {R}\) that is used to evaluate and quantify the effectiveness of a model. These metrics provide insight into how well a model is performing according to various aspects of the data it predicts.
Loss function The loss function is a metric of the form \(L:\hat {\by },\by \rightarrow \mathscr {R}\) that is calculated over the test set, such that optimal parameters that correspond to the minimum loss
\(\seteqnumber{0}{}{0}\)\begin{equation} \bth = \arg \min _{\bth } \loss (\hat {\by },\by ) \end{equation}
can be evaluated (Sec. 4.4.3).
A metric is not necessarily a loss function, and a loss function is not necessarily a metric.
For example cross-entropy loss in classification is not a metric and R-square is not a loss.
Summary Metrics are for communication; losses are for training. A loss is the objective minimized during training, while a metric is reported to summarize performance. They may coincide (e.g., MSE as both loss and metric), but need not.
7.2 Loss Function Properties
A loss function has a few desired properties that are presented below. While not obligatory, these properties may significantly ease the evaluation of the parameters \(\bm {\theta }\). In the following, only the basic description of these properties is provided, sacrificing mathematical rigor for brevity.
Continuity: Single unbroken (without jumps) curve for all possible input values, i.e., without discontinuities.
Lipschitz continuity: A formal stricter continuity requirement that limits the pace of function changes. A real-valued function \(f(\cdot ):\mathcal {R}\rightarrow \mathcal {R}\) is called Lipschitz continuous if there exists a positive real constant \(K\) such that, for all real \(x_1\) and \(x_2\),
\(\seteqnumber{0}{}{1}\)\begin{equation} \abs {f(x_1) - f(x_2)} \le K \abs {x_1 - x_2}. \end{equation}
This feature formally limits the maximum gradient values of a loss function.
Differentiability: A differentiable function of one real variable is a function whose derivative exists at each point in its domain.
-
• If the function is differentiable, it is also continuous.
-
• If the derivative is a continuous function, it is also Lipschitz continuous.
This property is particularly important in NNs, which are based on the back-propagation principle.
Convexity and strict convexity: Each segment lies above the graph between two points (Fig. 7.1). Formally, it means that the line between \(\left (x_1,f(x_1)\right )\) and \(\left (x_2,f(x_2)\right )\) is always above or just meeting the graph of the function \(f(x),x_1\le x \le x_2\). Mathematically, for all \(0\le t \le 1\) and \(\forall x_1,x_2\in \real \),
\(\seteqnumber{0}{}{2}\)\begin{equation} f\left (tx_1+(1-t)x_2\right )\le tf(x_1) + (1-t)f(x_2) \end{equation}
A twice-differentiable function of a single variable is convex if and only if its second derivative is nonnegative on its entire domain. For a strict convexity, the second derivative is always positive. In the context of loss function properties, the meaning of convexity is the presence of one global minimum.
7.3 Losses
Per-sample error notation is, \(e_i = y_i - \hat {y}_i, i=1,\ldots ,M\). The corresponding vector notation is \(\be = \by - \hat {\by }\). Note, the order of subtraction is not important in the context of the following material.
Some of the following losses are also used as metrics.
7.3.1 Mean-squared error (MSE)
-
• Continuous, differentiable, convex
MSE loss (also termed L2-loss) and metric is
\(\seteqnumber{0}{}{3}\)\begin{equation} \begin{aligned} J(\by ,\hat {\by }) &= \frac {1}{M}\norm {\by - \hat {\by }}^2=\frac {1}{M}\be ^T\be \\ &= \frac {1}{M}\sum _{i=1}^M (y_i -\hat {y}_i)^2 = \frac {1}{M}\sum _{i=1}^M e_i^2 \end {aligned} \end{equation}
When used as loss, sometimes factor 2 is applied to “compensate” for the derivative,
\(\seteqnumber{0}{}{4}\)\begin{equation} \loss (\by ,\hat {\by }) = \frac {1}{2M}\sum _{i=1}^M e_i^2 \end{equation}
Sum-of-squared error (SSE) ((2.4)) is also used.
Important properties:
-
• Popular regression loss.
-
• Popular metric.
-
• Analytical gradient that is error-dependent.
-
• Sometimes, analytical solution is available, e.g. normal equation.
-
• The main drawback is inherent outlier sensitivity.
7.3.2 RMSE
-
• Continuous, differentiable, convex
Complementary convenient metric for MSE is root-MSE (RMSE).
\(\seteqnumber{0}{}{5}\)\begin{equation} \begin{aligned} J(\by ,\hat {\by }) &= \frac {1}{\sqrt {M}}\norm {\by - \hat {\by }}\\ &= \sqrt {\frac {1}{M}\sum _{i=1}^M (y_i -\hat {y}_i)^2} = \sqrt {\frac {1}{M}\sum _{i=1}^M e_i^2} \end {aligned} \end{equation}
-
• Theoretically, RMSE can also be used as loss, but it is very similar to MSE with only a difference in gradients.
-
• Easier human interpretation.
7.3.3 Mean absolute error (MAE)
MAE is used both as loss and metric.
\(\seteqnumber{0}{}{6}\)\begin{equation} \begin{aligned} \loss (\by ,\hat {\by }) &= \frac {1}{M}\sum _{i=1}^M \abs {y_i -\hat {y}_i} = \frac {1}{M}\sum _{i=1}^M \abs {e_i} \end {aligned} \end{equation}
Important properties:
-
• Popular loss and metric.
-
• All errors are similarly important and, therefore, less sensitive to outliers than MSE.
-
• Error-independent gradient that may result in slower convergence under certain conditions. In particular, the gradient is high even for very small errors. To fix this, a dynamic learning rate that decreases as we move closer to the minima is required. MSE behaves nicely in this case and will converge even with a fixed learning rate. The gradient of MSE loss is high for larger loss values and decreases as loss approaches 0, making it more precise at the end of training.
-
• Non-differentiable at \(\loss (\by ,\hat {\by }) =0\), also without dramatic influence on most learning algorithms.
A brief numerical example of MSE, RMSE and MAE is presented in Table 7.1 and in Fig. 7.2.
| True Values | Predicted Values | MSE | RMSE | MAE | ||
| (30,25) | \(\frac {(40-30)^2}{2} + \frac {(30-25)^2}{2}=62.5\) | \(\sqrt {62.5}=7.91\) | \(\frac {\abs {40-30}}{2} + \frac {\abs {30-25}}{2}=7.5\) | |||
| (30,25) | \(\frac {(50-30)^2}{2} + \frac {(30-25)^2}{2}=212.5\) | \(\sqrt {212.5}=14.6\) | \(\frac {\abs {50-30}}{2} + \frac {\abs {30-25}}{2}=12.5\) |
7.3.4 Huber loss
-
• L-Continuous, differentiable, s-convex
\begin{equation} \loss (e_i) = \begin{cases} \dfrac {1}{2}e_i^2 & \abs {e_i}\le \delta \\[7pt] \delta \left (\abs {e_i} - \dfrac {1}{2}\delta \right ) & \text {otherwise} \end {cases} \end{equation}
For small \(e_i\) it behaves like MSE and for larger \(e_i\) like MAE.
The problem with Huber loss is the need to tune the hyperparameter \(\delta \), which is a non-trivial process.
7.3.5 Log-cosh loss
\(\seteqnumber{0}{}{8}\)\begin{equation} \loss (e_i) = \log {\cosh {e_i}} \end{equation}
Properties:
-
• Twice differentiable everywhere, \(\dfrac {\partial }{\partial e_i}L(e_i)=\tanh {e_i}\).
-
• Approximation:
\(\seteqnumber{0}{}{9}\)\begin{equation} L(e_i)\approx \begin{cases} \dfrac {e_i^2}{2} & e_i \text { small}\\[5pt] \abs {e_i}-\log (2) & e_i \text { large} \end {cases} \end{equation}
-
• No hyper-parameters.
-
• Similar to Huber loss with \(\delta =1\).
-
• Requires non-trivial error handling, otherwise may get stuck in either of two regions.
-
• Explicit hyper-parameter optimization of Huber loss is recommended.
7.3.6 Cauchy
\(\seteqnumber{0}{}{10}\)\begin{equation} \loss (e_i) = \log (1+\left (\frac {e_i}{d}\right )^2) \end{equation}
-
• \(d\) is “sharpness” parameter
-
• Robust against outliers more than MAE but less than atan loss.
7.3.7 Atan
\(\seteqnumber{0}{}{11}\)\begin{equation} \loss (e_i) = \arctan (e_i^2) \end{equation}
-
• Atan tends to \(\pi /2\) as input tends to infinity, and its derivatives tend to 0.
-
• This means extreme outliers will have a negligible effect on the search direction compared to non-outliers.
-
• It is important to ensure the data is well scaled and that the starting point is a reasonable guess at the true solution, if possible (similar to Log-cosh loss).
7.3.8 Mean Squared Logarithmic Error (MSLE)
MSLE is the relative difference between the log-transformed actual and predicted values.
\(\seteqnumber{0}{}{12}\)\begin{equation} \begin{aligned} \loss (y_i,\hat {y}_i) &= \frac {1}{M}\sum _{i=1}^M \left (\log (y_i+1)-\log (\hat {y}_i+1)\right )^2\\ & = \frac {1}{M}\sum _{i=1}^M\log (\frac {y_i+1}{\hat {y}_i+1})^2 \end {aligned} \end{equation}
‘1’ is added to both \(y\) and \(\hat {y}\) for mathematical convenience since \(\log (0)\) is not defined but both \(y\) and \(\hat {y}\) can be 0.
-
• Addresses \(y\) with high dynamic range, i.e. addresses relative error. Less useful for low dynamic range.
-
• MSLE tries to treat small and large differences between the actual and predicted values similarly, e.g. in Table 7.2.
-
• Penalizes underestimated values more than overestimated values.
-
• Root MSLE (RMSLE) is also used, e.g. in scikit.
| True Values | Predicted Values | MSE Loss | MSLE Loss | ||||
| 40 | 30 | 100 | 0.0782 | ||||
| 4000 | 3000 | 1,000,000 | 0.0827 | ||||
| 20 | 10 | 100 | 0.4181 | ||||
| 20 | 30 | 100 | 0.1517 |
7.4 Relative Metrics
Relative squared error (RSE)
Normalized MSE loss.
\(\seteqnumber{0}{}{13}\)\begin{equation} J(y_i,\hat {y}_i) = \dfrac {\displaystyle \sum _i \left ( y_i - \hat {y}_i\right )^2}{\displaystyle \sum _i\left (y_i-\overline {y}\right )^2}=\dfrac {MSE}{\Var [\by ]} =\frac {\norm {\by - \hat {\by }}^2}{\norm {\by - \bar {\by }}^2} \end{equation}
Shows the fraction of the explained variance, since \(\hat {y}_i = \bar {y}\) is the minimum MSE predictor if the data and \(\by \) are statistically independent. Closer to 0 is better.
R2
The common metric in social sciences,
\(\seteqnumber{0}{}{14}\)\begin{equation} R^2 = 1 - RSE \end{equation}
Opposite to RSE, i.e., \(R^2\) close to 1 is better. In highly undesirable cases, it may be negative.
For example, R2 of 40% indicates that your model has reduced the mean squared error by 40% compared to the baseline, which is the mean model. This is the same as RSE of 60%.
Normalized Root Mean Squared Error Expressed as a percentage, defined as:
\(\seteqnumber{0}{}{15}\)\begin{equation} J(y_i,\hat {y}_i) =100\left (1-\frac {\norm {\by - \hat {\by }}}{\norm {\by - \bar {\by }}}\right ) =100\left (1-\sqrt {RSE}\right ) \end{equation}
Used in Matlab.
Relative absolute error (RAE)
Normalized MAE loss.
\(\seteqnumber{0}{}{16}\)\begin{equation} J(y_i,\hat {y}_i) = \dfrac {\displaystyle \sum _i \abs {y_i - \hat {y}_i}}{\displaystyle \sum _i\abs {y_i-\overline {y}}} = \frac {MAE}{\displaystyle \sum _i\abs {y_i-\overline {y}}} \end{equation}
Closer to 0 is better.
Mean Absolute Percentage Error (MAPE)
Scaled error metric.
\(\seteqnumber{0}{}{17}\)\begin{equation} J = \frac {1}{M}\sum _{i=1}^M\frac {\abs {y_i-\hat {y}_i}}{y_i}\times 100\% \end{equation}
-
• Beware of small denominator!
-
• Can exceed 100%.
-
• Asymmetric, as described in Table 7.3.
| True Values | Predicted Values | Absolute Error | MAPE | ||||
| 100 | 60 | 40 | 40% | ||||
| 20 | 60 | 40 | 300% |
Additional Reading Additional reading on regression metrics: [?]
7.5 Number of Parameters Penalty
Adding parameters improves performance but may result in overfitting. These metrics attempt to resolve this problem by introducing a penalty term for the number of parameters in the model with MSE loss.
All these have lengthy theoretical justification that is not provided here.
-
• \(N\) is the number of parameters,
-
• \(M\) is the number of data-points.
Main assumption: residuals have Gaussian distribution.
Akaike’s Final Prediction Error (FPE)
Akaike’s Final Prediction Error (FPE) criterion provides a measure of model quality.
\(\seteqnumber{0}{}{18}\)\begin{equation} FPE = MSE\frac {1+N/M}{1-N/M} = MSE\frac {M+N}{M-N} \end{equation}
Akaike’s Information Criterion Penalizes the number of parameters,
\(\seteqnumber{0}{}{19}\)\begin{equation} \Delta AIC = 2N + M\ln (MSE) \end{equation}
AICc is AIC with a correction for small sample sizes
\(\seteqnumber{0}{}{20}\)\begin{equation} AICc = AIC + 2N\frac {N+1}{M-N-1} \end{equation}
Bayesian Information Criterion (BIC)
\(\seteqnumber{0}{}{21}\)\begin{equation} BIC = M\ln (MSE) + N\ln (M) \end{equation}
7.6 Summary
7.6.1 Loss Selection Guidelines
-
• MSE: Default choice for clean data; analytical gradients enable fast convergence.
-
• MAE: Prefer when outliers are present; requires adaptive learning rate.
-
• Huber: Best of both worlds when \(\delta \) can be tuned via cross-validation.
-
• MSLE: Use for high dynamic range targets where relative error matters.
-
• Cauchy/Atan: Heavy outlier contamination; require careful initialization.
7.6.2 Metric Selection Guidelines
-
• RMSE: Same units as target; intuitive for stakeholders.
-
• \(R^2\): Explains variance reduction vs. mean baseline; use for model comparison.
-
• MAPE: Percentage interpretation; avoid when \(y\) approaches zero.
7.6.3 Common Pitfalls
-
• Scale confusion: MSE/RMSE/MAE are scale-dependent; compare across datasets only after normalization or via dimensionless metrics.
-
• MAPE near zero: Avoid MAPE when \(y\) can be small or cross zero.
-
• \(R^2\) misuse: High \(R^2\) does not guarantee good predictions; inspect residuals and error magnitudes.
-
• Non-convex robust losses: Without careful initialization/scaling, they may converge to poor local minima.
-
• MAE with fixed learning rate: May oscillate near the optimum; use learning-rate decay or adaptive optimizers.