Machine Learning & Signals Learning

\(\newcommand{\footnotename}{footnote}\) \(\def \LWRfootnote {1}\) \(\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\let \LWRorighspace \hspace \) \(\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }\) \(\newcommand {\TextOrMath }[2]{#2}\) \(\newcommand {\mathnormal }[1]{{#1}}\) \(\newcommand \ensuremath [1]{#1}\) \(\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } \) \(\newcommand {\setlength }[2]{}\) \(\newcommand {\addtolength }[2]{}\) \(\newcommand {\setcounter }[2]{}\) \(\newcommand {\addtocounter }[2]{}\) \(\newcommand {\arabic }[1]{}\) \(\newcommand {\number }[1]{}\) \(\newcommand {\noalign }[1]{\text {#1}\notag \\}\) \(\newcommand {\cline }[1]{}\) \(\newcommand {\directlua }[1]{\text {(directlua)}}\) \(\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}\) \(\newcommand {\protect }{}\) \(\def \LWRabsorbnumber #1 {}\) \(\def \LWRabsorbquotenumber "#1 {}\) \(\newcommand {\LWRabsorboption }[1][]{}\) \(\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }\) \(\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }\) \(\def \mathcode #1={\mathchar }\) \(\let \delcode \mathcode \) \(\let \delimiter \mathchar \) \(\def \oe {\unicode {x0153}}\) \(\def \OE {\unicode {x0152}}\) \(\def \ae {\unicode {x00E6}}\) \(\def \AE {\unicode {x00C6}}\) \(\def \aa {\unicode {x00E5}}\) \(\def \AA {\unicode {x00C5}}\) \(\def \o {\unicode {x00F8}}\) \(\def \O {\unicode {x00D8}}\) \(\def \l {\unicode {x0142}}\) \(\def \L {\unicode {x0141}}\) \(\def \ss {\unicode {x00DF}}\) \(\def \SS {\unicode {x1E9E}}\) \(\def \dag {\unicode {x2020}}\) \(\def \ddag {\unicode {x2021}}\) \(\def \P {\unicode {x00B6}}\) \(\def \copyright {\unicode {x00A9}}\) \(\def \pounds {\unicode {x00A3}}\) \(\let \LWRref \ref \) \(\renewcommand {\ref }{\ifstar \LWRref \LWRref }\) \( \newcommand {\multicolumn }[3]{#3}\) \(\require {textcomp}\) \( \newcommand {\abs }[1]{\lvert #1\rvert } \) \( \DeclareMathOperator {\sign }{sign} \) \(\newcommand {\intertext }[1]{\text {#1}\notag \\}\) \(\let \Hat \hat \) \(\let \Check \check \) \(\let \Tilde \tilde \) \(\let \Acute \acute \) \(\let \Grave \grave \) \(\let \Dot \dot \) \(\let \Ddot \ddot \) \(\let \Breve \breve \) \(\let \Bar \bar \) \(\let \Vec \vec \) \(\newcommand {\bm }[1]{\boldsymbol {#1}}\) \(\require {physics}\) \(\newcommand {\LWRphystrig }[2]{\ifblank {#1}{\textrm {#2}}{\textrm {#2}^{#1}}}\) \(\renewcommand {\sin }[1][]{\LWRphystrig {#1}{sin}}\) \(\renewcommand {\sinh }[1][]{\LWRphystrig {#1}{sinh}}\) \(\renewcommand {\arcsin }[1][]{\LWRphystrig {#1}{arcsin}}\) \(\renewcommand {\asin }[1][]{\LWRphystrig {#1}{asin}}\) \(\renewcommand {\cos }[1][]{\LWRphystrig {#1}{cos}}\) \(\renewcommand {\cosh }[1][]{\LWRphystrig {#1}{cosh}}\) \(\renewcommand {\arccos }[1][]{\LWRphystrig {#1}{arcos}}\) \(\renewcommand {\acos }[1][]{\LWRphystrig {#1}{acos}}\) \(\renewcommand {\tan }[1][]{\LWRphystrig {#1}{tan}}\) \(\renewcommand {\tanh }[1][]{\LWRphystrig {#1}{tanh}}\) \(\renewcommand {\arctan }[1][]{\LWRphystrig {#1}{arctan}}\) \(\renewcommand {\atan }[1][]{\LWRphystrig {#1}{atan}}\) \(\renewcommand {\csc }[1][]{\LWRphystrig {#1}{csc}}\) \(\renewcommand {\csch }[1][]{\LWRphystrig {#1}{csch}}\) \(\renewcommand {\arccsc }[1][]{\LWRphystrig {#1}{arccsc}}\) \(\renewcommand {\acsc }[1][]{\LWRphystrig {#1}{acsc}}\) \(\renewcommand {\sec }[1][]{\LWRphystrig {#1}{sec}}\) \(\renewcommand {\sech }[1][]{\LWRphystrig {#1}{sech}}\) \(\renewcommand {\arcsec }[1][]{\LWRphystrig {#1}{arcsec}}\) \(\renewcommand {\asec }[1][]{\LWRphystrig {#1}{asec}}\) \(\renewcommand {\cot }[1][]{\LWRphystrig {#1}{cot}}\) \(\renewcommand {\coth }[1][]{\LWRphystrig {#1}{coth}}\) \(\renewcommand {\arccot }[1][]{\LWRphystrig {#1}{arccot}}\) \(\renewcommand {\acot }[1][]{\LWRphystrig {#1}{acot}}\) \(\require {cancel}\) \(\newcommand *{\underuparrow }[1]{{\underset {\uparrow }{#1}}}\) \(\DeclareMathOperator *{\argmax }{argmax}\) \(\DeclareMathOperator *{\argmin }{arg\,min}\) \(\def \E [#1]{\mathbb {E}\!\left [ #1 \right ]}\) \(\def \Var [#1]{\operatorname {Var}\!\left [ #1 \right ]}\) \(\def \Cov [#1]{\operatorname {Cov}\!\left [ #1 \right ]}\) \(\newcommand {\floor }[1]{\lfloor #1 \rfloor }\) \(\newcommand {\DTFTH }{ H \brk 1{e^{j\omega }}}\) \(\newcommand {\DTFTX }{ X\brk 1{e^{j\omega }}}\) \(\newcommand {\DFTtr }[1]{\mathrm {DFT}\left \{#1\right \}}\) \(\newcommand {\DTFTtr }[1]{\mathrm {DTFT}\left \{#1\right \}}\) \(\newcommand {\DTFTtrI }[1]{\mathrm {DTFT^{-1}}\left \{#1\right \}}\) \(\newcommand {\Ftr }[1]{ \mathcal {F}\left \{#1\right \}}\) \(\newcommand {\FtrI }[1]{ \mathcal {F}^{-1}\left \{#1\right \}}\) \(\newcommand {\Zover }{\overset {\mathscr Z}{\Longleftrightarrow }}\) \(\renewcommand {\real }{\mathbb {R}}\) \(\newcommand {\ba }{\mathbf {a}}\) \(\newcommand {\bb }{\mathbf {b}}\) \(\newcommand {\bd }{\mathbf {d}}\) \(\newcommand {\be }{\mathbf {e}}\) \(\newcommand {\bh }{\mathbf {h}}\) \(\newcommand {\bn }{\mathbf {n}}\) \(\newcommand {\bq }{\mathbf {q}}\) \(\newcommand {\br }{\mathbf {r}}\) \(\newcommand {\bt }{\mathbf {t}}\) \(\newcommand {\bv }{\mathbf {v}}\) \(\newcommand {\bw }{\mathbf {w}}\) \(\newcommand {\bx }{\mathbf {x}}\) \(\newcommand {\bxx }{\mathbf {xx}}\) \(\newcommand {\bxy }{\mathbf {xy}}\) \(\newcommand {\by }{\mathbf {y}}\) \(\newcommand {\byy }{\mathbf {yy}}\) \(\newcommand {\bz }{\mathbf {z}}\) \(\newcommand {\bA }{\mathbf {A}}\) \(\newcommand {\bB }{\mathbf {B}}\) \(\newcommand {\bI }{\mathbf {I}}\) \(\newcommand {\bK }{\mathbf {K}}\) \(\newcommand {\bP }{\mathbf {P}}\) \(\newcommand {\bQ }{\mathbf {Q}}\) \(\newcommand {\bR }{\mathbf {R}}\) \(\newcommand {\bU }{\mathbf {U}}\) \(\newcommand {\bW }{\mathbf {W}}\) \(\newcommand {\bX }{\mathbf {X}}\) \(\newcommand {\bY }{\mathbf {Y}}\) \(\newcommand {\bZ }{\mathbf {Z}}\) \(\newcommand {\balpha }{\bm {\alpha }}\) \(\newcommand {\bth }{{\bm {\theta }}}\) \(\newcommand {\bepsilon }{{\bm {\epsilon }}}\) \(\newcommand {\bmu }{{\bm {\mu }}}\) \(\newcommand {\bOne }{\mathbf {1}}\) \(\newcommand {\bZero }{\mathbf {0}}\) \(\newcommand {\loss }{\mathcal {L}}\) \(\newcommand {\appropto }{\mathrel {\vcenter { \offinterlineskip \halign {\hfil $##$\cr \propto \cr \noalign {\kern 2pt}\sim \cr \noalign {\kern -2pt}}}}}\) \(\newcommand {\SSE }{\mathrm {SSE}}\) \(\newcommand {\MSE }{\mathrm {MSE}}\) \(\newcommand {\RMSE }{\mathrm {RMSE}}\) \(\newcommand {\toprule }[1][]{\hline }\) \(\let \midrule \toprule \) \(\let \bottomrule \toprule \) \(\def \LWRbooktabscmidruleparen (#1)#2{}\) \(\newcommand {\LWRbooktabscmidrulenoparen }[1]{}\) \(\newcommand {\cmidrule }[1][]{\ifnextchar (\LWRbooktabscmidruleparen \LWRbooktabscmidrulenoparen }\) \(\newcommand {\morecmidrules }{}\) \(\newcommand {\specialrule }[3]{\hline }\) \(\newcommand {\addlinespace }[1][]{}\) \(\newcommand {\LWRsubmultirow }[2][]{#2}\) \(\newcommand {\LWRmultirow }[2][]{\LWRsubmultirow }\) \(\newcommand {\multirow }[2][]{\LWRmultirow }\) \(\newcommand {\mrowcell }{}\) \(\newcommand {\mcolrowcell }{}\) \(\newcommand {\STneed }[1]{}\) \(\newcommand {\tcbset }[1]{}\) \(\newcommand {\tcbsetforeverylayer }[1]{}\) \(\newcommand {\tcbox }[2][]{\boxed {\text {#2}}}\) \(\newcommand {\tcboxfit }[2][]{\boxed {#2}}\) \(\newcommand {\tcblower }{}\) \(\newcommand {\tcbline }{}\) \(\newcommand {\tcbtitle }{}\) \(\newcommand {\tcbsubtitle [2][]{\mathrm {#2}}}\) \(\newcommand {\tcboxmath }[2][]{\boxed {#2}}\) \(\newcommand {\tcbhighmath }[2][]{\boxed {#2}}\)

7 Regression Losses and Metrics

  • Goal: Quantify the difference between predicted and target values.

For regression, loss and metrics are often the same quantity.

7.1 Preface

The predicted range of \(\hat {y}_i\) results from the particular applied model.

Metric A performance metric is a function \(J:\hat {\by },\by \rightarrow \mathscr {R}\) that is used to evaluate and quantify the effectiveness of a model. These metrics provide insight into how well a model is performing according to various aspects of the data it predicts.

Loss function The loss function is a metric of the form \(L:\hat {\by },\by \rightarrow \mathscr {R}\) that is calculated over the test set, such that optimal parameters that correspond to the minimum loss

\begin{equation} \bth = \arg \min _{\bth } \loss (\hat {\by },\by ) \end{equation}

can be evaluated (Sec. 4.4.3).

A metric is not necessarily a loss function, and a loss function is not necessarily a metric.

For example cross-entropy loss in classification is not a metric and R-square is not a loss.

Summary Metrics are for communication; losses are for training. A loss is the objective minimized during training, while a metric is reported to summarize performance. They may coincide (e.g., MSE as both loss and metric), but need not.

7.2 Loss Function Properties

A loss function has a few desired properties that are presented below. While not obligatory, these properties may significantly ease the evaluation of the parameters \(\bm {\theta }\). In the following, only the basic description of these properties is provided, sacrificing mathematical rigor for brevity.

Continuity: Single unbroken (without jumps) curve for all possible input values, i.e., without discontinuities.

Lipschitz continuity: A formal stricter continuity requirement that limits the pace of function changes. A real-valued function \(f(\cdot ):\mathcal {R}\rightarrow \mathcal {R}\) is called Lipschitz continuous if there exists a positive real constant \(K\) such that, for all real \(x_1\) and \(x_2\),

\begin{equation} \abs {f(x_1) - f(x_2)} \le K \abs {x_1 - x_2}. \end{equation}

This feature formally limits the maximum gradient values of a loss function.

Differentiability: A differentiable function of one real variable is a function whose derivative exists at each point in its domain.

  • If the function is differentiable, it is also continuous.

  • If the derivative is a continuous function, it is also Lipschitz continuous.

This property is particularly important in NNs, which are based on the back-propagation principle.

Convexity and strict convexity: Each segment lies above the graph between two points (Fig. 7.1). Formally, it means that the line between \(\left (x_1,f(x_1)\right )\) and \(\left (x_2,f(x_2)\right )\) is always above or just meeting the graph of the function \(f(x),x_1\le x \le x_2\). Mathematically, for all \(0\le t \le 1\) and \(\forall x_1,x_2\in \real \),

\begin{equation} f\left (tx_1+(1-t)x_2\right )\le tf(x_1) + (1-t)f(x_2) \end{equation}

A twice-differentiable function of a single variable is convex if and only if its second derivative is nonnegative on its entire domain. For a strict convexity, the second derivative is always positive. In the context of loss function properties, the meaning of convexity is the presence of one global minimum.

(image)

Figure 7.1: Convexity property visualization.

7.3 Losses

Per-sample error notation is, \(e_i = y_i - \hat {y}_i, i=1,\ldots ,M\). The corresponding vector notation is \(\be = \by - \hat {\by }\). Note, the order of subtraction is not important in the context of the following material.

Some of the following losses are also used as metrics.

7.3.1 Mean-squared error (MSE)
  • Continuous, differentiable, convex

MSE loss (also termed L2-loss) and metric is

\begin{equation} \begin{aligned} J(\by ,\hat {\by }) &= \frac {1}{M}\norm {\by - \hat {\by }}^2=\frac {1}{M}\be ^T\be \\ &= \frac {1}{M}\sum _{i=1}^M (y_i -\hat {y}_i)^2 = \frac {1}{M}\sum _{i=1}^M e_i^2 \end {aligned} \end{equation}

When used as loss, sometimes factor 2 is applied to “compensate” for the derivative,

\begin{equation} \loss (\by ,\hat {\by }) = \frac {1}{2M}\sum _{i=1}^M e_i^2 \end{equation}

Sum-of-squared error (SSE) ((2.4)) is also used.

Important properties:

  • Popular regression loss.

  • Popular metric.

  • Analytical gradient that is error-dependent.

  • Sometimes, analytical solution is available, e.g. normal equation.

  • The main drawback is inherent outlier sensitivity.

7.3.2 RMSE
  • Continuous, differentiable, convex

Complementary convenient metric for MSE is root-MSE (RMSE).

\begin{equation} \begin{aligned} J(\by ,\hat {\by }) &= \frac {1}{\sqrt {M}}\norm {\by - \hat {\by }}\\ &= \sqrt {\frac {1}{M}\sum _{i=1}^M (y_i -\hat {y}_i)^2} = \sqrt {\frac {1}{M}\sum _{i=1}^M e_i^2} \end {aligned} \end{equation}

  • Theoretically, RMSE can also be used as loss, but it is very similar to MSE with only a difference in gradients.

  • Easier human interpretation.

7.3.3 Mean absolute error (MAE)

MAE is used both as loss and metric.

\begin{equation} \begin{aligned} \loss (\by ,\hat {\by }) &= \frac {1}{M}\sum _{i=1}^M \abs {y_i -\hat {y}_i} = \frac {1}{M}\sum _{i=1}^M \abs {e_i} \end {aligned} \end{equation}

Important properties:

  • Popular loss and metric.

  • All errors are similarly important and, therefore, less sensitive to outliers than MSE.

  • Error-independent gradient that may result in slower convergence under certain conditions. In particular, the gradient is high even for very small errors. To fix this, a dynamic learning rate that decreases as we move closer to the minima is required. MSE behaves nicely in this case and will converge even with a fixed learning rate. The gradient of MSE loss is high for larger loss values and decreases as loss approaches 0, making it more precise at the end of training.

  • Non-differentiable at \(\loss (\by ,\hat {\by }) =0\), also without dramatic influence on most learning algorithms.

A brief numerical example of MSE, RMSE and MAE is presented in Table 7.1 and in Fig. 7.2.

Table 7.1: Example of MSE, RMSE and MAE. Small change in error results in significant change in MSE and less dramatic change in MAE.
.
True Values Predicted Values MSE RMSE MAE
(30,25) \(\frac {(40-30)^2}{2} + \frac {(30-25)^2}{2}=62.5\) \(\sqrt {62.5}=7.91\) \(\frac {\abs {40-30}}{2} + \frac {\abs {30-25}}{2}=7.5\)
(30,25) \(\frac {(50-30)^2}{2} + \frac {(30-25)^2}{2}=212.5\) \(\sqrt {212.5}=14.6\) \(\frac {\abs {50-30}}{2} + \frac {\abs {30-25}}{2}=12.5\)

(image)

Figure 7.2: The model is \(N=4\) polynomial with 40 inliers and three outliers at significantly higher values. Both MSE and MAE were used with \(\lambda \)-optimized L2-regularization. MSE regression is pulled toward outliers due to squaring large errors, while MAE regression provides more robust fitting.
7.3.4 Huber loss
  • L-Continuous, differentiable, s-convex

  • Goal: Hybrid between MAE and MSE.

\begin{equation} \loss (e_i) = \begin{cases} \dfrac {1}{2}e_i^2 & \abs {e_i}\le \delta \\[7pt] \delta \left (\abs {e_i} - \dfrac {1}{2}\delta \right ) & \text {otherwise} \end {cases} \end{equation}

For small \(e_i\) it behaves like MSE and for larger \(e_i\) like MAE.

The problem with Huber loss is the need to tune the hyperparameter \(\delta \), which is a non-trivial process.

7.3.5 Log-cosh loss
  • Goal: Hybrid between MAE and MSE without hyper-parameters.

\begin{equation} \loss (e_i) = \log {\cosh {e_i}} \end{equation}

Properties:

  • Twice differentiable everywhere, \(\dfrac {\partial }{\partial e_i}L(e_i)=\tanh {e_i}\).

  • Approximation:

    \begin{equation} L(e_i)\approx \begin{cases} \dfrac {e_i^2}{2} & e_i \text { small}\\[5pt] \abs {e_i}-\log (2) & e_i \text { large} \end {cases} \end{equation}

  • No hyper-parameters.

  • Similar to Huber loss with \(\delta =1\).

  • Requires non-trivial error handling, otherwise may get stuck in either of two regions.

  • Explicit hyper-parameter optimization of Huber loss is recommended.

(image)

Figure 7.3: Visualization of different loss functions. MSE is the steepest.
7.3.6 Cauchy
  • Goal: More robust to outliers than MAE.

\begin{equation} \loss (e_i) = \log (1+\left (\frac {e_i}{d}\right )^2) \end{equation}

  • \(d\) is “sharpness” parameter

  • Robust against outliers more than MAE but less than atan loss.

7.3.7 Atan
  • Goal: More robust to outliers than Cauchy.

\begin{equation} \loss (e_i) = \arctan (e_i^2) \end{equation}

  • Atan tends to \(\pi /2\) as input tends to infinity, and its derivatives tend to 0.

  • This means extreme outliers will have a negligible effect on the search direction compared to non-outliers.

  • It is important to ensure the data is well scaled and that the starting point is a reasonable guess at the true solution, if possible (similar to Log-cosh loss).

7.3.8 Mean Squared Logarithmic Error (MSLE)
  • Goal: Relative loss for high-dynamic range data.

MSLE is the relative difference between the log-transformed actual and predicted values.

\begin{equation} \begin{aligned} \loss (y_i,\hat {y}_i) &= \frac {1}{M}\sum _{i=1}^M \left (\log (y_i+1)-\log (\hat {y}_i+1)\right )^2\\ & = \frac {1}{M}\sum _{i=1}^M\log (\frac {y_i+1}{\hat {y}_i+1})^2 \end {aligned} \end{equation}

‘1’ is added to both \(y\) and \(\hat {y}\) for mathematical convenience since \(\log (0)\) is not defined but both \(y\) and \(\hat {y}\) can be 0.

  • Addresses \(y\) with high dynamic range, i.e. addresses relative error. Less useful for low dynamic range.

  • MSLE tries to treat small and large differences between the actual and predicted values similarly, e.g. in Table 7.2.

  • Penalizes underestimated values more than overestimated values.

  • Root MSLE (RMSLE) is also used, e.g. in scikit.

Table 7.2: Example of MSLE
.
True Values Predicted Values MSE Loss MSLE Loss
40 30 100 0.0782
4000 3000 1,000,000 0.0827
20 10 100 0.4181
20 30 100 0.1517

7.4 Relative Metrics

  • Goal:

    • Add human tractability to the standard metrics.

    • Used to compare different models.

    • Most of the metrics are normalized to the range of \([0,1]\).

Relative squared error (RSE)

Normalized MSE loss.

\begin{equation} J(y_i,\hat {y}_i) = \dfrac {\displaystyle \sum _i \left ( y_i - \hat {y}_i\right )^2}{\displaystyle \sum _i\left (y_i-\overline {y}\right )^2}=\dfrac {MSE}{\Var [\by ]} =\frac {\norm {\by - \hat {\by }}^2}{\norm {\by - \bar {\by }}^2} \end{equation}

Shows the fraction of the explained variance, since \(\hat {y}_i = \bar {y}\) is the minimum MSE predictor if the data and \(\by \) are statistically independent. Closer to 0 is better.

R2

The common metric in social sciences,

\begin{equation} R^2 = 1 - RSE \end{equation}

Opposite to RSE, i.e., \(R^2\) close to 1 is better. In highly undesirable cases, it may be negative.

For example, R2 of 40% indicates that your model has reduced the mean squared error by 40% compared to the baseline, which is the mean model. This is the same as RSE of 60%.

Normalized Root Mean Squared Error Expressed as a percentage, defined as:

\begin{equation} J(y_i,\hat {y}_i) =100\left (1-\frac {\norm {\by - \hat {\by }}}{\norm {\by - \bar {\by }}}\right ) =100\left (1-\sqrt {RSE}\right ) \end{equation}

Used in Matlab.

Relative absolute error (RAE)

Normalized MAE loss.

\begin{equation} J(y_i,\hat {y}_i) = \dfrac {\displaystyle \sum _i \abs {y_i - \hat {y}_i}}{\displaystyle \sum _i\abs {y_i-\overline {y}}} = \frac {MAE}{\displaystyle \sum _i\abs {y_i-\overline {y}}} \end{equation}

Closer to 0 is better.

Mean Absolute Percentage Error (MAPE)

Scaled error metric.

\begin{equation} J = \frac {1}{M}\sum _{i=1}^M\frac {\abs {y_i-\hat {y}_i}}{y_i}\times 100\% \end{equation}

  • Beware of small denominator!

  • Can exceed 100%.

  • Asymmetric, as described in Table 7.3.

Table 7.3: Example of MAPE
.
True Values Predicted Values Absolute Error MAPE
100 60 40 40%
20 60 40 300%

Additional Reading Additional reading on regression metrics: [?]

7.5 Number of Parameters Penalty

Adding parameters improves performance but may result in overfitting. These metrics attempt to resolve this problem by introducing a penalty term for the number of parameters in the model with MSE loss.

All these have lengthy theoretical justification that is not provided here.

  • \(N\) is the number of parameters,

  • \(M\) is the number of data-points.

Main assumption: residuals have Gaussian distribution.

Akaike’s Final Prediction Error (FPE)

Akaike’s Final Prediction Error (FPE) criterion provides a measure of model quality.

\begin{equation} FPE = MSE\frac {1+N/M}{1-N/M} = MSE\frac {M+N}{M-N} \end{equation}

Akaike’s Information Criterion Penalizes the number of parameters,

\begin{equation} \Delta AIC = 2N + M\ln (MSE) \end{equation}

AICc is AIC with a correction for small sample sizes

\begin{equation} AICc = AIC + 2N\frac {N+1}{M-N-1} \end{equation}

Bayesian Information Criterion (BIC)

\begin{equation} BIC = M\ln (MSE) + N\ln (M) \end{equation}

7.6 Summary

7.6.1 Loss Selection Guidelines
  • MSE: Default choice for clean data; analytical gradients enable fast convergence.

  • MAE: Prefer when outliers are present; requires adaptive learning rate.

  • Huber: Best of both worlds when \(\delta \) can be tuned via cross-validation.

  • MSLE: Use for high dynamic range targets where relative error matters.

  • Cauchy/Atan: Heavy outlier contamination; require careful initialization.

7.6.2 Metric Selection Guidelines
  • RMSE: Same units as target; intuitive for stakeholders.

  • \(R^2\): Explains variance reduction vs. mean baseline; use for model comparison.

  • MAPE: Percentage interpretation; avoid when \(y\) approaches zero.

7.6.3 Common Pitfalls
  • Scale confusion: MSE/RMSE/MAE are scale-dependent; compare across datasets only after normalization or via dimensionless metrics.

  • MAPE near zero: Avoid MAPE when \(y\) can be small or cross zero.

  • \(R^2\) misuse: High \(R^2\) does not guarantee good predictions; inspect residuals and error magnitudes.

  • Non-convex robust losses: Without careful initialization/scaling, they may converge to poor local minima.

  • MAE with fixed learning rate: May oscillate near the optimum; use learning-rate decay or adaptive optimizers.