Machine Learning & Signals Learning

\(\newcommand{\footnotename}{footnote}\) \(\def \LWRfootnote {1}\) \(\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\let \LWRorighspace \hspace \) \(\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }\) \(\newcommand {\TextOrMath }[2]{#2}\) \(\newcommand {\mathnormal }[1]{{#1}}\) \(\newcommand \ensuremath [1]{#1}\) \(\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } \) \(\newcommand {\setlength }[2]{}\) \(\newcommand {\addtolength }[2]{}\) \(\newcommand {\setcounter }[2]{}\) \(\newcommand {\addtocounter }[2]{}\) \(\newcommand {\arabic }[1]{}\) \(\newcommand {\number }[1]{}\) \(\newcommand {\noalign }[1]{\text {#1}\notag \\}\) \(\newcommand {\cline }[1]{}\) \(\newcommand {\directlua }[1]{\text {(directlua)}}\) \(\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}\) \(\newcommand {\protect }{}\) \(\def \LWRabsorbnumber #1 {}\) \(\def \LWRabsorbquotenumber "#1 {}\) \(\newcommand {\LWRabsorboption }[1][]{}\) \(\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }\) \(\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }\) \(\def \mathcode #1={\mathchar }\) \(\let \delcode \mathcode \) \(\let \delimiter \mathchar \) \(\def \oe {\unicode {x0153}}\) \(\def \OE {\unicode {x0152}}\) \(\def \ae {\unicode {x00E6}}\) \(\def \AE {\unicode {x00C6}}\) \(\def \aa {\unicode {x00E5}}\) \(\def \AA {\unicode {x00C5}}\) \(\def \o {\unicode {x00F8}}\) \(\def \O {\unicode {x00D8}}\) \(\def \l {\unicode {x0142}}\) \(\def \L {\unicode {x0141}}\) \(\def \ss {\unicode {x00DF}}\) \(\def \SS {\unicode {x1E9E}}\) \(\def \dag {\unicode {x2020}}\) \(\def \ddag {\unicode {x2021}}\) \(\def \P {\unicode {x00B6}}\) \(\def \copyright {\unicode {x00A9}}\) \(\def \pounds {\unicode {x00A3}}\) \(\let \LWRref \ref \) \(\renewcommand {\ref }{\ifstar \LWRref \LWRref }\) \( \newcommand {\multicolumn }[3]{#3}\) \(\require {textcomp}\) \( \newcommand {\abs }[1]{\lvert #1\rvert } \) \( \DeclareMathOperator {\sign }{sign} \) \(\newcommand {\intertext }[1]{\text {#1}\notag \\}\) \(\let \Hat \hat \) \(\let \Check \check \) \(\let \Tilde \tilde \) \(\let \Acute \acute \) \(\let \Grave \grave \) \(\let \Dot \dot \) \(\let \Ddot \ddot \) \(\let \Breve \breve \) \(\let \Bar \bar \) \(\let \Vec \vec \) \(\newcommand {\bm }[1]{\boldsymbol {#1}}\) \(\require {physics}\) \(\newcommand {\LWRphystrig }[2]{\ifblank {#1}{\textrm {#2}}{\textrm {#2}^{#1}}}\) \(\renewcommand {\sin }[1][]{\LWRphystrig {#1}{sin}}\) \(\renewcommand {\sinh }[1][]{\LWRphystrig {#1}{sinh}}\) \(\renewcommand {\arcsin }[1][]{\LWRphystrig {#1}{arcsin}}\) \(\renewcommand {\asin }[1][]{\LWRphystrig {#1}{asin}}\) \(\renewcommand {\cos }[1][]{\LWRphystrig {#1}{cos}}\) \(\renewcommand {\cosh }[1][]{\LWRphystrig {#1}{cosh}}\) \(\renewcommand {\arccos }[1][]{\LWRphystrig {#1}{arcos}}\) \(\renewcommand {\acos }[1][]{\LWRphystrig {#1}{acos}}\) \(\renewcommand {\tan }[1][]{\LWRphystrig {#1}{tan}}\) \(\renewcommand {\tanh }[1][]{\LWRphystrig {#1}{tanh}}\) \(\renewcommand {\arctan }[1][]{\LWRphystrig {#1}{arctan}}\) \(\renewcommand {\atan }[1][]{\LWRphystrig {#1}{atan}}\) \(\renewcommand {\csc }[1][]{\LWRphystrig {#1}{csc}}\) \(\renewcommand {\csch }[1][]{\LWRphystrig {#1}{csch}}\) \(\renewcommand {\arccsc }[1][]{\LWRphystrig {#1}{arccsc}}\) \(\renewcommand {\acsc }[1][]{\LWRphystrig {#1}{acsc}}\) \(\renewcommand {\sec }[1][]{\LWRphystrig {#1}{sec}}\) \(\renewcommand {\sech }[1][]{\LWRphystrig {#1}{sech}}\) \(\renewcommand {\arcsec }[1][]{\LWRphystrig {#1}{arcsec}}\) \(\renewcommand {\asec }[1][]{\LWRphystrig {#1}{asec}}\) \(\renewcommand {\cot }[1][]{\LWRphystrig {#1}{cot}}\) \(\renewcommand {\coth }[1][]{\LWRphystrig {#1}{coth}}\) \(\renewcommand {\arccot }[1][]{\LWRphystrig {#1}{arccot}}\) \(\renewcommand {\acot }[1][]{\LWRphystrig {#1}{acot}}\) \(\require {cancel}\) \(\newcommand *{\underuparrow }[1]{{\underset {\uparrow }{#1}}}\) \(\DeclareMathOperator *{\argmax }{argmax}\) \(\DeclareMathOperator *{\argmin }{arg\,min}\) \(\def \E [#1]{\mathbb {E}\!\left [ #1 \right ]}\) \(\def \Var [#1]{\operatorname {Var}\!\left [ #1 \right ]}\) \(\def \Cov [#1]{\operatorname {Cov}\!\left [ #1 \right ]}\) \(\newcommand {\floor }[1]{\lfloor #1 \rfloor }\) \(\newcommand {\DTFTH }{ H \brk 1{e^{j\omega }}}\) \(\newcommand {\DTFTX }{ X\brk 1{e^{j\omega }}}\) \(\newcommand {\DFTtr }[1]{\mathrm {DFT}\left \{#1\right \}}\) \(\newcommand {\DTFTtr }[1]{\mathrm {DTFT}\left \{#1\right \}}\) \(\newcommand {\DTFTtrI }[1]{\mathrm {DTFT^{-1}}\left \{#1\right \}}\) \(\newcommand {\Ftr }[1]{ \mathcal {F}\left \{#1\right \}}\) \(\newcommand {\FtrI }[1]{ \mathcal {F}^{-1}\left \{#1\right \}}\) \(\newcommand {\Zover }{\overset {\mathscr Z}{\Longleftrightarrow }}\) \(\renewcommand {\real }{\mathbb {R}}\) \(\newcommand {\ba }{\mathbf {a}}\) \(\newcommand {\bb }{\mathbf {b}}\) \(\newcommand {\bd }{\mathbf {d}}\) \(\newcommand {\be }{\mathbf {e}}\) \(\newcommand {\bh }{\mathbf {h}}\) \(\newcommand {\bn }{\mathbf {n}}\) \(\newcommand {\bq }{\mathbf {q}}\) \(\newcommand {\br }{\mathbf {r}}\) \(\newcommand {\bt }{\mathbf {t}}\) \(\newcommand {\bv }{\mathbf {v}}\) \(\newcommand {\bw }{\mathbf {w}}\) \(\newcommand {\bx }{\mathbf {x}}\) \(\newcommand {\bxx }{\mathbf {xx}}\) \(\newcommand {\bxy }{\mathbf {xy}}\) \(\newcommand {\by }{\mathbf {y}}\) \(\newcommand {\byy }{\mathbf {yy}}\) \(\newcommand {\bz }{\mathbf {z}}\) \(\newcommand {\bA }{\mathbf {A}}\) \(\newcommand {\bB }{\mathbf {B}}\) \(\newcommand {\bI }{\mathbf {I}}\) \(\newcommand {\bK }{\mathbf {K}}\) \(\newcommand {\bP }{\mathbf {P}}\) \(\newcommand {\bQ }{\mathbf {Q}}\) \(\newcommand {\bR }{\mathbf {R}}\) \(\newcommand {\bU }{\mathbf {U}}\) \(\newcommand {\bW }{\mathbf {W}}\) \(\newcommand {\bX }{\mathbf {X}}\) \(\newcommand {\bY }{\mathbf {Y}}\) \(\newcommand {\bZ }{\mathbf {Z}}\) \(\newcommand {\balpha }{\bm {\alpha }}\) \(\newcommand {\bth }{{\bm {\theta }}}\) \(\newcommand {\bepsilon }{{\bm {\epsilon }}}\) \(\newcommand {\bmu }{{\bm {\mu }}}\) \(\newcommand {\bphi }{\bm {\phi }}\) \(\newcommand {\bOne }{\mathbf {1}}\) \(\newcommand {\bZero }{\mathbf {0}}\) \(\newcommand {\btx }{\tilde {\bx }}\) \(\newcommand {\loss }{\mathcal {L}}\) \(\newcommand {\appropto }{\mathrel {\vcenter { \offinterlineskip \halign {\hfil $##$\cr \propto \cr \noalign {\kern 2pt}\sim \cr \noalign {\kern -2pt}}}}}\) \(\newcommand {\SSE }{\mathrm {SSE}}\) \(\newcommand {\MSE }{\mathrm {MSE}}\) \(\newcommand {\RMSE }{\mathrm {RMSE}}\) \(\newcommand {\toprule }[1][]{\hline }\) \(\let \midrule \toprule \) \(\let \bottomrule \toprule \) \(\def \LWRbooktabscmidruleparen (#1)#2{}\) \(\newcommand {\LWRbooktabscmidrulenoparen }[1]{}\) \(\newcommand {\cmidrule }[1][]{\ifnextchar (\LWRbooktabscmidruleparen \LWRbooktabscmidrulenoparen }\) \(\newcommand {\morecmidrules }{}\) \(\newcommand {\specialrule }[3]{\hline }\) \(\newcommand {\addlinespace }[1][]{}\) \(\newcommand {\LWRsubmultirow }[2][]{#2}\) \(\newcommand {\LWRmultirow }[2][]{\LWRsubmultirow }\) \(\newcommand {\multirow }[2][]{\LWRmultirow }\) \(\newcommand {\mrowcell }{}\) \(\newcommand {\mcolrowcell }{}\) \(\newcommand {\STneed }[1]{}\) \(\newcommand {\tcbset }[1]{}\) \(\newcommand {\tcbsetforeverylayer }[1]{}\) \(\newcommand {\tcbox }[2][]{\boxed {\text {#2}}}\) \(\newcommand {\tcboxfit }[2][]{\boxed {#2}}\) \(\newcommand {\tcblower }{}\) \(\newcommand {\tcbline }{}\) \(\newcommand {\tcbtitle }{}\) \(\newcommand {\tcbsubtitle [2][]{\mathrm {#2}}}\) \(\newcommand {\tcboxmath }[2][]{\boxed {#2}}\) \(\newcommand {\tcbhighmath }[2][]{\boxed {#2}}\)

6 Regression Losses and Metrics

  • Goal: Quantify the difference between predicted and target values.

For regression, loss and metrics are often the same quantity.

6.1 Preface

The predicted range of \(\hat {y}_i=f(\bx _i)\) results from the particular applied model.

Metric A performance metric is a real-valued function \(J:\by ,\hat {\by }\rightarrow \mathscr {R}\) that is used to evaluate and quantify the effectiveness of a model. These metrics provide insight into how well a model is performing according to various aspects of the data it predicts.

Loss function The loss function is a metric of the form \(L:\by ,\hat {\by }\rightarrow \mathscr {R}\) that is calculated over the test set, such that optimal parameters that correspond to the minimum loss

\begin{equation} \bw = \arg \min _{\bw } \loss (\hat {\by },\by ) \end{equation}

can be evaluated, e.g., by GD (Sec. 3.4).

A metric is not necessarily a loss function, and a loss function is not necessarily a metric.

For example cross-entropy loss in classification is not a metric and R-square is not a loss.

Summary Metrics are for communication; losses are for training. A loss is the objective minimized during training, while a metric is reported to summarize performance. They may coincide (e.g., MSE as both loss and metric), but need not.

6.2 Loss Function Properties

A loss function has a few desired properties that are presented below. While not obligatory, these properties may significantly ease the evaluation of the parameters \(\bw \). In the following, only the basic description of these properties is provided, sacrificing mathematical rigor for brevity.

Continuity: Single unbroken (without jumps) curve for all possible input values, i.e., without discontinuities.

Lipschitz continuity: A formal stricter continuity requirement that limits the pace of function changes. A real-valued function \(f(\cdot ):\mathcal {R}\rightarrow \mathcal {R}\) is called Lipschitz continuous if there exists a positive real constant \(K\) such that, for all real \(x_1\) and \(x_2\),

\begin{equation} \abs {f(x_1) - f(x_2)} \le K \abs {x_1 - x_2}. \end{equation}

This feature formally limits the maximum gradient values of a loss function.

Differentiability: A differentiable function of one real variable is a function whose derivative exists at each point in its domain.

  • If the function is differentiable, it is also continuous.

  • If the derivative is a continuous function, it is also locally Lipschitz continuous.

Convexity and strict convexity: Each segment lies above the graph between two points (Fig. 6.1). Formally, it means that the line segment between \(\left (x_1,f(x_1)\right )\) and \(\left (x_2,f(x_2)\right )\) is always above or just meeting the graph of the function \(f(x),x_1\le x \le x_2\). Mathematically, for all \(0\le t \le 1\) and \(\forall x_1,x_2\in \real \),

\begin{equation} f\left (tx_1+(1-t)x_2\right )\le tf(x_1) + (1-t)f(x_2) \end{equation}

A twice-differentiable function of a single variable is convex if and only if its second derivative is nonnegative on its entire domain. For a strict convexity, the second derivative is always positive. In the context of loss function properties, convexity guarantees that any local minimum is also a global minimum. A strictly convex function has at most one global minimum.

(image)

Figure 6.1: Convexity property visualization.

All these properties are particularly important in GD-related methods.

6.3 Losses

Per-sample and vector notation error notation are,

\begin{align*} e_i &= y_i - \hat {y}_i\quad i=1,\ldots ,M\\ \be &= \by - \hat {\by } \end{align*} Note, the order of subtraction is sometimes important in the context of this section.

Some of the following losses are also used as metrics.

6.3.1 Mean-squared error (MSE) or L2-loss
  • Continuous, differentiable, convex

MSE loss (also termed L2-loss) and metric is ((2.5))

\begin{equation} \begin{aligned} J(\by ,\hat {\by }) &= \frac {1}{M}\norm {\by - \hat {\by }}^2=\frac {1}{M}\be ^T\be \\ &= \frac {1}{M}\sum _{i=1}^M e_i^2 \end {aligned} \end{equation}

When used as loss, sometimes sum-of-squared error (SSE) ((2.4)) or the version with factor \(\frac {1}{2M}\) is applied to “compensate” for the derivative are also used (see also Sec. 2.1.1).

Important properties:

  • Popular regression loss.

  • Popular metric.

  • Analytical gradient that is error-dependent.

  • Sometimes, analytical solution is available, e.g. normal equation.

  • The main drawback is inherent outlier sensitivity.

6.3.2 RMSE
  • Continuous, differentiable, convex

Complementary convenient metric for MSE is root-MSE (RMSE).

\begin{equation} \begin{aligned} J(\by ,\hat {\by }) &= \frac {1}{\sqrt {M}}\norm {\by - \hat {\by }}\\ &= \sqrt {\frac {1}{M}\sum _{i=1}^M e_i^2} \end {aligned} \end{equation}

  • Theoretically, RMSE can also be used as loss, but it is very similar to MSE with only a difference in gradients.

  • Easier human interpretation.

6.3.3 Mean absolute error (MAE)

MAE is used both as loss and metric.

\begin{equation} \begin{aligned} \loss (\by ,\hat {\by }) &= \frac {1}{M}\sum _{i=1}^M \abs {e_i} \end {aligned} \end{equation}

Important properties:

  • Popular loss and metric.

  • All errors are similarly important and, therefore, less sensitive to outliers than MSE.

  • Error-independent gradient that may result in slower convergence under certain conditions. In particular, the gradient is high even for very small errors. To fix this, a dynamic learning rate that decreases as we move closer to the minima is required. MSE behaves nicely in this case and will converge even with a fixed learning rate. The gradient of MSE loss is high for larger loss values and decreases as loss approaches 0, making it more precise at the end of training.

  • Non-differentiable at \(\loss (\by ,\hat {\by }) =0\), also without dramatic influence on most learning algorithms.

A brief numerical example of MSE, RMSE and MAE is presented in Table 6.1 and in Fig. 6.2.

Table 6.1: Example of MSE, RMSE and MAE. Small change in error results in significant change in MSE and less dramatic change in MAE.
.
\((y_1,y_2)\) \((\hat {y}_1,\hat {y}_2)\) MSE RMSE MAE
(30,25) \(\scriptstyle \frac {1}{2}\left [(40-30)^2 + (30-25)^2\right ]=62.5\) \(\sqrt {62.5}=7.91\) \(\scriptstyle \frac {1}{2}\left [\abs {40-30} + \abs {30-25}\right ]=7.5\)
(30,25) \(\scriptstyle \frac {1}{2}\left [(50-30)^2 + (30-25)^2\right ]=212.5\) \(\sqrt {212.5}=14.6\) \(\scriptstyle \frac {1}{2}\left [\abs {50-30} + \abs {30-25}\right ]=12.5\)

(image)

Figure 6.2: The model is \(N=4\) polynomial with 40 inliers and three outliers at significantly higher values. Both MSE and MAE were used with \(\lambda \)-optimized L2-regularization (Sec. 4.7). MSE regression is pulled toward outliers due to squaring large errors, while MAE regression provides more robust fitting.

The visual comparison of a few regression losses is presented in Fig. 6.3 and detailed in Sec. 6.3.4.

(image)

Figure 6.3: Visualization of different loss functions. MSE is the steepest.
6.3.4 Special Losses (*)
Huber loss
  • Lipschitz continuous, differentiable, convex

  • Goal: Hybrid between MAE and MSE.

\begin{equation} \loss (e_i) = \begin{cases} \dfrac {1}{2}e_i^2 & \abs {e_i}\le \delta \\[7pt] \delta \left (\abs {e_i} - \dfrac {1}{2}\delta \right ) & \text {otherwise} \end {cases} \end{equation}

For small \(e_i\) it behaves like MSE and for larger \(e_i\) like MAE.

The problem with Huber loss is the need to tune the hyperparameter \(\delta \), which is a non-trivial process.

Log-cosh loss
  • Goal: Hybrid between MAE and MSE without hyper-parameters.

\begin{equation} \loss (e_i) = \log {\cosh {e_i}} \end{equation}

Properties:

  • Twice differentiable everywhere, \(\dfrac {\partial }{\partial e_i}L(e_i)=\tanh {e_i}\).

  • Approximation:

    \begin{equation} L(e_i)\approx \begin{cases} \dfrac {e_i^2}{2} & e_i \text { small}\\[5pt] \abs {e_i}-\log (2) & e_i \text { large} \end {cases} \end{equation}

  • No hyper-parameters.

  • Similar to Huber loss with \(\delta =1\).

  • Requires non-trivial error handling, otherwise may get stuck in either of two regions.

  • Explicit hyper-parameter optimization of Huber loss is recommended.

Cauchy
  • Goal: More robust to outliers than MAE.

\begin{equation} \loss (e_i) = \log (1+\left (\frac {e_i}{d}\right )^2) \end{equation}

  • \(d\) is “sharpness” parameter

  • Robust against outliers more than MAE but less than atan loss.

Atan
  • Goal: More robust to outliers than Cauchy.

\begin{equation} \loss (e_i) = \arctan (e_i^2) \end{equation}

  • Atan tends to \(\pi /2\) as input tends to infinity, and its derivatives tend to 0.

  • This means extreme outliers will have a negligible effect on the search direction compared to non-outliers.

  • It is important to ensure the data is well scaled and that the starting point is a reasonable guess at the true solution, if possible (similar to Log-cosh loss).

6.3.5 Mean Squared Logarithmic Error (MSLE)
  • Goal: Relative loss for high-dynamic range data.

MSLE is the relative difference between the log-transformed actual and predicted values.

\begin{equation} \begin{aligned} \loss (y_i,\hat {y}_i) &= \frac {1}{M}\sum _{i=1}^M \left (\log (y_i+1)-\log (\hat {y}_i+1)\right )^2\\ & = \frac {1}{M}\sum _{i=1}^M\log (\frac {y_i+1}{\hat {y}_i+1})^2 \end {aligned} \end{equation}

‘1’ is added to both \(y\) and \(\hat {y}\) for mathematical convenience since \(\log (0)\) is not defined but both \(y\) and \(\hat {y}\) can be 0.

  • Addresses \(y\) with high dynamic range, i.e. addresses relative error. Less useful for low dynamic range.

  • MSLE tries to treat small and large differences between the actual and predicted values similarly, e.g. in Table 6.2.

  • Penalizes underestimated values more than overestimated values.

  • Root MSLE (RMSLE) is also used, e.g. in scikit.

Table 6.2: Example of MSLE
.
\(y\) \(\hat {y}\) MSE Loss MSLE Loss
40 30 100 0.0782
4000 3000 \(100\times \mathbf {10,\!000}\) 0.0827
20 10 100 0.4181
20 30 100 0.1517

6.4 Relative Metrics

  • Goal:

    • Add human tractability to the standard metrics.

    • Used to compare different models.

    • Most of the metrics are normalized to the range of \([0,1]\).

Relative squared error (RSE)

Normalized MSE loss.

\begin{equation} \begin{aligned} J(y_i,\hat {y}_i) &=\dfrac {\MSE }{s_\by ^2}\\ &= \dfrac {\sum _i \left ( y_i - \hat {y}_i\right )^2}{\sum _i\left (y_i-\overline {y}\right )^2} =\frac {\norm {\by - \hat {\by }}^2}{\norm {\by - \bar {\by }}^2} \end {aligned} \end{equation}

Recall, that \(\hat {y}_i = \bar {y}\) results \(\MSE =s_\by ^2\). The loss shows the fraction of the explained variance, with closer to 0 is better and \(>1\) is abnormal.

R2

The common metric in social sciences,

\begin{equation} R^2 = 1 - RSE \end{equation}

Opposite to RSE, i.e., \(R^2\) close to 1 is better. In highly undesirable cases, it may be negative.

For example, R2 of 40% indicates that your model has reduced the mean squared error by 40% compared to the baseline, which is the mean model. This is the same as RSE of 60%.

For additional aspects, see also Sec. 2.3.1.

Normalized Root Mean Squared Error Expressed as a percentage, defined as:

\begin{equation} J(y_i,\hat {y}_i) =100\left (1-\frac {\RMSE }{s_\by }\right ) =100\left (1-\sqrt {RSE}\right ) =100\left (1-\frac {\norm {\by - \hat {\by }}}{\norm {\by - \bar {\by }}}\right ) \end{equation}

Relative absolute error (RAE)

Normalized MAE loss.

\begin{equation} J(y_i,\hat {y}_i) = \dfrac {\displaystyle \sum _i \abs {y_i - \hat {y}_i}}{\displaystyle \sum _i\abs {y_i-\overline {y}}} = \frac {MAE}{\displaystyle \frac {1}{M}\sum _i\abs {y_i-\overline {y}}} \end{equation}

Recall, that \(\hat {y}_i = \bar {y}\) results in \(MAE=\frac {1}{M}\sum _i\abs {y_i-\overline {y}}\). Closer to 0 is better and \(>1\) is abnormal.

Mean Absolute Percentage Error (MAPE)

Scaled error metric.

\begin{equation} J = \frac {1}{M}\sum _{i=1}^M\frac {\abs {y_i-\hat {y}_i}}{y_i}\times 100\% \end{equation}

  • Beware of small denominator!

  • Can exceed 100%.

  • Asymmetric, as described in Table 6.3.

Table 6.3: Example of MAPE
.
\(y\) \(\hat {y}\) Absolute Error MAPE
100 60 40 40%
20 60 40 300%

Additional Reading Additional reading on regression metrics: [3]

6.5 Information Criteria

  • Goal: Compare models of different complexity by penalizing the number of parameters, using MSE as the goodness-of-fit term.

Adding parameters always improves the training MSE, which encourages overfitting. Information criteria replace the raw MSE with a penalized score that balances goodness-of-fit against the number of parameters \(N\), given \(M\) data points. All criteria below assume that the residuals are approximately Gaussian.

6.5.1 Akaike’s Final Prediction Error (FPE)

FPE estimates the expected prediction error on new data by scaling the training MSE with a bias-correction factor:

\begin{equation} FPE = \MSE \frac {1+N/M}{1-N/M} = \MSE \frac {M+N}{M-N} \end{equation}

6.5.2 Akaike’s Information Criterion (AIC)

AIC trades off goodness-of-fit against complexity. Lower AIC indicates a better model:

\begin{equation} \Delta AIC = 2N + M\ln (\MSE ) \end{equation}

AICc adds a small-sample correction, recommended when \(M/N < 40\):

\begin{equation} AICc = AIC + 2N\frac {N+1}{M-N-1} \end{equation}

6.5.3 Bayesian Information Criterion (BIC)

BIC applies a stronger penalty on the number of parameters, especially as the sample size \(M\) grows, so it tends to favor simpler models than AIC:

\begin{equation} BIC = M\ln (\MSE ) + N\ln (M) \end{equation}

These criteria are computed directly from residuals and parameter count without any additional model fitting, and they enable principled comparison between models of different complexity. Their main limitation is the Gaussian-residual assumption, which can be misleading when violated; AIC and BIC can also disagree about which model is best, so no single criterion is universally preferred. A likelihood-based version of these criteria, used for time-series models, appears in Sec. 20.3.

6.6 Summary

Common Pitfalls:

  • Scale confusion: MSE/RMSE/MAE are scale-dependent; compare across datasets only after normalization or via dimensionless metrics.

  • MAPE near zero: Avoid MAPE when \(y\) can be small or cross zero.