Machine Learning & Signals Learning

\(\newcommand{\footnotename}{footnote}\) \(\def \LWRfootnote {1}\) \(\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\let \LWRorighspace \hspace \) \(\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }\) \(\newcommand {\TextOrMath }[2]{#2}\) \(\newcommand {\mathnormal }[1]{{#1}}\) \(\newcommand \ensuremath [1]{#1}\) \(\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } \) \(\newcommand {\setlength }[2]{}\) \(\newcommand {\addtolength }[2]{}\) \(\newcommand {\setcounter }[2]{}\) \(\newcommand {\addtocounter }[2]{}\) \(\newcommand {\arabic }[1]{}\) \(\newcommand {\number }[1]{}\) \(\newcommand {\noalign }[1]{\text {#1}\notag \\}\) \(\newcommand {\cline }[1]{}\) \(\newcommand {\directlua }[1]{\text {(directlua)}}\) \(\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}\) \(\newcommand {\protect }{}\) \(\def \LWRabsorbnumber #1 {}\) \(\def \LWRabsorbquotenumber "#1 {}\) \(\newcommand {\LWRabsorboption }[1][]{}\) \(\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }\) \(\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }\) \(\def \mathcode #1={\mathchar }\) \(\let \delcode \mathcode \) \(\let \delimiter \mathchar \) \(\def \oe {\unicode {x0153}}\) \(\def \OE {\unicode {x0152}}\) \(\def \ae {\unicode {x00E6}}\) \(\def \AE {\unicode {x00C6}}\) \(\def \aa {\unicode {x00E5}}\) \(\def \AA {\unicode {x00C5}}\) \(\def \o {\unicode {x00F8}}\) \(\def \O {\unicode {x00D8}}\) \(\def \l {\unicode {x0142}}\) \(\def \L {\unicode {x0141}}\) \(\def \ss {\unicode {x00DF}}\) \(\def \SS {\unicode {x1E9E}}\) \(\def \dag {\unicode {x2020}}\) \(\def \ddag {\unicode {x2021}}\) \(\def \P {\unicode {x00B6}}\) \(\def \copyright {\unicode {x00A9}}\) \(\def \pounds {\unicode {x00A3}}\) \(\let \LWRref \ref \) \(\renewcommand {\ref }{\ifstar \LWRref \LWRref }\) \( \newcommand {\multicolumn }[3]{#3}\) \(\require {textcomp}\) \( \newcommand {\abs }[1]{\lvert #1\rvert } \) \( \DeclareMathOperator {\sign }{sign} \) \(\newcommand {\intertext }[1]{\text {#1}\notag \\}\) \(\let \Hat \hat \) \(\let \Check \check \) \(\let \Tilde \tilde \) \(\let \Acute \acute \) \(\let \Grave \grave \) \(\let \Dot \dot \) \(\let \Ddot \ddot \) \(\let \Breve \breve \) \(\let \Bar \bar \) \(\let \Vec \vec \) \(\newcommand {\bm }[1]{\boldsymbol {#1}}\) \(\require {physics}\) \(\newcommand {\LWRphystrig }[2]{\ifblank {#1}{\textrm {#2}}{\textrm {#2}^{#1}}}\) \(\renewcommand {\sin }[1][]{\LWRphystrig {#1}{sin}}\) \(\renewcommand {\sinh }[1][]{\LWRphystrig {#1}{sinh}}\) \(\renewcommand {\arcsin }[1][]{\LWRphystrig {#1}{arcsin}}\) \(\renewcommand {\asin }[1][]{\LWRphystrig {#1}{asin}}\) \(\renewcommand {\cos }[1][]{\LWRphystrig {#1}{cos}}\) \(\renewcommand {\cosh }[1][]{\LWRphystrig {#1}{cosh}}\) \(\renewcommand {\arccos }[1][]{\LWRphystrig {#1}{arcos}}\) \(\renewcommand {\acos }[1][]{\LWRphystrig {#1}{acos}}\) \(\renewcommand {\tan }[1][]{\LWRphystrig {#1}{tan}}\) \(\renewcommand {\tanh }[1][]{\LWRphystrig {#1}{tanh}}\) \(\renewcommand {\arctan }[1][]{\LWRphystrig {#1}{arctan}}\) \(\renewcommand {\atan }[1][]{\LWRphystrig {#1}{atan}}\) \(\renewcommand {\csc }[1][]{\LWRphystrig {#1}{csc}}\) \(\renewcommand {\csch }[1][]{\LWRphystrig {#1}{csch}}\) \(\renewcommand {\arccsc }[1][]{\LWRphystrig {#1}{arccsc}}\) \(\renewcommand {\acsc }[1][]{\LWRphystrig {#1}{acsc}}\) \(\renewcommand {\sec }[1][]{\LWRphystrig {#1}{sec}}\) \(\renewcommand {\sech }[1][]{\LWRphystrig {#1}{sech}}\) \(\renewcommand {\arcsec }[1][]{\LWRphystrig {#1}{arcsec}}\) \(\renewcommand {\asec }[1][]{\LWRphystrig {#1}{asec}}\) \(\renewcommand {\cot }[1][]{\LWRphystrig {#1}{cot}}\) \(\renewcommand {\coth }[1][]{\LWRphystrig {#1}{coth}}\) \(\renewcommand {\arccot }[1][]{\LWRphystrig {#1}{arccot}}\) \(\renewcommand {\acot }[1][]{\LWRphystrig {#1}{acot}}\) \(\require {cancel}\) \(\newcommand *{\underuparrow }[1]{{\underset {\uparrow }{#1}}}\) \(\DeclareMathOperator *{\argmax }{argmax}\) \(\DeclareMathOperator *{\argmin }{arg\,min}\) \(\def \E [#1]{\mathbb {E}\!\left [ #1 \right ]}\) \(\def \Var [#1]{\operatorname {Var}\!\left [ #1 \right ]}\) \(\def \Cov [#1]{\operatorname {Cov}\!\left [ #1 \right ]}\) \(\newcommand {\floor }[1]{\lfloor #1 \rfloor }\) \(\newcommand {\DTFTH }{ H \brk 1{e^{j\omega }}}\) \(\newcommand {\DTFTX }{ X\brk 1{e^{j\omega }}}\) \(\newcommand {\DFTtr }[1]{\mathrm {DFT}\left \{#1\right \}}\) \(\newcommand {\DTFTtr }[1]{\mathrm {DTFT}\left \{#1\right \}}\) \(\newcommand {\DTFTtrI }[1]{\mathrm {DTFT^{-1}}\left \{#1\right \}}\) \(\newcommand {\Ftr }[1]{ \mathcal {F}\left \{#1\right \}}\) \(\newcommand {\FtrI }[1]{ \mathcal {F}^{-1}\left \{#1\right \}}\) \(\newcommand {\Zover }{\overset {\mathscr Z}{\Longleftrightarrow }}\) \(\renewcommand {\real }{\mathbb {R}}\) \(\newcommand {\ba }{\mathbf {a}}\) \(\newcommand {\bb }{\mathbf {b}}\) \(\newcommand {\bd }{\mathbf {d}}\) \(\newcommand {\be }{\mathbf {e}}\) \(\newcommand {\bh }{\mathbf {h}}\) \(\newcommand {\bn }{\mathbf {n}}\) \(\newcommand {\bq }{\mathbf {q}}\) \(\newcommand {\br }{\mathbf {r}}\) \(\newcommand {\bt }{\mathbf {t}}\) \(\newcommand {\bv }{\mathbf {v}}\) \(\newcommand {\bw }{\mathbf {w}}\) \(\newcommand {\bx }{\mathbf {x}}\) \(\newcommand {\bxx }{\mathbf {xx}}\) \(\newcommand {\bxy }{\mathbf {xy}}\) \(\newcommand {\by }{\mathbf {y}}\) \(\newcommand {\byy }{\mathbf {yy}}\) \(\newcommand {\bz }{\mathbf {z}}\) \(\newcommand {\bA }{\mathbf {A}}\) \(\newcommand {\bB }{\mathbf {B}}\) \(\newcommand {\bI }{\mathbf {I}}\) \(\newcommand {\bK }{\mathbf {K}}\) \(\newcommand {\bP }{\mathbf {P}}\) \(\newcommand {\bQ }{\mathbf {Q}}\) \(\newcommand {\bR }{\mathbf {R}}\) \(\newcommand {\bU }{\mathbf {U}}\) \(\newcommand {\bW }{\mathbf {W}}\) \(\newcommand {\bX }{\mathbf {X}}\) \(\newcommand {\bY }{\mathbf {Y}}\) \(\newcommand {\bZ }{\mathbf {Z}}\) \(\newcommand {\balpha }{\bm {\alpha }}\) \(\newcommand {\bth }{{\bm {\theta }}}\) \(\newcommand {\bepsilon }{{\bm {\epsilon }}}\) \(\newcommand {\bmu }{{\bm {\mu }}}\) \(\newcommand {\bphi }{\bm {\phi }}\) \(\newcommand {\bOne }{\mathbf {1}}\) \(\newcommand {\bZero }{\mathbf {0}}\) \(\newcommand {\btx }{\tilde {\bx }}\) \(\newcommand {\loss }{\mathcal {L}}\) \(\newcommand {\appropto }{\mathrel {\vcenter { \offinterlineskip \halign {\hfil $##$\cr \propto \cr \noalign {\kern 2pt}\sim \cr \noalign {\kern -2pt}}}}}\) \(\newcommand {\SSE }{\mathrm {SSE}}\) \(\newcommand {\MSE }{\mathrm {MSE}}\) \(\newcommand {\RMSE }{\mathrm {RMSE}}\) \(\newcommand {\toprule }[1][]{\hline }\) \(\let \midrule \toprule \) \(\let \bottomrule \toprule \) \(\def \LWRbooktabscmidruleparen (#1)#2{}\) \(\newcommand {\LWRbooktabscmidrulenoparen }[1]{}\) \(\newcommand {\cmidrule }[1][]{\ifnextchar (\LWRbooktabscmidruleparen \LWRbooktabscmidrulenoparen }\) \(\newcommand {\morecmidrules }{}\) \(\newcommand {\specialrule }[3]{\hline }\) \(\newcommand {\addlinespace }[1][]{}\) \(\newcommand {\LWRsubmultirow }[2][]{#2}\) \(\newcommand {\LWRmultirow }[2][]{\LWRsubmultirow }\) \(\newcommand {\multirow }[2][]{\LWRmultirow }\) \(\newcommand {\mrowcell }{}\) \(\newcommand {\mcolrowcell }{}\) \(\newcommand {\STneed }[1]{}\) \(\newcommand {\tcbset }[1]{}\) \(\newcommand {\tcbsetforeverylayer }[1]{}\) \(\newcommand {\tcbox }[2][]{\boxed {\text {#2}}}\) \(\newcommand {\tcboxfit }[2][]{\boxed {#2}}\) \(\newcommand {\tcblower }{}\) \(\newcommand {\tcbline }{}\) \(\newcommand {\tcbtitle }{}\) \(\newcommand {\tcbsubtitle [2][]{\mathrm {#2}}}\) \(\newcommand {\tcboxmath }[2][]{\boxed {#2}}\) \(\newcommand {\tcbhighmath }[2][]{\boxed {#2}}\)

20 Regression Metrics

  • Goal: Quantify prediction quality and compare models fairly across scales and datasets.

Given targets \(y[n]\) and predictions \(\hat y[n]\), \(n=0,\ldots ,L-1\), the prediction error (residual) is

\begin{equation} e[n]=y[n]-\hat y[n]. \end{equation}

20.1 Scale-Dependent Metrics

Scale-dependent metrics retain the original units of the data, making them appropriate when the absolute error magnitude is meaningful.

MSE and RMSE

\begin{equation} \mathrm {MSE}=\frac {1}{L}\sum _{n} e^2[n],\qquad \mathrm {RMSE}=\sqrt {\mathrm {MSE}}. \end{equation}

Heavily penalizes large errors; sensitive to outliers. MSE is also commonly used as a training loss (differentiable, convex).

MAE and MedianAE

\begin{equation} \mathrm {MAE}=\frac {1}{L}\sum _{n}|e[n]|,\qquad \mathrm {MedAE}=\operatorname {median}_n\,|e[n]|. \end{equation}

Robust to outliers (especially MedAE). MAE is also used as a training loss (convex, but nondifferentiable at \(0\)).

20.2 Scale-Free Metrics

When comparing forecasts across series with different units or scales, scale-free metrics normalize the error so that results are comparable.

MAPE and sMAPE

\begin{equation} \mathrm {MAPE}=\frac {100}{L}\sum _n \frac {|e[n]|}{|y[n]|+\varepsilon },\qquad \mathrm {sMAPE}=\frac {100}{L}\sum _n \frac {2|e[n]|}{|y[n]|+|\hat y[n]|+\varepsilon }. \end{equation}

A small \(\varepsilon \) is added to avoid division by zero. sMAPE is bounded in \([0,200]\%\) but remains biased when target values are close to zero.

MASE (Mean Absolute Scaled Error)

MASE normalizes the MAE by the error of an in-sample naive forecast (\(\hat {y}[n]=y[n-1]\)):

\begin{equation} \mathrm {MASE}= \frac {\frac {1}{L}\sum _{n}|e[n]|} {\frac {1}{L-1}\sum _{n=1}^{L-1}|y[n]-y[n-1]|}. \end{equation}

MASE \(<1\) means the model outperforms the naive baseline. The metric is robust across scales and can be extended to a seasonal naive denominator when seasonality is present.

NRMSE

NRMSE normalizes the RMSE by a measure of the target’s spread:

\begin{equation} \mathrm {NRMSE}_{\sigma }=\frac {\mathrm {RMSE}}{\sigma _y},\qquad \mathrm {NRMSE}_{\mathrm {range}}=\frac {\mathrm {RMSE}}{\max y-\min y}. \end{equation}

  • Example 20.1: Two predictors are compared on the same target signal (Fig. 20.1). Predictor A has small, uniformly distributed errors. Predictor B is more accurate on most samples but contains three large outliers. The bar chart shows how different metrics rank the two: RMSE penalizes the outliers heavily (B is much worse), MAE is similar for both, and MedAE favors B (ignoring the outliers entirely). This illustrates why reporting multiple metrics provides a more complete picture of prediction quality.

(image)

Figure 20.1: Effect of outliers on metric values. Predictor A has uniform noise; Predictor B has sparse large errors. RMSE is sensitive to outliers, MAE is moderate, and MedAE is robust.

20.3 Information Criteria

  • Goal: Compare models of different complexity by penalizing the number of parameters, using the log-likelihood as the goodness-of-fit term.

For a probabilistic model with parameters \(\boldsymbol {\theta }\), the likelihood \(\mathcal {L}(\boldsymbol {\theta })\) is the probability that the model assigns to the observed data, and parameters are commonly estimated by maximizing it. Its logarithm, the log-likelihood \(\ln \mathcal {L}\), is preferred because it turns products into sums and is numerically stable; a higher \(\ln \mathcal {L}\) means a better fit to the data. For regression with Gaussian residuals, maximizing the log-likelihood is equivalent to minimizing the MSE, and up to an additive constant

\begin{equation} -2\ln \mathcal {L} \;=\; L\,\ln (\mathrm {MSE}) + \text {const}, \end{equation}

which links the formulas below to the MSE-based variants.

Adding parameters always improves the training log-likelihood, which encourages overfitting. Information criteria replace the raw log-likelihood with a penalized score that balances goodness-of-fit against the number of parameters \(k\), given \(L\) samples. Lower values indicate a better model.

Akaike’s Information Criterion (AIC)

AIC trades off goodness-of-fit against complexity:

\begin{equation} \mathrm {AIC}=2k-2\ln \mathcal {L}. \end{equation}

AICc adds a small-sample correction and should be preferred when \(L/k\) is small:

\begin{equation} \mathrm {AICc}=\mathrm {AIC}+\frac {2k(k+1)}{L-k-1}. \end{equation}

Bayesian Information Criterion (BIC)

BIC applies a stronger penalty on the number of parameters, especially as the sample size \(L\) grows, so it tends to favor simpler models than AIC:

\begin{equation} \mathrm {BIC}=k\ln L-2\ln \mathcal {L}. \end{equation}

These criteria are computed from the fitted log-likelihood and the parameter count, requiring no additional model fitting, and they enable principled comparison between models of different complexity. Their main limitation is sensitivity to the assumed likelihood (commonly Gaussian residuals); AIC and BIC can also disagree about which model is best, so no single criterion is universally preferred. The MSE-based forms of these criteria, together with Akaike’s Final Prediction Error (FPE), are given in Sec. 6.5.

20.4 Summary

.
Metric Scale-free Outlier-robust Also used as loss
MSE / RMSE No No Yes
MAE / MedAE No Yes Yes (MAE)
MAPE / sMAPE Yes No No
MASE Yes Yes No
NRMSE Yes No No