Machine Learning & Signals Learning

\(\newcommand{\footnotename}{footnote}\) \(\def \LWRfootnote {1}\) \(\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\let \LWRorighspace \hspace \) \(\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }\) \(\newcommand {\TextOrMath }[2]{#2}\) \(\newcommand {\mathnormal }[1]{{#1}}\) \(\newcommand \ensuremath [1]{#1}\) \(\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } \) \(\newcommand {\setlength }[2]{}\) \(\newcommand {\addtolength }[2]{}\) \(\newcommand {\setcounter }[2]{}\) \(\newcommand {\addtocounter }[2]{}\) \(\newcommand {\arabic }[1]{}\) \(\newcommand {\number }[1]{}\) \(\newcommand {\noalign }[1]{\text {#1}\notag \\}\) \(\newcommand {\cline }[1]{}\) \(\newcommand {\directlua }[1]{\text {(directlua)}}\) \(\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}\) \(\newcommand {\protect }{}\) \(\def \LWRabsorbnumber #1 {}\) \(\def \LWRabsorbquotenumber "#1 {}\) \(\newcommand {\LWRabsorboption }[1][]{}\) \(\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }\) \(\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }\) \(\def \mathcode #1={\mathchar }\) \(\let \delcode \mathcode \) \(\let \delimiter \mathchar \) \(\def \oe {\unicode {x0153}}\) \(\def \OE {\unicode {x0152}}\) \(\def \ae {\unicode {x00E6}}\) \(\def \AE {\unicode {x00C6}}\) \(\def \aa {\unicode {x00E5}}\) \(\def \AA {\unicode {x00C5}}\) \(\def \o {\unicode {x00F8}}\) \(\def \O {\unicode {x00D8}}\) \(\def \l {\unicode {x0142}}\) \(\def \L {\unicode {x0141}}\) \(\def \ss {\unicode {x00DF}}\) \(\def \SS {\unicode {x1E9E}}\) \(\def \dag {\unicode {x2020}}\) \(\def \ddag {\unicode {x2021}}\) \(\def \P {\unicode {x00B6}}\) \(\def \copyright {\unicode {x00A9}}\) \(\def \pounds {\unicode {x00A3}}\) \(\let \LWRref \ref \) \(\renewcommand {\ref }{\ifstar \LWRref \LWRref }\) \( \newcommand {\multicolumn }[3]{#3}\) \(\require {textcomp}\) \( \newcommand {\abs }[1]{\lvert #1\rvert } \) \( \DeclareMathOperator {\sign }{sign} \) \(\newcommand {\intertext }[1]{\text {#1}\notag \\}\) \(\let \Hat \hat \) \(\let \Check \check \) \(\let \Tilde \tilde \) \(\let \Acute \acute \) \(\let \Grave \grave \) \(\let \Dot \dot \) \(\let \Ddot \ddot \) \(\let \Breve \breve \) \(\let \Bar \bar \) \(\let \Vec \vec \) \(\newcommand {\bm }[1]{\boldsymbol {#1}}\) \(\require {physics}\) \(\newcommand {\LWRphystrig }[2]{\ifblank {#1}{\textrm {#2}}{\textrm {#2}^{#1}}}\) \(\renewcommand {\sin }[1][]{\LWRphystrig {#1}{sin}}\) \(\renewcommand {\sinh }[1][]{\LWRphystrig {#1}{sinh}}\) \(\renewcommand {\arcsin }[1][]{\LWRphystrig {#1}{arcsin}}\) \(\renewcommand {\asin }[1][]{\LWRphystrig {#1}{asin}}\) \(\renewcommand {\cos }[1][]{\LWRphystrig {#1}{cos}}\) \(\renewcommand {\cosh }[1][]{\LWRphystrig {#1}{cosh}}\) \(\renewcommand {\arccos }[1][]{\LWRphystrig {#1}{arcos}}\) \(\renewcommand {\acos }[1][]{\LWRphystrig {#1}{acos}}\) \(\renewcommand {\tan }[1][]{\LWRphystrig {#1}{tan}}\) \(\renewcommand {\tanh }[1][]{\LWRphystrig {#1}{tanh}}\) \(\renewcommand {\arctan }[1][]{\LWRphystrig {#1}{arctan}}\) \(\renewcommand {\atan }[1][]{\LWRphystrig {#1}{atan}}\) \(\renewcommand {\csc }[1][]{\LWRphystrig {#1}{csc}}\) \(\renewcommand {\csch }[1][]{\LWRphystrig {#1}{csch}}\) \(\renewcommand {\arccsc }[1][]{\LWRphystrig {#1}{arccsc}}\) \(\renewcommand {\acsc }[1][]{\LWRphystrig {#1}{acsc}}\) \(\renewcommand {\sec }[1][]{\LWRphystrig {#1}{sec}}\) \(\renewcommand {\sech }[1][]{\LWRphystrig {#1}{sech}}\) \(\renewcommand {\arcsec }[1][]{\LWRphystrig {#1}{arcsec}}\) \(\renewcommand {\asec }[1][]{\LWRphystrig {#1}{asec}}\) \(\renewcommand {\cot }[1][]{\LWRphystrig {#1}{cot}}\) \(\renewcommand {\coth }[1][]{\LWRphystrig {#1}{coth}}\) \(\renewcommand {\arccot }[1][]{\LWRphystrig {#1}{arccot}}\) \(\renewcommand {\acot }[1][]{\LWRphystrig {#1}{acot}}\) \(\require {cancel}\) \(\newcommand *{\underuparrow }[1]{{\underset {\uparrow }{#1}}}\) \(\DeclareMathOperator *{\argmax }{argmax}\) \(\DeclareMathOperator *{\argmin }{arg\,min}\) \(\def \E [#1]{\mathbb {E}\!\left [ #1 \right ]}\) \(\def \Var [#1]{\operatorname {Var}\!\left [ #1 \right ]}\) \(\def \Cov [#1]{\operatorname {Cov}\!\left [ #1 \right ]}\) \(\newcommand {\floor }[1]{\lfloor #1 \rfloor }\) \(\newcommand {\DTFTH }{ H \brk 1{e^{j\omega }}}\) \(\newcommand {\DTFTX }{ X\brk 1{e^{j\omega }}}\) \(\newcommand {\DFTtr }[1]{\mathrm {DFT}\left \{#1\right \}}\) \(\newcommand {\DTFTtr }[1]{\mathrm {DTFT}\left \{#1\right \}}\) \(\newcommand {\DTFTtrI }[1]{\mathrm {DTFT^{-1}}\left \{#1\right \}}\) \(\newcommand {\Ftr }[1]{ \mathcal {F}\left \{#1\right \}}\) \(\newcommand {\FtrI }[1]{ \mathcal {F}^{-1}\left \{#1\right \}}\) \(\newcommand {\Zover }{\overset {\mathscr Z}{\Longleftrightarrow }}\) \(\renewcommand {\real }{\mathbb {R}}\) \(\newcommand {\ba }{\mathbf {a}}\) \(\newcommand {\bb }{\mathbf {b}}\) \(\newcommand {\bd }{\mathbf {d}}\) \(\newcommand {\be }{\mathbf {e}}\) \(\newcommand {\bh }{\mathbf {h}}\) \(\newcommand {\bn }{\mathbf {n}}\) \(\newcommand {\bq }{\mathbf {q}}\) \(\newcommand {\br }{\mathbf {r}}\) \(\newcommand {\bt }{\mathbf {t}}\) \(\newcommand {\bv }{\mathbf {v}}\) \(\newcommand {\bw }{\mathbf {w}}\) \(\newcommand {\bx }{\mathbf {x}}\) \(\newcommand {\bxx }{\mathbf {xx}}\) \(\newcommand {\bxy }{\mathbf {xy}}\) \(\newcommand {\by }{\mathbf {y}}\) \(\newcommand {\byy }{\mathbf {yy}}\) \(\newcommand {\bz }{\mathbf {z}}\) \(\newcommand {\bA }{\mathbf {A}}\) \(\newcommand {\bB }{\mathbf {B}}\) \(\newcommand {\bI }{\mathbf {I}}\) \(\newcommand {\bK }{\mathbf {K}}\) \(\newcommand {\bP }{\mathbf {P}}\) \(\newcommand {\bQ }{\mathbf {Q}}\) \(\newcommand {\bR }{\mathbf {R}}\) \(\newcommand {\bU }{\mathbf {U}}\) \(\newcommand {\bW }{\mathbf {W}}\) \(\newcommand {\bX }{\mathbf {X}}\) \(\newcommand {\bY }{\mathbf {Y}}\) \(\newcommand {\bZ }{\mathbf {Z}}\) \(\newcommand {\balpha }{\bm {\alpha }}\) \(\newcommand {\bth }{{\bm {\theta }}}\) \(\newcommand {\bepsilon }{{\bm {\epsilon }}}\) \(\newcommand {\bmu }{{\bm {\mu }}}\) \(\newcommand {\bOne }{\mathbf {1}}\) \(\newcommand {\bZero }{\mathbf {0}}\) \(\newcommand {\loss }{\mathcal {L}}\) \(\newcommand {\appropto }{\mathrel {\vcenter { \offinterlineskip \halign {\hfil $##$\cr \propto \cr \noalign {\kern 2pt}\sim \cr \noalign {\kern -2pt}}}}}\) \(\newcommand {\SSE }{\mathrm {SSE}}\) \(\newcommand {\MSE }{\mathrm {MSE}}\) \(\newcommand {\RMSE }{\mathrm {RMSE}}\) \(\newcommand {\toprule }[1][]{\hline }\) \(\let \midrule \toprule \) \(\let \bottomrule \toprule \) \(\def \LWRbooktabscmidruleparen (#1)#2{}\) \(\newcommand {\LWRbooktabscmidrulenoparen }[1]{}\) \(\newcommand {\cmidrule }[1][]{\ifnextchar (\LWRbooktabscmidruleparen \LWRbooktabscmidrulenoparen }\) \(\newcommand {\morecmidrules }{}\) \(\newcommand {\specialrule }[3]{\hline }\) \(\newcommand {\addlinespace }[1][]{}\) \(\newcommand {\LWRsubmultirow }[2][]{#2}\) \(\newcommand {\LWRmultirow }[2][]{\LWRsubmultirow }\) \(\newcommand {\multirow }[2][]{\LWRmultirow }\) \(\newcommand {\mrowcell }{}\) \(\newcommand {\mcolrowcell }{}\) \(\newcommand {\STneed }[1]{}\) \(\newcommand {\tcbset }[1]{}\) \(\newcommand {\tcbsetforeverylayer }[1]{}\) \(\newcommand {\tcbox }[2][]{\boxed {\text {#2}}}\) \(\newcommand {\tcboxfit }[2][]{\boxed {#2}}\) \(\newcommand {\tcblower }{}\) \(\newcommand {\tcbline }{}\) \(\newcommand {\tcbtitle }{}\) \(\newcommand {\tcbsubtitle [2][]{\mathrm {#2}}}\) \(\newcommand {\tcboxmath }[2][]{\boxed {#2}}\) \(\newcommand {\tcbhighmath }[2][]{\boxed {#2}}\)

5 Overfitting and underfitting

  • Goal:

    • Interpret cross-validation results.

    • Identify fundamental trade-offs and/or relations between:

      • Model complexity and generalization performance.

      • Model complexity and the size of dataset.

5.1 Definition

Overfitting when model is too complex, i.e. have too many parameters.

  • Too many parameters relative to the number of observations. Theoretically, we expect at least an order of magnitude more observations than parameters.

  • Follow the training data very closely.

  • Fail to generalize well to unseen data.

Significant performance difference between train and test datasets.

Underfitting happens when a model is too simple.

  • Unable to capture the underlying pattern of the data and hence misses the trends in the data.

  • Performs poorly on the training data and fail to generalize.

Poor performance on both train and test datasets.

Overfitting and underfitting are complimentary and balancing between them is key to building robust machine learning models that perform well on new, unseen data, i.e. generalize well. This principle is illustrated in Fig. 5.1

(image)

Figure 5.1: Increasing model complexity improves the prediction error for training data, but from a certain point of transitioning from underfitting to overfitting, it increases the error for test data.

The numerical example of underfitting and overfitting trade-off from the polynomial model (Sec. 4.1) is presented in Fig. 5.2.

(image)

Figure 5.2: Overfitting and underfitting polynomial example.

5.2 Bias-Variance Trade-off

  • Goal: Inherent underfitting and overfitting trade-off.

5.2.1 Noisy Deterministic Function Interpretation

The general model beneath the dataset model is

\begin{equation} y_i = h(\bx _i) + \epsilon _i, \end{equation}

where \(f(\bx )\) is some unknown function and \(\epsilon \) is some random noise with unknown distribution, zero-mean and variance of \(\sigma ^2\) \((\E [\epsilon ]=0,\Var [\epsilon ]=\sigma ^2)\).

A model predicts outputs via \(\hat {y}_i = f(\bx _i;\bth )\).

Bias The bias of the model is the expected difference between predictions and the true function,

\begin{equation} \mathrm {Bias}[\hat {y}] = \E [\hat {y}] - \E [y] = \E [f(\bx ;\bw )] - h(\bx ) \end{equation}

where the expectation is taken over all possible training samples.

For a specific dataset, the empirical bias can be estimated as:

\begin{equation} \mathrm {Bias}_{\text {empirical}} = \frac {1}{M}\sum _{i=1}^M \hat {y}_i - \frac {1}{M}\sum _{i=1}^M h(\bx _i) \end{equation}

The bias and variance are illustrated in Fig. 5.3.

(image)

Figure 5.3: Illustration of bias and variance.
5.2.2 Bias and Variance Trade-off

The expected prediction error (MSE) for a new point can be decomposed as:

\begin{equation} \begin{aligned} \loss (\by ,\hat {\by }) &= \E [(\hat {y}-y)^2] = \frac {1}{M}\sum _{i=1}^M \left (\hat {y}_i-y_i\right )^2\\ &=\E [\hat {y}^2] -2 \E [y]\E [\hat {y}] + \E [y^2]\\ \E [\hat {y}^2] &= \Var [\hat {y}] + \mathbb {E}^2\left [\hat {y}\right ]\\ \E [y] &=\E [h(\bx )]=h(\bx )\\ \E [y^2] &= \E [(h(\bx ) + \epsilon )^2]= \frac {1}{M}\sum _{i=1}^M \left (h(\bx _i) + \epsilon _i\right )^2\\ &= \E [h^2(\bx )] + 2\E [h(\bx )]\cancelto {0}{\E [\epsilon ]} + \E [\epsilon ^2]\\ &= \E [h^2(\bx )] + \sigma ^2\\ &= h^2(\bx ) + \sigma ^2\\ \loss &=\Var [\hat {y}] + \mathbb {E}^2\left [\hat {y}\right ] - 2h(\bx )\E [\hat {y}] + h^2(\bx ) + \sigma ^2 \\ &=\underbrace {\left (\E [\hat {y}] - h(\bx )\right )^2}_{bias^2} + \underbrace {\Var [\hat {y}]}_{variance} + \underbrace {\sigma ^2}_{noise} \end {aligned} \end{equation}

This decomposition shows that the total prediction error consists of three components: squared bias, variance, and irreducible noise. The inherent bias-variance trade-off is presented in Fig. 5.4. Underfitting is low variance and high bias, and overfitting is high variance and low bias. The best model performance has inherent bias-variance trade-off. However, some less appropriate models can have high bias and high variance simultaneously

(image)

Figure 5.4: Illustration of bias and variance.

5.3 Normalization and Standardization

  • Goal: Data pre-processing to reduce possible overfitting.

Values in different columns in \(\bX \) (vectors \(\bx _j\)) may be different by orders magnitudes, i.e. \(\norm {\bx _i}\gg \norm {\bx _j}\). This results in:

  • Some columns have significantly higher influence on \(\hat {\by }\).

  • Numerical instabilities.

Standardization (z-score normalization)

Mapping all the input values such that they follow a distribution with zero mean and unit variance.

\begin{equation} z_{std} = \frac {\bx -\bar {\bx }}{\sigma _\bx }, \end{equation}

where

\begin{align*} \bar {\bx } &= \frac {1}{M}\sum _{j=1}^Mx_j\\ \sigma _\bx ^2 &= \frac {1}{M}\sum _{j=1}^M\left (x_j - \bar {\bx }\right )^2 \end{align*}

Implementation steps:

  • 1. On train dataset, evaluate \(\bar {\bx }\) and \(\sigma _\bx \).

  • 2. Apply normalization on train dataset, using \(\bar {\bx }\) and \(\sigma _\bx \).

  • 3. Apply normalization on test dataset, using same \(\bar {\bx }\) and \(\sigma _\bx \) (no recalculation).

When normalization is applied to \(\by \), the output of the model is transformed back, \(\hat {\by } = \hat {\by }_{std}\sigma _\by + \bar {\by }\).

Example: zscore command in Matlab and sklearn.preprocessing.StandardScaler in Python.

Normalization

Mapping all values of a feature to be in the range \([0,1]\) by the transformation1

\begin{equation} x_{norm} = \frac {x-x_{min}}{x_{max}-x_{min}} \end{equation}

Implementation steps for normalization are similar to standardization.

Beware, normalization and standardization are used interchangeably.

Pros:

  • Improves convergence speed of gradient-based optimization.

  • Prevents features with large magnitudes from dominating distance-based models.

  • Standardization of both \(\bX \) and \(\by \) results in \(w_0=0\) and removes the requirement for \(\bOne \) column in \(\bX \).

Cons:

  • Min-max normalization is sensitive to outliers.

  • Scaling parameters must be computed on the training set and applied consistently to the test set.

5.4 Dataset size

  • Goal: Dataset size influence on underfitting-overfitting trade-off.

Adding data usually improves the performance for an overly complex system as presented in Fig. 5.5.

(image)

Figure 5.5: Increase in dataset size improves the over-complex model performance.

5.5 Regularization

  • Goal: Loss function tweak that influences on underfitting-overfitting trade-off.

Regularization: Penalty to the loss function

\begin{equation} \loss _{\mathrm {new}} = \loss + \lambda g(\bw ) \end{equation}

where \(\lambda \) is termed regularization parameter.

\(L_2\)-regularization: Special case of

\begin{equation} g(\bw ) = \dfrac {1}{2M}\left \| \bw \right \|^2 =\frac {\lambda }{2M}\sum _{i=1}^N w_i^2 \end{equation}

where vector of weights \(\bw \) does not include \(w_0\) weight. Moreover, when normalization is used, no \(w_0\) weight is needed. Two special cases are:

  • \(\lambda \rightarrow 0\) gets the original loss function (overfitting).

  • \(\lambda \rightarrow \infty \) makes loss function independent on \(\bw \) (underfitting).

Polynomial regression example The resulting coefficients \(w_i\) will be significantly higher for higher \(i\) ( Fig. 5.2). Reducing them results in smooth \(\hat {y}\) prediction. The illustration of \(\lambda \) influence is presented in Fig. 5.6.

(image)

Figure 5.6: Illustration of regularization influence on the polynomial model.

(image)

Figure 5.7: Dependence of MSE on \(\lambda \) for regularization in Fig. 5.6 (polynomial regression). Overfitting is on the left \((\lambda \rightarrow 0)\) and undefitting on the right \((\lambda \rightarrow \infty )\).

The corresponding underfitting-overfitting \(\lambda \)-related trade-off is presented in Fig. 5.7.

5.5.1 Ridge Regression

Ridge regression: \(L_2\)-regularized linear regression.

\begin{equation} \begin{aligned} \mathcal {L}(\bw ) &= \frac {1}{2M}\left \| \by - \bX \bw \right \|^2 + \frac {\lambda }{2M}\underbrace {\left \| \bw \right \|^2}_{\sum _{i=1}^{N}w_i^2}\\ &= \frac {1}{2M}\left (\by - \bX \bw \right )^T \left (\by - \bX \bw \right ) + \frac {\lambda }{2M}\bw ^T\bw \end {aligned} \end{equation}

Derivative

\begin{equation} \nabla \mathcal {L}(\bw ) =\frac {1}{M}\left (-\bX ^T\left (\by - \bX \bw \right ) + \lambda \bw \right )= 0 \end{equation}

Solution

\begin{align*} &-\bX ^T\left (\by - \bX \bw \right ) + \lambda \bw = -\bX ^T \by + \bX ^T\bX \bw + \lambda \bw = 0\\ &\bX ^T \by = \bX ^T\bX \bw + \lambda \bw \end{align*} Finally, the regularized weights are given by

\begin{equation} \label {eq-rigde-regression-w} \bw = \left (\lambda \mathbf {I} + \bX ^T\bX \right )^{-1}\bX ^T \by \end{equation}

Can be viewed as special case of linear regression, of the form

\begin{align} \tilde {y} &= \tilde {\bX }\bw \nonumber \\ \begin{bmatrix} \by \\ 0 \\ \vdots \\ 0 \end {bmatrix} &= \begin{bmatrix} - & \bX & - \\ \sqrt {\lambda } & & 0 \\ & \ddots & \\ 0 & & \sqrt {\lambda } \end {bmatrix} \begin{bmatrix} w_1\\ \vdots \\ w_N \end {bmatrix} +w_0 \end{align}

Slope interpretation Higher value of \(\lambda \) reduces (mostly) the highest slopes (weights \(w_i\)) \(\Rightarrow \) the dependency on the parameters with these weights is reduced.

Eigenvalues interpretation This calculation limits the smallest eigenvalues to \(>\frac {1}{\lambda }\) and thus improve the numerical stability.

Bias-Variance Trade-off The bias-variance trade-off is illustrated in Fig. 5.8.

(image)

Figure 5.8: Regularization trade-off for ridge regression: Test error \(\bar {J}_{\text {test}}\) is minimized at optimal \(\lambda \). Note, the trends are opposite to Fig. 5.4.
GD Solution

\begin{equation} \label {eq-gd-lr-ls} \begin{aligned} \bw _{n+1} &= \bw _{n} - \frac {\alpha }{M}\left [\bX ^T\left (\bX \bw _n - \by \right ) +\lambda \bw _n\right ]\\ &=\bw _{n}\left (1-\frac {\alpha }{M}\lambda \right ) - \frac {\alpha }{M}\bX ^T\left (\bX \bw _n - \by \right ) \end {aligned} \end{equation}

The \(\alpha \) value is typically chosen such that

\begin{equation} 0.99\gtrsim \left (1-\dfrac {\alpha }{M}\lambda \right )\gtrsim 0.95 \end{equation}