Machine Learning & Signals Learning

$\newcommand{\footnotename}{footnote}$ $\def \LWRfootnote {1}$ $\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}$ $\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}$ $\let \LWRorighspace \hspace $ $\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }$ $\newcommand {\TextOrMath }[2]{#2}$ $\newcommand {\mathnormal }[1]{{#1}}$ $\newcommand \ensuremath [1]{#1}$ $\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } $ $\newcommand {\setlength }[2]{}$ $\newcommand {\addtolength }[2]{}$ $\newcommand {\setcounter }[2]{}$ $\newcommand {\addtocounter }[2]{}$ $\newcommand {\arabic }[1]{}$ $\newcommand {\number }[1]{}$ $\newcommand {\noalign }[1]{\text {#1}\notag \\}$ $\newcommand {\cline }[1]{}$ $\newcommand {\directlua }[1]{\text {(directlua)}}$ $\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}$ $\newcommand {\protect }{}$ $\def \LWRabsorbnumber #1 {}$ $\def \LWRabsorbquotenumber "#1 {}$ $\newcommand {\LWRabsorboption }[1][]{}$ $\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }$ $\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }$ $\def \mathcode #1={\mathchar }$ $\let \delcode \mathcode $ $\let \delimiter \mathchar $ $\def \oe {\unicode {x0153}}$ $\def \OE {\unicode {x0152}}$ $\def \ae {\unicode {x00E6}}$ $\def \AE {\unicode {x00C6}}$ $\def \aa {\unicode {x00E5}}$ $\def \AA {\unicode {x00C5}}$ $\def \o {\unicode {x00F8}}$ $\def \O {\unicode {x00D8}}$ $\def \l {\unicode {x0142}}$ $\def \L {\unicode {x0141}}$ $\def \ss {\unicode {x00DF}}$ $\def \SS {\unicode {x1E9E}}$ $\def \dag {\unicode {x2020}}$ $\def \ddag {\unicode {x2021}}$ $\def \P {\unicode {x00B6}}$ $\def \copyright {\unicode {x00A9}}$ $\def \pounds {\unicode {x00A3}}$ $\let \LWRref \ref $ $\renewcommand {\ref }{\ifstar \LWRref \LWRref }$ $ \newcommand {\multicolumn }[3]{#3}$ $\require {textcomp}$ $ \newcommand {\abs }[1]{\lvert #1\rvert } $ $ \DeclareMathOperator {\sign }{sign} $ $\newcommand {\intertext }[1]{\text {#1}\notag \\}$ $\let \Hat \hat $ $\let \Check \check $ $\let \Tilde \tilde $ $\let \Acute \acute $ $\let \Grave \grave $ $\let \Dot \dot $ $\let \Ddot \ddot $ $\let \Breve \breve $ $\let \Bar \bar $ $\let \Vec \vec $ $\newcommand {\bm }[1]{\boldsymbol {#1}}$ $\require {physics}$ $\newcommand {\LWRphystrig }[2]{\ifblank {#1}{\textrm {#2}}{\textrm {#2}^{#1}}}$ $\renewcommand {\sin }[1][]{\LWRphystrig {#1}{sin}}$ $\renewcommand {\sinh }[1][]{\LWRphystrig {#1}{sinh}}$ $\renewcommand {\arcsin }[1][]{\LWRphystrig {#1}{arcsin}}$ $\renewcommand {\asin }[1][]{\LWRphystrig {#1}{asin}}$ $\renewcommand {\cos }[1][]{\LWRphystrig {#1}{cos}}$ $\renewcommand {\cosh }[1][]{\LWRphystrig {#1}{cosh}}$ $\renewcommand {\arccos }[1][]{\LWRphystrig {#1}{arcos}}$ $\renewcommand {\acos }[1][]{\LWRphystrig {#1}{acos}}$ $\renewcommand {\tan }[1][]{\LWRphystrig {#1}{tan}}$ $\renewcommand {\tanh }[1][]{\LWRphystrig {#1}{tanh}}$ $\renewcommand {\arctan }[1][]{\LWRphystrig {#1}{arctan}}$ $\renewcommand {\atan }[1][]{\LWRphystrig {#1}{atan}}$ $\renewcommand {\csc }[1][]{\LWRphystrig {#1}{csc}}$ $\renewcommand {\csch }[1][]{\LWRphystrig {#1}{csch}}$ $\renewcommand {\arccsc }[1][]{\LWRphystrig {#1}{arccsc}}$ $\renewcommand {\acsc }[1][]{\LWRphystrig {#1}{acsc}}$ $\renewcommand {\sec }[1][]{\LWRphystrig {#1}{sec}}$ $\renewcommand {\sech }[1][]{\LWRphystrig {#1}{sech}}$ $\renewcommand {\arcsec }[1][]{\LWRphystrig {#1}{arcsec}}$ $\renewcommand {\asec }[1][]{\LWRphystrig {#1}{asec}}$ $\renewcommand {\cot }[1][]{\LWRphystrig {#1}{cot}}$ $\renewcommand {\coth }[1][]{\LWRphystrig {#1}{coth}}$ $\renewcommand {\arccot }[1][]{\LWRphystrig {#1}{arccot}}$ $\renewcommand {\acot }[1][]{\LWRphystrig {#1}{acot}}$ $\require {cancel}$ $\newcommand *{\underuparrow }[1]{{\underset {\uparrow }{#1}}}$ $\DeclareMathOperator *{\argmax }{argmax}$ $\DeclareMathOperator *{\argmin }{arg\,min}$ $\def \E [#1]{\mathbb {E}\!\left [ #1 \right ]}$ $\def \Var [#1]{\operatorname {Var}\!\left [ #1 \right ]}$ $\def \Cov [#1]{\operatorname {Cov}\!\left [ #1 \right ]}$ $\newcommand {\floor }[1]{\lfloor #1 \rfloor }$ $\newcommand {\DTFTH }{ H \brk 1{e^{j\omega }}}$ $\newcommand {\DTFTX }{ X\brk 1{e^{j\omega }}}$ $\newcommand {\DFTtr }[1]{\mathrm {DFT}\left \{#1\right \}}$ $\newcommand {\DTFTtr }[1]{\mathrm {DTFT}\left \{#1\right \}}$ $\newcommand {\DTFTtrI }[1]{\mathrm {DTFT^{-1}}\left \{#1\right \}}$ $\newcommand {\Ftr }[1]{ \mathcal {F}\left \{#1\right \}}$ $\newcommand {\FtrI }[1]{ \mathcal {F}^{-1}\left \{#1\right \}}$ $\newcommand {\Zover }{\overset {\mathscr Z}{\Longleftrightarrow }}$ $\renewcommand {\real }{\mathbb {R}}$ $\newcommand {\ba }{\mathbf {a}}$ $\newcommand {\bb }{\mathbf {b}}$ $\newcommand {\bd }{\mathbf {d}}$ $\newcommand {\be }{\mathbf {e}}$ $\newcommand {\bh }{\mathbf {h}}$ $\newcommand {\bn }{\mathbf {n}}$ $\newcommand {\bq }{\mathbf {q}}$ $\newcommand {\br }{\mathbf {r}}$ $\newcommand {\bt }{\mathbf {t}}$ $\newcommand {\bv }{\mathbf {v}}$ $\newcommand {\bw }{\mathbf {w}}$ $\newcommand {\bx }{\mathbf {x}}$ $\newcommand {\bxx }{\mathbf {xx}}$ $\newcommand {\bxy }{\mathbf {xy}}$ $\newcommand {\by }{\mathbf {y}}$ $\newcommand {\byy }{\mathbf {yy}}$ $\newcommand {\bz }{\mathbf {z}}$ $\newcommand {\bA }{\mathbf {A}}$ $\newcommand {\bB }{\mathbf {B}}$ $\newcommand {\bI }{\mathbf {I}}$ $\newcommand {\bK }{\mathbf {K}}$ $\newcommand {\bP }{\mathbf {P}}$ $\newcommand {\bQ }{\mathbf {Q}}$ $\newcommand {\bR }{\mathbf {R}}$ $\newcommand {\bU }{\mathbf {U}}$ $\newcommand {\bW }{\mathbf {W}}$ $\newcommand {\bX }{\mathbf {X}}$ $\newcommand {\bY }{\mathbf {Y}}$ $\newcommand {\bZ }{\mathbf {Z}}$ $\newcommand {\balpha }{\bm {\alpha }}$ $\newcommand {\bth }{{\bm {\theta }}}$ $\newcommand {\bepsilon }{{\bm {\epsilon }}}$ $\newcommand {\bmu }{{\bm {\mu }}}$ $\newcommand {\bphi }{\bm {\phi }}$ $\newcommand {\bOne }{\mathbf {1}}$ $\newcommand {\bZero }{\mathbf {0}}$ $\newcommand {\btx }{\tilde {\bx }}$ $\newcommand {\loss }{\mathcal {L}}$ $\newcommand {\appropto }{\mathrel {\vcenter { \offinterlineskip \halign {\hfil $##$\cr \propto \cr \noalign {\kern 2pt}\sim \cr \noalign {\kern -2pt}}}}}$ $\newcommand {\SSE }{\mathrm {SSE}}$ $\newcommand {\MSE }{\mathrm {MSE}}$ $\newcommand {\RMSE }{\mathrm {RMSE}}$ $\newcommand {\toprule }[1][]{\hline }$ $\let \midrule \toprule $ $\let \bottomrule \toprule $ $\def \LWRbooktabscmidruleparen (#1)#2{}$ $\newcommand {\LWRbooktabscmidrulenoparen }[1]{}$ $\newcommand {\cmidrule }[1][]{\ifnextchar (\LWRbooktabscmidruleparen \LWRbooktabscmidrulenoparen }$ $\newcommand {\morecmidrules }{}$ $\newcommand {\specialrule }[3]{\hline }$ $\newcommand {\addlinespace }[1][]{}$ $\newcommand {\LWRsubmultirow }[2][]{#2}$ $\newcommand {\LWRmultirow }[2][]{\LWRsubmultirow }$ $\newcommand {\multirow }[2][]{\LWRmultirow }$ $\newcommand {\mrowcell }{}$ $\newcommand {\mcolrowcell }{}$ $\newcommand {\STneed }[1]{}$ $\newcommand {\tcbset }[1]{}$ $\newcommand {\tcbsetforeverylayer }[1]{}$ $\newcommand {\tcbox }[2][]{\boxed {\text {#2}}}$ $\newcommand {\tcboxfit }[2][]{\boxed {#2}}$ $\newcommand {\tcblower }{}$ $\newcommand {\tcbline }{}$ $\newcommand {\tcbtitle }{}$ $\newcommand {\tcbsubtitle [2][]{\mathrm {#2}}}$ $\newcommand {\tcboxmath }[2][]{\boxed {#2}}$ $\newcommand {\tcbhighmath }[2][]{\boxed {#2}}$

4 Model Characterization

Goal: Understand what makes a model perform well on new, unseen data:
- • Why some models are too simple and others too complex.
- • How to measure and compare model performance reliably.
- • How to control model complexity and improve generalization.
- • How the amount of data and its preprocessing affect results.

4.1 Uni-variate Polynomial Model

Goal:
- • Extend a linear model “engine” to polynomial models. The polynomial model is very flexible, e.g. due to the Taylor expansion theorem.
- • Provide an example of a model with variable number of parameters.

The $L$-degree uni-variate polynomial regression model is

\begin{equation} \begin{aligned} \hat {y} = f_\bw (x) &= w_0 + w_1x + w_2x^2 + \cdots + w_Lx^L\\ &=\sum _{j=0}^{L} w_jx^j \end {aligned} \end{equation}

This problem is linear by the change of variables, $z_{j} = x^j,\; j=0,\ldots ,L$,

\begin{equation} \hat {y} =\sum _{j=0}^{L} w_jz_{j} \end{equation}

The corresponding prediction for the dataset $\left \{x_k,y_k\right \}_{k=1}^M$ can be easily written by using the matrix notation (also termed Vandermonde matrix),

\begin{equation} \renewcommand {\arraystretch }{1.3} \bX = \begin{bmatrix} 1 & x_1 & x_1^2 & \cdots & x_1^L\\ 1 & x_2 & x_2^2 & \cdots & x_2^L\\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_M & x_M^2 & \cdots & x_M^L\\ \end {bmatrix} \end{equation}

The weights vector values are straightforward by (3.5),

\begin{equation*} \bw _{opt} = \left (\bX ^T\bX \right )^{-1}\bX ^T\by \end{equation*}

Hyper-parameter: A hyper-parameter is a parameter that defines the model structure or the learning process, and is set before training rather than learned from the data. In contrast to the model parameters $\bw $, which are optimized during training, hyper-parameters are chosen by the designer (further discussed in Sec. 7.1.2).

The value of $L$ in the polynomial model is an example of a hyper-parameter, as it determines the model complexity.

4.2 Generalization

Examples of questions of the significant importance:

• Is a model $\hat {y} = f_\bw (x)$ appropriate?
• How do we evaluate a model performance?
• How to select optimal hyper-parameter values, e.g. $L$ of the polynomial model?

Goal: Understand how well a model performs beyond the data it was trained on:
- • How to quantify model performance and what limits that estimate.
- • Why training error alone is insufficient for evaluating a model.
- • What determines a model’s ability to predict on new, unseen data.

Let’s assume that the dataset $(\bX ,\by )$ are $M$ samples drawn from some (unknown) joint probability distribution, $\mathcal {D}$. The theoretical model performance is given by some average model metric (not loss) over all (theoretically) possible points from $\mathcal {D}$,¹

\begin{equation} \label {eq-generalization-metric} \bar {J}_{theory} = \lim \limits _{M\rightarrow \infty }\frac {1}{M}\sum _{k=1}^MJ(y_k,\hat {y}_k) \end{equation}

Generalization: The difference between:

• Performance metric over dataset that is used to train the model, $\bar {J}_M$.
• Theoretical performance metric $\bar {J}_{theory}$ from (4.4).

Better generalization means smaller difference between model performance and theoretical performance.

The problem is that the distribution $\mathcal {D}$ is unknown in most of the practical applications. Moreover, in practice, the value of $M$ is (very) limited and $\bar {J}_M$ can significantly differ from $\bar {J}_{theory}$.

The generalization gap has two main sources:

• Data-related: limited dataset size $M$, sampling bias, noise in the data, and non-representative samples from $\mathcal {D}$.
• Model-related: model complexity mismatch (underfitting or overfitting), inappropriate model assumptions, and insufficient regularization.

In the following, methods to reduce both data-related and model-related gaps such that $\bar {J}_M\approx \bar {J}_{theory}$ will be discussed.

The generalization concept is illustrated in Fig. 4.1.

Table 4.1 summarizes the methods discussed in this chapter and their influence on the generalization gap. Each method primarily addresses either data-related or model-related sources of the gap.

Table 4.1: Chapter methods and their generalization influence.

.
Method	Data-related	Model-related
Cross-validation (Sec. 4.4)	✓	✓
Hyper-parameter tuning (Sec. 4.4)		✓
Normalization / Standardization (Sec. 4.5)	✓
Dataset size increase (Sec. 4.6)	✓
Regularization (Sec. 4.7)		✓

¹ Probabilistic formulation, $E_\mathcal {D}[J(y,\hat {y})]$

4.3 Bias-Variance Trade-Off

4.3.1 Bias and Variance

Noise deterministic function model The general model beneath the dataset model is

\begin{equation} y_i = h(\bx _i) + \epsilon _i, \end{equation}

where $h(\bx )$ is some unknown function and $\epsilon $ is some random noise with unknown distribution, zero-mean ($\bar {\epsilon }=0$) and variance of $\sigma ^2$ $(\Var [\epsilon ]=\sigma ^2)$.

A model predicts outputs via $\hat {y}_i = f_\bw (\bx _i)$ with error $e_i = \hat {y}_i-y_i$.

Bias The bias of the model is the expected difference between predictions and the true function,

\begin{equation} \mathrm {Bias}[\hat {y}] = \E [\hat {y}] - \E [y] = \E [f_\bw (\bx )] - h(\bx ) \end{equation}

where the expectation is taken over all possible training samples.

For a specific dataset, the empirical bias can be estimated as:

Variance The variance of the model measures the variability of predictions across different training sets,

\begin{equation} \Var [\hat {y}] = \mathbb {E}\!\left [\left (\hat {y} - \E [\hat {y}]\right )^2\right ] \end{equation}

The bias and variance are illustrated conceptually in Fig. 4.2.

The Explicit Bias-Variance Trade-off Formula The core of model evaluation relates to understanding the total expected prediction error (MSE). For a new unseen point $\bx $, the expected error is explicitly given by the sum of three fundamental components:

\begin{equation} \label {eq-bias-variance-basics} \E [(y - \hat {y})^2] = \left (\mathrm {Bias}[\hat {y}]\right )^2 + \Var [\hat {y}] + \sigma ^2 \end{equation}

where:

• Squared Bias $\left (\mathrm {Bias}[\hat {y}]\right )^2$: Error introduced by approximating a real-world problem with a simplified model.
• Variance $\Var [\hat {y}]$: Error introduced by the model’s sensitivity to small fluctuations or noise in the training set.
• Irreducible Noise $\sigma ^2$: The inherent variance of the data itself that no model can capture.

Do not confuse model bias with zero-mean error.

• In least-squares regression with an intercept, the training residuals always satisfy $\frac {1}{n}\sum _{i=1}^n e_i = 0$ (Sec. 3.3.3). This is an algebraic property of the normal equations, not a statement about prediction quality.
• The bias term $\mathrm {Bias}[\hat {y}] = \E [\hat {y}(\bx )] - h(\bx )$ measures how far the expected prediction averaged over all possible training sets deviates from the true function at unseen points.
• A misspecified model, for example, fitting a straight line to cubic data, will have zero-mean training error in every single run, yet its expected prediction systematically misses the true curve, producing nonzero bias.

The total prediction error consists of these three components (formally derived in Sec. 4.8). The inherent bias-variance trade-off is presented in Fig. 4.3. Underfitting is characterized by low variance and high bias, whereas overfitting is characterized by high variance and low bias. The best generalized model performance balances this inherent bias-variance trade-off to minimize the overall expected error. However, note that some poorly chosen models can simultaneously exhibit both high bias and high variance.

Example 4.1: Recall the two sample variance estimators from Sec. 1.1: the biased estimator $s_{\text {biased}}^2 = \frac {1}{n}\sum _{i=1}^n(x_i - \bar {x})^2$ and the unbiased estimator $s_{\text {unbiased}}^2 = \frac {1}{n-1}\sum _{i=1}^n(x_i - \bar {x})^2$, where the true variance is $\sigma ^2$. Their MSE decomposition is:

.
Estimator	$\mathrm {Bias}^2$	$\mathrm {Var}$	$\MSE = \mathrm {Bias}^2 + \mathrm {Var}$
$s_{\text {biased}}^2\;\left (\tfrac {1}{n}\right )$	$\dfrac {\sigma ^4}{n^2}$	$\dfrac {2(n-1)}{n^2}\,\sigma ^4$	$\dfrac {2n-1}{n^2}\,\sigma ^4$
$s_{\text {unbiased}}^2\;\left (\tfrac {1}{n-1}\right )$	$0$	$\dfrac {2}{n-1}\,\sigma ^4$	$\dfrac {2}{n-1}\,\sigma ^4$

Comparing the total MSE:

\begin{equation*} \MSE _{\text {biased}}= \frac {2n-1}{n^2} < \frac {2}{n-1} ={\MSE _{\text {unbiased}}} \quad \text {for all } n \geq 2 \end{equation*}

The biased estimator achieves a lower total error despite its nonzero bias, because eliminating the bias (by using $\frac {1}{n-1}$) increases the variance by more than the bias reduction saves. This is the bias-variance trade-off in action: a small amount of bias can be beneficial if it substantially reduces variance.

Example 4.2: The bias-variance decomposition (4.9) is verified numerically using a Monte Carlo simulation. The true data-generating function is a cubic polynomial $h(x) = 0.2 + x + 3x^2 + 0.1x^3$, corrupted by additive Gaussian noise $\epsilon \sim \mathcal {N}(0, \sigma ^2)$ with $\sigma = 0.1$. For each polynomial degree $N = 1, \ldots , 8$, a total of $K = 200$ independent training sets of size $M = 25$ are sampled, and a polynomial model $\hat {y} = \bw ^\top \bx $ is fitted to each.

Figure 4.4 illustrates the effect visually. For $N=1$ (underfitting), the fitted curves cluster tightly but miss the true function shape — low variance, high bias. For $N=3$ (matching the true model order), the fits are both accurate and consistent. For $N=7$ (overfitting), individual fits vary wildly across training sets — low bias in the mean, but high variance.

Figure 4.4: Overlaid polynomial fits from 20 independent training sets at three complexity levels. Red: mean prediction $\bar {\hat {y}}$; dashed black: true function $h(x)$; blue: individual sample fits.

Figure 4.5 shows the squared bias, variance, and their sum (plus irreducible noise $\sigma ^2$) as a function of polynomial degree. The empirical MSE (green crosses) closely matches the theoretical decomposition (black diamonds), confirming (4.9). As the model complexity increases, bias decreases while variance increases, producing the characteristic U-shaped total error curve.

4.3.2 Overfitting and underfitting

Goal: Understand underfitting and overfitting as two extremes of model complexity.

The effect of model complexity is illustrated with polynomial regression (Sec. 4.1) in Fig. 4.6. A degree-1 polynomial (top left) underfits and it cannot capture the cubic shape (see Fig. 4.4) of the data. A degree-3 model (top right) matches the true data-generating order and generalizes well to test points. Degree-6 and degree-10 models (bottom row) overfit, failing on unseen test data (blue). The annotated weight $w_i$ in each panel shows that the largest coefficient magnitude grows with the polynomial degree, a mark of overfitting.

To evaluate a model, the available dataset is split into a training subset (used to fit the model) and a test subset (used to evaluate it on unseen data). The formal splitting procedures are discussed in Sec. 4.4. The relationship between model complexity and performance on these two subsets reveals two failure modes.

Overfitting is when the model is too complex, i.e. have too many parameters.

• Too many parameters relative to the number of observations. Theoretically, we expect at least an order of magnitude more observations than parameters.
• Follow the training data very closely.
• Fail to generalize well to unseen data.

Significant performance difference between train and test data.

Underfitting happens when a model is too simple.

• Unable to capture the underlying pattern of the data and hence misses the trends in the data.
• Performs poorly on the training data and fail to generalize.

Poor performance on both train and test datasets.

Overfitting and underfitting are complementary and balancing between them is key to building robust machine learning models that perform well on new, unseen data, i.e. generalize well. This principle is illustrated in Fig. 4.7

4.4 Cross-validation

Goal: Trial and error approach to quantify generalization performance. The cross-validation is also termed performance assessment. Outcomes:
- • Approximated generalization performance, $\bar {J}_M\approx \bar {J}_{theory}$.
- • Hyper-parameters selection guideline.
- • Clear separation between train and test datasets.
- • Special (validation) dataset for hyper-parameter selection.

The cross-validation methods are illustrated in Fig. 4.8.

Big dataset $\left (10^4\lesssim M\right )$: train/validation/test

First step is to resample the dataset into the random order. Then, split it into three distinctive datasets:

• Training dataset (50-80%): used for learning of model parameters, e.g. weights $\bw $.
• Validation dataset (10-25%): used for assessment of model hyper-parameters influence.
• Test dataset (10-25%): performance assessment that is supposed to be sufficiently close to $\bar {J}_{theory}$.

Hyper-parameter selection proceeds as follows:

1. For each candidate hyper-parameter value (e.g., polynomial degree $L$, regularization $\lambda $), train the model on the training set and evaluate the metric on the validation set.
2. Select the hyper-parameter value that yields the best validation metric.
3. Retrain the model on the combined training and validation sets using the selected hyper-parameter, and report the final performance on the held-out test set.

Medium datasets $\left (10^2\lesssim M\lesssim 10^4\right )$: Nested k-fold Cross-Validation

Goal: Simultaneously tune hyper-parameters (when applicable) and obtain an unbiased performance estimate.

The first step is to resample the dataset into a random order. After randomizing, we apply a nested loop procedure to separate two main objectives: hyper-parameter selection and performance assessment. Conflating these objectives into a single loop leads to optimistic performance estimates because the test data would indirectly influence model selection.

Nested cross-validation separates these objectives into two loops (Fig. 4.8):

• Outer loop ($k_{\text {out}}$ folds for Performance Assessment): The available dataset is divided into $k_{\text {out}}$ subsets (folds) of approximately equal size. In each of the $k_{\text {out}}$ iterations, one fold (red in Fig. 4.8) is held out as the test set for unbiased performance evaluation. The remaining $k_{\text {out}}-1$ folds form the outer training set.
• Inner loop ($k_{\text {in}}$ folds for Hyper-parameter Tuning): Within each outer training set, a further $k_{\text {in}}$-fold split creates validation folds (yellow in Fig. 4.8). For each candidate hyper-parameter value (e.g., polynomial degree $L$, regularization parameter $\lambda $), the model is trained on the inner training folds (blue) and evaluated on the inner validation fold. The hyper-parameter value that maximizes the average inner validation metric is selected.

After the inner loop selects the best hyper-parameter, the model is retrained on the entire outer training set using that hyper-parameter, and its performance is evaluated on the outer test fold. The final performance estimate is the average over the $k_{\text {out}}$ outer test metrics:

\begin{equation} \bar {J} = \frac {1}{k_{\text {out}}} \sum _{i=1}^{k_{\text {out}}} J_i \end{equation}

Usually, $k_{\text {out}}$ defaults to 5 or 10.

The hyper-parameter selected may differ across outer folds. This is expected — each outer fold sees a slightly different training distribution. The purpose of nested CV is to produce an unbiased performance estimate, not a single set of hyper-parameters.

When hyper-parameter optimization is not required the inner loop is completely omitted. The procedure simply collapses into a standard, single-loop $k$-fold cross-validation. The model is trained on the $k_{\text {out}}-1$ training folds and evaluated directly on the held-out test fold. This process is repeated $k_{\text {out}}$ times, and the performance is averaged.

Grouped Cross-Validation When data is collected from distinct domains or groups (e.g., multiple patients, different sensors, separate locations), standard random splitting can place samples from the same group in both training and test folds. This allows the model to learn domain-specific signatures rather than generalizable patterns. Grouped cross-validation ensures that all samples from a given group are assigned to the same fold, providing a realistic estimate of performance on unseen domains. Further CV splitting strategies are discussed in Sec. 11.1.1.

Example 4.3: Consider $M=12$ EEG recordings drawn from $G=3$ patients, with 4 recordings each:

.
Recording	Patient	Standard 3-fold	Grouped 3-fold
$x_1$–$x_4$	Patient A	Folds 1, 2, 3 (split)	Fold 1 (all)
$x_5$–$x_8$	Patient B	Folds 1, 2, 3 (split)	Fold 2 (all)
$x_9$–$x_{12}$	Patient C	Folds 1, 2, 3 (split)	Fold 3 (all)

With standard 3-fold CV, each fold contains recordings from all three patients. A model tested on fold 1 has already seen other recordings from Patients A, B, and C during training, so it can exploit patient-specific signatures (e.g., baseline amplitude, electrode impedance) to improve its predictions. The resulting accuracy is optimistic.

With grouped 3-fold CV, when the model is tested on fold 1 (Patient A), it has never seen any recording from Patient A during training. This forces the model to rely on patterns that generalize across patients and provides an unbiased estimate of performance on unseen patients.

Very small datasets $\left (M\lesssim 10^2\right )$: Leave-One-Out Cross-Validation (LOOCV)

Uses $k$-fold with $k=M$, which means that each fold will contain only one data point.

LOOCV is generally recommended only for very small datasets where $k$-fold would leave too few samples per fold. For medium and large datasets, $5$- or $10$-fold cross-validation typically provides a better bias-variance trade-off in the performance estimate itself.

4.5 Normalization and Standardization

Goal: Data pre-processing to reduce possible overfitting.

Values in different columns in $\bX $ (feature columns $\btx _j$) may be different by orders of magnitudes, i.e. $\norm {\btx _i}\gg \norm {\btx _j}$. This results in:

• Some columns have significantly higher influence on $\hat {\by }$.
• Numerical instabilities.

Standardization (z-score normalization)

Goal: Mapping all the input values such that they follow a distribution with zero mean and unit variance

\begin{equation} \label {eq-zscore} \bz = \frac {\bx -\bar {\bx }}{s_\bx }, \end{equation}

such that $\bar \bz = 0, s_\bz = 1$.

Implementation steps:²

1. On train dataset, evaluate $\bar {\bx }$ and $\sigma _\bx $.
2. Apply normalization on train dataset, using $\bar {\bx }$ and $\sigma _\bx $.
3. Apply normalization on test dataset, using same $\bar {\bx }$ and $\sigma _\bx $ (no recalculation).

When normalization is applied to $\by $, the output of the model is transformed back, $\hat {\by } = \hat {\by }_{std}\sigma _\by + \bar {\by }$.

² In Python: sklearn.preprocessing.StandardScaler, in Matlab zscore.

Normalization

Mapping all values of a feature to be in the range $[0,1]$ by the transformation³

\begin{equation} x_{norm} = \frac {x-x_{min}}{x_{max}-x_{min}} \end{equation}

Implementation steps for normalization are similar to standardization.

³ In Python: sklearn.preprocessing.MinMaxScaler

Beware, normalization and standardization are used interchangeably.

Pros:

• Improves convergence speed of gradient-based optimization.
• Prevents features with large magnitudes from dominating distance-based models.
• Standardization of both $\bX $ and $\by $ results in $w_0=0$ and removes the requirement for $\bOne $ column in $\bX $.

Cons:

• Min-max normalization is sensitive to outliers.
• Scaling parameters must be computed on the training set and applied consistently to the test set.

4.6 Dataset Size

Goal: Dataset size influence on underfitting-overfitting trade-off. Understanding when more data helps and when it does not.

By the law of large numbers, the sample average $\bar {J}_M = \frac {1}{M}\sum _{k=1}^M J(y_k,\hat {y}_k)$ converges to the theoretical performance (4.4), so the generalization gap $\bar {J}_M - \bar {J}_{theory}$ shrinks as the number of samples $M$ grows. However, the benefit of additional data depends on whether the model is overfitting or underfitting, as summarized in Table 4.2 and illustrated in Fig. 4.9.

Table 4.2: Effect of increasing dataset size $M$ on model performance.

.
	Overfitting (complex)	Underfitting (simple)
Training error	Rises	Stays high
Test error	Drops	Stays high
Gap (test $-$ train)	Shrinks	Already small
More data helps?	Yes (reduces variance)	No (model lacks capacity)

Beyond a certain dataset size, performance improvement may be almost negligible or non-existent for a constraint model complexity.

4.7 Regularization

Goal: Loss function tweak that influences on underfitting-overfitting trade-off.

Regularization: Penalty to the loss function

\begin{equation} \loss _{\mathrm {new}} = \loss + \lambda g(\bw ) \end{equation}

where $\lambda $ is termed regularization parameter.

$L_2$-regularization: Special case of

\begin{equation} g(\bw ) = \dfrac {1}{2M}\left \| \bw \right \|^2 =\frac {1}{2M}\sum _{i=1}^N w_i^2 \end{equation}

where vector of weights $\bw $ does not include $w_0$ weight. Moreover, when normalization is used, no $w_0$ weight is needed. Two special cases are:

• $\lambda \rightarrow 0$ gets the original loss function (overfitting).
• $\lambda \rightarrow \infty $ makes loss function independent on $\bw $ (underfitting).

Polynomial regression example The resulting coefficients $w_i$ will be significantly higher for higher $i$ (Fig. 4.6). Reducing them results in smooth $\hat {y}$ prediction. The illustration of $\lambda $ influence is presented in Fig. 4.10.

The corresponding underfitting-overfitting $\lambda $-related trade-off is presented in Fig. 4.11.

4.7.1 Ridge Regression

Ridge regression: $L_2$-regularized linear regression.

\begin{equation} \begin{aligned} \mathcal {L}(\bw ) &= \frac {1}{2M}\left \| \by - \bX \bw \right \|^2 + \frac {\lambda }{2M}\underbrace {\left \| \bw \right \|^2}_{\sum _{i=1}^{N}w_i^2}\\ &= \frac {1}{2M}\left (\by - \bX \bw \right )^T \left (\by - \bX \bw \right ) + \frac {\lambda }{2M}\bw ^T\bw \end {aligned} \end{equation}

Derivative

\begin{equation} \nabla \mathcal {L}(\bw ) =\frac {1}{M}\left (-\bX ^T\left (\by - \bX \bw \right ) + \lambda \bw \right )= 0 \end{equation}

Solution

\begin{align*} &-\bX ^T\left (\by - \bX \bw \right ) + \lambda \bw = -\bX ^T \by + \bX ^T\bX \bw + \lambda \bw = 0\\ &\bX ^T \by = \bX ^T\bX \bw + \lambda \bw \end{align*} Finally, the regularized weights are given by

\begin{equation} \label {eq-rigde-regression-w} \bw = \left (\lambda \mathbf {I} + \bX ^T\bX \right )^{-1}\bX ^T \by \end{equation}

Can be viewed as special case of linear regression, of the form

\begin{align} \tilde {y} &= \tilde {\bX }\bw \nonumber \\ \begin{bmatrix} \by \\ 0 \\ \vdots \\ 0 \end {bmatrix} &= \begin{bmatrix} - & \bX & - \\ \sqrt {\lambda } & & 0 \\ & \ddots & \\ 0 & & \sqrt {\lambda } \end {bmatrix} \begin{bmatrix} w_1\\ \vdots \\ w_N \end {bmatrix} +w_0 \end{align}

Slope interpretation Higher value of $\lambda $ reduces (mostly) the highest slopes (weights $w_i$) $\Rightarrow $ the dependency on the parameters with these weights is reduced.

Eigenvalues interpretation This calculation limits the smallest eigenvalues to $>\frac {1}{\lambda }$ and thus improve the numerical stability.

Bias-Variance Trade-off The bias-variance trade-off for ridge regression is illustrated in Fig. 4.12.

GD Solution

Tailored GD (Sec. 3.4) solution is

\begin{equation} \label {eq-gd-lr-ls} \begin{aligned} \bw _{n+1} &= \bw _{n} - \alpha \nabla \mathcal {L}(\bw )\\ &=\bw _{n} - \frac {\alpha }{M}\left [\bX ^T\left (\bX \bw _n - \by \right ) +\lambda \bw _n\right ]\\ &=\bw _{n}\left (1-\frac {\alpha }{M}\lambda \right ) - \frac {\alpha }{M}\bX ^T\left (\bX \bw _n - \by \right ) \end {aligned} \end{equation}

The $\alpha $ value is typically chosen such that

\begin{equation} 0.99\gtrsim \left (1-\dfrac {\alpha }{M}\lambda \right )\gtrsim 0.95 \end{equation}

4.8 Bias-Variance Decomposition (*)

Goal: Discussion of inherent underfitting and overfitting trade-off in (4.9).

The expected prediction error (MSE), taken over all possible training sets and noise realizations, decomposes into three irreducible components:

\begin{equation} \E [(\hat {y}-y)^2] =\left (\E [\hat {y}] - h(\bx )\right )^2 + \Var [\hat {y}] + \sigma ^2 \end{equation}

where $\left (\mathrm {Bias}[\hat {y}]\right )^2=\left (\E [\hat {y}] - h(\bx )\right )^2$.

Proof. Starting from:
$\seteqnumber{0}{}{21}$
\begin{equation} \MSE = \E [(\hat {y}-y)^2] =\E [\hat {y}^2] -2 \E [y\hat {y}] + \E [y^2] \end{equation}

We evaluate each term separately. For the first term, by the variance identity:
$\seteqnumber{0}{}{22}$
\begin{equation} \E [\hat {y}^2] = \Var [\hat {y}] + \mathbb {E}^2[\hat {y}] \end{equation}

For the cross-term, since $y = h(\bx ) + \epsilon $ and $\epsilon $ is independent of $\hat {y}$:
$\seteqnumber{0}{}{23}$
\begin{equation} \E [y\hat {y}] = \E [(h(\bx ) + \epsilon )\hat {y}] = h(\bx )\E [\hat {y}] + \cancelto {0}{\E [\epsilon ]}\E [\hat {y}] = h(\bx )\E [\hat {y}] \end{equation}

For the last term:
$\seteqnumber{0}{}{24}$
\begin{equation} \E [y^2] = \E [(h(\bx ) + \epsilon )^2] = h^2(\bx ) + 2h(\bx )\cancelto {0}{\E [\epsilon ]} + \E [\epsilon ^2] = h^2(\bx ) + \sigma ^2 \end{equation}

Combining all three terms:
$\seteqnumber{0}{}{25}$
\begin{equation} \begin{aligned} \E [(\hat {y}-y)^2] &=\Var [\hat {y}] + \mathbb {E}^2[\hat {y}] - 2h(\bx )\E [\hat {y}] + h^2(\bx ) + \sigma ^2 \\ &=\left (\E [\hat {y}] - h(\bx )\right )^2 + \Var [\hat {y}] + \sigma ^2 \end {aligned} \end{equation}

Empirical estimate For a specific dataset of $M$ samples, the MSE can be similarly decomposed:

\begin{equation} \MSE = \frac {1}{M}\sum _{i=1}^M (\hat {y}_i - y_i)^2 = \text {bias}^2_{\text {emp}} + \text {variance}_{\text {emp}} + \text {noise}_{\text {emp}} \end{equation}

where

\begin{align} {\text {bias}^2_{\text {emp}}} &=\left (\bar {\hat {y}} - \bar {h}\right )^2,\qquad \bar {\hat {y}} = \frac {1}{M}\sum _{i=1}^M \hat {y}_i, \quad \bar {h} = \frac {1}{M}\sum _{i=1}^M h(\bx _i)\\ \text {variance}_{\text {emp}} &= \frac {1}{M}\sum _{i=1}^M (\hat {y}_i - \bar {\hat {y}})^2\\ \text {noise}_{\text {emp}} &=\frac {1}{M}\sum _{i=1}^M \epsilon _i^2 \end{align}

4.9 Summary

• Overfitting and underfitting are two extremes of model complexity: overfitting follows training data too closely, underfitting fails to capture the underlying pattern.
• Bias-variance trade-off provides the theoretical framework explaining why overfitting (high variance, low bias) and underfitting (high bias, low variance) occur.
• Cross-validation mitigates data-related issues by evaluating model performance across different data splits, and guides model-related decisions such as hyper-parameter selection.
• Hyper-parameter tuning (e.g., polynomial degree $L$, regularization $\lambda $) adjusts model complexity to balance underfitting and overfitting.
• Normalization and standardization are data pre-processing steps that reduce the influence of feature scale imbalance and improve numerical stability.
• Increasing dataset size directly reduces the data-related gap by making $\bar {J}_M$ closer to $\bar {J}_{theory}$, and reduce overfitting in complex models.
• Regularization constrains model complexity via the penalty parameter $\lambda $, controlling the bias-variance trade-off from the model side.

.
Estimator	\(\mathrm {Bias}^2\)	\(\mathrm {Var}\)	\(\MSE = \mathrm {Bias}^2 + \mathrm {Var}\)
\(s_{\text {biased}}^2\;\left (\tfrac {1}{n}\right )\)	\(\dfrac {\sigma ^4}{n^2}\)	\(\dfrac {2(n-1)}{n^2}\,\sigma ^4\)	\(\dfrac {2n-1}{n^2}\,\sigma ^4\)
\(s_{\text {unbiased}}^2\;\left (\tfrac {1}{n-1}\right )\)	\(0\)	\(\dfrac {2}{n-1}\,\sigma ^4\)	\(\dfrac {2}{n-1}\,\sigma ^4\)