Machine Learning & Signals Learning
5 Overfitting and underfitting
5.1 Definition
Overfitting when model is too complex, i.e. have too many parameters.
-
• Too many parameters relative to the number of observations. Theoretically, we expect at least an order of magnitude more observations than parameters.
-
• Follow the training data very closely.
-
• Fail to generalize well to unseen data.
Significant performance difference between train and test datasets.
Underfitting happens when a model is too simple.
-
• Unable to capture the underlying pattern of the data and hence misses the trends in the data.
-
• Performs poorly on the training data and fail to generalize.
Poor performance on both train and test datasets.
Overfitting and underfitting are complimentary and balancing between them is key to building robust machine learning models that perform well on new, unseen data, i.e. generalize well. This principle is illustrated in Fig. 5.1
The numerical example of underfitting and overfitting trade-off from the polynomial model (Sec. 4.1) is presented in Fig. 5.2.
5.2 Bias-Variance Trade-off
5.2.1 Noisy Deterministic Function Interpretation
The general model beneath the dataset model is
\(\seteqnumber{0}{}{0}\)\begin{equation} y_i = h(\bx _i) + \epsilon _i, \end{equation}
where \(f(\bx )\) is some unknown function and \(\epsilon \) is some random noise with unknown distribution, zero-mean and variance of \(\sigma ^2\) \((\E [\epsilon ]=0,\Var [\epsilon ]=\sigma ^2)\).
A model predicts outputs via \(\hat {y}_i = f(\bx _i;\bth )\).
Bias The bias of the model is the expected difference between predictions and the true function,
\(\seteqnumber{0}{}{1}\)\begin{equation} \mathrm {Bias}[\hat {y}] = \E [\hat {y}] - \E [y] = \E [f(\bx ;\bw )] - h(\bx ) \end{equation}
where the expectation is taken over all possible training samples.
For a specific dataset, the empirical bias can be estimated as:
\(\seteqnumber{0}{}{2}\)\begin{equation} \mathrm {Bias}_{\text {empirical}} = \frac {1}{M}\sum _{i=1}^M \hat {y}_i - \frac {1}{M}\sum _{i=1}^M h(\bx _i) \end{equation}
The bias and variance are illustrated in Fig. 5.3.
5.2.2 Bias and Variance Trade-off
The expected prediction error (MSE) for a new point can be decomposed as:
\(\seteqnumber{0}{}{3}\)\begin{equation} \begin{aligned} \loss (\by ,\hat {\by }) &= \E [(\hat {y}-y)^2] = \frac {1}{M}\sum _{i=1}^M \left (\hat {y}_i-y_i\right )^2\\ &=\E [\hat {y}^2] -2 \E [y]\E [\hat {y}] + \E [y^2]\\ \E [\hat {y}^2] &= \Var [\hat {y}] + \mathbb {E}^2\left [\hat {y}\right ]\\ \E [y] &=\E [h(\bx )]=h(\bx )\\ \E [y^2] &= \E [(h(\bx ) + \epsilon )^2]= \frac {1}{M}\sum _{i=1}^M \left (h(\bx _i) + \epsilon _i\right )^2\\ &= \E [h^2(\bx )] + 2\E [h(\bx )]\cancelto {0}{\E [\epsilon ]} + \E [\epsilon ^2]\\ &= \E [h^2(\bx )] + \sigma ^2\\ &= h^2(\bx ) + \sigma ^2\\ \loss &=\Var [\hat {y}] + \mathbb {E}^2\left [\hat {y}\right ] - 2h(\bx )\E [\hat {y}] + h^2(\bx ) + \sigma ^2 \\ &=\underbrace {\left (\E [\hat {y}] - h(\bx )\right )^2}_{bias^2} + \underbrace {\Var [\hat {y}]}_{variance} + \underbrace {\sigma ^2}_{noise} \end {aligned} \end{equation}
This decomposition shows that the total prediction error consists of three components: squared bias, variance, and irreducible noise. The inherent bias-variance trade-off is presented in Fig. 5.4. Underfitting is low variance and high bias, and overfitting is high variance and low bias. The best model performance has inherent bias-variance trade-off. However, some less appropriate models can have high bias and high variance simultaneously
5.3 Normalization and Standardization
Values in different columns in \(\bX \) (vectors \(\bx _j\)) may be different by orders magnitudes, i.e. \(\norm {\bx _i}\gg \norm {\bx _j}\). This results in:
-
• Some columns have significantly higher influence on \(\hat {\by }\).
-
• Numerical instabilities.
Standardization (z-score normalization)
Mapping all the input values such that they follow a distribution with zero mean and unit variance.
\(\seteqnumber{0}{}{4}\)\begin{equation} z_{std} = \frac {\bx -\bar {\bx }}{\sigma _\bx }, \end{equation}
where
\(\seteqnumber{0}{}{5}\)\begin{align*} \bar {\bx } &= \frac {1}{M}\sum _{j=1}^Mx_j\\ \sigma _\bx ^2 &= \frac {1}{M}\sum _{j=1}^M\left (x_j - \bar {\bx }\right )^2 \end{align*}
Implementation steps:
-
1. On train dataset, evaluate \(\bar {\bx }\) and \(\sigma _\bx \).
-
2. Apply normalization on train dataset, using \(\bar {\bx }\) and \(\sigma _\bx \).
-
3. Apply normalization on test dataset, using same \(\bar {\bx }\) and \(\sigma _\bx \) (no recalculation).
When normalization is applied to \(\by \), the output of the model is transformed back, \(\hat {\by } = \hat {\by }_{std}\sigma _\by + \bar {\by }\).
Example: zscore command in Matlab and sklearn.preprocessing.StandardScaler in Python.
Normalization
Mapping all values of a feature to be in the range \([0,1]\) by the transformation1
\(\seteqnumber{0}{}{5}\)\begin{equation} x_{norm} = \frac {x-x_{min}}{x_{max}-x_{min}} \end{equation}
Implementation steps for normalization are similar to standardization.
1 In Python: sklearn.preprocessing.MinMaxScaler
Beware, normalization and standardization are used interchangeably.
Pros:
-
• Improves convergence speed of gradient-based optimization.
-
• Prevents features with large magnitudes from dominating distance-based models.
-
• Standardization of both \(\bX \) and \(\by \) results in \(w_0=0\) and removes the requirement for \(\bOne \) column in \(\bX \).
Cons:
-
• Min-max normalization is sensitive to outliers.
-
• Scaling parameters must be computed on the training set and applied consistently to the test set.
5.4 Dataset size
Adding data usually improves the performance for an overly complex system as presented in Fig. 5.5.
5.5 Regularization
Regularization: Penalty to the loss function
\(\seteqnumber{0}{}{6}\)\begin{equation} \loss _{\mathrm {new}} = \loss + \lambda g(\bw ) \end{equation}
where \(\lambda \) is termed regularization parameter.
\(L_2\)-regularization: Special case of
\(\seteqnumber{0}{}{7}\)\begin{equation} g(\bw ) = \dfrac {1}{2M}\left \| \bw \right \|^2 =\frac {\lambda }{2M}\sum _{i=1}^N w_i^2 \end{equation}
where vector of weights \(\bw \) does not include \(w_0\) weight. Moreover, when normalization is used, no \(w_0\) weight is needed. Two special cases are:
-
• \(\lambda \rightarrow 0\) gets the original loss function (overfitting).
-
• \(\lambda \rightarrow \infty \) makes loss function independent on \(\bw \) (underfitting).
Polynomial regression example The resulting coefficients \(w_i\) will be significantly higher for higher \(i\) ( Fig. 5.2). Reducing them results in smooth \(\hat {y}\) prediction. The illustration of \(\lambda \) influence is presented in Fig. 5.6.
The corresponding underfitting-overfitting \(\lambda \)-related trade-off is presented in Fig. 5.7.
5.5.1 Ridge Regression
Ridge regression: \(L_2\)-regularized linear regression.
\(\seteqnumber{0}{}{8}\)\begin{equation} \begin{aligned} \mathcal {L}(\bw ) &= \frac {1}{2M}\left \| \by - \bX \bw \right \|^2 + \frac {\lambda }{2M}\underbrace {\left \| \bw \right \|^2}_{\sum _{i=1}^{N}w_i^2}\\ &= \frac {1}{2M}\left (\by - \bX \bw \right )^T \left (\by - \bX \bw \right ) + \frac {\lambda }{2M}\bw ^T\bw \end {aligned} \end{equation}
Derivative
\(\seteqnumber{0}{}{9}\)\begin{equation} \nabla \mathcal {L}(\bw ) =\frac {1}{M}\left (-\bX ^T\left (\by - \bX \bw \right ) + \lambda \bw \right )= 0 \end{equation}
Solution
\(\seteqnumber{0}{}{10}\)\begin{align*} &-\bX ^T\left (\by - \bX \bw \right ) + \lambda \bw = -\bX ^T \by + \bX ^T\bX \bw + \lambda \bw = 0\\ &\bX ^T \by = \bX ^T\bX \bw + \lambda \bw \end{align*} Finally, the regularized weights are given by
\(\seteqnumber{0}{}{10}\)\begin{equation} \label {eq-rigde-regression-w} \bw = \left (\lambda \mathbf {I} + \bX ^T\bX \right )^{-1}\bX ^T \by \end{equation}
Can be viewed as special case of linear regression, of the form
\(\seteqnumber{0}{}{11}\)\begin{align} \tilde {y} &= \tilde {\bX }\bw \nonumber \\ \begin{bmatrix} \by \\ 0 \\ \vdots \\ 0 \end {bmatrix} &= \begin{bmatrix} - & \bX & - \\ \sqrt {\lambda } & & 0 \\ & \ddots & \\ 0 & & \sqrt {\lambda } \end {bmatrix} \begin{bmatrix} w_1\\ \vdots \\ w_N \end {bmatrix} +w_0 \end{align}
Slope interpretation Higher value of \(\lambda \) reduces (mostly) the highest slopes (weights \(w_i\)) \(\Rightarrow \) the dependency on the parameters with these weights is reduced.
Eigenvalues interpretation This calculation limits the smallest eigenvalues to \(>\frac {1}{\lambda }\) and thus improve the numerical stability.
Bias-Variance Trade-off The bias-variance trade-off is illustrated in Fig. 5.8.
GD Solution
\(\seteqnumber{0}{}{12}\)\begin{equation} \label {eq-gd-lr-ls} \begin{aligned} \bw _{n+1} &= \bw _{n} - \frac {\alpha }{M}\left [\bX ^T\left (\bX \bw _n - \by \right ) +\lambda \bw _n\right ]\\ &=\bw _{n}\left (1-\frac {\alpha }{M}\lambda \right ) - \frac {\alpha }{M}\bX ^T\left (\bX \bw _n - \by \right ) \end {aligned} \end{equation}
The \(\alpha \) value is typically chosen such that
\(\seteqnumber{0}{}{13}\)\begin{equation} 0.99\gtrsim \left (1-\dfrac {\alpha }{M}\lambda \right )\gtrsim 0.95 \end{equation}