Machine Learning & Signals Learning

\(\newcommand{\footnotename}{footnote}\) \(\def \LWRfootnote {1}\) \(\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\let \LWRorighspace \hspace \) \(\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }\) \(\newcommand {\TextOrMath }[2]{#2}\) \(\newcommand {\mathnormal }[1]{{#1}}\) \(\newcommand \ensuremath [1]{#1}\) \(\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } \) \(\newcommand {\setlength }[2]{}\) \(\newcommand {\addtolength }[2]{}\) \(\newcommand {\setcounter }[2]{}\) \(\newcommand {\addtocounter }[2]{}\) \(\newcommand {\arabic }[1]{}\) \(\newcommand {\number }[1]{}\) \(\newcommand {\noalign }[1]{\text {#1}\notag \\}\) \(\newcommand {\cline }[1]{}\) \(\newcommand {\directlua }[1]{\text {(directlua)}}\) \(\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}\) \(\newcommand {\protect }{}\) \(\def \LWRabsorbnumber #1 {}\) \(\def \LWRabsorbquotenumber "#1 {}\) \(\newcommand {\LWRabsorboption }[1][]{}\) \(\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }\) \(\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }\) \(\def \mathcode #1={\mathchar }\) \(\let \delcode \mathcode \) \(\let \delimiter \mathchar \) \(\def \oe {\unicode {x0153}}\) \(\def \OE {\unicode {x0152}}\) \(\def \ae {\unicode {x00E6}}\) \(\def \AE {\unicode {x00C6}}\) \(\def \aa {\unicode {x00E5}}\) \(\def \AA {\unicode {x00C5}}\) \(\def \o {\unicode {x00F8}}\) \(\def \O {\unicode {x00D8}}\) \(\def \l {\unicode {x0142}}\) \(\def \L {\unicode {x0141}}\) \(\def \ss {\unicode {x00DF}}\) \(\def \SS {\unicode {x1E9E}}\) \(\def \dag {\unicode {x2020}}\) \(\def \ddag {\unicode {x2021}}\) \(\def \P {\unicode {x00B6}}\) \(\def \copyright {\unicode {x00A9}}\) \(\def \pounds {\unicode {x00A3}}\) \(\let \LWRref \ref \) \(\renewcommand {\ref }{\ifstar \LWRref \LWRref }\) \( \newcommand {\multicolumn }[3]{#3}\) \(\require {textcomp}\) \( \newcommand {\abs }[1]{\lvert #1\rvert } \) \( \DeclareMathOperator {\sign }{sign} \) \(\newcommand {\intertext }[1]{\text {#1}\notag \\}\) \(\let \Hat \hat \) \(\let \Check \check \) \(\let \Tilde \tilde \) \(\let \Acute \acute \) \(\let \Grave \grave \) \(\let \Dot \dot \) \(\let \Ddot \ddot \) \(\let \Breve \breve \) \(\let \Bar \bar \) \(\let \Vec \vec \) \(\newcommand {\bm }[1]{\boldsymbol {#1}}\) \(\require {physics}\) \(\newcommand {\LWRphystrig }[2]{\ifblank {#1}{\textrm {#2}}{\textrm {#2}^{#1}}}\) \(\renewcommand {\sin }[1][]{\LWRphystrig {#1}{sin}}\) \(\renewcommand {\sinh }[1][]{\LWRphystrig {#1}{sinh}}\) \(\renewcommand {\arcsin }[1][]{\LWRphystrig {#1}{arcsin}}\) \(\renewcommand {\asin }[1][]{\LWRphystrig {#1}{asin}}\) \(\renewcommand {\cos }[1][]{\LWRphystrig {#1}{cos}}\) \(\renewcommand {\cosh }[1][]{\LWRphystrig {#1}{cosh}}\) \(\renewcommand {\arccos }[1][]{\LWRphystrig {#1}{arcos}}\) \(\renewcommand {\acos }[1][]{\LWRphystrig {#1}{acos}}\) \(\renewcommand {\tan }[1][]{\LWRphystrig {#1}{tan}}\) \(\renewcommand {\tanh }[1][]{\LWRphystrig {#1}{tanh}}\) \(\renewcommand {\arctan }[1][]{\LWRphystrig {#1}{arctan}}\) \(\renewcommand {\atan }[1][]{\LWRphystrig {#1}{atan}}\) \(\renewcommand {\csc }[1][]{\LWRphystrig {#1}{csc}}\) \(\renewcommand {\csch }[1][]{\LWRphystrig {#1}{csch}}\) \(\renewcommand {\arccsc }[1][]{\LWRphystrig {#1}{arccsc}}\) \(\renewcommand {\acsc }[1][]{\LWRphystrig {#1}{acsc}}\) \(\renewcommand {\sec }[1][]{\LWRphystrig {#1}{sec}}\) \(\renewcommand {\sech }[1][]{\LWRphystrig {#1}{sech}}\) \(\renewcommand {\arcsec }[1][]{\LWRphystrig {#1}{arcsec}}\) \(\renewcommand {\asec }[1][]{\LWRphystrig {#1}{asec}}\) \(\renewcommand {\cot }[1][]{\LWRphystrig {#1}{cot}}\) \(\renewcommand {\coth }[1][]{\LWRphystrig {#1}{coth}}\) \(\renewcommand {\arccot }[1][]{\LWRphystrig {#1}{arccot}}\) \(\renewcommand {\acot }[1][]{\LWRphystrig {#1}{acot}}\) \(\require {cancel}\) \(\newcommand *{\underuparrow }[1]{{\underset {\uparrow }{#1}}}\) \(\DeclareMathOperator *{\argmax }{argmax}\) \(\DeclareMathOperator *{\argmin }{arg\,min}\) \(\def \E [#1]{\mathbb {E}\!\left [ #1 \right ]}\) \(\def \Var [#1]{\operatorname {Var}\!\left [ #1 \right ]}\) \(\def \Cov [#1]{\operatorname {Cov}\!\left [ #1 \right ]}\) \(\newcommand {\floor }[1]{\lfloor #1 \rfloor }\) \(\newcommand {\DTFTH }{ H \brk 1{e^{j\omega }}}\) \(\newcommand {\DTFTX }{ X\brk 1{e^{j\omega }}}\) \(\newcommand {\DFTtr }[1]{\mathrm {DFT}\left \{#1\right \}}\) \(\newcommand {\DTFTtr }[1]{\mathrm {DTFT}\left \{#1\right \}}\) \(\newcommand {\DTFTtrI }[1]{\mathrm {DTFT^{-1}}\left \{#1\right \}}\) \(\newcommand {\Ftr }[1]{ \mathcal {F}\left \{#1\right \}}\) \(\newcommand {\FtrI }[1]{ \mathcal {F}^{-1}\left \{#1\right \}}\) \(\newcommand {\Zover }{\overset {\mathscr Z}{\Longleftrightarrow }}\) \(\renewcommand {\real }{\mathbb {R}}\) \(\newcommand {\ba }{\mathbf {a}}\) \(\newcommand {\bb }{\mathbf {b}}\) \(\newcommand {\bd }{\mathbf {d}}\) \(\newcommand {\be }{\mathbf {e}}\) \(\newcommand {\bh }{\mathbf {h}}\) \(\newcommand {\bn }{\mathbf {n}}\) \(\newcommand {\bq }{\mathbf {q}}\) \(\newcommand {\br }{\mathbf {r}}\) \(\newcommand {\bt }{\mathbf {t}}\) \(\newcommand {\bv }{\mathbf {v}}\) \(\newcommand {\bw }{\mathbf {w}}\) \(\newcommand {\bx }{\mathbf {x}}\) \(\newcommand {\bxx }{\mathbf {xx}}\) \(\newcommand {\bxy }{\mathbf {xy}}\) \(\newcommand {\by }{\mathbf {y}}\) \(\newcommand {\byy }{\mathbf {yy}}\) \(\newcommand {\bz }{\mathbf {z}}\) \(\newcommand {\bA }{\mathbf {A}}\) \(\newcommand {\bB }{\mathbf {B}}\) \(\newcommand {\bI }{\mathbf {I}}\) \(\newcommand {\bK }{\mathbf {K}}\) \(\newcommand {\bP }{\mathbf {P}}\) \(\newcommand {\bQ }{\mathbf {Q}}\) \(\newcommand {\bR }{\mathbf {R}}\) \(\newcommand {\bU }{\mathbf {U}}\) \(\newcommand {\bW }{\mathbf {W}}\) \(\newcommand {\bX }{\mathbf {X}}\) \(\newcommand {\bY }{\mathbf {Y}}\) \(\newcommand {\bZ }{\mathbf {Z}}\) \(\newcommand {\balpha }{\bm {\alpha }}\) \(\newcommand {\bth }{{\bm {\theta }}}\) \(\newcommand {\bepsilon }{{\bm {\epsilon }}}\) \(\newcommand {\bmu }{{\bm {\mu }}}\) \(\newcommand {\bOne }{\mathbf {1}}\) \(\newcommand {\bZero }{\mathbf {0}}\) \(\newcommand {\loss }{\mathcal {L}}\) \(\newcommand {\appropto }{\mathrel {\vcenter { \offinterlineskip \halign {\hfil $##$\cr \propto \cr \noalign {\kern 2pt}\sim \cr \noalign {\kern -2pt}}}}}\) \(\newcommand {\SSE }{\mathrm {SSE}}\) \(\newcommand {\MSE }{\mathrm {MSE}}\) \(\newcommand {\RMSE }{\mathrm {RMSE}}\) \(\newcommand {\toprule }[1][]{\hline }\) \(\let \midrule \toprule \) \(\let \bottomrule \toprule \) \(\def \LWRbooktabscmidruleparen (#1)#2{}\) \(\newcommand {\LWRbooktabscmidrulenoparen }[1]{}\) \(\newcommand {\cmidrule }[1][]{\ifnextchar (\LWRbooktabscmidruleparen \LWRbooktabscmidrulenoparen }\) \(\newcommand {\morecmidrules }{}\) \(\newcommand {\specialrule }[3]{\hline }\) \(\newcommand {\addlinespace }[1][]{}\) \(\newcommand {\LWRsubmultirow }[2][]{#2}\) \(\newcommand {\LWRmultirow }[2][]{\LWRsubmultirow }\) \(\newcommand {\multirow }[2][]{\LWRmultirow }\) \(\newcommand {\mrowcell }{}\) \(\newcommand {\mcolrowcell }{}\) \(\newcommand {\STneed }[1]{}\) \(\newcommand {\tcbset }[1]{}\) \(\newcommand {\tcbsetforeverylayer }[1]{}\) \(\newcommand {\tcbox }[2][]{\boxed {\text {#2}}}\) \(\newcommand {\tcboxfit }[2][]{\boxed {#2}}\) \(\newcommand {\tcblower }{}\) \(\newcommand {\tcbline }{}\) \(\newcommand {\tcbtitle }{}\) \(\newcommand {\tcbsubtitle [2][]{\mathrm {#2}}}\) \(\newcommand {\tcboxmath }[2][]{\boxed {#2}}\) \(\newcommand {\tcbhighmath }[2][]{\boxed {#2}}\)

3 Linear Least-Squares

Goals:re-write LS in the matrix notation and extend it to the multi-variate.

3.1 Uni-variate LS

To improve the mathematical representation, vector notation can be used. This time, the points \(\left \{x_k,y_k\right \}_{k=1}^M\) are organized into vectors, with a few additional ones, as follows,

\begin{equation} \bx = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_M \end {bmatrix}, \quad \by = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_M \end {bmatrix}, \quad \mathbf {1}_M = \begin{bmatrix} 1 \\ 1 \\ \vdots \\ 1 \end {bmatrix}\in \mathbb {R}^M, \, \bw = \begin{bmatrix} w_0 \\ w_1 \end {bmatrix} \end{equation}

The resulting model notation is

\begin{equation} \hat {\by } = f(\bX ;\bw ) = \bOne _M w_0 + \bx w_1 = \bX \bw , \end{equation}

where \(\bX = \begin {bmatrix} \bOne _M & \bx \end {bmatrix}\in \mathbb {R}^{M\times 2}\) and \(\bw = \begin {bmatrix} w_0 & w_1 \end {bmatrix}^T\).

The corresponding SSE loss function (2.4) is

\begin{equation} \label {eq-sse-loss-vector} \begin{aligned} \loss (\bw ;\bX ,\by ) &= (\by -\hat {\by })^T(\by -\hat {\by })=\norm {\by -\hat {\by }}^2\\ &=(\by -\bX \bw )^T(\by -\bX \bw ) = \norm {\by - \bX \bw }^2 \end {aligned} \end{equation}

and the corresponding optimal minimum (Eq. (2.7)) results from the solution of normal equation (matrix form)

\begin{equation} \label {eq-ls-gradient} \begin{aligned} \nabla _\bw \mathcal {L}(\cdot ) &=-\bX ^T\left (\by - \bX \bw \right ) = 0 \end {aligned} \end{equation}

and is given by

\begin{equation*} \begin{aligned} \bX ^T\left (\by - \bX \bw \right ) &= \bX ^T \by - \bX ^T\bX \bw = 0\\ \bX ^T \by &= \left (\bX ^T\bX \right )\bw \end {aligned} \end{equation*}

Finally,

\begin{equation} \label {eq-LS-solution-4w} \bw _{opt} = \left (\bX ^T\bX \right )^{-1}\bX ^T\by \end{equation}

  • Example 3.1: For centered (\(\bar x=\bar y=0\)) variables with \(w_0=0\),

    \[ w_1 = \frac {\sum _{i=1}^M x_iy_i}{\sum _{i=1}^M x_i^2} \]

The model error is

\begin{equation} \mathbf {e} = \by - \hat {\by } \end{equation}

such that SSE is \(\loss (\bw ) = \be ^T\be \).

3.2 Multivariate LS

3.2.1 Notation

For the multivariate \(N\)-dimensional formulation,

\begin{align} \bX &=\begin{bmatrix} \bOne & \bx _1 & \cdots & \bx _N \end {bmatrix}\in \mathbb {R}^{M\times (N+1)} \label {eq-mv-ls-Xmatrix} \\ \bw &=\begin{bmatrix} w_0 & w_1 & \cdots & w_N \end {bmatrix}^T\in \mathbb {R}^{N+1} \end{align} Each vector \(\bx _i\) is also termed feature vector.

All the LS discussion on \(\hat {\by } = \bX \bw \) is the same independently from the number of feature vectors.

Dataset

All the data rows in \((\bX ,\by )\) are called dataset with \(M\) entries and \(N\) features.

3.2.2 Adjusted Coefficient of Determination
  • Goal: Provide a comparison metric for multivariate linear models that accounts for different numbers or subsets of features.

While Section 2.3.2 discusses the standard \(R^2\) metric, the regular coefficient of determination has a critical flaw in multivariate regression: it mathematically cannot decrease when new features are added to the model, even if those features are entirely uninformative.

To address this, the adjusted \(R^2\) (\(\bar {R}^2\)) is introduced. It use a penalty term based on the number of features \(N\) relative to the number of samples \(M\). It is given by:

\begin{equation} \bar {R}^2 = 1 - \left (1-R^2\right )\frac {M-1}{M-N-1} \end{equation}

The primary advantage of \(\bar {R}^2\) is that it penalizes the model for adding irrelevant variables. It will only increase if a new feature improves the model’s predictive power more than would be expected by chance.

3.3 Properties

3.3.1 Moore–Penrose inverse (pseudo-inverse)

The Moore–Penrose inverse is the generalization of the ordinary matrix inverse for non-rectangular matrices. Assuming \(\bX \) has full column rank, its left pseudo-inverse is given by:

\begin{equation} \bX ^+ = \left (\bX ^T\bX \right )^{-1}\bX ^T\in \real ^{(N+1)\times M}, \end{equation}

such that multiplying it by \(\bX \) yields the identity matrix

\[ \bX ^+\bX = \left (\bX ^T\bX \right )^{-1}\bX ^T\bX = \mathbf {I}_{N+1}\]

Note thata direct implementation of \(\bX ^+\) may have numerical stability issues due to inversion of \(\left (\bX ^T\bX \right )^{-1}\). All the modern programming languages have numerically-stable and efficient implementation of pseudo-inverse calculations.

Using this notation, the optimal weight vector for LS can be compactly written as:

\begin{equation} \bw _{opt} = \bX ^+\by \end{equation}

3.3.2 Projection matrix

Using the pseudo-inverse notation, the model’s output (the predicted values) can be written directly in terms of the target vector \(\by \):

\begin{equation} \begin{aligned} \hat {\by } &= \bX \bw = \bX \left (\bX ^T\bX \right )^{-1}\bX ^T \by \\ &=\bX \bX ^+\by = \mathbf {P}\by \end {aligned} \end{equation}

where

\begin{equation} \bP = \bX \left (\bX ^T\bX \right )^{-1}\bX ^T\in \real ^{M\times M} \end{equation}

is termed the projection matrix. Geometrically, \(\bP \) projects the target vector \(\by \) onto the subspace spanned by the columns of \(\bX \) (the column space).

Important properties of the projection matrix \(\bP \) include:

  • Symmetry, \(\bP =\bP ^T\),

  • Idempotence, \(\bP =\bP ^2\),

    • Proof.

      \begin{equation} \begin{aligned} \bP ^2&=\bX \left (\bX ^T\bX \right )^{-1}\bX ^T\bX \left (\bX ^T\bX \right )^{-1}\bX ^T\\ &=\bX \left (\bX ^T\bX \right )^{-1}\bX ^T=\bP \end {aligned} \end{equation}

  • Orthogonality: The projection \(\mathbf {P}\) is orthogonal to its complement \(\left (\mathbf {I}-\mathbf {P}\right )\), i.e. \(\mathbf {P}\perp \left (\mathbf {I}-\mathbf {P}\right )\).

    • Proof. \(\bP (\bI -\bP ) = \bP - \bP ^2 = \bZero \).

  • Complementary projection, \(\bI -\bP \) is also projection matrix.

  • The residual error vector \(\be \) is simply the projection of \(\by \) onto the orthogonal complement of the column space of \(\bX \)

    \begin{equation} \mathbf {e} = \by - \hat {\by } = \by - \mathbf {P} \by = \left (\mathbf {I} -\mathbf {P}\right )\by . \end{equation}

3.3.3 Error properties
Error and feature orthogonality

\begin{equation} \label {eq-data-error-ort} \mathbf {e}\perp \bX \Rightarrow \bX ^T\mathbf {e} = \bZero _{N+1}\\ \end{equation}

  • Proof. Substituting the projection matrix notation:

    \begin{equation*} \begin{aligned} \bX ^T\mathbf {e} &= \bX ^T\left (\by - \hat {\by }\right ) \\ &= \bX ^T\by - \bX ^T\bX \left (\bX ^T\bX \right )^{-1}\bX ^T\by \\ &= \bX ^T\by - \mathbf {I}_{N+1}\bX ^T\by = \bZero _{N+1} \end {aligned} \end{equation*}

Error and prediction orthogonality

\begin{equation} \label {eq:error_pred_orth} \mathbf {e}\perp \hat {\by } \Rightarrow \hat {\by }^T\mathbf {e}=\be ^T\hat {\by } = 0 \end{equation}

  • Proof. Using the properties of the projection matrix \(\bP \) (symmetry and idempotence):

    \begin{equation*} \begin{aligned} \hat {\by }^T\mathbf {e} &= \left (\mathbf {P}\by \right )^T \left (\mathbf {I} -\mathbf {P}\right )\by \\ &= \by ^T \mathbf {P}^T \left (\mathbf {I} -\mathbf {P}\right )\by \\ &= \by ^T \mathbf {P}\by - \by ^T \mathbf {P}^2\by \\ &= \by ^T \mathbf {P}\by - \by ^T \mathbf {P}\by = 0 \end {aligned} \end{equation*}

The interesting outcome of this property is a relation between error and prediction,

\begin{equation} \norm {\by }^2 = \norm {\hat {\by }}^2 + \norm {\be }^2 \end{equation}

  • Proof.

    \begin{equation*} \begin{aligned} \norm {\by }^2 &= \norm {\hat {\by }+\be }^2\\ &= (\hat {\by }+\be )^T(\hat {\by }+\be )\\ &= \hat {\by }^T\hat {\by } + 2\hat {\by }^T\be + \be ^T\be \\ &= \norm {\hat {\by }}^2 + 0 + \norm {\be }^2 \qquad \text {(since } \hat {\by }^T\be = 0 \text {)} \end {aligned} \end{equation*}

Average error

If the model includes a bias/intercept term (\(w_0\)), the mean of the residuals is exactly zero,

\begin{equation} \begin{aligned} \bar {\mathbf {e}} &= \frac {1}{M}\sum _{k=1}^M e_k \\ &= \sum _{k=1}^M e_k= \bOne ^T\be = 0 \end {aligned} \end{equation}

  • Proof. From Eq. (3.16), we know \(\bX ^T\be = \bZero _{N+1}\). Because the first column of \(\bX \) is \(\bOne _M\) (the bias feature),

    \begin{equation*} \bX ^T\be = \begin{bmatrix} \bOne _M^T \\ \bx _1^T \\ \vdots \\ \bx _N^T \end {bmatrix} \be = \begin{bmatrix} \bOne _M^T \be \\ \vdots \end {bmatrix} = \bZero _{N+1} \implies \bOne _M^T \be = 0 \end{equation*}

An interesting consequence of this is that the mean of the true values equals the mean of the predicted values

\begin{equation} \begin{aligned} \bar {\by } &= \bar {\hat {\by }}\\ &= w_0 + w_1\bar {\bx }_1 + \cdots + w_N\bar {\bx }_N \end {aligned} \end{equation}

  • Proof.

    \begin{equation} \begin{aligned} \bar {\by } &= \overline {\hat {\by } +\be }\\ &=\bar {\hat {\by }} + \bar {\be } \end {aligned} \end{equation}

Following this property, \(\bar {\by }=0\) follows \(w_0=0\).

Error distribution

The values of the error vector \(\be \) are assumed to be normally distributed, due to Central Limit Theorem (CLT). Typically, this assumption is not need in ML, but it is important for statistical analysis for small values of \(M\).

Minimum SSE

The reduced expression for the resulting minimal loss is

\begin{equation} \label {eq-mse-min} \begin{aligned} \loss _{min} &= \sum _{k=1}^My_k^2 - \sum _{j=0}^Nw_j\by ^T\bx _j\\ &=\by ^T\by - \by ^T\bX \bw \end {aligned} \end{equation}

  • Proof.

    \begin{equation} \begin{aligned} sse_{min} &= \be ^T\be \\ &= (\by -\hat {\by })^T\be \\ &= \by ^T\be - \hat {\by }^T\be \qquad \leftarrow \text {\eqref {eq:error_pred_orth}}\\ &= \by ^T\be \\ &= \by ^T(\by - \bX \bw )\\ \end {aligned} \end{equation}

The MSE or RMSE evaluation from the loss is straightforward by dividing by \(M\) and taking the square root, respectively.

3.3.4 Implementation note

The matrix \(\bX \) is assumed full-rank, i.e. columns are linearly independent and the corresponding eigenvalues of \(\bX ^T\bX \) are sufficiently large. The straightforward approach is to use numerically optimized algorithms for \(\bw _{opt}\) solution given \(\bX \) and \(\by \), such as:

Code example Matlab example.

3.4 Gradient Descent

  • Goal: Find the minimum of the function:

    • First-order derivative based algorithm.

    • Local minimum only.

Preface

Let’s assume some function \(y=f(x)\), with \(x,y\in \Re \), differentiable with \(\dfrac {dy}{dx} = f'(x)\).

  • The interpretation of \(f'(x)\) is the slope of \(f(x)\) at a point \(x\).

  • By the definition of the derivative, for some small \(\epsilon \),

    \begin{align} f(x+\epsilon ) &\approx f(x) + \epsilon f'(x)\\ f(x-\epsilon ) &\approx f(x) - \epsilon f'(x) \end{align}

  • Given the sign of the derivative,

    \begin{equation} \begin{aligned} f(x-\epsilon )&< f(x), & f'(x) > 0\\ f(x+\epsilon )&< f(x), & f'(x) < 0\\ f(x+\epsilon )&> f(x), & f'(x) > 0\\ f(x-\epsilon )&> f(x), & f'(x) < 0 \end {aligned} \end{equation}

  • For sufficiently small \(\epsilon \),

    \begin{equation} \label {eq-gd-idea} f\left (x-\epsilon \sign \left (f'(x)\right )\right )\le f(x) \end{equation}

The idea of the algorithm is to reduce \(f(x)\) toward the minimum by moving in the direction opposite to the sign of the derivative \(f'(x)\) according to (3.26).

add illustration

3.4.1 Definition

Gradient descent (GD) - scalar function: For differentiable function \(f(x)\), the iterative algorithm

\begin{equation} x_{n+1} = x_n - \alpha f'(x_n) \end{equation}

converges to some local minimum of \(f(x)\).

Required parameters are:

  • Step-size \(\alpha >0\) is some positive constant or some function of \(n\), i.e. \(\alpha _n\).

  • \(x_0\) is an initial guess.

Some of the most common stopping conditions are:

  • Reaching the point of slow convergence, \(\abs {x_{n+1}-x_n}<\epsilon \).

  • Limiting the number of iterations, \(n\le n_0\).

Gradient descent (GD) - vector function: For differentiable multivariate and multidimensional function \(f(\bx ):\Re ^N\rightarrow \Re \), the iterative algorithm is

\begin{equation} \label {eq-gd-vector} \bx _{n+1} = \bx _n - \alpha \nabla _\bx f(\bx _n). \end{equation}

Each dimension is iteratively reduced according to its derivative. Notes:

  • Easy to implement.

  • Requires analytical or numerical derivative.

  • Non-trivial selection of the optimal value of \(\alpha \). Inappropriate selection of \(\alpha \) results in slower convergence (or divergence in extreme cases). Typically, \(\balpha _n\) vector may be desirable.

  • Useful only for the function with single (global) minimum, such as MSE minimization.

Minimization of loss function Optimal values of \(\bw \) may be found by substitution of the gradient of the loss function into (vector) GD,

\begin{equation} \label {eq-gd-vector-loss} \bw _{n+1} = \bw _{n} - \alpha \nabla _\bw \loss (\bw ) \end{equation}

GD for LS

Optimal values of \(\bw \) may be found by substitution of the gradient (Eq. (3.4) into (3.29)),

\begin{equation} \label {eq-gd-ls} \begin{aligned} \bw _{n+1} &= \bw _{n} - \alpha \nabla _\bw \loss (\bw )\\ &= \bw _{n} -\frac {\alpha }{M}\bX ^T\left (\bX \bw _n - \by \right )\\ &= \bw _{n}\left (\bI - \frac {\alpha }{M}\bX ^T\bX \right ) +\frac {\alpha }{M}\bX ^T\by \end {aligned} \end{equation}

The main differences between the normal equation (3.5) and GD solution are:

  • Lower computation complexity, since not require inversion of \(\bX ^T\bX \).

  • Option for reduced memory calculation with batch gradient descent.

3.4.2 Full-batch, mini-batch and stochastic GD

Lets define each row of the dataset as \(\bX _i\), such that the entire dataset is \(\{(\bx _k,y_k)\}_{k=1}^M\). Recall that for LS we can write the loss as a sum over samples,

\begin{equation} \loss (\bw ;\bX ,\by ) = \sum _{k=1}^M \ell _k(\bw ;\bX _k,y_k), \end{equation}

where \(\ell _k(\cdot )\) is a single raw loss between \(y_k\) and \(\hat {y}_k\).

For the LS,

\begin{equation} \ell _k(\bw ) = \left (y_k-\bX _k^T\bw \right )^2. \end{equation}

Full-batch

Full-batch GD uses the gradient of the whole loss elements, \(M\). In this case, for LS, the loss gradient takes the form of (3.30).

Mini-batch GD

Instead of using all \(M\) samples, we select (usually at random) at iteration \(n\) a subset (mini-batch) of indices

\[ \mathcal {B}_n \subset \{1,2,\dots ,M\}, \qquad \abs {\mathcal {B}_n}=B\ll M. \]

We define the corresponding sub-matrices containing rows indexed by \(\mathcal {B}_n\),

\begin{equation} \bX _{\mathcal {B}_n}\in \Re ^{B\times (N+1)},\qquad \by _{\mathcal {B}_n}\in \Re ^{B}, \end{equation}

The mini-batch loss is

\begin{equation} \loss _{\mathcal {B}_n}(\bw ) = \frac {1}{B}\sum _{k\in \mathcal {B}_n}\ell _k(\bw ) = \frac {1}{B}\norm {\by _{\mathcal {B}_n}-\bX _{\mathcal {B}_n}\bw }^2 \end{equation}

and its gradient is

\begin{equation} \nabla _\bw \loss _{\mathcal {B}_n}(\bw ) = \frac {1}{B}\bX _{\mathcal {B}_n}^T\big (\bX _{\mathcal {B}_n}\bw -\by _{\mathcal {B}_n}\big ). \end{equation}

The mini-batch GD update becomes

\begin{equation} \bw _{n+1} = \bw _n - \frac {\alpha }{B}\bX _{\mathcal {B}_n}^T \big (\bX _{\mathcal {B}_n}\bw _n-\by _{\mathcal {B}_n}\big ), \end{equation}

where \(B\) is called the batch size.

Stochastic gradient descent (SGD) The special extreme case of \(B=1\) is termed stochastic gradient descent.

Epoch

The term epoch denotes one full pass over the entire dataset. In full-batch GD, one iteration already uses all \(M\) samples, so one iteration is equivalent to one epoch. In mini-batch or stochastic GD, an epoch consists of several iterations, each using only a (small) part of the data.

Assume that we sweep once over all \(M\) samples without repetition:

  • Full-batch GD: \(B=M\Rightarrow L=1\) update per epoch.

  • Mini-batch GD: \(B\) samples per update results in

    \begin{equation} L = \left \lceil \dfrac {M}{B}\right \rceil \end{equation}

    batches (updates) per epoch.

  • SGD: \(B=1\Rightarrow L=M\) updates per epoch.

3.4.3 Practical aspects
Theoretical relation to the full gradient

Theoretically, mini-batches \(\mathcal {B}_n\) are estimators of the full gradient,

\begin{equation} \mathbb {E}\left [\nabla _\bw \loss _{\mathcal {B}_n}(\bw )\right ] = \frac {1}{L}\sum _{n=1}^{L} \nabla _\bw \loss _{\mathcal {B}_n}(\bw ) = \frac {1}{M}\nabla _\bw \loss (\bw ). \end{equation}

and follow (on average) the same direction as full-batch GD. Practically, since \(\bw \) is updated for each \(n\), it results with an additional noise as illustrated in Fig. 3.1.

(image)

Figure 3.1: Mini-batch gradient noise: smaller batches produce noisier loss trajectories but can escape local minima.
Summary
  • Mini-batch / SGD can handle huge datasets, since each update uses only \(B\) samples in memory.

  • Updates are faster (less data per step), and typically allow efficient parallel computation (e.g., on GPUs).

  • The noisy gradient may slow convergence and requires careful choice of step-size \(\alpha \) (often smaller than in full-batch GD).

  • In practice, mini-batch GD (with \(B\) between a few tens and a few hundreds) is the most commonly used compromise between speed and stability.