Machine Learning & Signals Learning

$\newcommand{\footnotename}{footnote}$ $\def \LWRfootnote {1}$ $\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}$ $\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}$ $\let \LWRorighspace \hspace $ $\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }$ $\newcommand {\TextOrMath }[2]{#2}$ $\newcommand {\mathnormal }[1]{{#1}}$ $\newcommand \ensuremath [1]{#1}$ $\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } $ $\newcommand {\setlength }[2]{}$ $\newcommand {\addtolength }[2]{}$ $\newcommand {\setcounter }[2]{}$ $\newcommand {\addtocounter }[2]{}$ $\newcommand {\arabic }[1]{}$ $\newcommand {\number }[1]{}$ $\newcommand {\noalign }[1]{\text {#1}\notag \\}$ $\newcommand {\cline }[1]{}$ $\newcommand {\directlua }[1]{\text {(directlua)}}$ $\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}$ $\newcommand {\protect }{}$ $\def \LWRabsorbnumber #1 {}$ $\def \LWRabsorbquotenumber "#1 {}$ $\newcommand {\LWRabsorboption }[1][]{}$ $\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }$ $\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }$ $\def \mathcode #1={\mathchar }$ $\let \delcode \mathcode $ $\let \delimiter \mathchar $ $\def \oe {\unicode {x0153}}$ $\def \OE {\unicode {x0152}}$ $\def \ae {\unicode {x00E6}}$ $\def \AE {\unicode {x00C6}}$ $\def \aa {\unicode {x00E5}}$ $\def \AA {\unicode {x00C5}}$ $\def \o {\unicode {x00F8}}$ $\def \O {\unicode {x00D8}}$ $\def \l {\unicode {x0142}}$ $\def \L {\unicode {x0141}}$ $\def \ss {\unicode {x00DF}}$ $\def \SS {\unicode {x1E9E}}$ $\def \dag {\unicode {x2020}}$ $\def \ddag {\unicode {x2021}}$ $\def \P {\unicode {x00B6}}$ $\def \copyright {\unicode {x00A9}}$ $\def \pounds {\unicode {x00A3}}$ $\let \LWRref \ref $ $\renewcommand {\ref }{\ifstar \LWRref \LWRref }$ $ \newcommand {\multicolumn }[3]{#3}$ $\require {textcomp}$ $ \newcommand {\abs }[1]{\lvert #1\rvert } $ $ \DeclareMathOperator {\sign }{sign} $ $\newcommand {\intertext }[1]{\text {#1}\notag \\}$ $\let \Hat \hat $ $\let \Check \check $ $\let \Tilde \tilde $ $\let \Acute \acute $ $\let \Grave \grave $ $\let \Dot \dot $ $\let \Ddot \ddot $ $\let \Breve \breve $ $\let \Bar \bar $ $\let \Vec \vec $ $\newcommand {\bm }[1]{\boldsymbol {#1}}$ $\require {physics}$ $\newcommand {\LWRphystrig }[2]{\ifblank {#1}{\textrm {#2}}{\textrm {#2}^{#1}}}$ $\renewcommand {\sin }[1][]{\LWRphystrig {#1}{sin}}$ $\renewcommand {\sinh }[1][]{\LWRphystrig {#1}{sinh}}$ $\renewcommand {\arcsin }[1][]{\LWRphystrig {#1}{arcsin}}$ $\renewcommand {\asin }[1][]{\LWRphystrig {#1}{asin}}$ $\renewcommand {\cos }[1][]{\LWRphystrig {#1}{cos}}$ $\renewcommand {\cosh }[1][]{\LWRphystrig {#1}{cosh}}$ $\renewcommand {\arccos }[1][]{\LWRphystrig {#1}{arcos}}$ $\renewcommand {\acos }[1][]{\LWRphystrig {#1}{acos}}$ $\renewcommand {\tan }[1][]{\LWRphystrig {#1}{tan}}$ $\renewcommand {\tanh }[1][]{\LWRphystrig {#1}{tanh}}$ $\renewcommand {\arctan }[1][]{\LWRphystrig {#1}{arctan}}$ $\renewcommand {\atan }[1][]{\LWRphystrig {#1}{atan}}$ $\renewcommand {\csc }[1][]{\LWRphystrig {#1}{csc}}$ $\renewcommand {\csch }[1][]{\LWRphystrig {#1}{csch}}$ $\renewcommand {\arccsc }[1][]{\LWRphystrig {#1}{arccsc}}$ $\renewcommand {\acsc }[1][]{\LWRphystrig {#1}{acsc}}$ $\renewcommand {\sec }[1][]{\LWRphystrig {#1}{sec}}$ $\renewcommand {\sech }[1][]{\LWRphystrig {#1}{sech}}$ $\renewcommand {\arcsec }[1][]{\LWRphystrig {#1}{arcsec}}$ $\renewcommand {\asec }[1][]{\LWRphystrig {#1}{asec}}$ $\renewcommand {\cot }[1][]{\LWRphystrig {#1}{cot}}$ $\renewcommand {\coth }[1][]{\LWRphystrig {#1}{coth}}$ $\renewcommand {\arccot }[1][]{\LWRphystrig {#1}{arccot}}$ $\renewcommand {\acot }[1][]{\LWRphystrig {#1}{acot}}$ $\require {cancel}$ $\newcommand *{\underuparrow }[1]{{\underset {\uparrow }{#1}}}$ $\DeclareMathOperator *{\argmax }{argmax}$ $\DeclareMathOperator *{\argmin }{arg\,min}$ $\def \E [#1]{\mathbb {E}\!\left [ #1 \right ]}$ $\def \Var [#1]{\operatorname {Var}\!\left [ #1 \right ]}$ $\def \Cov [#1]{\operatorname {Cov}\!\left [ #1 \right ]}$ $\newcommand {\floor }[1]{\lfloor #1 \rfloor }$ $\newcommand {\DTFTH }{ H \brk 1{e^{j\omega }}}$ $\newcommand {\DTFTX }{ X\brk 1{e^{j\omega }}}$ $\newcommand {\DFTtr }[1]{\mathrm {DFT}\left \{#1\right \}}$ $\newcommand {\DTFTtr }[1]{\mathrm {DTFT}\left \{#1\right \}}$ $\newcommand {\DTFTtrI }[1]{\mathrm {DTFT^{-1}}\left \{#1\right \}}$ $\newcommand {\Ftr }[1]{ \mathcal {F}\left \{#1\right \}}$ $\newcommand {\FtrI }[1]{ \mathcal {F}^{-1}\left \{#1\right \}}$ $\newcommand {\Zover }{\overset {\mathscr Z}{\Longleftrightarrow }}$ $\renewcommand {\real }{\mathbb {R}}$ $\newcommand {\ba }{\mathbf {a}}$ $\newcommand {\bb }{\mathbf {b}}$ $\newcommand {\bd }{\mathbf {d}}$ $\newcommand {\be }{\mathbf {e}}$ $\newcommand {\bh }{\mathbf {h}}$ $\newcommand {\bn }{\mathbf {n}}$ $\newcommand {\bq }{\mathbf {q}}$ $\newcommand {\br }{\mathbf {r}}$ $\newcommand {\bt }{\mathbf {t}}$ $\newcommand {\bv }{\mathbf {v}}$ $\newcommand {\bw }{\mathbf {w}}$ $\newcommand {\bx }{\mathbf {x}}$ $\newcommand {\bxx }{\mathbf {xx}}$ $\newcommand {\bxy }{\mathbf {xy}}$ $\newcommand {\by }{\mathbf {y}}$ $\newcommand {\byy }{\mathbf {yy}}$ $\newcommand {\bz }{\mathbf {z}}$ $\newcommand {\bA }{\mathbf {A}}$ $\newcommand {\bB }{\mathbf {B}}$ $\newcommand {\bI }{\mathbf {I}}$ $\newcommand {\bK }{\mathbf {K}}$ $\newcommand {\bP }{\mathbf {P}}$ $\newcommand {\bQ }{\mathbf {Q}}$ $\newcommand {\bR }{\mathbf {R}}$ $\newcommand {\bU }{\mathbf {U}}$ $\newcommand {\bW }{\mathbf {W}}$ $\newcommand {\bX }{\mathbf {X}}$ $\newcommand {\bY }{\mathbf {Y}}$ $\newcommand {\bZ }{\mathbf {Z}}$ $\newcommand {\balpha }{\bm {\alpha }}$ $\newcommand {\bth }{{\bm {\theta }}}$ $\newcommand {\bepsilon }{{\bm {\epsilon }}}$ $\newcommand {\bmu }{{\bm {\mu }}}$ $\newcommand {\bphi }{\bm {\phi }}$ $\newcommand {\bOne }{\mathbf {1}}$ $\newcommand {\bZero }{\mathbf {0}}$ $\newcommand {\btx }{\tilde {\bx }}$ $\newcommand {\loss }{\mathcal {L}}$ $\newcommand {\appropto }{\mathrel {\vcenter { \offinterlineskip \halign {\hfil $##$\cr \propto \cr \noalign {\kern 2pt}\sim \cr \noalign {\kern -2pt}}}}}$ $\newcommand {\SSE }{\mathrm {SSE}}$ $\newcommand {\MSE }{\mathrm {MSE}}$ $\newcommand {\RMSE }{\mathrm {RMSE}}$ $\newcommand {\toprule }[1][]{\hline }$ $\let \midrule \toprule $ $\let \bottomrule \toprule $ $\def \LWRbooktabscmidruleparen (#1)#2{}$ $\newcommand {\LWRbooktabscmidrulenoparen }[1]{}$ $\newcommand {\cmidrule }[1][]{\ifnextchar (\LWRbooktabscmidruleparen \LWRbooktabscmidrulenoparen }$ $\newcommand {\morecmidrules }{}$ $\newcommand {\specialrule }[3]{\hline }$ $\newcommand {\addlinespace }[1][]{}$ $\newcommand {\LWRsubmultirow }[2][]{#2}$ $\newcommand {\LWRmultirow }[2][]{\LWRsubmultirow }$ $\newcommand {\multirow }[2][]{\LWRmultirow }$ $\newcommand {\mrowcell }{}$ $\newcommand {\mcolrowcell }{}$ $\newcommand {\STneed }[1]{}$ $\newcommand {\tcbset }[1]{}$ $\newcommand {\tcbsetforeverylayer }[1]{}$ $\newcommand {\tcbox }[2][]{\boxed {\text {#2}}}$ $\newcommand {\tcboxfit }[2][]{\boxed {#2}}$ $\newcommand {\tcblower }{}$ $\newcommand {\tcbline }{}$ $\newcommand {\tcbtitle }{}$ $\newcommand {\tcbsubtitle [2][]{\mathrm {#2}}}$ $\newcommand {\tcboxmath }[2][]{\boxed {#2}}$ $\newcommand {\tcbhighmath }[2][]{\boxed {#2}}$

3 Linear Least-Squares

Goal: Re-write LS in the matrix notation and extend it to the multi-variate case.

3.1 Uni-variate LS

Goal: Rewrite univariate LS from Ch. 2 with linear algebra.

To improve the mathematical representation, vector notation can be used. This time, the points $\left \{x_k,y_k\right \}_{k=1}^M$ are organized into vectors, with a few additional ones, as follows,

\begin{equation} \bx = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_M \end {bmatrix}, \quad \by = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_M \end {bmatrix}, \quad \mathbf {1}_M = \begin{bmatrix} 1 \\ 1 \\ \vdots \\ 1 \end {bmatrix}\in \mathbb {R}^M, \, \bw = \begin{bmatrix} w_0 \\ w_1 \end {bmatrix} \end{equation}

The resulting model notation is

\begin{equation} \hat {\by } = f(\bX ;\bw ) = \bOne _M w_0 + \bx w_1 = \bX \bw , \end{equation}

where $\bX = \begin {bmatrix} \bOne _M & \bx \end {bmatrix}\in \mathbb {R}^{M\times 2}$ and $\bw = \begin {bmatrix} w_0 & w_1 \end {bmatrix}^T$.

The model error is

\begin{equation} \mathbf {e} = \by - \hat {\by } \end{equation}

The corresponding SSE loss function (2.4) is

\begin{equation} \label {eq-sse-loss-vector} \begin{aligned} \loss (\bw ;\bX ,\by ) &= \be ^T\be &&= \norm {\be }^2\\ &=(\by -\hat {\by })^T(\by -\hat {\by })&&=\norm {\by -\hat {\by }}^2\\ &=(\by -\bX \bw )^T(\by -\bX \bw ) &&= \norm {\by - \bX \bw }^2 \end {aligned} \end{equation}

The resulting normal equation solution in vector notation is given by

\begin{equation} \label {eq-LS-solution-4w} \bw _{opt} = \left (\bX ^T\bX \right )^{-1}\bX ^T\by \end{equation}

Proof. The corresponding optimal minimum (Eq. (2.7)) results from the solution of normal equation (in matrix form)
$\seteqnumber{0}{}{5}$
\begin{equation} \label {eq-ls-gradient} \begin{aligned} \nabla _\bw \mathcal {L}(\cdot ) &=-\bX ^T\left (\by - \bX \bw \right ) = 0 \end {aligned} \end{equation}

and is given by
$\seteqnumber{0}{}{6}$
\begin{equation*} \begin{aligned} \bX ^T\left (\by - \bX \bw \right ) &= \bX ^T \by - \bX ^T\bX \bw = 0\\ \bX ^T \by &= \left (\bX ^T\bX \right )\bw \end {aligned} \end{equation*}

Example 3.1: For centered ($\bar x=\bar y=0$) variables with $w_0=0$,

\[ w_1 = \frac {\sum _{i=1}^M x_iy_i}{\sum _{i=1}^M x_i^2} \]

3.2 Multivariate LS

Goal: Extend univariate results to multivariate ones.

3.2.1 Notation

For the multivariate $N$-dimensional formulation,

\begin{align} \bX & = \begin{bmatrix} 1 & x_{1,1} & x_{1,2} & \cdots & x_{1,N} \\ 1 & x_{2,1} & x_{2,2} & \cdots & x_{2,N} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{M,1} & x_{M,2} & \cdots & x_{M,N} \end {bmatrix}=\begin{bmatrix} \bx _1^T \\ \bx _2^T \\ \vdots \\ \bx _M^T \end {bmatrix} = \begin{bmatrix} \bOne _M & \btx _1 & \btx _2 & \cdots & \btx _N \end {bmatrix} \in \mathbb {R}^{M\times (N+1)} \label {eq-mv-ls-Xmatrix} \\ \bw &=\begin{bmatrix} w_0 & w_1 & \cdots & w_N \end {bmatrix}^T \in \mathbb {R}^{N+1} \end{align} where:

• each row $\bx _i^T$ is termed sample, $\bx _i = [1, x_{i,1}, \ldots , x_{i,N}]^T \in \mathbb {R}^{N+1},\,i=1,\ldots ,M$.
• each column $\btx _j\in \mathbb {R}^M,\,j=1,\ldots ,N$ is termed feature vector.

All the data rows in $(\bX ,\by )$ are called dataset with $M$ samples (or entries) and $N$ features.

The normal equation solution (3.5) is the same for $\hat {\by } = \bX \bw $, independently from the number of feature vectors.

3.3 Properties

Goal: Linear algebra properties of multivariate LS.

3.3.1 Moore–Penrose inverse (pseudo-inverse)

Goal: Inversion of non-rectangular matrices.

The Moore–Penrose inverse is the generalization of the ordinary matrix inverse for non-rectangular matrices. Assuming $\bX $ has full column rank, its left pseudo-inverse is given by:

\begin{equation} \bX ^+ = \left (\bX ^T\bX \right )^{-1}\bX ^T\in \real ^{(N+1)\times M}, \end{equation}

such that multiplying it by $\bX $ yields the identity matrix

\[ \bX ^+\bX = \left (\bX ^T\bX \right )^{-1}\bX ^T\bX = \mathbf {I}_{N+1}\]

Note that a direct implementation of $\bX ^+$ may have numerical stability issues due to inversion of $\left (\bX ^T\bX \right )^{-1}$. All the modern programming languages have numerically-stable and efficient implementation of pseudo-inverse calculations.

Using this notation, the optimal weight vector in (3.5) can be compactly written as

\begin{equation} \bw _{opt} = \bX ^+\by \end{equation}

3.3.2 Projection matrix

Goal: Derive the projection matrix $\bP $ that maps $\by $ to $\hat {\by }$, and state its key properties.

Using the pseudo-inverse notation, the model’s output (the predicted values) can be written directly in terms of the target vector $\by $:

\begin{equation} \begin{aligned} \hat {\by } &= \bX \bw = \bX \left (\bX ^T\bX \right )^{-1}\bX ^T \by \\ &=\bX \bX ^+\by = \mathbf {P}\by \end {aligned} \end{equation}

where

\begin{equation} \bP = \bX \left (\bX ^T\bX \right )^{-1}\bX ^T\in \real ^{M\times M} \end{equation}

is termed the projection matrix. Geometrically, $\bP $ projects the target vector $\by $ onto the subspace spanned by the columns of $\bX $ (the column space).

Important properties of the projection matrix $\bP $ include:

• Symmetry, $\bP =\bP ^T$,
• Idempotence, $\bP =\bP ^2$,
- Proof.
  $\seteqnumber{0}{}{12}$
  \begin{equation} \begin{aligned} \bP ^2&=\bX \left (\bX ^T\bX \right )^{-1}\bX ^T\bX \left (\bX ^T\bX \right )^{-1}\bX ^T\\ &=\bX \left (\bX ^T\bX \right )^{-1}\bX ^T=\bP \end {aligned} \end{equation}
• Orthogonality: The projection $\mathbf {P}$ is orthogonal to its complement $\left (\mathbf {I}-\mathbf {P}\right )$, i.e. $\mathbf {P}\perp \left (\mathbf {I}-\mathbf {P}\right )$.
- Proof. $\bP (\bI -\bP ) = \bP - \bP ^2 = \bZero $.
• Complementary projection, $\bI -\bP $ is also projection matrix.
• The residual error vector $\be $ is simply the projection of $\by $ onto the orthogonal complement of the column space of $\bX $
$\seteqnumber{0}{}{13}$
\begin{equation} \mathbf {e} = \by - \hat {\by } = \by - \mathbf {P} \by = \left (\mathbf {I} -\mathbf {P}\right )\by . \end{equation}

3.3.3 Error properties

Error and prediction orthogonality

\begin{equation} \label {eq:error_pred_orth} \mathbf {e}\perp \hat {\by } \Rightarrow \hat {\by }^T\mathbf {e}=\be ^T\hat {\by } = 0 \end{equation}

Proof. Using the properties of the projection matrix $\bP $ (symmetry and idempotence):
$\seteqnumber{0}{}{15}$
\begin{equation*} \begin{aligned} \hat {\by }^T\mathbf {e} &= \left (\mathbf {P}\by \right )^T \left (\mathbf {I} -\mathbf {P}\right )\by \\ &= \by ^T \mathbf {P}^T \left (\mathbf {I} -\mathbf {P}\right )\by \\ &= \by ^T \mathbf {P}\by - \by ^T \mathbf {P}^2\by \\ &= \by ^T \mathbf {P}\by - \by ^T \mathbf {P}\by = 0 \end {aligned} \end{equation*}

The interesting outcome of this property is a relation between error and prediction,

\begin{equation} \norm {\by }^2 = \norm {\hat {\by }}^2 + \norm {\be }^2 \end{equation}

Proof.
$\seteqnumber{0}{}{16}$
\begin{equation*} \begin{aligned} \norm {\by }^2 &= \norm {\hat {\by }+\be }^2\\ &= (\hat {\by }+\be )^T(\hat {\by }+\be )\\ &= \hat {\by }^T\hat {\by } + 2\hat {\by }^T\be + \be ^T\be \\ &= \norm {\hat {\by }}^2 + 0 + \norm {\be }^2 \qquad \text {(since } \hat {\by }^T\be = 0 \text {)} \end {aligned} \end{equation*}

Error and feature orthogonality

\begin{equation} \label {eq-data-error-ort} \mathbf {e}\perp \bX \Rightarrow \bX ^T\mathbf {e} = \bZero _{N+1}\\ \end{equation}

Proof. Substituting the projection matrix notation:
$\seteqnumber{0}{}{17}$
\begin{equation*} \begin{aligned} \bX ^T\mathbf {e} &= \bX ^T\left (\by - \hat {\by }\right ) \\ &= \bX ^T\by - \bX ^T\bX \left (\bX ^T\bX \right )^{-1}\bX ^T\by \\ &= \bX ^T\by - \mathbf {I}_{N+1}\bX ^T\by = \bZero _{N+1} \end {aligned} \end{equation*}

Since $\bX ^T\be = \bZero _{N+1}$ expands row-by-row as

\[ \bOne _M^T\be = 0,\quad \btx _1^T\be = 0,\quad \ldots ,\quad \btx _N^T\be = 0, \]

the error vector is orthogonal to every feature vector (and to the bias column).

Average error

If the model includes a bias/intercept term ($w_0$), the mean of the residuals is exactly zero,

\begin{equation} \label {eq-ls-zero-mean-error} \begin{aligned} \bar {\mathbf {e}} &= \frac {1}{M}\sum _{k=1}^M e_k = \frac {1}{M}\bOne ^T\be = 0 \end {aligned} \end{equation}

Proof. From Eq. (3.17), we know $\bX ^T\be = \bZero _{N+1}$. Because the first column of $\bX $ is $\bOne _M$ (the bias feature),
$\seteqnumber{0}{}{18}$
\begin{equation*} \bX ^T\be = \begin{bmatrix} \bOne _M^T \\ \btx _1^T \\ \vdots \\ \btx _N^T \end {bmatrix} \be = \begin{bmatrix} \bOne _M^T \be \\ \vdots \end {bmatrix} = \bZero _{N+1} \implies \bOne _M^T \be = 0 \end{equation*}

An interesting consequence of this is that the mean of the true values equals the mean of the predicted values

\begin{equation} \begin{aligned} \bar {\by } &= \bar {\hat {\by }}\\ &= w_0 + w_1\bar {\btx }_1 + \cdots + w_N\bar {\btx }_N \end {aligned} \end{equation}

Proof.
$\seteqnumber{0}{}{19}$
\begin{equation} \begin{aligned} \bar {\by } &= \overline {\hat {\by } +\be }\\ &=\bar {\hat {\by }} + \bar {\be } \end {aligned} \end{equation}

Following this property, $\bar {\by }=0$ implies $w_0=0$ and $\bar \be =\bZero $.

Error distribution

The values of the error vector $\be $ are assumed to be normally distributed, due to Central Limit Theorem (CLT). Typically, this assumption is not needed in ML, but it is important for statistical analysis for small values of $M$.

Minimum SSE

The reduced expression for the resulting minimal SSE loss is

\begin{equation} \label {eq-mse-min} \begin{aligned} \loss _{min} &=\by ^T\by - \by ^T\bX \bw \\ &= \by ^T\by - \by ^T\begin{bmatrix}\bOne _M & \btx _1 & \cdots & \btx _N\end {bmatrix}\bw \\ &= \norm {\by }^2 - w_0\bOne _M^T\by - \sum _{j=1}^{N} w_j\,\btx _j^T\by \end {aligned} \end{equation}

Proof.
$\seteqnumber{0}{}{21}$
\begin{equation} \begin{aligned} sse_{min} &= \be ^T\be \\ &= (\by -\hat {\by })^T\be \\ &= \by ^T\be - \hat {\by }^T\be \qquad \leftarrow \text {\eqref {eq:error_pred_orth}}\\ &= \by ^T\be \\ &= \by ^T(\by - \bX \bw )\\ \end {aligned} \end{equation}

The MSE or RMSE evaluation from the loss is straightforward by dividing by $M$ and taking the square root, respectively.

3.3.4 Implementation note

The matrix $\bX $ is assumed full-rank, i.e. columns are linearly independent and the corresponding eigenvalues of $\bX ^T\bX $ are sufficiently large. The straightforward approach is to use numerically optimized algorithms for $\bw _{opt}$ solution given $\bX $ and $\by $, such as:

1. Matlab: lsqminnorm
2. Python: numpy.linalg.lstsq
and scipy.linalg.lstsq

3.4 Gradient Descent

Goal: Iterative algorithm to find the minimum of the function:
- • First-order derivative based algorithm.
- • Local minimum only.

3.4.1 Preface

Goal: Basic mechanism behind the algorithm.

Let’s assume some continuous function $y=f(x)$, with $x,y\in \Re $, differentiable with continuous $\dfrac {dy}{dx} = f'(x)$.

• The interpretation of $f'(x)$ is the slope of $f(x)$ at a point $x$.
• By the definition of the derivative, for some small $\epsilon $,
$\seteqnumber{1}{3.23}{0}$
\begin{align} f(x+\epsilon ) &\approx f(x) + \epsilon f'(x)\\ f(x-\epsilon ) &\approx f(x) - \epsilon f'(x) \end{align}
• Given the sign of the derivative,
$\seteqnumber{0}{}{23}$
\begin{equation} \begin{aligned} f(x-\epsilon )&< f(x), & f'(x) > 0\\ f(x+\epsilon )&< f(x), & f'(x) < 0 f(x+\epsilon )&> f(x), & f'(x) > 0\\ f(x-\epsilon )&> f(x), & f'(x) < 0 \end {aligned} \end{equation}
• For sufficiently small $\epsilon $,
$\seteqnumber{0}{}{24}$
\begin{equation} \label {eq-gd-idea} f\left (x-\epsilon \sign \left (f'(x)\right )\right )\le f(x) \end{equation}

The idea of the algorithm is to reduce $f(x)$ toward the minimum by moving in the direction opposite to the sign of the derivative $f'(x)$ according to (3.25).

3.4.2 Definition

Gradient descent (GD) - scalar function: For differentiable function $f(x)$, the iterative algorithm

\begin{equation} x_{n+1} = x_n - \alpha f'(x_n) \end{equation}

converges to some local minimum of $f(x)$.

Required parameters are:

• Step-size $\alpha >0$ is some positive constant or some function of $n$, i.e. $\alpha _n$.
• $x_0$ is an initial guess.

Some of the most common stopping conditions are:

• Reaching the point of slow convergence, $\abs {x_{n+1}-x_n}<\epsilon $.
• Limiting the number of iterations, $n\le n_0$.

Gradient descent (GD) - vector function: For differentiable multivariate and multidimensional function $f(\bx ):\Re ^N\rightarrow \Re $, the iterative algorithm is

\begin{equation} \label {eq-gd-vector} \bx _{n+1} = \bx _n - \alpha \nabla _\bx f(\bx _n). \end{equation}

Each dimension is iteratively reduced according to its derivative.

Notes:

• Easy to implement.
• Requires analytical or numerical derivative.
• Non-trivial selection of the optimal value of $\alpha $. Inappropriate selection of $\alpha $ results in slower convergence (or divergence in extreme cases). Typically, $\balpha _n$ vector with non-equal values may be desirable.
• Useful only for the function with single (global) minimum, such as MSE minimization.

Minimization of loss function Optimal values of $\bw $ may be found by substitution of the gradient of the loss function into (vector) GD,

\begin{equation} \label {eq-gd-vector-loss} \bw _{n+1} = \bw _{n} - \alpha \nabla _\bw \loss (\bw ) \end{equation}

GD for LS

Optimal values of $\bw $ may be found by substitution of the gradient (Eq. (3.6)) into (3.28),

\begin{equation} \label {eq-gd-ls} \begin{aligned} \bw _{n+1} &= \bw _{n} - \alpha \nabla _\bw \loss (\bw )\\ &= \bw _{n} -\frac {\alpha }{M}\bX ^T\left (\bX \bw _n - \by \right )\\ &= \bw _{n}\left (\bI - \frac {\alpha }{M}\bX ^T\bX \right ) +\frac {\alpha }{M}\bX ^T\by \end {aligned} \end{equation}

The main differences between the normal equation (3.5) and GD solution are:

• Lower computational complexity, since it does not require inversion of $\bX ^T\bX $.
• Option for reduced memory calculation with batch gradient descent.

3.4.3 Full-batch, mini-batch and stochastic GD

Goal: Adaptation of GD for big and very big datasets that does not fit into the memory all together.

The entire dataset is $\{(\bx _k,y_k)\}_{k=1}^M$. Recall that for LS we can write the loss as a sum over samples,

\begin{equation} \loss (\bw ;\bX ,\by ) = \sum _{k=1}^M \ell _k(\bw ;\bx _k,y_k), \end{equation}

where $\ell _k(\cdot )$ is a single raw loss between $y_k$ and $\hat {y}_k$.

For the SSE loss,

\begin{equation} \ell _k(\bw ;\bx _k,y_k) = e_k^2 = \left (y_k-\bx _k^T\bw \right )^2 \end{equation}

Full-batch

Full-batch GD uses the gradient of the whole loss elements, $M$. In this case, for LS, the loss gradient takes the form of (3.29).

Mini-batch GD

Goal: Split dataset into parts, typically due to memory constraints.

Instead of using all $M$ samples, we select (usually at random) at iteration $n$ a subset (mini-batch) of indices

\[ \mathcal {B}_n \subset \{1,2,\dots ,M\}, \qquad \abs {\mathcal {B}_n}=B\ll M. \]

We define the corresponding sub-matrices containing rows indexed by $\mathcal {B}_n$,

\begin{equation} \bX _{\mathcal {B}_n}\in \Re ^{B\times (N+1)},\qquad \by _{\mathcal {B}_n}\in \Re ^{B}, \end{equation}

The mini-batch loss is

\begin{equation} \loss _{\mathcal {B}_n}(\bw ) = \frac {1}{B}\sum _{k\in \mathcal {B}_n}\ell _k(\bw ;\bx _k,y_k), \end{equation}

where $B$ is called the batch size.

Mini-batch GD for LS For the SSE loss, the mini-batch loss becomes

\begin{equation} \loss _{\mathcal {B}_n}(\bw ) = \frac {1}{B}\norm {\by _{\mathcal {B}_n}-\bX _{\mathcal {B}_n}\bw }^2 \end{equation}

and its gradient is

\begin{equation} \nabla _\bw \loss _{\mathcal {B}_n}(\bw ) = \frac {1}{B}\bX _{\mathcal {B}_n}^T\big (\bX _{\mathcal {B}_n}\bw -\by _{\mathcal {B}_n}\big ). \end{equation}

The mini-batch GD update becomes

\begin{equation} \bw _{n+1} = \bw _n - \frac {\alpha }{B}\bX _{\mathcal {B}_n}^T \big (\bX _{\mathcal {B}_n}\bw _n-\by _{\mathcal {B}_n}\big ). \end{equation}

Stochastic gradient descent (SGD) The special extreme case of $B=1$ is termed stochastic gradient descent.

Epoch

The term epoch denotes one full pass over the entire dataset. In full-batch GD, one iteration already uses all $M$ samples, so one iteration is equivalent to one epoch. In mini-batch or stochastic GD, an epoch consists of several iterations, each using only a (small) part of the data.

Assume that we sweep once over all $M$ samples without repetition:

• Full-batch GD: $B=M\Rightarrow L=1$ update per epoch.
• Mini-batch GD: $B$ samples per update results in
$\seteqnumber{0}{}{36}$
\begin{equation} L = \left \lceil \dfrac {M}{B}\right \rceil \end{equation}

batches (updates) per epoch.
• SGD: $B=1\Rightarrow L=M$ updates per epoch.

Theoretical relation between mini-batch and full-batch

Theoretically, mini-batches $\mathcal {B}_n$ are estimators of the full gradient,

\begin{equation} \mathbb {E}\left [\nabla _\bw \loss _{\mathcal {B}_n}(\bw )\right ] = \frac {1}{L}\sum _{n=1}^{L} \nabla _\bw \loss _{\mathcal {B}_n}(\bw ) = \frac {1}{M}\nabla _\bw \loss (\bw ). \end{equation}

and follow (on average) the same direction as full-batch GD. Practically, since $\bw $ is updated for each $n$, it results with an additional noise as illustrated in Fig. 3.2.

Summary

• Mini-batch / SGD can handle huge datasets, since each update uses only $B$ samples in memory.
• Updates are faster (less data per step), and typically allow efficient parallel computation (e.g., on GPUs).
• The noisy gradient may slow convergence and requires careful choice of step-size $\alpha $ (often smaller than in full-batch GD).
• In practice, mini-batch GD is the most commonly used compromise between speed, memory requirements and convergence stability.