Machine Learning & Signals Learning

\(\newcommand{\footnotename}{footnote}\) \(\def \LWRfootnote {1}\) \(\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\let \LWRorighspace \hspace \) \(\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }\) \(\newcommand {\TextOrMath }[2]{#2}\) \(\newcommand {\mathnormal }[1]{{#1}}\) \(\newcommand \ensuremath [1]{#1}\) \(\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } \) \(\newcommand {\setlength }[2]{}\) \(\newcommand {\addtolength }[2]{}\) \(\newcommand {\setcounter }[2]{}\) \(\newcommand {\addtocounter }[2]{}\) \(\newcommand {\arabic }[1]{}\) \(\newcommand {\number }[1]{}\) \(\newcommand {\noalign }[1]{\text {#1}\notag \\}\) \(\newcommand {\cline }[1]{}\) \(\newcommand {\directlua }[1]{\text {(directlua)}}\) \(\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}\) \(\newcommand {\protect }{}\) \(\def \LWRabsorbnumber #1 {}\) \(\def \LWRabsorbquotenumber "#1 {}\) \(\newcommand {\LWRabsorboption }[1][]{}\) \(\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }\) \(\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }\) \(\def \mathcode #1={\mathchar }\) \(\let \delcode \mathcode \) \(\let \delimiter \mathchar \) \(\def \oe {\unicode {x0153}}\) \(\def \OE {\unicode {x0152}}\) \(\def \ae {\unicode {x00E6}}\) \(\def \AE {\unicode {x00C6}}\) \(\def \aa {\unicode {x00E5}}\) \(\def \AA {\unicode {x00C5}}\) \(\def \o {\unicode {x00F8}}\) \(\def \O {\unicode {x00D8}}\) \(\def \l {\unicode {x0142}}\) \(\def \L {\unicode {x0141}}\) \(\def \ss {\unicode {x00DF}}\) \(\def \SS {\unicode {x1E9E}}\) \(\def \dag {\unicode {x2020}}\) \(\def \ddag {\unicode {x2021}}\) \(\def \P {\unicode {x00B6}}\) \(\def \copyright {\unicode {x00A9}}\) \(\def \pounds {\unicode {x00A3}}\) \(\let \LWRref \ref \) \(\renewcommand {\ref }{\ifstar \LWRref \LWRref }\) \( \newcommand {\multicolumn }[3]{#3}\) \(\require {textcomp}\) \( \newcommand {\abs }[1]{\lvert #1\rvert } \) \( \DeclareMathOperator {\sign }{sign} \) \(\newcommand {\intertext }[1]{\text {#1}\notag \\}\) \(\let \Hat \hat \) \(\let \Check \check \) \(\let \Tilde \tilde \) \(\let \Acute \acute \) \(\let \Grave \grave \) \(\let \Dot \dot \) \(\let \Ddot \ddot \) \(\let \Breve \breve \) \(\let \Bar \bar \) \(\let \Vec \vec \) \(\newcommand {\bm }[1]{\boldsymbol {#1}}\) \(\require {physics}\) \(\newcommand {\LWRphystrig }[2]{\ifblank {#1}{\textrm {#2}}{\textrm {#2}^{#1}}}\) \(\renewcommand {\sin }[1][]{\LWRphystrig {#1}{sin}}\) \(\renewcommand {\sinh }[1][]{\LWRphystrig {#1}{sinh}}\) \(\renewcommand {\arcsin }[1][]{\LWRphystrig {#1}{arcsin}}\) \(\renewcommand {\asin }[1][]{\LWRphystrig {#1}{asin}}\) \(\renewcommand {\cos }[1][]{\LWRphystrig {#1}{cos}}\) \(\renewcommand {\cosh }[1][]{\LWRphystrig {#1}{cosh}}\) \(\renewcommand {\arccos }[1][]{\LWRphystrig {#1}{arcos}}\) \(\renewcommand {\acos }[1][]{\LWRphystrig {#1}{acos}}\) \(\renewcommand {\tan }[1][]{\LWRphystrig {#1}{tan}}\) \(\renewcommand {\tanh }[1][]{\LWRphystrig {#1}{tanh}}\) \(\renewcommand {\arctan }[1][]{\LWRphystrig {#1}{arctan}}\) \(\renewcommand {\atan }[1][]{\LWRphystrig {#1}{atan}}\) \(\renewcommand {\csc }[1][]{\LWRphystrig {#1}{csc}}\) \(\renewcommand {\csch }[1][]{\LWRphystrig {#1}{csch}}\) \(\renewcommand {\arccsc }[1][]{\LWRphystrig {#1}{arccsc}}\) \(\renewcommand {\acsc }[1][]{\LWRphystrig {#1}{acsc}}\) \(\renewcommand {\sec }[1][]{\LWRphystrig {#1}{sec}}\) \(\renewcommand {\sech }[1][]{\LWRphystrig {#1}{sech}}\) \(\renewcommand {\arcsec }[1][]{\LWRphystrig {#1}{arcsec}}\) \(\renewcommand {\asec }[1][]{\LWRphystrig {#1}{asec}}\) \(\renewcommand {\cot }[1][]{\LWRphystrig {#1}{cot}}\) \(\renewcommand {\coth }[1][]{\LWRphystrig {#1}{coth}}\) \(\renewcommand {\arccot }[1][]{\LWRphystrig {#1}{arccot}}\) \(\renewcommand {\acot }[1][]{\LWRphystrig {#1}{acot}}\) \(\require {cancel}\) \(\newcommand *{\underuparrow }[1]{{\underset {\uparrow }{#1}}}\) \(\DeclareMathOperator *{\argmax }{argmax}\) \(\DeclareMathOperator *{\argmin }{arg\,min}\) \(\def \E [#1]{\mathbb {E}\!\left [ #1 \right ]}\) \(\def \Var [#1]{\operatorname {Var}\!\left [ #1 \right ]}\) \(\def \Cov [#1]{\operatorname {Cov}\!\left [ #1 \right ]}\) \(\newcommand {\floor }[1]{\lfloor #1 \rfloor }\) \(\newcommand {\DTFTH }{ H \brk 1{e^{j\omega }}}\) \(\newcommand {\DTFTX }{ X\brk 1{e^{j\omega }}}\) \(\newcommand {\DFTtr }[1]{\mathrm {DFT}\left \{#1\right \}}\) \(\newcommand {\DTFTtr }[1]{\mathrm {DTFT}\left \{#1\right \}}\) \(\newcommand {\DTFTtrI }[1]{\mathrm {DTFT^{-1}}\left \{#1\right \}}\) \(\newcommand {\Ftr }[1]{ \mathcal {F}\left \{#1\right \}}\) \(\newcommand {\FtrI }[1]{ \mathcal {F}^{-1}\left \{#1\right \}}\) \(\newcommand {\Zover }{\overset {\mathscr Z}{\Longleftrightarrow }}\) \(\renewcommand {\real }{\mathbb {R}}\) \(\newcommand {\ba }{\mathbf {a}}\) \(\newcommand {\bb }{\mathbf {b}}\) \(\newcommand {\bd }{\mathbf {d}}\) \(\newcommand {\be }{\mathbf {e}}\) \(\newcommand {\bh }{\mathbf {h}}\) \(\newcommand {\bn }{\mathbf {n}}\) \(\newcommand {\bq }{\mathbf {q}}\) \(\newcommand {\br }{\mathbf {r}}\) \(\newcommand {\bt }{\mathbf {t}}\) \(\newcommand {\bv }{\mathbf {v}}\) \(\newcommand {\bw }{\mathbf {w}}\) \(\newcommand {\bx }{\mathbf {x}}\) \(\newcommand {\bxx }{\mathbf {xx}}\) \(\newcommand {\bxy }{\mathbf {xy}}\) \(\newcommand {\by }{\mathbf {y}}\) \(\newcommand {\byy }{\mathbf {yy}}\) \(\newcommand {\bz }{\mathbf {z}}\) \(\newcommand {\bA }{\mathbf {A}}\) \(\newcommand {\bB }{\mathbf {B}}\) \(\newcommand {\bI }{\mathbf {I}}\) \(\newcommand {\bK }{\mathbf {K}}\) \(\newcommand {\bP }{\mathbf {P}}\) \(\newcommand {\bQ }{\mathbf {Q}}\) \(\newcommand {\bR }{\mathbf {R}}\) \(\newcommand {\bU }{\mathbf {U}}\) \(\newcommand {\bW }{\mathbf {W}}\) \(\newcommand {\bX }{\mathbf {X}}\) \(\newcommand {\bY }{\mathbf {Y}}\) \(\newcommand {\bZ }{\mathbf {Z}}\) \(\newcommand {\balpha }{\bm {\alpha }}\) \(\newcommand {\bth }{{\bm {\theta }}}\) \(\newcommand {\bepsilon }{{\bm {\epsilon }}}\) \(\newcommand {\bmu }{{\bm {\mu }}}\) \(\newcommand {\bOne }{\mathbf {1}}\) \(\newcommand {\bZero }{\mathbf {0}}\) \(\newcommand {\loss }{\mathcal {L}}\) \(\newcommand {\appropto }{\mathrel {\vcenter { \offinterlineskip \halign {\hfil $##$\cr \propto \cr \noalign {\kern 2pt}\sim \cr \noalign {\kern -2pt}}}}}\) \(\newcommand {\SSE }{\mathrm {SSE}}\) \(\newcommand {\MSE }{\mathrm {MSE}}\) \(\newcommand {\RMSE }{\mathrm {RMSE}}\) \(\newcommand {\toprule }[1][]{\hline }\) \(\let \midrule \toprule \) \(\let \bottomrule \toprule \) \(\def \LWRbooktabscmidruleparen (#1)#2{}\) \(\newcommand {\LWRbooktabscmidrulenoparen }[1]{}\) \(\newcommand {\cmidrule }[1][]{\ifnextchar (\LWRbooktabscmidruleparen \LWRbooktabscmidrulenoparen }\) \(\newcommand {\morecmidrules }{}\) \(\newcommand {\specialrule }[3]{\hline }\) \(\newcommand {\addlinespace }[1][]{}\) \(\newcommand {\LWRsubmultirow }[2][]{#2}\) \(\newcommand {\LWRmultirow }[2][]{\LWRsubmultirow }\) \(\newcommand {\multirow }[2][]{\LWRmultirow }\) \(\newcommand {\mrowcell }{}\) \(\newcommand {\mcolrowcell }{}\) \(\newcommand {\STneed }[1]{}\) \(\newcommand {\tcbset }[1]{}\) \(\newcommand {\tcbsetforeverylayer }[1]{}\) \(\newcommand {\tcbox }[2][]{\boxed {\text {#2}}}\) \(\newcommand {\tcboxfit }[2][]{\boxed {#2}}\) \(\newcommand {\tcblower }{}\) \(\newcommand {\tcbline }{}\) \(\newcommand {\tcbtitle }{}\) \(\newcommand {\tcbsubtitle [2][]{\mathrm {#2}}}\) \(\newcommand {\tcboxmath }[2][]{\boxed {#2}}\) \(\newcommand {\tcbhighmath }[2][]{\boxed {#2}}\)

2 Uni-Variate Linear Least-Squares

The goal of this chapter is to define and discuss uni-variate linear least-squares (LLS or LS) for a linear model.

Note that LS is also termed linear regression. Uni-variate LS is also referred to as a linear trend-line.

2.1 Uni-variate Linear LS

2.1.1 Definition

A random experiment produces a dataset of \(M\) paired observations \(\left \{x_k,y_k\right \}_{k=1}^M\).

The assumed model underlying the dataset is

\begin{equation} y = f(x) + \epsilon , \end{equation}

where \(f(x)\) is a deterministic function and \(\epsilon \) is zero-mean noise.

The corresponding linear model is

\begin{equation} \hat {y}_k = f(x_k;w_0,w_1)=w_0+w_1 x_k, \end{equation}

where \(w_0\) and \(w_1\) are the model parameters. The corresponding model error is

\begin{equation} e_k = y_k - \hat {y}_k = y_k-w_0-w_1x_k. \end{equation}

The optimal model parameters are found by minimizing a loss function, \(\loss (\cdot )\). Common examples of loss functions include the sum of squared errors (SSE), mean-square error (MSE), and root-mean-square error (RMSE):

\begin{align} \SSE &= \sum _{k=1}^{M} e_k^2,\label {eq-sse-loss-sum}\\ \MSE &= \frac {1}{M}\sum _{k=1}^{M} e_k^2 = \frac {1}{M}\SSE ,\\ \RMSE & = \sqrt {\MSE }. \end{align} For interpretation, \(\MSE \) is an estimator of the error variance. Because constant multiplication factors and monotonic transforms do not affect the location of the minimum, all these loss functions share the same optimal parameters:

\begin{equation} \label {eq-optimization-definition} \begin{aligned} w_0,w_1 &= \arg \min \limits _{w_0,w_1} \SSE (w_0,w_1) \\ &= \arg \min \limits _{w_0,w_1}\MSE (w_0,w_1) \\ &= \arg \min \limits _{w_0,w_1} \RMSE (w_0,w_1). \end {aligned} \end{equation}

For simplicity, \(\SSE \) is used in the following discussion.

(image)

Figure 2.1: Linear regression visualization. The goal is to minimize \(\SSE \), which represents the total area \(\sum _k e_k^2\) of the rectangles.

The minimization of the loss function (Fig. 2.1),

\begin{equation} \loss (w_0,w_1)=\sum _{k=1}^{M}\bigl (y_k-w_0-w_1x_k\bigr )^2, \end{equation}

is performed by setting the partial derivatives of \(\loss \) to zero:

\begin{equation} \begin{cases} \dfrac {\partial }{\partial w_0} \loss (w_0,w_1) = 0\\[10pt] \dfrac {\partial }{\partial w_1} \loss (w_0,w_1) = 0 \end {cases} \end{equation}

which yields:

\begin{equation} \begin{cases} \displaystyle 2\sum _{k=1}^M \left (y_k - w_0 - w_1x_k\right )\cdot (-1) = 0\\[10pt] \displaystyle 2\sum _{k=1}^M \left (y_k - w_0 - w_1x_k\right )\cdot (-x_k) = 0. \end {cases} \end{equation}

Finally, with some basic algebra:

\begin{equation} \label {eq-ls-normal-eq} \begin{cases} w_0 M + w_1\displaystyle \sum _{k=1}^{M}x_k = \displaystyle \sum _{k=1}^{M}y_k, \\[6pt] w_0 \displaystyle \sum _{k=1}^{M}x_k + w_1 \displaystyle \sum _{k=1}^{M}x_k^{2} = \displaystyle \sum _{k=1}^{M}x_k y_k. \end {cases} \end{equation}

This system of equations is termed the normal equations.

2.1.2 Mean and Variance

The goal of this section is to provide an interpretation of the mean and variance within the context of LS.

A special case of the linear model is one of the form:

\begin{equation} \hat {y} = w_0 \end{equation}

The corresponding MSE loss function is

\begin{equation} \label {eq-ls-one-coeff} \loss (w_0) = \frac {1}{M}\sum _{k=1}^M (y_k-w_0)^2, \end{equation}

with the minimum at the mean of \(y_k\) values,

\begin{equation} \label {eq-ls-one-coeff-mse} w_0 = \frac {1}{M}\sum _{k=1}^M y_k = \bar {y} \end{equation}

The corresponding MSE of the model (substituting (2.14) into (2.13)) is

\begin{equation} \MSE = \frac {1}{M} \sum _{k=1}^M\left (y_k - \frac {1}{M}\sum _{k=1}^M y_k\right )^2 =\sigma _y^2 \end{equation}

which is the sample variance of the \(y_k\) values 1.

To summarize, \(\hat {y} = \bar {y}\) with \(\MSE =\sigma _y^2\).

1 This variance expression is called biased and is used throughout this chapter (see Sec. 1.1)

2.2 Normal Equations with Statistical Terms

  • Goal: To rewrite the normal equations (2.11) using statistical terms.

2.2.1 Sample Covariance

The (biased) sample covariance between \(x_1,\dots ,x_M\) and \(y_1,\dots ,y_M\) is given by

\begin{equation} s_{xy} = \frac {1}{M}\sum _{i=1}^M (x_i - \bar x)\,(y_i - \bar y) \end{equation}

The sign and magnitude of the sample covariance \(s_{xy}\) indicate the direction and strength of the linear relationship between \(x\) and \(y\):

  • \(s_{xy}>0\): As \(x\) increases, \(y\) tends to increase (most products \((x_i - \bar x)(y_i - \bar y)\) are positive).

  • \(s_{xy}<0\): As \(x\) increases, \(y\) tends to decrease.

  • \(s_{xy}\approx 0\): No linear association.

2.2.2 Normal Equations

It is numerically convenient to rewrite the normal equations (2.11) in terms of sample moments

\begin{equation} \label {eq-lr-univariate-coeff-stat} \begin{aligned} w_1&=\frac {s_{xy}}{s_x^2}=\frac {\sum _k(x_k-\bar x)(y_k-\bar y)}{\sum _k (x_k-\bar x)^2},\\ w_0&=\bar y-w_1\bar x = \bar y-\frac {s_{xy}}{s_x^2}\bar x, \end {aligned} \end{equation}

and the predictive form

\begin{equation} \hat {y}= \bar y + w_1\bigl (x-\bar x\bigr ). \end{equation}

Note that these expressions are independent of whether biased or unbiased definitions are used for \(s_{xy}\) and \(s_x\), since the corresponding normalization factors cancel out.

Remarks.

  • A valid solution requires \(s_x\neq 0\), i.e. variability in \(x_i\).

  • If both variables are centered (\(\bar x=\bar y=0\)), then (2.17) reduces to

    \begin{equation} \label {eq-lr-univariate-coeff-stat-normalized} \begin{aligned} w_0&=0,\\ w_1 &= \frac {\sum _k x_k y_k}{\sum _k x_k^2} \end {aligned} \end{equation}

2.3 Correlation Coefficient

Definition The sample Pearson correlation coefficient between \(\bx \) and \(\by \) is the normalized (dimensionless) covariance:

\begin{align} r_{xy} &=\frac {s_{xy}}{s_x\,s_y}\\ &= \frac {\displaystyle \sum _{i=1}^M (x_i - \bar x)\,(y_i - \bar y)} {\displaystyle \sqrt {\sum _{i=1}^M (x_i - \bar x)^2} \;\sqrt {\sum _{i=1}^M (y_i - \bar y)^2}} \end{align} and the slope in (2.17) can be re-written as

\begin{equation} w_1 = r_{xy}\,\frac {s_y}{s_x}. \end{equation}

Note that there is no difference between biased and unbiased definitions for \(r_{xy}\), since the corresponding coefficients cancel out.

2.3.1 Normalized (z-score) Linear Regression

The goal of this section is to provide an intuitive interpretation of the correlation coefficient.

Rescaling To make the slope directly interpretable, we rescale both variables to zero mean and unit variance:

\begin{equation} z_i = \frac {x_i-\bar x}{s_x}, \qquad t_i = \frac {y_i-\bar y}{s_y}, \qquad i=1,\ldots ,M \end{equation}

The properties of \(z_i\) and \(t_i\) are

\begin{align*} \bar z &= \frac {1}{M}\sum z_i=0,\\ \bar t &=\frac {1}{M}\sum t_i=0,\\ s_z &= s_t = 1. \end{align*}

  • Proof. Using biased variance definitions 2:

    \begin{equation} s_x^2 = \frac {1}{M}\sum _{k=1}^M (x_k - \bar {x})^2, \qquad s_y^2 = \frac {1}{M}\sum _{k=1}^M (y_k - \bar {y})^2. \end{equation}

    For the means

    \begin{equation} \begin{aligned} \sum _{k=1}^M z_k &= \frac {1}{s_x}\sum _{k=1}^M (x_k - \bar {x}) \\ &= \frac {1}{s_x}\left (\sum _{k=1}^M x_k - M\bar {x}\right ) \\ &= \frac {1}{s_x}(M\bar {x} - M\bar {x}) = 0 \end {aligned} \end{equation}

    hence \(\bar {z} = 0\) and similarly \(\bar {t} = 0\) For the variances:

    \begin{equation} s_z^2 = \frac {1}{M}\sum _{k=1}^M z_k^2 = \frac {1}{M}\sum _{k=1}^M \frac {(x_k - \bar {x})^2}{s_x^2} = \frac {1}{s_x^2} \left [ \frac {1}{M} \sum _{k=1}^M (x_k - \bar {x})^2 \right ] = \frac {s_x^2}{s_x^2} = 1. \end{equation}

    The proof for \(s_t^2=1\) is analogous.

2 For unbiased evaluation the factor \(M-1\) is to be used.

Normal Equations The linear model in normalized space,

\begin{equation} \hat {t}_i = w_0^* + w_1^*z_i \end{equation}

yields (applying (2.19))

\begin{align} w_0^* &=0,\label {eq-w_0-normalized}\\ w_1^* &= \frac {s_{zt}}{s_z^2} =\frac {\frac {1}{M}\sum _{i=1}^{M} z_i t_i} {\frac {1}{M}\sum _{i=1}^{M} z_i^2} =\frac {1}{M}\sum _{i=1}^{M} z_i t_i =\frac {1}{s_x s_y}\frac {1}{M}\sum _{i=1}^M (x_i-\bar x)(y_i-\bar y) =r_{xy} \label {eq-w_1-normalized} \end{align} Thus, the normalized prediction is

\begin{equation} \hat t_i = r_{xy}\,z_i. \end{equation}

where

\begin{equation} r_{xy}=\frac {s_{xy}}{s_x\,s_y} \end{equation}

is the correlation coefficient. Returning to the original scale gives

\begin{equation*} \frac {\hat y-\bar y}{s_y}=r_{xy}\frac {x-\bar x}{s_x} \end{equation*}

or the familiar form:

\[ \hat y = \bar y + r_{xy}\,\frac {s_y}{s_x}\bigl (x-\bar x\bigr ). \]

Interpretation of \(r_{xy}\)

  • Units of x- and y-axes are standard deviation from the origin since \(t_i\) and \(z_i\) are centered and standardized.

  • In the normalized space, \(r_{xy}\) is the regression slope. A one-standard-deviation increase in \(x\) (\(z\) changes by 1) changes \(y\) by \(r_{xy}\) standard deviations on average (\(\hat {t}\) changes by \(r_{xy}\)).

  • Let’s define two vectors, \(\bz \) and \(\bt \), both in \(\mathbb R^{M}\). The corresponding dot product is

    \begin{equation} \bz \cdot \bt = \norm {\bz }\norm {\bt }\cos (\theta ) \end{equation}

    Using (2.29),

    \begin{equation} \bz \cdot \bt = \sum _{i=1}^M z_it_i = M r_{xy}. \end{equation}

    Since \(\norm {\bz } = \sqrt {\sum z_i^2} = \sqrt {M s_z^2} = \sqrt {M}\), we have

    \begin{equation} \norm {\bz }\norm {\bt } = \sqrt {M}\cdot \sqrt {M} = M \end{equation}

    Finally, \(r_{xy}=\cos \theta \), where \(\theta \) is the angle between the centered and standardized data vectors \(\bz \) and \(\bt \) (Fig. 2.2).

    Consequences:

    • Range: \(-1 \le r_{xy} \le +1\).

    • Special case of perfect linear fit (zero error)

      \begin{align*} r_{xy} = \pm 1 &\;\Longrightarrow \; \theta = 0^\circ \text { or }180^\circ \\ r_{xy} = 0 &\;\Longrightarrow \; \theta = 90^\circ \end{align*}

    • \(r_{xy}=+1\) indicates perfect positive linear association.

    • \(r_{xy}=-1\) indicates perfect negative linear association.

    • \(r_{xy}=0\) indicates no linear association.

(image)

Figure 2.2: Geometric interpretation of \(r_{xy}\) in a normalized LR.
Regression error as a function of \(r_{xy}\)

With the optimal normalized model \(\hat t = r_{xy} z\) the residuals are

\begin{equation} e_i = t_i - \hat {t}_i = t_i - r_{xy} z_i \end{equation}

with

\begin{equation} \text {MSE} = 1 - r_{xy}^{2} = s_e^2. \end{equation}

  • Proof.

    \begin{equation} \begin{aligned} s_e &= \frac {1}{M}\sum _{i=1}^M e_i^2 = \frac {1}{M}\sum _{i=1}^M \left (t_i - r_{xy} z_i\right )^2\\ &= \frac {1}{M}\sum _{i=1}^M t_i^2 -2r_{xy}\frac {1}{M}\sum _{i=1}^M t_iz_i + r_{xy}^2\frac {1}{M}\sum _{i=1}^Mz_i^2\\ &=s_t^2 -2r_{xy}\left (r_{xy}\right ) + r_{xy}^2s_z^2\\ &=1 - r_{xy}^2 \end {aligned} \end{equation}

Consequences:

  • \(r_{xy}=0\) \(\Rightarrow \) the best linear predictor is \(\hat y=\bar y\) and the MSE equals the sample variance of \(y\). There is no linear association and the MSE is maximal.

  • \(|r_{xy}|=1\) \(\Rightarrow \) perfect fit, zero residual error. It means that the data points lie perfectly on a straight line.

  • Typically, the residual errors are (approximately) Gaussian distributed.

Note, the unbiased estimator of \(s_e^{2}\) requires \(1/(M-2)\) factor instead \(1/M\).

2.3.2 Coefficient of Determination Metric

The performance of a model is quantified by a performance metric, which is not necessarily the same as the loss function.

To emphasize the difference between loss and metrics, the following example of an LR metric is provided. The coefficient of determination, denoted \(R^2\) or \(r^2\) (R-squared), is based on the relation:

\begin{equation} \underbrace {\sum _{k=1}^{M}(y_k-\bar {y})^2}_{\text {SST}} = \underbrace {\sum _{k=1}^{M}(\hat {y}_k-\bar {y})^2}_{\text {SSR}} + \underbrace {\sum _{k=1}^{M}e_k^2}_{\text {SSE}}, \end{equation}

where:

  • SST: Total sum of squares.

  • SSR: Sum of squares due to regression.

  • SSE: Sum of squared errors (or residual sum of squares).

  • Proof. Observe that

    \[ y_k - \bar y \;=\; (\hat y_k - \bar y) + (y_k - \hat y_k) \;=\; (\hat y_k - \bar y) + e_k. \]

    Hence,

    \[ \begin {aligned} \sum _{k=1}^M (y_k - \bar y)^2 &= \sum _{k=1}^M \bigl [(\hat y_k - \bar y) + e_k\bigr ]^2\\ &= \sum _{k=1}^M (\hat y_k - \bar y)^2 \;+\;\sum _{k=1}^M e_k^2 \;+\;2\sum _{k=1}^M (\hat y_k - \bar y)\,e_k. \end {aligned} \]

    It remains to show the cross-term vanishes:

    \[ \sum _{k=1}^M (\hat y_k - \bar y)\,e_k =\sum _{k=1}^M \hat y_k\,e_k \;-\;\bar y\sum _{k=1}^M e_k =0 - 0 = 0, \]

    since in LS, \(\sum _{k=1}^M e_k = 0\) and \(\sum _{k=1}^M \hat y_k\,e_k = 0\).

The \(R^2\) metric is defined as:

\begin{equation} R^2 =\frac {\text {SSR}}{\text {SST}} =1 - \frac {\text {SSE}}{\text {SST}}, \end{equation}

providing a unitless goodness-of-fit measure that shares an intuitive \(\,[0,1]\) range:

  • \(R^2=1\): Perfect fit.

  • \(R^2=0\): The model is no better than predicting the mean, \(\hat {y}_i = \bar y\).

  • \(R^2<0\): The model performs worse than the mean (possible for machine learning models on test data).

Uni-variate case For the uni-variate LR case, it is the fraction of the sample variance of \(y\) explained by the linear fit,

\begin{equation} R^{2}=r_{xy}^{2}. \end{equation}

  • Proof. Starting with \(\hat {y}_i =w_0 + w_1x_i\):

    \begin{equation} \begin{aligned} SSR &=\sum _{i=1}^{M}(\hat {y}_i-\bar {y})^2\\ &=\sum _{i=1}^{M}(w_0 + w_1x_i-\bar {y})^2\\ &=\sum _{i=1}^{M}(\bar y - w_1\bar x + w_1x_i-\bar {y})^2\\ &=w_1^2\sum _{i=1}^{M}(x_i - \bar x)^2\\ &=w_1^2 M s_x^2 = M \left (\frac {s_{xy}}{s_x^2}\right )^2 s_x^2 = M \frac {s_{xy}^2}{s_x^2},\\ SST &= \sum _{k=1}^{M}(y_k-\bar {y})^2 = M s_y^2 \end {aligned} \end{equation}

    Thus, \(SSR/SST = s_{xy}^2 / (s_x^2 s_y^2) = r_{xy}^2\).

The general (biased) MSE of unnormalized values is

\begin{equation} \begin{aligned} \MSE &= \frac {1}{M} \sum _{k=1}^{M} \bigl (y_k-\hat y_k\bigr )^{2}\\ &=s_y^{2}\bigl (1-r_{xy}^{2}\bigr ) \\ &=\frac {1}{M}\SSE = \frac {1}{M}\left (\text {SST}-\text {SSR}\right ). \end {aligned} \end{equation}