Machine Learning & Signals Learning

\(\newcommand{\footnotename}{footnote}\) \(\def \LWRfootnote {1}\) \(\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\let \LWRorighspace \hspace \) \(\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }\) \(\newcommand {\TextOrMath }[2]{#2}\) \(\newcommand {\mathnormal }[1]{{#1}}\) \(\newcommand \ensuremath [1]{#1}\) \(\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } \) \(\newcommand {\setlength }[2]{}\) \(\newcommand {\addtolength }[2]{}\) \(\newcommand {\setcounter }[2]{}\) \(\newcommand {\addtocounter }[2]{}\) \(\newcommand {\arabic }[1]{}\) \(\newcommand {\number }[1]{}\) \(\newcommand {\noalign }[1]{\text {#1}\notag \\}\) \(\newcommand {\cline }[1]{}\) \(\newcommand {\directlua }[1]{\text {(directlua)}}\) \(\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}\) \(\newcommand {\protect }{}\) \(\def \LWRabsorbnumber #1 {}\) \(\def \LWRabsorbquotenumber "#1 {}\) \(\newcommand {\LWRabsorboption }[1][]{}\) \(\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }\) \(\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }\) \(\def \mathcode #1={\mathchar }\) \(\let \delcode \mathcode \) \(\let \delimiter \mathchar \) \(\def \oe {\unicode {x0153}}\) \(\def \OE {\unicode {x0152}}\) \(\def \ae {\unicode {x00E6}}\) \(\def \AE {\unicode {x00C6}}\) \(\def \aa {\unicode {x00E5}}\) \(\def \AA {\unicode {x00C5}}\) \(\def \o {\unicode {x00F8}}\) \(\def \O {\unicode {x00D8}}\) \(\def \l {\unicode {x0142}}\) \(\def \L {\unicode {x0141}}\) \(\def \ss {\unicode {x00DF}}\) \(\def \SS {\unicode {x1E9E}}\) \(\def \dag {\unicode {x2020}}\) \(\def \ddag {\unicode {x2021}}\) \(\def \P {\unicode {x00B6}}\) \(\def \copyright {\unicode {x00A9}}\) \(\def \pounds {\unicode {x00A3}}\) \(\let \LWRref \ref \) \(\renewcommand {\ref }{\ifstar \LWRref \LWRref }\) \( \newcommand {\multicolumn }[3]{#3}\) \(\require {textcomp}\) \( \newcommand {\abs }[1]{\lvert #1\rvert } \) \( \DeclareMathOperator {\sign }{sign} \) \(\newcommand {\intertext }[1]{\text {#1}\notag \\}\) \(\let \Hat \hat \) \(\let \Check \check \) \(\let \Tilde \tilde \) \(\let \Acute \acute \) \(\let \Grave \grave \) \(\let \Dot \dot \) \(\let \Ddot \ddot \) \(\let \Breve \breve \) \(\let \Bar \bar \) \(\let \Vec \vec \) \(\newcommand {\bm }[1]{\boldsymbol {#1}}\) \(\require {physics}\) \(\newcommand {\LWRphystrig }[2]{\ifblank {#1}{\textrm {#2}}{\textrm {#2}^{#1}}}\) \(\renewcommand {\sin }[1][]{\LWRphystrig {#1}{sin}}\) \(\renewcommand {\sinh }[1][]{\LWRphystrig {#1}{sinh}}\) \(\renewcommand {\arcsin }[1][]{\LWRphystrig {#1}{arcsin}}\) \(\renewcommand {\asin }[1][]{\LWRphystrig {#1}{asin}}\) \(\renewcommand {\cos }[1][]{\LWRphystrig {#1}{cos}}\) \(\renewcommand {\cosh }[1][]{\LWRphystrig {#1}{cosh}}\) \(\renewcommand {\arccos }[1][]{\LWRphystrig {#1}{arcos}}\) \(\renewcommand {\acos }[1][]{\LWRphystrig {#1}{acos}}\) \(\renewcommand {\tan }[1][]{\LWRphystrig {#1}{tan}}\) \(\renewcommand {\tanh }[1][]{\LWRphystrig {#1}{tanh}}\) \(\renewcommand {\arctan }[1][]{\LWRphystrig {#1}{arctan}}\) \(\renewcommand {\atan }[1][]{\LWRphystrig {#1}{atan}}\) \(\renewcommand {\csc }[1][]{\LWRphystrig {#1}{csc}}\) \(\renewcommand {\csch }[1][]{\LWRphystrig {#1}{csch}}\) \(\renewcommand {\arccsc }[1][]{\LWRphystrig {#1}{arccsc}}\) \(\renewcommand {\acsc }[1][]{\LWRphystrig {#1}{acsc}}\) \(\renewcommand {\sec }[1][]{\LWRphystrig {#1}{sec}}\) \(\renewcommand {\sech }[1][]{\LWRphystrig {#1}{sech}}\) \(\renewcommand {\arcsec }[1][]{\LWRphystrig {#1}{arcsec}}\) \(\renewcommand {\asec }[1][]{\LWRphystrig {#1}{asec}}\) \(\renewcommand {\cot }[1][]{\LWRphystrig {#1}{cot}}\) \(\renewcommand {\coth }[1][]{\LWRphystrig {#1}{coth}}\) \(\renewcommand {\arccot }[1][]{\LWRphystrig {#1}{arccot}}\) \(\renewcommand {\acot }[1][]{\LWRphystrig {#1}{acot}}\) \(\require {cancel}\) \(\newcommand *{\underuparrow }[1]{{\underset {\uparrow }{#1}}}\) \(\DeclareMathOperator *{\argmax }{argmax}\) \(\DeclareMathOperator *{\argmin }{arg\,min}\) \(\def \E [#1]{\mathbb {E}\!\left [ #1 \right ]}\) \(\def \Var [#1]{\operatorname {Var}\!\left [ #1 \right ]}\) \(\def \Cov [#1]{\operatorname {Cov}\!\left [ #1 \right ]}\) \(\newcommand {\floor }[1]{\lfloor #1 \rfloor }\) \(\newcommand {\DTFTH }{ H \brk 1{e^{j\omega }}}\) \(\newcommand {\DTFTX }{ X\brk 1{e^{j\omega }}}\) \(\newcommand {\DFTtr }[1]{\mathrm {DFT}\left \{#1\right \}}\) \(\newcommand {\DTFTtr }[1]{\mathrm {DTFT}\left \{#1\right \}}\) \(\newcommand {\DTFTtrI }[1]{\mathrm {DTFT^{-1}}\left \{#1\right \}}\) \(\newcommand {\Ftr }[1]{ \mathcal {F}\left \{#1\right \}}\) \(\newcommand {\FtrI }[1]{ \mathcal {F}^{-1}\left \{#1\right \}}\) \(\newcommand {\Zover }{\overset {\mathscr Z}{\Longleftrightarrow }}\) \(\renewcommand {\real }{\mathbb {R}}\) \(\newcommand {\ba }{\mathbf {a}}\) \(\newcommand {\bb }{\mathbf {b}}\) \(\newcommand {\bd }{\mathbf {d}}\) \(\newcommand {\be }{\mathbf {e}}\) \(\newcommand {\bh }{\mathbf {h}}\) \(\newcommand {\bn }{\mathbf {n}}\) \(\newcommand {\bq }{\mathbf {q}}\) \(\newcommand {\br }{\mathbf {r}}\) \(\newcommand {\bt }{\mathbf {t}}\) \(\newcommand {\bv }{\mathbf {v}}\) \(\newcommand {\bw }{\mathbf {w}}\) \(\newcommand {\bx }{\mathbf {x}}\) \(\newcommand {\bxx }{\mathbf {xx}}\) \(\newcommand {\bxy }{\mathbf {xy}}\) \(\newcommand {\by }{\mathbf {y}}\) \(\newcommand {\byy }{\mathbf {yy}}\) \(\newcommand {\bz }{\mathbf {z}}\) \(\newcommand {\bA }{\mathbf {A}}\) \(\newcommand {\bB }{\mathbf {B}}\) \(\newcommand {\bI }{\mathbf {I}}\) \(\newcommand {\bK }{\mathbf {K}}\) \(\newcommand {\bP }{\mathbf {P}}\) \(\newcommand {\bQ }{\mathbf {Q}}\) \(\newcommand {\bR }{\mathbf {R}}\) \(\newcommand {\bU }{\mathbf {U}}\) \(\newcommand {\bW }{\mathbf {W}}\) \(\newcommand {\bX }{\mathbf {X}}\) \(\newcommand {\bY }{\mathbf {Y}}\) \(\newcommand {\bZ }{\mathbf {Z}}\) \(\newcommand {\balpha }{\bm {\alpha }}\) \(\newcommand {\bth }{{\bm {\theta }}}\) \(\newcommand {\bepsilon }{{\bm {\epsilon }}}\) \(\newcommand {\bmu }{{\bm {\mu }}}\) \(\newcommand {\bphi }{\bm {\phi }}\) \(\newcommand {\bOne }{\mathbf {1}}\) \(\newcommand {\bZero }{\mathbf {0}}\) \(\newcommand {\btx }{\tilde {\bx }}\) \(\newcommand {\loss }{\mathcal {L}}\) \(\newcommand {\appropto }{\mathrel {\vcenter { \offinterlineskip \halign {\hfil $##$\cr \propto \cr \noalign {\kern 2pt}\sim \cr \noalign {\kern -2pt}}}}}\) \(\newcommand {\SSE }{\mathrm {SSE}}\) \(\newcommand {\MSE }{\mathrm {MSE}}\) \(\newcommand {\RMSE }{\mathrm {RMSE}}\) \(\newcommand {\toprule }[1][]{\hline }\) \(\let \midrule \toprule \) \(\let \bottomrule \toprule \) \(\def \LWRbooktabscmidruleparen (#1)#2{}\) \(\newcommand {\LWRbooktabscmidrulenoparen }[1]{}\) \(\newcommand {\cmidrule }[1][]{\ifnextchar (\LWRbooktabscmidruleparen \LWRbooktabscmidrulenoparen }\) \(\newcommand {\morecmidrules }{}\) \(\newcommand {\specialrule }[3]{\hline }\) \(\newcommand {\addlinespace }[1][]{}\) \(\newcommand {\LWRsubmultirow }[2][]{#2}\) \(\newcommand {\LWRmultirow }[2][]{\LWRsubmultirow }\) \(\newcommand {\multirow }[2][]{\LWRmultirow }\) \(\newcommand {\mrowcell }{}\) \(\newcommand {\mcolrowcell }{}\) \(\newcommand {\STneed }[1]{}\) \(\newcommand {\tcbset }[1]{}\) \(\newcommand {\tcbsetforeverylayer }[1]{}\) \(\newcommand {\tcbox }[2][]{\boxed {\text {#2}}}\) \(\newcommand {\tcboxfit }[2][]{\boxed {#2}}\) \(\newcommand {\tcblower }{}\) \(\newcommand {\tcbline }{}\) \(\newcommand {\tcbtitle }{}\) \(\newcommand {\tcbsubtitle [2][]{\mathrm {#2}}}\) \(\newcommand {\tcboxmath }[2][]{\boxed {#2}}\) \(\newcommand {\tcbhighmath }[2][]{\boxed {#2}}\)

2 Uni-Variate Linear Least-Squares

The goal of this chapter is to define and discuss uni-variate linear least-squares (LLS or LS) for a linear model.

Note that LS is also termed linear regression. Uni-variate LS is also referred to as a linear trend-line.

2.1 Uni-variate Linear LS

2.1.1 Definitions

Dataset: A random experiment produces a dataset1 of \(M\) paired observations \(\left \{x_k,y_k\right \}_{k=1}^M\).

ML Model: The assumed model underlying the dataset is

\begin{equation} y = h(x) + \epsilon , \end{equation}

where \(h(x)\) is a deterministic function and \(\epsilon \) is zero-mean noise.

The goal is to propose and apply the model,

\begin{equation} \hat {y}_k = f(x_k) \end{equation}

that provide the most appropriate results with relation to \(y_k\).

Challenge: there are many possible \(f(\cdot )\) options.

Error: The model error is

\begin{equation} e_k = y_k - \hat {y}_k \end{equation}

1 Don’t confuse with database that is something else.

Loss
  • Goal: Find optimal model parameters.

Loss function: Loss (or cost) function \(\loss (y_k,\hat {y}_k)\) is some distance metric between \(y_k\) and \(\hat {y}_k\). The optimal model parameters of the applied model are found by minimizing a loss function.

Common examples of loss functions include:

  • the sum of squared errors (SSE),

    \begin{equation} \label {eq-sse-loss-sum} \SSE = \sum _{k=1}^{M} e_k^2, \end{equation}

  • mean-square error (MSE) or (biased) error variance

    \begin{equation} \label {eq-mse-loss-sum} \MSE = \frac {1}{M}\sum _{k=1}^{M} e_k^2 = \frac {1}{M}\SSE = s_e^2 \end{equation}

  • and root-mean-square error (RMSE),

    \begin{equation} \RMSE = \sqrt {\MSE }. \end{equation}

Because constant multiplication factors and monotonic transforms do not affect the location of the minimum, all these loss functions share the same optimal parameters:

\begin{equation} \label {eq-optimization-definition} \begin{aligned} w_0,w_1 &= \arg \min \limits _{w_0,w_1} \SSE \\ &= \arg \min \limits _{w_0,w_1}\MSE \\ &= \arg \min \limits _{w_0,w_1} \RMSE . \end {aligned} \end{equation}

Metric: A quantifiable measure used to evaluate how well a model is performing.

For example, the of RMSE is significantly easier to interpret due to same values range and measurement units as an original dataset, than MSE or SSE.

It is easy to confuse metrics with loss functions, but they serve different purposes:

  • Loss function is used by the algorithm during training to learn optimal model parameters.

  • Metric is used by humans to evaluate the model’s performance.

Summary
  • 1. Select model \(f(\cdot )\)

  • 2. Find model parameters by the minimization of the loss function, \(\loss \)

  • 3. Use metrics to evaluate model performance.

2.1.2 Mean and Variance
  • Goal: To provide an interpretation of the mean and variance within the context of LS.

A special case of the linear model is one of the form:

\begin{equation} \hat {y} = w_0 \end{equation}

The corresponding MSE loss (2.5) function is

\begin{equation} \label {eq-ls-one-coeff} \loss (w_0) = \frac {1}{M}\sum _{k=1}^M (y_k-w_0)^2, \end{equation}

with the related minimum at the mean of \(y_k\) values,

\begin{equation} \label {eq-ls-one-coeff-mse} w_0 = \frac {1}{M}\sum _{k=1}^M y_k = \bar {y} \end{equation}

  • Proof. To find the minimum,

    \begin{equation} \frac {d}{dw_0}\loss (w_0) = 2\frac {1}{M}\sum _{k=1}^M (y_k-w_0)(-1) = 0 \end{equation}

The corresponding MSE of the model (substituting (2.10) into (2.9)) is

\begin{equation} \MSE = \frac {1}{M} \sum _{k=1}^M\left (y_k - \bar {y}\right )^2 =s_y^2 \end{equation}

which is the sample variance of the \(y_k\) values 2.

To summarize, the best predictor is the sample mean, \(\hat {y} = \bar {y}\), with \(\MSE =s_y^2\).

2 This variance expression is called biased and is used throughout this chapter (see Sec. 1.1)

2.1.3 Linear Model

The example of linear model is

\begin{equation} \hat {y}_k = f(x_k;w_0,w_1)=w_0+w_1 x_k, \end{equation}

where \(w_0\) and \(w_1\) are the model parameters.

The corresponding linear model error is

\begin{equation} e_k = y_k-w_0-w_1x_k. \end{equation}

(image)

Figure 2.1: Linear regression visualization. The goal is to minimize \(\SSE \), which represents the total area \(\sum _k e_k^2\) of the rectangles.

The minimization of the SSE loss function ((2.4) and Fig. 2.1),

\begin{equation} \loss (w_0,w_1)=\sum _{k=1}^{M}\bigl (y_k-w_0-w_1x_k\bigr )^2, \end{equation}

is performed by setting the partial derivatives of \(\loss \) to zero:

\begin{equation} \begin{cases} \dfrac {\partial }{\partial w_0} \loss (w_0,w_1) = 0\\[10pt] \dfrac {\partial }{\partial w_1} \loss (w_0,w_1) = 0 \end {cases} \end{equation}

which yields:

\begin{equation} \begin{cases} \displaystyle 2\sum _{k=1}^M \left (y_k - w_0 - w_1x_k\right )\cdot (-1) = 0\\[10pt] \displaystyle 2\sum _{k=1}^M \left (y_k - w_0 - w_1x_k\right )\cdot (-x_k) = 0. \end {cases} \end{equation}

Finally, with some basic algebra:

\begin{equation} \label {eq-ls-normal-eq} \begin{cases} w_0 M + w_1\displaystyle \sum _{k=1}^{M}x_k = \displaystyle \sum _{k=1}^{M}y_k, \\[6pt] w_0 \displaystyle \sum _{k=1}^{M}x_k + w_1 \displaystyle \sum _{k=1}^{M}x_k^{2} = \displaystyle \sum _{k=1}^{M}x_k y_k. \end {cases} \end{equation}

This system of equations is termed the normal equations.

2.2 Normal Equations with Statistical Terms

  • Goal: To rewrite the normal equations (2.18) using statistical terms.

2.2.1 Sample Covariance
  • Goal: Quantify linear relation between \(x_k\) and \(y_k\).

The (biased) sample covariance between \(x_1,\dots ,x_M\) and \(y_1,\dots ,y_M\) is given by

\begin{equation} s_{xy} = \frac {1}{M}\sum _{k=1}^M (x_k - \bar x)\,(y_k - \bar y) \end{equation}

The sign and magnitude of the sample covariance \(s_{xy}\) indicate the direction and strength of the linear relationship between \(x\) and \(y\):

  • \(s_{xy}>0\): As \(x\) increases, \(y\) tends to increase (most products \((x_i - \bar x)(y_i - \bar y)\) are positive).

  • \(s_{xy}<0\): As \(x\) increases, \(y\) tends to decrease.

  • \(s_{xy}\approx 0\): No linear association.

2.2.2 Correlation Coefficient
  • Goal: Normalized sample covariance.

The sample Pearson correlation coefficient between \(\bx \) and \(\by \) is the normalized (dimensionless) covariance:

\begin{equation} \begin{aligned} r_{xy} &=\frac {s_{xy}}{s_x\,s_y}\\ &= \frac {\displaystyle \sum _{k=1}^M (x_k - \bar x)\,(y_k - \bar y)} {\displaystyle \sqrt {\sum _{k=1}^M (x_k - \bar x)^2} \;\sqrt {\sum _{k=1}^M (y_k - \bar y)^2}} \end {aligned} \end{equation}

Interpretation

  • Range: \(-1 \le r_{xy} \le +1\).

  • \(r_{xy} > 0\): positive linear association (as \(x\) increases, \(y\) tends to increase).

  • \(r_{xy} < 0\): negative linear association (as \(x\) increases, \(y\) tends to decrease).

  • \(r_{xy} = 0\): no linear association (a nonlinear relationship may still exist).

  • \(|r_{xy}| = 1\): perfect linear relationship (\(y_k = a + bx_k\) for all \(k\)).

Note that there is no difference between biased and unbiased definitions for \(r_{xy}\), since the corresponding coefficients cancel out.

2.2.3 Normal Equations

It is numerically convenient to rewrite the normal equations (2.18) in terms of sample moments

\begin{equation} \label {eq-lr-univariate-coeff-stat} \begin{aligned} w_1&=\frac {s_{xy}}{s_x^2}= r_{xy}\,\frac {s_y}{s_x}\nonumber \\ &=\frac {\sum _k(x_k-\bar x)(y_k-\bar y)}{\sum _k (x_k-\bar x)^2},\\ w_0&=\bar y-w_1\bar x \nonumber \\ &= \bar y-\frac {s_{xy}}{s_x^2}\bar x, \end {aligned} \end{equation}

and the predictive form

\begin{equation} \hat {y}= \bar y + w_1\bigl (x-\bar x\bigr ). \end{equation}

Note that these expressions are independent of whether biased or unbiased definitions are used for \(s_{xy}\) and \(s_x\), since the corresponding normalization factors cancel out.

Remarks.

  • A valid solution requires \(s_x\neq 0\), i.e. variability in \(x_i\ne \mathrm {const}\).

  • If both variables are centered (\(\bar x=\bar y=0\)), then (2.21) reduces to

    \begin{equation} \label {eq-lr-univariate-coeff-stat-normalized} \begin{aligned} w_0&=0,\\ w_1 &= \frac {\sum _k x_k y_k}{\sum _k x_k^2} \end {aligned} \end{equation}

2.3 Metrics

The performance of a model is quantified by a performance metric, which is not necessarily the same as the loss function.

2.3.1 Coefficient of Determination (R2)

To emphasize the difference between loss and metrics, the following example of an LR metric is provided. The coefficient of determination, denoted \(R^2\) (R-squared), is based on the relation:

\begin{equation} \label {eq-sst-ssr-sse} \underbrace {\sum _{k=1}^{M}(y_k-\bar {y})^2}_{\text {SST}} = \underbrace {\sum _{k=1}^{M}(\hat {y}_k-\bar {y})^2}_{\text {SSR}} + \underbrace {\sum _{k=1}^{M}e_k^2}_{\text {SSE}}, \end{equation}

where:

  • SST: Total sum of squares.

  • SSR: Sum of squares due to regression.

  • SSE: Sum of squared errors (or residual sum of squares).

The \(R^2\) metric is defined as:

\begin{equation} R^2 =\frac {\text {SSR}}{\text {SST}} =1 - \frac {\text {SSE}}{\text {SST}}, \end{equation}

providing a unitless goodness-of-fit measure that shares an intuitive \(\,[0,1]\) range:

  • \(R^2=1\): Perfect fit.

  • \(R^2=0\): The model is no better than predicting the mean, \(\hat {y}_i = \bar y\).

  • \(R^2<0\): The model performs worse than the mean (possible for machine learning models on test data).

  • Proof. To proof (2.23), observe that

    \[ y_k - \bar y \;=\; (\hat y_k - \bar y) + (y_k - \hat y_k) \;=\; (\hat y_k - \bar y) + e_k. \]

    Hence,

    \[ \begin {aligned} \sum _{k=1}^M (y_k - \bar y)^2 &= \sum _{k=1}^M \bigl [(\hat y_k - \bar y) + e_k\bigr ]^2\\ &= \sum _{k=1}^M (\hat y_k - \bar y)^2 \;+\;\sum _{k=1}^M e_k^2 \;+\;2\sum _{k=1}^M (\hat y_k - \bar y)\,e_k. \end {aligned} \]

    It remains to show the cross-term vanishes:

    \[ \sum _{k=1}^M (\hat y_k - \bar y)\,e_k =\sum _{k=1}^M \hat y_k\,e_k \;-\;\bar y\sum _{k=1}^M e_k =0 - 0 = 0, \]

    since in LS, \(\sum _{k=1}^M e_k = 0\) and \(\sum _{k=1}^M \hat y_k\,e_k = 0\).

R2 of Uni-variate Linear LS For the uni-variate LR case, it is the fraction of the sample variance of \(y\) explained by the linear fit,

\begin{equation} R^{2}=r_{xy}^{2}. \end{equation}

  • Proof. Starting with \(\hat {y}_k =w_0 + w_1x_k\):

    \begin{equation} \begin{aligned} SSR &=\sum _{k=1}^{M}(\hat {y}_k-\bar {y})^2\\ &=\sum _{k=1}^{M}(w_0 + w_1x_k-\bar {y})^2\\ &=\sum _{k=1}^{M}(\bar y - w_1\bar x + w_1x_k-\bar {y})^2\\ &=w_1^2\sum _{k=1}^{M}(x_k - \bar x)^2\\ &=w_1^2 M s_x^2 = M \left (\frac {s_{xy}}{s_x^2}\right )^2 s_x^2 = M \frac {s_{xy}^2}{s_x^2},\\ SST &= \sum _{k=1}^{M}(y_k-\bar {y})^2 = M s_y^2 \end {aligned} \end{equation}

    Thus, \(SSR/SST = s_{xy}^2 / (s_x^2 s_y^2) = r_{xy}^2\).

2.3.2 MSE of Linear LS

The general (biased3) MSE is given by

\begin{equation} \MSE = s_y^{2}\bigl (1-r_{xy}^{2}\bigr ) = s_e^{2} \end{equation}

3 The unbiased estimator of \(s_e^{2}\) requires \(1/(M-2)\) factor instead \(1/M\).

  • Proof. Using (2.23):

    \begin{equation} \begin{aligned} \MSE &= \frac {1}{M}\SSE = \frac {1}{M}\left (\text {SST}-\text {SSR}\right )\\ &= \frac {\text {SST}}{M}\left (1-\frac {\text {SSR}}{\text {SST}}\right )\\ &= s_y^{2}\bigl (1-R^{2}\bigr )\\ &= s_y^{2}\bigl (1-r_{xy}^{2}\bigr ) \end {aligned} \end{equation}

2.4 Normalized (z-score) Linear Regression (*)

The goal of this section is to provide an intuitive interpretation of the correlation coefficient.

Rescaling To make the slope directly interpretable, we rescale both variables to zero mean and unit variance:

\begin{equation} z_i = \frac {x_i-\bar x}{s_x}, \qquad t_i = \frac {y_i-\bar y}{s_y}, \qquad i=1,\ldots ,M \end{equation}

The properties of \(z_i\) and \(t_i\) are

\begin{align*} \bar z &= \frac {1}{M}\sum z_i=0,\\ \bar t &=\frac {1}{M}\sum t_i=0,\\ s_z &= s_t = 1. \end{align*}

  • Proof. Using biased variance definitions 4:

    \begin{equation} s_x^2 = \frac {1}{M}\sum _{k=1}^M (x_k - \bar {x})^2, \qquad s_y^2 = \frac {1}{M}\sum _{k=1}^M (y_k - \bar {y})^2. \end{equation}

    For the means

    \begin{equation} \begin{aligned} \sum _{k=1}^M z_k &= \frac {1}{s_x}\sum _{k=1}^M (x_k - \bar {x}) \\ &= \frac {1}{s_x}\left (\sum _{k=1}^M x_k - M\bar {x}\right ) \\ &= \frac {1}{s_x}(M\bar {x} - M\bar {x}) = 0 \end {aligned} \end{equation}

    hence \(\bar {z} = 0\) and similarly \(\bar {t} = 0\). For the variances:

    \begin{equation} s_z^2 = \frac {1}{M}\sum _{k=1}^M z_k^2 = \frac {1}{M}\sum _{k=1}^M \frac {(x_k - \bar {x})^2}{s_x^2} = \frac {1}{s_x^2} \left [ \frac {1}{M} \sum _{k=1}^M (x_k - \bar {x})^2 \right ] = \frac {s_x^2}{s_x^2} = 1. \end{equation}

    The proof for \(s_t^2=1\) is analogous.

4 For unbiased evaluation the factor \(M-1\) is to be used.

Normal Equations The linear model in normalized space is

\begin{equation} \hat t_i = r_{xy}\,z_i. \end{equation}

where

\begin{equation} r_{xy}=\frac {s_{xy}}{s_x\,s_y} \end{equation}

is the correlation coefficient.

  • Proof. Applying (2.22) to

    \begin{equation} \hat {t}_i = w_0^* + w_1^*z_i \end{equation}

    yields

    \begin{align} w_0^* &=0,\label {eq-w_0-normalized}\\ w_1^* &= \frac {s_{zt}}{s_z^2} =\frac {1}{M}\sum _{i=1}^{M} z_i t_i =\frac {1}{M}\sum _{i=1}^{M} z_i t_i =\frac {1}{M}\sum _{i=1}^M \left (\frac {x_i-\bar x}{s_x}\right )\left (\frac {y_i-\bar y}{s_y}\right ) =r_{xy} \label {eq-w_1-normalized} \end{align}

Regression error as a function of \(r_{xy}\)

With the optimal normalized model \(\hat t = r_{xy} z\) the residuals are

\begin{equation} e_i = t_i - \hat {t}_i = t_i - r_{xy} z_i \end{equation}

with

\begin{equation} \MSE = 1 - r_{xy}^{2} = s_e^2. \end{equation}

  • Proof.

    \begin{equation} \begin{aligned} s_e^2 &= \frac {1}{M}\sum _{i=1}^M e_i^2 = \frac {1}{M}\sum _{i=1}^M \left (t_i - r_{xy} z_i\right )^2\\ &= \frac {1}{M}\sum _{i=1}^M t_i^2 -2r_{xy}\frac {1}{M}\sum _{i=1}^M t_iz_i + r_{xy}^2\frac {1}{M}\sum _{i=1}^Mz_i^2\\ &=s_t^2 -2r_{xy}\left (r_{xy}\right ) + r_{xy}^2s_z^2\\ &=1 - r_{xy}^2 \end {aligned} \end{equation}

Geometrical interpretation of \(r_{xy}\)

(image)

Figure 2.2: Geometric interpretation of \(r_{xy}\) in a normalized LR.
  • Units of x- and y-axes are standard deviation from the origin since \(t_i\) and \(z_i\) are centered and standardized.

  • In the normalized space, \(r_{xy}\) is the regression slope. A one-standard-deviation increase in \(x\) (\(z\) changes by 1) changes \(y\) by \(r_{xy}\) standard deviations on average (\(\hat {t}\) changes by \(r_{xy}\)).

  • Let’s define two vectors, \(\bz \) and \(\bt \), both in \(\mathbb R^{M}\). The angle between them is \(\theta \) and \(r_{xy}=\cos \theta \) (Fig. 2.2).

    • Proof. The corresponding dot product is

      \begin{equation} \bz \cdot \bt = \norm {\bz }\norm {\bt }\cos (\theta ) \end{equation}

      Using (2.37),

      \begin{equation} \bz \cdot \bt = \sum _{i=1}^M z_it_i = M r_{xy}. \end{equation}

      Since \(\norm {\bz } = \sqrt {\sum z_i^2} = \sqrt {M s_z^2} = \sqrt {M}\), we have

      \begin{equation} \norm {\bz }\norm {\bt } = \sqrt {M}\cdot \sqrt {M} = M \end{equation}

Consequences:

  • Range: \(-1 \le r_{xy} \le +1\).

  • Special cases:

    • \(r_{xy}=+1\) indicates perfect positive linear association, \(\theta = 0^\circ \).

    • \(r_{xy}=-1\) indicates perfect negative linear association, \(\theta = 180^\circ \).

    • \(r_{xy}=0\) indicates no linear association, \(\theta = 90^\circ \).

2.5 Spearman Rank Correlation

  • Goal: Non-parametric alternative to Pearson correlation for monotonic (not necessarily linear) relationships.

The Pearson correlation coefficient \(r_{xy}\) measures linear association. When the relationship between \(x\) and \(y\) is monotonic but non-linear, or when the data contain outliers, the Spearman rank correlation provides a more robust measure.

Let \(R(x_i)\) denote the rank of \(x_i\) among \(\{x_1,\dots ,x_M\}\) (with average ranks for ties), and similarly \(R(y_i)\). The Spearman correlation is the Pearson correlation computed on the ranks:

\begin{equation} r_s = r_{R(x)\,R(y)} \end{equation}

  • Range: \(-1 \le r_s \le +1\).

  • \(|r_s| = 1\): perfect monotonic (not necessarily linear) relationship.

  • Robust to outliers and non-linear monotonic relationships.

  • For linear data, \(r_s \approx r_{xy}\).

  • Example 2.1: Consider \(M=5\) observations with an exponential relationship \(y \approx e^x\). The table compares a clean dataset (perfectly monotonic) with a noisy one (rank swap at \(i=2,3\)):

    .
    Clean (\(y_i = e^{x_i}\)) Noisy
    \(i\) 1 2 3 4 5 1 2 3 4 5
    \(x_i\) 1 2 3 4 5 1 2 3 4 5
    \(y_i\) 2.7 7.4 20.1 54.6 148.4 3.5 8.1 5.3 52.0 135.7
    \(R(x_i)\) 1 2 3 4 5 1 2 3 4 5
    \(R(y_i)\) 1 2 3 4 5 1 3 2 4 5
    • Clean: \(r_{xy} \approx 0.89\), \(r_s = 1\). Pearson underestimates the association (the exponential curve is not a straight line), while Spearman correctly identifies a perfect monotonic relationship.

    • Noisy: \(r_{xy} \approx 0.86\), \(r_s = 0.90\). Noise causes \(y_2 > y_3\), swapping ranks 2 and 3. Both correlations decrease, but Spearman specifically reflects the broken monotonicity.