Data-Driven Time-Series Prediction

$\newcommand{\footnotename}{footnote}$ $\def \LWRfootnote {1}$ $\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}$ $\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}$ $\let \LWRorighspace \hspace $ $\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }$ $\newcommand {\TextOrMath }[2]{#2}$ $\newcommand {\mathnormal }[1]{{#1}}$ $\newcommand \ensuremath [1]{#1}$ $\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } $ $\newcommand {\setlength }[2]{}$ $\newcommand {\addtolength }[2]{}$ $\newcommand {\setcounter }[2]{}$ $\newcommand {\addtocounter }[2]{}$ $\newcommand {\arabic }[1]{}$ $\newcommand {\number }[1]{}$ $\newcommand {\noalign }[1]{\text {#1}\notag \\}$ $\newcommand {\cline }[1]{}$ $\newcommand {\directlua }[1]{\text {(directlua)}}$ $\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}$ $\newcommand {\protect }{}$ $\def \LWRabsorbnumber #1 {}$ $\def \LWRabsorbquotenumber "#1 {}$ $\newcommand {\LWRabsorboption }[1][]{}$ $\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }$ $\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }$ $\def \mathcode #1={\mathchar }$ $\let \delcode \mathcode $ $\let \delimiter \mathchar $ $\def \oe {\unicode {x0153}}$ $\def \OE {\unicode {x0152}}$ $\def \ae {\unicode {x00E6}}$ $\def \AE {\unicode {x00C6}}$ $\def \aa {\unicode {x00E5}}$ $\def \AA {\unicode {x00C5}}$ $\def \o {\unicode {x00F8}}$ $\def \O {\unicode {x00D8}}$ $\def \l {\unicode {x0142}}$ $\def \L {\unicode {x0141}}$ $\def \ss {\unicode {x00DF}}$ $\def \SS {\unicode {x1E9E}}$ $\def \dag {\unicode {x2020}}$ $\def \ddag {\unicode {x2021}}$ $\def \P {\unicode {x00B6}}$ $\def \copyright {\unicode {x00A9}}$ $\def \pounds {\unicode {x00A3}}$ $\let \LWRref \ref $ $\renewcommand {\ref }{\ifstar \LWRref \LWRref }$ $ \newcommand {\multicolumn }[3]{#3}$ $\require {textcomp}$ $\newcommand {\intertext }[1]{\text {#1}\notag \\}$ $\let \Hat \hat $ $\let \Check \check $ $\let \Tilde \tilde $ $\let \Acute \acute $ $\let \Grave \grave $ $\let \Dot \dot $ $\let \Ddot \ddot $ $\let \Breve \breve $ $\let \Bar \bar $ $\let \Vec \vec $ $\newcommand {\bm }[1]{\boldsymbol {#1}}$ $\require {physics}$ $\require {cancel}$ $\newcommand *{\underuparrow }[1]{{\underset {\uparrow }{#1}}}$ $\DeclareMathOperator *{\argmax }{argmax}$ $\DeclareMathOperator *{\argmin }{arg\,min}$ $\def \E [#1]{\mathbb {E}\!\left [ #1 \right ]}$ $\def \Var [#1]{\operatorname {Var}\!\left [ #1 \right ]}$ $\def \Cov [#1]{\operatorname {Cov}\!\left [ #1 \right ]}$ $\newcommand {\floor }[1]{\lfloor #1 \rfloor }$ $\newcommand {\DTFTH }{ H \brk 1{e^{j\omega }}}$ $\newcommand {\DTFTX }{ X\brk 1{e^{j\omega }}}$ $\newcommand {\DFTtr }[1]{\mathrm {DFT}\left \{#1\right \}}$ $\newcommand {\DTFTtr }[1]{\mathrm {DTFT}\left \{#1\right \}}$ $\newcommand {\DTFTtrI }[1]{\mathrm {DTFT^{-1}}\left \{#1\right \}}$ $\newcommand {\Ftr }[1]{ \mathcal {F}\left \{#1\right \}}$ $\newcommand {\FtrI }[1]{ \mathcal {F}^{-1}\left \{#1\right \}}$ $\newcommand {\Zover }{\overset {\mathscr Z}{\Longleftrightarrow }}$ $\renewcommand {\real }{\mathbb {R}}$ $\newcommand {\ba }{\mathbf {a}}$ $\newcommand {\bb }{\mathbf {b}}$ $\newcommand {\bd }{\mathbf {d}}$ $\newcommand {\be }{\mathbf {e}}$ $\newcommand {\bh }{\mathbf {h}}$ $\newcommand {\bn }{\mathbf {n}}$ $\newcommand {\bq }{\mathbf {q}}$ $\newcommand {\br }{\mathbf {r}}$ $\newcommand {\bv }{\mathbf {v}}$ $\newcommand {\bw }{\mathbf {w}}$ $\newcommand {\bx }{\mathbf {x}}$ $\newcommand {\bxx }{\mathbf {xx}}$ $\newcommand {\bxy }{\mathbf {xy}}$ $\newcommand {\by }{\mathbf {y}}$ $\newcommand {\byy }{\mathbf {yy}}$ $\newcommand {\bz }{\mathbf {z}}$ $\newcommand {\bA }{\mathbf {A}}$ $\newcommand {\bB }{\mathbf {B}}$ $\newcommand {\bI }{\mathbf {I}}$ $\newcommand {\bK }{\mathbf {K}}$ $\newcommand {\bP }{\mathbf {P}}$ $\newcommand {\bQ }{\mathbf {Q}}$ $\newcommand {\bR }{\mathbf {R}}$ $\newcommand {\bU }{\mathbf {U}}$ $\newcommand {\bW }{\mathbf {W}}$ $\newcommand {\bX }{\mathbf {X}}$ $\newcommand {\bY }{\mathbf {Y}}$ $\newcommand {\bZ }{\mathbf {Z}}$ $\newcommand {\balpha }{\bm {\alpha }}$ $\newcommand {\bth }{{\bm {\theta }}}$ $\newcommand {\bepsilon }{{\bm {\epsilon }}}$ $\newcommand {\bmu }{{\bm {\mu }}}$ $\newcommand {\bOne }{\mathbf {1}}$ $\newcommand {\bZero }{\mathbf {0}}$ $\newcommand {\loss }{\mathcal {L}}$ $\newcommand {\appropto }{\mathrel {\vcenter { \offinterlineskip \halign {\hfil $##$\cr \propto \cr \noalign {\kern 2pt}\sim \cr \noalign {\kern -2pt}}}}}$ $\newcommand {\toprule }[1][]{\hline }$ $\let \midrule \toprule $ $\let \bottomrule \toprule $ $\def \LWRbooktabscmidruleparen (#1)#2{}$ $\newcommand {\LWRbooktabscmidrulenoparen }[1]{}$ $\newcommand {\cmidrule }[1][]{\ifnextchar (\LWRbooktabscmidruleparen \LWRbooktabscmidrulenoparen }$ $\newcommand {\morecmidrules }{}$ $\newcommand {\specialrule }[3]{\hline }$ $\newcommand {\addlinespace }[1][]{}$ $\newcommand {\LWRsubmultirow }[2][]{#2}$ $\newcommand {\LWRmultirow }[2][]{\LWRsubmultirow }$ $\newcommand {\multirow }[2][]{\LWRmultirow }$ $\newcommand {\mrowcell }{}$ $\newcommand {\mcolrowcell }{}$ $\newcommand {\STneed }[1]{}$ $\newcommand {\tcbset }[1]{}$ $\newcommand {\tcbsetforeverylayer }[1]{}$ $\newcommand {\tcbox }[2][]{\boxed {\text {#2}}}$ $\newcommand {\tcboxfit }[2][]{\boxed {#2}}$ $\newcommand {\tcblower }{}$ $\newcommand {\tcbline }{}$ $\newcommand {\tcbtitle }{}$ $\newcommand {\tcbsubtitle [2][]{\mathrm {#2}}}$ $\newcommand {\tcboxmath }[2][]{\boxed {#2}}$ $\newcommand {\tcbhighmath }[2][]{\boxed {#2}}$

Chapter 2 Least-squares and Linear Regression

Goal:
- • The goal of the least squares (LS) method is to minimize MSE (or RMSE) between the given data and the parametric model.
- • Define and analyze a model that is based on a linear relation between data and the outcome.
- • Find the linear model parameters by LS.

2.1 Uni-variate Linear LS

2.1.1 Definition

The (random) experiment produces a set of $M$ paired observations $\left \{x_k,y_k\right \}_{k=1}^M$. The linear model is

\begin{equation} y = f(x;w_0,w_1) = w_0 + w_1 x + \epsilon , \end{equation}

where $w_0$ and $w_1$ are the model parameters and $\epsilon $ is zero-mean noise. Predictions are

\begin{equation} \hat {y}_k = f(x_k;w_0,w_1)=w_0+w_1 x_k . \end{equation}

The performance metric is the mean-square error (MSE)

\begin{equation} J_{\text {mse}}(w_0,w_1)=\frac {1}{M}\sum _{k=1}^{M}(y_k-\hat y_k)^2 =\frac {1}{M}\sum _{k=1}^{M}e_k^2 , \end{equation}

or its square-root form, the RMSE,

\begin{equation} J_{\text {rmse}}(w_0,w_1)=\sqrt {J_{\text {mse}}(w_0,w_1)} . \end{equation}

(When multiplied by $M$, MSE is also called the sum of squared errors, SSE.)

Because constant factors and monotone transforms do not affect the minimiser, the loss function is chosen as

\begin{equation} \loss (w_0,w_1)=\sum _{k=1}^{M}\bigl (y_k-w_0-w_1x_k\bigr )^2 , \end{equation}

and the optimisation problem is

\begin{equation} \label {eq-optimization-definition} \begin{aligned} w_0,w_1 &= \arg \min \limits _{w_0,w_1} J_{mse}(w_0,w_1) \\ &= \arg \min \limits _{w_0,w_1}J_{rmse}(w_0,w_1) \\ &= \arg \min \limits _{w_0,w_1} \loss (w_0,w_1) \end {aligned} \end{equation}

2.1.2 Minimization

Setting the partial derivatives of $\loss $ to zero,

\begin{equation} \begin{cases} \dfrac {\partial }{\partial w_0} \loss (w_0,w_1) = 0\\[10pt] \dfrac {\partial }{\partial w_1} \loss (w_0,w_1) = 0\\ \end {cases} \end{equation}

gives

\begin{equation} \begin{cases} \displaystyle 2\sum _{k=1}^M \left (y_k - w_0 - w_1x_k\right )\cdot (-1) = 0\\[10pt] \displaystyle 2\sum _{k=1}^M \left (y_k - w_0 - w_1x_k\right )\cdot (-x_k) = 0\\ \end {cases} \end{equation}

The resulting equations are normal equations

\begin{equation} \begin{cases} w_0 M + w_1\displaystyle \sum _{k=1}^{M}x_k = \displaystyle \sum _{k=1}^{M}y_k ,\\[6pt] w_0 \displaystyle \sum _{k=1}^{M}x_k + w_1 \displaystyle \sum _{k=1}^{M}x_k^{2} = \displaystyle \sum _{k=1}^{M}x_k y_k . \end {cases} \end{equation}

It is numerically convenient to rewrite them in terms of biased sample moments

\begin{align} w_1&=\frac {s_{xy}}{s_x^2}=\frac {\sum _k(x_k-\bar x)(y_k-\bar y)}{\sum _k (x_k-\bar x)^2},\\ w_0&=\bar y-w_1\bar x = \bar y-\frac {s_{xy}}{s_x^2}\bar x, \end{align} and the predictive form

\begin{equation} \hat {\by }= \bar y + w_1\bigl (\bx -\bar x\bigr ). \end{equation}

Note that there is no difference between biased and unbiased definitions of $s_{xy}$ and $s_x$ for derivation of $w_0$ and $w_1$, since the corresponding coefficients are canceled out.

Remarks.

• A valid solution requires $s_x\neq 0$.
• If both variables are centered ($\bar x=\bar y=0$), then $w_0=0$.

Sample Covariance Unbiased sample covariance is given by

\begin{equation} s_{xy} = \frac {1}{n-1}\sum _{i=1}^n (x_i - \bar x)\,(y_i - \bar y) \end{equation}

The sign and magnitude of the sample covariance $s_{xy}$ indicate the direction and strength of the linear relationship between $x$ and $y$:

• $s_{xy}>0$: as $x$ increases, $y$ tends to increase.
• $s_{xy}<0$: as $x$ increases, $y$ tends to decrease.
• $s_{xy}\approx 0$: no linear association.

Correlation coefficient. The sample Pearson correlation coefficient between $\bx $ and $\by $ is normalized (dimensionless) covariance,

\begin{align} r_{xy} &=\frac {s_{xy}}{s_x\,s_y}\\ &= \frac {\displaystyle \sum _{i=1}^n (x_i - \bar x)\,(y_i - \bar y)} {\displaystyle \sqrt {\sum _{i=1}^n (x_i - \bar x)^2} \;\sqrt {\sum _{i=1}^n (y_i - \bar y)^2}} \end{align} and the slope can be written as

\begin{equation} w_1 = r_{xy}\,\frac {s_y}{s_x}, \end{equation}

Hence $r_{xy}$ scales the regression line by the ratio of standard deviations.

• Range: $-1 \le r_{xy} \le +1$.
• $r_{xy}=+1$ indicates perfect positive linear association.
• $r_{xy}=-1$ indicates perfect negative linear association.
• $r_{xy}=0$ indicates no linear association.

Note that there is no difference between biased and unbiased definitions of $r_{xy}$, since the corresponding coefficients are canceled out.

2.1.3 Normalized (z-score) Linear Regression

To make the slope directly interpretable, we often rescale both variables to zero mean and unit variance:

\[ z_i = \frac {x_i-\bar x}{s_x}, \qquad t_i = \frac {y_i-\bar y}{s_y}, \qquad i=1,\ldots ,n, \]

Because of the properties of $\{z_k\}$ and $\{t_k\}$,

\begin{align*} \bar z &= \sum z_i=0\\ \bar t &=\sum t_i=0\\ \sum z_k^2 &= \sum t_k^2 = M \end{align*} the normal equation immediately yields

\begin{equation} \label {eq-w_1-normalized} w_1^* =\frac {\sum _{i=1}^{M} z_i t_i} {\sum _{i=1}^{M} z_i^2} =r_{xy}. \end{equation}

Thus

\[ \hat t = r_{xy}\,z , \qquad (w_0^*=0,\ w_1^*=r_{xy}). \]

Returning to the original scale gives

\begin{equation*} \frac {\hat y-\bar y}{s_y}=r_{xy}\frac {x-\bar x}{s_x} \end{equation*}

or, in the familiar form

\[ \hat y = \bar y + r_{xy}\,\frac {s_y}{s_x}\bigl (x-\bar x\bigr ). \]

Interpretation of $r_{xy}$.

• In the normalized space, $r_{xy}$ is the regression slope: a one-standard-deviation increase in $x$ changes $y$ by $r_{xy}$ standard deviations on average.
• Geometrically, $r_{xy}=\cos \theta $ where $\theta $ is the angle between the centred data vectors $\bigl [x_1-\bar x,\dots ,x_M-\bar x\bigr ]$ and $\bigl [y_1-\bar y,\dots ,y_M-\bar y\bigr ]$ in $\mathbb R^{n}$.
• Special case of perfect linear fit (zero error)

\[ r_{xy}= \pm 1 \;\Longrightarrow \; \theta = 0^\circ \text { or }180^\circ . \]

Regression error as a function of $r_{xy}$ With the optimal normalized model $\hat t = r_{xy} z$ the residuals are

\[e_i = t_i - r_{xy} z_i\]

with

\begin{equation} \text {MSE} = 1 - r_{xy}^{2}. \end{equation}

The general bias MSE is

\begin{equation} \begin{aligned} MSE &= \frac {1}{M} \sum _{k=1}^{M} \bigl (y_k-\hat y_k\bigr )^{2}\\ &=s_y^{2}\bigl (1-r_{xy}^{2}\bigr ) \end {aligned} \end{equation}

• $r_{xy}=0$ $\Rightarrow $ the best linear predictor is $\hat y=\bar y$ and the MSE equals the sample variance of $y$.
• $|r_{xy}|=1$ $\Rightarrow $ perfect fit, zero residual error.
• The unbiased estimator of $s_e^{2}$ requires $1/(M-2)$ factor instead $1/M$ in MSE.
• Typically, the residual errors are (approximately) Gaussian distributed.

2.2 Coefficient of Determination

To emphasis the difference between loss and metrics, the following example of LR metric is provided, A coefficient of determination, denoted $R^2$ or $r^2$ (R-square) is based on the relation

\begin{equation} \underbrace {\sum _{k=1}^{M}(y_k-\bar {y})^2}_{\text {SST}} = \underbrace {\sum _{k=1}^{M}(\hat {y}_k-\bar {y})^2}_{\text {SSR}} + \underbrace {\sum _{k=1}^{M}e_k^2}_{\text {SSE}} \end{equation}

• SST - total sum of squares
• SSR - sum of squares due to regression
• SSE - sum of square errors (or residual sum of squares)

Proof.

\[ y_k - \bar y \;=\; (\hat y_k - \bar y) + (y_k - \hat y_k) \;=\; (\hat y_k - \bar y) + e_k. \]

Hence

\[ \begin {aligned} \sum _{k=1}^M (y_k - \bar y)^2 &= \sum _{k=1}^M \bigl [(\hat y_k - \bar y) + e_k\bigr ]^2\\ &= \sum _{k=1}^M (\hat y_k - \bar y)^2 \;+\;\sum _{k=1}^M e_k^2 \;+\;2\sum _{k=1}^M (\hat y_k - \bar y)\,e_k. \end {aligned} \]

It remains to show the cross-term vanishes:

\[ \sum _{k=1}^M (\hat y_k - \bar y)\,e_k =\sum _{k=1}^M \hat y_k\,e_k \;-\;\bar y\sum _{k=1}^M e_k =0 - 0 = 0, \]

since in LS $\sum _{k=1}^M e_k = 0$ and $\sum _{k=1}^M \hat y_k\,e_k = 0$.

\begin{equation} R^2 =\frac {\text {SSR}}{\text {SST}} =1 - \frac {\text {SSE}}{\text {SST}} \end{equation}

providing a unit-less goodness-of-fit measure that shares the intuitive $\,[0,1]$ range,

• $R^2=1$: perfect fit.
• $R^2=0$: model is no better than predicting the mean, $\hat {y}_i = \bar y$.
• $R^2<0$: model is worse than the mean (machine learning models).

Uni-variate case For the uni-variate case, it is a fraction of the sample variance of $y$ explained by the linear fit,

\begin{equation} R^{2}=r_{xy}^{2}. \end{equation}

Proof.
$\seteqnumber{0}{2.}{22}$
\begin{equation} \begin{aligned} \hat {y}_i &=w_0 + w_1x_i\\ SSR &=\sum _{i=1}^{n}(\hat {y}_i-\bar {y})^2\\ &=\sum _{i=1}^{n}(w_0 + w_1x_i-\bar {y})^2\\ &=\sum _{i=1}^{n}(\bar y - w_1\bar x + w_1x_i-\bar {y})^2\\ &=w_1^2\sum _{i=1}^{n}(\bar x - x_i)^2\\ &=w_1^2(M-1)s_x^2 = (M-1)\left (\frac {s_{xy}}{s_x^2}\right )^2\!s_x^2 = (M-1)\frac {s_{xy}^2}{s_x^2}\\ SST &= \sum _{k=1}^{M}(y_k-\bar {\by })^2 = (M-1)s_y^2 \end {aligned} \end{equation}

2.3 Vector/Matrix Notation

2.3.1 Uni-variate model

To improve the mathematical representation, vector notation can be used. This time, the points $\left \{x_k,y_k\right \}_{k=1}^M$ are organized into vectors, with a few additional ones, as follows,

\begin{equation} \bx = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_M \end {bmatrix}, \quad \by = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_M \end {bmatrix}, \quad \mathbf {1}_M = \begin{bmatrix} 1 \\ 1 \\ \vdots \\ 1 \end {bmatrix}\in \mathbb {R}^M, \, \bw = \begin{bmatrix} w_0 \\ w_1 \end {bmatrix} \end{equation}

The resulting model notation is

\begin{equation} \hat {\by } = f(\bX ;\bw ) = \bOne _M w_0 + \bx w_1 = \bX \bw , \end{equation}

where $\bX = \begin {bmatrix} \bOne _M & \bx \end {bmatrix}\in \mathbb {R}^{M\times 2}$ and $\bw = \begin {bmatrix} w_0 & w_1 \end {bmatrix}^T$.

The corresponding loss functions is

\begin{equation} \begin{aligned} \loss (\bw ) &= (\by -\hat {\by })^T(\by -\hat {\by })=\norm {\by -\hat {\by }}^2\\ &=(\by -\bX \bw )^T(\by -\bX \bw ) = \norm {\by - \bX \bw }^2 \end {aligned} \end{equation}

and the corresponding optimal minimum (Eq. (2.6)) results from the solution of normal equation (matrix form)

\begin{equation} \label {eq-ls-gradient} \begin{aligned} \nabla _\bw \mathcal {L}(\cdot ) &=-\bX ^T\left (\by - \bX \bw \right ) = 0 \end {aligned} \end{equation}

and is given by

\begin{equation*} \begin{aligned} \bX ^T\left (\by - \bX \bw \right ) &= \bX ^T \by - \bX ^T\bX \bw = 0\\ \bX ^T \by &= \left (\bX ^T\bX \right )\bw \end {aligned} \end{equation*}

Finally,

\begin{equation} \label {eq-LS-solution-4w} \bw _{opt} = \left (\bX ^T\bX \right )^{-1}\bX ^T\by \end{equation}

Example 2.1: For centered ($\bar x=\bar y=0$) variables with $w_0=0$,

\[ w_1 = \frac {\sum _{i=1}^M x_iy_i}{\sum _{i=1}^M x_i^2} \]

with is similar to (2.17).

2.3.2 Multivariate LS

For the multivariate $N$-dimensional formulation,

\begin{align} \bX &=\begin{bmatrix} \bOne & \bx _1 & \cdots & \bx _N \end {bmatrix}\in \mathbb {R}^{M\times (N+1)} \label {eq-mv-ls-Xmatrix} \\ \bw &=\begin{bmatrix} w_0 & w_1 & \cdots & w_N \end {bmatrix}^T\in \mathbb {R}^{N+1} \end{align}

All the LS discussion on $\hat {\by } = \bX \bw $ is the same independent from the number of variables.

Dataset

All the data rows in $(\bX ,\by )$ are called dataset. The matrix $\bX $ is assumed full-rank, i.e. columns are linearly independent.

Moore–Penrose inverse (pseudo-inverse)

Moore–Penrose inverse is the extension of an ordinary inverse matrix for none-rectangular matrices,

\begin{equation} \bX ^+ = \left (\bX ^T\bX \right )^{-1}\bX ^T\in \Re ^{(N+1)\times M}, \end{equation}

such that

\[ \bX ^+\bX = \left (\bX ^T\bX \right )^{-1}\bX ^T\bX = \mathbf {I}_{N+1}\]

Note that the by-definition implementation of $\bX ^+$ may have numerical stability problems with $\left (\bX ^T\bX \right )^{-1}$. All the modern programming languages have numerically-stable and efficient implementation of pseudo-inverse calculations.

The common numerical notation is

\begin{equation} \bw _{opt} = \bX ^+\by \end{equation}

Implementation note: there are numerically optimized algorithms for $\bw _{opt}$, such as:

1. lsqminnorm (Matlab)
2. Python, numpy.linalg.lstsq
and scipy.linalg.lstsq

Projection matrix

The model output is given by

\begin{equation} \begin{aligned} \hat {\by } &= \bX \bw = \bX \left (\bX ^T\bX \right )^{-1}\bX ^T \by \\ &=\bX \bX ^+\by = \mathbf {P}\by \end {aligned} \end{equation}

where

\begin{equation} \bP = \bX \left (\bX ^T\bX \right )^{-1}\bX ^T\in \Re ^{M\times M} \end{equation}

is a projection matrix, i.e. projection of $\by $ into a base derived from $\bX $.

Important properties of the matrix $\bP $:

• Symmetric $\bP =\bP ^T$,
• Idempotent $\bP =\bP ^2$,
$\seteqnumber{0}{2.}{34}$
\begin{equation} \begin{aligned} \bP ^2&=\bX \left (\bX ^T\bX \right )^{-1}\bX ^T\bX \left (\bX ^T\bX \right )^{-1}\bX ^T\\ &=\bX \left (\bX ^T\bX \right )^{-1}\bX ^T=\bP \end {aligned} \end{equation}
• Orthogonality, $\mathbf {P}\perp \left (\mathbf {I}-\mathbf {P}\right )$
Proof. $\bP (\bI -\bP ) = \bP - \bP ^2 = \bZero $.
• $\bI -\bP $ is also projection matrix.

Model error

The model error is

\begin{equation} \mathbf {e} = \by - \hat {\by } = \by - \mathbf {P} \by = \left (\mathbf {I} -\mathbf {P}\right )\by , \end{equation}

such that $\loss (\bw ) = \be ^T\be $.

Note that the unbiased error variance uses $M-N$ denominator,

\begin{equation} s_\be = \frac {1}{M-N}\be ^T\be \end{equation}

The difference between MSE and $s_\be ^2$ is that MSE is minimized during coefficient learning, and $s_\be ^2$ is post-fit unbiased error variance.

Error and data orthogonality

\begin{equation} \label {eq-data-error-ort} \mathbf {e}\perp \bX \Rightarrow \bX ^T\mathbf {e} = \bZero _{N+1}\\ \end{equation}

Proof:

\begin{equation} \begin{aligned} \bX ^T\mathbf {e} = \bX ^T\by - \bX ^T\bX \left (\bX ^T\bX \right )^{-1}\bX ^T\by \\ \bX ^T\by - \bX ^T\by = \bZero \end {aligned} \end{equation}

Error and prediction orthogonality

\begin{equation} \mathbf {e}\perp \hat {\by } \Rightarrow \hat {\by }^T\mathbf {e}=\be ^T\hat {\by } = 0 \end{equation}

Proof:

\begin{equation} \begin{aligned} \hat {\by }^T\mathbf {e} &= \by ^T \mathbf {P}\left (\mathbf {I} -\mathbf {P}\right )\by \\ &= \by ^T \mathbf {P}\by - \by ^T \mathbf {P}\mathbf {P}\by \\ &= \by ^T \mathbf {P}\by - \by ^T \mathbf {P}\by = 0 \end {aligned} \end{equation}

The interesting outcome of this property is a relation between error and prediction,

\begin{equation} \norm {\by }^2 = \norm {\hat {\by }}^2 + \norm {\be }^2 \end{equation}

Proof.
$\seteqnumber{0}{2.}{42}$
\begin{equation} \begin{aligned} \norm {\by }^2 &= \norm {\hat {\by }+\be }^2\\ &= (\hat {\by }+\be )^T(\hat {\by }+\be )\\ &= \hat {\by }^T\hat {\by } + \be ^T\be \end {aligned} \end{equation}

Average error

The average error is zero-mean,

\begin{equation} \begin{aligned} \bar {\mathbf {e}} &= \frac {1}{M}\sum _{k=1}^M e_k \\ &= \sum _{k=1}^M e_k= \bOne ^T\be = 0 \end {aligned} \end{equation}

Proof.
$\seteqnumber{0}{2.}{44}$
\begin{equation} \begin{aligned} \bX ^T\be = 0\leftarrow \text {\eqref {eq-data-error-ort}}\\ \Rightarrow \begin{bmatrix} \bOne ^T \\ \bx _1^T \\ \vdots \\ \bx _N^T \end {bmatrix} \be = \begin{bmatrix} \bOne ^T \be \\ \vdots \end {bmatrix} = 0 \end {aligned} \end{equation}

The interesting consequence is

\begin{align} \bar {\by } &= \bar {\hat {\by }}\\ &= w_0 + w_1\bar {\bx }_1 + \cdots + w_N\bar {\bx }_N \end{align}

Proof.
$\seteqnumber{0}{2.}{46}$
\begin{equation} \begin{aligned} \bar {\by } &= \overline {\hat {\by } +\be }\\ &=\bar {\hat {\by }} + \bar {\be } \end {aligned} \end{equation}

Error distribution

The values of the error vector $\be $ are assumed to be normally distributed, due to Central Limit Theorem (CLT). Typically, this assumption is not need in ML, but it is important for statistical analysis for small values of $M$.

MSE

The reduced expression for the resulting minimal loss is

\begin{equation} \label {eq-mse-min} \begin{aligned} \loss _{min} &= \sum _{k=1}^My_k^2 - \sum _{j=0}^Nw_j\by ^T\bx _j\\ &=\by ^T\by - \by ^T\bX \bw \end {aligned} \end{equation}

Proof.

Chapter 3 Basic Signal Analysis

Goal: This chapter introduces the fundamental concepts and methods for analyzing and estimating (learning) parameters of a discrete-time sinusoidal signal observed in additive noise.

3.1 Signal Preliminaries

A general continuous-time cosine signal can be written as

\begin{equation} \begin{aligned} y(t) &= A\cos (2\pi F_0t + \theta )+ \epsilon (t),\\ &= A\cos (\Omega _0 t + \theta )+ \epsilon (t), \end {aligned} \end{equation}

where

• $A>0$ is the amplitude,
• $-\pi <\theta \leq \pi $ or $0\le \theta < 2\pi $ is the phase,
• $F_0$ is the frequency in Hz,
• $\Omega _0 = 2\pi F_0$ is the radial frequency in rad/sec,
• $\epsilon (t)$ is additive noise.

For the further analysis, we use the sampled version $x[n]$ of the continuous-time signal $x(t)$, sampled with frequency $F_s=1/T$,

\begin{equation} \label {eq:base_signal} \begin{aligned} y[n] &= y(nT)\\ &= A\cos \left (\omega _0 n+\theta \right ) + \epsilon [n]\quad n=0,\ldots ,L-1, \end {aligned} \end{equation}

where

\begin{equation} \label {eq-signal-omega0} \begin{aligned} \omega _0 &= 2\pi F_0 T=2\pi \frac {F_0}{F_s}\\ \end {aligned} \end{equation}

is the angular frequency (measured in radians per sample) derived from the analog frequency $F_0$ and $L$ is the resulting number of samples.

The only assumption for the additive noise is that it is zero-mean,

\begin{equation} \bar \epsilon = \frac {1}{L}\sum _{n=0}^{L-1}\epsilon [n] = 0. \end{equation}

No additional assumptions, such as Gaussianity, are applied; however, the special case of additive white Gaussian noise (AWGN) is further refined as tips for selected topics.

In order to accurately reproduce a cosine signal, the Nyquist criterion demands $F_0<F_s/2$, which implies $0\le \omega _0<\pi $. This requirement can be easily illustrated by the following example. Consider two signals:

\[ x_1 (t) = \cos (0.6\pi t ), \ \ \ x_2(t) = \cos (2.6\pi t) \]

Sampling with $F_s=1$ Hz results in two identical signals,

\begin{align*} x_1[n] &= \cos (0.6\pi n ), \\ x_2[n] &= \cos (2.6\pi n) = \cos (0.6\pi n + 2 \pi n) = x_1[n]. \end{align*} This phenomenon is called aliasing. Note, when $\omega _0=0$ the signal is the DC level, $y(t)=y[n]=A\cos (\theta )$. For $\omega _0=\pi $ the signal is “alternating sign” signal $y[n] = A(-1)^n$. Therefore, the sampling frequency requirement is $0\leqslant \omega <\pi $. This relation holds for all the following discussions and derivations.

The energy of the signal $x[n]$ is defined as

\begin{equation} E_\bx = \norm {\bx }^2 = \sum _{n=0}^{L-1} x^2[n], \end{equation}

where $\bx $ is the vector of samples of the signal $x[n]$. The corresponding power is time-averaged energy,

\begin{equation} P_\bx = \frac {1}{L}E_\bx = \frac {1}{L}\norm {\bx }^2. \end{equation}

3.2 Amplitude estimation

Goal: Find the amplitude of a sinusoidal signal in noise that best fits the model in a least squares (LS) sense.

Given a signal model with a known frequency $\omega _0$,

\begin{equation} y[n] = A\cos \left (\omega _0n\right ) + \epsilon [n]\quad n=0,\ldots ,L-1 \end{equation}

the goal is to estimate the amplitude $A$ that best fits a provided model,

\begin{equation} \hat {y}[n] = A\cos \left (\omega _0n\right ) = Ax[n] \end{equation}

where $x[n]=\cos \left (\omega _0n\right )$. Technically, we are looking for the value of $A$ that minimizes the squared error,

\begin{equation} \begin{aligned} \loss =\sum _n (y[n]-\hat {y}[n])^2. \end {aligned} \end{equation}

In the linear LS regression formulation, we define the corresponding parameters $\by ,\bX $ and $\bw $. First, the required weight is $\bw =A$. The matrix $\bX $ is formed by samples of the signal $\left \lbrace x[n]\right \rbrace _{n=0}^{L-1}$,

\begin{equation} \begin{aligned} \bX &= \begin{bmatrix} x[0] & x[1] & \cdots & x[L-1]\\ \end {bmatrix}^T\\ &= \begin{bmatrix} 1 & \cos (\omega _0) & \cos (2\omega _0) & \cdots & \cos ((L-1)\omega _0) \end {bmatrix}^T \end {aligned} \end{equation}

resulting in

\begin{equation} \hat \by = \bX \bw \end{equation}

Finally, $\by $ is the vector of samples of $\left \lbrace y[n]\right \rbrace _{n=0}^{L-1}$. The resulting minimization is straightforward,

\begin{equation} \label {eq:signal:amplitude} \begin{aligned} \hat {A} &= (\bX ^T\bX )^{-1}\bX ^T\by \\[2pt] &=\frac {\sum _n x[n]y[n]}{\sum _n x^2[n]}\\[2pt] &=\frac {\sum _n y[n]\cos (\omega _0n)}{\sum _n \cos [2](\omega _0n)} \end {aligned} \end{equation}

This expression be further approximated by

\begin{equation} \hat {A} \approx \frac {2}{L}\sum _n y[n]\cos (\omega _0n) \end{equation}

by noting that

\begin{equation} \sum _n \cos [2](\omega _0n)\approx \frac {L}{2}\qquad \omega _0\ne 0,\pi \end{equation}

If we substitute the model into the resulting solution,

\begin{equation} \begin{aligned} \hat {A} &=\frac {\sum _n x[n]\left (Ax[n]+\epsilon [n]\right )}{\sum _n x^2[n]}\\[4pt] &=A + \frac {\sum _n x[n]\epsilon [n]}{\sum _n x^2[n]}, \end {aligned} \end{equation}

it produces a true value of $A$ with some additive noise.

The resulting residual error is given by

\begin{equation} \mathbf {e} = \by - \hat {\by }. \end{equation}

Since $\mathbf {e}\perp \hat {\by }$ the power/energy terms can be decomposed as follows,

\begin{equation} \norm {\by }^2 = \norm {\hat {\by }}^2 + \norm {\mathbf {e}}^2,\quad E_\by = E_{\hat {\by }} + E_\be . \end{equation}

An interesting interpretation of this result is estimated signal to noise ratio (SNR), defined as

\begin{equation} \label {eq:signal:Asnr} \widehat {SNR} = \frac {\norm {\hat {\by }}^2}{\norm {\mathbf {e}}^2} \end{equation}

Moreover, due to zero-mean property of the noise, the estimated (biased) variance of the noise is

\begin{equation} \label {eq:signal:Anoise} \hat {\sigma }_\epsilon ^2 = \frac {1}{L}\norm {\hat {\mathbf {e}}}^2. \end{equation}

The following example (Fig. 3.1) uses a synthetic cosine signal of length $L = 51$ samples, angular frequency $\omega _0 = 0.1\pi $ and amplitude $A = 1.5$. Gaussian noise with standard deviation $\sigma = 5$ is then added to create a noisy observation. A least-squares regression is applied to estimate the amplitude, yielding $\hat {\sigma }_\epsilon =4.43$.

3.3 Amplitude and phase estimation

Goal: Find amplitude and phase of a sinusoidal signal in noise.

The following analysis is provided for the more general model,

\begin{equation} y[n] = A\cos \left (\omega _0n + \theta \right ) + \epsilon [n]\quad n=0,\ldots ,L-1, \end{equation}

with two unknown parameters, the amplitude $A$ and the phase $\theta $.

The linear LS reformulation of the signal model

\begin{equation} \hat {y}[n] = A\cos \left (\omega _0n + \theta \right ) \end{equation}

involves the use of trigonometric identities to express the cosine with a phase shift as a linear combination of sine and cosine signals,

\begin{align} A\cos \left (\omega _0n + \theta \right ) &=w_c\cos (\omega _0n) + w_s\sin (\omega _0n), \end{align} where

\begin{equation} \begin{aligned} w_c &= A\cos (\theta )\\ w_s &= -A\sin (\theta ) \end {aligned} \end{equation}

and

\begin{equation} \label {eq-cos-linear-coeff} \begin{aligned} A &= \sqrt {w_c^2 + w_s^2}\\ \theta &= -\arctan (\frac {w_c}{w_s}) \end {aligned} \end{equation}

This transforms the problem into a two-parameter linear LS problem in terms of $w_c$ and $w_s$ [1]. The resulting LS formulation involves a two-valued vector of linear coefficients, $\bw = \begin {bmatrix} w_c & w_s\end {bmatrix}^T$, the vector $\by $ of samples of $y[n]$, and the matrix $\bX $ of dimensions $L\times 2$ that is given by

\begin{equation} \label {eq-X-single-freq} \bX = \begin{bmatrix} 1 & 0 \\ \cos (\omega _0) & \sin (\omega _0) \\ \cos (2\omega _0) & \sin (2\omega _0) \\ \vdots & \vdots \\ \cos ((L-1)\omega _0) & \sin ((L-1)\omega _0) \end {bmatrix}. \end{equation}

Once $\hat {\bw }$ has been found, the amplitude and phase can be recovered from (3.24).

SNR and noise variance interpretations are similar to in the previous model in Eqs. (3.18) and (3.19).

The numerical example is presented in Fig. 3.2. The configuration is similar to the previous figure, expect the lower noise variance, $\sigma =1.5$. Nevertheless, there is a decrease in performance, since two parameters are estimated simultaneously.

Implementation Tip

• This estimation procedure is optimal in the maximum likelihood (ML) sense under additive white Gaussian noise (AWGN) and achieves the Cramér-Rao lower bound (CRLB) [1, 6].
• The theoretical lower bound (also termed Cramer-Rao lower bound(CRLB)) on the average estimation accuracy of $w_c,w_s$ is given by [1, Eqs. (5.47-48)]
$\seteqnumber{0}{3.}{25}$
\begin{equation} \label {eq-signals-Ap_cr} \Var [\hat {w}_{c,s}]\gtrsim \frac {2\sigma ^2}{L} \end{equation}

This bound is the tightest for the AWGN case and is less accurate for other noise distributions.
• The approximated estimation variance can be easily evaluated by Monte-Carlo simulations for any set of parameters and any distribution of interest.

3.4 Frequency estimation

Goal: If the frequency $\omega _0$ is also unknown, it can be estimated by searching for the $\hat {\omega } _0$ that best fits a sinusoidal model for the observed data, i.e., that minimizes the residual error norm or maximizes the reconstructed signal energy.

Since $\omega _0$ is unknown, the matrix $\bX $ may be parameterized as a frequency-dependent one, $\bX (\omega )$. Here, the estimated signal is frequency-dependent

\begin{equation} \label {eq:signal:P_frequency_dependent} \hat {\by }(\omega ) =\bX (\omega )\bw (\omega ), \end{equation}

where $\bw (\omega )$ are the estimated parameters $w_c$ and $w_s$ at that frequency. The corresponding frequency-dependent residual error is given by

\begin{equation} \mathbf {e}(\omega ) = \by - \hat {\by }(\omega ). \end{equation}

Since the error $\mathbf {e}(\omega )$ is orthogonal to $\hat {\by }$,

\begin{equation} \norm {\by }^2 = \norm {\hat {\by }(\omega )}^2 + \norm {\be (\omega )}^2. \end{equation}

To find the frequency that best represents the data, we seek the one that maximizes the energy of the reconstructed signal (or equivalently minimizes the residual error), as mentioned above

\begin{equation} \label {eq:signal:omega0} \hat {\omega _0} = \arg \min \limits _{\omega } \norm {\mathbf {e}(\omega )}^2 = \arg \max \limits _{\omega } \norm {\hat {\mathbf {y}}(\omega )}^2. \end{equation}

Note, this optimization problem can be challenging because the objective function may exhibit multiple local maxima/minima. Therefore, an appropriate numerical global optimization method is required.

Once $\hat {\omega }_0$ has been found, the amplitude and phase are estimated using the corresponding linear LS solution $\bw (\omega _0)$. This solution also results in SNR and noise variance estimations, as in Eqs. (3.18) and (3.19).

Tip: Interpretation in Terms of the Periodogram The function

\begin{equation} \label {eq:signal:periodogram} P(\omega ) = \frac {1}{L}\norm {\hat {\by }(\omega )}^2 \end{equation}

as a function of $\omega $ is termed a periodogram that is a frequency-dependent measure of signal power that approximates the power spectral density (PSD) of the signal. By scanning over frequencies, the $\omega $ that yields the maximum periodogram value is taken as the frequency estimate, $\omega _0$.

The complexity of frequency estimation (without amplitude/phase) can significantly reduced by noting that

\begin{equation} \arg \max \limits _{\omega } \norm {\hat {\by }(\omega )}^2 \propto \arg \max \limits _{\omega } \abs {\sum _{n=0}^{L-1}{[n]\exp (-j\omega n)}}^2 \end{equation}

A numerical example of the signal with additive white Gaussian noise (AWGN), and with the parameters $A=1.5,\omega _0=0.1\pi ,\theta =-\pi /4$ and $\sigma _\epsilon ^2=1$, is presented in Fig. 3.3. First, periodogram peak is found (Fig. 3.3a). Than, the subsequent amplitude/phase estimation result is presented (Fig. 3.3b).

Tip: Theoretical performance bounds Under AWGN assumption, theoretical SNR is given by

\begin{equation} SNR = \frac {A^2}{2\sigma ^2} \end{equation}

and the corresponding CRLB on the estimation variances are [6]

\begin{align} \Var [\hat {A}] &\geqslant \frac {2\sigma ^2}{L}\quad [V^2] \\ \Var [\hat {\omega }_0] &\geqslant \frac {12}{SNR\times L(L^2-1)}\approx \frac {12}{SNR\times L^3}\quad \left [\left (\frac {rad}{sample}\right )^2\right ] \\ \Var [\hat {\theta }] &\geqslant \frac {2(2L-1)}{SNR\times L(L+1)}\approx \frac {4}{SNR\times L}\quad [rad^2] \end{align} For analog frequency $F_0 = \dfrac {\omega _0}{2\pi }F_s$,

\begin{equation} \Var [F_0] = \Var [\omega _0]\left (\frac {F_s}{2\pi }\right )^2 \quad [Hz^2] \end{equation}

In practice, for short data lengths or non-Gaussian noise, these bounds provide only approximate guides to achievable performance.

3.5 Harmonic Signal Analysis

A particularly important class of signals encountered in many practical applications is the harmonic or periodic signal. Such a signal can be expressed as a sum of cosine terms whose frequencies are integer multiples (harmonics) of a fundamental frequency $\omega _0$.

\begin{equation} y[n] = A_0 + \sum _{m=1}^M A_m\cos (m\omega _0 n + \theta _m), \end{equation}

where:

• $A_0$ is the constant (DC) component,
• $A_m$ and $\theta _m$ represent the amplitude and phase of the $m$-th harmonic,
• $\omega _0$ is the fundamental angular frequency,
• $m\omega _0$ corresponds to the frequency of the $m$-th harmonic,
• and $M$ is the number of harmonics in the model.

Given $\omega _0$, the model is linear in terms of the unknown parameters $\{A_m, \theta _m\}$ for each harmonic $m=1,\ldots ,M$. Similar to the single-frequency case, the LS matrix $\bX $ is constructed with columns corresponding to $\cos (m\omega _0 n)$ and $\sin (m\omega _0 n)$ for $m=1,\ldots ,M$, plus a column of ones for the DC component. Each pair $(A_m,\theta _m)$ can be recovered from the LS estimated cosine and sine coefficients in the manner described for single-frequency amplitude-phase estimation. The resulting SNR and noise variance estimates are similar to those described in the previous sections.

The model order $M$ (number of harmonics) is a hyper-parameter that should be chosen carefully. Too few harmonics can fail to capture essential signal structure, while too many may overfit noise. The maximum value of $M$ is bounded by the Nyquist criterion, $M\omega _0<\pi $.

If $\omega _0$ is not known, the approach that is described in the frequency estimation section can also be applied here. Once $\hat {\omega }_0$ has been determined from a maximum of the harmonic periodogram,

\begin{equation} P_h(\omega ) = \frac {1}{L}\sum _{m=1}^M\norm {\hat {\by }(m\omega )}^2, \end{equation}

the harmonic amplitudes and phases can be estimated via LS at this frequency [3].

Total harmonic distortion (THD) is a measure commonly used in electrical engineering, audio processing, and other fields to quantify how much the harmonic components of a signal differ from a pure sinusoid at the fundamental frequency. It is defined as the ratio of the root-sum-square of the harmonic amplitudes and the amplitude of the fundamental frequency,

\begin{equation} THD = \frac {\sqrt {\sum _{m=2}^M A_m^2}}{A_1}. \end{equation}

A lower THD value indicates that the signal is closer to a pure sinusoidal shape, whereas a higher THD signifies a stronger presence of higher-order harmonics.

The example is the sampled current of a switch-mode power supply in a 50Hz network sampled at a 50kHz frequency [2]. Figure 3.4a shows a reconstruction of the signal with $M=250$ harmonics. The estimated amplitudes $\hat {A}_m$ are shown (Fig. 3.4b) as a function of the harmonic index $m$, including the DC term at $m = 0$. A larger magnitude indicates a more prominent harmonic component. The first non-DC harmonic amplitude $m=1$ corresponds to the fundamental frequency, $\omega _0$, while higher indices capture additional harmonics in the signal. The estimated fundamental frequency is 50.104Hz with the corresponding THD of about 1.6 ratio ($\approx 160\%$). Figure 3.4c shows estimated SNR (top) and the noise standard deviation (bottom) vary as the number of harmonics $M$ in the model increases.

Tip: The frequency estimator is an effective ML estimator with known analytical CRLB [3].

3.6 Discrete Fourier Transform (DFT)

The discrete Fourier transform (DFT) can be viewed as a systematic way of decomposing a finite-length signal into a sum of harmonically related sinusoids. In fact, it is a special case of the harmonic signal representation discussed earlier. Specifically, setting the fundamental angular frequency to $\omega _0 = \frac {2\pi }{N}$ and using $N\ge L$ harmonics, the harmonic model reduces exactly to a DFT decomposition that provides a natural harmonic decomposition of the signal into $N$ harmonics that are evenly spaced in frequency.

For simplicity, the $L=N$ relation is used in the discussion below by zero padding the signal $y[n]$ up to length $N$.

DFT representation assumes that any arbitrary, finite-time signal $y[n]$ may be represented as a sum of sinusoidal signals,

\begin{equation} y[n] = \sum _{k=0}^{N-1}A_k\cos (k\frac {2\pi }{N}n+\theta _k),\quad n=0,\ldots ,L-1 \end{equation}

When $N \geq L$, the DFT allows for perfect reconstruction of the signal using its harmonic representation:

\[ \by = \bX \hat {\bw }. \]

It is also worth noting the symmetry in the DFT,

\begin{equation} \begin{aligned} \cos ((N-k)\Delta \omega ) &= \cos (k\Delta \omega )\\ \sin ((N-k)\Delta \omega ) &= -\sin (k\Delta \omega ), \end {aligned} \end{equation}

resulting in redundant information for frequencies $k\omega _0$ above and below $\pi $,

\begin{equation} \begin{aligned} A_k &= A_{N-k}\\ \theta _k &= -\theta _{N-k} \end {aligned}. \end{equation}

As a result, only frequencies $k\omega _0 \leq \pi $ need to be considered uniquely.

3.6.1 Single frequency analysis

Consider a signal $y[n]$ assume a discrete frequency $\omega _0 = \frac {2\pi }{N}k$ is given. To estimate amplitude and phase at this predefined frequency, we can form a matrix $\bX $,

\[ \bX _1 = \begin {bmatrix} \bx _c & \bx _s \end {bmatrix}, \]

where

\[ \bx _c = \begin {bmatrix} \cos (2\pi \frac {k}{N}\cdot 0) \\ \cos (2\pi \frac {k}{N}\cdot 1) \\ \vdots \\ \cos (2\pi \frac {k}{N}(L-1)) \end {bmatrix} \ \text {and} \ \bx _s = \begin {bmatrix} \sin (2\pi \frac {k}{N}\cdot 0) \\ \sin (2\pi \frac {k}{N}\cdot 1) \\ \vdots \\ \sin (2\pi \frac {k}{N}(L-1)) \end {bmatrix}. \]

By evaluating $\bX _1^T \bX _1$, we find that the sine and cosine columns form an orthogonal basis for this single frequency, with

\begin{align} \bx _c^T \bx _c &= \frac {N}{2}, \\ \bx _s^T \bx _s &= \frac {N}{2}, \\ \bx _c^T \bx _s &= 0. \end{align} Stacking these results for all $k=0,\ldots ,N-1$ yields the complete DFT matrix forms a complete orthogonal basis for the $L$-sample signal space. The further discussion of $\left (\bX ^T\bX \right )^{-1}$ matrix properties may be found in Examples 4.2 and 8.5 in [6].

Moreover, since $\left (\bX ^T\bX \right )^{-1}$ takes a particularly simple diagonal form, and the least squares solution $\hat {\bw }$ for the parameters $w_{c,k}$ and $w_{s,k}$ (corresponding to amplitude and phase components at $\omega _k$) is

\begin{align} w_{c,k} &= \frac {2}{N}\sum _{n=0}^{L-1} y[n]\cos \left (2\pi \frac {k}{N}n\right ), \\ w_{s,k} &= \frac {2}{N}\sum _{n=0}^{L-1} y[n]\sin \left (2\pi \frac {k}{N}n\right ). \end{align}

Tip The fast Fourier transform (FFT) algorithm efficiently computes $Y[k]=\DFTtr {y[n]}$, providing $A_k = |Y[k]|/N$ and $\theta _k = \angle (Y[k])$ with significantly lower memory requirements and complexity than direct calculation in Eq. (3.45). When only a single frequency value is of interest, Goertzel algorithm is more efficient method for the task. Moreover, it can be used for computationally effective peaking of the maximum in Eq. (3.30).

3.6.2 Power Spectral Density

The power of a signal of the form

\begin{equation} x_k[n] = A_k\cos (k\frac {2\pi }{N}n + \theta _k) \end{equation}

\begin{equation} S_\by [k]=P_{\by _k} = \frac {1}{L}\norm {\by _k}^2 = \frac {A_k^2}{2}. \end{equation}

This value is known as the power spectral density (PSD) at the frequency $\omega = k\frac {2\pi }{N}$. The corresponding squared magnitude values $A_k^2/2$ are known as the discrete-frequency periodogram (Eq. (3.31)), and this is the basic method for the PSD estimation of a signal. Plotting such a periodogram gives a frequency-domain representation of the signal’s power distribution, highlighting which frequencies carry the most power.

DFT is energy conservation transform (Parseval’s Theorem) that states the relation

\begin{equation} \sum _{k=0}^{N-1} A_k^2 = \frac {1}{L}\norm {\by }^2. \end{equation}

3.6.3 Spectral Spreading and Leakage

In an idealized setting, a pure cosine signal has a perfectly defined frequency representation. For instance, consider the discrete-time signal,

\begin{equation} \label {eq:signal:single_freq_cos} x[n] = A\cos (k_0\frac {2\pi }{L}n),\ k_0\in \left \lbrace 1,\cdots ,L-1 \right \rbrace \end{equation}

where $k_0$ is the frequency index. The Fourier transform of this signal yields a single spectral component at frequency $w_0=k_0\frac {2\pi }{L}$, such that the spectral amplitude $A_k$ at each value of $k$ is given by

\begin{equation} A_k = \begin{cases} \frac {A}{2} & k=k_0,N-k_0\\ 0 & \text {otherwise} \end {cases}. \end{equation}

Under these conditions, the signal’s spectral representation seems to be strictly localized at the specific frequency $\omega _k$, with no energy distributed elsewhere in the spectrum.

However, practical scenarios deviate from this ideal case. In particular, if a denser frequency grid is employed (i.e. $N>L$) or the frequency varies continuously (as in Eq. (3.27)), the resulting spectral distribution can differ substantially from the discrete, single-peak ideal (Fig. 3.5). This difference arises because, in general, $\bX (\omega )^T\bX (\omega )$ is not orthogonal as in Eq. (3.44). As a result, two effects are introduced:

• The main frequency peak broadens, resulting in “spectral spreading”.
• Additional frequency components emerge beyond the broadened main peak, termed “spectral leakage.”

3.7 Summary

The summary of the presented approach is shown in Table 3.1. The presented approach involves a design of matrix $\bX $ and using LS to estimate unknown parameters. The key addressed task are as follows.

Table 3.1: Comparison and summary of different signal estimation methods.

.
Task	Parameters	Matrix $\bX $	SNR
Amplitude only	$A$ given $\omega _0$	A single column of $\cos (\omega _0n)$	$\dfrac {\norm {\hat {\by }}^2}{\norm {\mathbf {e}}^2}$
Amplitude & phase	$A,\theta $ given $\omega _0$	Two columns of $\cos (\omega _0n)$ and $\sin (\omega _0n)$	$\dfrac {\norm {\hat {\by }}^2}{\norm {\mathbf {e}}^2}$
Frequency estimation	$\omega _0, A,\theta $	Frequency-dependent $\cos (\omega n)$ and $\sin (\omega n)$ columns	Maximum of $\dfrac {\norm {\hat {\by }(\omega )}^2}{\norm {\be (\omega )}^2}$
Fourier series (harmonic decomposition)	$A_0,\left \lbrace A_m,\theta _m\right \rbrace _{m=1}^{M}$, possibly $\omega _0$	Harmonic cos/sin columns at multiples of $\omega _0$, $\cos (m\omega _0 n)$, $\sin (m\omega _0 n)$	$\dfrac {\norm {\hat {\by }}^2}{\norm {\mathbf {e}}^2}$, can include frequency dependence if $\omega _0$ unknown
DFT	$\left \lbrace A_k,\theta _k\right \rbrace _{k=0}^{N-1}$	Multiple pairs of columns $\cos (\tfrac {2\pi k}{N} n)$, $\sin (\tfrac {2\pi k}{N} n)$ for $k=0,\dots ,N-1$	Not used directly. Perfect reconstruction for $N\geq L$

Amplitude Estimation With a known frequency $\omega _0$, the amplitude $A$ is found via LS. The resulting residuals provide noise variance and SNR estimates.

Amplitude and Phase Estimation: For known $\omega _0$, rewriting

\[ A\cos (\omega _0 n + \theta ) = w_c\cos (\omega _0 n) + w_s\sin (\omega _0 n) \]

transforms the problem into a two-parameter LS regression.

Frequency Estimation: If $\omega _0$ is unknown, it is found by searching for the frequency that maximizes the fitted signal energy.

Harmonic Signal Analysis: Signals can be expressed as sums of multiple harmonics. Extending the LS approach to multiple harmonics allows estimation of each amplitude and phase. THD quantifies deviations from a pure tone.

Discrete Fourier Transform (DFT): The DFT is a special case of harmonic modeling, decomposing a signal into equally spaced frequency components. Efficiently computed by the FFT, the DFT is central to signal spectral analysis.

Although the estimators presented above have been extensively analyzed for the specific case of additive white Gaussian noise (AWGN) in the statistical signal processing literature [6, 5], conducting such an analysis requires a significantly more extensive mathematical framework. Furthermore, it is worth noting that any bias and variance in these estimators can be readily approximated via Monte Carlo simulations under various parameter settings and noise distributions.

.
Task	Parameters	Matrix \(\bX \)	SNR
Amplitude only	\(A\) given \(\omega _0\)	A single column of \(\cos (\omega _0n)\)	\(\dfrac {\norm {\hat {\by }}^2}{\norm {\mathbf {e}}^2}\)
Amplitude & phase	\(A,\theta \) given \(\omega _0\)	Two columns of \(\cos (\omega _0n)\) and \(\sin (\omega _0n)\)	\(\dfrac {\norm {\hat {\by }}^2}{\norm {\mathbf {e}}^2}\)
Frequency estimation	\(\omega _0, A,\theta \)	Frequency-dependent \(\cos (\omega n)\) and \(\sin (\omega n)\) columns	Maximum of \(\dfrac {\norm {\hat {\by }(\omega )}^2}{\norm {\be (\omega )}^2}\)
Fourier series (harmonic decomposition)	\(A_0,\left \lbrace A_m,\theta _m\right \rbrace _{m=1}^{M}\), possibly \(\omega _0\)	Harmonic cos/sin columns at multiples of \(\omega _0\), \(\cos (m\omega _0 n)\), \(\sin (m\omega _0 n)\)	\(\dfrac {\norm {\hat {\by }}^2}{\norm {\mathbf {e}}^2}\), can include frequency dependence if \(\omega _0\) unknown
DFT	\(\left \lbrace A_k,\theta _k\right \rbrace _{k=0}^{N-1}\)	Multiple pairs of columns \(\cos (\tfrac {2\pi k}{N} n)\), \(\sin (\tfrac {2\pi k}{N} n)\) for \(k=0,\dots ,N-1\)	Not used directly. Perfect reconstruction for \(N\geq L\)