Data-Driven Time-Series Prediction

\(\newcommand{\footnotename}{footnote}\) \(\def \LWRfootnote {1}\) \(\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\let \LWRorighspace \hspace \) \(\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }\) \(\newcommand {\TextOrMath }[2]{#2}\) \(\newcommand {\mathnormal }[1]{{#1}}\) \(\newcommand \ensuremath [1]{#1}\) \(\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } \) \(\newcommand {\setlength }[2]{}\) \(\newcommand {\addtolength }[2]{}\) \(\newcommand {\setcounter }[2]{}\) \(\newcommand {\addtocounter }[2]{}\) \(\newcommand {\arabic }[1]{}\) \(\newcommand {\number }[1]{}\) \(\newcommand {\noalign }[1]{\text {#1}\notag \\}\) \(\newcommand {\cline }[1]{}\) \(\newcommand {\directlua }[1]{\text {(directlua)}}\) \(\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}\) \(\newcommand {\protect }{}\) \(\def \LWRabsorbnumber #1 {}\) \(\def \LWRabsorbquotenumber "#1 {}\) \(\newcommand {\LWRabsorboption }[1][]{}\) \(\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }\) \(\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }\) \(\def \mathcode #1={\mathchar }\) \(\let \delcode \mathcode \) \(\let \delimiter \mathchar \) \(\def \oe {\unicode {x0153}}\) \(\def \OE {\unicode {x0152}}\) \(\def \ae {\unicode {x00E6}}\) \(\def \AE {\unicode {x00C6}}\) \(\def \aa {\unicode {x00E5}}\) \(\def \AA {\unicode {x00C5}}\) \(\def \o {\unicode {x00F8}}\) \(\def \O {\unicode {x00D8}}\) \(\def \l {\unicode {x0142}}\) \(\def \L {\unicode {x0141}}\) \(\def \ss {\unicode {x00DF}}\) \(\def \SS {\unicode {x1E9E}}\) \(\def \dag {\unicode {x2020}}\) \(\def \ddag {\unicode {x2021}}\) \(\def \P {\unicode {x00B6}}\) \(\def \copyright {\unicode {x00A9}}\) \(\def \pounds {\unicode {x00A3}}\) \(\let \LWRref \ref \) \(\renewcommand {\ref }{\ifstar \LWRref \LWRref }\) \( \newcommand {\multicolumn }[3]{#3}\) \(\require {textcomp}\) \(\newcommand {\intertext }[1]{\text {#1}\notag \\}\) \(\let \Hat \hat \) \(\let \Check \check \) \(\let \Tilde \tilde \) \(\let \Acute \acute \) \(\let \Grave \grave \) \(\let \Dot \dot \) \(\let \Ddot \ddot \) \(\let \Breve \breve \) \(\let \Bar \bar \) \(\let \Vec \vec \) \(\newcommand {\bm }[1]{\boldsymbol {#1}}\) \(\require {physics}\) \(\require {cancel}\) \(\newcommand *{\underuparrow }[1]{{\underset {\uparrow }{#1}}}\) \(\DeclareMathOperator *{\argmax }{argmax}\) \(\DeclareMathOperator *{\argmin }{arg\,min}\) \(\def \E [#1]{\mathbb {E}\!\left [ #1 \right ]}\) \(\def \Var [#1]{\operatorname {Var}\!\left [ #1 \right ]}\) \(\def \Cov [#1]{\operatorname {Cov}\!\left [ #1 \right ]}\) \(\newcommand {\floor }[1]{\lfloor #1 \rfloor }\) \(\newcommand {\DTFTH }{ H \brk 1{e^{j\omega }}}\) \(\newcommand {\DTFTX }{ X\brk 1{e^{j\omega }}}\) \(\newcommand {\DFTtr }[1]{\mathrm {DFT}\left \{#1\right \}}\) \(\newcommand {\DTFTtr }[1]{\mathrm {DTFT}\left \{#1\right \}}\) \(\newcommand {\DTFTtrI }[1]{\mathrm {DTFT^{-1}}\left \{#1\right \}}\) \(\newcommand {\Ftr }[1]{ \mathcal {F}\left \{#1\right \}}\) \(\newcommand {\FtrI }[1]{ \mathcal {F}^{-1}\left \{#1\right \}}\) \(\newcommand {\Zover }{\overset {\mathscr Z}{\Longleftrightarrow }}\) \(\renewcommand {\real }{\mathbb {R}}\) \(\newcommand {\ba }{\mathbf {a}}\) \(\newcommand {\bb }{\mathbf {b}}\) \(\newcommand {\bd }{\mathbf {d}}\) \(\newcommand {\be }{\mathbf {e}}\) \(\newcommand {\bh }{\mathbf {h}}\) \(\newcommand {\bn }{\mathbf {n}}\) \(\newcommand {\bq }{\mathbf {q}}\) \(\newcommand {\br }{\mathbf {r}}\) \(\newcommand {\bv }{\mathbf {v}}\) \(\newcommand {\bw }{\mathbf {w}}\) \(\newcommand {\bx }{\mathbf {x}}\) \(\newcommand {\bxx }{\mathbf {xx}}\) \(\newcommand {\bxy }{\mathbf {xy}}\) \(\newcommand {\by }{\mathbf {y}}\) \(\newcommand {\byy }{\mathbf {yy}}\) \(\newcommand {\bz }{\mathbf {z}}\) \(\newcommand {\bA }{\mathbf {A}}\) \(\newcommand {\bB }{\mathbf {B}}\) \(\newcommand {\bI }{\mathbf {I}}\) \(\newcommand {\bK }{\mathbf {K}}\) \(\newcommand {\bP }{\mathbf {P}}\) \(\newcommand {\bQ }{\mathbf {Q}}\) \(\newcommand {\bR }{\mathbf {R}}\) \(\newcommand {\bU }{\mathbf {U}}\) \(\newcommand {\bW }{\mathbf {W}}\) \(\newcommand {\bX }{\mathbf {X}}\) \(\newcommand {\bY }{\mathbf {Y}}\) \(\newcommand {\bZ }{\mathbf {Z}}\) \(\newcommand {\balpha }{\bm {\alpha }}\) \(\newcommand {\bth }{{\bm {\theta }}}\) \(\newcommand {\bepsilon }{{\bm {\epsilon }}}\) \(\newcommand {\bmu }{{\bm {\mu }}}\) \(\newcommand {\bOne }{\mathbf {1}}\) \(\newcommand {\bZero }{\mathbf {0}}\) \(\newcommand {\loss }{\mathcal {L}}\) \(\newcommand {\appropto }{\mathrel {\vcenter { \offinterlineskip \halign {\hfil $##$\cr \propto \cr \noalign {\kern 2pt}\sim \cr \noalign {\kern -2pt}}}}}\) \(\newcommand {\toprule }[1][]{\hline }\) \(\let \midrule \toprule \) \(\let \bottomrule \toprule \) \(\def \LWRbooktabscmidruleparen (#1)#2{}\) \(\newcommand {\LWRbooktabscmidrulenoparen }[1]{}\) \(\newcommand {\cmidrule }[1][]{\ifnextchar (\LWRbooktabscmidruleparen \LWRbooktabscmidrulenoparen }\) \(\newcommand {\morecmidrules }{}\) \(\newcommand {\specialrule }[3]{\hline }\) \(\newcommand {\addlinespace }[1][]{}\) \(\newcommand {\LWRsubmultirow }[2][]{#2}\) \(\newcommand {\LWRmultirow }[2][]{\LWRsubmultirow }\) \(\newcommand {\multirow }[2][]{\LWRmultirow }\) \(\newcommand {\mrowcell }{}\) \(\newcommand {\mcolrowcell }{}\) \(\newcommand {\STneed }[1]{}\) \(\newcommand {\tcbset }[1]{}\) \(\newcommand {\tcbsetforeverylayer }[1]{}\) \(\newcommand {\tcbox }[2][]{\boxed {\text {#2}}}\) \(\newcommand {\tcboxfit }[2][]{\boxed {#2}}\) \(\newcommand {\tcblower }{}\) \(\newcommand {\tcbline }{}\) \(\newcommand {\tcbtitle }{}\) \(\newcommand {\tcbsubtitle [2][]{\mathrm {#2}}}\) \(\newcommand {\tcboxmath }[2][]{\boxed {#2}}\) \(\newcommand {\tcbhighmath }[2][]{\boxed {#2}}\)

Chapter 4 ARMA Model

  • Goal: Linear prediction coefficients for systems-inspired models.

This chapter extends least-squares (LS)-based modeling to systems and time-series contexts, focusing on linear prediction coefficients derived from signals. Instead of restricting to a predefined parametric form like a sinusoid, here we consider arbitrary zero-mean finite-time signals \(x[n]\) and \(y[n], n=0,\ldots , L-1\), in the presence of zero-mean noise, \(\epsilon [n]\).

4.1 Auto-Correlation Function

  • Goal: Evaluate correlation between a signal and its time-shifted replicas to derive meaningful insights.

4.1.1 Linear Prediction & AR(1)

In system modeling and time-series analysis, a common approach is to express the current sample of a signal in terms of its past values. A simple first-order autoregressive (AR(1)) model predicts the current sample \(x[n]\) from a single past sample \(x[n-1]\), using the system model

\begin{equation} \label {eq:ar1} x[n]=h_1x[n-1] + \epsilon [n]. \end{equation}

and

\begin{equation} \hat {x}[n] = h_1x[n-1] \end{equation}

Here, \(h_1\) is the linear prediction coefficient that indicates how strongly the previous sample \(x[n-1]\) influences the current sample \(x[n]\).

To determine \(h_1\), minimization of SE loss function

\begin{equation} \loss (h_1) = \sum _n \left (x[n] - h_1x[n-1])\right )^2 \end{equation}

is used. Setting the derivative of \(\loss (h_1)\) with respect to \(h_1\) to zero leads to

\begin{equation} \label {eq-a1} \frac {d\mathcal {L}(a)}{da} =2\sum _n (x[n]-h_1x[n-1])(-x[n-1])=0 \end{equation}

Finally, the LS solution for \(h_1\) is

\begin{equation} \label {eq-acf-a1} h_1 = \frac {\sum _n x[n]x[n-1]}{\sum _n x^2[n-1]}. \end{equation}

The value \(h_1\) is termed as the (single) prediction coefficient of AR(1) model.

Matrix formulation

The coefficient may be formulated as a relation between two vectors

\begin{equation} \label {eq:ar1:matrix} \hat {\by }= \begin{bmatrix} x[1] \\ x[2] \\ \vdots \\ x[L-2]\\ x[L-1] \end {bmatrix} \quad \bx = \begin{bmatrix} x[0]\\ x[1] \\ \vdots \\ x[L-3] \\ x[L-2] \end {bmatrix}, \end{equation}

such that

\begin{equation} \hat {\by } = h_1 \bx . \end{equation}

The correspoding MMSE loss is

\begin{equation} \loss (h_1) = \norm {\by - h_1 \bx }^2 \end{equation}

and its minimum is given by normal equation, (Eq. (2.28)),

\begin{equation} \label {eq:normal_eq_vector} h_1 = {\left (\bx ^T\bx \right )^{-1}}\bx ^T\by . \end{equation}

The substitution results in the solution identical to Eq. (4.5).

4.1.2 Auto-correlation function (ACF)

To generalize beyond a single lag, we use one-coefficient prediction from the current sample \(x[n]\) and its time-shifted version \(x[n-k]\), using the \(k\)-step predictor,

\begin{equation} \hat {x}[n]=h_kx[n-k]. \end{equation}

The corresponding solution is very similar to the solution for \(k=1\),

\begin{equation} \label {eq-acf-ak} h_k = \frac {\sum _n x[n]x[n-k]}{\sum _n x^2[n-k]}. \end{equation}

The nominator of Eq. (4.11) is termed (raw, or unscaled) auto-correlation function (ACF) at lag \(k\),

\begin{equation} R_{\bx \bx }[k] = \sum _n x[n]x[n-k],\quad k=0,\ldots ,L-1 \end{equation}

The ACF provides a measure of how linearly dependent the signal is on its shifted versions. Variants like biased, unbiased, and normalized ACF offer different normalization schemes to handle finite data length and scaling issues.

The signal energy is given by

\begin{equation} \label {eq:arma:energy} E_\bx = \sum _{n=0}^{L-1} x^2[n] = R_\bxx [0] \end{equation}

Biased auto-correlation Another useful definition is averaged sum of \(y[n]y[n-k]\),

\begin{equation} \label {eq-acf-biased} \begin{aligned} R_{\bx \bx ,biased}[k] &= \frac {1}{L}\sum _n x[n]x[n-k],\quad k=0,\ldots ,L-1\\ &= \frac {1}{L}R_{\bx \bx }[k] \end {aligned} \end{equation}

is termed biased ACF. The corresponding average power is given by

\begin{equation} P_\bx = \frac {1}{L}\sum _n x^2[n] = R_{\bxx ,biased}[0] \end{equation}

Normalized auto-correlation Another version is normalized ACF of the form

\begin{equation} R_{\bx \bx ,norm}[k] = \frac {R_{\bx \bx }[k]}{R_{\bx \bx }[0]}\lesssim h_k. \end{equation}

Note, the difference between denominator above and the one in Eq. (4.11) is

\begin{equation} \sum _n x^2[n] - \sum _n x^2[n-k]= \sum _{n<k}x^2[n] =x^2[0] + \cdots x^2[k-1] \end{equation}

Nevertheless, \(h_k\) expression in Eq. (4.11) and the latter expression are approximately equal for a sufficiently high \(L\). Moreover, the denominator is \(k\)-independent.

Unbiased auto-correlation Note, that the ACF includes summation only of \(n-k\) terms. Another useful normalization is

\begin{equation} \begin{aligned} R_{\bx \bx ,unbiased}[k] &= \frac {1}{L-k}\sum _n x[n]x[n-k]\\ &= \frac {1}{L-k}R_{\bx \bx }[k] \end {aligned} \end{equation}

This time it is assumed that

\begin{equation} \begin{aligned} \frac {1}{L}\sum _n x^2[n] &\approx \frac {1}{L-k}\sum _n x^2[n-k]\\ \frac {x^2[0] +x^2[1] + \cdots x^2[L-1]}{L} &\approx \frac {x^2[k] + \cdots x^2[L-1]}{L-k} \end {aligned} \end{equation}

and the resulting expression

\begin{equation} \begin{aligned} \frac {L}{L-k}\frac {R_{\bx \bx }[k]}{R_{\bx \bx }[0]}=\frac {L}{L-k}R_{\bx \bx ,norm}[k]\\ =\frac {R_{\bx \bx ,unbiased}[k]}{R_{\bx \bx ,unbiased}[0]} \approx h_k \end {aligned} \end{equation}

is assumed to be closer approximation to \(h_k\) than the normalized auto-correlation, for a relatively small values of \(k\).

4.1.3 ACF Properties

The signal energy (4.13) is the highest value of ACF,

\begin{equation} R_{\bxx }[0] > R_{\bxx }[k]. \end{equation}

Theoretically, \(R_{\bxx }[0] = R_{\bxx }[k]\) may happen under certain conditions but unachievable for the practical time-limited signals.

The ACF has inherent time symmetry (backward prediction),

\begin{equation} \label {eq:acf_symmetry} R_{\bx \bx }[k] =R_{\bx \bx }[-k] \end{equation}

Correlation Coefficient Interpretation

In matrix form (by Eq. (2.48)), the loss is given by

\begin{equation} \loss _{min}(h_k) =\be ^T\be = \by ^T\by - h_k\by ^T\bx \end{equation}

The resulting loss is given by (following Eq. (2.48) and the discussion above for \(L\rightarrow \infty \))

\begin{equation} \label {eq-signals-min-J} \begin{aligned} \loss _{min}(h_k) &= \sum _{n=0}^{L-1}x^2[n] - h_k\sum _{n=0}^{L-1}x[n]x[n-k]\\[3pt] &= R_{\bx \bx }[0] - h_kR_{\bx \bx }[k]\\[3pt] &\lesssim R_{\bx \bx }[0] \left (1-\left (\frac {R_{\bx \bx }[k]}{R_{\bx \bx }[0]}\right )^2\right )\\[3pt] &= R_{\bx \bx }[0]\left (1-R_{\bx \bx ,norm}^2[k]\right )\\[3pt] &\approx R_{\bx \bx }[0]\left (1-\rho _{\bxx }^2[k]\right ) \end {aligned} \end{equation}

The value of \(\rho _{\bx \bx }[k]\) is termed correlation coefficient between \(x[n]\) and \(x[n-k]\),

\begin{equation} \rho _{\bx \bx }[k] \approx \frac {L}{L-k}R_{\bx \bx ,norm}[k] \approx R_{\bx \bx ,norm}[k]. \end{equation}

It is a measure of a linear dependence. It is bounded by

\begin{equation} \abs {\rho _{\bx \bx }[k]}\le 1 \end{equation}

  • For noiseless data and a linear relation between the samples, the prediction is perfect, \(\abs {\rho _{\bx \bx }[k]}=1\) and \(\loss _{min}=0\).

  • Without any linear dependence, \(\rho _{\bx \bx }[k]=h_k=0\) and the resulting prediction is \(\hat {x}[n] = 0\).

This can be summarized as (see also (4.13) for interpretation)

\begin{equation} \label {eq-ar-loss-bounds} 0\leq \loss _{min}(h_k) \leq R_\bxx [0] \end{equation}

The illustration of the correlation coefficient principles is presented in Fig. 4.1.

(image)

(a) ACF

(image)

(b) \(x[n]\) vs. \(x[n-k]\)

(image)

(c) Prediction for \(k=1\)

(image)

(d) Prediction for \(k=5\)
Figure 4.1: Illustration of the linear dependence between \(x[n]\) and \(x[n-k]\).

MSE and RMSE The corresponding MSE and RMSE metrics are given by

\begin{align} MSE(h_k) &= \frac {1}{L}\loss _{min}(h_k)\\ RMSE(h_k) &= \sqrt {\frac {1}{L}\loss _{min}(h_k)} \end{align}

Correlation time The correlation time, \(k_c\) is the lag where \(\rho _{\bx \bx }[k_c]\) falls below a threshold,

\begin{equation} \label {eq-correlation-time} \rho _{\bx \bx }[k_c] = 0.5 \text { or } 0.1 \text { or }\exp (-1) \end{equation}

The predictability is assumed negligible for \(k>k_c\),

\begin{equation} \label {eq-correlation-time-limit} \rho _{\bx \bx }[k>k_c]\approx 0 \end{equation}

The decision threshold depends on the field of application. For example, one of the most applicable ACF for natural processes is

\begin{equation} R_{\bxx ,norm}[k] = \exp (-\frac {k}{k_c}) \end{equation}

with \(R_{\bxx ,norm}[k_c] = \exp (-1)\).

Note, correlation time is mostly used in physics and engineering models and is less applicable in social sciences.

4.1.4 Confidence Interval

Another way to quantify the “importance” of the coefficients is to use confidence bound.

General idea The ACF estimate at a given lag is essentially a sample statistic derived from sums of products of random variables (the data values at different time points). Under typical assumptions of stationarity and weak dependence, these sample averages of random variables invoke the Central Limit Theorem (CLT). The CLT states that when you sum (or average) a sufficiently large number of independent or weakly dependent random variables with finite variance, the resulting distribution approaches a normal distribution, regardless of the variables’ original distribution.

In other words, the estimation of the ACF involves averaging products like \(x[n]x[n-k]\) across many time indices \(n\). Provided the underlying process is stationary and not too strongly dependent, these averaged terms behave like sums of numerous random contributions. As the sample size \(L\) grows large, the distribution of the ACF estimate at each lag approaches normality by virtue of the CLT. This is why the ACF at a given lag can be approximated as a normally distributed random variable in large-sample scenarios.

Derivation Since the estimated ACF at a given lag \(k\) can be approximated as a normally distributed random variable with zero mean and variance approximately \(\frac {1}{L}\) for large \(L\), the 95% confidence bound is given by

\begin{equation} R_{\bx \bx ,norm}[k]\pm \dfrac {1.96}{\sqrt {L}}. \end{equation}

The value 1.96 comes from the properties of the standard normal distribution. In a standard normal distribution (mean 0, variance 1), approximately 95% of the area under the curve lies within \(\pm 1.96\) standard deviations from the mean.

Interpretation If the estimated ACF value at a particular lag falls outside this range, it is statistically significant at the 95% confidence level (i.e., unlikely to be a result simply due to random chance). If it remains within the interval, it suggests that the observed correlation could be attributed to randomness in the data rather than a meaningful linear relationship.

4.1.5 Auto-covariance

For simplicity, a zero-average, \(\bar {x}[n]=0\), was assumed. When the signals is non-zero mean, the subtraction of signal average from the signal, \(x[n] = x[n]-\bar {x}[n]\) before auto-correlation calculation is termed as auto-covariance.

Note, both auto-correlation and auto-covariance have similar abbreviation, ACF. Typically, ACF is used for auto-correlation, since majority of the signals are zero-mean.

Tip:

  • ACF calculation may be significantly speed-up with appropriate algorithms and the bounded maximum value of \(k\le k_{max}\). The value of \(k_{max}\) may be decided by Eq. (4.30).

4.2 Power Spectral Density (PSD)

Relation between ACF and PSD

From the signal processing point of view, the interpretation of a DFT of some general random signal is non-trivial. For example, phases \(\theta _k\) of some random origin will be quite different for each time-segment (or realization) of the signal. The Wiener–Khinchin theorem states that the PSD (pages (page for section 3.4) and (page for section 3.6.2)) of stationary random signal is the Fourier transform of the ACF,

\begin{equation} S_\bxx [k] = \DFTtr {R_\bx [n]} = \abs {\DFTtr {x[n]}}^2\, k=0,\ldots ,N-1 \end{equation}

Important properties are

  • \(S_\bxx [k]\in \Re \), due to the symmetry (Eq. (4.22)) of \(R_\bxx [k]\) it results \(\theta _k=0\forall k\).

  • \(S_\bxx [k]\ge 0\quad \forall k\).

This relation shows, that random signal also includes spectral interpretation.

Interpretation The PSD shows where (in frequency) the signal has most of its energy. Peaks in the PSD correspond to frequencies at which the signal exhibits strong periodic or quasi-periodic components.

Slowly decaying (long-memory) correlations in the time domain often translate into a PSD that has more energy at low frequencies (indicating slow variations in time). Conversely, if the ACF shows periodicity, the PSD will have distinct peaks at the corresponding harmonic frequencies.

The ACF reveals how similar a signal is to itself at different time shifts. A slowly decaying ACF indicates strong long-term correlations, while a rapidly decaying ACF suggests only short-range predictability.

4.3 AR(p) Model

  • Goal: Extend \(AR(1)\) to \(AR(p)\) model.

The auto-regressive (AR) signal model describes a signal \(y[n]\) as a linear combination of its \(p\) previous samples plus noise. Formally, an \(AR(p)\) model is

\begin{equation} \begin{aligned} y[n] &=h_1y[n-1] + h_2y[n-2] + \cdots + h_{p}y[n-p] + \epsilon [n]\\ &=\sum _{m=1}^p h_my[n-m]+ \epsilon [n] \end {aligned} \end{equation}

and the prediction is of the form

\begin{equation} \hat {y}[n] =\sum _{m=1}^p h_my[n-m] \end{equation}

where \(h_1, \ldots ,h_p\) are the AR model coefficient and \(p\) is the model order that is chosen as hyper-parameter.

This model can be easily formulated in the matrix form by using \(L\) sample of \(y[n]\), by

\begin{equation} \label {eq:ar_prediction} \hat {\by }= \begin{bmatrix} \hat {y}[1] \\ \hat {y}[2]\\ \vdots \\ \hat {y}[L-2] \\ \hat {y}[L-1] \end {bmatrix},\; \bX =\begin{bmatrix} y[0] & 0 & \cdots & 0\\ y[1] & y[0] & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ y[L-3] & y[L-4] & \cdots & y[L-p-2]\\ y[L-2] & y[L-3] & \cdots & y[L-p-1]\\ \end {bmatrix},\; \bh =\begin{bmatrix} h_1 \\ h_2 \\ \vdots \\ h_{p} \end {bmatrix} \end{equation}

where the corresponding dimensions are \(\hat {\by }\in \Re ^{L-1},\bX \in \Re ^{L-1\times p},\bh \in \Re ^p\).

The AR coefficients \(\bh \) can be found by a LS regression that minimizes the MSE loss

\begin{equation} \mathcal {L}(h_i) = \sum _n \left (y[n]-\hat {y}[n]\right )^2 = \norm {\by -\bX \bh }^2. \end{equation}

The LS solution is straightforward,

\begin{equation} \label {eq:normal_eq_matrix} \bh = {\left (\bX ^T\bX \right )^{-1}}\bX ^T\by . \end{equation}

Note, the practical approach for evaluation of \(\bh \) is presented in the following section.

  • Example 4.1: Learn linear prediction of \(y[8]\) for \(p=3\) and signal \(y[0],y[1],\ldots ,y[7]\).

    • Solution:

      \begin{equation} \bX =\begin{bmatrix} y[0] & 0 & 0 \\ y[1] & y[0] & 0 \\ y[2] & y[1] & y[0] \\ y[3] & y[2] & y[1] \\ y[4] & y[3] & y[2] \\ y[5] & y[4] & y[3] \\ y[6] & y[5] & y[4] \end {bmatrix},\; \bh =\begin{bmatrix} h_1 \\ h_2 \\ h_3 \end {bmatrix}\; \by = \begin{bmatrix} y[1] \\ y[2] \\ y[3]\\ y[4]\\ y[5] \\y[6] \\ y[7] \end {bmatrix} \end{equation}

      Finding vector \(\bh \) values is (almost) trivial by a minimum of the loss function of the form

      \begin{equation} \loss = \norm {\by -\bX \bh }^2. \end{equation}

      Once found, \(\hat {y}[8] = h_1y[7] + h_2y[6] + h_3y[5]\).

4.3.1 Yule-Walker Form

The resulting \(\bX ^T\bX \) matrix is termed (Toeplitz) auto-correlation matrix and can be interpreted in terms of auto-correlation values,

\begin{equation} \begin{aligned} \bR &= \bX ^T\bX \\ &= \begin{bmatrix} R_{\by \by }[0] & R_{\by \by }[1] & \cdots & R_{\by \by }[p-1]\\ R_{\by \by }[1] & R_{\by \by }[0] & \cdots & R_{\by \by }[p-2]\\ \vdots & \vdots & \ddots & \vdots \\ R_{\by \by }[p-1] & R_{\by \by }[p-2] & \cdots & R_{\by \by }[0] \end {bmatrix}, \end {aligned} \end{equation}

where unscaled (or biased) \(R_{\by \by }[k]\) is defined above. It is common practice to ignore the changes in the vector lengths in calculating (biased) ACF for the same time-shift for \(p\ll L\), e.g. diagonal matrix values are

\begin{equation} R_\byy [0] = \sum _{n=0}^{L-1}y^2[n] \approx \sum _{n=0}^{L-\textbf {2}}y^2[n] \end{equation}

The vector \(\bX ^T\by \) is also comprised of the corresponding \(R_{\byy }[k]\) values,

\begin{equation} \br = \bX ^T\by = \begin{bmatrix} R_\byy [1]\\ R_\byy [2]\\ \vdots \\ R_\byy [p]\\ \end {bmatrix} \end{equation}

Using these formulations, the resulting coefficients are given by a solution of a set of linear equations,

\begin{equation} \bR \bh = \br \end{equation}

and the result is similar to Eq. (4.38),

\begin{equation} \bh = \bR ^{-1}\br \end{equation}

The use of a \(R_{\bx \bx ,biased}[k]\) (biased ACF, Eq. (4.14)) guaranties the numerically stable solution [7, Theorem 6.1].

Squared error The resulting loss (squared error) is given by (following Eqs. (2.48) and (4.24))

\begin{equation} \begin{aligned} \loss _{min} &= R_\bxx [0] - \sum _{k=1}^p h_kR_\bxx [k] \end {aligned} \end{equation}

Theoretically, higher value of \(p\) results in lower loss in the presence of the sufficiently long correlation time (Eq. (4.29)). Note, the accuracy drops for high values of \(p\) to due to reduced accuracy of \(R_\bxx [p]\).

The commonly adopted notation of the signal model in most of the books and software packages is

\begin{equation} \sum _{m=0}^p a_m y[n-m] = \epsilon [n] \end{equation}

with \(a_0 = 1\) and \(a_m = - h_m\).

This definition follows discrete-time definition and the corresponding Z-transform of all-pole system. It also can be directly applied as a denominator coefficients for filter command.

Biased signal The presence of a non-zero average (bias) in the signal modifies the AR model formulation. Instead of assuming a zero-mean process, a constant bias term \(\mu \) is introduced,

\begin{equation} \label {eq:ar_biased} \begin{aligned} \hat {y}[n] &=\mu + h_1y[n-1] + h_2y[n-2] + \cdots + h_{p}y[n-p] + \epsilon [n] \end {aligned} \end{equation}

In matrix form, this requires adding a column of ones \(\bOne _M\) to the data matrix \(\bX \), similarly to how it is done in multivariate LS (Eq.(2.29)).

Tips:

  • For computational and memory efficiency, the Levinson-Durbin algorithm \(O\left (L^2\right )\) is often used to solve for AR coefficients 1, instead of direct \(O\left (L^3\right )\).

  • The AR parameters \(\bh \) are also referred to as linear prediction coefficients (LPC), emphasizing their role in predicting the current value from past samples.

  • The auto-correlation matrix \(\bR \) is guaranteed to be positive-definite (and, therefore, invertible) for ordinary or biased definitions (normalization constants reduce each other in (4.38)).

  • Theoretically, the AR model coefficients can be regularized to prevent overfitting.

  • Although this section focuses on linear AR models, nonlinear variants also exist, allowing modeling of more complex dynamics.

  • Note, there are other methods for LPC calculation, such as Burg’s method, as it is done in librosa.lpc.

1 For example, solve_toeplitz

4.3.2 Moving Average Filter

A simple special case of an AR-like model is the moving average filter, where all coefficients are equal and sum to one, \(h_i=\dfrac {1}{p}\). In this case,

\begin{equation} \label {eq-moving-average} \begin{aligned} \hat {y}[n] &= \frac {1}{p}\left (y[n-1] + \cdots + y[n-p]\right )\\ &=\frac {1}{p}\sum _{m=1}^p y[n-m] \end {aligned} \end{equation}

This is not strictly an AR model but shares the idea of using past samples. It smooths the signal by averaging the most recent \(p\) values.

4.3.3 Nearest Neighbor (Naïve)

A simple baseline is to use the immediate past sample as the prediction with \(h_1=1\),

\begin{equation} \label {eq-ts-1nn} \hat {y}[n] = y[n-1] \end{equation}

This "1-nearest neighbor" approach ignores signal dynamics and history. While often not very accurate, it serves as a useful baseline to compare against more sophisticated models.

4.4 Partial auto-correlation function

  • Goal: The partial autocorrelation function (PACF) measures the correlation between a time series and its lagged values after removing the influence of all shorter lags. In other words, the PACF at lag \(k\) shows the direct effect of \(y[n-k]\) on \(y[n]\), excluding any intermediary correlations through lags less than \(k\).

The partial autocorrelation at \(k\) is the correlation that results after removing the effect of any correlations due to the terms at shorter lags,

\[ \underline {x[0]},\underbrace {x[1],x[2],\ldots ,x[j-1]}_{\text {partial out}},\underline {x[j]},x[j+1],\ldots \]

The PACF at lag \(k\) can be extracted by fitting an AR(k) model and observing the coefficient associated with \(y[n-k]\) in this model2. For each \(k\),

\begin{equation} \hat {y}_{k}[n] = \phi _{k,1} y[n-1] + \phi _{k,2} y[n-2] + ... + \phi _{k,k} y[n-k] + \epsilon [n] \end{equation}

The \(k\)-th partial autocorrelation, \(\beta [k]\) is \(\phi _{k,k}\), the coefficient of \(y[n-k]\) in the AR(k) model. Each of these AR(k) models can be solved using standard LS methods.

The algorithm is as follows:

  • For the first value of PACF, \(\beta [1]\), fit AR(1) model

    \begin{equation} \hat {y}_{1}[n] = \phi _{1,1} y[n-1] + \epsilon [n] \end{equation}

    and the coefficient \(\beta [1]=\hat {\phi }_{1,1}\) is given by the model solution, \(h_1\) (Eq. (4.5)).

  • For the second order PACF, it would be the \(\beta [2]=\phi _{2,2}\) coefficient of AR(2) model,

    \begin{equation} \hat {y}_2[n] = \phi _{2,1} y[n-1] + \phi _{2,2} y[n-2] + \epsilon [n] \end{equation}

  • Continue this process up to the desired lag, each time extracting the coefficient of the highest-order lag as the PACF value.

Tip:

  • While the ACF reflects both direct and indirect correlations (where an earlier lag may influence a later lag through intermediate values), the PACF isolates the direct contribution of each individual lag once all shorter-term effects are factored out.

  • Though conceptually the PACF is obtained by fitting multiple AR models, practically more efficient algorithms like the Levinson-Durbin recursion can compute these values quickly and without having to solve a new LS problem at every step.

4.4.1 Relation between PACF and AR(p)

The order \(p\) of AR(p) model is related to the statistically significant (over a confidence bound) coefficients of PACF. Observing where the PACF cuts off helps identify the appropriate order \(p\) for AR(p) model fitting. An example of a synthetic signal analysis with \(p=2\) is presented in Fig. 4.2.

(image)

Figure 4.2: ACF and PACF of AR(2) signal. Note short PACF and long ACF plots.

4.5 MA model

  • Goal: The moving average (MA) model expresses the current output \(y[n]\) as a linear combination of current and past noise terms. Unlike the AR model, which relies on past values of the output, the MA model directly models the output as filtered white noise.

Another model is the moving average model (MA), where the output is a linear combination of the noise values at the different times [6, Example 4.3, pp. 90]

\begin{equation} y[n] = \epsilon [n] + b_1\epsilon [n-1] + \cdots + c_{q}\epsilon [n-q] \end{equation}

Because the noise terms \(\epsilon [n]\) are not directly observable (unlike the past outputs in AR modeling), deriving a closed-form solution for MA parameters through simple linear regression is not as direct and is not presented here. 3 While the underlying theory is well-established, practical estimation of MA parameters often involves advanced numerical methods rather than a simple closed-form solution.

4.5.1 MA and AR relations

There exists a duality between AR and MA processes in terms of infinite expansions:

Presenting AR(p) as MA(\(\infty \)) An AR model can be represented as an infinite MA model when the autoregressive parameters are stable. Consider the recursive simple AR(1),

\begin{equation} \begin{aligned} y[n] &= h_1y[n-1] + \epsilon [n]\\ &= h_1(h_1y[n-2] + \epsilon [n-1]) + \epsilon [n]\\ &= h_1^2y[n-2] + h_1\epsilon [n-1] + \epsilon [n]\\ &= h_1^3y[n-3] + h_1^2\epsilon [n-2] + h_1\epsilon [n-1] + \epsilon [n] \end {aligned} \end{equation}

Continuing this process indefinitely and assuming stability, \(h_1<1, \lim \limits _{k\rightarrow \infty }h_1^k\rightarrow 0\), we have

\begin{equation} \begin{aligned} y[n] &= \epsilon [n] + h_1\epsilon [n-1] + h_1^2\epsilon [n-2] + h_1^3\epsilon [n-3] + \cdots \\ &=\sum _{i=0}^\infty h_1^i\epsilon [n-i] \end {aligned} \end{equation}

that it is MA(\(\infty \)) model, where the coefficients of the MA representation are the infinite geometric sequence powers of \(h_1\).

Presenting MA(1) as AR(\(\infty \)) An MA(1) process is given by

\begin{equation} x[n] = \epsilon [n] + c_1\,\epsilon [n-1]. \end{equation}

We can express the process as an infinite-order AR\((\infty )\) model.

First, note that we can write

\begin{align*} \epsilon [n] &= x[n] - c_1\,\epsilon [n-1].\\ \epsilon [n-1] &= x[n-1] - c_1\,\epsilon [n-2]\\ \epsilon [n-2] &= x[n-2] - c_1\,\epsilon [n-3] \end{align*} and so on. Substituting for \(\epsilon [n-i]\) recursively, we have

\begin{equation} \begin{aligned} \epsilon [n] &= x[n] - c_1\left (x[n-1] - c_1\,\epsilon [n-2]\right )\\[1mm] &= x[n] - c_1\,x[n-1] + c_1^2\,\epsilon [n-2]\\[1mm] &= x[n] - c_1\,x[n-1] + c_1^2\left (x[n-2] - c_1\,\epsilon [n-3]\right )\\[1mm] &= x[n] - c_1\,x[n-1] + c_1^2\,x[n-2] - c_1^3\,\epsilon [n-3]\\[1mm] &\,\,\,\vdots \\[1mm] &= \sum _{i=0}^\infty (-c_1)^i\,x[n-i]. \end {aligned} \end{equation}

Rearranging, the AR\((\infty )\) representation becomes

\begin{equation} x[n] = \epsilon [n] + \sum _{i=1}^\infty (-c_1)^i\,x[n-i]. \end{equation}

Note the invertibility condition, \(|c_1| < 1\).

Interpretation AR and MA are not mutually exclusive categories. A stable AR process can be seen as a special case of an infinite MA, and a stable MA can be thought of as an infinite AR.

4.5.2 The relation between MA(q) and ACF

The MA(q) model has a distinctive fingerprint in terms of its ACF; it is nonzero for up to lag \(q\) and essentially zero afterward (except for sampling and noise effects). Just as the partial autocorrelation function (PACF) helps determine the order \(p\) of an AR(p) process by pinpointing where its PACF cuts off (Sec. 4.4.1), the ACF helps identify the order \(q\) of an MA(q) process.

An example of a synthetic MA(4) signal analysis is presented in Fig. 4.3. After lag 4, the ACF values remain within the confidence bounds, essentially zero, suggesting the data arise from an MA(4) process.

(image)

Figure 4.3: ACF and PACF of a synthetic MA(4) signal. Note short ACF and long PACF plots.

4.6 ARMA

  • Goal: The ARMA model extends the concepts of AR and MA models by combining their elements to capture more complex dynamics.

The ARMA(p,q) model is given by

\begin{equation} \begin{aligned} \hat {y}[n] &= h_1y[n-1] + \cdots + h_py[n-p] \\ &+ c_1\epsilon [n-q] + \cdots + c_q\epsilon [n-q] + \epsilon [n]\\ &=\sum _{i=1}^{p}h_iy[n-i] + \sum _{k=0}^qc_k\epsilon [n-k] \end {aligned}, \end{equation}

where \(c_0=1\) by definition.

One of the ways to describe MA part of ARMA, is that MA uses to model the difference unexplained by AR model,

\begin{equation} \hat {y}[n] - \sum _{i=1}^{p}h_iy[n-i] = \sum _{k=0}^qc_k\epsilon [n-k] \end{equation}

As with AR models (Eq. (4.48)), a constant bias \(\mu \) can be included,

\begin{equation} \hat {y}[n] =\sum _{i=1}^{p}h_iy[n-i] + \sum _{k=0}^qc_k\epsilon [n-k] + \mu \end{equation}

One of the adopted notation of the ARMA model is

\begin{equation} \sum _{m=0}^p a_m y[n-m] = \sum _{k=0}^qc_k\epsilon [n-k] \end{equation}

with \(a_0 = 1\) and \(a_m = - h_m\).