Data-Driven Time-Series Prediction

\(\newcommand{\footnotename}{footnote}\) \(\def \LWRfootnote {1}\) \(\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\let \LWRorighspace \hspace \) \(\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }\) \(\newcommand {\TextOrMath }[2]{#2}\) \(\newcommand {\mathnormal }[1]{{#1}}\) \(\newcommand \ensuremath [1]{#1}\) \(\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } \) \(\newcommand {\setlength }[2]{}\) \(\newcommand {\addtolength }[2]{}\) \(\newcommand {\setcounter }[2]{}\) \(\newcommand {\addtocounter }[2]{}\) \(\newcommand {\arabic }[1]{}\) \(\newcommand {\number }[1]{}\) \(\newcommand {\noalign }[1]{\text {#1}\notag \\}\) \(\newcommand {\cline }[1]{}\) \(\newcommand {\directlua }[1]{\text {(directlua)}}\) \(\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}\) \(\newcommand {\protect }{}\) \(\def \LWRabsorbnumber #1 {}\) \(\def \LWRabsorbquotenumber "#1 {}\) \(\newcommand {\LWRabsorboption }[1][]{}\) \(\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }\) \(\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }\) \(\def \mathcode #1={\mathchar }\) \(\let \delcode \mathcode \) \(\let \delimiter \mathchar \) \(\def \oe {\unicode {x0153}}\) \(\def \OE {\unicode {x0152}}\) \(\def \ae {\unicode {x00E6}}\) \(\def \AE {\unicode {x00C6}}\) \(\def \aa {\unicode {x00E5}}\) \(\def \AA {\unicode {x00C5}}\) \(\def \o {\unicode {x00F8}}\) \(\def \O {\unicode {x00D8}}\) \(\def \l {\unicode {x0142}}\) \(\def \L {\unicode {x0141}}\) \(\def \ss {\unicode {x00DF}}\) \(\def \SS {\unicode {x1E9E}}\) \(\def \dag {\unicode {x2020}}\) \(\def \ddag {\unicode {x2021}}\) \(\def \P {\unicode {x00B6}}\) \(\def \copyright {\unicode {x00A9}}\) \(\def \pounds {\unicode {x00A3}}\) \(\let \LWRref \ref \) \(\renewcommand {\ref }{\ifstar \LWRref \LWRref }\) \( \newcommand {\multicolumn }[3]{#3}\) \(\require {textcomp}\) \(\newcommand {\intertext }[1]{\text {#1}\notag \\}\) \(\let \Hat \hat \) \(\let \Check \check \) \(\let \Tilde \tilde \) \(\let \Acute \acute \) \(\let \Grave \grave \) \(\let \Dot \dot \) \(\let \Ddot \ddot \) \(\let \Breve \breve \) \(\let \Bar \bar \) \(\let \Vec \vec \) \(\newcommand {\bm }[1]{\boldsymbol {#1}}\) \(\require {physics}\) \(\require {cancel}\) \(\newcommand *{\underuparrow }[1]{{\underset {\uparrow }{#1}}}\) \(\DeclareMathOperator *{\argmax }{argmax}\) \(\DeclareMathOperator *{\argmin }{arg\,min}\) \(\def \E [#1]{\mathbb {E}\!\left [ #1 \right ]}\) \(\def \Var [#1]{\operatorname {Var}\!\left [ #1 \right ]}\) \(\def \Cov [#1]{\operatorname {Cov}\!\left [ #1 \right ]}\) \(\newcommand {\floor }[1]{\lfloor #1 \rfloor }\) \(\newcommand {\DTFTH }{ H \brk 1{e^{j\omega }}}\) \(\newcommand {\DTFTX }{ X\brk 1{e^{j\omega }}}\) \(\newcommand {\DFTtr }[1]{\mathrm {DFT}\left \{#1\right \}}\) \(\newcommand {\DTFTtr }[1]{\mathrm {DTFT}\left \{#1\right \}}\) \(\newcommand {\DTFTtrI }[1]{\mathrm {DTFT^{-1}}\left \{#1\right \}}\) \(\newcommand {\Ftr }[1]{ \mathcal {F}\left \{#1\right \}}\) \(\newcommand {\FtrI }[1]{ \mathcal {F}^{-1}\left \{#1\right \}}\) \(\newcommand {\Zover }{\overset {\mathscr Z}{\Longleftrightarrow }}\) \(\renewcommand {\real }{\mathbb {R}}\) \(\newcommand {\ba }{\mathbf {a}}\) \(\newcommand {\bb }{\mathbf {b}}\) \(\newcommand {\bd }{\mathbf {d}}\) \(\newcommand {\be }{\mathbf {e}}\) \(\newcommand {\bh }{\mathbf {h}}\) \(\newcommand {\bn }{\mathbf {n}}\) \(\newcommand {\bq }{\mathbf {q}}\) \(\newcommand {\br }{\mathbf {r}}\) \(\newcommand {\bv }{\mathbf {v}}\) \(\newcommand {\bw }{\mathbf {w}}\) \(\newcommand {\bx }{\mathbf {x}}\) \(\newcommand {\bxx }{\mathbf {xx}}\) \(\newcommand {\bxy }{\mathbf {xy}}\) \(\newcommand {\by }{\mathbf {y}}\) \(\newcommand {\byy }{\mathbf {yy}}\) \(\newcommand {\bz }{\mathbf {z}}\) \(\newcommand {\bA }{\mathbf {A}}\) \(\newcommand {\bB }{\mathbf {B}}\) \(\newcommand {\bI }{\mathbf {I}}\) \(\newcommand {\bK }{\mathbf {K}}\) \(\newcommand {\bP }{\mathbf {P}}\) \(\newcommand {\bQ }{\mathbf {Q}}\) \(\newcommand {\bR }{\mathbf {R}}\) \(\newcommand {\bU }{\mathbf {U}}\) \(\newcommand {\bW }{\mathbf {W}}\) \(\newcommand {\bX }{\mathbf {X}}\) \(\newcommand {\bY }{\mathbf {Y}}\) \(\newcommand {\bZ }{\mathbf {Z}}\) \(\newcommand {\balpha }{\bm {\alpha }}\) \(\newcommand {\bth }{{\bm {\theta }}}\) \(\newcommand {\bepsilon }{{\bm {\epsilon }}}\) \(\newcommand {\bmu }{{\bm {\mu }}}\) \(\newcommand {\bOne }{\mathbf {1}}\) \(\newcommand {\bZero }{\mathbf {0}}\) \(\newcommand {\loss }{\mathcal {L}}\) \(\newcommand {\appropto }{\mathrel {\vcenter { \offinterlineskip \halign {\hfil $##$\cr \propto \cr \noalign {\kern 2pt}\sim \cr \noalign {\kern -2pt}}}}}\) \(\newcommand {\toprule }[1][]{\hline }\) \(\let \midrule \toprule \) \(\let \bottomrule \toprule \) \(\def \LWRbooktabscmidruleparen (#1)#2{}\) \(\newcommand {\LWRbooktabscmidrulenoparen }[1]{}\) \(\newcommand {\cmidrule }[1][]{\ifnextchar (\LWRbooktabscmidruleparen \LWRbooktabscmidrulenoparen }\) \(\newcommand {\morecmidrules }{}\) \(\newcommand {\specialrule }[3]{\hline }\) \(\newcommand {\addlinespace }[1][]{}\) \(\newcommand {\LWRsubmultirow }[2][]{#2}\) \(\newcommand {\LWRmultirow }[2][]{\LWRsubmultirow }\) \(\newcommand {\multirow }[2][]{\LWRmultirow }\) \(\newcommand {\mrowcell }{}\) \(\newcommand {\mcolrowcell }{}\) \(\newcommand {\STneed }[1]{}\) \(\newcommand {\tcbset }[1]{}\) \(\newcommand {\tcbsetforeverylayer }[1]{}\) \(\newcommand {\tcbox }[2][]{\boxed {\text {#2}}}\) \(\newcommand {\tcboxfit }[2][]{\boxed {#2}}\) \(\newcommand {\tcblower }{}\) \(\newcommand {\tcbline }{}\) \(\newcommand {\tcbtitle }{}\) \(\newcommand {\tcbsubtitle [2][]{\mathrm {#2}}}\) \(\newcommand {\tcboxmath }[2][]{\boxed {#2}}\) \(\newcommand {\tcbhighmath }[2][]{\boxed {#2}}\)

Chapter 1 Descriptive Statistics Basics

  • Goal: Describe the concise characteristics of a data.

Preliminaries

We assume a uni-variate random experiment that is described by a real-valued random variable \(X\) with

\begin{align*} \E [X] &=\mu \\ \Var [X] &= \sigma ^2. \end{align*}

Let

\[x=\{x_1,\dots ,x_n\}\]

be \(n\) observations of \(X\), where \(n\) may be fixed in advance or chosen arbitrarily.

1.1 Central Tendency

1.1.1 Mean

Given observations \(x_1,\dots ,x_n\), the sample mean \(\bar {x}\) is given by

\begin{equation} \bar x = \frac 1n\sum _{i=1}^n x_i \end{equation}

Properties:

  • As you gather more observations (higher \(n\)), \(\bar x\) tends to stabilize: it fluctuates less, and its empirical distribution concentrates around the true center \(\mu \).

  • Moreover, for large \(n\), the variability of \(\bar x\) scales like \(\sigma /\sqrt {n}\) (or \(\Var [\bar x]\approx \frac {\sigma ^2}{n}\)), meaning the more data we collect, the tighter our estimate of the process’s center becomes.

  • In repeated experiments of size \(n\), the average of those \(\bar x\) values is itself approximately \(\mu \); that is, \(\E [\bar x]=\mu \), so on average the sample mean recovers the true mean.

1.1.2 Median

Median is a value or quantity lying at the midpoint of a frequency distribution of observed values or quantities, such that there is an equal probability of falling above or below it.

Equivalently, for a sample \(x_1,\dots ,x_n\) with order statistics \(x_{(1)}\le \cdots \le x_{(n)}\), the sample median is

\[ \mathrm {median}( x) = \begin {cases} x_{(\frac {n+1}2)}, & n\text { odd},\\[6pt] \dfrac {x_{(\frac n2)} + x_{(\frac n2 + 1)}}2, & n\text { even}. \end {cases} \]

1.1.3 Mode

For a finite sample \(x_1,\dots ,x_n\), the sample mode is any value(s) that occur(s) most often among the observations.

1.2 Dispersion

1.2.1 Variance

Sample variance is given by

\begin{align} s_{unbiased}^2 =\frac 1{n-1}\sum _{i=1}^n\bigl (x_i-\bar {x}\bigr )^2 \end{align} This formulas is unbiased.

The more intuitive formula is

\begin{equation} s_{biased}^2 =\frac 1n\sum _{i=1}^n\bigl (x_i-\bar {x}\bigr )^2\\ \end{equation}

However, it is biased with \(\E [s_{biased}^2]=\dfrac {n-1}{n}\sigma ^2\). It systematically underestimates the true population variance \(\sigma ^2\).

Note, that MSE error of the biased formula is slightly lower than unbiased one.

1.2.2 Standard Deviation

The sample standard deviation (std) is simply the square-root of the (unbiased or biased) sample variance, \(s\).

Table 1.1: The difference between variance and std.
.
Aspect Variance (\(s^2\)) Std. Dev. (\(s\))
Units (original unit)\(^2\) original unit
Interpretation “Mean squared deviation” “Average deviation from the mean”
Ease of communication Abstract (squared units) Concrete (\(\pm 5\) kg)
1.2.3 Bias

Let \(\widehat \theta \) be an estimator of a parameter \(\theta \). Its bias is

\[ \mathrm {Bias}(\widehat \theta )\;=\;\E [\widehat \theta ]\;-\;\theta . \]

If \(\mathrm {Bias}(\widehat \theta )=0\), then \(\E [\widehat \theta ]=\theta \) and \(\widehat \theta \) is called unbiased.

  • Example 1.1: Sample mean is unbiased and its variance decays as \(\sigma ^2/n\). The usual sample-variance estimators can be biased or unbiased. We illustrate all three properties by simulation:

    • 1. Parameters.

      • True distribution: \(X\sim \mathcal {N}(0,1)\), so \(\mu =0\), \(\sigma ^2=1\).

      • Sample sizes: \(n\in \{5,\,20,\,100\}\).

      • Number of replicates: \(R=5000\).

    • 2. Data generation. For each \(n\) and each replicate \(r=1,\dots ,R\):

      \[ x_{r,1},\dots ,x_{r,n}\;\overset {\mathrm {iid}}{\sim }\;\mathcal {N}(0,1), \quad \bar x_r = \frac {1}{n}\sum _{i=1}^n x_{r,i}. \]

    • 3. Empirical estimates.

      \begin{align*} \widehat \mu _n &= \frac {1}{R}\sum _{r=1}^R \bar x_r,\\ \widehat {\Var [\bar x]} &= \frac {1}{R-1}\sum _{r=1}^R\bigl (\bar x_r - \widehat \mu _n\bigr )^2\\ \end{align*} For the variance estimators,

      \[ \begin {aligned} s_{r,\mathrm {biased}}^2 &= \frac {1}{n}\sum _{i=1}^n (x_{r,i}-\bar x_r)^2, \\ s_{r,\mathrm {unbiased}}^2 &= \frac {1}{n-1}\sum _{i=1}^n (x_{r,i}-\bar x_r)^2, \end {aligned} \]

      and we compute

      \[ \widehat \sigma ^2_{\mathrm {biased}} = \frac {1}{R}\sum _{r=1}^R s_{r,\mathrm {biased}}^2, \quad \widehat \sigma ^2_{\mathrm {unbiased}} = \frac {1}{R}\sum _{r=1}^R s_{r,\mathrm {unbiased}}^2. \]

      We also compute the empirical mean-squared error (MSE) of each estimator relative to the true variance \(\sigma ^2=1\):

      \begin{align*} \widehat {\mathrm {MSE}}_{\mathrm {biased}} &= \frac {1}{R}\sum _{r=1}^R \bigl (s_{r,\mathrm {biased}}^2 - \sigma ^2\bigr )^2,\\ \widehat {\mathrm {MSE}}_{\mathrm {unbiased}} &= \frac {1}{R}\sum _{r=1}^R \bigl (s_{r,\mathrm {unbiased}}^2 - \sigma ^2\bigr )^2. \end{align*}

    • 4. Results.

      Table 1.2: Simulation results: sample mean, variance estimates, and their MSEs across different sample sizes.
      .
      \(n\) \(\widehat \mu _n\) Empirical \(\Var [\bar x]\) Theoretical \(\Var [\bar x]=1/n\) \(\widehat \sigma ^2_{\mathrm {biased}}\) \(\widehat \sigma ^2_{\mathrm {unbiased}}\) \(\widehat {\mathrm {MSE}}_{\mathrm {biased}}\) \(\widehat {\mathrm {MSE}}_{\mathrm {unbiased}}\)
      5 \(0.001\) \(0.198\) \(0.200\) \(0.79\) \(0.99\) \(0.36\) \(0.50\)
      20 \(-0.000\) \(0.049\) \(0.050\) \(0.95\) \(1.00\) \(0.10\) \(0.11\)
      100 \(0.000\) \(0.010\) \(0.010\) \(0.99\) \(1.01\) \(0.02\) \(0.02\)

      Results in Table 1.2 agree closely with the theoretical values.

      \begin{align*} \E [\bar x]&=0,\\ \Var [\bar x]&=\sigma ^2/n,\\ \E [s_{\mathrm {biased}}^2]&=\frac {n-1}{n}\sigma ^2, \\ \E [s_{\mathrm {unbiased}}^2]&=\sigma ^2,\\ \mathrm {MSE}(s_{\mathrm {biased}}^2) &=\frac {2n-1}{n^2}\,\sigma ^4, \\ \mathrm {MSE}(s_{\mathrm {unbiased}}^2) &=\Var [s_{\mathrm {unbiased}}^2] =\frac {2}{\,n-1\,}\,\sigma ^4 \end{align*}

The application of biased and unbiased estimators:

  • Biased: in ML tasks (e.g., loss function), optimal (maximum-likelihood estimation) for Gaussian distribution

  • Unbiased: when the particular value of \(\sigma \) is of higher importance.

Code implementation defaults (note the difference):

  • Python (numpy.var,std) biased default.

  • Matlab (var,std) unbiased default.

1.3 Histogram

  • Goal: Visualization of the experimental data.

1.3.1 count type
  • Goal: Show how many times each discrete outcome occurs.

Consider an experiment with:

  • \(k\) possible distinct outcomes \(x_{1},x_{2},\ldots ,x_{k}\), where \(k\) is relatively small.

  • A total of \(N\) trials.

  • Recorded results:

    • \(n_{1}\) occurrences of \(x_{1}\),

    • \(n_{2}\) occurrences of \(x_{2}\),

    • \(\ldots \) and so on,

    with \(\sum _{i} n_{i}=N\).

A graphical representation of the outcomes is shown in Fig. 1.1(a).

(image)

(a) count
   

(image)

(b) probability
Figure 1.1: Example of histograms: (a) count, (b) probability.
1.3.2 probability type
  • Goal: Show the proportion of each discrete outcome, providing an empirical estimate of the PDF.

Approximation to the PDF: The probability of a particular outcome is approximated by the ratio of its count to the total number of trials,

\begin{equation} \label {eq:rand1:numeric_PDF_discr} p_X[x_i]\approx \frac {n_i}{N}, \qquad i = 1,\ldots ,k. \end{equation}

Naturally, the approximation improves as \(N \to \infty \).

A graphical example of this histogram type is shown in Fig. 1.1(b).

1.3.3 Large number of outcomes

When the number of possible outcomes, \(k\), is large (on the order of hundreds or more, or even continuous values) two main difficulties arise:

  • Presenting the results in a compact, readable form.

  • Some outcome categories contain very few observations because their probabilities are small.

A practical way to display the data is:

  • 1. Record the extreme values, \(x_{\max }\) and \(x_{\min }\).

  • 2. Partition the interval \(\bigl [x_{\min },x_{\max }\bigr ]\) into \(k\) equal-width bins of size \(\Delta x\).

  • 3. Mark each bin by its midpoint

    \begin{equation} \label {eq:rand1:mid_point} \tilde {x}_{i}=x_{\min }+\Bigl (i-\frac 12\Bigr )\,\Delta x, \qquad i=1,\dots ,k, \end{equation}

  • 4. Let \(n_{1}\) be the count in the first bin, \(n_{2}\) the count in the second, and so on.

  • 5. Use the pairs the pairs \(\left (\tilde {x}_{i},n_{i}\right )\) for count or probability type histograms.

An example of a count histogram for a large data set is shown in Fig. 1.2.

(image)

Figure 1.2: A count histogram for a large number of experimental outcomes.