Machine Learning & Signals Learning

$\newcommand{\footnotename}{footnote}$ $\def \LWRfootnote {1}$ $\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}$ $\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}$ $\let \LWRorighspace \hspace $ $\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }$ $\newcommand {\TextOrMath }[2]{#2}$ $\newcommand {\mathnormal }[1]{{#1}}$ $\newcommand \ensuremath [1]{#1}$ $\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } $ $\newcommand {\setlength }[2]{}$ $\newcommand {\addtolength }[2]{}$ $\newcommand {\setcounter }[2]{}$ $\newcommand {\addtocounter }[2]{}$ $\newcommand {\arabic }[1]{}$ $\newcommand {\number }[1]{}$ $\newcommand {\noalign }[1]{\text {#1}\notag \\}$ $\newcommand {\cline }[1]{}$ $\newcommand {\directlua }[1]{\text {(directlua)}}$ $\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}$ $\newcommand {\protect }{}$ $\def \LWRabsorbnumber #1 {}$ $\def \LWRabsorbquotenumber "#1 {}$ $\newcommand {\LWRabsorboption }[1][]{}$ $\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }$ $\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }$ $\def \mathcode #1={\mathchar }$ $\let \delcode \mathcode $ $\let \delimiter \mathchar $ $\def \oe {\unicode {x0153}}$ $\def \OE {\unicode {x0152}}$ $\def \ae {\unicode {x00E6}}$ $\def \AE {\unicode {x00C6}}$ $\def \aa {\unicode {x00E5}}$ $\def \AA {\unicode {x00C5}}$ $\def \o {\unicode {x00F8}}$ $\def \O {\unicode {x00D8}}$ $\def \l {\unicode {x0142}}$ $\def \L {\unicode {x0141}}$ $\def \ss {\unicode {x00DF}}$ $\def \SS {\unicode {x1E9E}}$ $\def \dag {\unicode {x2020}}$ $\def \ddag {\unicode {x2021}}$ $\def \P {\unicode {x00B6}}$ $\def \copyright {\unicode {x00A9}}$ $\def \pounds {\unicode {x00A3}}$ $\let \LWRref \ref $ $\renewcommand {\ref }{\ifstar \LWRref \LWRref }$ $ \newcommand {\multicolumn }[3]{#3}$ $\require {textcomp}$ $ \newcommand {\abs }[1]{\lvert #1\rvert } $ $ \DeclareMathOperator {\sign }{sign} $ $\newcommand {\intertext }[1]{\text {#1}\notag \\}$ $\let \Hat \hat $ $\let \Check \check $ $\let \Tilde \tilde $ $\let \Acute \acute $ $\let \Grave \grave $ $\let \Dot \dot $ $\let \Ddot \ddot $ $\let \Breve \breve $ $\let \Bar \bar $ $\let \Vec \vec $ $\newcommand {\bm }[1]{\boldsymbol {#1}}$ $\require {physics}$ $\newcommand {\LWRphystrig }[2]{\ifblank {#1}{\textrm {#2}}{\textrm {#2}^{#1}}}$ $\renewcommand {\sin }[1][]{\LWRphystrig {#1}{sin}}$ $\renewcommand {\sinh }[1][]{\LWRphystrig {#1}{sinh}}$ $\renewcommand {\arcsin }[1][]{\LWRphystrig {#1}{arcsin}}$ $\renewcommand {\asin }[1][]{\LWRphystrig {#1}{asin}}$ $\renewcommand {\cos }[1][]{\LWRphystrig {#1}{cos}}$ $\renewcommand {\cosh }[1][]{\LWRphystrig {#1}{cosh}}$ $\renewcommand {\arccos }[1][]{\LWRphystrig {#1}{arcos}}$ $\renewcommand {\acos }[1][]{\LWRphystrig {#1}{acos}}$ $\renewcommand {\tan }[1][]{\LWRphystrig {#1}{tan}}$ $\renewcommand {\tanh }[1][]{\LWRphystrig {#1}{tanh}}$ $\renewcommand {\arctan }[1][]{\LWRphystrig {#1}{arctan}}$ $\renewcommand {\atan }[1][]{\LWRphystrig {#1}{atan}}$ $\renewcommand {\csc }[1][]{\LWRphystrig {#1}{csc}}$ $\renewcommand {\csch }[1][]{\LWRphystrig {#1}{csch}}$ $\renewcommand {\arccsc }[1][]{\LWRphystrig {#1}{arccsc}}$ $\renewcommand {\acsc }[1][]{\LWRphystrig {#1}{acsc}}$ $\renewcommand {\sec }[1][]{\LWRphystrig {#1}{sec}}$ $\renewcommand {\sech }[1][]{\LWRphystrig {#1}{sech}}$ $\renewcommand {\arcsec }[1][]{\LWRphystrig {#1}{arcsec}}$ $\renewcommand {\asec }[1][]{\LWRphystrig {#1}{asec}}$ $\renewcommand {\cot }[1][]{\LWRphystrig {#1}{cot}}$ $\renewcommand {\coth }[1][]{\LWRphystrig {#1}{coth}}$ $\renewcommand {\arccot }[1][]{\LWRphystrig {#1}{arccot}}$ $\renewcommand {\acot }[1][]{\LWRphystrig {#1}{acot}}$ $\require {cancel}$ $\newcommand *{\underuparrow }[1]{{\underset {\uparrow }{#1}}}$ $\DeclareMathOperator *{\argmax }{argmax}$ $\DeclareMathOperator *{\argmin }{arg\,min}$ $\def \E [#1]{\mathbb {E}\!\left [ #1 \right ]}$ $\def \Var [#1]{\operatorname {Var}\!\left [ #1 \right ]}$ $\def \Cov [#1]{\operatorname {Cov}\!\left [ #1 \right ]}$ $\newcommand {\floor }[1]{\lfloor #1 \rfloor }$ $\newcommand {\DTFTH }{ H \brk 1{e^{j\omega }}}$ $\newcommand {\DTFTX }{ X\brk 1{e^{j\omega }}}$ $\newcommand {\DFTtr }[1]{\mathrm {DFT}\left \{#1\right \}}$ $\newcommand {\DTFTtr }[1]{\mathrm {DTFT}\left \{#1\right \}}$ $\newcommand {\DTFTtrI }[1]{\mathrm {DTFT^{-1}}\left \{#1\right \}}$ $\newcommand {\Ftr }[1]{ \mathcal {F}\left \{#1\right \}}$ $\newcommand {\FtrI }[1]{ \mathcal {F}^{-1}\left \{#1\right \}}$ $\newcommand {\Zover }{\overset {\mathscr Z}{\Longleftrightarrow }}$ $\renewcommand {\real }{\mathbb {R}}$ $\newcommand {\ba }{\mathbf {a}}$ $\newcommand {\bb }{\mathbf {b}}$ $\newcommand {\bc }{\mathbf {c}}$ $\newcommand {\bd }{\mathbf {d}}$ $\newcommand {\be }{\mathbf {e}}$ $\newcommand {\bf }{\mathbf {f}}$ $\newcommand {\bh }{\mathbf {h}}$ $\newcommand {\bi }{\mathbf {i}}$ $\newcommand {\bn }{\mathbf {n}}$ $\newcommand {\bo }{\mathbf {o}}$ $\newcommand {\bp }{\mathbf {p}}$ $\newcommand {\bq }{\mathbf {q}}$ $\newcommand {\br }{\mathbf {r}}$ $\newcommand {\bs }{\mathbf {s}}$ $\newcommand {\bt }{\mathbf {t}}$ $\newcommand {\bu }{\mathbf {u}}$ $\newcommand {\bv }{\mathbf {v}}$ $\newcommand {\bw }{\mathbf {w}}$ $\newcommand {\bx }{\mathbf {x}}$ $\newcommand {\bxx }{\mathbf {xx}}$ $\newcommand {\bxy }{\mathbf {xy}}$ $\newcommand {\by }{\mathbf {y}}$ $\newcommand {\byy }{\mathbf {yy}}$ $\newcommand {\bz }{\mathbf {z}}$ $\newcommand {\bA }{\mathbf {A}}$ $\newcommand {\bB }{\mathbf {B}}$ $\newcommand {\bC }{\mathbf {C}}$ $\newcommand {\bD }{\mathbf {D}}$ $\newcommand {\bH }{\mathbf {H}}$ $\newcommand {\bI }{\mathbf {I}}$ $\newcommand {\bK }{\mathbf {K}}$ $\newcommand {\bM }{\mathbf {M}}$ $\newcommand {\bP }{\mathbf {P}}$ $\newcommand {\bQ }{\mathbf {Q}}$ $\newcommand {\bR }{\mathbf {R}}$ $\newcommand {\bS }{\mathbf {S}}$ $\newcommand {\bU }{\mathbf {U}}$ $\newcommand {\bW }{\mathbf {W}}$ $\newcommand {\bX }{\mathbf {X}}$ $\newcommand {\bY }{\mathbf {Y}}$ $\newcommand {\bZ }{\mathbf {Z}}$ $\newcommand {\balpha }{\bm {\alpha }}$ $\newcommand {\bth }{{\bm {\theta }}}$ $\newcommand {\bepsilon }{{\bm {\epsilon }}}$ $\newcommand {\bmu }{{\bm {\mu }}}$ $\newcommand {\bphi }{\bm {\phi }}$ $\newcommand {\bOne }{\mathbf {1}}$ $\newcommand {\bZero }{\mathbf {0}}$ $\newcommand {\indFunc }{\mathbb {1}}$ $\newcommand {\btx }{\tilde {\bx }}$ $\newcommand {\loss }{\mathcal {L}}$ $\newcommand {\appropto }{\mathrel {\vcenter { \offinterlineskip \halign {\hfil $##$\cr \propto \cr \noalign {\kern 2pt}\sim \cr \noalign {\kern -2pt}}}}}$ $\newcommand {\SSE }{\mathrm {SSE}}$ $\newcommand {\MSE }{\mathrm {MSE}}$ $\newcommand {\RMSE }{\mathrm {RMSE}}$ $\newcommand {\toprule }[1][]{\hline }$ $\let \midrule \toprule $ $\let \bottomrule \toprule $ $\def \LWRbooktabscmidruleparen (#1)#2{}$ $\newcommand {\LWRbooktabscmidrulenoparen }[1]{}$ $\newcommand {\cmidrule }[1][]{\ifnextchar (\LWRbooktabscmidruleparen \LWRbooktabscmidrulenoparen }$ $\newcommand {\morecmidrules }{}$ $\newcommand {\specialrule }[3]{\hline }$ $\newcommand {\addlinespace }[1][]{}$ $\newcommand {\LWRsubmultirow }[2][]{#2}$ $\newcommand {\LWRmultirow }[2][]{\LWRsubmultirow }$ $\newcommand {\multirow }[2][]{\LWRmultirow }$ $\newcommand {\mrowcell }{}$ $\newcommand {\mcolrowcell }{}$ $\newcommand {\STneed }[1]{}$ $\newcommand {\tcbset }[1]{}$ $\newcommand {\tcbsetforeverylayer }[1]{}$ $\newcommand {\tcbox }[2][]{\boxed {\text {#2}}}$ $\newcommand {\tcboxfit }[2][]{\boxed {#2}}$ $\newcommand {\tcblower }{}$ $\newcommand {\tcbline }{}$ $\newcommand {\tcbtitle }{}$ $\newcommand {\tcbsubtitle [2][]{\mathrm {#2}}}$ $\newcommand {\tcboxmath }[2][]{\boxed {#2}}$ $\newcommand {\tcbhighmath }[2][]{\boxed {#2}}$ $\require {colortbl}$ $\let \LWRorigcolumncolor \columncolor $ $\renewcommand {\columncolor }[2][named]{\LWRorigcolumncolor [#1]{#2}\LWRabsorbtwooptions }$ $\let \LWRorigrowcolor \rowcolor $ $\renewcommand {\rowcolor }[2][named]{\LWRorigrowcolor [#1]{#2}\LWRabsorbtwooptions }$ $\let \LWRorigcellcolor \cellcolor $ $\renewcommand {\cellcolor }[2][named]{\LWRorigcellcolor [#1]{#2}\LWRabsorbtwooptions }$

Part I Machine Learning

1 Descriptive Statistics Basics

Goal: Describe the concise characteristics of a data.

1.1 Sample and Population Variance

Preliminaries

We assume a uni-variate random experiment that is described by a real-valued random variable $X$ with

\begin{align*} \E [X] &=\mu \\ \Var [X] &= \sigma ^2. \end{align*}

Let

\begin{equation} \bx =\{x_1,\dots ,x_n\} \end{equation}

be $n$ observations of $X$, where $n$ may be fixed in advance or chosen arbitrarily.

1.1.1 Definition

Sample mean Given observations $x_1,\dots ,x_n$, the sample mean $\bar {\bx }$ is given by

\begin{equation} \bar \bx = \frac 1n\sum _{i=1}^n x_i \end{equation}

Sample variance Sample variance is given by

\begin{align} s_{\text {unbiased}}^2 =\frac 1{n-1}\sum _{i=1}^n\bigl (x_i-\bar {x}\bigr )^2 \end{align} This formulas is unbiased.

Population variance The more intuitive (biased) population variance formula is

\begin{equation} s_{\text {biased}}^2 =\frac 1n\sum _{i=1}^n\bigl (x_i-\bar {x}\bigr )^2\\ \end{equation}

Standard Deviation The sample standard deviation (std) is simply the square-root of the (unbiased or biased) sample variance, $s$.

Table 1.1: The difference between variance and std.

.
Aspect	Variance ($s^2$)	Std. Dev. ($s$)
Units	(original unit)$^2$	original unit
Interpretation	“Mean squared deviation”	“Average deviation from the mean”
Ease of communication	Abstract (squared units)	Concrete ($\pm 5$ kg)

1.1.2 Repeated Experiments Characterization

Goal: Establish relations between theoretical and evaluated quantities, $\mu \xleftarrow {?}\bar {x},\mu \xleftarrow {?}s_{biased}^2$ and $\mu \xleftarrow {?}s_{unbiased}^2$.

The repeated experiments approach is based on

• $k$ experiments
- – In each experiment $n$ samples $\bx _1,\ldots ,\bx _k$ are drawn from the known random variable $X$.

Mean

The average of repeated experiments values is itself approximately $\mu $, so on average the sample mean

\begin{equation} \bar {\mu }_k = \frac {1}{k}\sum _{j=1}^{k}\bar {\bx }_j \end{equation}

recovers the true mean

\begin{equation} \lim _{k\rightarrow \infty }\bar {\mu }_k \rightarrow \mu \end{equation}

This principle is illustrated in Fig. 1.1.

Fluctuations of $\bar {\mu }_k$:

• As you gather more observations (higher $n$), $\bar \bx $ tends to stabilize: it fluctuates less, and its empirical distribution concentrates around the true center $\mu $.
• Moreover, for large $n$, the variability of $\bar \bx $ scales like $\sigma /\sqrt {n}$ (or $\Var [\bar x]\approx \dfrac {\sigma ^2}{n}$), meaning the more data we collect, the tighter our estimate of the process’s center becomes.

Variance

In the case of biased variance, the mean of the $s_{n_1},\ldots ,s_{n_k}$ values

\begin{equation} \bar {s}_k^2 = \frac {1}{k}\sum _{j=1}^{k}s_{n_j} \end{equation}

systematically underestimates the true population variance $\sigma ^2$,

\begin{equation} \lim _{k\rightarrow \infty }\bar {s}_{k,\text {biased}}^2 \rightarrow \dfrac {n-1}{n}\sigma ^2 \end{equation}

This inherent difference is termed bias. However, for the unbiased variance,

\begin{equation} \lim _{k\rightarrow \infty }\bar {s}_{k,\text {unbiased}}^2 \rightarrow \sigma ^2 \end{equation}

This principle is illustrated in Fig. 1.2.

Note, the difference between $s_\text {biased}$ and $s_{\text {unbiased}}$ goes smaller for high $n$.

MSE

The reason that biased estimator is still useful, is the when the empirical mean-squared error (MSE) is of interest. MSE is defined by

\begin{align} \mathrm {\overline {MSE}}_{\text {unbiased}} &= \frac {1}{k}\sum _{j=1}^k \left (s_{{\text {unbiased}}_j} - \sigma ^2\right )^2 \rightarrow \frac {2}{n-1}\,\sigma ^4\\ \mathrm {\overline {MSE}}_{\text {biased}} &= \frac {1}{k}\sum _{j=1}^k \left (s_{{\text {biased}}_j} - \sigma ^2\right )^2 \rightarrow \frac {2n-1}{n^2}\,\sigma ^4 \end{align} and

\begin{equation} \mathrm {\overline {MSE}}_{n-1} \geq \mathrm {\overline {MSE}}_{n} \end{equation}

Note, that MSE error of the biased formula is slightly lower than unbiased one.

Example 1.1 (Unbiasedness and variance of the sample mean, and biased vs. unbiased variance estimators): Sample mean is unbiased and its variance decays as $\sigma ^2/n$. The usual sample-variance estimators can be biased or unbiased. We illustrate all three properties by simulation:

1. Parameters.
- • True distribution: $X\sim \mathcal {N}(0,1)$, so $\mu =0$, $\sigma ^2=1$.
- • Sample sizes: $n\in \{5,\,20,\,100\}$.
- • Number of replicates: $k=5000$.
2. Results. Results in Table 1.2 agree closely with the theoretical values.

Table 1.2: Simulation results: sample mean, variance estimates, and their MSEs across different sample sizes.

.
$n$	$\bar \mu _n$	Empirical $\Var [\bar x]$	Theoretical $\Var [\bar x]=1/n$	$\bar s^2_{\mathrm {biased}}$	$\bar s^2_{\mathrm {unbiased}}$	$\overline {MSE}_{\mathrm {biased}}$	$\overline {MSE}_{\mathrm {unbiased}}$
5	$0.001$	$0.198$	$0.200$	$0.79$	$0.99$	$0.36$	$0.50$
20	$-0.000$	$0.049$	$0.050$	$0.95$	$1.00$	$0.10$	$0.11$
100	$0.000$	$0.010$	$0.010$	$0.99$	$1.01$	$0.02$	$0.02$

1.1.3 Bias

The bias of a sample statistic is the systematic difference between its long-run average over repeated experiments and the corresponding true value of the random variable. A statistic is called unbiased when this difference is zero, and biased otherwise.

For example, for the sample variance,

\begin{equation} \mathrm {bias}(s^2) = \lim _{k\to \infty }\bar {s}_k^2 - \sigma ^2. \end{equation}

Therefore, $s_{\text {unbiased}}^2$ is unbiased while $s_{\text {biased}}^2$ has bias $-\sigma ^2/n$.

The application of biased and unbiased estimators:

• Biased: in ML tasks (e.g., loss function), optimal (maximum-likelihood estimation) for Gaussian distribution
• Unbiased: statistics.

Code implementation default varies:

• Python uses biased expression, e.g. numpy.std.
• Matlab uses unbiased expression, e.g. std.

1.2 Mode, Skewness and Median

1.2.1 Median

Median is a value or quantity lying at the midpoint of a frequency distribution of observed values or quantities, such that there is an equal probability of falling above or below it.

Equivalently, for a sample $x_1,\dots ,x_n$ with order statistics $x_{(1)}\le \cdots \le x_{(n)}$, the sample median is

\[ \mathrm {median}( x) = \begin {cases} x_{(\frac {n+1}2)}, & n\text { odd},\\[6pt] \dfrac {x_{(\frac n2)} + x_{(\frac n2 + 1)}}2, & n\text { even}. \end {cases} \]

1.2.2 Mode

For a finite sample $x_1,\dots ,x_n$, the sample mode is any value(s) that occur(s) most often among the observations.

Multimodality: A distribution is called multimodal if it has more than one local maximum (mode) in its probability density function or histogram. Common cases include:

• Unimodal: a single peak (e.g., normal distribution).
• Bimodal: two distinct peaks, often indicating that the data is a mixture of two subpopulations.
• Multimodal: three or more peaks.

Multimodality suggests that the data may come from a mixture of different underlying processes or groups.

1.2.3 Skewness

Skewness: Skewness measures the asymmetry of a distribution about its mean. The sample skewness is defined as

\[ g_1 = \frac {1}{n}\sum _{i=1}^{n}\!\left (\frac {x_i - \bar {x}}{s_n}\right )^{\!3}. \]

• $g_1 = 0$: the distribution is symmetric (e.g., normal distribution).
• $g_1 > 0$: the distribution is right-skewed (positively skewed) — the right tail is longer, and the mean is typically greater than the median.
• $g_1 < 0$: the distribution is left-skewed (negatively skewed) — the left tail is longer, and the mean is typically less than the median.

An illustration of the relationship between mean, median, and mode under different skewness is shown in Fig. 1.3.

1.3 Boxplot

Goal: Compact visualization of the distribution’s location, spread, and potential outliers.

1.3.1 Prelinaries

Quartiles: Given the order statistics $x_{(1)}\le \cdots \le x_{(n)}$, the three quartiles divide the sorted data into four equal parts:

• $Q_1$ (first quartile, 25th percentile) — the median of the lower half of the data.
• $Q_2$ (second quartile, 50th percentile) — the median of the entire data set.
• $Q_3$ (third quartile, 75th percentile) — the median of the upper half of the data.

Equivalently, approximately 25% of the observations fall below $Q_1$, and another 25% up to $Q_3$.

Interquartile range: The interquartile range (IQR) measures the spread of the central 50% of the data,

\begin{equation} \mathrm {IQR} = Q_3 - Q_1. \end{equation}

Outlier: An outlier is any observation that falls outside the interval

\[ \bigl [Q_1 - 1.5\,\mathrm {IQR},\;\; Q_3 + 1.5\,\mathrm {IQR}\bigr ]. \]

Such points are considered unusually far from the bulk of the data and may indicate measurement errors, rare events, or heavy-tailed behavior.

The coefficient 1.5 was proposed empirically by statistician John Tukey in 70th.

1.3.2 Plot

A boxplot (box-and-whisker plot) summarizes a data set using five statistics:

1. Minimum non-outlier value.
2. First quartile ($Q_1$).
3. Median ($Q_2$).
4. Third quartile ($Q_3$).
5. Maximum non-outlier value.

The boxplot is drawn as follows:

• A box spans from $Q_1$ to $Q_3$, with a line at the median $Q_2$.
• Whiskers extend from the box to the most extreme data points that lie within

\[ \bigl [Q_1 - 1.5\,\mathrm {IQR},\;\; Q_3 + 1.5\,\mathrm {IQR}\bigr ]. \]
• Any observation outside the whisker range is marked individually as a potential outlier.

An illustration of a boxplot with its components is shown in Fig. 1.4.

Interpretation

• The box width (IQR) reflects variability — a wide box indicates high spread.
• The median line position within the box reveals skewness: if the median is closer to $Q_1$, the distribution is right-skewed; if closer to $Q_3$, left-skewed.
• Whisker lengths indicate the range of typical observations.
• Individual points beyond the whiskers flag potential outliers that may warrant further investigation.

Unlike histograms, boxplots do not show the shape of the distribution (e.g., multimodality). They are most useful for comparing distributions across groups side by side.

1.4 Histogram

Goal: Visualization of the experimental data.

1.4.1 Discrete Values

`count` type

Goal: Show how many times each discrete outcome occurs.

Consider an experiment with:

• $k$ possible distinct outcomes $x_{1},x_{2},\ldots ,x_{k}$, where $k$ is relatively small.
• A total of $N$ trials.
• Recorded results:
- – $n_{1}$ occurrences of $x_{1}$,
- – $n_{2}$ occurrences of $x_{2}$,
- – $\ldots $ and so on,
with $\sum _{i} n_{i}=N$.

The highest bar in a count histogram identifies the mode of the sample. A graphical representation of the outcomes is shown in Fig. 1.5(a).

`probability` type

Goal: Show the proportion of each discrete outcome, providing an empirical estimate of the PDF.

Approximation to the PDF: The probability of a particular outcome is approximated by the ratio of its count to the total number of trials,

\begin{equation} \label {eq:rand1:numeric_PDF_discr} p_X[x_i]\approx \frac {n_i}{N}, \qquad i = 1,\ldots ,k. \end{equation}

Naturally, the approximation improves as $N \to \infty $.

A graphical example of this histogram type is shown in Fig. 1.5(b).

1.4.2 Continuous Values

When the number of possible outcomes, $k$, is large (on the order of hundreds or more, or continuous values) two main difficulties arise:

• Presenting the results in a compact, readable form.
• Some outcome categories contain very few observations because their probabilities are small.

A practical way to display the data is:

1. Record the extreme values, $x_{\max }$ and $x_{\min }$.
2. Partition the interval $\bigl [x_{\min },x_{\max }\bigr ]$ into $k$ equal-width bins of size $\Delta x$.
3. Mark each bin by its midpoint
$\seteqnumber{0}{}{15}$
\begin{equation} \label {eq:rand1:mid_point} \tilde {x}_{i}=x_{\min }+\Bigl (i-\frac 12\Bigr )\,\Delta x, \qquad i=1,\dots ,k, \end{equation}
4. Let $n_{1}$ be the count in the first bin, $n_{2}$ the count in the second, and so on.
5. Use the pairs the pairs $\left (\tilde {x}_{i},n_{i}\right )$ for count or probability type histograms.

An example of a count histogram for a large data set is shown in Fig. 1.6.

In this representation, the median is the value on the x-axis that divides the total area of the histogram into two equal parts. The variance $s^2$ and standard deviation $s$ are reflected in the width or spread of the histogram: a broad histogram indicates high variability, while a narrow, sharp peak indicates that the data is tightly clustered around the mean.

add illustration/example

1.4.3 PDF Approximation of Continuous Random Variable (*)

Goal: Approximate the PDF of a continuous random variable from experimental data.

In addition to the binning method described above, there is a third histogram type that directly approximates the PDF of a continuous random variable.

Approximation to the PDF of a continuous random variable via histogram:

\begin{equation} \label {eq:stat:hist_pdf} f_X(x_i) \approx \frac {n_i}{N}\cdot \frac {1}{\Delta x}, \quad i = 1,\ldots ,k \end{equation}

Note the normalization factor $1/\Delta x$, which distinguishes this from the discrete case in (1.15).

A graphical example of this histogram type is shown in Fig. 1.7.

Derivation

Based on the principle

\begin{equation} \Pr (a<X\le b) = \int _{a}^{b}f_X(x)\,dx, \end{equation}

the probability of falling within a bin of width $\Delta x$ centered at $x_0$ can be approximated as

\begin{align} \Pr \!\Bigl (x_0-\tfrac {\Delta x}{2} < X \le x_0 +\tfrac {\Delta x}{2}\Bigr ) &= \int _{\Delta x}f_X(x)\,dx \approx f_X(x_0)\,\Delta x. \end{align} An illustration of this principle is shown in Fig. 1.8.

Rearranging,

\begin{equation} \label {eq:stat:histogram_deriv} f_X(x_0) \approx \Pr \!\Bigl (x_0-\tfrac {\Delta x}{2} < X \le x_0 +\tfrac {\Delta x}{2}\Bigr )\cdot \frac {1}{\Delta x}. \end{equation}

The probability of falling in bin $i$ (centered at $\tilde {x}_i$) is approximated by the relative frequency,

\begin{equation} \label {eq:stat:hist_prob} \Pr \!\Bigl (\tilde {x}_i-\tfrac {\Delta x}{2} < X \le \tilde {x}_i +\tfrac {\Delta x}{2}\Bigr ) \approx \frac {n_i}{N}, \quad i = 1,\ldots ,k. \end{equation}

Substituting (1.21) into (1.20) yields the PDF approximation formula in (1.17).

Example 1.2 (Computing a PDF histogram): Given the experimental outcomes

\[ \left [16,\,98,\,96,\,49,\,81,\,14,\,43,\,92,\,80,\,96\right ], \]

display the PDF using a histogram with $k=3$ bins.
- Solution:
  $\seteqnumber{0}{}{21}$
  \begin{align*} N &= 10\\ x_{\min } &= 14,\quad x_{\max } = 98\\ x_{\max } &= x_{\min } + k\,\Delta x = 98\quad \Rightarrow \Delta x = \frac {x_{\max }-x_{\min }}{k} = \frac {84}{3} = 28 \end{align*} Counts per bin:
  $\seteqnumber{0}{}{21}$
  \begin{align*} n_1 &= 2, &\leftarrow \lbrace 14,16\rbrace &\in [x_{\min },\;x_{\min } + \Delta x] &&= [14,42]\\ n_2 &= 2, &\leftarrow \lbrace 43, 49\rbrace &\in (x_{\min } + \Delta x,\;x_{\min } + 2\Delta x] &&= (42,70]\\ n_3 &= 6, &\leftarrow \lbrace 80,81,92,96,96,98\rbrace &\in (x_{\max } - \Delta x,\;x_{\max }] &&= (70,98] \end{align*} where $n_3 = 10 - n_1 - n_2$. Bin midpoints:
  $\seteqnumber{0}{}{21}$
  \begin{align*} \tilde {x}_1 &= x_{\min } + \tfrac {\Delta x}{2} = 28\\ \tilde {x}_2 &= x_{\min } + \Delta x + \tfrac {\Delta x}{2} = 56\\ \tilde {x}_3 &= x_{\max } - \tfrac {\Delta x}{2} = 84 \end{align*} The PDF approximation is therefore
  
  \[ f_X(\tilde {x}_i) \approx \frac {n_i}{N\cdot \Delta x} = n_i \cdot \frac {1}{10\cdot 28}. \]

Summary

Three ways to display experimental results as a histogram:

• count: plot $n_i$ — the raw bin counts.
• probability: plot $n_i/N$ — the relative frequency per bin.
• pdf (or density): plot $\dfrac {n_i}{N\,\Delta x}$ — the estimated probability density.

1.5 Violin Plot

Goal: Visualize both the summary statistics (boxplot) and the shape of the data distribution (histogram).

A violin plot extends the boxplot by adding a (smoothed¹) histogram of the data on each side. A median line is typically drawn inside.

Fig. 1.9 compares the two representations for the same bimodal data set: the boxplot gives no indication of two groups, whereas the violin plot clearly shows two bulges.

¹ This smoothing is termed kernel density estimation (KDE). It is a special case of kernel smoothing in Sec. 5.5. The discussion of this technique is beyond the scope of this chapter.

.
\(n\)	\(\bar \mu _n\)	Empirical \(\Var [\bar x]\)	Theoretical \(\Var [\bar x]=1/n\)	\(\bar s^2_{\mathrm {biased}}\)	\(\bar s^2_{\mathrm {unbiased}}\)	\(\overline {MSE}_{\mathrm {biased}}\)	\(\overline {MSE}_{\mathrm {unbiased}}\)
5	\(0.001\)	\(0.198\)	\(0.200\)	\(0.79\)	\(0.99\)	\(0.36\)	\(0.50\)
20	\(-0.000\)	\(0.049\)	\(0.050\)	\(0.95\)	\(1.00\)	\(0.10\)	\(0.11\)
100	\(0.000\)	\(0.010\)	\(0.010\)	\(0.99\)	\(1.01\)	\(0.02\)	\(0.02\)

.
Aspect	Variance (\(s^2\))	Std. Dev. (\(s\))
Units	(original unit)\(^2\)	original unit
Interpretation	“Mean squared deviation”	“Average deviation from the mean”
Ease of communication	Abstract (squared units)	Concrete (\(\pm 5\) kg)