Machine Learning & Signals Learning

\(\newcommand{\footnotename}{footnote}\) \(\def \LWRfootnote {1}\) \(\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\let \LWRorighspace \hspace \) \(\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }\) \(\newcommand {\TextOrMath }[2]{#2}\) \(\newcommand {\mathnormal }[1]{{#1}}\) \(\newcommand \ensuremath [1]{#1}\) \(\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } \) \(\newcommand {\setlength }[2]{}\) \(\newcommand {\addtolength }[2]{}\) \(\newcommand {\setcounter }[2]{}\) \(\newcommand {\addtocounter }[2]{}\) \(\newcommand {\arabic }[1]{}\) \(\newcommand {\number }[1]{}\) \(\newcommand {\noalign }[1]{\text {#1}\notag \\}\) \(\newcommand {\cline }[1]{}\) \(\newcommand {\directlua }[1]{\text {(directlua)}}\) \(\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}\) \(\newcommand {\protect }{}\) \(\def \LWRabsorbnumber #1 {}\) \(\def \LWRabsorbquotenumber "#1 {}\) \(\newcommand {\LWRabsorboption }[1][]{}\) \(\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }\) \(\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }\) \(\def \mathcode #1={\mathchar }\) \(\let \delcode \mathcode \) \(\let \delimiter \mathchar \) \(\def \oe {\unicode {x0153}}\) \(\def \OE {\unicode {x0152}}\) \(\def \ae {\unicode {x00E6}}\) \(\def \AE {\unicode {x00C6}}\) \(\def \aa {\unicode {x00E5}}\) \(\def \AA {\unicode {x00C5}}\) \(\def \o {\unicode {x00F8}}\) \(\def \O {\unicode {x00D8}}\) \(\def \l {\unicode {x0142}}\) \(\def \L {\unicode {x0141}}\) \(\def \ss {\unicode {x00DF}}\) \(\def \SS {\unicode {x1E9E}}\) \(\def \dag {\unicode {x2020}}\) \(\def \ddag {\unicode {x2021}}\) \(\def \P {\unicode {x00B6}}\) \(\def \copyright {\unicode {x00A9}}\) \(\def \pounds {\unicode {x00A3}}\) \(\let \LWRref \ref \) \(\renewcommand {\ref }{\ifstar \LWRref \LWRref }\) \( \newcommand {\multicolumn }[3]{#3}\) \(\require {textcomp}\) \( \newcommand {\abs }[1]{\lvert #1\rvert } \) \( \DeclareMathOperator {\sign }{sign} \) \(\newcommand {\intertext }[1]{\text {#1}\notag \\}\) \(\let \Hat \hat \) \(\let \Check \check \) \(\let \Tilde \tilde \) \(\let \Acute \acute \) \(\let \Grave \grave \) \(\let \Dot \dot \) \(\let \Ddot \ddot \) \(\let \Breve \breve \) \(\let \Bar \bar \) \(\let \Vec \vec \) \(\newcommand {\bm }[1]{\boldsymbol {#1}}\) \(\require {physics}\) \(\newcommand {\LWRphystrig }[2]{\ifblank {#1}{\textrm {#2}}{\textrm {#2}^{#1}}}\) \(\renewcommand {\sin }[1][]{\LWRphystrig {#1}{sin}}\) \(\renewcommand {\sinh }[1][]{\LWRphystrig {#1}{sinh}}\) \(\renewcommand {\arcsin }[1][]{\LWRphystrig {#1}{arcsin}}\) \(\renewcommand {\asin }[1][]{\LWRphystrig {#1}{asin}}\) \(\renewcommand {\cos }[1][]{\LWRphystrig {#1}{cos}}\) \(\renewcommand {\cosh }[1][]{\LWRphystrig {#1}{cosh}}\) \(\renewcommand {\arccos }[1][]{\LWRphystrig {#1}{arcos}}\) \(\renewcommand {\acos }[1][]{\LWRphystrig {#1}{acos}}\) \(\renewcommand {\tan }[1][]{\LWRphystrig {#1}{tan}}\) \(\renewcommand {\tanh }[1][]{\LWRphystrig {#1}{tanh}}\) \(\renewcommand {\arctan }[1][]{\LWRphystrig {#1}{arctan}}\) \(\renewcommand {\atan }[1][]{\LWRphystrig {#1}{atan}}\) \(\renewcommand {\csc }[1][]{\LWRphystrig {#1}{csc}}\) \(\renewcommand {\csch }[1][]{\LWRphystrig {#1}{csch}}\) \(\renewcommand {\arccsc }[1][]{\LWRphystrig {#1}{arccsc}}\) \(\renewcommand {\acsc }[1][]{\LWRphystrig {#1}{acsc}}\) \(\renewcommand {\sec }[1][]{\LWRphystrig {#1}{sec}}\) \(\renewcommand {\sech }[1][]{\LWRphystrig {#1}{sech}}\) \(\renewcommand {\arcsec }[1][]{\LWRphystrig {#1}{arcsec}}\) \(\renewcommand {\asec }[1][]{\LWRphystrig {#1}{asec}}\) \(\renewcommand {\cot }[1][]{\LWRphystrig {#1}{cot}}\) \(\renewcommand {\coth }[1][]{\LWRphystrig {#1}{coth}}\) \(\renewcommand {\arccot }[1][]{\LWRphystrig {#1}{arccot}}\) \(\renewcommand {\acot }[1][]{\LWRphystrig {#1}{acot}}\) \(\require {cancel}\) \(\newcommand *{\underuparrow }[1]{{\underset {\uparrow }{#1}}}\) \(\DeclareMathOperator *{\argmax }{argmax}\) \(\DeclareMathOperator *{\argmin }{arg\,min}\) \(\def \E [#1]{\mathbb {E}\!\left [ #1 \right ]}\) \(\def \Var [#1]{\operatorname {Var}\!\left [ #1 \right ]}\) \(\def \Cov [#1]{\operatorname {Cov}\!\left [ #1 \right ]}\) \(\newcommand {\floor }[1]{\lfloor #1 \rfloor }\) \(\newcommand {\DTFTH }{ H \brk 1{e^{j\omega }}}\) \(\newcommand {\DTFTX }{ X\brk 1{e^{j\omega }}}\) \(\newcommand {\DFTtr }[1]{\mathrm {DFT}\left \{#1\right \}}\) \(\newcommand {\DTFTtr }[1]{\mathrm {DTFT}\left \{#1\right \}}\) \(\newcommand {\DTFTtrI }[1]{\mathrm {DTFT^{-1}}\left \{#1\right \}}\) \(\newcommand {\Ftr }[1]{ \mathcal {F}\left \{#1\right \}}\) \(\newcommand {\FtrI }[1]{ \mathcal {F}^{-1}\left \{#1\right \}}\) \(\newcommand {\Zover }{\overset {\mathscr Z}{\Longleftrightarrow }}\) \(\renewcommand {\real }{\mathbb {R}}\) \(\newcommand {\ba }{\mathbf {a}}\) \(\newcommand {\bb }{\mathbf {b}}\) \(\newcommand {\bd }{\mathbf {d}}\) \(\newcommand {\be }{\mathbf {e}}\) \(\newcommand {\bh }{\mathbf {h}}\) \(\newcommand {\bn }{\mathbf {n}}\) \(\newcommand {\bq }{\mathbf {q}}\) \(\newcommand {\br }{\mathbf {r}}\) \(\newcommand {\bt }{\mathbf {t}}\) \(\newcommand {\bv }{\mathbf {v}}\) \(\newcommand {\bw }{\mathbf {w}}\) \(\newcommand {\bx }{\mathbf {x}}\) \(\newcommand {\bxx }{\mathbf {xx}}\) \(\newcommand {\bxy }{\mathbf {xy}}\) \(\newcommand {\by }{\mathbf {y}}\) \(\newcommand {\byy }{\mathbf {yy}}\) \(\newcommand {\bz }{\mathbf {z}}\) \(\newcommand {\bA }{\mathbf {A}}\) \(\newcommand {\bB }{\mathbf {B}}\) \(\newcommand {\bI }{\mathbf {I}}\) \(\newcommand {\bK }{\mathbf {K}}\) \(\newcommand {\bP }{\mathbf {P}}\) \(\newcommand {\bQ }{\mathbf {Q}}\) \(\newcommand {\bR }{\mathbf {R}}\) \(\newcommand {\bU }{\mathbf {U}}\) \(\newcommand {\bW }{\mathbf {W}}\) \(\newcommand {\bX }{\mathbf {X}}\) \(\newcommand {\bY }{\mathbf {Y}}\) \(\newcommand {\bZ }{\mathbf {Z}}\) \(\newcommand {\balpha }{\bm {\alpha }}\) \(\newcommand {\bth }{{\bm {\theta }}}\) \(\newcommand {\bepsilon }{{\bm {\epsilon }}}\) \(\newcommand {\bmu }{{\bm {\mu }}}\) \(\newcommand {\bOne }{\mathbf {1}}\) \(\newcommand {\bZero }{\mathbf {0}}\) \(\newcommand {\loss }{\mathcal {L}}\) \(\newcommand {\appropto }{\mathrel {\vcenter { \offinterlineskip \halign {\hfil $##$\cr \propto \cr \noalign {\kern 2pt}\sim \cr \noalign {\kern -2pt}}}}}\) \(\newcommand {\SSE }{\mathrm {SSE}}\) \(\newcommand {\MSE }{\mathrm {MSE}}\) \(\newcommand {\RMSE }{\mathrm {RMSE}}\) \(\newcommand {\toprule }[1][]{\hline }\) \(\let \midrule \toprule \) \(\let \bottomrule \toprule \) \(\def \LWRbooktabscmidruleparen (#1)#2{}\) \(\newcommand {\LWRbooktabscmidrulenoparen }[1]{}\) \(\newcommand {\cmidrule }[1][]{\ifnextchar (\LWRbooktabscmidruleparen \LWRbooktabscmidrulenoparen }\) \(\newcommand {\morecmidrules }{}\) \(\newcommand {\specialrule }[3]{\hline }\) \(\newcommand {\addlinespace }[1][]{}\) \(\newcommand {\LWRsubmultirow }[2][]{#2}\) \(\newcommand {\LWRmultirow }[2][]{\LWRsubmultirow }\) \(\newcommand {\multirow }[2][]{\LWRmultirow }\) \(\newcommand {\mrowcell }{}\) \(\newcommand {\mcolrowcell }{}\) \(\newcommand {\STneed }[1]{}\) \(\newcommand {\tcbset }[1]{}\) \(\newcommand {\tcbsetforeverylayer }[1]{}\) \(\newcommand {\tcbox }[2][]{\boxed {\text {#2}}}\) \(\newcommand {\tcboxfit }[2][]{\boxed {#2}}\) \(\newcommand {\tcblower }{}\) \(\newcommand {\tcbline }{}\) \(\newcommand {\tcbtitle }{}\) \(\newcommand {\tcbsubtitle [2][]{\mathrm {#2}}}\) \(\newcommand {\tcboxmath }[2][]{\boxed {#2}}\) \(\newcommand {\tcbhighmath }[2][]{\boxed {#2}}\)

Part I Machine Learning

1 Descriptive Statistics Basics

  • Goal: Describe the concise characteristics of a data.

1.1 Basic Characterization

Preliminaries

We assume a uni-variate random experiment that is described by a real-valued random variable \(X\) with

\begin{align*} \E [X] &=\mu \\ \Var [X] &= \sigma ^2. \end{align*}

Let

\[\bx =\{x_1,\dots ,x_n\}\]

be \(n\) observations of \(X\), where \(n\) may be fixed in advance or chosen arbitrarily.

1.1.1 Mean

Given observations \(x_1,\dots ,x_n\), the sample mean \(\bar {\bx }\) is given by

\begin{equation} \bar \bx = \frac 1n\sum _{i=1}^n x_i \end{equation}

Properties:

  • As you gather more observations (higher \(n\)), \(\bar \bx \) tends to stabilize: it fluctuates less, and its empirical distribution concentrates around the true center \(\mu \). Moreover, for large \(n\), the variability of \(\bar \bx \) scales like \(\sigma /\sqrt {n}\) (or \(\Var [\bar x]\approx \dfrac {\sigma ^2}{n}\)), meaning the more data we collect, the tighter our estimate of the process’s center becomes.

1.1.2 Median

Median is a value or quantity lying at the midpoint of a frequency distribution of observed values or quantities, such that there is an equal probability of falling above or below it.

Equivalently, for a sample \(x_1,\dots ,x_n\) with order statistics \(x_{(1)}\le \cdots \le x_{(n)}\), the sample median is

\[ \mathrm {median}( x) = \begin {cases} x_{(\frac {n+1}2)}, & n\text { odd},\\[6pt] \dfrac {x_{(\frac n2)} + x_{(\frac n2 + 1)}}2, & n\text { even}. \end {cases} \]

1.1.3 Mode

For a finite sample \(x_1,\dots ,x_n\), the sample mode is any value(s) that occur(s) most often among the observations.

Multimodality: A distribution is called multimodal if it has more than one local maximum (mode) in its probability density function or histogram. Common cases include:

  • Unimodal: a single peak (e.g., normal distribution).

  • Bimodal: two distinct peaks, often indicating that the data is a mixture of two subpopulations.

  • Multimodal: three or more peaks.

Multimodality suggests that the data may come from a mixture of different underlying processes or groups.

1.1.4 Variance

Sample variance is given by

\begin{align} s_{n-1}^2 =\frac 1{n-1}\sum _{i=1}^n\bigl (x_i-\bar {x}\bigr )^2 \end{align} This formulas is unbiased.

The more intuitive (biased) formula is

\begin{equation} s_{n}^2 =\frac 1n\sum _{i=1}^n\bigl (x_i-\bar {x}\bigr )^2\\ \end{equation}

Standard Deviation The sample standard deviation (std) is simply the square-root of the (unbiased or biased) sample variance, \(s\).

Table 1.1: The difference between variance and std.
.
Aspect Variance (\(s^2\)) Std. Dev. (\(s\))
Units (original unit)\(^2\) original unit
Interpretation “Mean squared deviation” “Average deviation from the mean”
Ease of communication Abstract (squared units) Concrete (\(\pm 5\) kg)
1.1.5 Repeated Experiments Characterization

In \(k\) repeated experiments, \(n\) samples are drawn from the random variable \(X\) for each experiment, \(\bx _1,\ldots ,\bx _k\).

Mean

The average of repeated experiments values is itself approximately \(\mu \), so on average the sample mean recovers the true mean

\begin{equation} \lim _{k\rightarrow \infty }\frac {1}{k}\sum _{j=1}^{k}\bar {\bx }_j \rightarrow \mu \end{equation}

This principle is illustrated in Fig. 1.1).

(image)

Figure 1.1: Visualization of convergence of \(k\) repeated experiments to the true mean.
Variance

In the case of biased variance, the mean of the \(s_{n_1},\ldots ,s_{n_k}\) values converges to \(\dfrac {n-1}{n}\sigma ^2\) and systematically underestimates the true population variance \(\sigma ^2\),

\begin{equation} \lim _{k\rightarrow \infty }\frac {1}{k}\sum _{j=1}^{k}s_{n_j} \rightarrow \dfrac {n-1}{n}\sigma ^2 \end{equation}

This inherent difference is termed bias. However, for the unbiased variance,

\begin{equation} \lim _{k\rightarrow \infty }\frac {1}{k}\sum _{j=1}^{k}s_{{n-1}_j} \rightarrow \sigma ^2 \end{equation}

This principle is illustrated in Fig. 1.2.

Note, the difference between \(s_n\) and \(s_{n-1}\) goes smaller for high \(n\).

(image)

Figure 1.2: Visualization of convergence of \(k\) repeated experiments to the biased and unbiased variances.
MSE

The reason that biased estimator is still useful, is the when the empirical mean-squared error (MSE) is of interest. MSE is defined by

\begin{align} \mathrm {MSE}_{n-1} &= \frac {1}{k}\sum _{j=1}^k \left (s_{{n-1}_j} - \sigma ^2\right )^2 \rightarrow \frac {2}{n-1}\,\sigma ^4\\ \mathrm {MSE}_{n} &= \frac {1}{k}\sum _{j=1}^k \left (s_{{n}_j} - \sigma ^2\right )^2 \rightarrow \frac {2n-1}{n^2}\,\sigma ^4 \end{align} and

\begin{equation} \mathrm {MSE}_{n-1} > \mathrm {MSE}_{n} \end{equation}

Note, that MSE error of the biased formula is slightly lower than unbiased one.

  • Example 1.1: Sample mean is unbiased and its variance decays as \(\sigma ^2/n\). The usual sample-variance estimators can be biased or unbiased. We illustrate all three properties by simulation:

    • 1. Parameters.

      • True distribution: \(X\sim \mathcal {N}(0,1)\), so \(\mu =0\), \(\sigma ^2=1\).

      • Sample sizes: \(n\in \{5,\,20,\,100\}\).

      • Number of replicates: \(k=5000\).

    • 2. Results. Results in Table 1.2 agree closely with the theoretical values.

    Table 1.2: Simulation results: sample mean, variance estimates, and their MSEs across different sample sizes.
    .
    \(n\) \(\widehat \mu _n\) Empirical \(\Var [\bar x]\) Theoretical \(\Var [\bar x]=1/n\) \(\widehat \sigma ^2_{\mathrm {biased}}\) \(\widehat \sigma ^2_{\mathrm {unbiased}}\) \(\widehat {\mathrm {MSE}}_{\mathrm {biased}}\) \(\widehat {\mathrm {MSE}}_{\mathrm {unbiased}}\)
    5 \(0.001\) \(0.198\) \(0.200\) \(0.79\) \(0.99\) \(0.36\) \(0.50\)
    20 \(-0.000\) \(0.049\) \(0.050\) \(0.95\) \(1.00\) \(0.10\) \(0.11\)
    100 \(0.000\) \(0.010\) \(0.010\) \(0.99\) \(1.01\) \(0.02\) \(0.02\)

The application of biased and unbiased estimators:

  • Biased: in ML tasks (e.g., loss function), optimal (maximum-likelihood estimation) for Gaussian distribution

  • Unbiased: statistics.

Code implementation default varies:

  • Python uses biased expression, e.g. numpy.std.

  • Matlab uses unbiased expression, e.g. std.

1.1.6 Skewness

Skewness: Skewness measures the asymmetry of a distribution about its mean. The sample skewness is defined as

\[ g_1 = \frac {1}{n}\sum _{i=1}^{n}\!\left (\frac {x_i - \bar {x}}{s_n}\right )^{\!3}. \]

  • \(g_1 = 0\): the distribution is symmetric (e.g., normal distribution).

  • \(g_1 > 0\): the distribution is right-skewed (positively skewed) — the right tail is longer, and the mean is typically greater than the median.

  • \(g_1 < 0\): the distribution is left-skewed (negatively skewed) — the left tail is longer, and the mean is typically less than the median.

An illustration of the relationship between mean, median, and mode under different skewness is shown in Fig. 1.3.

(image)

(a) Positive skew
   

(image)

(b) Symmetrical
   

(image)

(c) Negative skew
Figure 1.3: Relationship between mean, median, and mode for (a) positively skewed, (b) symmetrical, and (c) negatively skewed distributions.

1.2 Histogram

  • Goal: Visualization of the experimental data.

1.2.1 count type
  • Goal: Show how many times each discrete outcome occurs.

Consider an experiment with:

  • \(k\) possible distinct outcomes \(x_{1},x_{2},\ldots ,x_{k}\), where \(k\) is relatively small.

  • A total of \(N\) trials.

  • Recorded results:

    • \(n_{1}\) occurrences of \(x_{1}\),

    • \(n_{2}\) occurrences of \(x_{2}\),

    • \(\ldots \) and so on,

    with \(\sum _{i} n_{i}=N\).

The highest bar in a count histogram identifies the mode of the sample. A graphical representation of the outcomes is shown in Fig. 1.4(a).

(image)

(a) count
   

(image)

(b) probability
Figure 1.4: Example of histograms: (a) count, (b) probability.
1.2.2 probability type
  • Goal: Show the proportion of each discrete outcome, providing an empirical estimate of the PDF.

Approximation to the PDF: The probability of a particular outcome is approximated by the ratio of its count to the total number of trials,

\begin{equation} \label {eq:rand1:numeric_PDF_discr} p_X[x_i]\approx \frac {n_i}{N}, \qquad i = 1,\ldots ,k. \end{equation}

Naturally, the approximation improves as \(N \to \infty \).

A graphical example of this histogram type is shown in Fig. 1.4(b).

1.2.3 Large number of outcomes

When the number of possible outcomes, \(k\), is large (on the order of hundreds or more, or even continuous values) two main difficulties arise:

  • Presenting the results in a compact, readable form.

  • Some outcome categories contain very few observations because their probabilities are small.

A practical way to display the data is:

  • 1. Record the extreme values, \(x_{\max }\) and \(x_{\min }\).

  • 2. Partition the interval \(\bigl [x_{\min },x_{\max }\bigr ]\) into \(k\) equal-width bins of size \(\Delta x\).

  • 3. Mark each bin by its midpoint

    \begin{equation} \label {eq:rand1:mid_point} \tilde {x}_{i}=x_{\min }+\Bigl (i-\frac 12\Bigr )\,\Delta x, \qquad i=1,\dots ,k, \end{equation}

  • 4. Let \(n_{1}\) be the count in the first bin, \(n_{2}\) the count in the second, and so on.

  • 5. Use the pairs the pairs \(\left (\tilde {x}_{i},n_{i}\right )\) for count or probability type histograms.

An example of a count histogram for a large data set is shown in Fig. 1.5.

(image)

Figure 1.5: A count histogram for a large number of experimental outcomes.

In this representation, the median is the value on the x-axis that divides the total area of the histogram into two equal parts. The variance \(s^2\) and standard deviation \(s\) are reflected in the width or spread of the histogram: a broad histogram indicates high variability, while a narrow, sharp peak indicates that the data is tightly clustered around the mean.

add illustration/example

1.2.4 PDF Approximation of Continuous Random Variable (*)
  • Goal: Approximate the PDF of a continuous random variable from experimental data.

In addition to the binning method described above, there is a third histogram type that directly approximates the PDF of a continuous random variable.

Approximation to the PDF of a continuous random variable via histogram:

\begin{equation} \label {eq:stat:hist_pdf} f_X(x_i) \approx \frac {n_i}{N}\cdot \frac {1}{\Delta x}, \quad i = 1,\ldots ,k \end{equation}

Note the normalization factor \(1/\Delta x\), which distinguishes this from the discrete case in (1.10).

A graphical example of this histogram type is shown in Fig. 1.6.

(image)

Figure 1.6: A pdf histogram for a large number of experimental outcomes.
Derivation

Based on the principle

\begin{equation} \Pr (a<X\le b) = \int _{a}^{b}f_X(x)\,dx, \end{equation}

the probability of falling within a bin of width \(\Delta x\) centered at \(x_0\) can be approximated as

\begin{align} \Pr \!\Bigl (x_0-\tfrac {\Delta x}{2} < X \le x_0 +\tfrac {\Delta x}{2}\Bigr ) &= \int _{\Delta x}f_X(x)\,dx \approx f_X(x_0)\,\Delta x. \end{align} An illustration of this principle is shown in Fig. 1.7.

(image)

Figure 1.7: Illustration of the histogram principle: the shaded area \(f_X(x_0)\,\Delta x\) approximates the probability of falling within the bin.

Rearranging,

\begin{equation} \label {eq:stat:histogram_deriv} f_X(x_0) \approx \Pr \!\Bigl (x_0-\tfrac {\Delta x}{2} < X \le x_0 +\tfrac {\Delta x}{2}\Bigr )\cdot \frac {1}{\Delta x}. \end{equation}

The probability of falling in bin \(i\) (centered at \(\tilde {x}_i\)) is approximated by the relative frequency,

\begin{equation} \label {eq:stat:hist_prob} \Pr \!\Bigl (\tilde {x}_i-\tfrac {\Delta x}{2} < X \le \tilde {x}_i +\tfrac {\Delta x}{2}\Bigr ) \approx \frac {n_i}{N}, \quad i = 1,\ldots ,k. \end{equation}

Substituting (1.16) into (1.15) yields the PDF approximation formula in (1.12).

  • Example 1.2: Given the experimental outcomes

    \[ \left [16,\,98,\,96,\,49,\,81,\,14,\,43,\,92,\,80,\,96\right ], \]

    display the PDF using a histogram with \(k=3\) bins.

    • Solution:

      \begin{align*} N &= 10\\ x_{\min } &= 14,\quad x_{\max } = 98\\ x_{\max } &= x_{\min } + k\,\Delta x = 98\quad \Rightarrow \Delta x = \frac {x_{\max }-x_{\min }}{k} = \frac {84}{3} = 28 \end{align*} Counts per bin:

      \begin{align*} n_1 &= 2, &\leftarrow \lbrace 14,16\rbrace &\in [x_{\min },\;x_{\min } + \Delta x] &&= [14,42]\\ n_2 &= 2, &\leftarrow \lbrace 43, 49\rbrace &\in (x_{\min } + \Delta x,\;x_{\min } + 2\Delta x] &&= (42,70]\\ n_3 &= 6, &\leftarrow \lbrace 80,81,92,96,96,98\rbrace &\in (x_{\max } - \Delta x,\;x_{\max }] &&= (70,98] \end{align*} where \(n_3 = 10 - n_1 - n_2\). Bin midpoints:

      \begin{align*} \tilde {x}_1 &= x_{\min } + \tfrac {\Delta x}{2} = 28\\ \tilde {x}_2 &= x_{\min } + \Delta x + \tfrac {\Delta x}{2} = 56\\ \tilde {x}_3 &= x_{\max } - \tfrac {\Delta x}{2} = 84 \end{align*} The PDF approximation is therefore

      \[ f_X(\tilde {x}_i) \approx \frac {n_i}{N\cdot \Delta x} = n_i \cdot \frac {1}{10\cdot 28}. \]

Summary

Three ways to display experimental results as a histogram:

  • count: plot \(n_i\) — the raw bin counts.

  • probability: plot \(n_i/N\) — the relative frequency per bin.

  • pdf (or density): plot \(\dfrac {n_i}{N\,\Delta x}\) — the estimated probability density.

1.3 Boxplot

  • Goal: Compact visualization of the distribution’s location, spread, and potential outliers.

1.3.1 Quartiles

Quartiles: Given the order statistics \(x_{(1)}\le \cdots \le x_{(n)}\), the three quartiles divide the sorted data into four equal parts:

  • \(Q_1\) (first quartile, 25th percentile) — the median of the lower half of the data.

  • \(Q_2\) (second quartile, 50th percentile) — the median of the entire data set.

  • \(Q_3\) (third quartile, 75th percentile) — the median of the upper half of the data.

Equivalently, approximately 25% of the observations fall below \(Q_1\), and another 25% up to \(Q_3\).

1.3.2 Five-Number Summary

A boxplot (box-and-whisker plot) summarizes a data set using five statistics:

  • 1. Minimum non-outlier value.

  • 2. First quartile (\(Q_1\)).

  • 3. Median (\(Q_2\)).

  • 4. Third quartile (\(Q_3\)).

  • 5. Maximum non-outlier value.

The interquartile range (IQR) measures the spread of the central 50% of the data,

\begin{equation} \mathrm {IQR} = Q_3 - Q_1. \end{equation}

Outlier: An outlier is any observation that falls outside the interval

\[ \bigl [Q_1 - 1.5\,\mathrm {IQR},\;\; Q_3 + 1.5\,\mathrm {IQR}\bigr ]. \]

Such points are considered unusually far from the bulk of the data and may indicate measurement errors, rare events, or heavy-tailed behavior.

1.3.3 Construction

The boxplot is drawn as follows:

  • A box spans from \(Q_1\) to \(Q_3\), with a line at the median \(Q_2\).

  • Whiskers extend from the box to the most extreme data points that lie within

    \[ \bigl [Q_1 - 1.5\,\mathrm {IQR},\;\; Q_3 + 1.5\,\mathrm {IQR}\bigr ]. \]

  • Any observation outside the whisker range is marked individually as a potential outlier.

An illustration of a boxplot with its components is shown in Fig. 1.8.

(image)

Figure 1.8: Anatomy of a boxplot: the box spans the IQR from \(Q_1\) to \(Q_3\), the median \(Q_2\) is marked inside, whiskers extend to the most extreme non-outlier observations, and individual outliers are shown beyond.
1.3.4 Interpretation
  • The box width (IQR) reflects variability — a wide box indicates high spread.

  • The median line position within the box reveals skewness: if the median is closer to \(Q_1\), the distribution is right-skewed; if closer to \(Q_3\), left-skewed.

  • Whisker lengths indicate the range of typical observations.

  • Individual points beyond the whiskers flag potential outliers that may warrant further investigation.

Unlike histograms, boxplots do not show the shape of the distribution (e.g., multimodality). They are most useful for comparing distributions across groups side by side.

1.4 Violin Plot

  • Goal: Visualize both the summary statistics (boxplot) and the shape of the data distribution (histogram).

A violin plot extends the boxplot by adding a (smoothed1) histogram of the data on each side. A median line is typically drawn inside.

(image)

Figure 1.9: Comparison of a boxplot and a violin plot for the same bimodal data: the boxplot hides the two-group structure, while the violin plot reveals it through its shape.

Fig. 1.9 compares the two representations for the same bimodal data set: the boxplot gives no indication of two groups, whereas the violin plot clearly shows two bulges.

1 This smoothing is termed kernel density estimation (KDE). It is a special case of kernel smoothing in Sec. 6.5. The discussion of this technique is beyond the scope of this chapter.