Machine Learning & Signals Learning
Part I Machine Learning
1 Descriptive Statistics Basics
1.1 Basic Characterization
Preliminaries
We assume a uni-variate random experiment that is described by a real-valued random variable \(X\) with
\(\seteqnumber{0}{}{0}\)\begin{align*} \E [X] &=\mu \\ \Var [X] &= \sigma ^2. \end{align*}
Let
\[\bx =\{x_1,\dots ,x_n\}\]
be \(n\) observations of \(X\), where \(n\) may be fixed in advance or chosen arbitrarily.
1.1.1 Mean
Given observations \(x_1,\dots ,x_n\), the sample mean \(\bar {\bx }\) is given by
\(\seteqnumber{0}{}{0}\)\begin{equation} \bar \bx = \frac 1n\sum _{i=1}^n x_i \end{equation}
Properties:
-
• As you gather more observations (higher \(n\)), \(\bar \bx \) tends to stabilize: it fluctuates less, and its empirical distribution concentrates around the true center \(\mu \). Moreover, for large \(n\), the variability of \(\bar \bx \) scales like \(\sigma /\sqrt {n}\) (or \(\Var [\bar x]\approx \dfrac {\sigma ^2}{n}\)), meaning the more data we collect, the tighter our estimate of the process’s center becomes.
1.1.2 Median
Median is a value or quantity lying at the midpoint of a frequency distribution of observed values or quantities, such that there is an equal probability of falling above or below it.
Equivalently, for a sample \(x_1,\dots ,x_n\) with order statistics \(x_{(1)}\le \cdots \le x_{(n)}\), the sample median is
\[ \mathrm {median}( x) = \begin {cases} x_{(\frac {n+1}2)}, & n\text { odd},\\[6pt] \dfrac {x_{(\frac n2)} + x_{(\frac n2 + 1)}}2, & n\text { even}. \end {cases} \]
1.1.3 Mode
For a finite sample \(x_1,\dots ,x_n\), the sample mode is any value(s) that occur(s) most often among the observations.
Multimodality: A distribution is called multimodal if it has more than one local maximum (mode) in its probability density function or histogram. Common cases include:
-
• Unimodal: a single peak (e.g., normal distribution).
-
• Bimodal: two distinct peaks, often indicating that the data is a mixture of two subpopulations.
-
• Multimodal: three or more peaks.
Multimodality suggests that the data may come from a mixture of different underlying processes or groups.
1.1.4 Variance
Sample variance is given by
\(\seteqnumber{0}{}{1}\)\begin{align} s_{n-1}^2 =\frac 1{n-1}\sum _{i=1}^n\bigl (x_i-\bar {x}\bigr )^2 \end{align} This formulas is unbiased.
The more intuitive (biased) formula is
\(\seteqnumber{0}{}{2}\)\begin{equation} s_{n}^2 =\frac 1n\sum _{i=1}^n\bigl (x_i-\bar {x}\bigr )^2\\ \end{equation}
Standard Deviation The sample standard deviation (std) is simply the square-root of the (unbiased or biased) sample variance, \(s\).
| Aspect | Variance (\(s^2\)) | Std. Dev. (\(s\)) |
| Units | (original unit)\(^2\) | original unit |
| Interpretation | “Mean squared deviation” | “Average deviation from the mean” |
| Ease of communication | Abstract (squared units) | Concrete (\(\pm 5\) kg) |
1.1.5 Repeated Experiments Characterization
In \(k\) repeated experiments, \(n\) samples are drawn from the random variable \(X\) for each experiment, \(\bx _1,\ldots ,\bx _k\).
Mean
The average of repeated experiments values is itself approximately \(\mu \), so on average the sample mean recovers the true mean
\(\seteqnumber{0}{}{3}\)\begin{equation} \lim _{k\rightarrow \infty }\frac {1}{k}\sum _{j=1}^{k}\bar {\bx }_j \rightarrow \mu \end{equation}
This principle is illustrated in Fig. 1.1).
Variance
In the case of biased variance, the mean of the \(s_{n_1},\ldots ,s_{n_k}\) values converges to \(\dfrac {n-1}{n}\sigma ^2\) and systematically underestimates the true population variance \(\sigma ^2\),
\(\seteqnumber{0}{}{4}\)\begin{equation} \lim _{k\rightarrow \infty }\frac {1}{k}\sum _{j=1}^{k}s_{n_j} \rightarrow \dfrac {n-1}{n}\sigma ^2 \end{equation}
This inherent difference is termed bias. However, for the unbiased variance,
\(\seteqnumber{0}{}{5}\)\begin{equation} \lim _{k\rightarrow \infty }\frac {1}{k}\sum _{j=1}^{k}s_{{n-1}_j} \rightarrow \sigma ^2 \end{equation}
This principle is illustrated in Fig. 1.2.
Note, the difference between \(s_n\) and \(s_{n-1}\) goes smaller for high \(n\).
MSE
The reason that biased estimator is still useful, is the when the empirical mean-squared error (MSE) is of interest. MSE is defined by
\(\seteqnumber{0}{}{6}\)\begin{align} \mathrm {MSE}_{n-1} &= \frac {1}{k}\sum _{j=1}^k \left (s_{{n-1}_j} - \sigma ^2\right )^2 \rightarrow \frac {2}{n-1}\,\sigma ^4\\ \mathrm {MSE}_{n} &= \frac {1}{k}\sum _{j=1}^k \left (s_{{n}_j} - \sigma ^2\right )^2 \rightarrow \frac {2n-1}{n^2}\,\sigma ^4 \end{align} and
\(\seteqnumber{0}{}{8}\)\begin{equation} \mathrm {MSE}_{n-1} > \mathrm {MSE}_{n} \end{equation}
Note, that MSE error of the biased formula is slightly lower than unbiased one.
-
Example 1.1: Sample mean is unbiased and its variance decays as \(\sigma ^2/n\). The usual sample-variance estimators can be biased or unbiased. We illustrate all three properties by simulation:
-
1. Parameters.
-
• True distribution: \(X\sim \mathcal {N}(0,1)\), so \(\mu =0\), \(\sigma ^2=1\).
-
• Sample sizes: \(n\in \{5,\,20,\,100\}\).
-
• Number of replicates: \(k=5000\).
-
-
2. Results. Results in Table 1.2 agree closely with the theoretical values.
Table 1.2: Simulation results: sample mean, variance estimates, and their MSEs across different sample sizes.\(n\) \(\widehat \mu _n\) Empirical \(\Var [\bar x]\) Theoretical \(\Var [\bar x]=1/n\) \(\widehat \sigma ^2_{\mathrm {biased}}\) \(\widehat \sigma ^2_{\mathrm {unbiased}}\) \(\widehat {\mathrm {MSE}}_{\mathrm {biased}}\) \(\widehat {\mathrm {MSE}}_{\mathrm {unbiased}}\) 5 \(0.001\) \(0.198\) \(0.200\) \(0.79\) \(0.99\) \(0.36\) \(0.50\) 20 \(-0.000\) \(0.049\) \(0.050\) \(0.95\) \(1.00\) \(0.10\) \(0.11\) 100 \(0.000\) \(0.010\) \(0.010\) \(0.99\) \(1.01\) \(0.02\) \(0.02\) -
The application of biased and unbiased estimators:
-
• Biased: in ML tasks (e.g., loss function), optimal (maximum-likelihood estimation) for Gaussian distribution
-
• Unbiased: statistics.
Code implementation default varies:
1.1.6 Skewness
Skewness: Skewness measures the asymmetry of a distribution about its mean. The sample skewness is defined as
\[ g_1 = \frac {1}{n}\sum _{i=1}^{n}\!\left (\frac {x_i - \bar {x}}{s_n}\right )^{\!3}. \]
-
• \(g_1 = 0\): the distribution is symmetric (e.g., normal distribution).
-
• \(g_1 > 0\): the distribution is right-skewed (positively skewed) — the right tail is longer, and the mean is typically greater than the median.
-
• \(g_1 < 0\): the distribution is left-skewed (negatively skewed) — the left tail is longer, and the mean is typically less than the median.
An illustration of the relationship between mean, median, and mode under different skewness is shown in Fig. 1.3.
1.2 Histogram
1.2.1 count type
Consider an experiment with:
-
• \(k\) possible distinct outcomes \(x_{1},x_{2},\ldots ,x_{k}\), where \(k\) is relatively small.
-
• A total of \(N\) trials.
-
• Recorded results:
-
– \(n_{1}\) occurrences of \(x_{1}\),
-
– \(n_{2}\) occurrences of \(x_{2}\),
-
– \(\ldots \) and so on,
with \(\sum _{i} n_{i}=N\).
-
The highest bar in a count histogram identifies the mode of the sample. A graphical representation of the outcomes is shown in Fig. 1.4(a).
1.2.2 probability type
Approximation to the PDF: The probability of a particular outcome is approximated by the ratio of its count to the total number of trials,
\(\seteqnumber{0}{}{9}\)\begin{equation} \label {eq:rand1:numeric_PDF_discr} p_X[x_i]\approx \frac {n_i}{N}, \qquad i = 1,\ldots ,k. \end{equation}
Naturally, the approximation improves as \(N \to \infty \).
A graphical example of this histogram type is shown in Fig. 1.4(b).
1.2.3 Large number of outcomes
When the number of possible outcomes, \(k\), is large (on the order of hundreds or more, or even continuous values) two main difficulties arise:
-
• Presenting the results in a compact, readable form.
-
• Some outcome categories contain very few observations because their probabilities are small.
A practical way to display the data is:
-
1. Record the extreme values, \(x_{\max }\) and \(x_{\min }\).
-
2. Partition the interval \(\bigl [x_{\min },x_{\max }\bigr ]\) into \(k\) equal-width bins of size \(\Delta x\).
-
3. Mark each bin by its midpoint
\(\seteqnumber{0}{}{10}\)\begin{equation} \label {eq:rand1:mid_point} \tilde {x}_{i}=x_{\min }+\Bigl (i-\frac 12\Bigr )\,\Delta x, \qquad i=1,\dots ,k, \end{equation}
-
4. Let \(n_{1}\) be the count in the first bin, \(n_{2}\) the count in the second, and so on.
-
5. Use the pairs the pairs \(\left (\tilde {x}_{i},n_{i}\right )\) for count or probability type histograms.
An example of a count histogram for a large data set is shown in Fig. 1.5.
In this representation, the median is the value on the x-axis that divides the total area of the histogram into two equal parts. The variance \(s^2\) and standard deviation \(s\) are reflected in the width or spread of the histogram: a broad histogram indicates high variability, while a narrow, sharp peak indicates that the data is tightly clustered around the mean.
add illustration/example
1.2.4 PDF Approximation of Continuous Random Variable (*)
In addition to the binning method described above, there is a third histogram type that directly approximates the PDF of a continuous random variable.
Approximation to the PDF of a continuous random variable via histogram:
\(\seteqnumber{0}{}{11}\)\begin{equation} \label {eq:stat:hist_pdf} f_X(x_i) \approx \frac {n_i}{N}\cdot \frac {1}{\Delta x}, \quad i = 1,\ldots ,k \end{equation}
Note the normalization factor \(1/\Delta x\), which distinguishes this from the discrete case in (1.10).
A graphical example of this histogram type is shown in Fig. 1.6.
Derivation
Based on the principle
\(\seteqnumber{0}{}{12}\)\begin{equation} \Pr (a<X\le b) = \int _{a}^{b}f_X(x)\,dx, \end{equation}
the probability of falling within a bin of width \(\Delta x\) centered at \(x_0\) can be approximated as
\(\seteqnumber{0}{}{13}\)\begin{align} \Pr \!\Bigl (x_0-\tfrac {\Delta x}{2} < X \le x_0 +\tfrac {\Delta x}{2}\Bigr ) &= \int _{\Delta x}f_X(x)\,dx \approx f_X(x_0)\,\Delta x. \end{align} An illustration of this principle is shown in Fig. 1.7.
Rearranging,
\(\seteqnumber{0}{}{14}\)\begin{equation} \label {eq:stat:histogram_deriv} f_X(x_0) \approx \Pr \!\Bigl (x_0-\tfrac {\Delta x}{2} < X \le x_0 +\tfrac {\Delta x}{2}\Bigr )\cdot \frac {1}{\Delta x}. \end{equation}
The probability of falling in bin \(i\) (centered at \(\tilde {x}_i\)) is approximated by the relative frequency,
\(\seteqnumber{0}{}{15}\)\begin{equation} \label {eq:stat:hist_prob} \Pr \!\Bigl (\tilde {x}_i-\tfrac {\Delta x}{2} < X \le \tilde {x}_i +\tfrac {\Delta x}{2}\Bigr ) \approx \frac {n_i}{N}, \quad i = 1,\ldots ,k. \end{equation}
Substituting (1.16) into (1.15) yields the PDF approximation formula in (1.12).
-
Example 1.2: Given the experimental outcomes
\[ \left [16,\,98,\,96,\,49,\,81,\,14,\,43,\,92,\,80,\,96\right ], \]
display the PDF using a histogram with \(k=3\) bins.
-
\(\seteqnumber{0}{}{16}\)
\begin{align*} N &= 10\\ x_{\min } &= 14,\quad x_{\max } = 98\\ x_{\max } &= x_{\min } + k\,\Delta x = 98\quad \Rightarrow \Delta x = \frac {x_{\max }-x_{\min }}{k} = \frac {84}{3} = 28 \end{align*} Counts per bin:
\(\seteqnumber{0}{}{16}\)\begin{align*} n_1 &= 2, &\leftarrow \lbrace 14,16\rbrace &\in [x_{\min },\;x_{\min } + \Delta x] &&= [14,42]\\ n_2 &= 2, &\leftarrow \lbrace 43, 49\rbrace &\in (x_{\min } + \Delta x,\;x_{\min } + 2\Delta x] &&= (42,70]\\ n_3 &= 6, &\leftarrow \lbrace 80,81,92,96,96,98\rbrace &\in (x_{\max } - \Delta x,\;x_{\max }] &&= (70,98] \end{align*} where \(n_3 = 10 - n_1 - n_2\). Bin midpoints:
\(\seteqnumber{0}{}{16}\)\begin{align*} \tilde {x}_1 &= x_{\min } + \tfrac {\Delta x}{2} = 28\\ \tilde {x}_2 &= x_{\min } + \Delta x + \tfrac {\Delta x}{2} = 56\\ \tilde {x}_3 &= x_{\max } - \tfrac {\Delta x}{2} = 84 \end{align*} The PDF approximation is therefore
\[ f_X(\tilde {x}_i) \approx \frac {n_i}{N\cdot \Delta x} = n_i \cdot \frac {1}{10\cdot 28}. \]
-
\(\seteqnumber{0}{}{16}\)
Summary
Three ways to display experimental results as a histogram:
-
• count: plot \(n_i\) — the raw bin counts.
-
• probability: plot \(n_i/N\) — the relative frequency per bin.
-
• pdf (or density): plot \(\dfrac {n_i}{N\,\Delta x}\) — the estimated probability density.
1.3 Boxplot
1.3.1 Quartiles
Quartiles: Given the order statistics \(x_{(1)}\le \cdots \le x_{(n)}\), the three quartiles divide the sorted data into four equal parts:
-
• \(Q_1\) (first quartile, 25th percentile) — the median of the lower half of the data.
-
• \(Q_2\) (second quartile, 50th percentile) — the median of the entire data set.
-
• \(Q_3\) (third quartile, 75th percentile) — the median of the upper half of the data.
Equivalently, approximately 25% of the observations fall below \(Q_1\), and another 25% up to \(Q_3\).
1.3.2 Five-Number Summary
A boxplot (box-and-whisker plot) summarizes a data set using five statistics:
-
1. Minimum non-outlier value.
-
2. First quartile (\(Q_1\)).
-
3. Median (\(Q_2\)).
-
4. Third quartile (\(Q_3\)).
-
5. Maximum non-outlier value.
The interquartile range (IQR) measures the spread of the central 50% of the data,
\(\seteqnumber{0}{}{16}\)\begin{equation} \mathrm {IQR} = Q_3 - Q_1. \end{equation}
Outlier: An outlier is any observation that falls outside the interval
\[ \bigl [Q_1 - 1.5\,\mathrm {IQR},\;\; Q_3 + 1.5\,\mathrm {IQR}\bigr ]. \]
Such points are considered unusually far from the bulk of the data and may indicate measurement errors, rare events, or heavy-tailed behavior.
1.3.3 Construction
The boxplot is drawn as follows:
-
• A box spans from \(Q_1\) to \(Q_3\), with a line at the median \(Q_2\).
-
• Whiskers extend from the box to the most extreme data points that lie within
\[ \bigl [Q_1 - 1.5\,\mathrm {IQR},\;\; Q_3 + 1.5\,\mathrm {IQR}\bigr ]. \]
-
• Any observation outside the whisker range is marked individually as a potential outlier.
An illustration of a boxplot with its components is shown in Fig. 1.8.
1.3.4 Interpretation
-
• The box width (IQR) reflects variability — a wide box indicates high spread.
-
• The median line position within the box reveals skewness: if the median is closer to \(Q_1\), the distribution is right-skewed; if closer to \(Q_3\), left-skewed.
-
• Whisker lengths indicate the range of typical observations.
-
• Individual points beyond the whiskers flag potential outliers that may warrant further investigation.
Unlike histograms, boxplots do not show the shape of the distribution (e.g., multimodality). They are most useful for comparing distributions across groups side by side.
1.4 Violin Plot
A violin plot extends the boxplot by adding a (smoothed1) histogram of the data on each side. A median line is typically drawn inside.
Fig. 1.9 compares the two representations for the same bimodal data set: the boxplot gives no indication of two groups, whereas the violin plot clearly shows two bulges.
1 This smoothing is termed kernel density estimation (KDE). It is a special case of kernel smoothing in Sec. 6.5. The discussion of this technique is beyond the scope of this chapter.