Machine Learning & Signals Learning

\(\newcommand{\footnotename}{footnote}\) \(\def \LWRfootnote {1}\) \(\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\let \LWRorighspace \hspace \) \(\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }\) \(\newcommand {\TextOrMath }[2]{#2}\) \(\newcommand {\mathnormal }[1]{{#1}}\) \(\newcommand \ensuremath [1]{#1}\) \(\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } \) \(\newcommand {\setlength }[2]{}\) \(\newcommand {\addtolength }[2]{}\) \(\newcommand {\setcounter }[2]{}\) \(\newcommand {\addtocounter }[2]{}\) \(\newcommand {\arabic }[1]{}\) \(\newcommand {\number }[1]{}\) \(\newcommand {\noalign }[1]{\text {#1}\notag \\}\) \(\newcommand {\cline }[1]{}\) \(\newcommand {\directlua }[1]{\text {(directlua)}}\) \(\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}\) \(\newcommand {\protect }{}\) \(\def \LWRabsorbnumber #1 {}\) \(\def \LWRabsorbquotenumber "#1 {}\) \(\newcommand {\LWRabsorboption }[1][]{}\) \(\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }\) \(\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }\) \(\def \mathcode #1={\mathchar }\) \(\let \delcode \mathcode \) \(\let \delimiter \mathchar \) \(\def \oe {\unicode {x0153}}\) \(\def \OE {\unicode {x0152}}\) \(\def \ae {\unicode {x00E6}}\) \(\def \AE {\unicode {x00C6}}\) \(\def \aa {\unicode {x00E5}}\) \(\def \AA {\unicode {x00C5}}\) \(\def \o {\unicode {x00F8}}\) \(\def \O {\unicode {x00D8}}\) \(\def \l {\unicode {x0142}}\) \(\def \L {\unicode {x0141}}\) \(\def \ss {\unicode {x00DF}}\) \(\def \SS {\unicode {x1E9E}}\) \(\def \dag {\unicode {x2020}}\) \(\def \ddag {\unicode {x2021}}\) \(\def \P {\unicode {x00B6}}\) \(\def \copyright {\unicode {x00A9}}\) \(\def \pounds {\unicode {x00A3}}\) \(\let \LWRref \ref \) \(\renewcommand {\ref }{\ifstar \LWRref \LWRref }\) \( \newcommand {\multicolumn }[3]{#3}\) \(\require {textcomp}\) \( \newcommand {\abs }[1]{\lvert #1\rvert } \) \( \DeclareMathOperator {\sign }{sign} \) \(\newcommand {\intertext }[1]{\text {#1}\notag \\}\) \(\let \Hat \hat \) \(\let \Check \check \) \(\let \Tilde \tilde \) \(\let \Acute \acute \) \(\let \Grave \grave \) \(\let \Dot \dot \) \(\let \Ddot \ddot \) \(\let \Breve \breve \) \(\let \Bar \bar \) \(\let \Vec \vec \) \(\newcommand {\bm }[1]{\boldsymbol {#1}}\) \(\require {physics}\) \(\newcommand {\LWRphystrig }[2]{\ifblank {#1}{\textrm {#2}}{\textrm {#2}^{#1}}}\) \(\renewcommand {\sin }[1][]{\LWRphystrig {#1}{sin}}\) \(\renewcommand {\sinh }[1][]{\LWRphystrig {#1}{sinh}}\) \(\renewcommand {\arcsin }[1][]{\LWRphystrig {#1}{arcsin}}\) \(\renewcommand {\asin }[1][]{\LWRphystrig {#1}{asin}}\) \(\renewcommand {\cos }[1][]{\LWRphystrig {#1}{cos}}\) \(\renewcommand {\cosh }[1][]{\LWRphystrig {#1}{cosh}}\) \(\renewcommand {\arccos }[1][]{\LWRphystrig {#1}{arcos}}\) \(\renewcommand {\acos }[1][]{\LWRphystrig {#1}{acos}}\) \(\renewcommand {\tan }[1][]{\LWRphystrig {#1}{tan}}\) \(\renewcommand {\tanh }[1][]{\LWRphystrig {#1}{tanh}}\) \(\renewcommand {\arctan }[1][]{\LWRphystrig {#1}{arctan}}\) \(\renewcommand {\atan }[1][]{\LWRphystrig {#1}{atan}}\) \(\renewcommand {\csc }[1][]{\LWRphystrig {#1}{csc}}\) \(\renewcommand {\csch }[1][]{\LWRphystrig {#1}{csch}}\) \(\renewcommand {\arccsc }[1][]{\LWRphystrig {#1}{arccsc}}\) \(\renewcommand {\acsc }[1][]{\LWRphystrig {#1}{acsc}}\) \(\renewcommand {\sec }[1][]{\LWRphystrig {#1}{sec}}\) \(\renewcommand {\sech }[1][]{\LWRphystrig {#1}{sech}}\) \(\renewcommand {\arcsec }[1][]{\LWRphystrig {#1}{arcsec}}\) \(\renewcommand {\asec }[1][]{\LWRphystrig {#1}{asec}}\) \(\renewcommand {\cot }[1][]{\LWRphystrig {#1}{cot}}\) \(\renewcommand {\coth }[1][]{\LWRphystrig {#1}{coth}}\) \(\renewcommand {\arccot }[1][]{\LWRphystrig {#1}{arccot}}\) \(\renewcommand {\acot }[1][]{\LWRphystrig {#1}{acot}}\) \(\require {cancel}\) \(\newcommand *{\underuparrow }[1]{{\underset {\uparrow }{#1}}}\) \(\DeclareMathOperator *{\argmax }{argmax}\) \(\DeclareMathOperator *{\argmin }{arg\,min}\) \(\def \E [#1]{\mathbb {E}\!\left [ #1 \right ]}\) \(\def \Var [#1]{\operatorname {Var}\!\left [ #1 \right ]}\) \(\def \Cov [#1]{\operatorname {Cov}\!\left [ #1 \right ]}\) \(\newcommand {\floor }[1]{\lfloor #1 \rfloor }\) \(\newcommand {\DTFTH }{ H \brk 1{e^{j\omega }}}\) \(\newcommand {\DTFTX }{ X\brk 1{e^{j\omega }}}\) \(\newcommand {\DFTtr }[1]{\mathrm {DFT}\left \{#1\right \}}\) \(\newcommand {\DTFTtr }[1]{\mathrm {DTFT}\left \{#1\right \}}\) \(\newcommand {\DTFTtrI }[1]{\mathrm {DTFT^{-1}}\left \{#1\right \}}\) \(\newcommand {\Ftr }[1]{ \mathcal {F}\left \{#1\right \}}\) \(\newcommand {\FtrI }[1]{ \mathcal {F}^{-1}\left \{#1\right \}}\) \(\newcommand {\Zover }{\overset {\mathscr Z}{\Longleftrightarrow }}\) \(\renewcommand {\real }{\mathbb {R}}\) \(\newcommand {\ba }{\mathbf {a}}\) \(\newcommand {\bb }{\mathbf {b}}\) \(\newcommand {\bd }{\mathbf {d}}\) \(\newcommand {\be }{\mathbf {e}}\) \(\newcommand {\bh }{\mathbf {h}}\) \(\newcommand {\bn }{\mathbf {n}}\) \(\newcommand {\bq }{\mathbf {q}}\) \(\newcommand {\br }{\mathbf {r}}\) \(\newcommand {\bt }{\mathbf {t}}\) \(\newcommand {\bv }{\mathbf {v}}\) \(\newcommand {\bw }{\mathbf {w}}\) \(\newcommand {\bx }{\mathbf {x}}\) \(\newcommand {\bxx }{\mathbf {xx}}\) \(\newcommand {\bxy }{\mathbf {xy}}\) \(\newcommand {\by }{\mathbf {y}}\) \(\newcommand {\byy }{\mathbf {yy}}\) \(\newcommand {\bz }{\mathbf {z}}\) \(\newcommand {\bA }{\mathbf {A}}\) \(\newcommand {\bB }{\mathbf {B}}\) \(\newcommand {\bI }{\mathbf {I}}\) \(\newcommand {\bK }{\mathbf {K}}\) \(\newcommand {\bP }{\mathbf {P}}\) \(\newcommand {\bQ }{\mathbf {Q}}\) \(\newcommand {\bR }{\mathbf {R}}\) \(\newcommand {\bU }{\mathbf {U}}\) \(\newcommand {\bW }{\mathbf {W}}\) \(\newcommand {\bX }{\mathbf {X}}\) \(\newcommand {\bY }{\mathbf {Y}}\) \(\newcommand {\bZ }{\mathbf {Z}}\) \(\newcommand {\balpha }{\bm {\alpha }}\) \(\newcommand {\bth }{{\bm {\theta }}}\) \(\newcommand {\bepsilon }{{\bm {\epsilon }}}\) \(\newcommand {\bmu }{{\bm {\mu }}}\) \(\newcommand {\bOne }{\mathbf {1}}\) \(\newcommand {\bZero }{\mathbf {0}}\) \(\newcommand {\loss }{\mathcal {L}}\) \(\newcommand {\appropto }{\mathrel {\vcenter { \offinterlineskip \halign {\hfil $##$\cr \propto \cr \noalign {\kern 2pt}\sim \cr \noalign {\kern -2pt}}}}}\) \(\newcommand {\SSE }{\mathrm {SSE}}\) \(\newcommand {\MSE }{\mathrm {MSE}}\) \(\newcommand {\RMSE }{\mathrm {RMSE}}\) \(\newcommand {\toprule }[1][]{\hline }\) \(\let \midrule \toprule \) \(\let \bottomrule \toprule \) \(\def \LWRbooktabscmidruleparen (#1)#2{}\) \(\newcommand {\LWRbooktabscmidrulenoparen }[1]{}\) \(\newcommand {\cmidrule }[1][]{\ifnextchar (\LWRbooktabscmidruleparen \LWRbooktabscmidrulenoparen }\) \(\newcommand {\morecmidrules }{}\) \(\newcommand {\specialrule }[3]{\hline }\) \(\newcommand {\addlinespace }[1][]{}\) \(\newcommand {\LWRsubmultirow }[2][]{#2}\) \(\newcommand {\LWRmultirow }[2][]{\LWRsubmultirow }\) \(\newcommand {\multirow }[2][]{\LWRmultirow }\) \(\newcommand {\mrowcell }{}\) \(\newcommand {\mcolrowcell }{}\) \(\newcommand {\STneed }[1]{}\) \(\newcommand {\tcbset }[1]{}\) \(\newcommand {\tcbsetforeverylayer }[1]{}\) \(\newcommand {\tcbox }[2][]{\boxed {\text {#2}}}\) \(\newcommand {\tcboxfit }[2][]{\boxed {#2}}\) \(\newcommand {\tcblower }{}\) \(\newcommand {\tcbline }{}\) \(\newcommand {\tcbtitle }{}\) \(\newcommand {\tcbsubtitle [2][]{\mathrm {#2}}}\) \(\newcommand {\tcboxmath }[2][]{\boxed {#2}}\) \(\newcommand {\tcbhighmath }[2][]{\boxed {#2}}\)

4 Model Characterization

4.1 Uni-variate Polynomial Model

The goal is to extend a linear model “engine” to polynomial models. The polynomial model is very flexible, e.g. due to the Taylor expansion theorem.

The \(L\)-degree uni-variate polynomial regression model is

\begin{equation} \begin{aligned} \hat {y} = f(x;\bw ) &= w_0 + w_1x + w_2x^2 + \cdots + w_Nx^L\\ &=\sum _{j=0}^{L} w_jx^j \end {aligned} \end{equation}

This problem is linear by the change of variables, \(z_{j} = x^j\, j=0,\ldots ,L\),

\begin{equation} \hat {y} =\sum _{j=0}^{L} w_jz_{j} \end{equation}

The corresponding prediction for the dataset \(\left \{x_k,y_k\right \}_{k=1}^M\) can be easily written by using the matrix notation (also termed Vandermonde matrix),

\begin{equation} \renewcommand *{\arraystretch }{1.3} \bX = \begin{bmatrix} 1 & x_1 & x_1^2 & \cdots & x_1^L\\ 1 & x_2 & x_2^2 & \cdots & x_2^L\\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_M & x_L^2 & \cdots & x_M^L\\ \end {bmatrix} \end{equation}

The weights vector values are straightforward.

Hyper-parameter The value of \(L\) is termed hyper-parameter of the model.

Examples of questions of the dramatic importance:

  • Is a model \(\hat {y} = f(x;\bw )\) is appropriate?

  • How do we evaluate a model performance?

  • What are the optimal hyper-parameter values, e.g. \(L\) of the polynomial model?

4.2 Generalization

  • Goal: The goal is:

    • Estimate some metrics of the model and understand the limitations on this estimate.

    • Trial and error approach.

    • Understand the limitations on the effectiveness of the model’s ability to make inferences on unfamiliar data.

Let’s assume that the dataset \((\bX ,\by )\) are \(M\) samples drawn from some (unknown) joint probability distribution, \(\mathcal {D}\). The theoretical model performance is given by some average model metric (not loss) over all (theoretically) possible points from \(\mathcal {D}\),1

\begin{equation} \label {eq-generalization-metric} \bar {J}_{theory} = \lim \limits _{M\rightarrow \infty }\frac {1}{M}\sum _{k=1}^MJ(y_k,\hat {y}_k) \end{equation}

Generalization performance: The difference between:

  • Performance metric over dataset that is used to train mode, \(\bar {J}_M\).

  • \(\bar {J}_{theory}\) from (4.4).

The problem is that the distribution \(\mathcal {D}\) is unknown in most of the practical applications. Moreover, in practice, the value of \(M\) is (very) limited and \(\bar {J}_M\) can significantly differ from \(\bar {J}_{theory}\).

Notes:

  • Better generalization means smaller difference between model performance and generalization performance.

  • Can be evaluated theoretically only on very simple models and datasets.

The generalization gap has two main sources:

  • Data-related: limited dataset size \(M\), sampling bias, noise in the data, and non-representative samples from \(\mathcal {D}\).

  • Model-related: model complexity mismatch (underfitting or overfitting), inappropriate model assumptions, and insufficient regularization.

In the following, methods to reduce both data-related and model-related gaps such that \(\bar {J}_M\approx \bar {J}_{theory}\) will be discussed.

The generalization concept is illustrated in Fig. 4.1.

(image)

Figure 4.1: Generalization: the gap between the training metric \(\bar {J}_M\) (computed on \(M\) observed samples) and the theoretical metric \(\bar {J}_{theory}\) (over the entire unknown distribution \(\mathcal {D}\) of possible inputs). The gap has data-related and model-related sources.

1 Probabilistic formulation, \(E_\mathcal {D}[J(y,\hat {y})]\)

4.3 Cross-validation

  • Goal: Trial and error approach to quantify generalization performance. The cross-validation is also termed performance assessment. Outcomes:

    • Approximated generalization performance, \(\bar {J}_M\approx \bar {J}_{theory}\).

    • Hyper-parameters selection.

Big dataset \(\left (10^4\lesssim M\right )\): train/validation/test

First step is to resample the dataset into the random order. Than, split it into three distinctive datasets:

  • Training (50-80%): used for learning of model parameters, e.g. weights \(\bw \).

  • Validation (10-25%): used for assessment of model hyper-parameters influence.

  • Test (10-25%): performance assessment that is supposed to be sufficiently close to \(\bar {J}_{theory}\).

Medium datasets \(\left (10^2\lesssim M\lesssim 10^4\right )\): k-fold

First step is to resample the dataset into the random order. Then, apply the following steps:

  • Data Splitting: First, the available dataset is divided into \(k\) subsets of approximately equal size. These subsets are often referred to as “folds”.

  • Model Training and Evaluation: The model is trained k times. In each iteration, one of the subsets is used as the test set, and the remaining \(k-1\) subsets are used as the training/validation sets. This means that in each iteration, the model is trained and validated on a different combination of training and test data.

  • Performance Evaluation: After training the model \(k\) times, the performance of the model is evaluated by averaging the performance metrics obtained in each iteration.

Usually, \(k\) is defaulted to 5 or 10.

Very small datasets \(\left (M\lesssim 10^2\right )\): one-hold-out

Uses \(k\)-fold with \(k=M\), which means that each fold will contain only one data point.

The cross-validation methods are illustrated in Fig. 4.2.

(image)

Figure 4.2: Cross-validation illustration: train/validation/test split for big datasets (top) and \(k\)-fold cross-validation for medium datasets (bottom).

4.4 Basic Workflow

The basic ML/DL workflow is presented in Fig. 4.3. The workflow parts are:

  • Goal definition

  • Data: available data

  • Pre-processing: preliminary dataset exploration and validation of dataset integrity (e.g., same physical units for all values of the same feature).

  • Model: basic assumptions about the hidden pattern within the data

  • Model training: minimization of the loss functions to derive the most appropriate parameters.

  • Hyper-parameter optimization: tuning the model sub-type.

  • Performance assessment according predefined metrics.

(image)

Figure 4.3: Basic workflow of ML/DL solution.

Baseline The basic end-to-end workflow implementation is called baseline.

4.4.1 Goal definition

Typical related goals:

  • Prediction or regression, \(\by \) is quantitative (Fig. 4.4a).

  • Classification, \(\by \) is categorical (Fig. 4.4b).

  • Clustering, no \(\by \) is provided - it is learned from dataset (Fig. 4.4c).

  • Semi-supervised learning, combination of labeled and unlabeled data (Fig. 4.4d).

  • Anomaly detection, somewhere between classification and clustering (Fig. 4.4e).

  • Segmentation

  • Simulation

  • Signal processing tasks: noise removal, smoothing (filling missing values), event/condition detection.

(image)

(a) Regression: what is the value of \(y\) for given \(x\)?
   

(image)

(b) Classification: what is the class of a new sample?

(image)

(c) Clustering: which samples belong together?
   

(image)

(d) Semi-supervised: few labeled samples guide learning from many unlabeled ones.

(image)

(e) Anomaly detection: which samples deviate from the normal pattern?
Figure 4.4: Examples of common ML goal types.

Note, there is possible to have two or more goals for the same dataset.

4.4.2 Model

We assume that there is an underling problem (e.g., regression and classification) formulation is of the form

\begin{equation} y = h(\bx ) + \epsilon \end{equation}

where \(h(\bx )\) is the true unknown function and \(\epsilon \) is some irreducible noise. Sometimes, zero-mean noise is assumed. The values of \(\bx \) (scalar or vector) and \(y\) are known (it is the dataset).

The goal is to find the function \(f(\cdot ;\bw )\) that approximates \(h(\bx )\). The way to define the \(f(\cdot ;\bw )\) is termed model that depends on some model parameters vector \(\bw \). The process of finding parameters \(\bw \) is called learning, such as the resulting model can provides predictions

\begin{equation} \hat {y}_0 = f(\bx _0;\bw ) \end{equation}

for some new data \(\bx _0\).

Parameters vs hyper-parameters

There is a conceptual difference between parameters and hyper-parameters.

Model parameters: Model parameters are learned directly from a dataset.

Hyper-parameters: Model parameters that are not learned directly from a dataset are called hyper-parameters. They are learned in in-direct way during cross-validation process in the follow.

Hyper-parameter optimization Selecting the most appropriate hyper-parameters value is called hyper-parameter optimization.

Parametric vs non-parametric models

There are two main classes of models: parametric and non-parametric, summarized in Table 4.1.

Table 4.1: Comparison of parametric and non-parametric models.
.

Aspect

Parametric Non-parametric

Dependence on number of parameters on dataset size

Fixed Flexible

Interpretability

Yes No

Underlying data assumptions

Yes No

Risk

Underfitting due to rigid structure Overfitting due to high flexibility

Dataset size

Smaller Best for larger

Complexity

Often fast Often complex

Examples

Linear regression k-NN, trees

The modern trend is to bridge the gap between interpretable and non-parameteric modeling.

4.4.3 Loss Function

Loss (or cost) function is a function that relates between dataset outputs \(\by \) and model outputs \(\hat {\by }\). The parameters \(\bw \) are minimum of that function,

\begin{equation} \hat \bw = \argmin _{\bw }\loss (\by ,\hat {\by }) \end{equation}

The minimization of the loss function is also termed training.

Loss Function Minimization
  • Goal: Minimum of the loss function for a given model.

Closed-form solution A closed-form solution for \(\bw \) is a solution that is based on basic mathematical functions. For example, a "normal equation" is a solution for linear regression/classification.

Local-minimum gradient-based iterative algorithms This family of algorithms is applicable only for convex (preferably strictly convex) loss functions. For example, gradient descent (GD) and its modifications (e.g., stochastic GD) are used to evaluate NN parameters. Another example is the Newton-Raphson algorithm.

  • Some advanced algorithms under this category also employ (require) second-order derivative \(\frac {\partial ^2 }{\partial \bw }\loss \) for faster convergence.

  • If either derivative is not available as a closed-form expression, it is evaluated numerically.

Global optimizers The goal of global optimizers is to find a global minimum of non-convex function. These algorithms may be gradient-free, first-derivative or second-derivative. The complexity of these algorithms is significantly higher than the local optimizer and can be prohibitive for more than a few hundred variables in \(\bX \).

4.4.4 Metrics

Metrics are quantitative performance indicator \(\loss (\by ,\hat {\by })\) of the model that relate between \(\by \) and \(\hat {\by }\). Sometimes, the minimum of the loss function is also a metric, e.g. mean squared error (MSE).

4.5 Project Steps

A widely adopted framework for ML/DL projects is the Cross-Industry Standard Process for Data Mining (CRISP-DM), which defines six iterative phases:

  • 1. Business understanding – define the problem, success criteria, and project plan.

  • 2. Data understanding – collect initial data and perform exploratory analysis.

  • 3. Data preparation – clean, transform, and construct the final dataset.

  • 4. Modeling – select and train candidate models.

  • 5. Evaluation – assess model performance against business objectives.

  • 6. Deployment – integrate the model into production.

The process is iterative: findings in any phase may require revisiting earlier phases.

Practical guidelines

  • Understanding the problem: domain knowledge is essential.

    • Define performance metric and performance goal.

    • Evaluate human-level performance, if relevant.

  • Data collection and preparation: typically the most time-consuming step.

    • Ensure representative distribution.

    • Identify outliers.

    • Compute basic statistics and create visualizations.

    • Verify sufficient dataset size.

  • Model engineering:

    • Start with a standard, pre-implemented model.

    • Avoid overly complex models for the baseline.

  • Baseline implementation:

    • Build an end-to-end pipeline: data \(\rightarrow \) model \(\rightarrow \) loss \(\rightarrow \) metrics.

    • Debug and verify a sufficient level of performance.

  • Performance evaluation:

    • Identify overfitting and underfitting (bias/variance analysis).

    • Investigate errors and validate dataset integrity.

  • Performance improvement:

    • Hyper-parameter optimization.

    • Iteratively refine: revisit the problem, add data, or improve the model until sufficient performance is achieved.

  • Model deployment:

    • Deploy the model for the target application.

4.5.1 MLOps

Machine Learning Operations (MLOps) extends DevOps principles to ML systems, bridging the gap between model development and production deployment.

  • Develop: algorithm training and testing, ETL / data pipelines, continuous integration and deployment (CI/CD).

  • Operate: continuous delivery, model inference, monitoring and management.