Machine Learning & Signals Learning
4 Model Characterization
4.1 Uni-variate Polynomial Model
The goal is to extend a linear model “engine” to polynomial models. The polynomial model is very flexible, e.g. due to the Taylor expansion theorem.
The \(L\)-degree uni-variate polynomial regression model is
\(\seteqnumber{0}{}{0}\)\begin{equation} \begin{aligned} \hat {y} = f(x;\bw ) &= w_0 + w_1x + w_2x^2 + \cdots + w_Nx^L\\ &=\sum _{j=0}^{L} w_jx^j \end {aligned} \end{equation}
This problem is linear by the change of variables, \(z_{j} = x^j\, j=0,\ldots ,L\),
\(\seteqnumber{0}{}{1}\)\begin{equation} \hat {y} =\sum _{j=0}^{L} w_jz_{j} \end{equation}
The corresponding prediction for the dataset \(\left \{x_k,y_k\right \}_{k=1}^M\) can be easily written by using the matrix notation (also termed Vandermonde matrix),
\(\seteqnumber{0}{}{2}\)\begin{equation} \renewcommand *{\arraystretch }{1.3} \bX = \begin{bmatrix} 1 & x_1 & x_1^2 & \cdots & x_1^L\\ 1 & x_2 & x_2^2 & \cdots & x_2^L\\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_M & x_L^2 & \cdots & x_M^L\\ \end {bmatrix} \end{equation}
The weights vector values are straightforward.
Hyper-parameter The value of \(L\) is termed hyper-parameter of the model.
Examples of questions of the dramatic importance:
-
• Is a model \(\hat {y} = f(x;\bw )\) is appropriate?
-
• How do we evaluate a model performance?
-
• What are the optimal hyper-parameter values, e.g. \(L\) of the polynomial model?
4.2 Generalization
Let’s assume that the dataset \((\bX ,\by )\) are \(M\) samples drawn from some (unknown) joint probability distribution, \(\mathcal {D}\). The theoretical model performance is given by some average model metric (not loss) over all (theoretically) possible points from \(\mathcal {D}\),1
\(\seteqnumber{0}{}{3}\)\begin{equation} \label {eq-generalization-metric} \bar {J}_{theory} = \lim \limits _{M\rightarrow \infty }\frac {1}{M}\sum _{k=1}^MJ(y_k,\hat {y}_k) \end{equation}
Generalization performance: The difference between:
-
• Performance metric over dataset that is used to train mode, \(\bar {J}_M\).
-
• \(\bar {J}_{theory}\) from (4.4).
The problem is that the distribution \(\mathcal {D}\) is unknown in most of the practical applications. Moreover, in practice, the value of \(M\) is (very) limited and \(\bar {J}_M\) can significantly differ from \(\bar {J}_{theory}\).
Notes:
-
• Better generalization means smaller difference between model performance and generalization performance.
-
• Can be evaluated theoretically only on very simple models and datasets.
The generalization gap has two main sources:
-
• Data-related: limited dataset size \(M\), sampling bias, noise in the data, and non-representative samples from \(\mathcal {D}\).
-
• Model-related: model complexity mismatch (underfitting or overfitting), inappropriate model assumptions, and insufficient regularization.
In the following, methods to reduce both data-related and model-related gaps such that \(\bar {J}_M\approx \bar {J}_{theory}\) will be discussed.
The generalization concept is illustrated in Fig. 4.1.
4.3 Cross-validation
Big dataset \(\left (10^4\lesssim M\right )\): train/validation/test
First step is to resample the dataset into the random order. Than, split it into three distinctive datasets:
-
• Training (50-80%): used for learning of model parameters, e.g. weights \(\bw \).
-
• Validation (10-25%): used for assessment of model hyper-parameters influence.
-
• Test (10-25%): performance assessment that is supposed to be sufficiently close to \(\bar {J}_{theory}\).
Medium datasets \(\left (10^2\lesssim M\lesssim 10^4\right )\): k-fold
First step is to resample the dataset into the random order. Then, apply the following steps:
-
• Data Splitting: First, the available dataset is divided into \(k\) subsets of approximately equal size. These subsets are often referred to as “folds”.
-
• Model Training and Evaluation: The model is trained k times. In each iteration, one of the subsets is used as the test set, and the remaining \(k-1\) subsets are used as the training/validation sets. This means that in each iteration, the model is trained and validated on a different combination of training and test data.
-
• Performance Evaluation: After training the model \(k\) times, the performance of the model is evaluated by averaging the performance metrics obtained in each iteration.
Usually, \(k\) is defaulted to 5 or 10.
Very small datasets \(\left (M\lesssim 10^2\right )\): one-hold-out
Uses \(k\)-fold with \(k=M\), which means that each fold will contain only one data point.
The cross-validation methods are illustrated in Fig. 4.2.
4.4 Basic Workflow
The basic ML/DL workflow is presented in Fig. 4.3. The workflow parts are:
-
• Goal definition
-
• Data: available data
-
• Pre-processing: preliminary dataset exploration and validation of dataset integrity (e.g., same physical units for all values of the same feature).
-
• Model: basic assumptions about the hidden pattern within the data
-
• Model training: minimization of the loss functions to derive the most appropriate parameters.
-
• Hyper-parameter optimization: tuning the model sub-type.
-
• Performance assessment according predefined metrics.
Baseline The basic end-to-end workflow implementation is called baseline.
4.4.1 Goal definition
Typical related goals:
-
• Prediction or regression, \(\by \) is quantitative (Fig. 4.4a).
-
• Classification, \(\by \) is categorical (Fig. 4.4b).
-
• Clustering, no \(\by \) is provided - it is learned from dataset (Fig. 4.4c).
-
• Semi-supervised learning, combination of labeled and unlabeled data (Fig. 4.4d).
-
• Anomaly detection, somewhere between classification and clustering (Fig. 4.4e).
-
• Segmentation
-
• Simulation
-
• Signal processing tasks: noise removal, smoothing (filling missing values), event/condition detection.
Note, there is possible to have two or more goals for the same dataset.
4.4.2 Model
We assume that there is an underling problem (e.g., regression and classification) formulation is of the form
\(\seteqnumber{0}{}{4}\)\begin{equation} y = h(\bx ) + \epsilon \end{equation}
where \(h(\bx )\) is the true unknown function and \(\epsilon \) is some irreducible noise. Sometimes, zero-mean noise is assumed. The values of \(\bx \) (scalar or vector) and \(y\) are known (it is the dataset).
The goal is to find the function \(f(\cdot ;\bw )\) that approximates \(h(\bx )\). The way to define the \(f(\cdot ;\bw )\) is termed model that depends on some model parameters vector \(\bw \). The process of finding parameters \(\bw \) is called learning, such as the resulting model can provides predictions
\(\seteqnumber{0}{}{5}\)\begin{equation} \hat {y}_0 = f(\bx _0;\bw ) \end{equation}
for some new data \(\bx _0\).
Parameters vs hyper-parameters
There is a conceptual difference between parameters and hyper-parameters.
Model parameters: Model parameters are learned directly from a dataset.
Hyper-parameters: Model parameters that are not learned directly from a dataset are called hyper-parameters. They are learned in in-direct way during cross-validation process in the follow.
Hyper-parameter optimization Selecting the most appropriate hyper-parameters value is called hyper-parameter optimization.
Parametric vs non-parametric models
There are two main classes of models: parametric and non-parametric, summarized in Table 4.1.
|
Aspect |
Parametric | Non-parametric |
|
Dependence on number of parameters on dataset size |
Fixed | Flexible |
|
Interpretability |
Yes | No |
|
Underlying data assumptions |
Yes | No |
|
Risk |
Underfitting due to rigid structure | Overfitting due to high flexibility |
|
Dataset size |
Smaller | Best for larger |
|
Complexity |
Often fast | Often complex |
|
Examples |
Linear regression | k-NN, trees |
The modern trend is to bridge the gap between interpretable and non-parameteric modeling.
4.4.3 Loss Function
Loss (or cost) function is a function that relates between dataset outputs \(\by \) and model outputs \(\hat {\by }\). The parameters \(\bw \) are minimum of that function,
\(\seteqnumber{0}{}{6}\)\begin{equation} \hat \bw = \argmin _{\bw }\loss (\by ,\hat {\by }) \end{equation}
The minimization of the loss function is also termed training.
Loss Function Minimization
Closed-form solution A closed-form solution for \(\bw \) is a solution that is based on basic mathematical functions. For example, a "normal equation" is a solution for linear regression/classification.
Local-minimum gradient-based iterative algorithms This family of algorithms is applicable only for convex (preferably strictly convex) loss functions. For example, gradient descent (GD) and its modifications (e.g., stochastic GD) are used to evaluate NN parameters. Another example is the Newton-Raphson algorithm.
-
• Some advanced algorithms under this category also employ (require) second-order derivative \(\frac {\partial ^2 }{\partial \bw }\loss \) for faster convergence.
-
• If either derivative is not available as a closed-form expression, it is evaluated numerically.
Global optimizers The goal of global optimizers is to find a global minimum of non-convex function. These algorithms may be gradient-free, first-derivative or second-derivative. The complexity of these algorithms is significantly higher than the local optimizer and can be prohibitive for more than a few hundred variables in \(\bX \).
4.4.4 Metrics
Metrics are quantitative performance indicator \(\loss (\by ,\hat {\by })\) of the model that relate between \(\by \) and \(\hat {\by }\). Sometimes, the minimum of the loss function is also a metric, e.g. mean squared error (MSE).
4.5 Project Steps
A widely adopted framework for ML/DL projects is the Cross-Industry Standard Process for Data Mining (CRISP-DM), which defines six iterative phases:
-
1. Business understanding – define the problem, success criteria, and project plan.
-
2. Data understanding – collect initial data and perform exploratory analysis.
-
3. Data preparation – clean, transform, and construct the final dataset.
-
4. Modeling – select and train candidate models.
-
5. Evaluation – assess model performance against business objectives.
-
6. Deployment – integrate the model into production.
The process is iterative: findings in any phase may require revisiting earlier phases.
-
• Understanding the problem: domain knowledge is essential.
-
– Define performance metric and performance goal.
-
– Evaluate human-level performance, if relevant.
-
-
• Data collection and preparation: typically the most time-consuming step.
-
– Ensure representative distribution.
-
– Identify outliers.
-
– Compute basic statistics and create visualizations.
-
– Verify sufficient dataset size.
-
-
• Model engineering:
-
– Start with a standard, pre-implemented model.
-
– Avoid overly complex models for the baseline.
-
-
• Baseline implementation:
-
– Build an end-to-end pipeline: data \(\rightarrow \) model \(\rightarrow \) loss \(\rightarrow \) metrics.
-
– Debug and verify a sufficient level of performance.
-
-
• Performance evaluation:
-
– Identify overfitting and underfitting (bias/variance analysis).
-
– Investigate errors and validate dataset integrity.
-
-
• Performance improvement:
-
– Hyper-parameter optimization.
-
– Iteratively refine: revisit the problem, add data, or improve the model until sufficient performance is achieved.
-
-
• Model deployment:
-
– Deploy the model for the target application.
-
4.5.1 MLOps
Machine Learning Operations (MLOps) extends DevOps principles to ML systems, bridging the gap between model development and production deployment.
-
• Develop: algorithm training and testing, ETL / data pipelines, continuous integration and deployment (CI/CD).
-
• Operate: continuous delivery, model inference, monitoring and management.