Machine Learning & Signals Learning
7 Learning Systems (Mid-Summary)
7.1 Basic Workflow
The basic ML/DL workflow is presented in Fig. 7.1. Following the data-flow in the figure, its building blocks are:
-
• Goal definition: what the system is intended to predict or decide.
-
• Raw data: the available observations before any treatment.
-
• Pre-processing: preliminary exploration and validation of dataset integrity (e.g., consistent physical units across all values of the same feature), producing the dataset \((\bX ,\by )\).
-
• Model: assumptions about the hidden pattern within the data, \(\hat {\by }=f(\bX ;\bw )\), parameterized by \(\bw \).
-
• Loss: the quantity \(\loss (\hat {\by },\by )\) to be minimized.
-
• Solver/optimizer: the training procedure that adjusts \(\bw \) so as to minimize the loss.
-
• Hyper-parameter optimization: outer loop that tunes the model sub-type, typically via cross-validation.
-
• Performance metrics: human-facing indicators used for assessment.
-
• Generalization performance: the final estimate of model quality on unseen data.
Baseline: The basic end-to-end implementation of the workflow is called the baseline.
7.2 Goal Types
The goal determines both the form of the target \(\by \) and the family of admissible models, losses, and metrics. The most common ML goal types are listed below and illustrated in Fig. 7.2.
Supervised The dataset provides paired inputs and targets \((\bX ,\by )\).
-
• Regression: \(\by \) is quantitative; the model predicts a continuous value (Fig. 7.2a).
-
• Classification: \(\by \) is categorical; the model assigns each sample to one of finitely many classes (Fig. 7.2b).
-
• Segmentation: classification or regression performed per pixel, per sample, or per time index, producing a structured output of the same shape as the input.
Unsupervised No target \(\by \) is provided; structure is inferred from \(\bX \) alone.
-
• Clustering: group samples by similarity (Fig. 7.2c).
-
• Anomaly detection: identify samples that deviate from the typical pattern; situated between classification and clustering (Fig. 7.2e).
-
• Simulation: learn the underlying distribution to generate new, statistically similar samples.
-
• Semi-supervised learning: a small set of labeled samples guides learning from a much larger pool of unlabeled ones (Fig. 7.2d).
-
• Signal processing tasks: noise removal, smoothing or filling missing values, and event/condition detection.
A single dataset may serve two or more goals simultaneously.
7.3 Model and Parameters
Recall, see 2 and 4.2, that the assumed data-generating model is
\(\seteqnumber{0}{}{0}\)\begin{equation} y = h(\bx ) + \epsilon , \end{equation}
where \(h(\bx )\) is the unknown true function and \(\epsilon \) is irreducible (often zero-mean) noise. The model \(f(\cdot ;\bw )\), parameterized by \(\bw \), approximates \(h(\bx )\) and produces predictions \(\hat {y}_0 = f(\bx _0;\bw )\) for new inputs \(\bx _0\). The process of finding \(\bw \) from the dataset is called learning or training.
7.3.1 Parameters vs Hyper-parameters
Model parameters: Model parameters \(\bw \) are learned directly from the dataset by minimizing the loss function.
Hyper-parameters: Hyper-parameters are model parameters that are not learned directly from the dataset; they are set by the designer or chosen indirectly via cross-validation (see 4.2).
Hyper-parameter optimization Selecting the most appropriate hyper-parameter values is called hyper-parameter optimization.
7.3.2 Parametric vs Non-parametric Models
There are two main classes of models, parametric and non-parametric, summarized in Table 7.1.
|
Aspect |
Parametric | Non-parametric |
|
Dependence of the number of parameters on dataset size |
Fixed | Flexible |
|
Interpretability |
Yes | No |
|
Underlying data assumptions |
Yes | No |
|
Risk |
Underfitting due to rigid structure | Overfitting due to high flexibility |
|
Dataset size |
Smaller | Best for larger |
|
Complexity |
Often fast | Often complex |
|
Examples |
Linear regression | k-NN, trees |
The modern trend is to bridge the gap between interpretable and non-parametric modeling.
7.4 Loss and Metrics: Optimization Approaches
The loss (or cost) function \(\loss (\by ,\hat {\by })\) relates dataset outputs \(\by \) to model outputs \(\hat {\by }\), and the parameters \(\bw \) are obtained as
\(\seteqnumber{0}{}{1}\)\begin{equation} \hat \bw = \argmin _{\bw }\loss (\by ,\hat {\by }), \end{equation}
a process termed training (see 2.1.1 for definitions and examples such as SSE, MSE, RMSE). A metric is the corresponding human-facing performance indicator; sometimes the loss minimum and the reported metric coincide (e.g. MSE).
Closed-form solution A closed-form solution for \(\bw \) is expressible through standard mathematical operations. For example, the “normal equation” provides such a solution for linear regression.
Local-minimum gradient-based iterative algorithms This family of algorithms is applicable only for convex (preferably strictly convex) loss functions. Examples include gradient descent (GD) and its modifications (e.g., stochastic GD), used to evaluate NN parameters, and the Newton-Raphson algorithm.
-
• Some advanced algorithms in this category employ (or require) the second-order derivative \(\frac {\partial ^2 }{\partial \bw }\loss \) for faster convergence.
-
• If a derivative is not available in closed form, it is evaluated numerically.
Global optimizers Global optimizers aim to find a global minimum of a non-convex function. They may be gradient-free, first-derivative, or second-derivative based. Their complexity is significantly higher than that of local optimizers and can be prohibitive for more than a few hundred variables in \(\bX \).
7.5 Project Steps
The workflow of Fig. 7.1 captures the technical core of an ML/DL solution but says little about the surrounding project. A widely adopted process framework that does is the Cross-Industry Standard Process for Data Mining (CRISP-DM), which organizes a project into six iterative phases: findings in any phase may require revisiting earlier ones. Each phase is described below together with practical guidelines for executing it well.
-
1. Business understanding: define the problem, success criteria, and project plan. Domain knowledge matters most in this phase.
-
• Translate the business need into a concrete ML goal type and a single, measurable success criterion.
-
• Choose the performance metric and the numerical performance goal in advance, before looking at any model output.
-
• Estimate human-level performance, if relevant, as a reference point for what is achievable.
-
• Identify constraints early: data access, latency, interpretability, and regulatory or safety requirements.
-
-
2. Data understanding: collect initial data and perform exploratory analysis to learn what the dataset actually contains.
-
• Compute basic statistics (mean, variance, ranges, missingness) and create visualizations for every feature and the target.
-
• Verify that the sample is representative of the deployment distribution and that the dataset size is sufficient for the chosen model family.
-
• Flag outliers, label noise, and class imbalance early; these shape later decisions on loss and metric.
-
-
3. Data preparation: clean, transform, and construct the final dataset. This is typically the most time-consuming phase.
-
• Validate dataset integrity (consistent units, deduplication, handling of missing values).
-
• Apply normalization or standardization where appropriate (Sec. 4.2).
-
• Construct features and split the data into training, validation, and test sets in a way that reflects the deployment scenario (e.g., temporal split for time-series).
-
-
4. Modeling: select and train candidate models.
-
• Start with a standard, pre-implemented model; avoid overly complex baselines.
-
• Build an end-to-end pipeline data \(\rightarrow \) model \(\rightarrow \) loss \(\rightarrow \) metrics, and debug it until the metric value is sane on a known-easy slice of data.
-
• Only after the baseline is healthy, introduce more expressive models or richer features.
-
-
5. Evaluation: assess model performance against the business objectives, not only against the loss.
-
• Compare the metric to the goal set in phase 1 and to the human-level reference.
-
• Diagnose overfitting and underfitting via bias/variance analysis (Sec. 4.2).
-
• Investigate the largest errors qualitatively; they often reveal dataset-integrity problems or unmodeled regimes.
-
• Improve performance iteratively by tuning hyper-parameters, adding data, or refining the model. If the goal remains out of reach, revisit phase 1 rather than over-tuning.
-
-
6. Deployment: integrate the model into the target application. From this point on, the system is governed by the MLOps practices of the next section: serving, monitoring, retraining triggers, and versioning of both data and model.
7.6 MLOps
ML projects differ from traditional software: outputs are probabilistic, development is experimental, and up to 80% of the effort may go into data preparation. ML operations (MLOps) extends software development operations (DevOps) principles to address these challenges, bridging model development and production deployment (Fig. 7.3).
CI/CD Continuous Integration (CI) automatically builds and tests code changes upon each commit. Continuous Delivery (CD) extends CI by automating the release of validated changes to production. In MLOps, CI/CD pipelines additionally handle data validation, model training, and model deployment.
-
• Experiment tracking: logging parameters, metrics, and artifacts for reproducibility.
-
• Data and model versioning: tracking dataset and model changes over time.
-
• Automated testing: data and model validation before deployment.
-
• ML pipeline automation: automating the training workflow via CI/CD pipelines.
-
• Model serving and inference: deploying trained models for real-time or batch predictions.
-
• Performance monitoring: tracking model accuracy, data drift, and concept drift in production.
-
• Level 0, Manual: all steps (data preparation, training, deployment) are performed manually.
-
• Level 1, ML pipeline automation: model training and validation are automated via pipelines.
-
• Level 2, CI/CD automation: automated testing, deployment, and monitoring close the feedback loop.
Concept drift: A shift in the data distribution over time. It is especially relevant for time-series models, where the underlying process may evolve. Monitoring for drift and triggering model retraining are essential for maintaining prediction quality.
-
Example 7.1: A demand-forecasting model for a grocery chain is trained on two years of pre-pandemic sales. After deployment, consumer behavior shifts: online orders surge, certain staple items are stockpiled, and the seasonality of fresh-produce sales changes. The mapping from features (day-of-week, promotion, weather) to demand \(y\) is no longer the same as in the training data, and forecast errors grow even though the model itself is unchanged. Monitoring the rolling prediction error and the input-feature distribution detects the drift and triggers retraining on recent data.