Machine Learning & Signals Learning
17 From Machine Learning to Deep Learning
17.1 Three Paradigms of Learning
Predictive systems can be grouped into three families along an axis of how much of the feature engineering is delegated to the learner (Fig. 17.1).
Rule-based The mapping from input to output is hand-crafted by a domain expert; no parameters are estimated from data. Examples: classical computer-vision filters, fire-control logic in automatic weapon systems, expert rule engines in medicine.
Classical machine learning A human designs a feature representation and a learning algorithm fits the mapping from features to output. Examples: signal classification from engineered features (amplitude, dominant frequency, spectral entropy), random-forest credit scoring on tabular features.
Deep learning Both the feature representation and the mapping are learned end-to-end from raw data. Examples: large language models such as ChatGPT, diffusion-based image generators, end-to-end speech recognition.
17.2 Prominent Historical Points
Two public competitions whose outcomes made the dramatic change in ML/DL. Together they bracket the transition: the first was won by classical methods just before deep learning broke through; the second was overturned by it.
17.2.1 The Netflix Prize (2006–2009)
A public competition to predict user ratings of films from previous ratings alone, with no auxiliary information about users or films. The goal was to beat the in-house Netflix algorithm (Cinematch) by \(10\,\%\), with a grand prize of $1,000,000 and several $50,000 progress prizes along the way. The dataset contained \(100{,}480{,}507\) ratings given by \(480{,}189\) users to \(17{,}770\) movies on a discrete scale of \(1\)–\(5\); predictions were allowed to be real-valued. Each \((\text {user},\text {movie})\) entry was timestamped, but no demographic, textual, or visual information about users or films was provided.
Performance was measured by the root-mean-square error
\(\seteqnumber{0}{}{0}\)\begin{equation*} \text {RMSE} = \sqrt {\frac {1}{M}\sum _{i=1}^{M}(\hat {r}_i - r_i)^2}, \end{equation*}
on a held-out test set. The Cinematch baseline of \(0.9525\) had to be reduced to \(0.8572\). The contest ran from October 2, 2006 to July 26, 2009 and attracted about \(2{,}000\) teams from over \(150\) countries.
Within a few weeks of launch, an independent contestant (Simon Funk) publicly described a stochastic-gradient form of matrix factorization that beat Cinematch by several percent; for the rest of the contest, latent-factor models stayed at the core of every leading entry. The grand prize was awarded to the team BellKor’s Pragmatic Chaos, a late-stage merger of three groups, whose final submission combined more than a hundred classical predictors (matrix factorization, restricted Boltzmann machines, gradient-boosted trees, and \(k\)-nearest-neighbor models) through linear blending. A rival team, The Ensemble, matched the score numerically but submitted twenty minutes later and lost on the tie-breaking rule. Deep architectures had not yet emerged as competitive predictors.
The lasting influence of the prize was less on the algorithms themselves than on the surrounding practices: large-scale public benchmarks with held-out test sets, public leaderboards, and team-merger dynamics later codified by Kaggle. Two cautionary footnotes also followed: the winning ensemble was never deployed in production because its engineering cost outweighed the rating-quality gain over a simpler in-house model, and a planned sequel competition was canceled in 2010 after researchers showed that supposedly anonymized users in the released dataset could be re-identified by cross-referencing public ratings on other sites.
17.2.2 The ImageNet Challenge (2010–2017)
The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) was an annual computer-vision competition held from 2010 to 2017. The ILSVRC subset (ImageNet-1K) contains \(1{,}000\) classes with \(1{,}281{,}167\) training, \(50{,}000\) validation, and \(100{,}000\) test images; the larger ImageNet-21K contains \(14{,}197{,}122\) images across \(21{,}841\) classes. Performance was reported using both top-\(1\) and top-\(5\) error, where the top-\(5\) error is the fraction of test images for which the correct label is not among the model’s five most probable predictions (formal definitions in Sec. 18.2.1).
On September 30, 2012, the convolutional neural network AlexNet broke from the classical Fisher-vector pipelines that had won the previous years, enabled by GPU training of a deep CNN (Table 17.1). The jump is widely credited with launching the modern deep-learning era: within five years deep models dominated the leaderboard, error fell below the human reference of about \(5\,\%\), and the competition was retired.
| Year | Entry |
Approach |
Top-\(1\) err. (%) | Top-\(5\) err. (%) |
| 2011 | XRCE (winner) |
hand-crafted features |
\(\sim 50\) | 25.8 |
| 2012 | ISI (runner-up) |
hand-crafted features |
\(\sim 46\) | 26.2 |
| 2012 | AlexNet (winner) |
8-layer CNN + GPU debut |
37.5 | 15.3 |
| 2014 | GoogLeNet (winner) |
22-layer CNN |
\(\sim 26\) | 6.7 |
| 2015 | ResNet (winner) |
152-layer CNN |
\(\sim 19\) | 3.57 |
| 2017 | SENet (winner) |
CNN + attention |
\(\sim 17\) | 2.25 |
The challenge was retired after 2017 because it had stopped discriminating between top entries: \(29\) of the \(38\) submitting teams had top-\(5\) error below \(5\,\%\), and the leading entries had fallen to roughly \(2\,\%\), comparable to the estimated label-noise floor of the test set itself, so further gains were no longer measuring image-recognition quality. The community moved on to harder benchmarks (detection, segmentation, larger and noisier web-scale corpora, and self-supervised pre-training).
17.3 Enablers
What changed between the Netflix Prize and the ImageNet jump was not the theoretical idea of deep networks (it had been around for decades), but three practical enablers that arrived together: the data needed to train an over-parametrised model (Sec. 17.3.1), the hardware to fit it in a tractable wall-clock (Sec. 17.3.2), and the empirical observation that deep networks keep improving as both grow (Sec. 17.3.3).
17.3.1 The Data Explosion
A modern megapixel camera produces \(1920\times 1080\) pixels, each described by three colour channels (RGB) of \(1\) byte (\(8\) bit) each. At a typical frame rate of \(30\) fps, the uncompressed stream is
\(\seteqnumber{0}{}{0}\)\begin{equation*} 1920\times 1080\times 3\times 30 \;\approx \; 2\times 10^8 \text {\,bytes/s} \;\approx \; 200\text {\,MB/s}. \end{equation*}
A single hour of raw HD video therefore exceeds \(700\) GB; multiply by millions of cameras, microphones, and clickstreams across the internet and the corpus available for training is enormous.
17.3.2 Hardware: GPUs and TPUs
Training and inference in neural networks consist almost entirely of dense matrix and tensor operations, which parallelise trivially across thousands of cores. GPUs are designed for exactly this workload (Table 17.2).
|
Parameter |
CPU |
GPU (NVIDIA) |
Note |
|
Single-thread speed |
Fast |
Slow |
Latency-optimised vs. throughput-optimised |
|
Parallel cores |
8–64 |
\(\sim 10^4\) |
GPU has many simple cores |
|
Execution model |
Independent threads (MIMD) |
Same program, many data items (SIMT) |
GPU well-suited to tensor ops |
|
Cache |
Hardware-managed, multi-level |
Programmer-managed shared memory |
GPU memory hierarchy is explicit |
|
Memory |
DDR, large, slower |
HBM, \(16\)–\(80\) GB onboard, much faster bandwidth |
GPU memory dominates throughput |
|
Power |
\(\sim 65\) W |
\(150\)–\(700\) W |
Datacentre GPUs need dedicated cooling |
|
Multi-device scaling |
Limited |
NVLink / InfiniBand fabrics |
Routine scaling to thousands of GPUs |
|
Precision |
FP64 standard |
FP32 / FP16 / BF16 / FP8 |
Lower precision \(\Rightarrow \) higher throughput |
Two GPU-specific features were decisive for deep learning:
-
• Tensor Cores (NVIDIA Volta and later): dedicated units that perform a fused mixed-precision matrix-multiply-accumulate in a single cycle, giving an order-of-magnitude throughput advantage on the operations that dominate training.
-
• Mature software stack (CUDA, cuDNN, cuBLAS), which is the base to all the major deep-learning frameworks (PyTorch, TensorFlow, JAX).
An example of further specialization is Google’s Tensor Processing Unit (TPU), an application-specific integrated circuit (ASIC) built around a large array matrix multiplier and optimized exclusively for tensor workloads. NVIDIA, AMD, and several startups now ship comparable accelerators.
17.3.3 Scaling Behaviour
Data and hardware would be of little use if the model could not benefit from them. The defining property of deep networks is that they keep improving as both grow.
Scalability of a model: A model is scalable if both of the following hold:
-
• Statistical scalability: test-set performance keeps improving as the training set grows.
-
• Computational scalability: the training procedure can absorb the larger dataset at a tolerable wall-clock and memory cost (e.g., through parallelization across GPUs).
A linear or otherwise low-capacity model has an asymptotic performance ceiling: once enough data is available to estimate its parameters reliably, further samples bring diminishing returns because the hypothesis class itself is too restrictive. Deep networks behave differently: when the model is enlarged in tandem with the dataset, performance keeps improving across many orders of magnitude in dataset size. This empirical observation is the operational definition of “deep learning at scale” (Fig. 17.2), and it is what made the AlexNet jump in Sec. 17.2 possible.
17.4 Applications and Architectures
Neural networks are flexible enough to tackle problems across very different data modalities. Table 17.3 maps a few canonical pairings of data type, task, and architecture; the rest of this part of the book introduces these architectures one by one.
| Learning type | Data modality | Example | Task | Architecture | |||||
| Supervised | Tabular | Home / car description | Price prediction | Fully-connected NN | |||||
| Sequential (time) | Audio waveform | Speech transcription | Recurrent NN | ||||||
| Sequential (text) | Sentence | Translation | Recurrent NN / Transformer | ||||||
| Image (2D) | Photographs | Image classification | Convolutional NN | ||||||
| Unsupervised | Image (2D) | Unlabelled images | Representation learning | Contrastive / self-supervised NN | |||||
| Image (2D) | Noisy images | Denoising | Autoencoder |