Machine Learning & Signals Learning

\(\newcommand{\footnotename}{footnote}\) \(\def \LWRfootnote {1}\) \(\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\let \LWRorighspace \hspace \) \(\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }\) \(\newcommand {\TextOrMath }[2]{#2}\) \(\newcommand {\mathnormal }[1]{{#1}}\) \(\newcommand \ensuremath [1]{#1}\) \(\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } \) \(\newcommand {\setlength }[2]{}\) \(\newcommand {\addtolength }[2]{}\) \(\newcommand {\setcounter }[2]{}\) \(\newcommand {\addtocounter }[2]{}\) \(\newcommand {\arabic }[1]{}\) \(\newcommand {\number }[1]{}\) \(\newcommand {\noalign }[1]{\text {#1}\notag \\}\) \(\newcommand {\cline }[1]{}\) \(\newcommand {\directlua }[1]{\text {(directlua)}}\) \(\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}\) \(\newcommand {\protect }{}\) \(\def \LWRabsorbnumber #1 {}\) \(\def \LWRabsorbquotenumber "#1 {}\) \(\newcommand {\LWRabsorboption }[1][]{}\) \(\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }\) \(\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }\) \(\def \mathcode #1={\mathchar }\) \(\let \delcode \mathcode \) \(\let \delimiter \mathchar \) \(\def \oe {\unicode {x0153}}\) \(\def \OE {\unicode {x0152}}\) \(\def \ae {\unicode {x00E6}}\) \(\def \AE {\unicode {x00C6}}\) \(\def \aa {\unicode {x00E5}}\) \(\def \AA {\unicode {x00C5}}\) \(\def \o {\unicode {x00F8}}\) \(\def \O {\unicode {x00D8}}\) \(\def \l {\unicode {x0142}}\) \(\def \L {\unicode {x0141}}\) \(\def \ss {\unicode {x00DF}}\) \(\def \SS {\unicode {x1E9E}}\) \(\def \dag {\unicode {x2020}}\) \(\def \ddag {\unicode {x2021}}\) \(\def \P {\unicode {x00B6}}\) \(\def \copyright {\unicode {x00A9}}\) \(\def \pounds {\unicode {x00A3}}\) \(\let \LWRref \ref \) \(\renewcommand {\ref }{\ifstar \LWRref \LWRref }\) \( \newcommand {\multicolumn }[3]{#3}\) \(\require {textcomp}\) \( \newcommand {\abs }[1]{\lvert #1\rvert } \) \( \DeclareMathOperator {\sign }{sign} \) \(\newcommand {\intertext }[1]{\text {#1}\notag \\}\) \(\let \Hat \hat \) \(\let \Check \check \) \(\let \Tilde \tilde \) \(\let \Acute \acute \) \(\let \Grave \grave \) \(\let \Dot \dot \) \(\let \Ddot \ddot \) \(\let \Breve \breve \) \(\let \Bar \bar \) \(\let \Vec \vec \) \(\newcommand {\bm }[1]{\boldsymbol {#1}}\) \(\require {physics}\) \(\newcommand {\LWRphystrig }[2]{\ifblank {#1}{\textrm {#2}}{\textrm {#2}^{#1}}}\) \(\renewcommand {\sin }[1][]{\LWRphystrig {#1}{sin}}\) \(\renewcommand {\sinh }[1][]{\LWRphystrig {#1}{sinh}}\) \(\renewcommand {\arcsin }[1][]{\LWRphystrig {#1}{arcsin}}\) \(\renewcommand {\asin }[1][]{\LWRphystrig {#1}{asin}}\) \(\renewcommand {\cos }[1][]{\LWRphystrig {#1}{cos}}\) \(\renewcommand {\cosh }[1][]{\LWRphystrig {#1}{cosh}}\) \(\renewcommand {\arccos }[1][]{\LWRphystrig {#1}{arcos}}\) \(\renewcommand {\acos }[1][]{\LWRphystrig {#1}{acos}}\) \(\renewcommand {\tan }[1][]{\LWRphystrig {#1}{tan}}\) \(\renewcommand {\tanh }[1][]{\LWRphystrig {#1}{tanh}}\) \(\renewcommand {\arctan }[1][]{\LWRphystrig {#1}{arctan}}\) \(\renewcommand {\atan }[1][]{\LWRphystrig {#1}{atan}}\) \(\renewcommand {\csc }[1][]{\LWRphystrig {#1}{csc}}\) \(\renewcommand {\csch }[1][]{\LWRphystrig {#1}{csch}}\) \(\renewcommand {\arccsc }[1][]{\LWRphystrig {#1}{arccsc}}\) \(\renewcommand {\acsc }[1][]{\LWRphystrig {#1}{acsc}}\) \(\renewcommand {\sec }[1][]{\LWRphystrig {#1}{sec}}\) \(\renewcommand {\sech }[1][]{\LWRphystrig {#1}{sech}}\) \(\renewcommand {\arcsec }[1][]{\LWRphystrig {#1}{arcsec}}\) \(\renewcommand {\asec }[1][]{\LWRphystrig {#1}{asec}}\) \(\renewcommand {\cot }[1][]{\LWRphystrig {#1}{cot}}\) \(\renewcommand {\coth }[1][]{\LWRphystrig {#1}{coth}}\) \(\renewcommand {\arccot }[1][]{\LWRphystrig {#1}{arccot}}\) \(\renewcommand {\acot }[1][]{\LWRphystrig {#1}{acot}}\) \(\require {cancel}\) \(\newcommand *{\underuparrow }[1]{{\underset {\uparrow }{#1}}}\) \(\DeclareMathOperator *{\argmax }{argmax}\) \(\DeclareMathOperator *{\argmin }{arg\,min}\) \(\def \E [#1]{\mathbb {E}\!\left [ #1 \right ]}\) \(\def \Var [#1]{\operatorname {Var}\!\left [ #1 \right ]}\) \(\def \Cov [#1]{\operatorname {Cov}\!\left [ #1 \right ]}\) \(\newcommand {\floor }[1]{\lfloor #1 \rfloor }\) \(\newcommand {\DTFTH }{ H \brk 1{e^{j\omega }}}\) \(\newcommand {\DTFTX }{ X\brk 1{e^{j\omega }}}\) \(\newcommand {\DFTtr }[1]{\mathrm {DFT}\left \{#1\right \}}\) \(\newcommand {\DTFTtr }[1]{\mathrm {DTFT}\left \{#1\right \}}\) \(\newcommand {\DTFTtrI }[1]{\mathrm {DTFT^{-1}}\left \{#1\right \}}\) \(\newcommand {\Ftr }[1]{ \mathcal {F}\left \{#1\right \}}\) \(\newcommand {\FtrI }[1]{ \mathcal {F}^{-1}\left \{#1\right \}}\) \(\newcommand {\Zover }{\overset {\mathscr Z}{\Longleftrightarrow }}\) \(\renewcommand {\real }{\mathbb {R}}\) \(\newcommand {\ba }{\mathbf {a}}\) \(\newcommand {\bb }{\mathbf {b}}\) \(\newcommand {\bc }{\mathbf {c}}\) \(\newcommand {\bd }{\mathbf {d}}\) \(\newcommand {\be }{\mathbf {e}}\) \(\newcommand {\bf }{\mathbf {f}}\) \(\newcommand {\bh }{\mathbf {h}}\) \(\newcommand {\bi }{\mathbf {i}}\) \(\newcommand {\bn }{\mathbf {n}}\) \(\newcommand {\bo }{\mathbf {o}}\) \(\newcommand {\bp }{\mathbf {p}}\) \(\newcommand {\bq }{\mathbf {q}}\) \(\newcommand {\br }{\mathbf {r}}\) \(\newcommand {\bs }{\mathbf {s}}\) \(\newcommand {\bt }{\mathbf {t}}\) \(\newcommand {\bu }{\mathbf {u}}\) \(\newcommand {\bv }{\mathbf {v}}\) \(\newcommand {\bw }{\mathbf {w}}\) \(\newcommand {\bx }{\mathbf {x}}\) \(\newcommand {\bxx }{\mathbf {xx}}\) \(\newcommand {\bxy }{\mathbf {xy}}\) \(\newcommand {\by }{\mathbf {y}}\) \(\newcommand {\byy }{\mathbf {yy}}\) \(\newcommand {\bz }{\mathbf {z}}\) \(\newcommand {\bA }{\mathbf {A}}\) \(\newcommand {\bB }{\mathbf {B}}\) \(\newcommand {\bC }{\mathbf {C}}\) \(\newcommand {\bD }{\mathbf {D}}\) \(\newcommand {\bH }{\mathbf {H}}\) \(\newcommand {\bI }{\mathbf {I}}\) \(\newcommand {\bK }{\mathbf {K}}\) \(\newcommand {\bM }{\mathbf {M}}\) \(\newcommand {\bP }{\mathbf {P}}\) \(\newcommand {\bQ }{\mathbf {Q}}\) \(\newcommand {\bR }{\mathbf {R}}\) \(\newcommand {\bS }{\mathbf {S}}\) \(\newcommand {\bU }{\mathbf {U}}\) \(\newcommand {\bW }{\mathbf {W}}\) \(\newcommand {\bX }{\mathbf {X}}\) \(\newcommand {\bY }{\mathbf {Y}}\) \(\newcommand {\bZ }{\mathbf {Z}}\) \(\newcommand {\balpha }{\bm {\alpha }}\) \(\newcommand {\bth }{{\bm {\theta }}}\) \(\newcommand {\bepsilon }{{\bm {\epsilon }}}\) \(\newcommand {\bmu }{{\bm {\mu }}}\) \(\newcommand {\bphi }{\bm {\phi }}\) \(\newcommand {\bOne }{\mathbf {1}}\) \(\newcommand {\bZero }{\mathbf {0}}\) \(\newcommand {\indFunc }{\mathbb {1}}\) \(\newcommand {\btx }{\tilde {\bx }}\) \(\newcommand {\loss }{\mathcal {L}}\) \(\newcommand {\appropto }{\mathrel {\vcenter { \offinterlineskip \halign {\hfil $##$\cr \propto \cr \noalign {\kern 2pt}\sim \cr \noalign {\kern -2pt}}}}}\) \(\newcommand {\SSE }{\mathrm {SSE}}\) \(\newcommand {\MSE }{\mathrm {MSE}}\) \(\newcommand {\RMSE }{\mathrm {RMSE}}\) \(\newcommand {\toprule }[1][]{\hline }\) \(\let \midrule \toprule \) \(\let \bottomrule \toprule \) \(\def \LWRbooktabscmidruleparen (#1)#2{}\) \(\newcommand {\LWRbooktabscmidrulenoparen }[1]{}\) \(\newcommand {\cmidrule }[1][]{\ifnextchar (\LWRbooktabscmidruleparen \LWRbooktabscmidrulenoparen }\) \(\newcommand {\morecmidrules }{}\) \(\newcommand {\specialrule }[3]{\hline }\) \(\newcommand {\addlinespace }[1][]{}\) \(\newcommand {\LWRsubmultirow }[2][]{#2}\) \(\newcommand {\LWRmultirow }[2][]{\LWRsubmultirow }\) \(\newcommand {\multirow }[2][]{\LWRmultirow }\) \(\newcommand {\mrowcell }{}\) \(\newcommand {\mcolrowcell }{}\) \(\newcommand {\STneed }[1]{}\) \(\newcommand {\tcbset }[1]{}\) \(\newcommand {\tcbsetforeverylayer }[1]{}\) \(\newcommand {\tcbox }[2][]{\boxed {\text {#2}}}\) \(\newcommand {\tcboxfit }[2][]{\boxed {#2}}\) \(\newcommand {\tcblower }{}\) \(\newcommand {\tcbline }{}\) \(\newcommand {\tcbtitle }{}\) \(\newcommand {\tcbsubtitle [2][]{\mathrm {#2}}}\) \(\newcommand {\tcboxmath }[2][]{\boxed {#2}}\) \(\newcommand {\tcbhighmath }[2][]{\boxed {#2}}\) \(\require {colortbl}\) \(\let \LWRorigcolumncolor \columncolor \) \(\renewcommand {\columncolor }[2][named]{\LWRorigcolumncolor [#1]{#2}\LWRabsorbtwooptions }\) \(\let \LWRorigrowcolor \rowcolor \) \(\renewcommand {\rowcolor }[2][named]{\LWRorigrowcolor [#1]{#2}\LWRabsorbtwooptions }\) \(\let \LWRorigcellcolor \cellcolor \) \(\renewcommand {\cellcolor }[2][named]{\LWRorigcellcolor [#1]{#2}\LWRabsorbtwooptions }\)

17 From Machine Learning to Deep Learning

  • Goal: Frame where neural networks fit among learning algorithms and motivate the shift to deep learning.

17.1 Three Paradigms of Learning

Predictive systems can be grouped into three families along an axis of how much of the feature engineering is delegated to the learner (Fig. 17.1).

(image)

Figure 17.1: Three learning paradigms, ordered by how much of the pipeline is learned from data rather than designed by hand.

Rule-based The mapping from input to output is hand-crafted by a domain expert; no parameters are estimated from data. Examples: classical computer-vision filters, fire-control logic in automatic weapon systems, expert rule engines in medicine.

Classical machine learning A human designs a feature representation and a learning algorithm fits the mapping from features to output. Examples: signal classification from engineered features (amplitude, dominant frequency, spectral entropy), random-forest credit scoring on tabular features.

Deep learning Both the feature representation and the mapping are learned end-to-end from raw data. Examples: large language models such as ChatGPT, diffusion-based image generators, end-to-end speech recognition.

17.2 Prominent Historical Points

Two public competitions whose outcomes made the dramatic change in ML/DL. Together they bracket the transition: the first was won by classical methods just before deep learning broke through; the second was overturned by it.

17.2.1 The Netflix Prize (2006–2009)

A public competition to predict user ratings of films from previous ratings alone, with no auxiliary information about users or films. The goal was to beat the in-house Netflix algorithm (Cinematch) by \(10\,\%\), with a grand prize of $1,000,000 and several $50,000 progress prizes along the way. The dataset contained \(100{,}480{,}507\) ratings given by \(480{,}189\) users to \(17{,}770\) movies on a discrete scale of \(1\)–\(5\); predictions were allowed to be real-valued. Each \((\text {user},\text {movie})\) entry was timestamped, but no demographic, textual, or visual information about users or films was provided.

Performance was measured by the root-mean-square error

\begin{equation*} \text {RMSE} = \sqrt {\frac {1}{M}\sum _{i=1}^{M}(\hat {r}_i - r_i)^2}, \end{equation*}

on a held-out test set. The Cinematch baseline of \(0.9525\) had to be reduced to \(0.8572\). The contest ran from October 2, 2006 to July 26, 2009 and attracted about \(2{,}000\) teams from over \(150\) countries.

Within a few weeks of launch, an independent contestant (Simon Funk) publicly described a stochastic-gradient form of matrix factorization that beat Cinematch by several percent; for the rest of the contest, latent-factor models stayed at the core of every leading entry. The grand prize was awarded to the team BellKor’s Pragmatic Chaos, a late-stage merger of three groups, whose final submission combined more than a hundred classical predictors (matrix factorization, restricted Boltzmann machines, gradient-boosted trees, and \(k\)-nearest-neighbor models) through linear blending. A rival team, The Ensemble, matched the score numerically but submitted twenty minutes later and lost on the tie-breaking rule. Deep architectures had not yet emerged as competitive predictors.

The lasting influence of the prize was less on the algorithms themselves than on the surrounding practices: large-scale public benchmarks with held-out test sets, public leaderboards, and team-merger dynamics later codified by Kaggle. Two cautionary footnotes also followed: the winning ensemble was never deployed in production because its engineering cost outweighed the rating-quality gain over a simpler in-house model, and a planned sequel competition was canceled in 2010 after researchers showed that supposedly anonymized users in the released dataset could be re-identified by cross-referencing public ratings on other sites.

17.2.2 The ImageNet Challenge (2010–2017)

The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) was an annual computer-vision competition held from 2010 to 2017. The ILSVRC subset (ImageNet-1K) contains \(1{,}000\) classes with \(1{,}281{,}167\) training, \(50{,}000\) validation, and \(100{,}000\) test images; the larger ImageNet-21K contains \(14{,}197{,}122\) images across \(21{,}841\) classes. Performance was reported using both top-\(1\) and top-\(5\) error, where the top-\(5\) error is the fraction of test images for which the correct label is not among the model’s five most probable predictions (formal definitions in Sec. 18.2.1).

On September 30, 2012, the convolutional neural network AlexNet broke from the classical Fisher-vector pipelines that had won the previous years, enabled by GPU training of a deep CNN (Table 17.1). The jump is widely credited with launching the modern deep-learning era: within five years deep models dominated the leaderboard, error fell below the human reference of about \(5\,\%\), and the competition was retired.

Table 17.1: Selected ILSVRC top-\(5\) classification results. 2011 and the 2012 runner-up still rely on hand-engineered features; AlexNet replaces them with a learned deep representation.
.
Year Entry

Approach

Top-\(1\) err. (%) Top-\(5\) err. (%)
2011 XRCE (winner)

hand-crafted features

\(\sim 50\) 25.8
2012 ISI (runner-up)

hand-crafted features

\(\sim 46\) 26.2
2012 AlexNet (winner)

8-layer CNN + GPU debut

37.5 15.3
2014 GoogLeNet (winner)

22-layer CNN

\(\sim 26\) 6.7
2015 ResNet (winner)

152-layer CNN

\(\sim 19\) 3.57
2017 SENet (winner)

CNN + attention

\(\sim 17\) 2.25

The challenge was retired after 2017 because it had stopped discriminating between top entries: \(29\) of the \(38\) submitting teams had top-\(5\) error below \(5\,\%\), and the leading entries had fallen to roughly \(2\,\%\), comparable to the estimated label-noise floor of the test set itself, so further gains were no longer measuring image-recognition quality. The community moved on to harder benchmarks (detection, segmentation, larger and noisier web-scale corpora, and self-supervised pre-training).

17.3 Enablers

What changed between the Netflix Prize and the ImageNet jump was not the theoretical idea of deep networks (it had been around for decades), but three practical enablers that arrived together: the data needed to train an over-parametrised model (Sec. 17.3.1), the hardware to fit it in a tractable wall-clock (Sec. 17.3.2), and the empirical observation that deep networks keep improving as both grow (Sec. 17.3.3).

17.3.1 The Data Explosion

A modern megapixel camera produces \(1920\times 1080\) pixels, each described by three colour channels (RGB) of \(1\) byte (\(8\) bit) each. At a typical frame rate of \(30\) fps, the uncompressed stream is

\begin{equation*} 1920\times 1080\times 3\times 30 \;\approx \; 2\times 10^8 \text {\,bytes/s} \;\approx \; 200\text {\,MB/s}. \end{equation*}

A single hour of raw HD video therefore exceeds \(700\) GB; multiply by millions of cameras, microphones, and clickstreams across the internet and the corpus available for training is enormous.

17.3.2 Hardware: GPUs and TPUs

Training and inference in neural networks consist almost entirely of dense matrix and tensor operations, which parallelise trivially across thousands of cores. GPUs are designed for exactly this workload (Table 17.2).

Table 17.2: High-level comparison between CPU and GPU for neural-network workloads.
.

Parameter

CPU

GPU (NVIDIA)

Note

Single-thread speed

Fast

Slow

Latency-optimised vs. throughput-optimised

Parallel cores

8–64

\(\sim 10^4\)

GPU has many simple cores

Execution model

Independent threads (MIMD)

Same program, many data items (SIMT)

GPU well-suited to tensor ops

Cache

Hardware-managed, multi-level

Programmer-managed shared memory

GPU memory hierarchy is explicit

Memory

DDR, large, slower

HBM, \(16\)–\(80\) GB onboard, much faster bandwidth

GPU memory dominates throughput

Power

\(\sim 65\) W

\(150\)–\(700\) W

Datacentre GPUs need dedicated cooling

Multi-device scaling

Limited

NVLink / InfiniBand fabrics

Routine scaling to thousands of GPUs

Precision

FP64 standard

FP32 / FP16 / BF16 / FP8

Lower precision \(\Rightarrow \) higher throughput

Two GPU-specific features were decisive for deep learning:

  • Tensor Cores (NVIDIA Volta and later): dedicated units that perform a fused mixed-precision matrix-multiply-accumulate in a single cycle, giving an order-of-magnitude throughput advantage on the operations that dominate training.

  • Mature software stack (CUDA, cuDNN, cuBLAS), which is the base to all the major deep-learning frameworks (PyTorch, TensorFlow, JAX).

An example of further specialization is Google’s Tensor Processing Unit (TPU), an application-specific integrated circuit (ASIC) built around a large array matrix multiplier and optimized exclusively for tensor workloads. NVIDIA, AMD, and several startups now ship comparable accelerators.

17.3.3 Scaling Behaviour

Data and hardware would be of little use if the model could not benefit from them. The defining property of deep networks is that they keep improving as both grow.

Scalability of a model: A model is scalable if both of the following hold:

  • Statistical scalability: test-set performance keeps improving as the training set grows.

  • Computational scalability: the training procedure can absorb the larger dataset at a tolerable wall-clock and memory cost (e.g., through parallelization across GPUs).

A linear or otherwise low-capacity model has an asymptotic performance ceiling: once enough data is available to estimate its parameters reliably, further samples bring diminishing returns because the hypothesis class itself is too restrictive. Deep networks behave differently: when the model is enlarged in tandem with the dataset, performance keeps improving across many orders of magnitude in dataset size. This empirical observation is the operational definition of “deep learning at scale” (Fig. 17.2), and it is what made the AlexNet jump in Sec. 17.2 possible.

(image)

Figure 17.2: Schematic performance vs. dataset size. Rule-based, linear, and shallow models saturate at successively higher levels; deep networks continue to improve as data and capacity grow together.

17.4 Applications and Architectures

Neural networks are flexible enough to tackle problems across very different data modalities. Table 17.3 maps a few canonical pairings of data type, task, and architecture; the rest of this part of the book introduces these architectures one by one.

Table 17.3: Canonical pairings of data modality, task, and neural-network architecture.
.
Learning type Data modality Example Task Architecture
Supervised Tabular Home / car description Price prediction Fully-connected NN
Sequential (time) Audio waveform Speech transcription Recurrent NN
Sequential (text) Sentence Translation Recurrent NN / Transformer
Image (2D) Photographs Image classification Convolutional NN
Unsupervised Image (2D) Unlabelled images Representation learning Contrastive / self-supervised NN
Image (2D) Noisy images Denoising Autoencoder