Machine Learning & Signals Learning

$\newcommand{\footnotename}{footnote}$ $\def \LWRfootnote {1}$ $\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}$ $\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}$ $\let \LWRorighspace \hspace $ $\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }$ $\newcommand {\TextOrMath }[2]{#2}$ $\newcommand {\mathnormal }[1]{{#1}}$ $\newcommand \ensuremath [1]{#1}$ $\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } $ $\newcommand {\setlength }[2]{}$ $\newcommand {\addtolength }[2]{}$ $\newcommand {\setcounter }[2]{}$ $\newcommand {\addtocounter }[2]{}$ $\newcommand {\arabic }[1]{}$ $\newcommand {\number }[1]{}$ $\newcommand {\noalign }[1]{\text {#1}\notag \\}$ $\newcommand {\cline }[1]{}$ $\newcommand {\directlua }[1]{\text {(directlua)}}$ $\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}$ $\newcommand {\protect }{}$ $\def \LWRabsorbnumber #1 {}$ $\def \LWRabsorbquotenumber "#1 {}$ $\newcommand {\LWRabsorboption }[1][]{}$ $\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }$ $\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }$ $\def \mathcode #1={\mathchar }$ $\let \delcode \mathcode $ $\let \delimiter \mathchar $ $\def \oe {\unicode {x0153}}$ $\def \OE {\unicode {x0152}}$ $\def \ae {\unicode {x00E6}}$ $\def \AE {\unicode {x00C6}}$ $\def \aa {\unicode {x00E5}}$ $\def \AA {\unicode {x00C5}}$ $\def \o {\unicode {x00F8}}$ $\def \O {\unicode {x00D8}}$ $\def \l {\unicode {x0142}}$ $\def \L {\unicode {x0141}}$ $\def \ss {\unicode {x00DF}}$ $\def \SS {\unicode {x1E9E}}$ $\def \dag {\unicode {x2020}}$ $\def \ddag {\unicode {x2021}}$ $\def \P {\unicode {x00B6}}$ $\def \copyright {\unicode {x00A9}}$ $\def \pounds {\unicode {x00A3}}$ $\let \LWRref \ref $ $\renewcommand {\ref }{\ifstar \LWRref \LWRref }$ $ \newcommand {\multicolumn }[3]{#3}$ $\require {textcomp}$ $ \newcommand {\abs }[1]{\lvert #1\rvert } $ $ \DeclareMathOperator {\sign }{sign} $ $\newcommand {\intertext }[1]{\text {#1}\notag \\}$ $\let \Hat \hat $ $\let \Check \check $ $\let \Tilde \tilde $ $\let \Acute \acute $ $\let \Grave \grave $ $\let \Dot \dot $ $\let \Ddot \ddot $ $\let \Breve \breve $ $\let \Bar \bar $ $\let \Vec \vec $ $\newcommand {\bm }[1]{\boldsymbol {#1}}$ $\require {physics}$ $\newcommand {\LWRphystrig }[2]{\ifblank {#1}{\textrm {#2}}{\textrm {#2}^{#1}}}$ $\renewcommand {\sin }[1][]{\LWRphystrig {#1}{sin}}$ $\renewcommand {\sinh }[1][]{\LWRphystrig {#1}{sinh}}$ $\renewcommand {\arcsin }[1][]{\LWRphystrig {#1}{arcsin}}$ $\renewcommand {\asin }[1][]{\LWRphystrig {#1}{asin}}$ $\renewcommand {\cos }[1][]{\LWRphystrig {#1}{cos}}$ $\renewcommand {\cosh }[1][]{\LWRphystrig {#1}{cosh}}$ $\renewcommand {\arccos }[1][]{\LWRphystrig {#1}{arcos}}$ $\renewcommand {\acos }[1][]{\LWRphystrig {#1}{acos}}$ $\renewcommand {\tan }[1][]{\LWRphystrig {#1}{tan}}$ $\renewcommand {\tanh }[1][]{\LWRphystrig {#1}{tanh}}$ $\renewcommand {\arctan }[1][]{\LWRphystrig {#1}{arctan}}$ $\renewcommand {\atan }[1][]{\LWRphystrig {#1}{atan}}$ $\renewcommand {\csc }[1][]{\LWRphystrig {#1}{csc}}$ $\renewcommand {\csch }[1][]{\LWRphystrig {#1}{csch}}$ $\renewcommand {\arccsc }[1][]{\LWRphystrig {#1}{arccsc}}$ $\renewcommand {\acsc }[1][]{\LWRphystrig {#1}{acsc}}$ $\renewcommand {\sec }[1][]{\LWRphystrig {#1}{sec}}$ $\renewcommand {\sech }[1][]{\LWRphystrig {#1}{sech}}$ $\renewcommand {\arcsec }[1][]{\LWRphystrig {#1}{arcsec}}$ $\renewcommand {\asec }[1][]{\LWRphystrig {#1}{asec}}$ $\renewcommand {\cot }[1][]{\LWRphystrig {#1}{cot}}$ $\renewcommand {\coth }[1][]{\LWRphystrig {#1}{coth}}$ $\renewcommand {\arccot }[1][]{\LWRphystrig {#1}{arccot}}$ $\renewcommand {\acot }[1][]{\LWRphystrig {#1}{acot}}$ $\require {cancel}$ $\newcommand *{\underuparrow }[1]{{\underset {\uparrow }{#1}}}$ $\DeclareMathOperator *{\argmax }{argmax}$ $\DeclareMathOperator *{\argmin }{arg\,min}$ $\def \E [#1]{\mathbb {E}\!\left [ #1 \right ]}$ $\def \Var [#1]{\operatorname {Var}\!\left [ #1 \right ]}$ $\def \Cov [#1]{\operatorname {Cov}\!\left [ #1 \right ]}$ $\newcommand {\floor }[1]{\lfloor #1 \rfloor }$ $\newcommand {\DTFTH }{ H \brk 1{e^{j\omega }}}$ $\newcommand {\DTFTX }{ X\brk 1{e^{j\omega }}}$ $\newcommand {\DFTtr }[1]{\mathrm {DFT}\left \{#1\right \}}$ $\newcommand {\DTFTtr }[1]{\mathrm {DTFT}\left \{#1\right \}}$ $\newcommand {\DTFTtrI }[1]{\mathrm {DTFT^{-1}}\left \{#1\right \}}$ $\newcommand {\Ftr }[1]{ \mathcal {F}\left \{#1\right \}}$ $\newcommand {\FtrI }[1]{ \mathcal {F}^{-1}\left \{#1\right \}}$ $\newcommand {\Zover }{\overset {\mathscr Z}{\Longleftrightarrow }}$ $\renewcommand {\real }{\mathbb {R}}$ $\newcommand {\ba }{\mathbf {a}}$ $\newcommand {\bb }{\mathbf {b}}$ $\newcommand {\bc }{\mathbf {c}}$ $\newcommand {\bd }{\mathbf {d}}$ $\newcommand {\be }{\mathbf {e}}$ $\newcommand {\bf }{\mathbf {f}}$ $\newcommand {\bh }{\mathbf {h}}$ $\newcommand {\bi }{\mathbf {i}}$ $\newcommand {\bn }{\mathbf {n}}$ $\newcommand {\bo }{\mathbf {o}}$ $\newcommand {\bp }{\mathbf {p}}$ $\newcommand {\bq }{\mathbf {q}}$ $\newcommand {\br }{\mathbf {r}}$ $\newcommand {\bs }{\mathbf {s}}$ $\newcommand {\bt }{\mathbf {t}}$ $\newcommand {\bu }{\mathbf {u}}$ $\newcommand {\bv }{\mathbf {v}}$ $\newcommand {\bw }{\mathbf {w}}$ $\newcommand {\bx }{\mathbf {x}}$ $\newcommand {\bxx }{\mathbf {xx}}$ $\newcommand {\bxy }{\mathbf {xy}}$ $\newcommand {\by }{\mathbf {y}}$ $\newcommand {\byy }{\mathbf {yy}}$ $\newcommand {\bz }{\mathbf {z}}$ $\newcommand {\bA }{\mathbf {A}}$ $\newcommand {\bB }{\mathbf {B}}$ $\newcommand {\bC }{\mathbf {C}}$ $\newcommand {\bD }{\mathbf {D}}$ $\newcommand {\bH }{\mathbf {H}}$ $\newcommand {\bI }{\mathbf {I}}$ $\newcommand {\bK }{\mathbf {K}}$ $\newcommand {\bM }{\mathbf {M}}$ $\newcommand {\bP }{\mathbf {P}}$ $\newcommand {\bQ }{\mathbf {Q}}$ $\newcommand {\bR }{\mathbf {R}}$ $\newcommand {\bS }{\mathbf {S}}$ $\newcommand {\bU }{\mathbf {U}}$ $\newcommand {\bW }{\mathbf {W}}$ $\newcommand {\bX }{\mathbf {X}}$ $\newcommand {\bY }{\mathbf {Y}}$ $\newcommand {\bZ }{\mathbf {Z}}$ $\newcommand {\balpha }{\bm {\alpha }}$ $\newcommand {\bth }{{\bm {\theta }}}$ $\newcommand {\bepsilon }{{\bm {\epsilon }}}$ $\newcommand {\bmu }{{\bm {\mu }}}$ $\newcommand {\bphi }{\bm {\phi }}$ $\newcommand {\bOne }{\mathbf {1}}$ $\newcommand {\bZero }{\mathbf {0}}$ $\newcommand {\indFunc }{\mathbb {1}}$ $\newcommand {\btx }{\tilde {\bx }}$ $\newcommand {\loss }{\mathcal {L}}$ $\newcommand {\appropto }{\mathrel {\vcenter { \offinterlineskip \halign {\hfil $##$\cr \propto \cr \noalign {\kern 2pt}\sim \cr \noalign {\kern -2pt}}}}}$ $\newcommand {\SSE }{\mathrm {SSE}}$ $\newcommand {\MSE }{\mathrm {MSE}}$ $\newcommand {\RMSE }{\mathrm {RMSE}}$ $\newcommand {\toprule }[1][]{\hline }$ $\let \midrule \toprule $ $\let \bottomrule \toprule $ $\def \LWRbooktabscmidruleparen (#1)#2{}$ $\newcommand {\LWRbooktabscmidrulenoparen }[1]{}$ $\newcommand {\cmidrule }[1][]{\ifnextchar (\LWRbooktabscmidruleparen \LWRbooktabscmidrulenoparen }$ $\newcommand {\morecmidrules }{}$ $\newcommand {\specialrule }[3]{\hline }$ $\newcommand {\addlinespace }[1][]{}$ $\newcommand {\LWRsubmultirow }[2][]{#2}$ $\newcommand {\LWRmultirow }[2][]{\LWRsubmultirow }$ $\newcommand {\multirow }[2][]{\LWRmultirow }$ $\newcommand {\mrowcell }{}$ $\newcommand {\mcolrowcell }{}$ $\newcommand {\STneed }[1]{}$ $\newcommand {\tcbset }[1]{}$ $\newcommand {\tcbsetforeverylayer }[1]{}$ $\newcommand {\tcbox }[2][]{\boxed {\text {#2}}}$ $\newcommand {\tcboxfit }[2][]{\boxed {#2}}$ $\newcommand {\tcblower }{}$ $\newcommand {\tcbline }{}$ $\newcommand {\tcbtitle }{}$ $\newcommand {\tcbsubtitle [2][]{\mathrm {#2}}}$ $\newcommand {\tcboxmath }[2][]{\boxed {#2}}$ $\newcommand {\tcbhighmath }[2][]{\boxed {#2}}$ $\require {colortbl}$ $\let \LWRorigcolumncolor \columncolor $ $\renewcommand {\columncolor }[2][named]{\LWRorigcolumncolor [#1]{#2}\LWRabsorbtwooptions }$ $\let \LWRorigrowcolor \rowcolor $ $\renewcommand {\rowcolor }[2][named]{\LWRorigrowcolor [#1]{#2}\LWRabsorbtwooptions }$ $\let \LWRorigcellcolor \cellcolor $ $\renewcommand {\cellcolor }[2][named]{\LWRorigcellcolor [#1]{#2}\LWRabsorbtwooptions }$

17 From Machine Learning to Deep Learning

Goal: Frame where neural networks fit among learning algorithms and motivate the shift to deep learning.

17.1 Three Paradigms of Learning

Predictive systems can be grouped into three families along an axis of how much of the feature engineering is delegated to the learner (Fig. 17.1).

Rule-based The mapping from input to output is hand-crafted by a domain expert; no parameters are estimated from data. Examples: classical computer-vision filters, fire-control logic in automatic weapon systems, expert rule engines in medicine.

Classical machine learning A human designs a feature representation and a learning algorithm fits the mapping from features to output. Examples: signal classification from engineered features (amplitude, dominant frequency, spectral entropy), random-forest credit scoring on tabular features.

Deep learning Both the feature representation and the mapping are learned end-to-end from raw data. Examples: large language models such as ChatGPT, diffusion-based image generators, end-to-end speech recognition.

17.2 Prominent Historical Points

Two public competitions whose outcomes made the dramatic change in ML/DL. Together they bracket the transition: the first was won by classical methods just before deep learning broke through; the second was overturned by it.

17.2.1 The Netflix Prize (2006–2009)

A public competition to predict user ratings of films from previous ratings alone, with no auxiliary information about users or films. The goal was to beat the in-house Netflix algorithm (Cinematch) by $10\,\%$, with a grand prize of $1,000,000 and several $50,000 progress prizes along the way. The dataset contained $100{,}480{,}507$ ratings given by $480{,}189$ users to $17{,}770$ movies on a discrete scale of $1$–$5$; predictions were allowed to be real-valued. Each $(\text {user},\text {movie})$ entry was timestamped, but no demographic, textual, or visual information about users or films was provided.

Performance was measured by the root-mean-square error

\begin{equation*} \text {RMSE} = \sqrt {\frac {1}{M}\sum _{i=1}^{M}(\hat {r}_i - r_i)^2}, \end{equation*}

on a held-out test set. The Cinematch baseline of $0.9525$ had to be reduced to $0.8572$. The contest ran from October 2, 2006 to July 26, 2009 and attracted about $2{,}000$ teams from over $150$ countries.

Within a few weeks of launch, an independent contestant (Simon Funk) publicly described a stochastic-gradient form of matrix factorization that beat Cinematch by several percent; for the rest of the contest, latent-factor models stayed at the core of every leading entry. The grand prize was awarded to the team BellKor’s Pragmatic Chaos, a late-stage merger of three groups, whose final submission combined more than a hundred classical predictors (matrix factorization, restricted Boltzmann machines, gradient-boosted trees, and $k$-nearest-neighbor models) through linear blending. A rival team, The Ensemble, matched the score numerically but submitted twenty minutes later and lost on the tie-breaking rule. Deep architectures had not yet emerged as competitive predictors.

The lasting influence of the prize was less on the algorithms themselves than on the surrounding practices: large-scale public benchmarks with held-out test sets, public leaderboards, and team-merger dynamics later codified by Kaggle. Two cautionary footnotes also followed: the winning ensemble was never deployed in production because its engineering cost outweighed the rating-quality gain over a simpler in-house model, and a planned sequel competition was canceled in 2010 after researchers showed that supposedly anonymized users in the released dataset could be re-identified by cross-referencing public ratings on other sites.

17.2.2 The ImageNet Challenge (2010–2017)

The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) was an annual computer-vision competition held from 2010 to 2017. The ILSVRC subset (ImageNet-1K) contains $1{,}000$ classes with $1{,}281{,}167$ training, $50{,}000$ validation, and $100{,}000$ test images; the larger ImageNet-21K contains $14{,}197{,}122$ images across $21{,}841$ classes. Performance was reported using both top-$1$ and top-$5$ error, where the top-$5$ error is the fraction of test images for which the correct label is not among the model’s five most probable predictions (formal definitions in Sec. 18.2.1).

On September 30, 2012, the convolutional neural network AlexNet broke from the classical Fisher-vector pipelines that had won the previous years, enabled by GPU training of a deep CNN (Table 17.1). The jump is widely credited with launching the modern deep-learning era: within five years deep models dominated the leaderboard, error fell below the human reference of about $5\,\%$, and the competition was retired.

Table 17.1: Selected ILSVRC top-$5$ classification results. 2011 and the 2012 runner-up still rely on hand-engineered features; AlexNet replaces them with a learned deep representation.

.
Year	Entry	Approach	Top-$1$ err. (%)	Top-$5$ err. (%)
2011	XRCE (winner)	hand-crafted features	$\sim 50$	25.8
2012	ISI (runner-up)	hand-crafted features	$\sim 46$	26.2
2012	AlexNet (winner)	8-layer CNN + GPU debut	37.5	15.3
2014	GoogLeNet (winner)	22-layer CNN	$\sim 26$	6.7
2015	ResNet (winner)	152-layer CNN	$\sim 19$	3.57
2017	SENet (winner)	CNN + attention	$\sim 17$	2.25

The challenge was retired after 2017 because it had stopped discriminating between top entries: $29$ of the $38$ submitting teams had top-$5$ error below $5\,\%$, and the leading entries had fallen to roughly $2\,\%$, comparable to the estimated label-noise floor of the test set itself, so further gains were no longer measuring image-recognition quality. The community moved on to harder benchmarks (detection, segmentation, larger and noisier web-scale corpora, and self-supervised pre-training).

17.3 Enablers

What changed between the Netflix Prize and the ImageNet jump was not the theoretical idea of deep networks (it had been around for decades), but three practical enablers that arrived together: the data needed to train an over-parametrised model (Sec. 17.3.1), the hardware to fit it in a tractable wall-clock (Sec. 17.3.2), and the empirical observation that deep networks keep improving as both grow (Sec. 17.3.3).

17.3.1 The Data Explosion

A modern megapixel camera produces $1920\times 1080$ pixels, each described by three colour channels (RGB) of $1$ byte ($8$ bit) each. At a typical frame rate of $30$ fps, the uncompressed stream is

\begin{equation*} 1920\times 1080\times 3\times 30 \;\approx \; 2\times 10^8 \text {\,bytes/s} \;\approx \; 200\text {\,MB/s}. \end{equation*}

A single hour of raw HD video therefore exceeds $700$ GB; multiply by millions of cameras, microphones, and clickstreams across the internet and the corpus available for training is enormous.

17.3.2 Hardware: GPUs and TPUs

Training and inference in neural networks consist almost entirely of dense matrix and tensor operations, which parallelise trivially across thousands of cores. GPUs are designed for exactly this workload (Table 17.2).

Table 17.2: High-level comparison between CPU and GPU for neural-network workloads.

.
Parameter	CPU	GPU (NVIDIA)	Note
Single-thread speed	Fast	Slow	Latency-optimised vs. throughput-optimised
Parallel cores	8–64	$\sim 10^4$	GPU has many simple cores
Execution model	Independent threads (MIMD)	Same program, many data items (SIMT)	GPU well-suited to tensor ops
Cache	Hardware-managed, multi-level	Programmer-managed shared memory	GPU memory hierarchy is explicit
Memory	DDR, large, slower	HBM, $16$–$80$ GB onboard, much faster bandwidth	GPU memory dominates throughput
Power	$\sim 65$ W	$150$–$700$ W	Datacentre GPUs need dedicated cooling
Multi-device scaling	Limited	NVLink / InfiniBand fabrics	Routine scaling to thousands of GPUs
Precision	FP64 standard	FP32 / FP16 / BF16 / FP8	Lower precision $\Rightarrow $ higher throughput

Two GPU-specific features were decisive for deep learning:

• Tensor Cores (NVIDIA Volta and later): dedicated units that perform a fused mixed-precision matrix-multiply-accumulate in a single cycle, giving an order-of-magnitude throughput advantage on the operations that dominate training.
• Mature software stack (CUDA, cuDNN, cuBLAS), which is the base to all the major deep-learning frameworks (PyTorch, TensorFlow, JAX).

An example of further specialization is Google’s Tensor Processing Unit (TPU), an application-specific integrated circuit (ASIC) built around a large array matrix multiplier and optimized exclusively for tensor workloads. NVIDIA, AMD, and several startups now ship comparable accelerators.

17.3.3 Scaling Behaviour

Data and hardware would be of little use if the model could not benefit from them. The defining property of deep networks is that they keep improving as both grow.

Scalability of a model: A model is scalable if both of the following hold:

• Statistical scalability: test-set performance keeps improving as the training set grows.
• Computational scalability: the training procedure can absorb the larger dataset at a tolerable wall-clock and memory cost (e.g., through parallelization across GPUs).

A linear or otherwise low-capacity model has an asymptotic performance ceiling: once enough data is available to estimate its parameters reliably, further samples bring diminishing returns because the hypothesis class itself is too restrictive. Deep networks behave differently: when the model is enlarged in tandem with the dataset, performance keeps improving across many orders of magnitude in dataset size. This empirical observation is the operational definition of “deep learning at scale” (Fig. 17.2), and it is what made the AlexNet jump in Sec. 17.2 possible.

17.4 Applications and Architectures

Neural networks are flexible enough to tackle problems across very different data modalities. Table 17.3 maps a few canonical pairings of data type, task, and architecture; the rest of this part of the book introduces these architectures one by one.

Table 17.3: Canonical pairings of data modality, task, and neural-network architecture.

.
Learning type	Data modality	Example	Task	Architecture
Supervised	Tabular	Home / car description	Price prediction	Fully-connected NN
	Sequential (time)	Audio waveform	Speech transcription	Recurrent NN
	Sequential (text)	Sentence	Translation	Recurrent NN / Transformer
	Image (2D)	Photographs	Image classification	Convolutional NN
Unsupervised	Image (2D)	Unlabelled images	Representation learning	Contrastive / self-supervised NN
Unsupervised	Image (2D)	Noisy images	Denoising	Autoencoder