Machine Learning & Signals Learning

\(\newcommand{\footnotename}{footnote}\) \(\def \LWRfootnote {1}\) \(\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\let \LWRorighspace \hspace \) \(\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }\) \(\newcommand {\TextOrMath }[2]{#2}\) \(\newcommand {\mathnormal }[1]{{#1}}\) \(\newcommand \ensuremath [1]{#1}\) \(\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } \) \(\newcommand {\setlength }[2]{}\) \(\newcommand {\addtolength }[2]{}\) \(\newcommand {\setcounter }[2]{}\) \(\newcommand {\addtocounter }[2]{}\) \(\newcommand {\arabic }[1]{}\) \(\newcommand {\number }[1]{}\) \(\newcommand {\noalign }[1]{\text {#1}\notag \\}\) \(\newcommand {\cline }[1]{}\) \(\newcommand {\directlua }[1]{\text {(directlua)}}\) \(\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}\) \(\newcommand {\protect }{}\) \(\def \LWRabsorbnumber #1 {}\) \(\def \LWRabsorbquotenumber "#1 {}\) \(\newcommand {\LWRabsorboption }[1][]{}\) \(\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }\) \(\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }\) \(\def \mathcode #1={\mathchar }\) \(\let \delcode \mathcode \) \(\let \delimiter \mathchar \) \(\def \oe {\unicode {x0153}}\) \(\def \OE {\unicode {x0152}}\) \(\def \ae {\unicode {x00E6}}\) \(\def \AE {\unicode {x00C6}}\) \(\def \aa {\unicode {x00E5}}\) \(\def \AA {\unicode {x00C5}}\) \(\def \o {\unicode {x00F8}}\) \(\def \O {\unicode {x00D8}}\) \(\def \l {\unicode {x0142}}\) \(\def \L {\unicode {x0141}}\) \(\def \ss {\unicode {x00DF}}\) \(\def \SS {\unicode {x1E9E}}\) \(\def \dag {\unicode {x2020}}\) \(\def \ddag {\unicode {x2021}}\) \(\def \P {\unicode {x00B6}}\) \(\def \copyright {\unicode {x00A9}}\) \(\def \pounds {\unicode {x00A3}}\) \(\let \LWRref \ref \) \(\renewcommand {\ref }{\ifstar \LWRref \LWRref }\) \( \newcommand {\multicolumn }[3]{#3}\) \(\require {textcomp}\) \( \newcommand {\abs }[1]{\lvert #1\rvert } \) \( \DeclareMathOperator {\sign }{sign} \) \(\newcommand {\intertext }[1]{\text {#1}\notag \\}\) \(\let \Hat \hat \) \(\let \Check \check \) \(\let \Tilde \tilde \) \(\let \Acute \acute \) \(\let \Grave \grave \) \(\let \Dot \dot \) \(\let \Ddot \ddot \) \(\let \Breve \breve \) \(\let \Bar \bar \) \(\let \Vec \vec \) \(\newcommand {\bm }[1]{\boldsymbol {#1}}\) \(\require {physics}\) \(\newcommand {\LWRphystrig }[2]{\ifblank {#1}{\textrm {#2}}{\textrm {#2}^{#1}}}\) \(\renewcommand {\sin }[1][]{\LWRphystrig {#1}{sin}}\) \(\renewcommand {\sinh }[1][]{\LWRphystrig {#1}{sinh}}\) \(\renewcommand {\arcsin }[1][]{\LWRphystrig {#1}{arcsin}}\) \(\renewcommand {\asin }[1][]{\LWRphystrig {#1}{asin}}\) \(\renewcommand {\cos }[1][]{\LWRphystrig {#1}{cos}}\) \(\renewcommand {\cosh }[1][]{\LWRphystrig {#1}{cosh}}\) \(\renewcommand {\arccos }[1][]{\LWRphystrig {#1}{arcos}}\) \(\renewcommand {\acos }[1][]{\LWRphystrig {#1}{acos}}\) \(\renewcommand {\tan }[1][]{\LWRphystrig {#1}{tan}}\) \(\renewcommand {\tanh }[1][]{\LWRphystrig {#1}{tanh}}\) \(\renewcommand {\arctan }[1][]{\LWRphystrig {#1}{arctan}}\) \(\renewcommand {\atan }[1][]{\LWRphystrig {#1}{atan}}\) \(\renewcommand {\csc }[1][]{\LWRphystrig {#1}{csc}}\) \(\renewcommand {\csch }[1][]{\LWRphystrig {#1}{csch}}\) \(\renewcommand {\arccsc }[1][]{\LWRphystrig {#1}{arccsc}}\) \(\renewcommand {\acsc }[1][]{\LWRphystrig {#1}{acsc}}\) \(\renewcommand {\sec }[1][]{\LWRphystrig {#1}{sec}}\) \(\renewcommand {\sech }[1][]{\LWRphystrig {#1}{sech}}\) \(\renewcommand {\arcsec }[1][]{\LWRphystrig {#1}{arcsec}}\) \(\renewcommand {\asec }[1][]{\LWRphystrig {#1}{asec}}\) \(\renewcommand {\cot }[1][]{\LWRphystrig {#1}{cot}}\) \(\renewcommand {\coth }[1][]{\LWRphystrig {#1}{coth}}\) \(\renewcommand {\arccot }[1][]{\LWRphystrig {#1}{arccot}}\) \(\renewcommand {\acot }[1][]{\LWRphystrig {#1}{acot}}\) \(\require {cancel}\) \(\newcommand *{\underuparrow }[1]{{\underset {\uparrow }{#1}}}\) \(\DeclareMathOperator *{\argmax }{argmax}\) \(\DeclareMathOperator *{\argmin }{arg\,min}\) \(\def \E [#1]{\mathbb {E}\!\left [ #1 \right ]}\) \(\def \Var [#1]{\operatorname {Var}\!\left [ #1 \right ]}\) \(\def \Cov [#1]{\operatorname {Cov}\!\left [ #1 \right ]}\) \(\newcommand {\floor }[1]{\lfloor #1 \rfloor }\) \(\newcommand {\DTFTH }{ H \brk 1{e^{j\omega }}}\) \(\newcommand {\DTFTX }{ X\brk 1{e^{j\omega }}}\) \(\newcommand {\DFTtr }[1]{\mathrm {DFT}\left \{#1\right \}}\) \(\newcommand {\DTFTtr }[1]{\mathrm {DTFT}\left \{#1\right \}}\) \(\newcommand {\DTFTtrI }[1]{\mathrm {DTFT^{-1}}\left \{#1\right \}}\) \(\newcommand {\Ftr }[1]{ \mathcal {F}\left \{#1\right \}}\) \(\newcommand {\FtrI }[1]{ \mathcal {F}^{-1}\left \{#1\right \}}\) \(\newcommand {\Zover }{\overset {\mathscr Z}{\Longleftrightarrow }}\) \(\renewcommand {\real }{\mathbb {R}}\) \(\newcommand {\ba }{\mathbf {a}}\) \(\newcommand {\bb }{\mathbf {b}}\) \(\newcommand {\bc }{\mathbf {c}}\) \(\newcommand {\bd }{\mathbf {d}}\) \(\newcommand {\be }{\mathbf {e}}\) \(\newcommand {\bf }{\mathbf {f}}\) \(\newcommand {\bh }{\mathbf {h}}\) \(\newcommand {\bi }{\mathbf {i}}\) \(\newcommand {\bn }{\mathbf {n}}\) \(\newcommand {\bo }{\mathbf {o}}\) \(\newcommand {\bp }{\mathbf {p}}\) \(\newcommand {\bq }{\mathbf {q}}\) \(\newcommand {\br }{\mathbf {r}}\) \(\newcommand {\bs }{\mathbf {s}}\) \(\newcommand {\bt }{\mathbf {t}}\) \(\newcommand {\bu }{\mathbf {u}}\) \(\newcommand {\bv }{\mathbf {v}}\) \(\newcommand {\bw }{\mathbf {w}}\) \(\newcommand {\bx }{\mathbf {x}}\) \(\newcommand {\bxx }{\mathbf {xx}}\) \(\newcommand {\bxy }{\mathbf {xy}}\) \(\newcommand {\by }{\mathbf {y}}\) \(\newcommand {\byy }{\mathbf {yy}}\) \(\newcommand {\bz }{\mathbf {z}}\) \(\newcommand {\bA }{\mathbf {A}}\) \(\newcommand {\bB }{\mathbf {B}}\) \(\newcommand {\bC }{\mathbf {C}}\) \(\newcommand {\bD }{\mathbf {D}}\) \(\newcommand {\bH }{\mathbf {H}}\) \(\newcommand {\bI }{\mathbf {I}}\) \(\newcommand {\bK }{\mathbf {K}}\) \(\newcommand {\bM }{\mathbf {M}}\) \(\newcommand {\bP }{\mathbf {P}}\) \(\newcommand {\bQ }{\mathbf {Q}}\) \(\newcommand {\bR }{\mathbf {R}}\) \(\newcommand {\bS }{\mathbf {S}}\) \(\newcommand {\bU }{\mathbf {U}}\) \(\newcommand {\bW }{\mathbf {W}}\) \(\newcommand {\bX }{\mathbf {X}}\) \(\newcommand {\bY }{\mathbf {Y}}\) \(\newcommand {\bZ }{\mathbf {Z}}\) \(\newcommand {\balpha }{\bm {\alpha }}\) \(\newcommand {\bth }{{\bm {\theta }}}\) \(\newcommand {\bepsilon }{{\bm {\epsilon }}}\) \(\newcommand {\bmu }{{\bm {\mu }}}\) \(\newcommand {\bphi }{\bm {\phi }}\) \(\newcommand {\bOne }{\mathbf {1}}\) \(\newcommand {\bZero }{\mathbf {0}}\) \(\newcommand {\indFunc }{\mathbb {1}}\) \(\newcommand {\btx }{\tilde {\bx }}\) \(\newcommand {\loss }{\mathcal {L}}\) \(\newcommand {\appropto }{\mathrel {\vcenter { \offinterlineskip \halign {\hfil $##$\cr \propto \cr \noalign {\kern 2pt}\sim \cr \noalign {\kern -2pt}}}}}\) \(\newcommand {\SSE }{\mathrm {SSE}}\) \(\newcommand {\MSE }{\mathrm {MSE}}\) \(\newcommand {\RMSE }{\mathrm {RMSE}}\) \(\newcommand {\toprule }[1][]{\hline }\) \(\let \midrule \toprule \) \(\let \bottomrule \toprule \) \(\def \LWRbooktabscmidruleparen (#1)#2{}\) \(\newcommand {\LWRbooktabscmidrulenoparen }[1]{}\) \(\newcommand {\cmidrule }[1][]{\ifnextchar (\LWRbooktabscmidruleparen \LWRbooktabscmidrulenoparen }\) \(\newcommand {\morecmidrules }{}\) \(\newcommand {\specialrule }[3]{\hline }\) \(\newcommand {\addlinespace }[1][]{}\) \(\newcommand {\LWRsubmultirow }[2][]{#2}\) \(\newcommand {\LWRmultirow }[2][]{\LWRsubmultirow }\) \(\newcommand {\multirow }[2][]{\LWRmultirow }\) \(\newcommand {\mrowcell }{}\) \(\newcommand {\mcolrowcell }{}\) \(\newcommand {\STneed }[1]{}\) \(\newcommand {\tcbset }[1]{}\) \(\newcommand {\tcbsetforeverylayer }[1]{}\) \(\newcommand {\tcbox }[2][]{\boxed {\text {#2}}}\) \(\newcommand {\tcboxfit }[2][]{\boxed {#2}}\) \(\newcommand {\tcblower }{}\) \(\newcommand {\tcbline }{}\) \(\newcommand {\tcbtitle }{}\) \(\newcommand {\tcbsubtitle [2][]{\mathrm {#2}}}\) \(\newcommand {\tcboxmath }[2][]{\boxed {#2}}\) \(\newcommand {\tcbhighmath }[2][]{\boxed {#2}}\) \(\require {colortbl}\) \(\let \LWRorigcolumncolor \columncolor \) \(\renewcommand {\columncolor }[2][named]{\LWRorigcolumncolor [#1]{#2}\LWRabsorbtwooptions }\) \(\let \LWRorigrowcolor \rowcolor \) \(\renewcommand {\rowcolor }[2][named]{\LWRorigrowcolor [#1]{#2}\LWRabsorbtwooptions }\) \(\let \LWRorigcellcolor \cellcolor \) \(\renewcommand {\cellcolor }[2][named]{\LWRorigcellcolor [#1]{#2}\LWRabsorbtwooptions }\)

11 Dimensionality Reduction

  • Goal: Map \(N\)-dimensional data vectors \(\bx _1,\ldots ,\bx _M\) into a lower-dimensional representation
    \(\bz _1,\ldots ,\bz _M\in \mathbb {R}^K\) with \(K\ll N\), while preserving as much of the structure of the data as possible.

Common motivations: visualization in two or three dimensions, noise reduction, decorrelation of features, lossy compression, and mitigating the curse of dimensionality before downstream classification or regression.

11.1 Characterization

11.1.1 Local vs. global structure

The methods in this chapter are compared by how well they preserve the local and global structure of the data:

  • Local structure is the geometry of small neighbourhoods: which points are close to which, the shape of a single cluster, fine-grained variation within a group. Preserving local structure means that points that were near each other in \(\mathbb {R}^N\) remain near each other in \(\mathbb {R}^K\).

  • Global structure is the large-scale layout: relative positions of clusters, overall spread, and the distances between distant points. Preserving global structure means that clusters that were far apart in \(\mathbb {R}^N\) remain far apart, and their relative arrangement is maintained.

Both aspects are illustrated on the same five-cluster ring dataset in Fig. 11.1. Panels (a) and (b) define the two notions on the original data; panels (c) and (d) show the two failure modes a low-dimensional embedding can fall into when the goals come into conflict. The trade-off is unavoidable: mapping \(\mathbb {R}^N\) to \(\mathbb {R}^K\) with \(K\ll N\) must discard some pairwise-distance information, and which distances survive is dictated by the criterion the embedding optimises.

(image)

Figure 11.1: Local vs. global structure on a single five-cluster ring dataset. (a) Local structure: a focal point (\(\star \)) and its nearest neighbours inside one cluster. (b) Global structure: the five cluster centroids (X) and their ring arrangement. (c) Embedding that preserves the global ring arrangement but collapses each cluster (global preserved, local lost). (d) Embedding that keeps each cluster intact but scrambles their relative positions (local preserved, global distorted).
11.1.2 General Features
Deterministic vs. stochastic fits

A method is deterministic if it returns the same embedding every time it is run on a given dataset; it is stochastic if the result depends on a random initialisation or a randomised optimiser, so successive runs of the same code on the same data produce different layouts. Determinism matters when the embedding is used as a fixed preprocessing step in a pipeline (the same input must always give the same output) and for the reproducibility of published figures. Stochastic methods are usually made reproducible by fixing a random seed.

Embedding new points

A method embeds new points if, after fitting on a training set, it can place unseen samples in the same coordinate system without re-running the optimisation. This is essential when the low-dimensional representation is part of a deployed system (e.g., a learned 2D map onto which test or streaming data must be plotted). Methods that lack this property must be re-fit from scratch whenever the dataset changes, which makes them suitable mainly for one-off visualisation rather than for reusable preprocessing.

Supervised vs. unsupervised

A dimensionality-reduction method is unsupervised if it uses only the inputs \(\bx _1,\ldots ,\bx _M\) and supervised if it also uses the targets \(\by _1,\ldots ,\by _M\) to choose the embedding. Unsupervised methods optimise a purely geometric criterion (variance, neighbour structure), so they may discard directions that have low variance but are highly informative about \(\by \).

11.2 Principal Component Analysis (PCA)

11.2.1 Mean as a zero-dimensional representation

The simplest dimensionality reduction maps every data point to a single point \(\mathbf {c}\in \mathbb {R}^N\). The choice that minimizes the sum of squared distances,

\begin{equation} \mathbf {c}^\star = \argmin _{\mathbf {c}} \sum _{i=1}^{M}\|\bx _i-\mathbf {c}\|^2, \end{equation}

is the sample mean

\begin{equation} \bmu = \frac {1}{M}\sum _{i=1}^{M}\bx _i. \end{equation}

We use \(K\) to denote the dimension of the approximation: a single point has dimension \(K=0\), a line \(K=1\), a plane \(K=2\), and so on. The sample mean is therefore the best \(K=0\) approximation of the dataset (Fig. 11.2).

(image)

Figure 11.2: The sample mean \(\bmu \) minimizes the sum of squared distances to all data points; the best zero-dimensional approximation of the dataset.
11.2.2 Definition

Intuition For \(K\ge 1\), PCA fits the data with a \(K\)-dimensional subspace through the sample mean (a line for \(K=1\), a plane for \(K=2\), and so on). This subspace is oriented along the directions in which the data spreads the most. The first such direction is the line that best fits the centered points; the second is perpendicular to the first and captures the next-largest spread; and so on (Fig. 11.3).

(image)

Figure 11.3: First and second principal axes of a 2D point cloud, drawn at the centroid; arrow lengths are proportional to \(\sqrt {\lambda _i}\) of the covariance matrix \(\bS \).

Objective To turn this picture into a computation, we need a formula for the reconstruction \(\hat \bx _i\) of each data point on the subspace. Write it as \(\hat \bx _i=\mathbf {c}+\bW _K\bz _i\), where \(\mathbf {c}\in \mathbb {R}^N\) is a point the subspace passes through, the columns of \(\bW _K\in \mathbb {R}^{N\times K}\) are the \(K\) direction vectors of the subspace (unit-length and mutually perpendicular), and \(\bz _i\in \mathbb {R}^K\) holds the coordinates of point \(i\) within it. With this parametrization, the same squared-distance criterion as in the \(K=0\) case becomes

\begin{equation} \label {eq-pca-objective} (\mathbf {c}^\star ,\bW _K^\star ,\{\bz _i^\star \}) = \argmin _{\substack {\mathbf {c},\,\bW _K,\,\{\bz _i\}\\ \bW _K^T\bW _K=\bI _K}} \sum _{i=1}^{M}\bigl \|\bx _i-(\mathbf {c}+\bW _K\bz _i)\bigr \|^2. \end{equation}

Setting \(K=0\) removes the \(\bW _K\bz _i\) term and recovers the mean problem.

Solution Solving (11.3) splits into two steps: locating the subspace (choosing \(\mathbf {c}^\star \)) and orienting it (choosing \(\bW _K^\star \)). Both are read off the data through the sample mean and the sample covariance matrix. Let \(\bX \in \mathbb {R}^{M\times N}\) be the data matrix (rows are samples). Center it by subtracting the sample mean,

\begin{equation} \tilde \bX = \bX - \bOne \bmu ^T,\qquad \bmu = \frac {1}{M}\sum _{i=1}^{M}\bx _i, \end{equation}

and form the sample covariance matrix

\begin{equation} \bS = \frac {1}{M-1}\tilde \bX ^T\tilde \bX . \end{equation}

The best subspace passes through the sample mean, so \(\mathbf {c}^\star =\bmu \). Its directions are the top \(K\) eigenvectors of \(\bS \): \(\bW _K^\star =[\bw _1,\ldots ,\bw _K]\), where the eigenvectors \(\bw _1,\ldots ,\bw _N\) are ordered by decreasing eigenvalue \(\lambda _1\ge \lambda _2\ge \cdots \ge \lambda _N\ge 0\). Each eigenvalue \(\lambda _i\) equals the variance of the data along \(\bw _i\), so picking the top \(K\) eigenvectors picks the \(K\) directions of largest spread, which matches the geometric picture from the intuition above. The eigendecomposition itself is reviewed in the appendix.

Encoding and decoding Once the directions \(\bW _K\) are fixed, each data point is encoded as

\begin{equation} \bz = \bW _K^T(\bx -\bmu ) \in \mathbb {R}^K \end{equation}

and reconstructed as

\begin{equation} \hat \bx = \bmu + \bW _K\bz . \end{equation}

For \(K=0\) this collapses to \(\hat \bx =\bmu \), the zero-dimensional approximation. For \(K\ge 1\), no other \(K\)-dimensional subspace through \(\bmu \) gives a smaller reconstruction error in (11.3).

Computation. In practice PCA is computed from the singular value decomposition of \(\tilde \bX \) rather than by forming \(\bS \) explicitly. SVD is more numerically stable, especially when features are nearly collinear or \(N\) is large.

(image)

Figure 11.4: PCA minimises orthogonal (perpendicular) reconstruction distance; linear regression minimises vertical residuals. The two fits coincide only in special cases.

The PCA fit is not the same as the linear regression fit. PCA minimises the perpendicular distance from each point to the fitted line (or subspace), treating all coordinates symmetrically. Linear regression minimises the vertical distance, treating \(x_2\) as a function of \(x_1\), and changes if the roles of the variables are swapped.

PCA can also be viewed as a linear autoencoder with a single hidden layer of width \(K\) and tied weights; this connection is taken up again in the chapter on neural networks.

11.2.3 Choosing the number of components

The fraction of variance explained by component \(i\) is \(\lambda _i/\sum _j\lambda _j\). The cumulative explained variance by the top \(K\) components is

\begin{equation} \mathrm {EV}(K) = \frac {\sum _{i=1}^{K}\lambda _i}{\sum _{j=1}^{N}\lambda _j}\times 100\%. \end{equation}

Two common heuristics for picking \(K\) are: (i) the smallest \(K\) such that \(\mathrm {EV}(K)\) exceeds a target threshold (typically 90% or 95%), and (ii) the elbow of the scree plot, where the variance ratio drops sharply (Fig. 11.5).

(image)

Figure 11.5: Left: variance ratio per component (scree plot). Right: cumulative explained variance with 90% and 95% thresholds; here the 90% threshold is reached at \(K=4\).
  • Example 11.1 (Choosing \(K\)): For the eigenvalue spectrum \(\lambda =(4,\,2,\,1,\,0.5,\,0.1)\) with total \(\sum _j\lambda _j=7.6\), the variance ratios are \((0.526,\,0.263,\,0.132,\,0.066,\,0.013)\) and the cumulative explained variance is

    \begin{equation*} \mathrm {EV}(1)=52.6\%,\;\mathrm {EV}(2)=78.9\%,\;\mathrm {EV}(3)=92.1\%,\;\mathrm {EV}(4)=98.7\%. \end{equation*}

    The 90% threshold is met at \(K=3\).

PCA is deterministic, embeds new points trivially via \(\bz _{\mathrm {new}}=\bW _K^\top (\bx _{\mathrm {new}}-\bmu )\), and is unsupervised: it does not use the target \(\by \), so it may discard directions that are low-variance but discriminative.

The number of components \(K\) is the only hyperparameter of PCA. It is typically chosen by an explained-variance threshold (e.g. 90% or 95%) or by the elbow of the scree plot, and validated with cross-validation when PCA feeds a downstream supervised model.

Practical notes

  • Scale sensitivity: when input features are in different units, standardise them first (or apply PCA to the correlation matrix instead of the covariance matrix); otherwise the largest-variance feature dominates.

  • Use in supervised pipelines: validate the chosen \(K\) with cross-validation rather than picking it from the scree plot alone.

  • Lossy compression and denoising: dropping the small-eigenvalue components removes the directions of smallest variance, which often correspond to noise (eigenfaces are a classical example).

  • For visualization and clustering: PCA tends to preserve global (inter-cluster) geometry but distorts local (intra-cluster) structure.

11.3 t-SNE

t-SNE (t-distributed stochastic neighbor embedding) is a popular tool for visualizing complex data in 2D or 3D. Instead of trying to keep the exact mathematical distances between points like PCA does, t-SNE simply tries to group similar data points together while pushing dissimilar points far apart.

t-SNE is stochastic and does not embed new points: each call re-runs the optimization from a random initialisation, and the resulting layout cannot be reused to place unseen samples without refitting on the combined data.

Hyper-parameter: Perplexity The main control knob is perplexity, typically set between 5 and 50. Think of this as the number of close neighbors each point cares about. If your plot looks like a cloud, adjusting the perplexity is the best way to uncover hidden structure.

Because t-SNE forces clusters to separate, it can be deceiving:

  • Don’t trust the gaps: t-SNE exaggerates the empty space between clusters. The physical distance between two different clusters on the plot is meaningless.

  • Cluster size doesn’t matter: A large, spread-out cluster on the plot isn’t necessarily more diverse than a tight, dense one. t-SNE heavily distorts the density of the data.

11.4 UMAP

UMAP (uniform manifold approximation and projection) is a newer alternative to t-SNE. It targets the same goal (creating 2D or 3D plots of complex data) but uses a more efficient approach that gives it several major practical advantages.

Improvements over t-SNE:

  • Faster

  • Better preserves global structure.

  • Can place unseen samples on a previously fitted map without refitting.

Hyper-parameters UMAP gives you two main ways to control the final plot:

  • n_neighbors: Similar to t-SNE’s perplexity. A low number focuses on grouping tiny local details, while a high number forces the algorithm to look at the broader, global structure.

  • min_dist: Controls how tightly packed the points are allowed to be. Set it low (e.g., 0.1) if you want distinct, separated clusters; set it higher (e.g., 0.5) if you want to see the individual points without them overlapping too much.

11.5 PaCMAP

PaCMAP (pairwise controlled manifold approximation and projection) is a more recent visualisation method that explicitly aims to preserve both local clusters and global layout at the same time. It is built on top of the same general idea as UMAP, but with a different recipe for choosing which pairs of points the optimisation should care about.

The distinguishing idea of PaCMAP is to balance three types of point pairs in the loss function:

  • Near pairs: a few close neighbours of each point, used to keep local clusters tight.

  • Mid-near pairs: pairs that are not the closest, but not far either, used to anchor the medium-scale layout.

  • Further pairs: random distant pairs, used to push unrelated points apart.

By tuning the relative weight of these three types over the course of optimisation, PaCMAP recovers fine cluster structure without losing the big picture.

Hyper-parameter The main control is n_neighbors (size of the local neighbourhood, similar in spirit to UMAP’s n_neighbors or t-SNE’s perplexity). Defaults work well in most cases; the implementation also exposes ratios that control how many mid-near and further pairs are used per near pair, which usually do not need to be changed.

PaCMAP is stochastic, unsupervised, and embeds new points into an existing layout . As with the other nonlinear methods, absolute distances in a PaCMAP plot should not be over-interpreted, but the relative arrangement of clusters tends to be more trustworthy than in t-SNE.

11.6 Summary

How each method handles the local/global trade-off. The local vs. global distinction introduced at the start of the chapter (Fig. 11.1) is dictated by the criterion each method optimises. A loss built on overall variance or all pairwise distances (as in PCA) is dominated by the largest distances, so global layout is retained but within-cluster geometry is spent first when a direction is dropped. A loss built on neighbour relations (as in t-SNE) penalises only short-range errors, so within-cluster structure is preserved, but the optimiser is free to rearrange clusters because faraway distances do not enter the loss. Hybrid methods such as UMAP and PaCMAP mix near-neighbour and farther-pair terms in the loss, which is why their plots tend to keep clusters tight and their relative arrangement readable.

PCA is a linear projection, so it preserves global structure (large pairwise distances and overall spread) but compresses everything along the dropped directions, which often destroys fine local detail. t-SNE optimises a local objective, so it reveals clusters and within-cluster structure clearly but distorts global geometry: cluster sizes and inter-cluster distances in a t-SNE plot are not meaningful. UMAP sits between the two, recovering local structure similarly to t-SNE while retaining more of the global layout.

Figure 11.6 illustrates this on the handwritten-digits dataset (10 classes, 64 features): PCA shows heavy class overlap, while t-SNE and UMAP both separate the digit classes into compact clusters. Table 11.1 summarises how the four methods compare on the properties introduced earlier in Sec. 11.1.

(image)

Figure 11.6: The same handwritten-digits dataset (10 classes, 64 features) embedded into 2D by PCA, t-SNE, and UMAP. Colours encode the digit label.
Table 11.1: Comparison of PCA, t-SNE, UMAP, and PaCMAP.
.
Property PCA t-SNE UMAP PaCMAP
Mapping linear nonlinear nonlinear nonlinear
Deterministic yes no no no
Supervised no no no/yes (2) no
Preserves global yes no partially yes
Preserves local no yes yes yes
Embeds new points yes (trivial) no yes yes
Complexity (1) \(\mathcal {O}(\min (MN^2,M^2N))\) \(\mathcal {O}(M^2)\) / \(\mathcal {O}(M\log M)\) \(\sim \mathcal {O}(M^{1.14})\) \(\mathcal {O}(M)\)

(1) Complexity with \(M\) samples and \(N\) input features.

(2) UMAP is unsupervised by default; a supervised variant exists that uses labels \(\by \) to guide the embedding.

Choosing a method

  • Use PCA as a default first step for small datasets. Its fit is a single SVD (seconds on \(M,N\sim 10^4\), scaling as \(\mathcal {O}(\min (MN^2,M^2N))\)), it has no random seed, and the same fitted projection \(\bW _K\) can be applied to streaming or test data. Use it as a preprocessor that decorrelates features and trims dimension before regression, classification, or clustering, and as a quick sanity-check plot. Its blind spots are that it is linear (it cannot unfold curved manifolds) and unsupervised (it may drop low-variance directions that are discriminative for \(\by \)).

  • Use t-SNE, UMAP, or PaCMAP when the goal is a 2D or 3D plot that reveals cluster structure that PCA hides. Prefer t-SNE only for small datasets (\(M\) up to a few thousand) where you mostly care about whether clusters exist. Prefer UMAP as the modern default for visualisation: it scales to millions of points, retains more of the global layout than t-SNE, and can embed new samples without refitting. Prefer PaCMAP when the relative arrangement of clusters carries meaning (hierarchies, smooth gradients between groups). In all three cases, treat absolute distances, cluster sizes, and inter-cluster gaps as not meaningful, fix the random seed for reproducibility, and verify that the layout is stable across a few seeds and hyper-parameter values before drawing conclusions.

Further Reading