Machine Learning & Signals Learning
10 k-Nearest Neighbors (k-NN)
Unlike logistic regression which learns a parametric decision boundary, k-NN classifies a new point by majority vote among its \(k\) nearest neighbors in the training set.
10.1 Algorithm
For a new sample \(\bx _0\) and the dataset \(\left \{x_k,y_k\right \}_{k=1}^M\)
-
1. Compute distances between \(\bx _0\) to each of the dataset samples, \(d(\bx _0, \bx _i)\) (\(M\) distances).
-
2. Select the \(k\) samples with smallest distances.
-
3. Assign the class by majority vote among the \(k\) neighbors.
-
Example 10.1: Fig. 10.1 shows k-NN classification for a query point (star) with \(k=1\), \(k=5\), and \(k=15\). The decision boundary (dashed) becomes smoother as \(k\) increases: \(k=1\) produces a complex, noisy boundary, while \(k=15\) yields a simpler one at the cost of reduced local sensitivity.
Tie-Breaking
When \(k>1\) neighbors sometimes yield equal votes. The ways to handle this situation are:
-
• Random selection among tied classes.
-
• Select class of the nearest neighbor among tied samples, i.e. use \(k=1\).
10.2 Distance Metrics
The choice of distance defines what “near” means for k-NN and is therefore as important as the choice of \(k\) itself. Different distance metrics emphasize different aspects of the data. The metric is a hyperparameter and is typically selected by cross-validation (Sec. 4.5). Most metrics are not scale-invariant, so features should be standardized before distances are computed (Sec. 4.7).
10.2.1 Properties
A function \(d:\mathbb {R}^N\times \mathbb {R}^N\to \mathbb {R}\) is a distance (or metric) if for all \(\ba ,\bb ,\bc \):
-
• Non-negativity: \(d(\ba ,\bb ) \ge 0\).
-
• Identity of indiscernibles: \(d(\ba ,\bb ) = 0 \;\Longleftrightarrow \; \ba = \bb \).
-
• Symmetry: \(d(\ba ,\bb ) = d(\bb ,\ba )\).
-
• Triangle inequality: \(d(\ba ,\bc ) \le d(\ba ,\bb ) + d(\bb ,\bc )\).
A function satisfying only a subset of these is a semi-metric or pseudo-metric; cosine distance is a common example that fails the triangle inequality (see below).
The most popular distance metrics are summarized below.
10.2.2 Euclidean (\(L_2\))
The straight-line distance in \(\real ^N\),
\(\seteqnumber{0}{}{0}\)\begin{equation} d(\ba ,\bb ) = \lVert \ba -\bb \rVert = \sqrt {\sum _{j=1}^N (a_j-b_j)^2} \end{equation}
The default choice for continuous, real-valued features. Because the differences are squared, features with large numerical ranges dominate the sum, so standardization (zero mean, unit variance) is essentially mandatory (Sec. 4.7). Euclidean distance is invariant to rotations and translations of the coordinate system.
10.2.3 Manhattan (\(L_1\), city block)
The sum of absolute coordinate differences,
\(\seteqnumber{0}{}{1}\)\begin{equation} d(\ba ,\bb ) = \sum _{j=1}^N \abs {a_j-b_j} \end{equation}
The distance one would travel on a grid of city blocks, hence the name. Because deviations are not squared, \(L_1\) is less sensitive to outliers and large per-feature gaps than \(L_2\).
10.2.4 Minkowski (generalized, hyperparameter \(p\))
The Minkowski distance is a family that interpolates between Manhattan, Euclidean, and Chebyshev as the exponent \(p\) varies; \(p\) itself can be treated as a hyperparameter and tuned by cross-validation. For \(p\ge 1\),
\(\seteqnumber{0}{}{2}\)\begin{equation} d(\ba ,\bb ) = \left (\sum _{j=1}^N \abs {a_j-b_j}^p\right )^{1/p} \end{equation}
Special cases: \(p=1\) gives Manhattan, \(p=2\) gives Euclidean. As \(p\to \infty \) the sum is dominated by its largest term, yielding the Chebyshev (\(L_\infty \)) distance
\(\seteqnumber{0}{}{3}\)\begin{equation} d_\infty (\ba ,\bb ) = \lim _{p\to \infty }\left (\sum _{j=1}^N \abs {a_j-b_j}^p\right )^{1/p} = \max _{j}\abs {a_j-b_j}. \end{equation}
For \(p<1\), \(d(\ba ,\bb ) = \left (\sum _{j=1}^N \abs {a_j-b_j}^p\right )\) should be used.
10.2.5 Canberra
A weighted \(L_1\) distance that normalizes each coordinate gap by the sum of the magnitudes:
\(\seteqnumber{0}{}{4}\)\begin{equation} d(\ba ,\bb ) = \sum _{j=1}^{N} \frac {\abs {a_j-b_j}}{\abs {a_j}+\abs {b_j}}, \end{equation}
with the convention that the \(j\)-th term is \(0\) when \(a_j=b_j=0\). Each term is bounded in \([0,1]\), so every coordinate contributes on the same scale regardless of its magnitude, giving the metric an implicit per-feature normalization and a total bounded by \(N\).
Canberra is most useful for non-negative data such as counts, intensities, or ranked features, where relative differences matter more than absolute ones. Compared to Manhattan, a small absolute gap near the origin counts as much as a large gap far from it, so the metric is particularly sensitive to changes near zero. Conversely, when coordinates can be negative or pass through zero, the denominator becomes unstable and the metric loses its appeal.
10.2.6 Cosine similarity
Measures the angle between two non-zero vectors rather than their absolute separation. It is the natural choice for high-dimensional sparse data where vector magnitude is uninformative and only direction carries meaning:
\(\seteqnumber{0}{}{5}\)\begin{equation} \cos (\ba ,\bb ) \;=\; \frac {\ba ^\top \bb }{\lVert \ba \rVert \,\lVert \bb \rVert } \;=\; \frac {\sum _{j=1}^{N} a_j b_j}{\sqrt {\sum _{j=1}^{N} a_j^2}\,\sqrt {\sum _{j=1}^{N} b_j^2}} \;\in \; [-1, 1]. \end{equation}
A value of \(1\) means the vectors point in the same direction, \(0\) means they are orthogonal, and \(-1\) means opposite directions. The corresponding cosine distance
\(\seteqnumber{0}{}{6}\)\begin{equation} d_{\cos }(\ba ,\bb ) \;=\; 1 - \cos (\ba ,\bb ) \end{equation}
ranges in \([0,2]\). Unlike \(L_p\) distances, \(d_{\cos }\) is scale-invariant: \(d_{\cos }(\ba ,\bb ) = d_{\cos }(\alpha \ba ,\beta \bb )\) for any \(\alpha ,\beta >0\), so only the direction of the vectors matters.
Strictly speaking, \(d_{\cos }\) is not a metric in the mathematical sense (the triangle inequality fails), but on the unit sphere it is monotonically related to Euclidean distance: \(\lVert \hat \ba -\hat \bb \rVert _2^2 = 2\,d_{\cos }(\ba ,\bb )\) for unit-normalized \(\hat \ba =\ba /\lVert \ba \rVert \) and \(\hat \bb =\bb /\lVert \bb \rVert \).
-
Example 10.2: 2-D illustration (See Fig. 10.2) For \(\ba =(1,1)\) and \(\bb =(4,5)\) the coordinate differences are \(|a_1-b_1|=3\) and \(|a_2-b_2|=4\).
-
• Manhattan (\(L_1\)): \(d_1 = 3+4 = 7\) (sum along axes).
-
• Euclidean (\(L_2\)): \(d_2 = \sqrt {3^2+4^2} = 5\), the hypotenuse of the right triangle formed by the two legs (Pythagorean theorem).
-
• Chebyshev (\(L_\infty \)): \(d_\infty = \max (3,4) = 4\), the larger coordinate gap.
-
• Canberra: \(d = 3/5 + 4/6 \approx 1.267\), each coordinate contributing a value in \([0,1]\) regardless of its magnitude (no visualization).
-
• Cosine: \(\cos \theta = 9/\sqrt {82}\approx 0.994\), \(d_{\cos }\approx 0.006\). Seen as vectors from the origin, \(\ba \) and \(\bb \) are nearly parallel, so cosine treats them as very similar despite the non-trivial \(L_2\) gap.
-
Many additional, domain-specific distances exist beyond the presented ones. The right choice depends on the data type and the notion of similarity the task requires.
10.3 Bias-Variance Trade-Off
The hyperparameter \(k\) directly controls the bias-variance trade-off (Sec. 4.4) of the classifier.
Small \(k\) (high variance, low bias)
At \(k=1\), the prediction at every point equals the label of its nearest training sample. The decision boundary follows the data closely and can represent very fine local structure (low bias), but it is also extremely sensitive to noise and to the particular training set drawn: a single mislabeled point or a small shift in the sample changes the boundary in its neighborhood. Different training sets produce visibly different classifiers, which is the sign of high variance.
Large \(k\) (low variance, high bias)
As \(k\) grows, each prediction averages over more neighbors and the boundary becomes smoother. Random label noise in any individual sample is washed out by the vote, so the classifier is stable across training sets (low variance). The cost is that fine local structure also gets smoothed away: a small minority region surrounded by the opposite class can be outvoted and absorbed (high bias). In the limit \(k=M\), every query receives the global majority label.
Fig. 10.1 visualizes this progression: \(k=1\) produces a jagged, noise-following boundary, while \(k=15\) yields a much smoother one that may miss local detail.
Choosing \(k\)
\(k\) is typically selected by cross-validation (Sec. 4.5) over a grid of candidate values. A common rule of thumb is \(k\approx \sqrt {M}\), but cross-validation is preferred whenever the cost permits. Odd \(k\) avoids ties in binary problems.
Asymptotic behavior (*)
In the limit of infinite data, k-NN is Bayes-consistent: its error converges to the Bayes-optimal error \(R^{*}\) (Sec. 4.3), provided
-
• \(M\to \infty \): the training set grows without bound;
-
• \(k\to \infty \): the vote averages more samples, so by the law of large numbers its variance vanishes and it becomes a reliable estimate of \(P(y\mid \bx )\);
-
• \(k/M\to 0\) (e.g. \(k=\sqrt {M}\)): \(k\) grows slower than \(M\), so the neighborhood still shrinks around the query point and the bias vanishes.
With all three conditions met, both bias and variance go to zero, so the error reaches the irreducible floor \(R^{*}\).
The \(1\)-NN classifier, in contrast, fixes \(k=1\) and therefore does no averaging at all. Cover & Hart (1967) showed that even so its asymptotic error satisfies
\(\seteqnumber{0}{}{7}\)\begin{equation} R^{*} \;\le \; R_{1\text {-NN}} \;\le \; 2R^{*}(1-R^{*}) \;\le \; 2R^{*}, \end{equation}
so 1-NN never closes the gap to optimality but stays within a factor of two. This is the price of pure variance: minimum bias, no averaging, a permanent gap above \(R^{*}\).
10.4 Curse of Dimensionality
10.4.1 Volume concentrates at the corners
A ball centered at the origin of a hypercube occupies a vanishing fraction of the cube’s volume as \(d\) grows; almost all the mass is in the corners.
The volume of a \(d\)-dimensional ball of radius \(r\) is
\(\seteqnumber{0}{}{8}\)\begin{equation} \label {eq-volume-sphere} V_d(r) = \frac {\pi ^{d/2}}{\Gamma (d/2 + 1)}\, r^d, \end{equation}
where \(\Gamma (\cdot )\) is the gamma function, the continuous extension of the factorial (\(\Gamma (n) = (n-1)!\) for positive integers \(n\)). The familiar cases are \(V_1(r) = 2r\), \(V_2(r) = \pi r^2\), and \(V_3(r) = \tfrac {4}{3}\pi r^3\).
Consider the hypercube \([-a, a]^d\), of volume \((2a)^d\), and an inscribed hypersphere of radius \(r = a\) (Fig. 10.3, left). The fraction of the cube’s volume contained in the sphere is
\[ f_d = \frac {V_d(a)}{(2a)^d} = \frac {\pi ^{d/2}}{2^d\,\Gamma (d/2 + 1)} = \frac {\pi ^{d/2}}{d \, 2^{d-1} \, \Gamma (d/2)}, \qquad \lim _{d \to \infty } f_d = 0, \]
using \(\Gamma (d/2 + 1) = (d/2)\,\Gamma (d/2)\). The cube’s volume therefore concentrates at its corners rather than at its center.
-
Example 10.3: In 2-D the sphere fills \(\pi /4 \approx 78.5\%\) of the square (Fig. 10.3). By \(d=5\) the fraction is already below \(20\%\), by \(d=10\) it is about \(2.5 \cdot 10^{-3}\), and by \(d=15\) it is on the order of \(10^{-5}\). A Monte Carlo check (draw \(N\) uniform points in \([-a,a]^d\) and count the fraction with \(\lVert \bx \rVert \le a\)) reproduces \(f_d\) to within \(1/\sqrt {N}\).
k-NN consequence The “neighborhood” is no longer local: the \(k\) chosen points span most of the feature space, so the vote reflects the overall class proportions rather than the class composition near the query. In bias-variance terms (Sec. 10.3), this drives bias up even at small \(k\), since the de-facto neighborhood is almost global; tuning \(k\) downward no longer recovers the low-bias regime that small \(k\) provides in low dimensions.
10.4.2 Volume concentrates near the surface
For a hypersphere, almost all volume sits in a thin shell at the boundary; the interior is essentially empty.
Consider two concentric hyperspheres of radii \(r\) and \(r - \epsilon \) (Fig. 10.4, left). The fraction of the larger sphere’s volume (10.9) in the thin shell between them is
\[ f_s = \frac {V_d(r) - V_d(r-\epsilon )}{V_d(r)} = 1 - \left (1 - \frac {\epsilon }{r}\right )^d, \qquad \lim _{d \to \infty } f_s = 1. \]
Equivalently, the interior fraction \(1 - f_s = (1 - \epsilon /r)^d\) decays geometrically in \(d\).
-
Example 10.4: A thin 10% shell takes the volume. For \(\epsilon /r = 0.1\) (a shell occupying only the outer tenth of the radius), the interior fraction is \(0.9^d\): about \(73\%\) at \(d=3\), \(35\%\) at \(d=10\), and \(12\%\) at \(d=20\) (Fig. 10.4). Data distributed uniformly over a hypersphere therefore clusters near the boundary, leaving interior regions empty.
k-NN consequence. Pairwise distances concentrate: the ratio of farthest to nearest neighbor distance approaches \(1\), so the ranking of training points by distance becomes nearly arbitrary; the same \(k\) points might as well have been chosen at random.
In bias-variance terms (Sec. 10.3), this is a variance explosion at every \(k\); different draws of the training set yield different neighbor sets and different predictions at the same query.
10.4.3 Samples live in the tails of the density
For the standard multivariate normal \(\bx \sim \mathcal {N}(\mathbf {0}, \bI _d)\) (zero mean, unit variance per coordinate, uncorrelated coordinates), samples cluster far from the mean1 as \(d\) grows; in high dimensions essentially every sample is a tail outlier (Sec. 1.3).
This sounds paradoxical, since the bell curve peaks at the origin. The catch: the chance of landing in a region \(A\) equals density times volume,
\[ P(\bx \in A) = \int _A f(\bx )\,\mathrm {d}\bx , \]
and there is far more volume far from the origin than near it. In \(d\) dimensions, the volume of a thin shell at radius \(\rho \) grows like \(\rho ^{d-1}\) (the surface area of a \(d\)-sphere), while the density drops as \(e^{-\rho ^2/2}\). So the radial density (the chance of landing at distance \(\rho \), summed over all directions) is
\[ g(\rho ) \;\propto \; \rho ^{d-1} e^{-\rho ^2/2}, \]
which is largest at \(\rho ^{*} = \sqrt {d-1}\). Samples therefore concentrate on a thin shell at distance roughly \(\sqrt {d}\) from the origin, where the density itself is already tiny. The origin is the single most likely point; the bulk of the probability sits on the shell.
How concentrated this shell is depends on \(d\). The squared distance \(\lVert \bx \rVert ^2\) follows a \(\chi ^2(d)\) distribution, so for any radius \(r\) the probability of landing outside the sphere of radius \(r\) is
\[ \Pr (\lVert \bx \rVert > r) = \Pr (\chi ^2(d) > r^2), \qquad \lim _{d \to \infty } \Pr (\chi ^2(d) > r^2) = 1, \]
i.e. for any fixed radius, the sample eventually escapes it almost surely as \(d\) grows.
1 Because the standard normal is symmetric and unimodal, the mean, the median, and the mode, all coincide at the origin \(\mathbf {0}\), so the three terms are interchangeable here.
-
Example 10.5: From bulk to tails as \(d\) grows. Take as reference the density level set \(\{\bx : f(\bx ) = 0.01\,f(\mathbf {0})\}\), i.e. the contour where the density equals \(1\%\) of its peak value at the origin (Fig. 10.5, left). Since \(\ln (f(\bx )/f(\mathbf {0})) = -\lVert \bx \rVert ^2/2\), this contour is the sphere of radius \(r_{1\%} = \sqrt {-2\ln 0.01} \approx 3.03\), the same numerical threshold in every dimension. Note that “\(1\%\)” here is the density value on the contour, not the probability mass outside it. The actual tail probability \(P_{\text {tail}}(d) = \Pr (\chi ^2(d) > -2\ln 0.01) = \Pr (\chi ^2(d) > 9.21)\) depends on \(d\):
-
• \(d = 1\): \(P_{\text {tail}} = 2(1-\Phi (3.03)) \approx 0.0024\), only about \(1\) sample in \(400\) lies outside, far below \(1\%\). The threshold \(3.03\sigma \) is much stricter than the actual \(1\%\) tail of a 1-D Gaussian (at \(\approx 2.58\sigma \)), and the 1-D “volume” at radius \(3\) is tiny, so very little mass reaches that far.
-
• \(d = 2\): \(P_{\text {tail}} = 0.01\) exactly. This is a coincidence of \(\chi ^2(2)\) being exponential: \(\Pr (\chi ^2(2) > c) = e^{-c/2}\), and with \(c = -2\ln 0.01\) this evaluates to \(0.01\).
-
• Between \(d = 5\) and \(d = 10\) the mass migrates rapidly into the tails (Fig. 10.5, right): the radial volume factor \(\rho ^{d-1}\) inflates fast enough to push the tail mass well past \(1\%\).
-
• \(d = 10\): more than half the samples lie outside the \(1\%\)-of-mode contour; equivalently \(\Pr (\lVert \bx \rVert \ge 1.6) = 0.99\).
-
• \(d = 20\): \(P_{\text {tail}} \approx 0.98\), so almost every sample is outside the contour.
-
• In very high dimensions, virtually every sample is a tail outlier.
-
Center: same construction in 2-D, where the interior region is a disk and a few more samples land in the tails.
Right: probability that a sample from a \(d\)-dimensional standard normal falls inside the contour; this interior probability collapses, so beyond \(d \approx 10\) virtually every sample is in the tails.
k-NN consequence k-NN guesses the class of a query by looking at the labels of nearby training points. This only works only in the area of the actual training points. When samples concentrate in the tails, every test query sits in a region where only a few training points landed, so the \(k\) nearest neighbors come from far away and do not really reflect the query’s surroundings.
In bias-variance terms (Sec. 10.3), small \(k\) gives high variance (a handful of far-away neighbors are noisy), and large \(k\) gives high bias (the average drifts toward the global majority); no choice of \(k\) works well.
10.4.4 Consequences for Classification
The three phenomena above, corner concentration, shell concentration, and tail concentration, are not specific to k-NN. Every classifier relies on some kind of geometric contrast (distances, densities, or margins), and each kind weakens in its own way as \(d\) grows.
-
• Distance-based methods (e.g., k-NN, kernel SVM, RBF networks, kernel density classifiers). Hurt most directly: the contrast between near and far collapses, so neighborhood votes and kernel weights stop carrying information. A learned or supervised metric helps only if the data actually lives on a low-dimensional manifold.
-
• Generative and density-based classifiers (Naive Bayes with continuous features, QDA, GMMs, KDE). Needs sample sizes that grow exponentially in \(d\); tail concentration means every test point sits in a region the training set never sampled densely, so the density is extrapolated rather than estimated.
-
• Linear and parametric classifiers (logistic regression, linear SVM). Geometrically less exposed, but vulnerable to a different face of the curse: when \(d\) approaches the sample size \(M\), the data becomes almost surely linearly separable for any labeling, so the training loss can be driven to zero without learning anything. Strong regularization (\(L_2\)/ridge, \(L_1\)/lasso, early stopping; Sec. 4.9) becomes mandatory rather than optional.
-
• Tree-based methods (decision trees, random forests, gradient boosting). Splits use one coordinate at a time, so they ignore distance concentration entirely, but they still need enough samples per leaf; the chance of selecting a useful split shrinks as the number of noise features dilutes the informative ones.
Practical remedies all reduce the effective dimension well below the nominal \(d\):
-
• Feature selection (Ch. 9): keep only the features that carry signal, discard the rest.
-
• Dimensionality reduction (Ch. 11): project the data onto a lower-dimensional subspace (e.g. PCA) that preserves the variance that matters.
-
• Regularization (Sec. 4.9): shrink the parameter space the model can explore, so even in nominally high \(d\) the fitted model has few effective degrees of freedom.
-
• Learned representations (deep networks): map raw features to a much lower intrinsic dimension that captures the task structure.
-
• Domain-driven feature engineering: encode prior knowledge directly, replacing many noisy raw features with a few informative ones.
Whichever route is chosen, more samples never hurt, but the required \(M\) to undo the curse grows exponentially in \(d\), so doubling the data has far less effect than removing irrelevant features.
10.5 Summary
Hyperparameters:
-
• the distance metric (Sec. 10.2),
-
• for Minkowski distance, the exponent \(p\) (Sec. 10.2.4), and
-
• the number of neighbors \(k\) (Sec. 10.3).
All three are typically chosen by cross-validation (Sec. 4.5).
Pros.
-
• Instance-based and non-parametric: no training phase, no model to fit, few hyperparameters, and new data is absorbed by adding points to the index.
-
• Probabilistic reading: the fraction of class-\(c\) neighbors estimates the class empirically.
-
• Flexible decision boundary: small \(k\) tracks fine local structure; large \(k\) smooths automatically.
-
• Strong baseline for small, low-dimensional, well-normalized datasets.
Cons.
-
• No learned structure: k-NN memorizes, it does not summarize; it cannot share information across queries or identify which features matter.
-
• High inference cost: \(\mathcal {O}(M\,N)\) per query (one \(N\)-dimensional distance to each of \(M\) training points). Approximate-nearest-neighbor indexes help up to moderate \(N\).
-
• Memory cost: the entire training set must be kept at inference time.
-
• Sensitive to feature scale and to irrelevant features; standardization (Sec. 4.7) and feature selection (Ch. 9) are essentially mandatory.
-
• Sensitive to outliers and label noise, especially at small \(k\).
-
• Performance degrades quickly in high dimensions (curse of dimensionality); shrink the effective dimension before tuning \(k\).
Further Reading
-
• Cover and Hart, “Nearest neighbor pattern classification” (IEEE Trans. Inf. Theory, 1967): the \(R^{*} \le R_{1\text {-NN}} \le 2R^{*}(1-R^{*})\) bound used in Sec. 10.3.
-
• Paradoxical behavior of multidimensional data: interactive MATLAB illustration of the three phenomena in Sec. sec-curse-of-dimensionality.