Machine Learning & Signals Learning

\(\newcommand{\footnotename}{footnote}\) \(\def \LWRfootnote {1}\) \(\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\let \LWRorighspace \hspace \) \(\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }\) \(\newcommand {\TextOrMath }[2]{#2}\) \(\newcommand {\mathnormal }[1]{{#1}}\) \(\newcommand \ensuremath [1]{#1}\) \(\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } \) \(\newcommand {\setlength }[2]{}\) \(\newcommand {\addtolength }[2]{}\) \(\newcommand {\setcounter }[2]{}\) \(\newcommand {\addtocounter }[2]{}\) \(\newcommand {\arabic }[1]{}\) \(\newcommand {\number }[1]{}\) \(\newcommand {\noalign }[1]{\text {#1}\notag \\}\) \(\newcommand {\cline }[1]{}\) \(\newcommand {\directlua }[1]{\text {(directlua)}}\) \(\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}\) \(\newcommand {\protect }{}\) \(\def \LWRabsorbnumber #1 {}\) \(\def \LWRabsorbquotenumber "#1 {}\) \(\newcommand {\LWRabsorboption }[1][]{}\) \(\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }\) \(\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }\) \(\def \mathcode #1={\mathchar }\) \(\let \delcode \mathcode \) \(\let \delimiter \mathchar \) \(\def \oe {\unicode {x0153}}\) \(\def \OE {\unicode {x0152}}\) \(\def \ae {\unicode {x00E6}}\) \(\def \AE {\unicode {x00C6}}\) \(\def \aa {\unicode {x00E5}}\) \(\def \AA {\unicode {x00C5}}\) \(\def \o {\unicode {x00F8}}\) \(\def \O {\unicode {x00D8}}\) \(\def \l {\unicode {x0142}}\) \(\def \L {\unicode {x0141}}\) \(\def \ss {\unicode {x00DF}}\) \(\def \SS {\unicode {x1E9E}}\) \(\def \dag {\unicode {x2020}}\) \(\def \ddag {\unicode {x2021}}\) \(\def \P {\unicode {x00B6}}\) \(\def \copyright {\unicode {x00A9}}\) \(\def \pounds {\unicode {x00A3}}\) \(\let \LWRref \ref \) \(\renewcommand {\ref }{\ifstar \LWRref \LWRref }\) \( \newcommand {\multicolumn }[3]{#3}\) \(\require {textcomp}\) \( \newcommand {\abs }[1]{\lvert #1\rvert } \) \( \DeclareMathOperator {\sign }{sign} \) \(\newcommand {\intertext }[1]{\text {#1}\notag \\}\) \(\let \Hat \hat \) \(\let \Check \check \) \(\let \Tilde \tilde \) \(\let \Acute \acute \) \(\let \Grave \grave \) \(\let \Dot \dot \) \(\let \Ddot \ddot \) \(\let \Breve \breve \) \(\let \Bar \bar \) \(\let \Vec \vec \) \(\newcommand {\bm }[1]{\boldsymbol {#1}}\) \(\require {physics}\) \(\newcommand {\LWRphystrig }[2]{\ifblank {#1}{\textrm {#2}}{\textrm {#2}^{#1}}}\) \(\renewcommand {\sin }[1][]{\LWRphystrig {#1}{sin}}\) \(\renewcommand {\sinh }[1][]{\LWRphystrig {#1}{sinh}}\) \(\renewcommand {\arcsin }[1][]{\LWRphystrig {#1}{arcsin}}\) \(\renewcommand {\asin }[1][]{\LWRphystrig {#1}{asin}}\) \(\renewcommand {\cos }[1][]{\LWRphystrig {#1}{cos}}\) \(\renewcommand {\cosh }[1][]{\LWRphystrig {#1}{cosh}}\) \(\renewcommand {\arccos }[1][]{\LWRphystrig {#1}{arcos}}\) \(\renewcommand {\acos }[1][]{\LWRphystrig {#1}{acos}}\) \(\renewcommand {\tan }[1][]{\LWRphystrig {#1}{tan}}\) \(\renewcommand {\tanh }[1][]{\LWRphystrig {#1}{tanh}}\) \(\renewcommand {\arctan }[1][]{\LWRphystrig {#1}{arctan}}\) \(\renewcommand {\atan }[1][]{\LWRphystrig {#1}{atan}}\) \(\renewcommand {\csc }[1][]{\LWRphystrig {#1}{csc}}\) \(\renewcommand {\csch }[1][]{\LWRphystrig {#1}{csch}}\) \(\renewcommand {\arccsc }[1][]{\LWRphystrig {#1}{arccsc}}\) \(\renewcommand {\acsc }[1][]{\LWRphystrig {#1}{acsc}}\) \(\renewcommand {\sec }[1][]{\LWRphystrig {#1}{sec}}\) \(\renewcommand {\sech }[1][]{\LWRphystrig {#1}{sech}}\) \(\renewcommand {\arcsec }[1][]{\LWRphystrig {#1}{arcsec}}\) \(\renewcommand {\asec }[1][]{\LWRphystrig {#1}{asec}}\) \(\renewcommand {\cot }[1][]{\LWRphystrig {#1}{cot}}\) \(\renewcommand {\coth }[1][]{\LWRphystrig {#1}{coth}}\) \(\renewcommand {\arccot }[1][]{\LWRphystrig {#1}{arccot}}\) \(\renewcommand {\acot }[1][]{\LWRphystrig {#1}{acot}}\) \(\require {cancel}\) \(\newcommand *{\underuparrow }[1]{{\underset {\uparrow }{#1}}}\) \(\DeclareMathOperator *{\argmax }{argmax}\) \(\DeclareMathOperator *{\argmin }{arg\,min}\) \(\def \E [#1]{\mathbb {E}\!\left [ #1 \right ]}\) \(\def \Var [#1]{\operatorname {Var}\!\left [ #1 \right ]}\) \(\def \Cov [#1]{\operatorname {Cov}\!\left [ #1 \right ]}\) \(\newcommand {\floor }[1]{\lfloor #1 \rfloor }\) \(\newcommand {\DTFTH }{ H \brk 1{e^{j\omega }}}\) \(\newcommand {\DTFTX }{ X\brk 1{e^{j\omega }}}\) \(\newcommand {\DFTtr }[1]{\mathrm {DFT}\left \{#1\right \}}\) \(\newcommand {\DTFTtr }[1]{\mathrm {DTFT}\left \{#1\right \}}\) \(\newcommand {\DTFTtrI }[1]{\mathrm {DTFT^{-1}}\left \{#1\right \}}\) \(\newcommand {\Ftr }[1]{ \mathcal {F}\left \{#1\right \}}\) \(\newcommand {\FtrI }[1]{ \mathcal {F}^{-1}\left \{#1\right \}}\) \(\newcommand {\Zover }{\overset {\mathscr Z}{\Longleftrightarrow }}\) \(\renewcommand {\real }{\mathbb {R}}\) \(\newcommand {\ba }{\mathbf {a}}\) \(\newcommand {\bb }{\mathbf {b}}\) \(\newcommand {\bd }{\mathbf {d}}\) \(\newcommand {\be }{\mathbf {e}}\) \(\newcommand {\bh }{\mathbf {h}}\) \(\newcommand {\bn }{\mathbf {n}}\) \(\newcommand {\bq }{\mathbf {q}}\) \(\newcommand {\br }{\mathbf {r}}\) \(\newcommand {\bt }{\mathbf {t}}\) \(\newcommand {\bv }{\mathbf {v}}\) \(\newcommand {\bw }{\mathbf {w}}\) \(\newcommand {\bx }{\mathbf {x}}\) \(\newcommand {\bxx }{\mathbf {xx}}\) \(\newcommand {\bxy }{\mathbf {xy}}\) \(\newcommand {\by }{\mathbf {y}}\) \(\newcommand {\byy }{\mathbf {yy}}\) \(\newcommand {\bz }{\mathbf {z}}\) \(\newcommand {\bA }{\mathbf {A}}\) \(\newcommand {\bB }{\mathbf {B}}\) \(\newcommand {\bI }{\mathbf {I}}\) \(\newcommand {\bK }{\mathbf {K}}\) \(\newcommand {\bP }{\mathbf {P}}\) \(\newcommand {\bQ }{\mathbf {Q}}\) \(\newcommand {\bR }{\mathbf {R}}\) \(\newcommand {\bU }{\mathbf {U}}\) \(\newcommand {\bW }{\mathbf {W}}\) \(\newcommand {\bX }{\mathbf {X}}\) \(\newcommand {\bY }{\mathbf {Y}}\) \(\newcommand {\bZ }{\mathbf {Z}}\) \(\newcommand {\balpha }{\bm {\alpha }}\) \(\newcommand {\bth }{{\bm {\theta }}}\) \(\newcommand {\bepsilon }{{\bm {\epsilon }}}\) \(\newcommand {\bmu }{{\bm {\mu }}}\) \(\newcommand {\bOne }{\mathbf {1}}\) \(\newcommand {\bZero }{\mathbf {0}}\) \(\newcommand {\loss }{\mathcal {L}}\) \(\newcommand {\appropto }{\mathrel {\vcenter { \offinterlineskip \halign {\hfil $##$\cr \propto \cr \noalign {\kern 2pt}\sim \cr \noalign {\kern -2pt}}}}}\) \(\newcommand {\SSE }{\mathrm {SSE}}\) \(\newcommand {\MSE }{\mathrm {MSE}}\) \(\newcommand {\RMSE }{\mathrm {RMSE}}\) \(\newcommand {\toprule }[1][]{\hline }\) \(\let \midrule \toprule \) \(\let \bottomrule \toprule \) \(\def \LWRbooktabscmidruleparen (#1)#2{}\) \(\newcommand {\LWRbooktabscmidrulenoparen }[1]{}\) \(\newcommand {\cmidrule }[1][]{\ifnextchar (\LWRbooktabscmidruleparen \LWRbooktabscmidrulenoparen }\) \(\newcommand {\morecmidrules }{}\) \(\newcommand {\specialrule }[3]{\hline }\) \(\newcommand {\addlinespace }[1][]{}\) \(\newcommand {\LWRsubmultirow }[2][]{#2}\) \(\newcommand {\LWRmultirow }[2][]{\LWRsubmultirow }\) \(\newcommand {\multirow }[2][]{\LWRmultirow }\) \(\newcommand {\mrowcell }{}\) \(\newcommand {\mcolrowcell }{}\) \(\newcommand {\STneed }[1]{}\) \(\newcommand {\tcbset }[1]{}\) \(\newcommand {\tcbsetforeverylayer }[1]{}\) \(\newcommand {\tcbox }[2][]{\boxed {\text {#2}}}\) \(\newcommand {\tcboxfit }[2][]{\boxed {#2}}\) \(\newcommand {\tcblower }{}\) \(\newcommand {\tcbline }{}\) \(\newcommand {\tcbtitle }{}\) \(\newcommand {\tcbsubtitle [2][]{\mathrm {#2}}}\) \(\newcommand {\tcboxmath }[2][]{\boxed {#2}}\) \(\newcommand {\tcbhighmath }[2][]{\boxed {#2}}\)

10 Classifiers

10.1 Support Vector Machine (SVM)

  • Goal: Binary classifier that finds the decision boundary maximizing the margin between classes.

The central idea is to find a hyperplane that separates two classes while maximizing the distance (margin) to the nearest training points. Only the data points closest to the boundary, called support vectors, determine the classifier (Fig. 10.1).

(image)

Figure 10.1: Left: linear SVM with decision boundary (solid), margin boundaries (dashed), and support vectors (circled). Right: effect of the regularization parameter \(C\) on margin width — small \(C\) yields a wider margin, large \(C\) a narrower one.
10.1.1 Linear SVM

Consider a binary classification problem with labeled data:

\[ \{(x_i, y_i)\}_{i=1}^n, \quad y_i \in \{-1, +1\} \]

A linear classifier is defined as:

\[ f(x) = w^\top x + b \]

The decision boundary is given by:

\[ w^\top x + b = 0 \]

For a hard-margin SVM, the constraints are:

\[ y_i (w^\top x_i + b) \ge 1 \quad \forall i \]

The distance from a point \(x\) to the hyperplane \(w^\top x + b = 0\) is \(|w^\top x + b|/\|w\|\). The two margin boundaries \(w^\top x + b = \pm 1\) each lie at distance \(1/\|w\|\) from the decision boundary, so the total margin width is

\[ \text {Margin width} = \frac {2}{\|w\|}. \]

Maximizing the margin is therefore equivalent to minimizing \(\|w\|^2\).

Hard margin

When data is perfectly separable, the hard-margin SVM solves:

\begin{equation} \min _{w,b} \; \frac {1}{2}\|w\|^2 \qquad \text {s.t.} \quad y_i (w^\top x_i + b) \ge 1 \quad \forall i \end{equation}

This is a convex quadratic program with a unique global optimum.

Soft margin

Real-world data often contains noise and class overlap. Soft margin SVM introduces slack variables \(\xi _i \ge 0\) that allow margin violations:

\begin{equation} \min _{w,b,\xi } \; \frac {1}{2}\|w\|^2 + C \sum _{i=1}^n \xi _i \qquad \text {s.t.} \quad y_i (w^\top x_i + b) \ge 1 - \xi _i \end{equation}

Interpretation of \(\xi _i\):

.
\(\xi _i\) Interpretation
\(0\) Correctly classified, outside the margin
\(0 < \xi _i < 1\) Correctly classified, inside the margin
\(= 1\) On the decision boundary
\(> 1\) Misclassified

The parameter \(C > 0\) controls the trade-off: large \(C\) penalizes violations heavily (narrow margin, risk of overfitting), small \(C\) allows more violations (wider margin, better robustness).

Hinge loss

Soft margin SVM is equivalent to minimizing hinge loss with L2 regularization:

\begin{equation} \min _{w,b} \; \frac {1}{2}\|w\|^2 + C \sum _i \max (0, 1 - y_i f(x_i)) \end{equation}

Only points violating the margin (\(y_i f(x_i) < 1\)) contribute to the loss (Fig. 10.2).

(image)

Figure 10.2: Hinge loss (solid) compared to the 0–1 loss (dashed). The shaded region shows where hinge loss penalizes points inside or violating the margin (\(y \cdot f(x) < 1\)).
10.1.2 Kernel SVM

When data is not linearly separable, SVM maps inputs into a higher-dimensional feature space \(\phi : \mathbb {R}^d \rightarrow \mathbb {R}^D\). Kernels compute inner products in this space without explicit mapping:

\begin{equation} K(x, z) = \phi (x)^\top \phi (z) \end{equation}

Common kernels: linear, polynomial, radial basis function (RBF), and sigmoid.

  • Example 10.1: Fig. 10.3 shows SVM with three kernels on concentric-circle data (not linearly separable). The linear kernel fails to separate the classes. The polynomial kernel (\(d=2\)) and RBF kernel both produce circular decision boundaries that correctly separate the inner and outer rings. Support vectors are circled.

(image)

Figure 10.3: Kernel SVM on non-linearly separable data. Linear kernel (left) cannot separate the classes; polynomial (center) and RBF (right) kernels produce nonlinear boundaries. Dashed lines show the margin.
10.1.3 Summary

Pros:

  • Strong generalization via margin maximization.

  • Effective in high-dimensional spaces.

  • Robust against overfitting (controlled by \(C\)).

Cons:

  • Not suitable for very large datasets (training scales poorly).

  • Non-probabilistic output (no direct class probabilities).

  • Kernel and \(C\) selection require cross-validation.

Hyperparameters: \(C\) (regularization), kernel type, kernel parameters (e.g. \(\gamma \) for RBF). Feature scaling is required.

In Python: scikit-learn (SVC, LinearSVC).

10.2 Decision Trees

  • Goal: Recursive partitioning classifier that splits the feature space into axis-aligned regions, each assigned a class label.

A decision tree classifies a sample by traversing a binary tree from root to leaf. At each internal node, a single feature \(x_j\) is compared to a threshold \(t\): the sample goes left if \(x_j \le t\) and right otherwise. Each leaf assigns a class label.

  • Example 10.2: Fig. 10.4 compares three classifiers on the same dataset. A shallow tree (depth\(\,=2\)) produces coarse axis-aligned regions. An unrestricted tree perfectly fits the training data but creates a complex, overfitting boundary. A random forest (100 trees) smooths the boundary by averaging many individual trees, reducing variance.

(image)

Figure 10.4: Decision boundaries for a shallow tree (left), an unrestricted tree (center), and a random forest (right). Note the axis-aligned splits characteristic of tree-based methods.
10.2.1 Tree Construction (CART)

The Classification and Regression Tree (CART) algorithm builds the tree greedily:

  • 1. For each node, consider all features and all possible thresholds.

  • 2. Select the split \((j, t)\) that maximizes a purity criterion (see below).

  • 3. Partition the data and recurse on each child node.

  • 4. Stop when a stopping criterion is met (e.g. maximum depth, minimum samples per leaf, or no further purity improvement).

10.2.2 Splitting Criteria

Let \(p_c\) denote the fraction of samples belonging to class \(c\) at a given node.

Gini impurity

\begin{equation} G = 1 - \sum _{c} p_c^2 \end{equation}

Measures the probability that a randomly chosen sample would be misclassified if labeled according to the class distribution at the node. \(G=0\) for a pure node.

Entropy (information gain)

\begin{equation} H = -\sum _{c} p_c \log _2 p_c \end{equation}

The split that yields the largest reduction in entropy (information gain) is selected. Entropy and Gini often produce similar trees; Gini is computationally cheaper.

10.2.3 Regularization

Without constraints, a tree can grow until every leaf contains a single sample, perfectly fitting the training data but overfitting. Common regularization strategies:

  • Maximum depth: limit the number of levels.

  • Minimum samples per leaf: require at least \(n_{\min }\) samples in each leaf.

  • Pruning: grow a full tree, then remove branches that do not improve validation performance.

10.2.4 Random Forest

A random forest is an ensemble of decision trees trained on bootstrap samples of the data. At each split, only a random subset of features is considered. The final prediction is the majority vote across all trees.

This reduces variance compared to a single tree while preserving the ability to model complex boundaries.

10.2.5 Summary

Pros:

  • Interpretable: the tree can be visualized and each decision explained.

  • No feature scaling required.

  • Handles both numerical and categorical features.

  • Fast inference: \(\mathcal {O}(\text {depth})\) per prediction.

Cons:

  • Single trees are prone to overfitting (high variance).

  • Axis-aligned splits limit expressiveness for rotated decision boundaries.

  • Greedy construction does not guarantee a globally optimal tree.

Hyperparameters: maximum depth, minimum samples per leaf, splitting criterion (Gini or entropy). For random forests: number of trees, number of features per split.

In Python: scikit-learn (DecisionTreeClassifier, RandomForestClassifier).

10.3 k-Nearest Neighbors (k-NN)

  • Goal: Non-parametric classifier based on local similarity.

Unlike logistic regression which learns a parametric decision boundary, k-NN classifies a new point by majority vote among its \(k\) nearest neighbors in the training set.

10.3.1 Algorithm

For a new sample \(\bx \):

  • 1. Compute distance \(d(\bx , \bx _i)\) to all training samples.

  • 2. Select the \(k\) samples with smallest distances.

  • 3. Assign the class by majority vote among the \(k\) neighbors.

  • Example 10.3: Fig. 10.5 shows k-NN classification for a query point (star) with \(k=1\), \(k=5\), and \(k=15\). The decision boundary (dashed) becomes smoother as \(k\) increases: \(k=1\) produces a complex, noisy boundary, while \(k=15\) yields a simpler one at the cost of reduced local sensitivity.

(image)

Figure 10.5: k-NN decision boundaries for different values of \(k\). The query point (star) and its \(k\) nearest neighbors (connected by lines) are shown. Larger \(k\) smooths the boundary.
10.3.2 Distance Metrics

The most popular distance metrics are:

Euclidean (\(L_2\))

\begin{equation} d(\ba ,\bb ) = \lVert \ba -\bb \rVert = \sqrt {\sum _{j=1}^N (a_j-b_j)^2} \end{equation}

Manhattan (\(L_1\), city block)

\begin{equation} d(\ba ,\bb ) = \sum _{j=1}^N \abs {a_j-b_j} \end{equation}

Minkowski (generalized, hyperparameter \(p\))

\begin{equation} d(\ba ,\bb ) = \left (\sum _{j=1}^N \abs {a_j-b_j}^p\right )^{1/p} \end{equation}

Special cases: \(p=1\) gives Manhattan, \(p=2\) gives Euclidean.

10.3.3 Tie-Breaking

When \(k>1\) neighbors sometimes yield equal votes. The ways to handle this situation are:

  • Random selection among tied classes.

  • Select class of the nearest neighbor among tied samples.

10.3.4 Curse of Dimensionality

In high-dimensional spaces, distances become less meaningful1:

  • All points tend to be equidistant from each other.

  • The ratio of nearest to farthest neighbor distance approaches 1.

  • Most volume of a hypersphere concentrates near its surface.

As \(N\) grows, k-NN performance degrades unless \(M\) grows exponentially.

10.3.5 Summary

Pros:

  • Simple, non-parametric, no training phase.

  • Natural probabilistic interpretation (fraction of neighbors).

  • Can model complex decision boundaries.

Cons:

  • High inference cost: \(\mathcal {O}(M \cdot N)\) per prediction (\(N\)-times \(M\)-dimensional distance).

  • Sensitive to irrelevant features and scale (normalization required).

  • Sensitive to outliers (especially for small \(k\)).

  • Performance degrades in high dimensions (curse of dimensionality).

Hyperparameters: \(k\) (number of neighbors), distance metric.