Machine Learning & Signals Learning
10 Classifiers
10.1 Support Vector Machine (SVM)
The central idea is to find a hyperplane that separates two classes while maximizing the distance (margin) to the nearest training points. Only the data points closest to the boundary, called support vectors, determine the classifier (Fig. 10.1).
10.1.1 Linear SVM
Consider a binary classification problem with labeled data:
\[ \{(x_i, y_i)\}_{i=1}^n, \quad y_i \in \{-1, +1\} \]
A linear classifier is defined as:
\[ f(x) = w^\top x + b \]
The decision boundary is given by:
\[ w^\top x + b = 0 \]
For a hard-margin SVM, the constraints are:
\[ y_i (w^\top x_i + b) \ge 1 \quad \forall i \]
The distance from a point \(x\) to the hyperplane \(w^\top x + b = 0\) is \(|w^\top x + b|/\|w\|\). The two margin boundaries \(w^\top x + b = \pm 1\) each lie at distance \(1/\|w\|\) from the decision boundary, so the total margin width is
\[ \text {Margin width} = \frac {2}{\|w\|}. \]
Maximizing the margin is therefore equivalent to minimizing \(\|w\|^2\).
Hard margin
When data is perfectly separable, the hard-margin SVM solves:
\(\seteqnumber{0}{}{0}\)\begin{equation} \min _{w,b} \; \frac {1}{2}\|w\|^2 \qquad \text {s.t.} \quad y_i (w^\top x_i + b) \ge 1 \quad \forall i \end{equation}
This is a convex quadratic program with a unique global optimum.
Soft margin
Real-world data often contains noise and class overlap. Soft margin SVM introduces slack variables \(\xi _i \ge 0\) that allow margin violations:
\(\seteqnumber{0}{}{1}\)\begin{equation} \min _{w,b,\xi } \; \frac {1}{2}\|w\|^2 + C \sum _{i=1}^n \xi _i \qquad \text {s.t.} \quad y_i (w^\top x_i + b) \ge 1 - \xi _i \end{equation}
Interpretation of \(\xi _i\):
| \(\xi _i\) | Interpretation |
| \(0\) | Correctly classified, outside the margin |
| \(0 < \xi _i < 1\) | Correctly classified, inside the margin |
| \(= 1\) | On the decision boundary |
| \(> 1\) | Misclassified |
The parameter \(C > 0\) controls the trade-off: large \(C\) penalizes violations heavily (narrow margin, risk of overfitting), small \(C\) allows more violations (wider margin, better robustness).
Hinge loss
Soft margin SVM is equivalent to minimizing hinge loss with L2 regularization:
\(\seteqnumber{0}{}{2}\)\begin{equation} \min _{w,b} \; \frac {1}{2}\|w\|^2 + C \sum _i \max (0, 1 - y_i f(x_i)) \end{equation}
Only points violating the margin (\(y_i f(x_i) < 1\)) contribute to the loss (Fig. 10.2).
10.1.2 Kernel SVM
When data is not linearly separable, SVM maps inputs into a higher-dimensional feature space \(\phi : \mathbb {R}^d \rightarrow \mathbb {R}^D\). Kernels compute inner products in this space without explicit mapping:
\(\seteqnumber{0}{}{3}\)\begin{equation} K(x, z) = \phi (x)^\top \phi (z) \end{equation}
Common kernels: linear, polynomial, radial basis function (RBF), and sigmoid.
-
Example 10.1: Fig. 10.3 shows SVM with three kernels on concentric-circle data (not linearly separable). The linear kernel fails to separate the classes. The polynomial kernel (\(d=2\)) and RBF kernel both produce circular decision boundaries that correctly separate the inner and outer rings. Support vectors are circled.
10.1.3 Summary
Pros:
-
• Strong generalization via margin maximization.
-
• Effective in high-dimensional spaces.
-
• Robust against overfitting (controlled by \(C\)).
Cons:
-
• Not suitable for very large datasets (training scales poorly).
-
• Non-probabilistic output (no direct class probabilities).
-
• Kernel and \(C\) selection require cross-validation.
Hyperparameters: \(C\) (regularization), kernel type, kernel parameters (e.g. \(\gamma \) for RBF). Feature scaling is required.
In Python: scikit-learn (SVC, LinearSVC).
10.2 Decision Trees
A decision tree classifies a sample by traversing a binary tree from root to leaf. At each internal node, a single feature \(x_j\) is compared to a threshold \(t\): the sample goes left if \(x_j \le t\) and right otherwise. Each leaf assigns a class label.
-
Example 10.2: Fig. 10.4 compares three classifiers on the same dataset. A shallow tree (depth\(\,=2\)) produces coarse axis-aligned regions. An unrestricted tree perfectly fits the training data but creates a complex, overfitting boundary. A random forest (100 trees) smooths the boundary by averaging many individual trees, reducing variance.
10.2.1 Tree Construction (CART)
The Classification and Regression Tree (CART) algorithm builds the tree greedily:
-
1. For each node, consider all features and all possible thresholds.
-
2. Select the split \((j, t)\) that maximizes a purity criterion (see below).
-
3. Partition the data and recurse on each child node.
-
4. Stop when a stopping criterion is met (e.g. maximum depth, minimum samples per leaf, or no further purity improvement).
10.2.2 Splitting Criteria
Let \(p_c\) denote the fraction of samples belonging to class \(c\) at a given node.
Gini impurity
\(\seteqnumber{0}{}{4}\)\begin{equation} G = 1 - \sum _{c} p_c^2 \end{equation}
Measures the probability that a randomly chosen sample would be misclassified if labeled according to the class distribution at the node. \(G=0\) for a pure node.
Entropy (information gain)
\(\seteqnumber{0}{}{5}\)\begin{equation} H = -\sum _{c} p_c \log _2 p_c \end{equation}
The split that yields the largest reduction in entropy (information gain) is selected. Entropy and Gini often produce similar trees; Gini is computationally cheaper.
10.2.3 Regularization
Without constraints, a tree can grow until every leaf contains a single sample, perfectly fitting the training data but overfitting. Common regularization strategies:
-
• Maximum depth: limit the number of levels.
-
• Minimum samples per leaf: require at least \(n_{\min }\) samples in each leaf.
-
• Pruning: grow a full tree, then remove branches that do not improve validation performance.
10.2.4 Random Forest
A random forest is an ensemble of decision trees trained on bootstrap samples of the data. At each split, only a random subset of features is considered. The final prediction is the majority vote across all trees.
This reduces variance compared to a single tree while preserving the ability to model complex boundaries.
10.2.5 Summary
Pros:
-
• Interpretable: the tree can be visualized and each decision explained.
-
• No feature scaling required.
-
• Handles both numerical and categorical features.
-
• Fast inference: \(\mathcal {O}(\text {depth})\) per prediction.
Cons:
-
• Single trees are prone to overfitting (high variance).
-
• Axis-aligned splits limit expressiveness for rotated decision boundaries.
-
• Greedy construction does not guarantee a globally optimal tree.
Hyperparameters: maximum depth, minimum samples per leaf, splitting criterion (Gini or entropy). For random forests: number of trees, number of features per split.
In Python: scikit-learn (DecisionTreeClassifier, RandomForestClassifier).
10.3 k-Nearest Neighbors (k-NN)
Unlike logistic regression which learns a parametric decision boundary, k-NN classifies a new point by majority vote among its \(k\) nearest neighbors in the training set.
10.3.1 Algorithm
For a new sample \(\bx \):
-
1. Compute distance \(d(\bx , \bx _i)\) to all training samples.
-
2. Select the \(k\) samples with smallest distances.
-
3. Assign the class by majority vote among the \(k\) neighbors.
-
Example 10.3: Fig. 10.5 shows k-NN classification for a query point (star) with \(k=1\), \(k=5\), and \(k=15\). The decision boundary (dashed) becomes smoother as \(k\) increases: \(k=1\) produces a complex, noisy boundary, while \(k=15\) yields a simpler one at the cost of reduced local sensitivity.
10.3.2 Distance Metrics
The most popular distance metrics are:
\(\seteqnumber{0}{}{6}\)\begin{equation} d(\ba ,\bb ) = \lVert \ba -\bb \rVert = \sqrt {\sum _{j=1}^N (a_j-b_j)^2} \end{equation}
Manhattan (\(L_1\), city block)
\(\seteqnumber{0}{}{7}\)\begin{equation} d(\ba ,\bb ) = \sum _{j=1}^N \abs {a_j-b_j} \end{equation}
Minkowski (generalized, hyperparameter \(p\))
\(\seteqnumber{0}{}{8}\)\begin{equation} d(\ba ,\bb ) = \left (\sum _{j=1}^N \abs {a_j-b_j}^p\right )^{1/p} \end{equation}
Special cases: \(p=1\) gives Manhattan, \(p=2\) gives Euclidean.
10.3.3 Tie-Breaking
When \(k>1\) neighbors sometimes yield equal votes. The ways to handle this situation are:
-
• Random selection among tied classes.
-
• Select class of the nearest neighbor among tied samples.
10.3.4 Curse of Dimensionality
In high-dimensional spaces, distances become less meaningful1:
-
• All points tend to be equidistant from each other.
-
• The ratio of nearest to farthest neighbor distance approaches 1.
-
• Most volume of a hypersphere concentrates near its surface.
As \(N\) grows, k-NN performance degrades unless \(M\) grows exponentially.
10.3.5 Summary
Pros:
-
• Simple, non-parametric, no training phase.
-
• Natural probabilistic interpretation (fraction of neighbors).
-
• Can model complex decision boundaries.
Cons:
-
• High inference cost: \(\mathcal {O}(M \cdot N)\) per prediction (\(N\)-times \(M\)-dimensional distance).
-
• Sensitive to irrelevant features and scale (normalization required).
-
• Sensitive to outliers (especially for small \(k\)).
-
• Performance degrades in high dimensions (curse of dimensionality).
Hyperparameters: \(k\) (number of neighbors), distance metric.