Machine Learning & Signals Learning

\(\newcommand{\footnotename}{footnote}\) \(\def \LWRfootnote {1}\) \(\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\let \LWRorighspace \hspace \) \(\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }\) \(\newcommand {\TextOrMath }[2]{#2}\) \(\newcommand {\mathnormal }[1]{{#1}}\) \(\newcommand \ensuremath [1]{#1}\) \(\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } \) \(\newcommand {\setlength }[2]{}\) \(\newcommand {\addtolength }[2]{}\) \(\newcommand {\setcounter }[2]{}\) \(\newcommand {\addtocounter }[2]{}\) \(\newcommand {\arabic }[1]{}\) \(\newcommand {\number }[1]{}\) \(\newcommand {\noalign }[1]{\text {#1}\notag \\}\) \(\newcommand {\cline }[1]{}\) \(\newcommand {\directlua }[1]{\text {(directlua)}}\) \(\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}\) \(\newcommand {\protect }{}\) \(\def \LWRabsorbnumber #1 {}\) \(\def \LWRabsorbquotenumber "#1 {}\) \(\newcommand {\LWRabsorboption }[1][]{}\) \(\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }\) \(\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }\) \(\def \mathcode #1={\mathchar }\) \(\let \delcode \mathcode \) \(\let \delimiter \mathchar \) \(\def \oe {\unicode {x0153}}\) \(\def \OE {\unicode {x0152}}\) \(\def \ae {\unicode {x00E6}}\) \(\def \AE {\unicode {x00C6}}\) \(\def \aa {\unicode {x00E5}}\) \(\def \AA {\unicode {x00C5}}\) \(\def \o {\unicode {x00F8}}\) \(\def \O {\unicode {x00D8}}\) \(\def \l {\unicode {x0142}}\) \(\def \L {\unicode {x0141}}\) \(\def \ss {\unicode {x00DF}}\) \(\def \SS {\unicode {x1E9E}}\) \(\def \dag {\unicode {x2020}}\) \(\def \ddag {\unicode {x2021}}\) \(\def \P {\unicode {x00B6}}\) \(\def \copyright {\unicode {x00A9}}\) \(\def \pounds {\unicode {x00A3}}\) \(\let \LWRref \ref \) \(\renewcommand {\ref }{\ifstar \LWRref \LWRref }\) \( \newcommand {\multicolumn }[3]{#3}\) \(\require {textcomp}\) \( \newcommand {\abs }[1]{\lvert #1\rvert } \) \( \DeclareMathOperator {\sign }{sign} \) \(\newcommand {\intertext }[1]{\text {#1}\notag \\}\) \(\let \Hat \hat \) \(\let \Check \check \) \(\let \Tilde \tilde \) \(\let \Acute \acute \) \(\let \Grave \grave \) \(\let \Dot \dot \) \(\let \Ddot \ddot \) \(\let \Breve \breve \) \(\let \Bar \bar \) \(\let \Vec \vec \) \(\newcommand {\bm }[1]{\boldsymbol {#1}}\) \(\require {physics}\) \(\newcommand {\LWRphystrig }[2]{\ifblank {#1}{\textrm {#2}}{\textrm {#2}^{#1}}}\) \(\renewcommand {\sin }[1][]{\LWRphystrig {#1}{sin}}\) \(\renewcommand {\sinh }[1][]{\LWRphystrig {#1}{sinh}}\) \(\renewcommand {\arcsin }[1][]{\LWRphystrig {#1}{arcsin}}\) \(\renewcommand {\asin }[1][]{\LWRphystrig {#1}{asin}}\) \(\renewcommand {\cos }[1][]{\LWRphystrig {#1}{cos}}\) \(\renewcommand {\cosh }[1][]{\LWRphystrig {#1}{cosh}}\) \(\renewcommand {\arccos }[1][]{\LWRphystrig {#1}{arcos}}\) \(\renewcommand {\acos }[1][]{\LWRphystrig {#1}{acos}}\) \(\renewcommand {\tan }[1][]{\LWRphystrig {#1}{tan}}\) \(\renewcommand {\tanh }[1][]{\LWRphystrig {#1}{tanh}}\) \(\renewcommand {\arctan }[1][]{\LWRphystrig {#1}{arctan}}\) \(\renewcommand {\atan }[1][]{\LWRphystrig {#1}{atan}}\) \(\renewcommand {\csc }[1][]{\LWRphystrig {#1}{csc}}\) \(\renewcommand {\csch }[1][]{\LWRphystrig {#1}{csch}}\) \(\renewcommand {\arccsc }[1][]{\LWRphystrig {#1}{arccsc}}\) \(\renewcommand {\acsc }[1][]{\LWRphystrig {#1}{acsc}}\) \(\renewcommand {\sec }[1][]{\LWRphystrig {#1}{sec}}\) \(\renewcommand {\sech }[1][]{\LWRphystrig {#1}{sech}}\) \(\renewcommand {\arcsec }[1][]{\LWRphystrig {#1}{arcsec}}\) \(\renewcommand {\asec }[1][]{\LWRphystrig {#1}{asec}}\) \(\renewcommand {\cot }[1][]{\LWRphystrig {#1}{cot}}\) \(\renewcommand {\coth }[1][]{\LWRphystrig {#1}{coth}}\) \(\renewcommand {\arccot }[1][]{\LWRphystrig {#1}{arccot}}\) \(\renewcommand {\acot }[1][]{\LWRphystrig {#1}{acot}}\) \(\require {cancel}\) \(\newcommand *{\underuparrow }[1]{{\underset {\uparrow }{#1}}}\) \(\DeclareMathOperator *{\argmax }{argmax}\) \(\DeclareMathOperator *{\argmin }{arg\,min}\) \(\def \E [#1]{\mathbb {E}\!\left [ #1 \right ]}\) \(\def \Var [#1]{\operatorname {Var}\!\left [ #1 \right ]}\) \(\def \Cov [#1]{\operatorname {Cov}\!\left [ #1 \right ]}\) \(\newcommand {\floor }[1]{\lfloor #1 \rfloor }\) \(\newcommand {\DTFTH }{ H \brk 1{e^{j\omega }}}\) \(\newcommand {\DTFTX }{ X\brk 1{e^{j\omega }}}\) \(\newcommand {\DFTtr }[1]{\mathrm {DFT}\left \{#1\right \}}\) \(\newcommand {\DTFTtr }[1]{\mathrm {DTFT}\left \{#1\right \}}\) \(\newcommand {\DTFTtrI }[1]{\mathrm {DTFT^{-1}}\left \{#1\right \}}\) \(\newcommand {\Ftr }[1]{ \mathcal {F}\left \{#1\right \}}\) \(\newcommand {\FtrI }[1]{ \mathcal {F}^{-1}\left \{#1\right \}}\) \(\newcommand {\Zover }{\overset {\mathscr Z}{\Longleftrightarrow }}\) \(\renewcommand {\real }{\mathbb {R}}\) \(\newcommand {\ba }{\mathbf {a}}\) \(\newcommand {\bb }{\mathbf {b}}\) \(\newcommand {\bd }{\mathbf {d}}\) \(\newcommand {\be }{\mathbf {e}}\) \(\newcommand {\bh }{\mathbf {h}}\) \(\newcommand {\bn }{\mathbf {n}}\) \(\newcommand {\bq }{\mathbf {q}}\) \(\newcommand {\br }{\mathbf {r}}\) \(\newcommand {\bt }{\mathbf {t}}\) \(\newcommand {\bv }{\mathbf {v}}\) \(\newcommand {\bw }{\mathbf {w}}\) \(\newcommand {\bx }{\mathbf {x}}\) \(\newcommand {\bxx }{\mathbf {xx}}\) \(\newcommand {\bxy }{\mathbf {xy}}\) \(\newcommand {\by }{\mathbf {y}}\) \(\newcommand {\byy }{\mathbf {yy}}\) \(\newcommand {\bz }{\mathbf {z}}\) \(\newcommand {\bA }{\mathbf {A}}\) \(\newcommand {\bB }{\mathbf {B}}\) \(\newcommand {\bI }{\mathbf {I}}\) \(\newcommand {\bK }{\mathbf {K}}\) \(\newcommand {\bP }{\mathbf {P}}\) \(\newcommand {\bQ }{\mathbf {Q}}\) \(\newcommand {\bR }{\mathbf {R}}\) \(\newcommand {\bU }{\mathbf {U}}\) \(\newcommand {\bW }{\mathbf {W}}\) \(\newcommand {\bX }{\mathbf {X}}\) \(\newcommand {\bY }{\mathbf {Y}}\) \(\newcommand {\bZ }{\mathbf {Z}}\) \(\newcommand {\balpha }{\bm {\alpha }}\) \(\newcommand {\bth }{{\bm {\theta }}}\) \(\newcommand {\bepsilon }{{\bm {\epsilon }}}\) \(\newcommand {\bmu }{{\bm {\mu }}}\) \(\newcommand {\bphi }{\bm {\phi }}\) \(\newcommand {\bOne }{\mathbf {1}}\) \(\newcommand {\bZero }{\mathbf {0}}\) \(\newcommand {\btx }{\tilde {\bx }}\) \(\newcommand {\loss }{\mathcal {L}}\) \(\newcommand {\appropto }{\mathrel {\vcenter { \offinterlineskip \halign {\hfil $##$\cr \propto \cr \noalign {\kern 2pt}\sim \cr \noalign {\kern -2pt}}}}}\) \(\newcommand {\SSE }{\mathrm {SSE}}\) \(\newcommand {\MSE }{\mathrm {MSE}}\) \(\newcommand {\RMSE }{\mathrm {RMSE}}\) \(\newcommand {\toprule }[1][]{\hline }\) \(\let \midrule \toprule \) \(\let \bottomrule \toprule \) \(\def \LWRbooktabscmidruleparen (#1)#2{}\) \(\newcommand {\LWRbooktabscmidrulenoparen }[1]{}\) \(\newcommand {\cmidrule }[1][]{\ifnextchar (\LWRbooktabscmidruleparen \LWRbooktabscmidrulenoparen }\) \(\newcommand {\morecmidrules }{}\) \(\newcommand {\specialrule }[3]{\hline }\) \(\newcommand {\addlinespace }[1][]{}\) \(\newcommand {\LWRsubmultirow }[2][]{#2}\) \(\newcommand {\LWRmultirow }[2][]{\LWRsubmultirow }\) \(\newcommand {\multirow }[2][]{\LWRmultirow }\) \(\newcommand {\mrowcell }{}\) \(\newcommand {\mcolrowcell }{}\) \(\newcommand {\STneed }[1]{}\) \(\newcommand {\tcbset }[1]{}\) \(\newcommand {\tcbsetforeverylayer }[1]{}\) \(\newcommand {\tcbox }[2][]{\boxed {\text {#2}}}\) \(\newcommand {\tcboxfit }[2][]{\boxed {#2}}\) \(\newcommand {\tcblower }{}\) \(\newcommand {\tcbline }{}\) \(\newcommand {\tcbtitle }{}\) \(\newcommand {\tcbsubtitle [2][]{\mathrm {#2}}}\) \(\newcommand {\tcboxmath }[2][]{\boxed {#2}}\) \(\newcommand {\tcbhighmath }[2][]{\boxed {#2}}\)

9 Feature Engineering and Selection

  • Goal: Working with feautures:

    • Feature engineering (FE) transforms raw features into representations that improve model performance. It extends existing features, \(M\times N\rightarrow M\times N'\), such that
      \(N<N'\) or \(N\ll N'\).

    • Feature selection (FS) selects a subset of relevant features, \(M\times N\rightarrow M\times N'\), such that \(N>N'\) or \(N\gg N'\).

9.1 Feature Engineering

Feature engineering transforms raw variables into representations that are easier for a model to use. Instead of only removing columns, it creates, reformulates, or encodes features so that useful patterns become more explicit. Common goals are to expose non-linear relationships and better match the limitations of a learning algorithm.

Feature engineering and feature selection serve different purposes. Feature engineering creates new or improved representations, while feature selection removes irrelevant or redundant ones. In a typical pipeline, feature engineering is applied first, followed by feature selection.

9.1.1 Feature Transformations
  • Goal: Transform numerical features into representations that better match the structure of the problem and the assumptions of the model.

Many models are easiest to train when the relationship between the features and the target is relatively simple. Feature transformations apply a mapping \(x \mapsto \phi (x)\) that can linearize trends, expose interactions, or reduce the influence of extreme values.

Polynomial features

Polynomial features1 generate interaction and higher-order terms from the original features. For \(N\) features and degree \(d\), the transformation produces all monomials up to degree \(d\). This allows a linear model to represent non-linear relationships in the original input space. It is closely related to the polynomial kernel approach (see Sec. 4.1 and Kernels in chapter 5).

  • Example 9.1: For two features \(x_1, x_2\) with degree \(d=2\), the expanded feature set is:

    \begin{equation} [x_1, x_2] \rightarrow [1,\; x_1,\; x_2,\; x_1^2,\; x_1 x_2,\; x_2^2]. \end{equation}

    The dimensionality increases from \(N=2\) to \(N'=6\) (including the bias term).

The main advantage of polynomial features is that they let a linear model fit non-linear patterns and feature interactions without changing the learner itself. Their main drawback is dimensionality growth: the number of terms scales as \(\binom {N+d}{d}\), so both training cost and the risk of overfitting grow rapidly with \(N\) or \(d\).

Log transform applies \(x' = \log (x)\) (for \(x > 0\)) to reduce right-skewness and compress the range of features spanning several orders of magnitude. It is useful for variables such as income, population, or concentration, where multiplicative changes are often more meaningful than additive ones. The transform converts multiplicative relationships into additive ones. Its main strength is that it stabilizes variance and tames heavy-tailed distributions, which typically helps linear models. Its main limitation is that it is defined only for strictly positive inputs, so a shift such as \(\log (1+x)\) is required when zeros are present.

Binning (discretization) converts continuous features into categorical ones by partitioning the value range into intervals. For example, an age feature can be binned into groups: \([0,18)\), \([18,65)\), \([65,\infty )\). This can reduce the effect of outliers and capture non-linear step-like relationships when only coarse ranges are important. On the positive side, binning is robust to outliers and can expose coarse threshold-like effects that a linear model would otherwise miss. On the negative side, all information within a bin is discarded, and the bin edges introduce somewhat arbitrary discontinuities.

9.1.2 Date and Time Features
  • Goal: Extract informative features from date and time variables to capture temporal patterns.

A raw datetime variable (e.g., a timestamp) usually carries little direct meaning for a learning algorithm. If it is used as a single number, the model may treat time as only increasing, while ignoring weekly, monthly, or daily patterns. Extracting its components creates features that expose cyclic and calendar-based structure:

  • Day of week (1–7): captures weekly seasonality (e.g., weekday vs. weekend behavior).

  • Day of month (1–31): captures monthly patterns (e.g., salary-related spending).

  • Month (1–12): captures annual seasonality (e.g., weather, holidays).

  • Hour of day (0–23): captures intra-day patterns (e.g., traffic peaks).

  • Quarter (1–4): captures business-cycle effects.

  • Is weekend (binary): a derived indicator for weekend days.

Because many of these features are cyclic (e.g., hour 23 is close to hour 0), encoding them as a single integer can mislead distance-based models. A common solution is cyclical encoding using sine and cosine:

\begin{equation} x_{\sin } = \sin \!\left (\frac {2\pi \, x}{P}\right ), \quad x_{\cos } = \cos \!\left (\frac {2\pi \, x}{P}\right ), \end{equation}

where \(P\) is the period (e.g., \(P=7\) for day of week, \(P=24\) for hour of day). This maps each cyclic value onto the unit circle, so that the first and last values in the cycle are adjacent.

  • Example 9.2: Given a timestamp 2024-12-20 14:30, the extracted features are:

    .
    Day of week 5 (Friday)
    Day of month 20
    Month 12
    Hour 14
    Is weekend 0

Component extraction exposes weekly, monthly, and daily patterns that are invisible in a raw timestamp, and cyclical sine/cosine encoding preserves the natural proximity between the first and last values of a cycle. The cost is higher dimensionality, since cyclical encoding produces two features per component, and the choice of which components to extract depends on domain knowledge about the problem at hand.

9.2 Feature Selection Concept

  • Goal: Select a subset of relevant features to reduce dimensionality and improve model performance.

Feature selection chooses a subset of the original features. Given a dataset

\begin{equation} X \in \mathbb {R}^{M\times N}, \end{equation}

it constructs a reduced representation

\begin{equation} X' \in \mathbb {R}^{M\times N'}, \quad N' < N, \end{equation}

by keeping only the most informative columns. The objective is not only to reduce dimensionality, but to preserve or improve predictive performance on unseen data.

Why perform feature selection?

  • Some features are irrelevant or noisy and do not contribute useful predictive information.

  • Some features are redundant, for example when two features are highly correlated and carry nearly the same information.

  • In high-dimensional problems, unnecessary features can increase overfitting and reduce generalization.

  • Each feature may have a cost: measurement, storage, preprocessing, and model-training complexity.

  • A smaller feature set often improves interpretability by making the model easier to analyze.

Feature selection also introduces a tradeoff. Removing useful features may degrade performance, so the goal is not to find the smallest possible subset, but rather the smallest informative subset according to a chosen criterion.

Main families of methods Feature selection methods are commonly divided into three groups:

  • Filter methods score features using statistical properties of the data, independently of the final predictor. In this chapter, variance threshold and ANOVA are examples.

  • Wrapper methods evaluate subsets by training and testing a predictor. The classifier-based greedy search methods in the next section belong to this family.

  • Embedded methods perform feature selection as part of model training itself. Tree-based feature importance is an example of this idea.

In addition, some model-comparison criteria explicitly penalize the number of features or parameters. An example of a feature-number-aware comparison metric is adjusted \(R^2\) in Sec. 9.5.

  • Example 9.3: Suppose two sensors measure nearly the same physical quantity, producing features \(x_1\) and \(x_2\), while a third feature \(x_3\) is mostly random noise. A useful feature-selection method may keep only \(x_1\), discard \(x_2\) as redundant, and discard \(x_3\) as irrelevant. The dimensionality decreases from \(N=3\) to \(N'=1\) with little or no loss in predictive information.

Search complexity Exhaustive search over all non-empty subsets of \(N\) features requires

\begin{equation} \sum _{k=1}^{N}\binom {N}{k} = 2^N - 1 \end{equation}

candidate subsets. This becomes computationally prohibitive even for moderate \(N\). For example, for \(N=15\) features there are

\begin{equation} 2^{15}-1 = 32767 \end{equation}

possible subsets to examine.

Fixed subset size. When the target size \(N'\) is fixed in advance, the search restricted to subsets of exactly that size, costing \(\binom {N}{N'}\) model fits. For \(N=15\) and \(N'=3\) this gives \(\binom {15}{3}=455\) candidate subsets, compared with \(2^{15}-1=32767\) for the full exhaustive search. This approach is guaranteed to find the best subset of size \(N'\), but its cost grows combinatorially once \(N'\) is no longer very small relative to \(N\).

Feature selection must be learned using training data only. If features are selected before the train/test split, or before each cross-validation fold is formed, information from the validation or test set leaks into the model-design process. This leads to overly optimistic performance estimates.

9.3 Statistical-Based

There are also statistical FS methods that are not related to a particular classifier.

9.3.1 Variance Threshold
  • Goal: Eliminate features with little or no variation across samples.

The variance threshold2 method removes features whose variance falls below a specified threshold, on the premise that a feature which barely changes across samples cannot help discriminate between them. The typical use case is dropping features with zero (or near-zero) variance.

Variance threshold: For a given feature, let \(M\) be the number of samples, \(x_i\) the \(i\)-th sample, and \(s_x^2\)is the variance. Features with \(s_x^2 < \tau \) are removed, where \(\tau \) is the threshold parameter. Setting \(\tau = 0\) drops only exactly constant columns, but floating-point arithmetic during data generation or preprocessing may leave a nominally constant feature with a tiny nonzero variance (on the order of \(10^{-15}\)). A small positive \(\tau \) (e.g., \(10^{-8}\)) is therefore used in practice to also remove such numerically constant columns.

Variance threshold is unsupervised: it does not consult the target variable.

9.3.2 ANOVA
  • Goal: Rank features by how well they separate the classes, using only per-class statistics.

Intuition

Analysis of Variance (ANOVA)3 asks a single question for each feature: do the classes differ from each other by more than they differ internally?

A feature is informative when the class means are far apart (large between-class variation) relative to the spread of samples inside each class (small within-class variation). If the two are comparable, the apparent gap between class means could be explained by noise alone, and the feature carries little discriminative signal.

This comparison is summarized by a single number, the F-statistic: larger \(F\) means the feature separates the classes more cleanly than random fluctuation would suggest. Features are then ranked by \(F\) and the top-scoring ones are kept (Fig. 9.1).

(image)

Figure 9.1: Illustration of ANOVA FS.
Calculation

For a given feature, let:

  • \(K\) be the number of classes,

  • \(M_k\) the number of samples in class \(k\),

  • \(x_{k,i}\) the \(i\)-th sample in class \(k\),

  • \(\bar {x}_k\) the mean of class \(k\), and

  • \(\bar {x}\) the global mean.

The between-class sum of squares measures how much class means deviate from the global mean:

\begin{equation} SS_B = \sum _{k=1}^{K} M_k (\bar {x}_k - \bar {x})^2. \end{equation}

The within-class sum of squares measures variability within each class:

\begin{equation} SS_W = \sum _{k=1}^{K} \sum _{i=1}^{M_k} (x_{k,i} - \bar {x}_k)^2. \end{equation}

The F-statistic is the ratio of mean squares:

\begin{equation} F = \frac {SS_B / (K - 1)}{SS_W / (M - K)} = \frac {MS_B}{MS_W}, \end{equation}

where \(M = \sum _{k=1}^{K} M_k\) is the total number of samples, and \(K-1\) and \(M-K\) are the corresponding degrees of freedom that normalize each sum of squares. ANOVA assumes features are independent and normally distributed within each class.

  • Example 9.4: Consider \(K=2\) classes with \(M_1 = M_2 = 3\) samples each, and two candidate features:

    Feature 1 (good separation): Class 1: \(\{2, 4, 6\}\), Class 2: \(\{8, 10, 12\}\).

    • Class means: \(\bar {x}_1 = 4\), \(\bar {x}_2 = 10\), global mean \(\bar {x} = 7\).

    • \(SS_B = 3(4-7)^2 + 3(10-7)^2 = 27 + 27 = 54\).

    • \(SS_W = (2-4)^2 + (4-4)^2 + (6-4)^2 + (8-10)^2 + (10-10)^2 + (12-10)^2 = 16\).

    • \(F = \frac {54/1}{16/4} = \frac {54}{4} = 13.5\).

    Feature 2 (poor separation): Class 1: \(\{5, 6, 7\}\), Class 2: \(\{4, 6, 8\}\).

    • Class means: \(\bar {x}_1 = 6\), \(\bar {x}_2 = 6\), global mean \(\bar {x} = 6\).

    • \(SS_B = 3(6-6)^2 + 3(6-6)^2 = 0\).

    • \(SS_W = 1 + 0 + 1 + 4 + 0 + 4 = 10\).

    • \(F = 0\).

    Feature 1 (\(F = 13.5\)) is selected over Feature 2 (\(F = 0\)) since it better separates the classes.

Pros:

  • Fast computation: closed-form solution with no model training required.

  • Classifier-independent: can be used as a preprocessing step for any model.

Cons:

  • Assumes linear separability: only measures differences in means, not complex relationships.

  • Evaluates features independently: does not capture feature interactions.

  • Sensitive to assumptions: assumes normally distributed data within classes.

9.4 Classifier-Based

  • Goal: Evaluate each feature by its contribution to a trained classifier’s performance.

Greedy search4 builds a feature subset incrementally, making the locally optimal choice at each step. It does not guarantee a globally optimal subset, but it reduces the search cost from exponential to polynomial in \(N\).

Forward feature selection starts from an empty set and, at each step, adds the feature whose inclusion most improves model performance, measured by cross-validated accuracy or another evaluation metric. The process stops when no remaining feature improves performance or when a predetermined number \(N'\) of features is reached.

Backward feature selection starts from the full feature set and, at each step, removes the feature whose removal causes the smallest drop (or the largest gain) in performance. The process stops on the same kinds of criteria.

Forward selection requires about \(N \cdot N'\) model fits, and backward selection about \(N \cdot (N - N')\), both far fewer than the \(2^{N}-1\) fits of exhaustive search. Forward selection is therefore cheaper when the target subset is small (\(N' \ll N\)), while backward selection is preferred when only a few features need to be removed.

Pros:

  • Classifier-tailored: feature relevance is evaluated in the context of the actual model.

  • Captures feature interactions that affect model performance.

Cons:

  • May converge to a local optimum due to its greedy nature.

  • Computationally expensive: requires retraining the classifier for each candidate subset.

  • Hyperparameter tuning may be needed at every subset size.

9.5 Adjusted Coefficient of Determination

  • Goal: Provide a comparison metric for multivariate linear models that accounts for different numbers or subsets of features.

While Section 2.3.1 discusses the standard \(R^2\) metric, the regular coefficient of determination has a critical flaw in multivariate regression: it mathematically cannot decrease when new features are added to the model, even if those features are entirely uninformative.

To address this, the adjusted \(R^2\) (\(\bar {R}^2\)) is introduced. It uses a penalty term based on the number of features \(N\) relative to the number of samples \(M\). It is given by:

\begin{equation} \bar {R}^2 = 1 - \left (1-R^2\right )\frac {M-1}{M-N-1} \end{equation}

The primary advantage of \(\bar {R}^2\) is that it penalizes the model for adding irrelevant variables. It will only increase if a new feature improves the model’s predictive power more than would be expected by chance.

Further Reading