Machine Learning & Signals Learning

\(\newcommand{\footnotename}{footnote}\) \(\def \LWRfootnote {1}\) \(\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\let \LWRorighspace \hspace \) \(\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }\) \(\newcommand {\TextOrMath }[2]{#2}\) \(\newcommand {\mathnormal }[1]{{#1}}\) \(\newcommand \ensuremath [1]{#1}\) \(\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } \) \(\newcommand {\setlength }[2]{}\) \(\newcommand {\addtolength }[2]{}\) \(\newcommand {\setcounter }[2]{}\) \(\newcommand {\addtocounter }[2]{}\) \(\newcommand {\arabic }[1]{}\) \(\newcommand {\number }[1]{}\) \(\newcommand {\noalign }[1]{\text {#1}\notag \\}\) \(\newcommand {\cline }[1]{}\) \(\newcommand {\directlua }[1]{\text {(directlua)}}\) \(\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}\) \(\newcommand {\protect }{}\) \(\def \LWRabsorbnumber #1 {}\) \(\def \LWRabsorbquotenumber "#1 {}\) \(\newcommand {\LWRabsorboption }[1][]{}\) \(\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }\) \(\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }\) \(\def \mathcode #1={\mathchar }\) \(\let \delcode \mathcode \) \(\let \delimiter \mathchar \) \(\def \oe {\unicode {x0153}}\) \(\def \OE {\unicode {x0152}}\) \(\def \ae {\unicode {x00E6}}\) \(\def \AE {\unicode {x00C6}}\) \(\def \aa {\unicode {x00E5}}\) \(\def \AA {\unicode {x00C5}}\) \(\def \o {\unicode {x00F8}}\) \(\def \O {\unicode {x00D8}}\) \(\def \l {\unicode {x0142}}\) \(\def \L {\unicode {x0141}}\) \(\def \ss {\unicode {x00DF}}\) \(\def \SS {\unicode {x1E9E}}\) \(\def \dag {\unicode {x2020}}\) \(\def \ddag {\unicode {x2021}}\) \(\def \P {\unicode {x00B6}}\) \(\def \copyright {\unicode {x00A9}}\) \(\def \pounds {\unicode {x00A3}}\) \(\let \LWRref \ref \) \(\renewcommand {\ref }{\ifstar \LWRref \LWRref }\) \( \newcommand {\multicolumn }[3]{#3}\) \(\require {textcomp}\) \( \newcommand {\abs }[1]{\lvert #1\rvert } \) \( \DeclareMathOperator {\sign }{sign} \) \(\newcommand {\intertext }[1]{\text {#1}\notag \\}\) \(\let \Hat \hat \) \(\let \Check \check \) \(\let \Tilde \tilde \) \(\let \Acute \acute \) \(\let \Grave \grave \) \(\let \Dot \dot \) \(\let \Ddot \ddot \) \(\let \Breve \breve \) \(\let \Bar \bar \) \(\let \Vec \vec \) \(\newcommand {\bm }[1]{\boldsymbol {#1}}\) \(\require {physics}\) \(\newcommand {\LWRphystrig }[2]{\ifblank {#1}{\textrm {#2}}{\textrm {#2}^{#1}}}\) \(\renewcommand {\sin }[1][]{\LWRphystrig {#1}{sin}}\) \(\renewcommand {\sinh }[1][]{\LWRphystrig {#1}{sinh}}\) \(\renewcommand {\arcsin }[1][]{\LWRphystrig {#1}{arcsin}}\) \(\renewcommand {\asin }[1][]{\LWRphystrig {#1}{asin}}\) \(\renewcommand {\cos }[1][]{\LWRphystrig {#1}{cos}}\) \(\renewcommand {\cosh }[1][]{\LWRphystrig {#1}{cosh}}\) \(\renewcommand {\arccos }[1][]{\LWRphystrig {#1}{arcos}}\) \(\renewcommand {\acos }[1][]{\LWRphystrig {#1}{acos}}\) \(\renewcommand {\tan }[1][]{\LWRphystrig {#1}{tan}}\) \(\renewcommand {\tanh }[1][]{\LWRphystrig {#1}{tanh}}\) \(\renewcommand {\arctan }[1][]{\LWRphystrig {#1}{arctan}}\) \(\renewcommand {\atan }[1][]{\LWRphystrig {#1}{atan}}\) \(\renewcommand {\csc }[1][]{\LWRphystrig {#1}{csc}}\) \(\renewcommand {\csch }[1][]{\LWRphystrig {#1}{csch}}\) \(\renewcommand {\arccsc }[1][]{\LWRphystrig {#1}{arccsc}}\) \(\renewcommand {\acsc }[1][]{\LWRphystrig {#1}{acsc}}\) \(\renewcommand {\sec }[1][]{\LWRphystrig {#1}{sec}}\) \(\renewcommand {\sech }[1][]{\LWRphystrig {#1}{sech}}\) \(\renewcommand {\arcsec }[1][]{\LWRphystrig {#1}{arcsec}}\) \(\renewcommand {\asec }[1][]{\LWRphystrig {#1}{asec}}\) \(\renewcommand {\cot }[1][]{\LWRphystrig {#1}{cot}}\) \(\renewcommand {\coth }[1][]{\LWRphystrig {#1}{coth}}\) \(\renewcommand {\arccot }[1][]{\LWRphystrig {#1}{arccot}}\) \(\renewcommand {\acot }[1][]{\LWRphystrig {#1}{acot}}\) \(\require {cancel}\) \(\newcommand *{\underuparrow }[1]{{\underset {\uparrow }{#1}}}\) \(\DeclareMathOperator *{\argmax }{argmax}\) \(\DeclareMathOperator *{\argmin }{arg\,min}\) \(\def \E [#1]{\mathbb {E}\!\left [ #1 \right ]}\) \(\def \Var [#1]{\operatorname {Var}\!\left [ #1 \right ]}\) \(\def \Cov [#1]{\operatorname {Cov}\!\left [ #1 \right ]}\) \(\newcommand {\floor }[1]{\lfloor #1 \rfloor }\) \(\newcommand {\DTFTH }{ H \brk 1{e^{j\omega }}}\) \(\newcommand {\DTFTX }{ X\brk 1{e^{j\omega }}}\) \(\newcommand {\DFTtr }[1]{\mathrm {DFT}\left \{#1\right \}}\) \(\newcommand {\DTFTtr }[1]{\mathrm {DTFT}\left \{#1\right \}}\) \(\newcommand {\DTFTtrI }[1]{\mathrm {DTFT^{-1}}\left \{#1\right \}}\) \(\newcommand {\Ftr }[1]{ \mathcal {F}\left \{#1\right \}}\) \(\newcommand {\FtrI }[1]{ \mathcal {F}^{-1}\left \{#1\right \}}\) \(\newcommand {\Zover }{\overset {\mathscr Z}{\Longleftrightarrow }}\) \(\renewcommand {\real }{\mathbb {R}}\) \(\newcommand {\ba }{\mathbf {a}}\) \(\newcommand {\bb }{\mathbf {b}}\) \(\newcommand {\bd }{\mathbf {d}}\) \(\newcommand {\be }{\mathbf {e}}\) \(\newcommand {\bh }{\mathbf {h}}\) \(\newcommand {\bn }{\mathbf {n}}\) \(\newcommand {\bq }{\mathbf {q}}\) \(\newcommand {\br }{\mathbf {r}}\) \(\newcommand {\bt }{\mathbf {t}}\) \(\newcommand {\bv }{\mathbf {v}}\) \(\newcommand {\bw }{\mathbf {w}}\) \(\newcommand {\bx }{\mathbf {x}}\) \(\newcommand {\bxx }{\mathbf {xx}}\) \(\newcommand {\bxy }{\mathbf {xy}}\) \(\newcommand {\by }{\mathbf {y}}\) \(\newcommand {\byy }{\mathbf {yy}}\) \(\newcommand {\bz }{\mathbf {z}}\) \(\newcommand {\bA }{\mathbf {A}}\) \(\newcommand {\bB }{\mathbf {B}}\) \(\newcommand {\bI }{\mathbf {I}}\) \(\newcommand {\bK }{\mathbf {K}}\) \(\newcommand {\bP }{\mathbf {P}}\) \(\newcommand {\bQ }{\mathbf {Q}}\) \(\newcommand {\bR }{\mathbf {R}}\) \(\newcommand {\bU }{\mathbf {U}}\) \(\newcommand {\bW }{\mathbf {W}}\) \(\newcommand {\bX }{\mathbf {X}}\) \(\newcommand {\bY }{\mathbf {Y}}\) \(\newcommand {\bZ }{\mathbf {Z}}\) \(\newcommand {\balpha }{\bm {\alpha }}\) \(\newcommand {\bth }{{\bm {\theta }}}\) \(\newcommand {\bepsilon }{{\bm {\epsilon }}}\) \(\newcommand {\bmu }{{\bm {\mu }}}\) \(\newcommand {\bOne }{\mathbf {1}}\) \(\newcommand {\bZero }{\mathbf {0}}\) \(\newcommand {\loss }{\mathcal {L}}\) \(\newcommand {\appropto }{\mathrel {\vcenter { \offinterlineskip \halign {\hfil $##$\cr \propto \cr \noalign {\kern 2pt}\sim \cr \noalign {\kern -2pt}}}}}\) \(\newcommand {\SSE }{\mathrm {SSE}}\) \(\newcommand {\MSE }{\mathrm {MSE}}\) \(\newcommand {\RMSE }{\mathrm {RMSE}}\) \(\newcommand {\toprule }[1][]{\hline }\) \(\let \midrule \toprule \) \(\let \bottomrule \toprule \) \(\def \LWRbooktabscmidruleparen (#1)#2{}\) \(\newcommand {\LWRbooktabscmidrulenoparen }[1]{}\) \(\newcommand {\cmidrule }[1][]{\ifnextchar (\LWRbooktabscmidruleparen \LWRbooktabscmidrulenoparen }\) \(\newcommand {\morecmidrules }{}\) \(\newcommand {\specialrule }[3]{\hline }\) \(\newcommand {\addlinespace }[1][]{}\) \(\newcommand {\LWRsubmultirow }[2][]{#2}\) \(\newcommand {\LWRmultirow }[2][]{\LWRsubmultirow }\) \(\newcommand {\multirow }[2][]{\LWRmultirow }\) \(\newcommand {\mrowcell }{}\) \(\newcommand {\mcolrowcell }{}\) \(\newcommand {\STneed }[1]{}\) \(\newcommand {\tcbset }[1]{}\) \(\newcommand {\tcbsetforeverylayer }[1]{}\) \(\newcommand {\tcbox }[2][]{\boxed {\text {#2}}}\) \(\newcommand {\tcboxfit }[2][]{\boxed {#2}}\) \(\newcommand {\tcblower }{}\) \(\newcommand {\tcbline }{}\) \(\newcommand {\tcbtitle }{}\) \(\newcommand {\tcbsubtitle [2][]{\mathrm {#2}}}\) \(\newcommand {\tcboxmath }[2][]{\boxed {#2}}\) \(\newcommand {\tcbhighmath }[2][]{\boxed {#2}}\)

11 Feature Engineering and Selection

  • Goal: Working with feautures:

    • Feature engineering (FE) transforms raw features into representations that improve model performance.

    • Feature selection (FS) selects a subset of relevant features, \(M\times N\rightarrow M\times N'\), such that \(N>N'\) or \(N\gg N'\).

11.1 Feature Engineering

Feature engineering creates or transforms features to improve model performance, while feature selection chooses a subset from existing ones. In a typical pipeline, feature engineering is applied first, followed by feature selection.

11.1.1 Feature Transformations
  • Goal: Create new features from existing ones to capture non-linear relationships.

Polynomial features1 generate interaction and higher-order terms from the original features. For \(N\) features and degree \(d\), the transformation produces all monomials up to degree \(d\). This is closely related to the polynomial kernel approach (see Kernels chapter).

  • Example 11.1: For two features \(x_1, x_2\) with degree \(d=2\), the expanded feature set is:

    \begin{equation} [x_1, x_2] \rightarrow [1,\; x_1,\; x_2,\; x_1^2,\; x_1 x_2,\; x_2^2]. \end{equation}

    The dimensionality increases from \(N=2\) to \(N'=6\) (including the bias term).

Log transform applies \(x' = \log (x)\) (for \(x > 0\)) to reduce right-skewness and compress the range of features spanning several orders of magnitude. It converts multiplicative relationships into additive ones.

Binning (discretization) converts continuous features into categorical ones by partitioning the value range into intervals. For example, an age feature can be binned into groups: \([0,18)\), \([18,65)\), \([65,\infty )\). This can reduce the effect of outliers and capture non-linear step-like relationships.

Pros:

  • Polynomial features enable linear models to fit non-linear patterns.

  • Log transform stabilizes variance and reduces the influence of outliers.

  • Binning is robust to outliers and can simplify the model.

Cons:

  • Polynomial features cause rapid dimensionality growth: \(\binom {N+d}{d}\) terms for degree \(d\).

  • Log transform is only applicable to strictly positive values.

  • Binning discards information within each bin.

11.1.2 Date and Time Features
  • Goal: Extract informative features from date and time variables to capture temporal patterns.

A single datetime variable (e.g., a timestamp) carries no direct numerical meaning for most models. Extracting its components creates features that expose cyclic and calendar-based patterns:

  • Day of week (1–7): captures weekly seasonality (e.g., weekday vs. weekend behavior).

  • Day of month (1–31): captures monthly patterns (e.g., salary-related spending).

  • Month (1–12): captures annual seasonality (e.g., weather, holidays).

  • Hour of day (0–23): captures intra-day patterns (e.g., traffic peaks).

  • Quarter (1–4): captures business-cycle effects.

  • Is weekend (binary): a derived indicator for weekend days.

Because many of these features are cyclic (e.g., hour 23 is close to hour 0), encoding them as a single integer can mislead distance-based models. A common solution is cyclical encoding using sine and cosine:

\begin{equation} x_{\sin } = \sin \!\left (\frac {2\pi \, x}{P}\right ), \quad x_{\cos } = \cos \!\left (\frac {2\pi \, x}{P}\right ), \end{equation}

where \(P\) is the period (e.g., \(P=7\) for day of week, \(P=24\) for hour of day). This maps each cyclic value onto the unit circle, so that the first and last values in the cycle are adjacent.

  • Example 11.2: Given a timestamp 2024-12-20 14:30, the extracted features are:

    .
    Day of week 5 (Friday)
    Day of month 20
    Month 12
    Hour 14
    Is weekend 0

Pros:

  • Exposes temporal patterns that are invisible in raw timestamps.

  • Cyclical encoding preserves the natural proximity of cyclic values.

Cons:

  • Increases dimensionality, especially with cyclical encoding (two features per component).

  • Requires domain knowledge to choose which components are relevant.

11.2 Feature Selection

  • Goal: Select a subset of relevant features to reduce dimensionality and improve model performance.

Motivation:

  • Some features are redundant and are very similar to each other.

  • Some features are noise, i.e does not contribute to the performance of a model.

  • Each feature has its (significant) computation complexity.

  • Enhanced generalization by reducing overfitting

  • Shorter training times and lesser computational cost

Exhaustive full search over all possible combination of features is typically computationally prohibitive. For example, checking all possible subset combinations of \(N=15\) features includes \(\sum _{k=1}^{N}\binom {N}{k} = 32767\) possible combinations.

All of the practical FS methods use sub-optimal search that significantly reduce the number of the examined combinations. An example of feature-number-aware metric is adjusted \(R^2\) in Sec. 3.2.2.

11.3 Classifier-Based

  • Goal: The influence of each feature is evaluated by its influence on the classifier.

Greedy search

Greedy search methods build a feature subset incrementally by making locally optimal choices at each step. While they do not guarantee a globally optimal solution, they significantly reduce computational complexity from exponential to polynomial.

Forward feature selection starts with an empty feature set and iteratively adds the feature that most improves model performance. At each step, every remaining feature is evaluated, and the one yielding the best performance gain is added. This process continues until adding more features no longer improves performance or a predetermined number of features is reached.

Backward feature selection begins by constructing a model with all available features. It then eliminates the one that produces the highest performing model based on specific evaluation metrics. In subsequent steps, it continues to remove the next least significant feature, improving performance each time. This process is repeated, eliminating one feature at a time, until a predetermined criterion is satisfied.

Forward selection is computationally cheaper when the target subset is small (\(N' \ll N\)), while backward selection is preferred when only few features need to be removed. Both methods require \(\mathcal {O}(N \cdot N')\) model evaluations, compared to \(\mathcal {O}(2^N)\) for exhaustive search.

Pros:

  • Classifier-tailored: feature relevance is evaluated in the context of the actual model.

  • Captures feature interactions that affect model performance.

Cons:

  • May converge to a local optimum due to greedy nature.

  • Computationally expensive: requires retraining the classifier for each candidate subset.

  • Hyperparameter tuning may be needed for each set or subset size.

Constant number of the selected features For example, for the pre-defined feature subset size of \(N'=3\) and \(N=15\), it results only in \(\binom {N}{N'} = 455\) feature combinations.

11.4 Statistical-Based

There are also statistical FS methods that are not related to a particular classifier.

11.4.1 Variance Threshold
  • Goal: Eliminates features by their statistical properties.

Variance threshold2 removes features with variance below a specified threshold. For a given feature, let \(M\) be the number of samples, \(x_i\) the \(i\)-th sample, and \(\bar {x}\) the mean. The variance is:

\begin{equation} \Var [x] = \frac {1}{M} \sum _{i=1}^{M} (x_i - \bar {x})^2. \end{equation}

Features with \(\Var [x] < \tau \) are removed, where \(\tau \) is the threshold parameter.

This method is useful for removing constant or near-constant features that provide no discriminative information. A common use case is removing features with zero variance (all identical values).

Note that variance threshold is unsupervised and does not consider the target variable.

11.4.2 ANOVA
  • Goal: Eliminate features by inter-feature statistical properties.

Analysis of Variance (ANOVA)3 evaluates feature relevance by comparing the variance between class means to the variance within classes. For a given feature, let:

  • \(K\) be the number of classes,

  • \(M_k\) the number of samples in class \(k\),

  • \(x_{k,i}\) the \(i\)-th sample in class \(k\),

  • \(\bar {x}_k\) the mean of class \(k\), and

  • \(\bar {x}\) the global mean.

The between-class sum of squares measures how much class means deviate from the global mean:

\begin{equation} SS_B = \sum _{k=1}^{K} M_k (\bar {x}_k - \bar {x})^2. \end{equation}

The within-class sum of squares measures variability within each class:

\begin{equation} SS_W = \sum _{k=1}^{K} \sum _{i=1}^{M_k} (x_{k,i} - \bar {x}_k)^2. \end{equation}

The F-statistic is the ratio of mean squares:

\begin{equation} F = \frac {SS_B / (K - 1)}{SS_W / (M - K)} = \frac {MS_B}{MS_W}, \end{equation}

where \(M = \sum _{k=1}^{K} M_k\) is the total number of samples.

A higher F-value indicates that the feature better separates the classes. Features are ranked by their F-scores, and those with the highest scores are selected. ANOVA assumes features are independent and normally distributed within each class.

(image)

Figure 11.1: Illustration of ANOVA FS.
  • Example 11.3: Consider \(K=2\) classes with \(M_1 = M_2 = 3\) samples each, and two candidate features:

    Feature 1 (good separation): Class 1: \(\{2, 4, 6\}\), Class 2: \(\{8, 10, 12\}\).

    • Class means: \(\bar {x}_1 = 4\), \(\bar {x}_2 = 10\), global mean \(\bar {x} = 7\).

    • \(SS_B = 3(4-7)^2 + 3(10-7)^2 = 27 + 27 = 54\).

    • \(SS_W = (2-4)^2 + (4-4)^2 + (6-4)^2 + (8-10)^2 + (10-10)^2 + (12-10)^2 = 16\).

    • \(F = \frac {54/1}{16/4} = \frac {54}{4} = 13.5\).

    Feature 2 (poor separation): Class 1: \(\{5, 6, 7\}\), Class 2: \(\{4, 6, 8\}\).

    • Class means: \(\bar {x}_1 = 6\), \(\bar {x}_2 = 6\), global mean \(\bar {x} = 6\).

    • \(SS_B = 3(6-6)^2 + 3(6-6)^2 = 0\).

    • \(SS_W = 1 + 0 + 1 + 4 + 0 + 4 = 10\).

    • \(F = 0\).

    Feature 1 (\(F = 13.5\)) is selected over Feature 2 (\(F = 0\)) since it better separates the classes.

Pros:

  • Fast computation: closed-form solution with no model training required.

  • Classifier-independent: can be used as a preprocessing step for any model.

Cons:

  • Assumes linear separability: only measures differences in means, not complex relationships.

  • Evaluates features independently: does not capture feature interactions.

  • Sensitive to assumptions: assumes normally distributed data within classes.

11.5 Tree-Based Feature Importance

Decision tree models provide built-in feature importance scores based on how much each feature contributes to reducing impurity (e.g., Gini impurity or entropy) across all splits in the tree.

Pros:

  • Fast computation: importance is a byproduct of model training.

  • Captures non-linear relationships and feature interactions.

Cons:

  • Biased toward high-cardinality features (many unique values).

  • Importance is split among correlated features, underestimating individual relevance.