Machine Learning & Signals Learning
Part III Signal Classification
24 Signal Feature-Extraction Pipeline
24.1 Preface
Classical ML operates on fixed-length feature vectors, but signals are long and often of variable length. The feature-extraction (FE) pipeline bridges this gap by reducing each signal (or each \(L\)-sample window, when the signal is long or continuous) to a single \(N\)-dimensional feature vector (\(N\ll L\)), producing a tabular dataset that any regressor or classifier from earlier chapters can use. Here we focus on the surrounding pipeline choices.
24.2 Workflow
FE from signals is summarized in Fig. 24.1. Starting from a raw signal, the pipeline proceeds through the following stages, in the order shown. The stages fall into three groups:
Signal preprocessing (optional).
-
1. Signal transformation (Sec. 24.3): a point-wise mapping reshapes each signal’s distribution (log, Box–Cox, etc.).
-
2. Windowing (Sec. 24.4): long or continuous signals are split into \(M\) windows of \(L\) samples each (non-overlapping or overlapping).
-
3. Train/test split (Sec. 24.5): signals (or windows) are partitioned into training and test sets before feature extraction, so that the leakage-sensitive downstream steps (feature selection, normalization) never see test data. Feature extraction itself is per-signal and does not leak; the split is placed here only to set up those later steps.
-
4. Feature extraction (Sec. 24.6): each window (or whole signal) is mapped to an \(N\)-dimensional feature vector via \(f:\real ^L\to \real ^N\). Stacking the \(M\) feature vectors row-wise yields the \(M\times N\) tabular dataset.
Downstream ML (covered in earlier chapters of the book).
-
5. Feature selection (Sec. 9.4): retain the subset of the \(N\) features most informative for the task.
-
6. Normalization (Sec. 4.7): rescale the retained features (e.g. standardization) with parameters fit on the training set only.
-
7. ML model: fit a regressor or classifier on the normalized training features and evaluate on the held-out test set, producing the final outputs (predictions, class labels, performance metrics).
Some classifiers to be introduced in Ch. 25 already have several of these pipeline stages built in (e.g. internal feature selection, normalization, or end-to-end feature learning). When such a classifier is used, the corresponding stand-alone stages may be redundant and can be skipped.
24.3 Signal transformation (optional)
A signal transformation can produce either:
-
• a univariate signal from a univariate input (basic case);
-
• a multivariate set of signals from a univariate input, for further multivariate processing (advanced case).
Notes:
-
• When the transformation is applied to a regression target (univariate to univariate), it must be invertible so predictions can be mapped back to the original scale. For feature-generating transforms (univariate to multivariate) and for classification, invertibility is not required.
-
• Some transformations are restricted to positive signals, \(y[n]>0\,\forall n\).
Common transformations:
-
• Logarithmic transformation,
\(\seteqnumber{0}{}{0}\)\begin{equation} \tilde {y}[n] = \log (y[n]+1), \qquad y[n]\ge 0. \end{equation}
The \(+1\) shift avoids \(\log 0\); the transform requires non-negative input.
-
• Square root transformation,
\(\seteqnumber{0}{}{1}\)\begin{equation} \tilde {y}[n]=\sqrt {y[n]} \end{equation}
-
• Exponential transformation,
\(\seteqnumber{0}{}{2}\)\begin{equation} \tilde {y}[n]=y^m[n] \end{equation}
-
Example 24.1: A sensor signal \(y[n]\) with values in \([0,\,10{,}000]\) is highly right-skewed (Fig. 24.2). Applying \(\tilde {y}[n]=\log (y[n]+1)\) compresses the dynamic range to \([0,\,\approx 9.2]\), making the distribution more symmetric and improving downstream classifier performance.
In Python: sklearn.preprocessing.PowerTransformer (Box–Cox, Yeo–Johnson).
24.4 Windowing (optional)
This stage is skipped when each input signal is already a short, fixed-length record. It is required when the input is a long recording or a continuous stream that must be segmented before FE.
Non-overlapping windows (Fig. 24.3a) are the standard choice. Overlapping windows (Fig. 24.3b) are less recommended since they may reduce performance: the resulting feature vectors become too similar.
For overlapping windows, the overlap is controlled by the step size \(S\) (number of samples between consecutive window starts). The overlap ratio is
\(\seteqnumber{0}{}{3}\)\begin{equation} r = \frac {L - S}{L}, \end{equation}
where \(r=0\) corresponds to non-overlapping windows and \(r=0.5\) to 50% overlap.
Feature redundancy
Overlapping windows produce highly correlated feature vectors: adjacent windows share most of their samples, so each new vector adds little information. The same overlap also opens a leakage path at the train/test split (Sec. 24.5).
Each window contains an equal number of samples, \(L\). The value of \(L\) is:
-
• Field-related, e.g. 20–40 msec in acoustic and speech processing.
-
• Hand-picked by visual analysis.
-
• Hyper-parameter (least preferred: no prior, pure search).
24.5 Train–test split in signals
Possible data leakage
Random train-test splitting, which is standard for i.i.d. tabular data, is invalid for time-series signals. Adjacent windows share temporal context (and samples, if overlapping), so random splitting causes data leakage and over-optimistic performance estimates. Train and test windows must never be adjacent.
24.5.1 Classification
When the goal is to classify signals (e.g. fault detection, speaker identification), the split must be done by source rather than by individual windows. Each source (e.g. speaker, device, or recording session) appears entirely in either the train or the test set, never both.
This is known as grouped CV (Sec. 4.5). It ensures the model is evaluated on truly unseen sources, not on different windows from a source it has already learned.
24.5.2 Prediction
A temporal split is used: all data before a cutoff time \(t_c\) is used for training, and data after \(t_c\) for testing. The test set must always lie in the future relative to the training set.
Time-series cross-validation Standard \(k\)-fold CV shuffles the data, violating temporal order. Instead, time-series CV (Fig. 24.4) uses an expanding (or sliding) window:
-
1. Start with a minimal training set of the earliest samples.
-
2. Train the model and evaluate on the next time step (or window).
-
3. Expand the training set to include the previous test point, and repeat.
Each fold preserves the temporal ordering: training data always precedes test data (reference; example in Python).
Optionally, a gap of \(g\) samples can be inserted between the training and test sets. This prevents leakage from auto-correlated signals where adjacent windows carry similar information. The gap size \(g\) is a hyper-parameter that depends on the correlation length of the signal.
24.6 Feature Extraction
24.6.1 Motivation
An alternative to operating on raw samples is to first describe each signal by a handful of interpretable quantities (its mean, its dominant frequency, its peak rate) and then hand the resulting vector to any tabular classifier (logistic regression, SVM, random forest). This decouples the signal-processing step from the learning step and, by collapsing \(L\) raw samples down to \(N\ll L\) features, directly mitigates the overfitting issue.
Feature: A feature is a numerical function of the signal that captures a characteristic of interest. Formally, feature extraction is a mapping
\(\seteqnumber{0}{}{4}\)\begin{equation} f:\real ^L \to \real ^N, \end{equation}
that converts a signal of \(L\) samples into a single scalar (\(N=1\)) or a vector of \(N\) values.
Dimensions
-
• Signal length \(L\): from hundreds to tens of thousands of samples.
-
• Feature count \(N\): from a few to thousands.
The strong reduction \(L\to N\) is precisely what makes feature-based classification tractable on small labeled datasets.
24.6.2 Common Features
Features are commonly grouped by the domain they summarize: time, frequency, joint time–frequency, and field-specific. The remainder of this section walks through these families.
Statistical (time-domain) features
Descriptive statistics computed directly on the signal samples \(y[1],\ldots ,y[L]\):
-
• Mean:
\(\seteqnumber{0}{}{5}\)\begin{equation} \bar {y} = \frac {1}{L}\sum _{n=1}^{L} y[n] \end{equation}
-
• Variance:
\(\seteqnumber{0}{}{6}\)\begin{equation} s_y^2 = \frac {1}{L}\sum _{n=1}^{L}\bigl (y[n]-\bar {y}\bigr )^2 \end{equation}
-
• Root mean square (RMS):
\(\seteqnumber{0}{}{7}\)\begin{equation} y_{\text {rms}} = \sqrt {\frac {1}{L}\sum _{n=1}^{L} y^2[n]} \end{equation}
-
• Zero-crossing rate (ZCR):
\(\seteqnumber{0}{}{8}\)\begin{equation} \text {ZCR} = \frac {1}{L-1}\sum _{n=2}^{L}\mathbf {1}\bigl [y[n]\cdot y[n-1]<0\bigr ] \end{equation}
where \(\mathbf {1}[\cdot ]\) is the indicator function.
-
• Additional: maximum, minimum, median, skewness, kurtosis, energy (\(\sum y^2[n]\)), number of peaks.
Spectral (frequency-domain) features
Features derived from the magnitude spectrum \(|Y[k]|\) obtained via the Fourier transform of the signal. They summarize where energy sits in frequency rather than when it occurs in time.
-
• Spectral centroid, the “center of mass” of the spectrum:
\(\seteqnumber{0}{}{9}\)\begin{equation} \text {SC} = \frac {\sum _{k=1}^{K} f_k\,|Y[k]|^2}{\sum _{k=1}^{K} |Y[k]|^2}, \end{equation}
where \(f_k\) is the frequency of the \(k\)-th bin. A high SC indicates a signal dominated by high-frequency content.
-
• Spectral bandwidth, spectral rolloff, spectral flatness, and the energy in pre-defined frequency bands.
Auto-correlation function based features
Quantities derived from the ACF of the signal. They capture periodicity and temporal structure (note the relation between ACF and PSD (Sec. 20.2)).
Model-based (parametric) features
Coefficients of a parametric time-series model fitted to the signal. Fitting an AR(\(p\)) model (Sec. 20.3), or more general ARMA model (Sec. 20.6), yields a compact set of coefficients that summarizes the signal’s spectral shape.
The AR (linear-prediction) coefficients \(h_1,\ldots ,h_p\) (LPC) are obtainable from the ACF via the Yule-Walker equations (Sec. 20.3.1) and form a \(p\)-dimensional feature vector; the residual (prediction-error) variance is a useful scalar companion feature. LPC are a standard features in speech and audio processing.
Mixed-domain features
Features built on joint time–frequency representations, such as the short-time Fourier transform (STFT, Sec. 19.7) or the wavelet transform. These are appropriate when the discriminative information is non-stationary, i.e. when the spectral content changes over the duration of the signal.
Field-tailored features
Quantities developed within a specific application domain to encode domain knowledge. A canonical example is the Mel-frequency cepstral coefficients (MFCC) used in speech and audio processing.
Inherent windowing
Some of the FE methods, such as STFT or MFCC, have inherent windowing.
For ready-made implementations of dozens of features in each of the families above, see, for example, the tsfel feature list.
Computational cost Some features, especially spectral or wavelet-based ones, are non-trivial to compute for large \(L\). Two practical remedies:
-
• Parallel computation: signals are processed independently, so the per-signal feature pipeline parallelizes trivially across CPU or GPU cores.
-
• Feature selection (Ch. 9): keep only a compact, discriminative subset so the heaviest features are not computed unless they pay for themselves.
Learned embeddings (representation learning)
Instead of manually designing signal features based on domain knowledge, representation learning uses deep neural networks to automatically learn informative representations (embeddings) directly from raw signals or time-frequency representations (e.g., spectrograms).
These learned features can be obtained through several paradigms:
-
• Supervised pre-training: A deep neural network (e.g., a 1D CNN, ResNet, or Transformer) is trained on a large labeled signal classification dataset. Once trained, the final classification layer is discarded, and the activations of the penultimate layer are extracted as a fixed-size feature vector (embedding) \(\bz \in \real ^d\) for downstream tasks.
-
• Self-supervised learning (SSL): Representations are learned from unlabeled data by solving a "pretext" task. Popular SSL approaches for signals include:
-
– Autoencoders (AE): An encoder network maps the signal to a low-dimensional bottleneck state, and a decoder network reconstructs the original signal. The bottleneck state serves as the learned embedding.
-
– Contrastive learning: The network is trained to maximize similarity between different augmented views (e.g., adding noise, cropping, jittering) of the same signal window (positive pairs) while minimizing similarity with views from other signals (negative pairs). Contrastive Predictive Coding (CPC) uses temporal forecasting as a pretext task to learn representations.
-
– Masked autoencoders (MAE): Portions of the signal are masked (zeroed out), and the network is trained to predict or reconstruct the missing segments.
-
Pros and Cons
-
• Advantages: Reduces the need for hand-crafted feature engineering, captures complex non-linear patterns, and scales well with large datasets.
-
• Disadvantages: Requires large volumes of data to learn generalizable representations, lacks direct physical interpretability (black-box features), and has high training and inference computational costs compared to basic statistical features.
General Note: Feature Normalization
Features from different domains may have very different scales (a centroid in Hertz vs. a unit-less skewness). Apply feature normalization before stacking them into a single vector, otherwise distance- and gradient-based classifiers will be dominated by the largest-magnitude feature.
Output of the pipeline Each input signal of \(L\) samples is mapped to a single \(N\)-dimensional feature vector. Stacking the vectors of all training signals row-wise yields a tabular dataset of shape (\(M,N\)) on which any standard classifier from earlier chapters can be trained directly.
24.6.3 Interpretation
A single feature is rarely tied to one branch of the theory. To illustrate, take the root-mean-square (RMS) of a signal,
\(\seteqnumber{0}{}{10}\)\begin{equation} \label {eq:signal-class:rms} \mathrm {RMS}(\by ) = \sqrt {\frac {1}{L}\sum _{n=1}^{L} y^2[n]} = \frac {1}{\sqrt {L}}\norm {\by }, \end{equation}
and read it through five complementary approaches. Each approach points to neighbouring features in the catalogue above.
Statistical interpretation Under the common zero-mean assumption \(\bar y\approx 0\), Eq. (24.11) reduces to the (biased) sample standard deviation of \(\{y[1],\dots ,y[L]\}\) (Sec. 1.1). The natural extensions are higher-order sample moments such as kurtosis.
Distance interpretation Eq. (24.11) is, up to the factor \(1/\sqrt {L}\), the Euclidean (\(L_2\)) distance from the sample vector \(\by =(y[1],\dots ,y[L])\) to the origin. Replacing the \(L_2\) norm by another \(L_p\) or Minkowski distance (Sec. 10.2) yields a family of analogous "size" features, of which RMS is the \(p=2\) representative.
Signal characterization Squaring RMS recovers the signal power \(P_\by =\frac {1}{L}\norm {\by }^2=\mathrm {RMS}(\by )^2\), and the corresponding energy is \(E_\by = L\cdot P_\by \) (Sec. 19.1.2). RMS is therefore the simplest amplitude-based characterization of the signal.
Spectral interpretation By Parseval’s theorem (Eq. (19.51)), the same energy can be read off the magnitude spectrum \(|Y[k]|\), so RMS is the zeroth-order summary of the spectrum. Higher-order summaries of the same spectrum, the spectral centroid, bandwidth, rolloff, and band energies sit naturally beside it.
ACF interpretation The energy identity \(E_\by = R_\by [0]\) gives \(\mathrm {RMS}(\by ) = \sqrt {R_\by [0]/L}\), i.e. RMS is the value of the auto-correlation function at lag zero (Sec. 20.1). Other ACF-related features, such as the lag and height of the first non-trivial peak of \(R_\by [k]\) at \(k\ne 0\) extend the same family and capture periodicity rather than amplitude.
24.6.4 Dedicated libraries
Several open-source libraries provide large collections of pre-implemented signal features, eliminating the need to code them from scratch.
-
• Python-based, \(150+\) features covering statistical, spectral, and non-linear domains.
-
• Built-in statistical feature selection (relevance testing via hypothesis tests).
-
• Seamless scikit-learn integration (transformers, pipelines).
tsfel (Time Series Feature Extraction Library)
-
• Python-based, \(60+\) features organized by domain (statistical, temporal, spectral).
hctsa (highly comparative time-series analysis)
-
• Matlab-based, \(7{,}700+\) features, the largest available feature library.
-
• Designed for exploratory comparison across many time-series datasets.
catch22 (CAnonical Time-series CHaracteristics)
-
• A curated subset of 22 features selected from \(4{,}791\) hctsa features on 93 publicly available datasets.
-
• Provides strong baseline performance on a broad variety of signals.
-
• Applied on internally normalized signals; optional mean and std features.
-
• Fast C implementation with Python, R, and Matlab wrappers.
-
• Does not include signal-level feature extraction, but provides classifiers, regressors, and feature selection methods.
-
• Feature selection can be embedded in a Pipeline for end-to-end workflows.
Complementary: mlxtend supplies sequential feature selection (forward/backward) and model-evaluation utilities on top of scikit-learn.
24.7 Summary
-
• Pipeline order (Fig. 24.1): (optional) signal transformation \(\to \) (optional) windowing \(\to \) train/test split \(\to \) feature extraction, producing an \(M\times N\) tabular dataset via \(f:\real ^L\to \real ^N\).
-
• Signal transformations (log, Box–Cox, etc.) reshape distributions to improve downstream performance; they must be invertible when applied to a regression target.
-
• Windowing applies only to long or continuous signals; short fixed-length signals skip directly to the split. When used, non-overlapping windows are preferred and the overlap ratio \(r=(L-S)/L\) controls redundancy and leakage risk.
-
• Train–test splitting happens before feature extraction and must respect structure: group-split for classification, temporal/expanding-window CV for prediction, never random.