Machine Learning & Signals Learning

$\newcommand{\footnotename}{footnote}$ $\def \LWRfootnote {1}$ $\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}$ $\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}$ $\let \LWRorighspace \hspace $ $\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }$ $\newcommand {\TextOrMath }[2]{#2}$ $\newcommand {\mathnormal }[1]{{#1}}$ $\newcommand \ensuremath [1]{#1}$ $\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } $ $\newcommand {\setlength }[2]{}$ $\newcommand {\addtolength }[2]{}$ $\newcommand {\setcounter }[2]{}$ $\newcommand {\addtocounter }[2]{}$ $\newcommand {\arabic }[1]{}$ $\newcommand {\number }[1]{}$ $\newcommand {\noalign }[1]{\text {#1}\notag \\}$ $\newcommand {\cline }[1]{}$ $\newcommand {\directlua }[1]{\text {(directlua)}}$ $\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}$ $\newcommand {\protect }{}$ $\def \LWRabsorbnumber #1 {}$ $\def \LWRabsorbquotenumber "#1 {}$ $\newcommand {\LWRabsorboption }[1][]{}$ $\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }$ $\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }$ $\def \mathcode #1={\mathchar }$ $\let \delcode \mathcode $ $\let \delimiter \mathchar $ $\def \oe {\unicode {x0153}}$ $\def \OE {\unicode {x0152}}$ $\def \ae {\unicode {x00E6}}$ $\def \AE {\unicode {x00C6}}$ $\def \aa {\unicode {x00E5}}$ $\def \AA {\unicode {x00C5}}$ $\def \o {\unicode {x00F8}}$ $\def \O {\unicode {x00D8}}$ $\def \l {\unicode {x0142}}$ $\def \L {\unicode {x0141}}$ $\def \ss {\unicode {x00DF}}$ $\def \SS {\unicode {x1E9E}}$ $\def \dag {\unicode {x2020}}$ $\def \ddag {\unicode {x2021}}$ $\def \P {\unicode {x00B6}}$ $\def \copyright {\unicode {x00A9}}$ $\def \pounds {\unicode {x00A3}}$ $\let \LWRref \ref $ $\renewcommand {\ref }{\ifstar \LWRref \LWRref }$ $ \newcommand {\multicolumn }[3]{#3}$ $\require {textcomp}$ $ \newcommand {\abs }[1]{\lvert #1\rvert } $ $ \DeclareMathOperator {\sign }{sign} $ $\newcommand {\intertext }[1]{\text {#1}\notag \\}$ $\let \Hat \hat $ $\let \Check \check $ $\let \Tilde \tilde $ $\let \Acute \acute $ $\let \Grave \grave $ $\let \Dot \dot $ $\let \Ddot \ddot $ $\let \Breve \breve $ $\let \Bar \bar $ $\let \Vec \vec $ $\newcommand {\bm }[1]{\boldsymbol {#1}}$ $\require {physics}$ $\newcommand {\LWRphystrig }[2]{\ifblank {#1}{\textrm {#2}}{\textrm {#2}^{#1}}}$ $\renewcommand {\sin }[1][]{\LWRphystrig {#1}{sin}}$ $\renewcommand {\sinh }[1][]{\LWRphystrig {#1}{sinh}}$ $\renewcommand {\arcsin }[1][]{\LWRphystrig {#1}{arcsin}}$ $\renewcommand {\asin }[1][]{\LWRphystrig {#1}{asin}}$ $\renewcommand {\cos }[1][]{\LWRphystrig {#1}{cos}}$ $\renewcommand {\cosh }[1][]{\LWRphystrig {#1}{cosh}}$ $\renewcommand {\arccos }[1][]{\LWRphystrig {#1}{arcos}}$ $\renewcommand {\acos }[1][]{\LWRphystrig {#1}{acos}}$ $\renewcommand {\tan }[1][]{\LWRphystrig {#1}{tan}}$ $\renewcommand {\tanh }[1][]{\LWRphystrig {#1}{tanh}}$ $\renewcommand {\arctan }[1][]{\LWRphystrig {#1}{arctan}}$ $\renewcommand {\atan }[1][]{\LWRphystrig {#1}{atan}}$ $\renewcommand {\csc }[1][]{\LWRphystrig {#1}{csc}}$ $\renewcommand {\csch }[1][]{\LWRphystrig {#1}{csch}}$ $\renewcommand {\arccsc }[1][]{\LWRphystrig {#1}{arccsc}}$ $\renewcommand {\acsc }[1][]{\LWRphystrig {#1}{acsc}}$ $\renewcommand {\sec }[1][]{\LWRphystrig {#1}{sec}}$ $\renewcommand {\sech }[1][]{\LWRphystrig {#1}{sech}}$ $\renewcommand {\arcsec }[1][]{\LWRphystrig {#1}{arcsec}}$ $\renewcommand {\asec }[1][]{\LWRphystrig {#1}{asec}}$ $\renewcommand {\cot }[1][]{\LWRphystrig {#1}{cot}}$ $\renewcommand {\coth }[1][]{\LWRphystrig {#1}{coth}}$ $\renewcommand {\arccot }[1][]{\LWRphystrig {#1}{arccot}}$ $\renewcommand {\acot }[1][]{\LWRphystrig {#1}{acot}}$ $\require {cancel}$ $\newcommand *{\underuparrow }[1]{{\underset {\uparrow }{#1}}}$ $\DeclareMathOperator *{\argmax }{argmax}$ $\DeclareMathOperator *{\argmin }{arg\,min}$ $\def \E [#1]{\mathbb {E}\!\left [ #1 \right ]}$ $\def \Var [#1]{\operatorname {Var}\!\left [ #1 \right ]}$ $\def \Cov [#1]{\operatorname {Cov}\!\left [ #1 \right ]}$ $\newcommand {\floor }[1]{\lfloor #1 \rfloor }$ $\newcommand {\DTFTH }{ H \brk 1{e^{j\omega }}}$ $\newcommand {\DTFTX }{ X\brk 1{e^{j\omega }}}$ $\newcommand {\DFTtr }[1]{\mathrm {DFT}\left \{#1\right \}}$ $\newcommand {\DTFTtr }[1]{\mathrm {DTFT}\left \{#1\right \}}$ $\newcommand {\DTFTtrI }[1]{\mathrm {DTFT^{-1}}\left \{#1\right \}}$ $\newcommand {\Ftr }[1]{ \mathcal {F}\left \{#1\right \}}$ $\newcommand {\FtrI }[1]{ \mathcal {F}^{-1}\left \{#1\right \}}$ $\newcommand {\Zover }{\overset {\mathscr Z}{\Longleftrightarrow }}$ $\renewcommand {\real }{\mathbb {R}}$ $\newcommand {\ba }{\mathbf {a}}$ $\newcommand {\bb }{\mathbf {b}}$ $\newcommand {\bc }{\mathbf {c}}$ $\newcommand {\bd }{\mathbf {d}}$ $\newcommand {\be }{\mathbf {e}}$ $\newcommand {\bf }{\mathbf {f}}$ $\newcommand {\bh }{\mathbf {h}}$ $\newcommand {\bi }{\mathbf {i}}$ $\newcommand {\bn }{\mathbf {n}}$ $\newcommand {\bo }{\mathbf {o}}$ $\newcommand {\bp }{\mathbf {p}}$ $\newcommand {\bq }{\mathbf {q}}$ $\newcommand {\br }{\mathbf {r}}$ $\newcommand {\bs }{\mathbf {s}}$ $\newcommand {\bt }{\mathbf {t}}$ $\newcommand {\bu }{\mathbf {u}}$ $\newcommand {\bv }{\mathbf {v}}$ $\newcommand {\bw }{\mathbf {w}}$ $\newcommand {\bx }{\mathbf {x}}$ $\newcommand {\bxx }{\mathbf {xx}}$ $\newcommand {\bxy }{\mathbf {xy}}$ $\newcommand {\by }{\mathbf {y}}$ $\newcommand {\byy }{\mathbf {yy}}$ $\newcommand {\bz }{\mathbf {z}}$ $\newcommand {\bA }{\mathbf {A}}$ $\newcommand {\bB }{\mathbf {B}}$ $\newcommand {\bC }{\mathbf {C}}$ $\newcommand {\bD }{\mathbf {D}}$ $\newcommand {\bH }{\mathbf {H}}$ $\newcommand {\bI }{\mathbf {I}}$ $\newcommand {\bK }{\mathbf {K}}$ $\newcommand {\bM }{\mathbf {M}}$ $\newcommand {\bP }{\mathbf {P}}$ $\newcommand {\bQ }{\mathbf {Q}}$ $\newcommand {\bR }{\mathbf {R}}$ $\newcommand {\bS }{\mathbf {S}}$ $\newcommand {\bU }{\mathbf {U}}$ $\newcommand {\bW }{\mathbf {W}}$ $\newcommand {\bX }{\mathbf {X}}$ $\newcommand {\bY }{\mathbf {Y}}$ $\newcommand {\bZ }{\mathbf {Z}}$ $\newcommand {\balpha }{\bm {\alpha }}$ $\newcommand {\bth }{{\bm {\theta }}}$ $\newcommand {\bepsilon }{{\bm {\epsilon }}}$ $\newcommand {\bmu }{{\bm {\mu }}}$ $\newcommand {\bphi }{\bm {\phi }}$ $\newcommand {\bOne }{\mathbf {1}}$ $\newcommand {\bZero }{\mathbf {0}}$ $\newcommand {\indFunc }{\mathbb {1}}$ $\newcommand {\btx }{\tilde {\bx }}$ $\newcommand {\loss }{\mathcal {L}}$ $\newcommand {\appropto }{\mathrel {\vcenter { \offinterlineskip \halign {\hfil $##$\cr \propto \cr \noalign {\kern 2pt}\sim \cr \noalign {\kern -2pt}}}}}$ $\newcommand {\SSE }{\mathrm {SSE}}$ $\newcommand {\MSE }{\mathrm {MSE}}$ $\newcommand {\RMSE }{\mathrm {RMSE}}$ $\newcommand {\toprule }[1][]{\hline }$ $\let \midrule \toprule $ $\let \bottomrule \toprule $ $\def \LWRbooktabscmidruleparen (#1)#2{}$ $\newcommand {\LWRbooktabscmidrulenoparen }[1]{}$ $\newcommand {\cmidrule }[1][]{\ifnextchar (\LWRbooktabscmidruleparen \LWRbooktabscmidrulenoparen }$ $\newcommand {\morecmidrules }{}$ $\newcommand {\specialrule }[3]{\hline }$ $\newcommand {\addlinespace }[1][]{}$ $\newcommand {\LWRsubmultirow }[2][]{#2}$ $\newcommand {\LWRmultirow }[2][]{\LWRsubmultirow }$ $\newcommand {\multirow }[2][]{\LWRmultirow }$ $\newcommand {\mrowcell }{}$ $\newcommand {\mcolrowcell }{}$ $\newcommand {\STneed }[1]{}$ $\newcommand {\tcbset }[1]{}$ $\newcommand {\tcbsetforeverylayer }[1]{}$ $\newcommand {\tcbox }[2][]{\boxed {\text {#2}}}$ $\newcommand {\tcboxfit }[2][]{\boxed {#2}}$ $\newcommand {\tcblower }{}$ $\newcommand {\tcbline }{}$ $\newcommand {\tcbtitle }{}$ $\newcommand {\tcbsubtitle [2][]{\mathrm {#2}}}$ $\newcommand {\tcboxmath }[2][]{\boxed {#2}}$ $\newcommand {\tcbhighmath }[2][]{\boxed {#2}}$ $\require {colortbl}$ $\let \LWRorigcolumncolor \columncolor $ $\renewcommand {\columncolor }[2][named]{\LWRorigcolumncolor [#1]{#2}\LWRabsorbtwooptions }$ $\let \LWRorigrowcolor \rowcolor $ $\renewcommand {\rowcolor }[2][named]{\LWRorigrowcolor [#1]{#2}\LWRabsorbtwooptions }$ $\let \LWRorigcellcolor \cellcolor $ $\renewcommand {\cellcolor }[2][named]{\LWRorigcellcolor [#1]{#2}\LWRabsorbtwooptions }$

27 Multivariate Signal Classification

Goal: Extend the univariate feature-extraction pipeline of the previous chapter to signals with $C>1$ channels, where cross-channel structure carries information the per-channel pipeline discards.

27.1 Preface

A multivariate signal is a collection of $C$ synchronously sampled channels, organized as a matrix

\begin{equation} \bX = \begin{bmatrix} \bx _1^\top \\ \bx _2^\top \\ \vdots \\ \bx _C^\top \end {bmatrix} \in \real ^{C\times L}, \qquad \bx _c\in \real ^L, \end{equation}

where row $c$ is the $L$-sample record of channel $c$ and column $n$ is the instantaneous snapshot across all channels at sample $n$ (Fig. 27.1).

Typical sources:

• 3-axis IMU: accelerometer or gyroscope with $C=3$ orthogonal axes.
• Multi-lead ECG ($C=2,3,12$) and multi-channel EEG ($C=8$–$256$).
• Microphone arrays for speech or acoustic event detection.

Applicability note

This chapter does not explicitly cover multi-rate sensor measurement, such as in multi-sensor industrial monitoring (vibration, temperature, current on the same machine).

Applying the univariate pipeline channel-by-channel is a valid baseline but discards the inter-channel structure, such as correlation, relative phase, shared latent sources, which often carry the discriminative information (e.g. an IMU gesture is defined by how the three axes move together, not by any one axis alone).

The pipeline reuses every stage from the previous chapter (Ch. 25) except feature extraction, which now produces two concatenated blocks:

• Per-channel features (Sec. 27.2): the univariate $f:\real ^L\to \real ^N$ applied independently to each of the $C$ channels, yielding $C\cdot N$ features.
• Cross-channel features (Sec. 27.3): values that quantify inter-channel structure, yielding $N_\text {cc}$ features.

Stacking gives a row of $N' = C\cdot N + N_\text {cc}$ features (per window).

27.2 Channel-wise feature extraction

Goal: Treat the $C$ channels as $C$ separate sources and combine them with the fusion strategies of Ch. 15.

The $C$ channels are $C$ synchronously sampled sources describing the same sample, so the fusion taxonomy of Ch. 15 applies directly, with each channel playing the role of a source. Three strategies result.

27.2.1 Data fusion (early fusion)

Stack the raw channel records into a single length-$C\cdot L$ vector and apply one univariate feature map $f:\real ^{CL}\to \real ^N$ (data concatenation, Sec. 15.1). This is the simplest option, but it loses channel identity and implicitly assumes the channels share a common representation (units, scale, sampling rate).

27.2.2 Feature fusion (early fusion)

Apply a univariate feature map $f:\real ^L\to \real ^N$ (time, frequency, or wavelet features as in Ch. 25) to each channel independently and concatenate the per-channel feature vectors (feature concatenation, Sec. 15.2):

\begin{equation} \mathbf {z}_\text {pc}(\bX ) = \bigl [\,f(\bx _1)^\top ,\;f(\bx _2)^\top ,\;\dots ,\;f(\bx _C)^\top \,\bigr ]^\top \in \real ^{CN}. \end{equation}

This is the trivial extension of the univariate pipeline and a reasonable baseline. Its weakness is that every inter-channel statistic is discarded: two datasets that differ only in how channels move relative to each other produce identical $\mathbf {z}_\text {pc}$.

Example 27.1: On a 3-axis accelerometer ($C=3$) with $N=10$ univariate features per channel, $\mathbf {z}_\text {pc}\in \real ^{30}$. Walking and waving can produce near-identical per-axis statistics (mean, variance, dominant frequency) while differing sharply in axis-to-axis correlation, which this baseline cannot see.

27.2.3 Late fusion

Process each channel with its own feature-extraction and classification pipeline, then combine the $C$ per-channel predictions with a fusion rule (classifier fusion, Sec. 15.3). This is attractive when the channels are heterogeneous or come from different modalities, since each pipeline can be specialized. Like feature fusion, however, it does not model the joint cross-channel structure, since each channel is classified in isolation. Capturing that structure is the role of the cross-channel features of Sec. 27.3.

Time synchronization

For an effective application of cross-channel features, all channels must share a common time base. Random jitter-misalignment can effectively destroys some of these features.

27.3 Cross-channel features

Goal: Summarize the inter-channel structure of $\bX $ into a small vector of scalar features that complements the per-channel block.

27.3.1 Multivariate ACF

Viewing each channel as a signal, the auto-correlation (Sec. 20.1,Eq. (20.11)) and the cross-correlation (Sec. 21.1,Eq. (21.6)) merge into a single matrix-valued, lag-indexed object:

\begin{equation} \label {eq:multi-acf} \begin{aligned} \bR _{\bX \bX }[\tau ]&\in \real ^{C\times C},\\ \bigl (\bR _{\bX \bX }[\tau ]\bigr )_{ij} &= R_{ij}[k] = \sum _{n}\bigl (x_i[n]-\bar {x}_i\bigr )\bigl (x_j[n+\tau ]-\bar {x}_j\bigr ). \end {aligned} \end{equation}

The following special cases recover the constructs of Part II:

• $i=j$: the auto-correlation $R_{\bxx }[\tau ]$ of channel $i$ (Sec. 20.1).
• $i\ne j$: the cross-correlation between channels $i$ and $j$ (Sec. 21.1).
• $\tau =0$, biased normalization (Eq. (20.14)): the covariance matrix used below.

The biased, unbiased, and normalized variants of Sec. 20.1.2 and Sec. 21.1 apply entry-wise to $\bR _{\bX \bX }[\tau ]$.

Covariance and correlation

At $\tau =0$, the biased form of the multivariate ACF (Sec. 27.3.1, Eq. (20.14)) is the covariance matrix

\begin{equation} \boldsymbol {\Sigma }=\frac {1}{L}\bR _{\bX \bX }[0] \in \real ^{C\times C},\qquad \boldsymbol {\Sigma }_{ij} = \Cov [\bx _i,\bx _j] = R_{ij,biased}[k], \end{equation}

or its normalized counterpart, the Pearson correlation matrix

\begin{equation} \bR _{ij} = \frac {\boldsymbol {\Sigma }_{ij}}{\sqrt {\boldsymbol {\Sigma }_{ii}\,\boldsymbol {\Sigma }_{jj}}} \in [-1,1]. \end{equation}

$\bR $ is symmetric with unit diagonal, so the $C(C-1)/2$ strictly upper-triangular entries are used as features (Fig. 27.2). For $C=3$ this adds 3 features; for $C=16$, 120.

Lagged cross-correlation

For $\tau \ne 0$, the off-diagonal entries of the multivariate ACF $\bR _{\bX \bX }[\tau ]$ ((27.3)) encode delayed coupling between channels and the timing of that coupling. Writing $r_{ij}[\tau ]\triangleq (\bR _{\bX \bX }[\tau ])_{ij}$ for the $(i,j)$-entry, two scalar features per channel pair are typical:

• the peak value $\max \limits _{|\tau |\le \tau _\text {max}} r_{ij}[\tau ]$, $i\ne j$, and
• the argmax lag $\argmax \limits _{|\tau |\le \tau _\text {max}} r_{ij}[\tau ]$, $i\ne j$.

27.3.2 VAR coefficients

Fitting a vector AR (VAR($p$), Sec. 21.9) model to the $C$-channel window estimates the coefficient matrices $\bA _1,\ldots ,\bA _p\in \real ^{C\times C}$. Their off-diagonal entries $(\bA _m)_{ij}$, $i\ne j$, quantify how the past of channel $j$ predicts the present of channel $i$: a directed, lagged, conditional measure of coupling¹, in contrast to the symmetric, undirected correlation of Sec. 27.3.1. The $p\,C(C-1)$ off-diagonal entries across all lags are stacked as features; the $pC$ diagonal entries are per-channel self-prediction and may be assigned to the per-channel block instead.

¹ The significance of these off-diagonal coefficients is tested by Granger causality (Sec. 22.5).

27.3.3 Coherence

For a pair of channels, the cross-coherence (Sec. 21.2.2) $\gamma _{ij}[k]$ between $\bx _i$ and $\bx _j$ can be evaluated (e.g., by Welch estimate). Features are band-averaged coherences

\begin{equation} \bar {\gamma }_{ij,b} = \frac {1}{|B_b|}\sum _{k\in B_b}\gamma ^2_{ij}[k] \end{equation}

over frequency bands $B_b$ of interest (e.g. $\alpha $, $\beta $, $\gamma $ bands in EEG).

27.3.4 Phase-locking value

For EEG-style phase coupling, let $\phi _c[n]$ be the instantaneous phase of channel $c$ (from the Hilbert transform or a band-limited analytic signal). The phase-locking value (PLV) between channels $i,j$ is

\begin{equation} \mathrm {PLV}_{ij} = \left |\frac {1}{L}\sum _{n=0}^{L-1}\mathrm {e}^{\,j\bigl (\phi _i[n]-\phi _j[n]\bigr )}\right | \in [0,1]. \end{equation}

PLV$=1$ indicates perfect phase synchrony; PLV$=0$ indicates uniform phase difference.

27.3.5 Common spatial patterns (CSP)

For binary classification, CSP learns $K$ spatial filters $\bw _k\in \real ^C$ that maximize the variance ratio between the two classes:

\begin{equation} \bw _k = \argmax _{\bw } \frac {\bw ^\top \boldsymbol {\Sigma }^{(1)}\bw }{\bw ^\top \boldsymbol {\Sigma }^{(2)}\bw }, \end{equation}

where $\boldsymbol {\Sigma }^{(c)}$ is the average covariance matrix of class $c$. Each filter $\bw _k$ projects the window onto a latent channel $\bw _k^\top \bX \in \real ^{L}$; its variance is large for one class and small for the other, so the $K$ log-variances

\begin{equation} \label {eq-csp-features} z_k = \log \mathrm {Var}[\bw _k^\top \bX ], \qquad k=1,\dots ,K, \end{equation}

form the $K$-dimensional CSP feature vector for that window (the log makes the features more nearly Gaussian for the downstream classifier). CSP is standard in brain–computer interfaces; like any supervised feature, $\bw _k$ must be fitted on the training set only.

27.3.6 Learned representations

End-to-end deep models (1D-CNN, RNN, transformer) operate directly on $\bX $ and subsume both per-channel and cross-channel feature extraction as learned representations. They trade interpretability and small-data robustness for raw performance on large datasets.

High number of cross-features

Feature selection (Sec. 9.4) is strongly recommended: number of cross-channel features grows quickly with $C$, and of these features are redundant.

27.4 Dimensionality reduction across channels

Goal: When $C$ is large, project onto $C'\ll C$ latent channels before per-channel feature extraction.

Beyond $C\gtrsim 10$ the cross-channel feature count grows as $O(C^2)$ and the per-channel block becomes unwieldy. A remedy is to first reduce the $C$ channels to $C'\ll C$ latent channels with a global linear projection, fixed once for the whole record, then run the per-channel pipeline of Sec. 27.2 on the reduced set.

Linear projection

Let $\bX \in \real ^{C\times L}$ be a window. Fit a linear projection $\bW \in \real ^{C'\times C}$ and form

\begin{equation} \bX ' = \bW \bX \in \real ^{C'\times L}, \end{equation}

then apply the univariate feature map to each of the $C'$ rows of $\bX '$. Common choices for $\bW $:

• PCA (Sec. 11.2): rows of $\bW $ are the top-$C'$ eigenvectors of the training covariance matrix $\boldsymbol {\Sigma }_\text {train}$. Captures maximum variance; label-agnostic.
• Independent component analysis (ICA): rows of $\bW $ are spatial filters that unmix the channels into statistically independent sources. Where PCA only decorrelates the channels (a second-order property), ICA seeks full statistical independence, so it can separate physically distinct generators that PCA leaves mixed. Unsupervised; widely used for artifact removal in EEG, where ocular and muscle components are isolated and dropped before reconstruction.
• CSP (Sec. 27.3.5): a supervised filter for binary classification. Its rows maximize the variance of the projection for one class while minimizing it for the other, obtained from the generalized eigendecomposition of the two class covariance matrices. The leading and trailing filters give the most class-discriminative latent channels, but, being supervised, $\bW $ must be fitted on the training set only.

27.5 Multivariate decomposition (*)

Goal: Opposite to the linear projection of Sec. 27.4: rather than reduce the $C$ channels to fewer latent ones, re-represent each channel as its own set of adaptive, narrowband modes, so the feature map sees cleaner components and a narrowband signature lands in a single mode.

The linear projection of Sec. 27.4 applies one global matrix $\bW $, fitted across windows, to reduce the channels. An adaptive decomposition goes the other way: it lets each signal dictate, per window, a small set of constituent modes, each isolating a single oscillation, system dynamic, or fault signature. This expands one channel into several mode-channels rather than shrinking $C$; the univariate feature map of Ch. 25 is then applied mode-by-mode rather than channel-by-channel, which concentrates a narrowband fault signature in one mode instead of spreading it across the full-band channel record. Dimensionality is reduced only if a few informative modes are then kept and the rest discarded. These methods are common in vibration-based fault diagnosis and in biomedical signals. The four methods detailed below differ in what fixes the modes: the data alone (EMD), a prescribed mode budget (VMD), a linear dynamical model (DMD), or an eigenbasis of the signal’s own delayed copies (MSSA). They are illustrated throughout on a common test signal (Fig. 27.3), a slow tone plus a short high-frequency burst,

\begin{equation} \label {eq-decomp-testsignal} x(t) = \underbrace {\sin (2\pi f_1 t)}_{\text {slow tone}} \;+\; \underbrace {a\,\mathrm {e}^{-(t-t_0)^2/2\tau ^2}\,\sin (2\pi f_2 t)}_{\text {HF burst}} \;+\; \varepsilon (t), \end{equation}

with slow frequency $f_1 = 3$ Hz, burst frequency $f_2 = 18$ Hz, burst amplitude $a = 0.8$, center $t_0 = 1$ s, width $\tau = 0.18$ s, and a small noise term $\varepsilon (t)$. The decompositions react differently: EMD adapts the number of modes to the data and tends to spread the burst across the first two of them; VMD splits the signal into a prescribed number $K$ of band-limited modes that isolate the slow tone and the burst; DMD fits a handful of modes, each a single sinusoid with its own frequency and growth rate; and MSSA recovers each oscillation as a pair of data-adaptive components. Each subsection below defines a method, illustrates it on this signal, and explains how its number of modes is set.

27.5.1 Empirical mode decomposition (EMD)

EMD is fully data-driven and fixes no basis in advance. It expands the signal into intrinsic mode functions (IMFs): components that are oscillatory enough to have a meaningful instantaneous frequency. Concretely, an IMF must satisfy two conditions: (i) its number of local extrema and zero-crossings are equal or differ by one (no riding waves), and (ii) at every instant the mean of the upper envelope (through the maxima) and the lower envelope (through the minima) is zero, so the waveform is locally symmetric about zero.

An IMF is extracted by sifting, the iterative inner loop of EMD. One sift spline-interpolates the local maxima into an upper envelope and the local minima into a lower envelope, and subtracts their mean from the signal; the result is sifted again, and again, until the two IMF conditions hold. That IMF is then subtracted from the signal and the whole sifting loop restarts on the residual to yield the next IMF, peeling components from highest to lowest frequency until only a monotonic trend remains:

\begin{equation} x[n] = \sum _{k=1}^{K_\text {EMD}} d_k[n] + r[n], \end{equation}

with IMFs $d_k$ and residual trend $r$.

Choosing $k$. There is nothing to choose: the count $K_\text {EMD}$ is decided by the data (and the sift stopping rule), and is of order $\log _2 L$ for a broadband window. The band occupied by IMF $k$ is signal-dependent, so $k$ carries no fixed physical meaning and can denote different content from one window or channel to the next. EMD is also prone to mode mixing, where one scale spreads across adjacent IMFs (the burst in Fig. 27.4) or one IMF mixes disparate scales; the ensemble variants EEMD and CEEMDAN average sifts over many added-noise realizations to suppress it.

27.5.2 Variational mode decomposition (VMD)

VMD splits the signal into a prescribed number $K$ of modes that add back to it,

\begin{equation} \label {eq-vmd-recon} x(t) = \sum _{k=1}^{K} u_k(t), \end{equation}

with each mode $u_k$ a narrowband oscillation concentrated around its own center frequency $\omega _k$. The modes are found jointly, not peeled off one at a time as in EMD, and their number $K$ is fixed in advance, which makes VMD markedly more robust to noise and to mode mixing. A bandwidth parameter $\alpha $ controls how tightly each mode is held to its band.

Choosing $k$. Here $K$ is a hyperparameter fixed in advance, the number of modes the user expects. Too small a $K$ under-bins the signal, forcing distinct components to share a mode; at $K=1$ in Fig. 27.5 the burst is lost into the dominant tone. Too large a $K$ over-splits it, so surplus modes latch onto noise (the extra modes at $K=4$). The matched $K=2$ isolates the slow tone and the burst. In practice $K$ is set from the number of narrowband components expected in the application, or by increasing it until the last mode carries only residual energy or two center frequencies $\omega _k$ collide. Because $K$ and the band structure are prescribed, the $\omega _k$ are stable across windows and channels, which is what makes VMD modes alignable where EMD modes are not.

27.5.3 Dynamic mode decomposition (DMD)

DMD models the signal as evolving under a single linear rule, $\bx _{n+1}\approx \bA \bx _n$, and returns spatio-temporal modes: each mode is a fixed spatial pattern $\boldsymbol {\phi }_k$ (the where, constant over time) that evolves as a single complex exponential $\lambda _k^{\,n}$ (the when). From its eigenvalue $\lambda _k$ each mode carries an explicit frequency and growth or decay rate,

\begin{equation} f_k = \frac {|\angle \lambda _k|}{2\pi \,\Delta t},\qquad \sigma _k = \frac {\ln |\lambda _k|}{\Delta t}, \end{equation}

with $\sigma _k<0$ a decaying transient, $\sigma _k>0$ a growing one, and $\sigma _k\approx 0$ a sustained oscillation. “Explicit growth rate” means $\sigma _k$ is a number the method reports for each mode, unlike the EMD and VMD modes, which are plain time series with no such tag. A single channel carries no spatial dimension, so it is first time-delay (Hankel) embedded into a stack of shifted copies before DMD is applied.

Choosing $k$. The number of modes is the rank $r$ at which the dynamics are truncated. Real signals give complex-conjugate eigenvalue pairs, so $r$ modes describe about $r/2$ oscillations, each with its own $(f_k,\sigma _k)$. Figure 27.6 sweeps $r$: too small ($r=2$) captures only the $3$ Hz tone and misses the burst; the matched $r=4$ recovers both the $3$ and $18$ Hz oscillations as sustained modes ($\sigma \approx 0$); too large ($r=8$) adds spurious high-frequency, fast-decaying modes that are then discarded. The rank is chosen by keeping the dominant singular values of the data.

27.5.4 Multichannel eigendecomposition (MSSA)

Like DMD, singular spectrum analysis (SSA) starts from the time-delay (Hankel) embedding of the signal: the length-$L$ record is cut into overlapping length-$L_w$ segments stacked as the columns of a trajectory matrix $\bX _\text {traj}\in \real ^{L_w\times (L-L_w+1)}$. SSA then eigendecomposes this matrix (an SVD of $\bX _\text {traj}$, equivalently an eigendecomposition of its covariance), giving a set of eigentriples, each a data-adaptive temporal pattern with a weight. Grouping the eigentriples and averaging each group back along the anti-diagonals reconstructs additive components,

\begin{equation} \label {eq-ssa-recon} x[n] = \sum _{g} c_g[n], \end{equation}

so the modes are an orthogonal basis learned from the signal’s own delayed copies rather than a fixed or dynamical one.

Eigentriple pairing. A pure oscillation does not map to one eigentriple but to a near-equal pair (its sine and cosine phases), so a clean tone occupies two adjacent components; a slow trend takes the leading triple, and broadband noise spreads over the long tail of small eigentriples. The paired plateaus in the scree plot of Fig. 27.7 make this structure visible and tell where the informative components end and the noise tail begins.

Multichannel (MSSA). Where this section’s EMD, VMD, and DMD act on one channel at a time, the multichannel form stacks the per-channel trajectory matrices into one block-Hankel matrix and eigendecomposes them jointly, yielding a common set of temporal components shared by all $C$ channels. Mode $k$ is therefore the same band in every channel by construction, an eigendecomposition route to the cross-channel alignment that MEMD and MVMD achieve, so MSSA needs no separate alignment step.

Choosing $k$. Two knobs set the result: the window length $L_w$ (the embedding dimension, which fixes the frequency resolution, finer as $L_w$ grows toward $L/2$) and the number of leading eigentriples kept and how they are grouped. Because oscillations come in pairs, the retained rank is about $2\times (\text {number of tones})$ plus one for a trend; the scree’s paired plateaus (Fig. 27.7) guide the cut, with the small-eigenvalue tail discarded as noise.

27.5.5 Multivariate variants and mode alignment

The EMD and VMD above are univariate: run independently per channel, they break a property the cross-channel pipeline needs, mode alignment, that the $k$-th mode mean the same thing across channels before its features can be compared or fused. The mode count and the center frequency of mode $k$ drift from channel to channel. Their multivariate counterparts, multivariate EMD (MEMD) and multivariate VMD (MVMD), fix this by construction, extracting a common set of modes jointly across all $C$ channels so that mode $k$ occupies the same frequency band in every channel.

Example 27.2 (Aligning a resonance burst across axes): A 3-axis accelerometer ($C=3$) on a rotating machine records, per window, a slow imbalance tone plus a short high-frequency burst from an incipient bearing fault that excites the three axes with different amplitudes and arrival times. A fixed band split (per-channel Fourier or wavelet features) smears the burst across several bins, and per-channel EMD returns a different number of IMFs per axis, so “IMF 2” of the $x$-axis need not match “IMF 2” of the $y$-axis. Running MVMD with, say, $K=4$ common modes instead yields one burst-dominated mode shared by all three axes. The univariate map then produces aligned per-mode features, for example the log-energy $\log \mathrm {Var}[\cdot ]$ of the burst mode on each axis, that are directly comparable across channels and feed the cross-channel block of Sec. 27.3 (the same energy-of-a-latent-channel idea as the CSP features of Eq. (27.9)).

Figure 27.8 makes this alignment problem concrete on the three-axis signal: per-channel EMD returns a different number of IMFs per axis and the center frequency of mode $k$ drifts from axis to axis, whereas per-channel VMD with a common $K$ locks the modes onto shared frequency bands; MVMD enforces this equality exactly.

Possible data leakage

The linear projection $\bW $ (PCA, ICA, CSP) is fitted across samples and must be estimated on the training set only, then applied unchanged to the test set (Sec. 4.7). Fitting it on the combined train+test set leaks information and inflates accuracy. The decompositions above (EMD, VMD, DMD, MSSA) are computed independently per window and carry no such cross-sample fit.

27.6 Cross-channel model transfer and linear alignment

Goal: The per-channel pipelines of Sec. 27.2 require training $C$ separate models. Test whether a classifier fitted on one source channel can be applied to the remaining target channels without retraining.

27.6.1 Setup: from vectors to channels

In vector terms, a fitted classifier is a fixed function on the feature space, and applying it to samples drawn from a shifted distribution is exactly the dataset-shift situation of Sec. 13.2.2. The multivariate signal restates this generic vector setting with two additional structural properties.

Let channel $s$ be the source and channel $t$ a target. Applying the same per-channel feature map $f:\real ^L\to \real ^N$ (Sec. 27.2) to the $M$ training windows of each channel gives two training feature matrices

\begin{equation} \bZ _s,\ \bZ _t\in \real ^{M\times N}, \end{equation}

with the following properties:

• Shared feature space. The same feature map is applied to every channel, so the columns of $\bZ _s$ and $\bZ _t$ have identical dimension and meaning. Generic multimodal sources (e.g. EEG combined with EMG, Ch. 15) lack this property.
• Paired rows. The channels are synchronously sampled (chapter preface), so row $i$ of $\bZ _s$ and row $i$ of $\bZ _t$ describe the same window of the same physical event. The two matrices are paired row by row, and the pairing requires no labels. Generic tabular domains rarely offer this.

The transfer experiment: train the full pipeline (feature extraction, normalization, classifier) on the training windows of the source channel and apply it, unchanged, to the test windows of every channel, with the same grouped train/test split (Sec. 4.5) used for all channels. Collecting the accuracies yields a $C\times C$ transfer matrix whose diagonal holds each channel’s native within-channel accuracy; the off-diagonal entries are directly comparable to it.

27.6.2 Why direct transfer fails

Each channel observes the same underlying event through its own measurement chain. In feature space this appears as a differently calibrated coordinate system: per-feature gain and offset (sensor sensitivity, distance to the event) and feature mixing (frequency response, mechanical or electrical coupling). The decision boundary learned in the source coordinates then cuts the target clouds in the wrong place, even when the target channel, with its own trained model, is just as accurate as the source.

Example 27.3 (Gain and offset mismatch): A single feature $z$ separates two classes on the source channel: class A training values $\{-2,-1\}$, class B training values $\{1,2\}$, and the fitted rule predicts B when $z>0$. The target channel observes the same four events with half the gain and an offset of $+2$:
$\seteqnumber{0}{}{16}$
\begin{equation*} z' = \tfrac {1}{2}z + 2\quad \Rightarrow \quad \text {A}\to \{1,\ 1.5\},\qquad \text {B}\to \{2.5,\ 3\}. \end{equation*}

All four target values are positive, so the source rule predicts B everywhere: accuracy drops from $100\%$ to $50\%$, chance level, although the target values separate the classes perfectly.

Equivalent sensors, inconsistent domains

Channels can be equivalent as sensors (each reaches the same accuracy with its own model) and still be inconsistent as domains: the classifier two-sample test of Sec. 13.2.2 easily distinguishes their feature distributions. Direct transfer failure is a dataset-shift event, not evidence that the target channel is less informative.

27.6.3 Linear alignment

An alignment is a map, learned on training data only, that transports target features into the source coordinate system before the source classifier reads them. The class labels are never used. The three standard linear alignments below are previously presented vector tools re-purposed; they are ordered by the statistics they match, and therefore by the distortions they can undo.

Moment matching

Each feature is standardized with the target’s training statistics and de-standardized with the source’s, i.e. the standardization of Sec. 4.7 applied twice:

\begin{equation} \label {eq-moment-matching} z'_j = \frac {z_j-\mu _{t,j}}{\sigma _{t,j}}\,\sigma _{s,j}+\mu _{s,j},\qquad j=1,\dots ,N, \end{equation}

where $(\mu _{t,j},\sigma _{t,j})$ and $(\mu _{s,j},\sigma _{s,j})$ are the per-feature training means and standard deviations of the target and source channels. Moment matching undoes any per-feature gain and offset, as in Example 27.3, where (27.17) recovers the source values exactly. It cannot undo feature mixing, since each feature is corrected in isolation.

Covariance alignment (CORAL)

Correlation alignment (CORAL) matches the full second-order structure: the target features are whitened with the target training covariance and re-colored with the source covariance,

\begin{equation} \label {eq-coral} \bz ' = \boldsymbol {\Sigma }_s^{1/2}\,\boldsymbol {\Sigma }_t^{-1/2}\left (\bz -\bmu _t\right )+\bmu _s, \end{equation}

where $\boldsymbol {\Sigma }_s,\boldsymbol {\Sigma }_t\in \real ^{N\times N}$ are the training covariance matrices (the $\tau =0$ object of Sec. 27.3.1) and the matrix square roots are computed from their eigendecompositions, as in PCA (Sec. 11.2).

Distribution matching is blind to rotations

Any transform that preserves the feature distribution is invisible to an alignment that only matches that distribution. A rotation or reflection of the feature space can leave all means and covariances unchanged, so neither moment matching (27.17) nor CORAL (27.18) can detect it, let alone undo it. Worse, the re-coloring in (27.18) is only determined up to such a rotation, and the arbitrary choice may rotate exactly the directions the source classifier relies on; CORAL can then underperform plain moment matching.

Paired linear map

The strongest linear alignment uses the property the previous two ignore: the rows of $\bZ _s$ and $\bZ _t$ are paired. Center each matrix with its own training means, $\tilde {\bZ }_s = \bZ _s - \bOne _M\bmu _s^\top $ and $\tilde {\bZ }_t = \bZ _t - \bOne _M\bmu _t^\top $, and fit a regularized linear map from the target space to the source space,

\begin{equation} \label {eq-paired-map} \bW = \argmin _{\bW \in \real ^{N\times N}}\ \norm {\tilde {\bZ }_s - \tilde {\bZ }_t\bW }_F^2 + \lambda \norm {\bW }_F^2 = \left (\tilde {\bZ }_t^\top \tilde {\bZ }_t+\lambda \bI \right )^{-1}\tilde {\bZ }_t^\top \tilde {\bZ }_s, \end{equation}

where $\norm {\cdot }_F$ is the Frobenius norm (the square root of the sum of squared entries). At test time the alignment is $\bz ' = \bW ^\top (\bz -\bmu _t)+\bmu _s$. Eq. (27.19) is ridge regression (Sec. 4.9) with $N$ outputs, in which the source features play the role of the regression targets; no class labels appear anywhere. Distribution matching can never decide which target direction corresponds to which source direction, but the row pairing resolves the correspondence by construction, so the paired map recovers arbitrary linear mixing, including the rotations that defeat moment matching and CORAL.

Example 27.4 (Feature swap): Four paired training windows are recorded by both channels. On the source channel the class is carried by $z_1$ alone:
$\seteqnumber{0}{}{19}$
\begin{equation*} \bZ _s = \begin{pmatrix} -1 & 1\\ -1 & -1\\ 1 & 1\\ 1 & -1 \end {pmatrix} \quad \begin{matrix} \text {A}\\ \text {A}\\ \text {B}\\ \text {B} \end {matrix} \qquad \qquad \bZ _t = \begin{pmatrix} 1 & -1\\ -1 & -1\\ 1 & 1\\ -1 & 1 \end {pmatrix}, \end{equation*}

with the fitted source rule: predict B when $z_1>0$. The target channel swaps the two features, $\bz ' = (z_2, z_1)$. Compare the three alignments.

Solution: Direct transfer. The target first feature equals the source $z_2$, which is independent of the class: accuracy $50\%$.

Moment matching. In both channels every feature has training mean $0$ and standard deviation $1$, so (27.17) is the identity map: still $50\%$.

CORAL. Both training covariances equal $\bI $ (the features are uncorrelated with unit variance), so (27.18) is also the identity map: still $50\%$. The swap preserves the feature distribution, so no distribution-matching alignment can see it.

Paired map. The means are zero, so centering changes nothing, and
$\seteqnumber{0}{}{19}$
\begin{equation*} \bZ _t^\top \bZ _t = 4\,\bI , \qquad \bZ _t^\top \bZ _s = 4\begin{pmatrix} 0 & 1\\ 1 & 0\end {pmatrix} = 4\,\bP , \end{equation*}

where $\bP $ is the swap (permutation) matrix. With $\lambda \to 0$, Eq. (27.19) gives $\bW = \bP $ exactly; with $\lambda =1$, $\bW = \frac {4}{5}\,\bP $, which shrinks the features but produces the same decisions. The pairing identifies which target column corresponds to which source column, the swap is undone, and the accuracy returns to $100\%$.

Figure 27.9 illustrates the same hierarchy on Gaussian feature clouds whose target version is rotated, scaled, and shifted: direct transfer collapses to chance, moment matching recovers the per-feature scale and offset but not the rotation, and the paired map restores essentially the native accuracy.

27.6.4 Summary and practice

.
Method	Statistics matched	Paired rows	Distortions undone
Moment matching (27.17)	per-feature mean and std	not needed	per-feature gain and offset
CORAL (27.18)	mean and full covariance	not needed	correlation structure, up to a rotation ambiguity
Paired map (27.19)	row correspondence (ridge fit)	required	any linear mixing

Table 27.1: Linear alignment methods for cross-channel model transfer.

Alignment is fitted, so it can leak

All alignment parameters ($\bmu $, $\sigma _j$, $\boldsymbol {\Sigma }$, $\bW $) are data-dependent transformations. Like normalization (Sec. 4.7) and PCA, they must be estimated on the training partition only and applied unchanged to the test partition; estimating them on the full dataset inflates the transfer accuracy.

Calibration transfer

The paired map gives a practical recipe for commissioning an additional channel: record paired, unlabeled data with the already-deployed channel, fit (27.19), and reuse the existing classifier, instead of collecting a labeled training set per channel. Native per-channel training remains the accuracy ceiling, so alignment is a deployment shortcut, not a replacement for it.

Beyond signals

Nothing in (27.17)–(27.19) is signal-specific: the methods apply to any tabular problem in which source and target share one feature space, such as the same instrument type at two sites or two scanners measuring the same specimens. The multichannel signal setting is special only in that synchronous sampling supplies the paired rows for free.

Whether an aligned transfer still falls short of the target channel’s native model is a paired comparison of two classifiers on the same test windows, assessed with the tools of Ch. 14, in particular McNemar’s test (Sec. 14.1.4).