Machine Learning & Signals Learning
27 Multivariate Signal Classification
27.1 Preface
A multivariate signal is a collection of \(C\) synchronously sampled channels, organized as a matrix
\(\seteqnumber{0}{}{0}\)\begin{equation} \bX = \begin{bmatrix} \bx _1^\top \\ \bx _2^\top \\ \vdots \\ \bx _C^\top \end {bmatrix} \in \real ^{C\times L}, \qquad \bx _c\in \real ^L, \end{equation}
where row \(c\) is the \(L\)-sample record of channel \(c\) and column \(n\) is the instantaneous snapshot across all channels at sample \(n\) (Fig. 27.1).
Typical sources:
-
• 3-axis IMU: accelerometer or gyroscope with \(C=3\) orthogonal axes.
-
• Multi-lead ECG (\(C=2,3,12\)) and multi-channel EEG (\(C=8\)–\(256\)).
-
• Microphone arrays for speech or acoustic event detection.
Applicability note
This chapter does not explicitly cover multi-rate sensor measurement, such as in multi-sensor industrial monitoring (vibration, temperature, current on the same machine).
Applying the univariate pipeline channel-by-channel is a valid baseline but discards the inter-channel structure, such as correlation, relative phase, shared latent sources, which often carry the discriminative information (e.g. an IMU gesture is defined by how the three axes move together, not by any one axis alone).
The pipeline reuses every stage from the previous chapter (Ch. 25) except feature extraction, which now produces two concatenated blocks:
-
• Per-channel features: the univariate \(f:\real ^L\to \real ^N\) applied independently to each of the \(C\) channels, yielding \(C\cdot N\) features.
-
• Cross-channel features (Sec. 27.2): values that quantify inter-channel structure, yielding \(N_\text {cc}\) features.
Stacking gives a row of \(N' = C\cdot N + N_\text {cc}\) features (per window).
Channel-wise feature extraction
For a signal \(\bX =[\bx _1,\dots ,\bx _C]^\top \) and a univariate feature map \(f:\real ^L\to \real ^N\) (time, frequency, or wavelet features as in Ch. 25), define
\(\seteqnumber{0}{}{1}\)\begin{equation} \mathbf {z}_\text {pc}(\bX ) = \bigl [\,f(\bx _1)^\top ,\;f(\bx _2)^\top ,\;\dots ,\;f(\bx _C)^\top \,\bigr ]^\top \in \real ^{CN}. \end{equation}
This is the trivial extension of the univariate pipeline and a reasonable baseline. Its weakness is that every inter-channel statistic is discarded: two datasets that differ only in how channels move relative to each other produce identical \(\mathbf {z}_\text {pc}\).
-
Example 27.1: On a 3-axis accelerometer (\(C=3\)) with \(N=10\) univariate features per channel, \(\mathbf {z}_\text {pc}\in \real ^{30}\). Walking and waving can produce near-identical per-axis statistics (mean, variance, dominant frequency) while differing sharply in axis-to-axis correlation, which this baseline cannot see.
Time synchronization
For an effective application of cross-channel features, all channels must share a common time base. Random jitter-misalignment can effectively destroys some of these features.
27.2 Cross-channel features
27.2.1 Multivariate ACF
Viewing each channel as a signal, the auto-correlation (Sec. 20.1,Eq. (20.11)) and the cross-correlation (Sec. 21.1,Eq. (21.6)) merge into a single matrix-valued, lag-indexed object:
\(\seteqnumber{0}{}{2}\)\begin{equation} \label {eq:multi-acf} \begin{aligned} \bR _{\bX \bX }[\tau ]&\in \real ^{C\times C},\\ \bigl (\bR _{\bX \bX }[\tau ]\bigr )_{ij} &= R_{ij}[k] = \sum _{n}\bigl (x_i[n]-\bar {x}_i\bigr )\bigl (x_j[n+\tau ]-\bar {x}_j\bigr ). \end {aligned} \end{equation}
The following special cases recover the constructs of Part II:
-
• \(i=j\): the auto-correlation \(R_{\bxx }[\tau ]\) of channel \(i\) (Sec. 20.1).
-
• \(i\ne j\): the cross-correlation between channels \(i\) and \(j\) (Sec. 21.1).
-
• \(\tau =0\), biased normalization (Eq. (20.14)): the covariance matrix used below.
The biased, unbiased, and normalized variants of Sec. 20.1.2 and Sec. 21.1 apply entry-wise to \(\bR _{\bX \bX }[\tau ]\).
Covariance and correlation
At \(\tau =0\), the biased form of the multivariate ACF (Sec. 27.2.1, Eq. (20.14)) is the covariance matrix
\(\seteqnumber{0}{}{3}\)\begin{equation} \boldsymbol {\Sigma }=\frac {1}{L}\bR _{\bX \bX }[0] \in \real ^{C\times C},\qquad \boldsymbol {\Sigma }_{ij} = \Cov [\bx _i,\bx _j] = R_{ij,biased}[k], \end{equation}
or its normalized counterpart, the Pearson correlation matrix
\(\seteqnumber{0}{}{4}\)\begin{equation} \bR _{ij} = \frac {\boldsymbol {\Sigma }_{ij}}{\sqrt {\boldsymbol {\Sigma }_{ii}\,\boldsymbol {\Sigma }_{jj}}} \in [-1,1]. \end{equation}
\(\bR \) is symmetric with unit diagonal, so the \(C(C-1)/2\) strictly upper-triangular entries are used as features (Fig. 27.2). For \(C=3\) this adds 3 features; for \(C=16\), 120.
Lagged cross-correlation
For \(\tau \ne 0\), the off-diagonal entries of the multivariate ACF \(\bR _{\bX \bX }[\tau ]\) ((27.3)) encode delayed coupling between channels and the timing of that coupling. Writing \(r_{ij}[\tau ]\triangleq (\bR _{\bX \bX }[\tau ])_{ij}\) for the \((i,j)\)-entry, two scalar features per channel pair are typical:
-
• the peak value \(\max \limits _{|\tau |\le \tau _\text {max}} r_{ij}[\tau ]\), \(i\ne j\), and
-
• the argmax lag \(\argmax \limits _{|\tau |\le \tau _\text {max}} r_{ij}[\tau ]\), \(i\ne j\).
27.2.2 VAR coefficients
Fitting a vector AR (VAR(\(p\)), Sec. 21.9) model to the \(C\)-channel window estimates the coefficient matrices \(\bA _1,\ldots ,\bA _p\in \real ^{C\times C}\). Their off-diagonal entries \((\bA _m)_{ij}\), \(i\ne j\), quantify how the past of channel \(j\) predicts the present of channel \(i\): a directed, lagged, conditional measure of coupling1, in contrast to the symmetric, undirected correlation of Sec. 27.2.1. The \(p\,C(C-1)\) off-diagonal entries across all lags are stacked as features; the \(pC\) diagonal entries are per-channel self-prediction and may be assigned to the per-channel block instead.
1 The significance of these off-diagonal coefficients is tested by Granger causality (Sec. 22.5).
27.2.3 Coherence
For a pair of channels, the cross-coherence (Sec. 21.2.2) \(\gamma _{ij}[k]\) between \(\bx _i\) and \(\bx _j\) can be evaluated (e.g., by Welch estimate). Features are band-averaged coherences
\(\seteqnumber{0}{}{5}\)\begin{equation} \bar {\gamma }_{ij,b} = \frac {1}{|B_b|}\sum _{k\in B_b}\gamma ^2_{ij}[k] \end{equation}
over frequency bands \(B_b\) of interest (e.g. \(\alpha \), \(\beta \), \(\gamma \) bands in EEG).
27.2.4 Phase-locking value
For EEG-style phase coupling, let \(\phi _c[n]\) be the instantaneous phase of channel \(c\) (from the Hilbert transform or a band-limited analytic signal). The phase-locking value (PLV) between channels \(i,j\) is
\(\seteqnumber{0}{}{6}\)\begin{equation} \mathrm {PLV}_{ij} = \left |\frac {1}{L}\sum _{n=0}^{L-1}\mathrm {e}^{\,j\bigl (\phi _i[n]-\phi _j[n]\bigr )}\right | \in [0,1]. \end{equation}
PLV\(=1\) indicates perfect phase synchrony; PLV\(=0\) indicates uniform phase difference.
27.2.5 Common spatial patterns (CSP)
For binary classification, CSP learns \(K\) spatial filters \(\bw _k\in \real ^C\) that maximize the variance ratio between the two classes:
\(\seteqnumber{0}{}{7}\)\begin{equation} \bw _k = \argmax _{\bw } \frac {\bw ^\top \boldsymbol {\Sigma }^{(1)}\bw }{\bw ^\top \boldsymbol {\Sigma }^{(2)}\bw }, \end{equation}
where \(\boldsymbol {\Sigma }^{(c)}\) is the average covariance matrix of class \(c\). Features are the log-variances \(\log \mathrm {Var}[\bw _k^\top \bX ]\) of the filtered channels. CSP is standard in brain–computer interfaces; like any supervised feature, it must be fitted on the training set only.
Cross-channel features grow as \(O(C^2)\) in \(C\): for \(C=16\) channels, pairwise correlation alone contributes 120 features, and per-band coherence multiplies this by the number of bands. Feature selection (Sec. 9.4) and/or dimensionality reduction (below) are usually needed beyond \(C\gtrsim 10\).
27.2.6 Learned representations
End-to-end deep models (1D-CNN, RNN, transformer) operate directly on \(\bX \) and subsume both per-channel and cross-channel feature extraction as learned representations. They trade interpretability and small-data robustness for raw performance on large datasets.
High number of cross-features
Feature selection (Sec. 9.4) is strongly recommended: number of cross-channel features grows quickly with \(C\), and of these features are redundant.
27.3 Dimensionality reduction across channels
Let \(\bX \in \real ^{C\times L}\) be a window. Fit a linear projection \(\bW \in \real ^{C'\times C}\) and form
\(\seteqnumber{0}{}{8}\)\begin{equation} \bX ' = \bW \bX \in \real ^{C'\times L}, \end{equation}
then apply the univariate feature map to each of the \(C'\) rows of \(\bX '\). Common choices for \(\bW \):
-
• PCA: rows of \(\bW \) are the top-\(C'\) eigenvectors of the training covariance matrix \(\boldsymbol {\Sigma }_\text {train}\). Captures maximum variance; label-agnostic.
-
• ICA: rows of \(\bW \) are statistically independent spatial sources. Useful for artifact removal in EEG (ocular, muscle).
-
• CSP (binary classification): supervised; uses class labels.
Possible data leakage
\(\bW \) is a data-dependent transform and must be fitted on the training set only, then applied unchanged to the test set (Sec. 4.7). Fitting PCA on the combined train+test set leaks information and inflates accuracy.