Machine Learning & Signals Learning
26 Exploratory Analysis of Univariate Signal Samples
The signal-classification chapters so far assume that each input sample is a clean, equally-sampled univariate record of comparable length and energy. In practice, raw acquisitions may fail any of these assumptions. This chapter is a partial diagnostic checklist applied before signal classification.
26.1 Visual inspection for artifacts
Every quantitative test below answers a pre-specified question. Visual inspection answers the question you did not think to ask. A few minutes spent plotting raw samples typically exposes problems that would otherwise survive into the model: a class with a constant DC offset, a sensor that dropped to zero for a fraction of a second, a mains-hum line at \(50\)/\(60\) Hz, isolated spikes from electrical interference, or stretches of missing samples filled with zeros or NaN.
-
• Plot at least \(5\) randomly drawn samples per class, with the same y-axis range across classes.
-
• For each class plot one sample’s amplitude histogram and one sample’s spectrogram (Sec. 19.7) or PSD (Sec. 19.4.2).
-
• Report the count of samples containing NaN, \(\pm \infty \), or runs of identical-valued samples.
-
• DC offset:Is it per-class or inter-class consistent?
-
• Mains hum: narrow spectral lines at \(50\) Hz (Europe) or \(60\) Hz (US) and their harmonics.
-
• Saturation: flat tops at the A/D rails, treated quantitatively in Sec. 26.2.
-
• Spikes and glitches: isolated single-sample excursions far outside the local envelope. They inflate \(P_i\) (Sec. 26.5) and bias high-order spectral features.
-
• Dropouts: stretches of repeated values (often exactly zero) where the sensor lost lock or the buffer was empty. Indistinguishable in features from genuine silence unless explicitly flagged.
-
• Phase or polarity flip: a class consistently inverted relative to the other.
Visual inspection does not scale, and that is fine
On a dataset of \(10^5\) samples no human inspects them all. Inspect a stratified sample, e.g. \(20\) per class, plus the longest, shortest, highest power, and lowest-power (Sec. 19.1.2) recording per class. The goal is not full coverage; it is to discover failure modes.
Re-run on any change
Re-run the inspection plots after any change to the acquisition pipeline. Every protocol change is an opportunity for a new artifact.
26.2 Detecting saturated samples
A saturated sample exceeds the converter’s full-scale range \(\pm A_{\max }\). The signature appears in three views:
-
• Time domain: flat-top runs of consecutive points pinned at \(\pm A_{\max }\).
-
• Amplitude histogram: an isolated spike of mass at the rails, separated from the interior distribution.
Three per-sample diagnostics make this quantitative:
-
1. Rail fraction: \(\rho _i = \frac {1}{N_i}\sum _{n=0}^{N_i-1}\indFunc \!\left [\abs {x_i[n]} \ge (1-\epsilon )\,A_{\max }\right ]\) for a small tolerance \(\epsilon \) (e.g. \(10^{-3}\)).
-
2. Longest rail run: \(\ell _i = \max \{k : x_i[n],\ldots ,x_i[n+k-1]\text { all satisfy }\abs {\cdot }\ge (1-\epsilon )A_{\max }\}\).
-
3. Histogram inspection: visible spike of mass at \(\pm A_{\max }\), separated from the interior distribution.
Fig. 26.1 contrasts a clean and a saturated sinusoid across the three views.
Handle explicitly
When saturation is detected, choose explicitly haw to handle it: discard the sample, re-acquire with adjusted gain, or record saturation status as an explicit feature.
26.3 Signal boundaries: onset and offset
The nominal duration \(T_i = N_i / f_s\) counts every recorded sample, including pre-event silence and post-event decay. The signal of interest sometimes occupies only a sub-interval \([n_{\mathrm {on}},\,n_{\mathrm {off}}]\) with effective duration
\(\seteqnumber{0}{}{0}\)\begin{equation} T_i^{\mathrm {eff}} = (n_{\mathrm {off}} - n_{\mathrm {on}}) / f_s. \end{equation}
A small ratio \(T_i^{\mathrm {eff}} / T_i\) is itself a quality metric: most of the recording is not signal.
Onset and offset
Given a per-sample log envelope \(20\log _{10} e_i[n]\) (e.g. short-time \(\RMSE \) over a window of a few fundamental periods; the factor \(20\) is the amplitude-to-dB convention, replace it with \(10\) for a squared/power envelope), locate the peak and define
\(\seteqnumber{0}{}{1}\)\begin{equation} n_{\mathrm {on}} = \min \{n : 20\log _{10} e_i[n] \ge \mathrm {peak}_{\mathrm {dB}} - \theta \}, \quad n_{\mathrm {off}} = \max \{n : 20\log _{10} e_i[n] \ge \mathrm {peak}_{\mathrm {dB}} - \theta \}, \end{equation}
with some \(\theta \) threshold (measured in dB, .e.g. \(\theta =20\)) and a hangover of \(L_h\) samples that keeps the segment still marked active to overcome an “occasional drop” in amplitude.
Noise floor \(\theta \) in per-class setting may become class-discriminative feature.
Threshold may result leakage
Test set should not be used for thresholding derivation.
Exponential decay
When the signal has the form
\(\seteqnumber{0}{}{2}\)\begin{equation} x_i[n] \approx a\,e^{-(n - n_{\mathrm {on}})/\tau }\cos (2\pi f_0 n / f_s + \phi ), \end{equation}
the log envelope is linear in \(n\) between onset and the late noise floor. A direct linear fit of \(\log e_i[n]\) on \([n_{\mathrm {on}},\,n_{\mathrm {off}}]\) returns \(\hat \tau \). The per-class distribution of \(\hat \tau \) is sometimes discriminative.
For decay-dominated signals, \(\hat \tau \) becomes a per-class feature in its own right.
Takeaway
Trim samples to active interval \([n_{\mathrm {on}},\,n_{\mathrm {off}}]\) before any downstream diagnostic.
26.4 Sample-rate adequacy
The Nyquist criterion \(f_s > 2 f_{\max }\) (see the signals chapter) is the lower bound. The classification literature obsesses over the under-sampling case; the over-sampling case is rarely discussed but just as common in practice.
The too-high case
When \(f_s\) is many times the informative bandwidth, every per-sample vector \(\bx _i\) is artificially long. The concrete costs:
-
• Feature-extraction computation costs.
-
• Some features have higher dimensions or are less informative.
-
• Curse of dimensionality on a representation whose intrinsic dimension is unchanged.
Spectral-occupancy diagnostic
Estimate the PSD (of a representative subset) of samples and locate the frequency \(f_{\mathrm {occ}}\) above which a chosen fraction (e.g. 99%) of the in-band power lies. Set the working sample rate to \(f_s' \gtrsim 2.5\,f_{\mathrm {occ}}\) via decimation (with an anti-alias prefilter). Fig. 26.3 shows that PSDs at \(2.5\,f_{\max }\) and \(25\,f_{\max }\) contain the same informative band.
Set \(f_s\) from spectral occupancy
The acquisition sample rate is usually a sensor default, not a modeling choice. Estimate the spectral occupancy of representative recordings and decimate to a working \(f_s' \gtrsim 2.5\,f_{\mathrm {occ}}\) before feature extraction.
Terminology: decimation vs. undersampling
Undersampling keeps every \(k\)-th sample as-is. Decimation first applies a low-pass anti-alias filter at the new Nyquist rate \(f_s'/2\), then keeps every \(k\)-th sample. Use decimation. Prefer a linear-phase (FIR) anti-alias filter: a non-linear-phase one (e.g. IIR Butterworth) introduces frequency-dependent group delay that distorts onset positions and decay envelopes used downstream (Sec. 26.3).
Re-check with the classifier
The spectral-occupancy cut \(f_s'\) is necessary but not sufficient. After decimating the training set to \(f_s'\) (identical decimation applied to the test set, never with parameters fit on test), refit the chosen classifier and compare its held-out accuracy to a baseline fit at the original \(f_s\). Equal accuracy confirms the discarded frequencies carried no class-discriminative information.
Sweep recipe
Decimate to a small grid of working rates (e.g. \(1\times \), \(0.5\times \), \(0.25\times \) of the original \(f_s\)) and plot held-out performance (e.g., accuracy) vs. \(f_s'\).
Leakage warning
Decimation parameters (cutoff, filter order) are derived from training-set statistics only. This is the same end-to-end check pattern used in the permutation diagnostics of Sec. 26.8.
26.5 Energy and power across classes
For each sample define
\(\seteqnumber{0}{}{3}\)\begin{equation} E_i = \sum _{n=0}^{N_i-1} x_i[n]^2, \qquad P_i = \frac {E_i}{N_i}. \end{equation}
When sample lengths \(N_i\) differ across the dataset, \(E_i\) is misleading and the average power \(P_i\) is the fair cross-class quantity. When the recordings include silence or decay tails, restrict the sum to the signal interval \([n_{\mathrm {on}},\,n_{\mathrm {off}}]\) recovered in Sec. 26.3; otherwise silence padding deflates \(P_i\) and a class with a longer recorded tail looks quieter than it is.
A per-class boxplot of \(P_i\) (Fig. 26.4) answers two questions at once: is amplitude class-discriminative, and is the discrimination credible? Two distinct interpretations apply:
-
• Phenomenon-driven: classes genuinely differ in radiated power (e.g. loud vs. quiet sound source). Amplitude is a valid feature.
-
• Acquisition-driven: classes were collected on different hardware, gains, or distances. Amplitude is a confound and the classifier will exploit it instead of the phenomenon.
-
1. Compare \(P_i\) distributions before and after per-sample amplitude normalization (e.g. divide by \(\sqrt {P_i}\) or scale to \(\max \abs {x_i} = 1\)).
-
2. Refit the classifier on the normalized samples. If accuracy collapses to chance, amplitude was the only working signal; whether that is acceptable depends on the answer to the phenomenon-vs-acquisition question above.
Interaction with saturation
A class whose samples are saturated while the other’s are not has its \(P_i\) artificially pinned toward \(A_{\max }^2\).
Acquisition-driven amplitude is leakage
If the amplitude difference between classes traces to a hardware or protocol difference at acquisition time rather than to the phenomenon, the classifier is learning the acquisition setup.
The leak is invisible to a held-out test set drawn from the same acquisition and is exposed only when the model is deployed on a new setup.
Normalize amplitude away or rerun acquisition under a single protocol.
26.6 Effective signal length
Two failure modes appear at the ends of the \(T_i\) scale.
Too short
A sample that contains only a fraction of a fundamental period cannot yield a reliable \(\hat f_0\) (Sec. 26.7), and its spectral-resolution floor is \(\Delta f = 1/T_i\). As a practical lower bound, target
\(\seteqnumber{0}{}{4}\)\begin{equation} T_i \cdot \hat f_0 \gtrsim 5, \end{equation}
i.e. at least five fundamental periods per sample. Plot the distribution of \(T_i \hat f_0\) per class; an excess of samples below this threshold flags a recording protocol that needs to be lengthened.
Too long
A long recording is often a mixture of regimes (onset transient, steady state, decay) that should be classified separately. The diagnostic is intra-sample non-stationarity: split each \(\bx _i\) into \(Q\) contiguous sub-windows of equal length.
Significant spread within a single sample means either the sample should be re-windowed (Sec. 24.4) or the class label refers to a regime that occupies only part of the recording.
Spectral-resolution floor
No spectral feature can resolve detail finer than \(\Delta f = 1/T_i\). A classifier asked to discriminate two classes whose spectra differ on a finer scale will fail regardless of the model; the fix is longer recordings, not a richer model.
Class-conditional duration histogram
A per-class histogram of \(T_i\) frequently reveals acquisition mismatch: one class consistently recorded longer than another (Fig. 26.5).
Acquisition-driven time-length is leakage
\(T_i\) itself can leak into duration-sensitive features.
26.7 Fundamental frequency
For periodic or quasi-periodic signals, the fundamental frequency \(f_0\) is often the single most informative scalar summary. The standard estimators are:
-
• Autocorrelation peak (Sec. 20.2.2): \(\hat f_0 = f_s / \arg \max _{\tau \ge \tau _{\min }} r_{xx}[\tau ]\), where \(r_{xx}\) is the biased autocorrelation and \(\tau _{\min }\) excludes the trivial peak at \(\tau =0\).
-
• Periodogram peak: \(\hat f_0 = \arg \max _f \abs {X(f)}^2\), restricted to the expected band (Sec. 19.5).
A per-class histogram of \(\hat f_0\) (Fig. 26.6) often reveals class separability via \(f_0\) alone, and an empty overlap region is itself an important finding: a one-dimensional decision rule on \(\hat f_0\) may already saturate accuracy and a complex classifier is unnecessary.
Interaction with saturation (Sec. 26.2). Saturation inflates higher harmonics but does not shift the fundamental, so \(\hat f_0\) from autocorrelation or cepstrum remains usable for mildly saturated samples. A periodogram-based estimator restricted to the fundamental band is also robust; only an estimator that searches the full spectrum can be misled into reporting a harmonic.
\(f_0\) ambiguity at short records
When the sample length \(T_i\) contains fewer than \(\approx 3\) fundamental periods, \(\hat f_0\) from any of the three estimators has high variance and bias. This is a recording-length issue, not an estimator issue (Sec. 26.6).
26.8 Data-integrity permutation tests
Two of the diagnostics in Ch. 13 are particularly relevant once a signal-classification pipeline is in place: the label-permutation null baseline (Sec. 13.3.1), which establishes the dataset- and pipeline-specific chance accuracy, and the cross-class sample permutation (Sec. 13.3.5), which exposes per-sample leakage cues (filename, timestamp, recording-session id, residual saturation signature, duration) that survive a swap between class buckets. Run both before trusting any held-out score.