Machine Learning & Signals Learning

\(\newcommand{\footnotename}{footnote}\) \(\def \LWRfootnote {1}\) \(\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\let \LWRorighspace \hspace \) \(\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }\) \(\newcommand {\TextOrMath }[2]{#2}\) \(\newcommand {\mathnormal }[1]{{#1}}\) \(\newcommand \ensuremath [1]{#1}\) \(\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } \) \(\newcommand {\setlength }[2]{}\) \(\newcommand {\addtolength }[2]{}\) \(\newcommand {\setcounter }[2]{}\) \(\newcommand {\addtocounter }[2]{}\) \(\newcommand {\arabic }[1]{}\) \(\newcommand {\number }[1]{}\) \(\newcommand {\noalign }[1]{\text {#1}\notag \\}\) \(\newcommand {\cline }[1]{}\) \(\newcommand {\directlua }[1]{\text {(directlua)}}\) \(\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}\) \(\newcommand {\protect }{}\) \(\def \LWRabsorbnumber #1 {}\) \(\def \LWRabsorbquotenumber "#1 {}\) \(\newcommand {\LWRabsorboption }[1][]{}\) \(\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }\) \(\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }\) \(\def \mathcode #1={\mathchar }\) \(\let \delcode \mathcode \) \(\let \delimiter \mathchar \) \(\def \oe {\unicode {x0153}}\) \(\def \OE {\unicode {x0152}}\) \(\def \ae {\unicode {x00E6}}\) \(\def \AE {\unicode {x00C6}}\) \(\def \aa {\unicode {x00E5}}\) \(\def \AA {\unicode {x00C5}}\) \(\def \o {\unicode {x00F8}}\) \(\def \O {\unicode {x00D8}}\) \(\def \l {\unicode {x0142}}\) \(\def \L {\unicode {x0141}}\) \(\def \ss {\unicode {x00DF}}\) \(\def \SS {\unicode {x1E9E}}\) \(\def \dag {\unicode {x2020}}\) \(\def \ddag {\unicode {x2021}}\) \(\def \P {\unicode {x00B6}}\) \(\def \copyright {\unicode {x00A9}}\) \(\def \pounds {\unicode {x00A3}}\) \(\let \LWRref \ref \) \(\renewcommand {\ref }{\ifstar \LWRref \LWRref }\) \( \newcommand {\multicolumn }[3]{#3}\) \(\require {textcomp}\) \( \newcommand {\abs }[1]{\lvert #1\rvert } \) \( \DeclareMathOperator {\sign }{sign} \) \(\newcommand {\intertext }[1]{\text {#1}\notag \\}\) \(\let \Hat \hat \) \(\let \Check \check \) \(\let \Tilde \tilde \) \(\let \Acute \acute \) \(\let \Grave \grave \) \(\let \Dot \dot \) \(\let \Ddot \ddot \) \(\let \Breve \breve \) \(\let \Bar \bar \) \(\let \Vec \vec \) \(\newcommand {\bm }[1]{\boldsymbol {#1}}\) \(\require {physics}\) \(\newcommand {\LWRphystrig }[2]{\ifblank {#1}{\textrm {#2}}{\textrm {#2}^{#1}}}\) \(\renewcommand {\sin }[1][]{\LWRphystrig {#1}{sin}}\) \(\renewcommand {\sinh }[1][]{\LWRphystrig {#1}{sinh}}\) \(\renewcommand {\arcsin }[1][]{\LWRphystrig {#1}{arcsin}}\) \(\renewcommand {\asin }[1][]{\LWRphystrig {#1}{asin}}\) \(\renewcommand {\cos }[1][]{\LWRphystrig {#1}{cos}}\) \(\renewcommand {\cosh }[1][]{\LWRphystrig {#1}{cosh}}\) \(\renewcommand {\arccos }[1][]{\LWRphystrig {#1}{arcos}}\) \(\renewcommand {\acos }[1][]{\LWRphystrig {#1}{acos}}\) \(\renewcommand {\tan }[1][]{\LWRphystrig {#1}{tan}}\) \(\renewcommand {\tanh }[1][]{\LWRphystrig {#1}{tanh}}\) \(\renewcommand {\arctan }[1][]{\LWRphystrig {#1}{arctan}}\) \(\renewcommand {\atan }[1][]{\LWRphystrig {#1}{atan}}\) \(\renewcommand {\csc }[1][]{\LWRphystrig {#1}{csc}}\) \(\renewcommand {\csch }[1][]{\LWRphystrig {#1}{csch}}\) \(\renewcommand {\arccsc }[1][]{\LWRphystrig {#1}{arccsc}}\) \(\renewcommand {\acsc }[1][]{\LWRphystrig {#1}{acsc}}\) \(\renewcommand {\sec }[1][]{\LWRphystrig {#1}{sec}}\) \(\renewcommand {\sech }[1][]{\LWRphystrig {#1}{sech}}\) \(\renewcommand {\arcsec }[1][]{\LWRphystrig {#1}{arcsec}}\) \(\renewcommand {\asec }[1][]{\LWRphystrig {#1}{asec}}\) \(\renewcommand {\cot }[1][]{\LWRphystrig {#1}{cot}}\) \(\renewcommand {\coth }[1][]{\LWRphystrig {#1}{coth}}\) \(\renewcommand {\arccot }[1][]{\LWRphystrig {#1}{arccot}}\) \(\renewcommand {\acot }[1][]{\LWRphystrig {#1}{acot}}\) \(\require {cancel}\) \(\newcommand *{\underuparrow }[1]{{\underset {\uparrow }{#1}}}\) \(\DeclareMathOperator *{\argmax }{argmax}\) \(\DeclareMathOperator *{\argmin }{arg\,min}\) \(\def \E [#1]{\mathbb {E}\!\left [ #1 \right ]}\) \(\def \Var [#1]{\operatorname {Var}\!\left [ #1 \right ]}\) \(\def \Cov [#1]{\operatorname {Cov}\!\left [ #1 \right ]}\) \(\newcommand {\floor }[1]{\lfloor #1 \rfloor }\) \(\newcommand {\DTFTH }{ H \brk 1{e^{j\omega }}}\) \(\newcommand {\DTFTX }{ X\brk 1{e^{j\omega }}}\) \(\newcommand {\DFTtr }[1]{\mathrm {DFT}\left \{#1\right \}}\) \(\newcommand {\DTFTtr }[1]{\mathrm {DTFT}\left \{#1\right \}}\) \(\newcommand {\DTFTtrI }[1]{\mathrm {DTFT^{-1}}\left \{#1\right \}}\) \(\newcommand {\Ftr }[1]{ \mathcal {F}\left \{#1\right \}}\) \(\newcommand {\FtrI }[1]{ \mathcal {F}^{-1}\left \{#1\right \}}\) \(\newcommand {\Zover }{\overset {\mathscr Z}{\Longleftrightarrow }}\) \(\renewcommand {\real }{\mathbb {R}}\) \(\newcommand {\ba }{\mathbf {a}}\) \(\newcommand {\bb }{\mathbf {b}}\) \(\newcommand {\bc }{\mathbf {c}}\) \(\newcommand {\bd }{\mathbf {d}}\) \(\newcommand {\be }{\mathbf {e}}\) \(\newcommand {\bf }{\mathbf {f}}\) \(\newcommand {\bh }{\mathbf {h}}\) \(\newcommand {\bi }{\mathbf {i}}\) \(\newcommand {\bn }{\mathbf {n}}\) \(\newcommand {\bo }{\mathbf {o}}\) \(\newcommand {\bp }{\mathbf {p}}\) \(\newcommand {\bq }{\mathbf {q}}\) \(\newcommand {\br }{\mathbf {r}}\) \(\newcommand {\bs }{\mathbf {s}}\) \(\newcommand {\bt }{\mathbf {t}}\) \(\newcommand {\bu }{\mathbf {u}}\) \(\newcommand {\bv }{\mathbf {v}}\) \(\newcommand {\bw }{\mathbf {w}}\) \(\newcommand {\bx }{\mathbf {x}}\) \(\newcommand {\bxx }{\mathbf {xx}}\) \(\newcommand {\bxy }{\mathbf {xy}}\) \(\newcommand {\by }{\mathbf {y}}\) \(\newcommand {\byy }{\mathbf {yy}}\) \(\newcommand {\bz }{\mathbf {z}}\) \(\newcommand {\bA }{\mathbf {A}}\) \(\newcommand {\bB }{\mathbf {B}}\) \(\newcommand {\bC }{\mathbf {C}}\) \(\newcommand {\bD }{\mathbf {D}}\) \(\newcommand {\bH }{\mathbf {H}}\) \(\newcommand {\bI }{\mathbf {I}}\) \(\newcommand {\bK }{\mathbf {K}}\) \(\newcommand {\bM }{\mathbf {M}}\) \(\newcommand {\bP }{\mathbf {P}}\) \(\newcommand {\bQ }{\mathbf {Q}}\) \(\newcommand {\bR }{\mathbf {R}}\) \(\newcommand {\bS }{\mathbf {S}}\) \(\newcommand {\bU }{\mathbf {U}}\) \(\newcommand {\bW }{\mathbf {W}}\) \(\newcommand {\bX }{\mathbf {X}}\) \(\newcommand {\bY }{\mathbf {Y}}\) \(\newcommand {\bZ }{\mathbf {Z}}\) \(\newcommand {\balpha }{\bm {\alpha }}\) \(\newcommand {\bth }{{\bm {\theta }}}\) \(\newcommand {\bepsilon }{{\bm {\epsilon }}}\) \(\newcommand {\bmu }{{\bm {\mu }}}\) \(\newcommand {\bphi }{\bm {\phi }}\) \(\newcommand {\bOne }{\mathbf {1}}\) \(\newcommand {\bZero }{\mathbf {0}}\) \(\newcommand {\indFunc }{\mathbb {1}}\) \(\newcommand {\btx }{\tilde {\bx }}\) \(\newcommand {\loss }{\mathcal {L}}\) \(\newcommand {\appropto }{\mathrel {\vcenter { \offinterlineskip \halign {\hfil $##$\cr \propto \cr \noalign {\kern 2pt}\sim \cr \noalign {\kern -2pt}}}}}\) \(\newcommand {\SSE }{\mathrm {SSE}}\) \(\newcommand {\MSE }{\mathrm {MSE}}\) \(\newcommand {\RMSE }{\mathrm {RMSE}}\) \(\newcommand {\toprule }[1][]{\hline }\) \(\let \midrule \toprule \) \(\let \bottomrule \toprule \) \(\def \LWRbooktabscmidruleparen (#1)#2{}\) \(\newcommand {\LWRbooktabscmidrulenoparen }[1]{}\) \(\newcommand {\cmidrule }[1][]{\ifnextchar (\LWRbooktabscmidruleparen \LWRbooktabscmidrulenoparen }\) \(\newcommand {\morecmidrules }{}\) \(\newcommand {\specialrule }[3]{\hline }\) \(\newcommand {\addlinespace }[1][]{}\) \(\newcommand {\LWRsubmultirow }[2][]{#2}\) \(\newcommand {\LWRmultirow }[2][]{\LWRsubmultirow }\) \(\newcommand {\multirow }[2][]{\LWRmultirow }\) \(\newcommand {\mrowcell }{}\) \(\newcommand {\mcolrowcell }{}\) \(\newcommand {\STneed }[1]{}\) \(\newcommand {\tcbset }[1]{}\) \(\newcommand {\tcbsetforeverylayer }[1]{}\) \(\newcommand {\tcbox }[2][]{\boxed {\text {#2}}}\) \(\newcommand {\tcboxfit }[2][]{\boxed {#2}}\) \(\newcommand {\tcblower }{}\) \(\newcommand {\tcbline }{}\) \(\newcommand {\tcbtitle }{}\) \(\newcommand {\tcbsubtitle [2][]{\mathrm {#2}}}\) \(\newcommand {\tcboxmath }[2][]{\boxed {#2}}\) \(\newcommand {\tcbhighmath }[2][]{\boxed {#2}}\) \(\require {colortbl}\) \(\let \LWRorigcolumncolor \columncolor \) \(\renewcommand {\columncolor }[2][named]{\LWRorigcolumncolor [#1]{#2}\LWRabsorbtwooptions }\) \(\let \LWRorigrowcolor \rowcolor \) \(\renewcommand {\rowcolor }[2][named]{\LWRorigrowcolor [#1]{#2}\LWRabsorbtwooptions }\) \(\let \LWRorigcellcolor \cellcolor \) \(\renewcommand {\cellcolor }[2][named]{\LWRorigcellcolor [#1]{#2}\LWRabsorbtwooptions }\)

13 Sanity Checks for Cross-Validation of Classifiers

  • Goal: Verify that cross-validation results reflect genuine predictive signal rather than artifacts of the evaluation procedure.

Cross-validation (Sec. 4.5) provides an estimate of model performance on unseen data. However, this estimate can be misleading—especially with small datasets (\(M\lesssim 10^2\)), high-dimensional feature spaces (\(N\gg M\)), or data collected from heterogeneous sources. A classifier may report high accuracy not because it has learned the true relationship between \(\bX \) and \(\by \), but because the evaluation procedure itself is flawed.

Common sources of misleadingly optimistic CV results include:

  • • An inappropriate CV splitting strategy that leaks information between folds.

  • • Data-dependent pre-processing steps (e.g., feature selection, normalization) fitted on the entire dataset before splitting.

  • • High-dimensional feature spaces where spurious correlations arise by chance.

  • • Domain-specific artifacts (e.g., scanner calibration, patient baselines) that the model memorizes instead of learning generalizable patterns.

This chapter presents a series of sanity checks designed to detect these problems. The checks are organized from general methodology (CV strategy and data integrity) to specific diagnostic tests (statistical validation and image-specific techniques). Each check is independent and can be applied selectively based on the problem domain.

13.1 Cross-Validation Strategy Pitfalls

  • Goal: Ensure that the CV splitting strategy does not artificially inflate performance estimates.

The choice of how to partition data into folds is itself a modeling decision that can introduce bias. This section reviews two common pitfalls: ignoring group structure in the data, and using leave-one-out CV in high-dimensional settings.

13.1.1 Grouped, Stratified, and Grouped Stratified CV

Standard random cross-validation implicitly assumes that samples are independent and identically distributed (i.i.d.), and that the classes are relatively balanced. When these assumptions fail, specialized splitting strategies are required:

  • • Grouped CV (introduced in Sec. 4.5) ensures that all samples from a given group remain in the same fold, preventing leakage when multiple samples are drawn from the same underlying entity (e.g., a patient or sensor).

  • • Stratified CV preserves the original class proportions in every fold, addressing artifacts caused by class imbalance (Sec. 12.4).

  • • Grouped stratified CV attempts to combine both constraints simultaneously: no group spans multiple folds, and the class distribution is maintained as closely as possible across folds.

Stratification for imbalanced data When a dataset is highly imbalanced (Sec. 12.4), standard random splitting can result test folds where the minority class is underrepresented or entirely absent. Stratified CV enforces that the target class distribution is maintained in each fold, ensuring that performance metrics such as recall and \(F_1\)-score are evaluated on a representative sample. This is particularly crucial for small datasets where random variation in fold composition can heavily change the overall performance estimate.

  • Example 13.1 (Stratified CV for Rare Classes): Consider a dataset with \(M=100\) samples, containing \(90\) majority and \(10\) minority class samples, split into \(5\) folds of \(20\) samples each:

    .
    Standard 5-fold Stratified 5-fold
    Fold Majority Minority Majority Minority
    1 17 3 18 2
    2 20 0 18 2
    3 19 1 18 2
    4 16 4 18 2
    5 18 2 18 2

    With standard splitting, fold 2 contains zero minority samples, making it impossible to compute precision or recall on that test fold. Stratified CV prevents this by strictly preserving the \(9\!:\!1\) class ratio in every fold, providing stable performance estimates.

Combining grouping and stratification When data is both grouped and imbalanced (e.g., medical data with multiple scans per patient, where some patients have a rare condition), grouped stratified CV is necessary. However, unlike pure stratification, combining both grouping and stratification is sometimes an inherently challenging optimization problem. Because groups can vary significantly in size and contain mixed class distributions, achieving perfect stratification is generally mathematically impossible without violating the grouping constraint.

Grouped stratified CV algorithms therefore seek a compromise: they strictly enforce the grouping constraint (ensuring no data leakage) while heuristically assigning groups to folds to approximate the global class distribution as closely as possible.

  • Example 13.2 (The Challenge of Grouped Stratified CV): A medical dataset contains scans from three patients. The total is \(25\) normal and \(15\) abnormal scans (ratio \(5\!:\!3\)).

    .
    Normal Abnormal Total
    Patient A 10 10 20
    Patient B 15 0 15
    Patient C 0 5 5
    Dataset 25 15 40

    The grouped stratified algorithm assigns whole patients to folds, approximating the global class ratio as closely as possible:

    .
    Patients Normal Abnormal Ratio
    Ideal fold — 12.5 7.5 \(5:3\)
    Fold 1 A 10 10 \(1:1\)
    Fold 2 B, C 15 5 \(3:1\)

    Both folds deviate from the ideal \(5\!:\!3\) ratio, but this is the best achievable split that strictly prevents patient leakage between folds.

13.1.2 Group Leakage Diagnostic
  • Goal: Detect whether a model exploits domain-specific signatures rather than generalizable patterns.

When data is collected from distinct logical groups, such as multiple patients, different sensors, or separate geographic locations, standard random cross-validation can inadvertently place samples from the same group into both the training and test folds. Consequently, the model may learn to recognize group-specific signatures (e.g., a specific patient’s baseline signal amplitude or a particular sensor’s calibration offset) instead of uncovering the true, generalizable underlying phenomena.

A simple and effective diagnostic for this issue is to evaluate the same ML pipeline using two different splitting strategies:

  • 1. Standard (ungrouped) random CV.

  • 2. Grouped CV (Sec. 4.5), which strictly ensures all samples from a given group are isolated within a single fold.

Confounder: A variable that correlates with both the input features and the target label but is not part of the true underlying relationship. A model that relies on confounders achieves high apparent accuracy without learning the genuine pattern, and fails when the confounding correlation no longer holds (e.g., when deployed on data from a new domain).

  • Example 13.3 (Confounder in Medical Imaging): A classifier is trained to detect pneumonia from chest X-rays collected at two hospitals. Hospital A uses portable X-ray machines (mostly for bedridden, sicker patients) and Hospital B uses fixed machines (mostly for ambulatory patients). The classifier learns to distinguish portable from fixed X-ray artifacts (a confounder that correlates with disease severity) rather than the actual lung pathology. It achieves high accuracy on data from both hospitals but fails on X-rays from a third hospital with a different equipment mix.

If the ungrouped cross-validation yields substantially higher accuracy than the grouped cross-validation, it is a strong indication that the model is exploiting group-specific confounders. In such scenarios, relying solely on grouped CV for evaluation may be insufficient if the ultimate goal is to build a highly accurate, generalizable model. The underlying data representation itself must be improved to prevent the model from learning the confounders in the first place. Common mitigation strategies include:

  • • Subject-specific normalization: Removing baseline offsets by standardizing features on a per-group basis (e.g., subtracting a patient’s resting baseline from their active measurements) rather than computing statistics globally across the entire dataset.

  • • Feature engineering: Designing or extracting features that are fundamentally invariant to the known group-specific confounders (e.g., focusing on relative frequency-domain changes instead of absolute raw time-domain amplitudes).

  • • Domain adaptation techniques: Employing models, architectures, or loss functions (such as adversarial training) specifically designed to learn invariant representations across different domains or groups.

13.1.3 Leave-One-Out Cross-Validation (LOOCV)
  • Goal: The choice of the cross-validation strategy itself can drastically inflate performance estimates if misapplied to the wrong dataset regime. A prime example is using LOOCV on high-dimensional data.

LOOCV uses \(k\)-fold cross-validation with \(k=M\), leaving exactly one sample out per fold. While sometimes necessary for extremely small datasets (Sec. 4.5), LOOCV suffers from several drawbacks that make it a potential source of misleadingly optimistic results:

  • 1. High computational cost. LOOCV requires training the model \(M\) times. For large datasets this is impractical; it is only feasible when \(M\) is small.

  • 2. High variance in performance estimates. Because only one sample is left out, the \(M\) training sets are nearly identical to each other and have low bias. However, the resulting models are highly correlated, and the average of correlated estimates can have higher variance than \(5\)-fold or \(10\)-fold cross-validation.

  • 3. Sensitivity to outliers. Each data point serves as the sole validation sample in exactly one fold. A single outlier or noisy point can strongly influence that fold’s error, producing unreliable overall estimates compared to \(k\)-fold where outliers are diluted within larger folds.

  • 4. Potential for overfitting. Because each training set contains \(M-1\) points, the resulting models are very similar to the full-data model. In high-dimensional or noisy settings (\(N\gg M\)), LOOCV can favor overly complex models that fit the training data too closely.

  • Example 13.4 (LOOCV Overfitting on Random Noise): LOOCV with \(N \gg M\) can yield near-perfect accuracy even on pure noise.

    Consider \(M=80\) samples with \(N=2048\) features drawn entirely from i.i.d. \(\mathcal {N}(0,1)\) (pure noise) and binary labels with class sizes \(51\) vs. \(29\) (majority baseline \(\approx 0.64\)). A pipeline of standardization \(\rightarrow \) PCA(\(5\)) \(\rightarrow \) RBF-SVM \((C=0.05)\) with LOOCV achieves \(100\%\) accuracy across \(10\) random seeds—despite the features containing no real signal.

    This illustrates how LOOCV combined with high dimensionality (\(N \gg M\)) can produce misleadingly optimistic performance estimates. The diagnostic methods described in the remainder of this chapter—particularly the random feature baseline (Sec. 13.3.4) and the permutation test (Sec. 13.3.1)—can detect such conditions.

13.2 Data Integrity

  • Goal: Identify situations where the data itself—rather than the model—is responsible for inflated performance.

Even with a correctly configured CV strategy, problems in the data can produce misleading results. This section covers two common issues: data leakage (information from outside the training set influencing model training) and domain inconsistency (distribution shifts between training and deployment).

13.2.1 Data Leakage

Data leakage: (also termed train-test contamination) Information from outside the training set (for example, test labels, future observations, or global statistics) influences model training, producing overly optimistic performance estimates.

Common sources of leakage:

  • • Pre-processing leakage: fitting data-dependent transformations (e.g., standardization, PCA, feature selection) on the entire dataset before splitting into CV folds. The test fold then contains information that was used during training.

  • • Temporal leakage: in time-series problems, using future observations to predict the past. Standard random splits violate the causal ordering; temporal (forward-chaining) splits should be used instead.

  • • Target leakage: including features that are derived from or strongly correlated with the target variable (e.g., a feature computed from the label, or a proxy that is unavailable at prediction time).

Pre-processing leakage example. Consider a pipeline that applies standardization (Sec. 4.7) before classification with k-fold CV (Sec. 4.5). If the mean \(\bar {\bx }\) and standard deviation \(s_\bx \) are computed on the entire dataset (all \(M\) samples) and then the standardized features are split into CV folds, the test fold’s features encode information about the global data distribution, including the test samples themselves. The correct approach is to compute \(\bar {\bx }\) and \(s_\bx \) inside each CV fold on the training partition only, and then transform the test partition using the training-derived statistics. The same principle applies to any other data-dependent transformation.

Prevention & Best Practices

  • • All data-dependent pre-processing steps (e.g., supervised feature selection, PCA, normalization as in Sec. 4.7) must be fitted strictly inside each CV fold on the training partition only.

  • • Audit feature provenance to ensure no target-derived or future-derived information enters the feature matrix.

  • • Ensure an appropriate number of folds: too few folds, especially when combined with class imbalance, can produce biased estimates; too many folds (approaching LOOCV) can produce high-variance estimates and mask overfitting in high-dimensional settings (Sec. 13.1.3).

13.2.2 Domain Consistency
  • Goal: Test whether two datasets share the same underlying distribution before combining them for training or expecting cross-domain generalization.

Even when grouped CV (Sec. 13.1.2) correctly prevents within-group leakage, the model may still fail if the deployment domain differs systematically from the training domains.

Domain Inconsistency (Dataset Shift): A scenario where the distribution of the data (features or labels) in the deployment environment differs from the distribution in the training data. Common sources include differences in acquisition hardware (e.g., scanner manufacturer), environmental conditions, population demographics, or labeling conventions.

Given two datasets \(\bX _A\in \mathbb {R}^{M_A\times N}\) and \(\bX _B\in \mathbb {R}^{M_B\times N}\), a natural question is: can a classifier tell them apart? If yes, the domains are inconsistent.

Classifier two-sample test

The idea is to recast the distributional question as a binary classification problem:

  • 1. Assign artificial labels \(y=0\) to all samples in \(\bX _A\) and \(y=1\) to all samples in \(\bX _B\).

  • 2. Concatenate into \(\bX = [\bX _A;\,\bX _B]\in \mathbb {R}^{M\times N}\), \(M=M_A+M_B\), with label vector \(\by \).

  • 3. Train a binary classifier on \((\bX ,\by )\) using stratified \(k\)-fold CV and record the accuracy \(J_{\text {real}}\).

The chance-level accuracy is the majority-class rate \(J_{\text {chance}} = \max (M_A, M_B)/M\) (Eq. 13.2), and if \(J_{\text {real}} \approx J_{\text {chance}}\) than \(\bX _A\) and \(\bX _B\) are drawn from the same distribution, Otherwise, \(J_{\text {real}} \gg J_{\text {chance}}\) and the distributions differ.

Statistical significance via permutation test To determine whether \(J_{\text {real}}\) is significantly above \(J_{\text {chance}}\), apply the permutation test (Sec. 13.3.1) with the following application-specific interpretation:

  • • Labels shuffled: the domain labels \(\by \) (the artificial \(0/1\) assignments), breaking any association between features and domain membership.

  • • Null distribution: the classifier accuracy expected when samples from \(\bX _A\) and \(\bX _B\) are interchangeable, i.e., when no domain difference exists.

  • • Decision: if \(p < \alpha \), reject \(H_0\) — the two datasets are statistically distinguishable and domain inconsistency is present.

  • Example 13.5 (Classifier Two-Sample Test): Dataset A contains \(M_A=60\) EEG recordings from Hospital 1 and dataset B contains \(M_B=40\) recordings from Hospital 2, both with \(N=128\) spectral features:

    .
    Hospital 1 (\(y=0\)) Hospital 2 (\(y=1\))
    Samples 60 40
    Chance accuracy \(J_{\text {chance}}\) \(60/100 = 0.60\)
    Domain classifier accuracy \(J_{\text {real}}\) \(0.91\gg J_{\text {chance}}\)
    Permutation test (\(\alpha =0.05\)) \(p < 0.001 \Rightarrow \) reject \(H_0\)

    The classifier easily separates the two hospitals (\(J_{\text {real}}=0.91 \gg 0.60\), \(p<\alpha \)), indicating that the feature distributions differ substantially—likely due to hardware or acquisition protocol differences. A task classifier trained on Hospital 1 data alone should not be expected to generalize to Hospital 2 without domain adaptation.

Rejecting \(H_0\) does not necessarily mean the task classifier will fail—only that the feature distributions differ. However, high domain separability is a strong warning that cross-domain generalization may be poor.

13.3 Statistical Validation

  • Goal: Apply formal diagnostic tests to determine whether the observed CV performance reflects genuine signal or statistical artifacts.

Both tests in this section share a common logic: destroy the signal that a valid pipeline should rely on, and verify that performance drops to chance. The permutation test destroys the labels; the random feature baseline destroys the features.

13.3.1 Permutation Test
  • Goal: Test whether the model has learned a genuine relationship or the observed accuracy is indistinguishable from chance.

Permutation test: A statistical test that evaluates whether the observed CV performance is significantly better than what would be obtained by chance:

  • 1. Train and evaluate the model using CV, obtaining the real score \(J_{\text {real}}\).

  • 2. Randomly shuffle the labels \(\by \), breaking any true relationship between \(\bX \) and \(\by \).

  • 3. Re-run the same CV on the shuffled labels, obtaining a permuted score \(J_{\text {perm}}^{(i)}\).

  • 4. Repeat steps 2–3 for \(T\) times (typically \(T=1000\)) to build a null distribution—the distribution of scores when there is no true relationship between \(\bX \) and \(\by \).

The test is framed as a hypothesis test:

  • • \(H_0\): The features \(\bX \) and labels \(\by \) are independent — the model has no predictive power.

  • • \(H_1\): \(\bX \) carries genuine information about \(\by \).

Under \(H_0\), shuffling the labels does not change the joint distribution, so the permuted scores \(J_{\text {perm}}^{(i)}\) represent the distribution of performance expected by chance.

The p-value is the fraction of permuted scores that are at least as good as the real score,

\begin{equation} p = \frac {1}{T+1}\left (\sum _{i=1}^{T} \bm {1}\left [J_{\text {perm}}^{(i)} \geq J_{\text {real}}\right ] + 1\right ) \end{equation}

where \(\bm {1}[\cdot ]\) is the indicator function. The \(+1\) in numerator and denominator accounts for the real score itself.

Decision rule The decision is made by comparing the p-value to a predetermined significance level \(\alpha \), which is the maximum acceptable probability of incorrectly rejecting \(H_0\). Common choices are \(\alpha = 0.05\) (5%) and \(\alpha = 0.01\) (1%).

  • • If \(p < \alpha \), reject \(H_0\): the model has learned a genuine relationship between \(\bX \) and \(\by \).

  • • If \(p \ge \alpha \), fail to reject \(H_0\): the observed accuracy is not statistically distinguishable from chance.

Choosing \(T\) The number of permutations \(T\) determines the resolution of the p-value: the smallest achievable p-value is \(\frac {1}{T+1}\). For standard significance testing at \(\alpha = 0.05\), \(T=100\) suffices. For more precise p-values (e.g., \(p < 0.001\)), use \(T \geq 1000\). Each permutation requires a full CV run, so the total cost is \((T+1)\) times a single CV evaluation.

(image)

Figure 13.1: Distribution of test accuracy under \(T\) label permutations. The vertical line marks the true-label accuracy; the area to the right is the permutation \(p\)-value.
13.3.2 Omnibus Permutation Test
  • Goal: Obtain a single significance verdict on whether a multi-class classifier carries any genuine predictive signal, without the multiple-comparison inflation of running one test per class.

With \(K\) classes, a tempting approach is to run the permutation test (Sec. 13.3.1) separately for each class in a one-vs-rest fashion. This gives \(K\) chances to reject \(H_0\) by accident: even when no class is predictable, the probability that at least one of the \(K\) tests is spuriously significant (the family-wise error rate) far exceeds the nominal \(\alpha \).

The omnibus permutation test avoids this by reducing the entire multi-class problem to a single test. The procedure is identical to Sec. 13.3.1, but each CV run is scored with one global statistic that aggregates all classes, such as overall accuracy, balanced (macro) accuracy, or macro-\(F_1\) (preferred under class imbalance, Sec. 12.4). The full label vector \(\by \) is shuffled jointly, producing one null distribution and one p-value via the same \(\frac {1}{T+1}\) estimator. The hypotheses become:

  • • \(H_0\): \(\bX \) and \(\by \) are independent — no class is predictable.

  • • \(H_1\): \(\bX \) carries information about at least one class.

  • Example 13.6 (Omnibus Test on a 4-Class Problem): A pipeline is evaluated on \(K=4\) balanced classes (chance accuracy \(\approx 0.25\)). The observed macro-\(F_1\) is \(0.41\), and across \(T=1000\) label permutations no permuted run reaches this value, giving \(p < 0.001\). The omnibus test rejects \(H_0\): the classifier carries genuine multi-class signal.

Rejecting \(H_0\) establishes that the pipeline learned something, but not which classes. The per-class follow-up of Sec. 13.3.3 localizes the signal.

The omnibus test inherits every assumption of the base permutation test: the same CV splitting strategy and all pre-processing fitted strictly inside each fold (Sec. 13.2.1). A leaky pipeline produces a misleadingly significant omnibus p-value just as readily as a per-class one.

13.3.3 Per-Class Permutation Test
  • Goal: After the omnibus test rejects, identify which classes carry signal while controlling the error rate across the \(K\) simultaneous tests.

For each class \(k\), collapse the labels to one-vs-rest (\(y=1\) for class \(k\), \(y=0\) otherwise) and run the permutation test of Sec. 13.3.1 with a per-class statistic (e.g. that class’s \(F_1\) or one-vs-rest AUC), yielding raw p-values \(p_1,\dots ,p_K\). Because \(K\) hypotheses are tested at once, these raw p-values must be adjusted:

  • • Bonferroni: reject class \(k\) if \(p_k < \alpha /K\). This controls the family-wise error rate (the probability of any false positive) but is conservative.

  • • Benjamini-Hochberg (FDR): sort the p-values \(p_{(1)}\le \dots \le p_{(K)}\), find the largest \(i\) with \(p_{(i)} \le \frac {i}{K}\alpha \), and reject it together with all smaller ranks. This controls the expected fraction of false positives among the rejected classes, and is more powerful when several classes are genuinely predictable.

Standard practice is to gate the per-class scan on a significant omnibus result (Sec. 13.3.2): the omnibus answers “is there any signal?” and the per-class scan answers “which classes?” only once the global test has rejected.

  • Example 13.7 (Localizing the Signal): Continuing the \(K=4\) example (Sec. 13.3.2), the raw per-class p-values are \((0.002,\,0.02,\,0.31,\,0.6)\). Under Bonferroni, \(\alpha /K = 0.05/4 = 0.0125\), so only class 1 survives (\(0.02 > 0.0125\)). Benjamini-Hochberg additionally retains class 2, since \(p_{(2)}=0.02 \le \frac {2}{4}\cdot 0.05 = 0.025\), illustrating how the choice of correction trades power against strictness.

Never report the single most significant class without a multiple-comparison correction. Selecting the best of \(K\) raw p-values is precisely the inflation the omnibus test (Sec. 13.3.2) was designed to guard against.

13.3.4 Random Feature Baseline
  • Goal: Verify that the pipeline does not achieve high accuracy on meaningless input.

While the permutation test (Sec. 13.3.1) keeps the real features and destroys the labels, the random feature baseline takes the complementary approach: keep the real labels but destroy the features.

Replace the real feature matrix \(\bX \) with random Gaussian noise \(\bX _{\text {rand}} \sim \mathcal {N}(0,1)\) of the same dimensions, and run the identical pipeline (same pre-processing, same model, same CV). The expected accuracy should be close to the majority-class rate,

\begin{equation} \label {eq-chance-accuracy} J_{\text {chance}} = \frac {\max (M_0, M_1)}{M} \end{equation}

where \(M_0\) and \(M_1\) are the class counts.

If the pipeline achieves accuracy substantially above \(J_{\text {chance}}\) on random features, the evaluation procedure itself is flawed (e.g., data leakage as in Sec. 13.2.1, or LOOCV in high dimensions as in Example 13.4).

On imbalanced datasets with stratified CV, the random-feature baseline may slightly exceed \(J_{\text {chance}}\) due to the interaction between stratification and the classifier’s bias toward the majority class. A deviation of a few percentage points above \(J_{\text {chance}}\) is not necessarily alarming; a substantial deviation (e.g., \(>10\%\) above chance) indicates a pipeline problem.

13.3.5 Cross-class sample permutation
  • Goal: Detect per-sample leakage cues (filename, timestamp, recording-session id, residual artifacts) that survive a swap of samples between classes.

Where the permutation test (Sec. 13.3.1) destroys labels globally and the random-feature baseline (Sec. 13.3.4) destroys features, the cross-class swap destroys the class-conditional signal incrementally. Swap a controlled fraction \(\alpha \in [0, 0.5]\) of samples between class buckets and refit the full pipeline. Accuracy should degrade approximately monotonically with \(\alpha \) and approach the label-shuffle null at \(\alpha = 0.5\).

  • • Monotone degradation: the model is learning a class-conditional signal that is being progressively corrupted.

  • • Flat curve: the model is not exploiting the class-conditional signal at all; it relies on a per-sample cue that is preserved under the swap, e.g. filename, timestamp, recording-session id, duration, or any other metadata that survived into the feature vector (Sec. 13.2.2).

  • • Non-monotone curve: typically a sign of two competing cues, one of which is leakage that the swap inadvertently reinforces.

(image)

Figure 13.2: Accuracy vs. cross-class swap fraction \(\alpha \). A monotone curve (blue) is healthy; a flat curve (red) indicates per-sample leakage.

Flat cross-class curve = silent leakage

A flat accuracy curve under cross-class swap is the strongest leakage signature among the permutation diagnostics. The model has identified each sample individually (often via metadata that survived into the feature vector) and is reporting test accuracy by recognition rather than by classification. Trace back the per-sample cues introduced by the acquisition or storage pipeline.