Machine Learning & Signals Learning
11 Sanity Checks for Cross-Validation of Classifiers
Cross-validation (Sec. 4.4) provides an estimate of model performance on unseen data. However, this estimate can be misleading—especially with small datasets (\(M\lesssim 10^2\)), high-dimensional feature spaces (\(N\gg M\)), or data collected from heterogeneous sources. A classifier may report high accuracy not because it has learned the true relationship between \(\bX \) and \(\by \), but because the evaluation procedure itself is flawed.
Common sources of misleadingly optimistic CV results include:
-
• An inappropriate CV splitting strategy that leaks information between folds.
-
• Data-dependent pre-processing steps (e.g., feature selection, normalization) fitted on the entire dataset before splitting.
-
• High-dimensional feature spaces where spurious correlations arise by chance.
-
• Domain-specific artifacts (e.g., scanner calibration, patient baselines) that the model memorizes instead of learning generalizable patterns.
This chapter presents a series of sanity checks designed to detect these problems. The checks are organized from general methodology (CV strategy and data integrity) to specific diagnostic tests (statistical validation and image-specific techniques). Each check is independent and can be applied selectively based on the problem domain.
11.1 Cross-Validation Strategy Pitfalls
The choice of how to partition data into folds is itself a modeling decision that can introduce bias. This section reviews two common pitfalls: ignoring group structure in the data, and using leave-one-out CV in high-dimensional settings.
11.1.1 Grouped, Stratified, and Grouped Stratified CV
Standard random cross-validation implicitly assumes that samples are independent and identically distributed (i.i.d.), and that the classes are relatively balanced. When these assumptions fail, specialized splitting strategies are required:
-
• Grouped CV (introduced in Sec. 4.4) ensures that all samples from a given group remain in the same fold, preventing leakage when multiple samples are drawn from the same underlying entity (e.g., a patient or sensor).
-
• Stratified CV preserves the original class proportions in every fold, addressing artifacts caused by class imbalance (Sec. 10.4).
-
• Grouped stratified CV attempts to combine both constraints simultaneously: no group spans multiple folds, and the class distribution is maintained as closely as possible across folds.
Stratification for imbalanced data When a dataset is highly imbalanced (Sec. 10.3.6), standard random splitting can result test folds where the minority class is underrepresented or entirely absent. Stratified CV enforces that the target class distribution is maintained in each fold, ensuring that performance metrics such as recall and \(F_1\)-score are evaluated on a representative sample. This is particularly crucial for small datasets where random variation in fold composition can heavily change the overall performance estimate.
-
Example 11.1: Consider a dataset with \(M=100\) samples, containing \(90\) majority and \(10\) minority class samples, split into \(5\) folds of \(20\) samples each:
Standard 5-fold Stratified 5-fold Fold Majority Minority Majority Minority 1 17 3 18 2 2 20 0 18 2 3 19 1 18 2 4 16 4 18 2 5 18 2 18 2 With standard splitting, fold 2 contains zero minority samples, making it impossible to compute precision or recall on that test fold. Stratified CV prevents this by strictly preserving the \(9\!:\!1\) class ratio in every fold, providing stable performance estimates.
Combining grouping and stratification When data is both grouped and imbalanced (e.g., medical data with multiple scans per patient, where some patients have a rare condition), grouped stratified CV is necessary. However, unlike pure stratification, combining both grouping and stratification is sometimes an inherently challenging optimization problem. Because groups can vary significantly in size and contain mixed class distributions, achieving perfect stratification is generally mathematically impossible without violating the grouping constraint.
Grouped stratified CV algorithms therefore seek a compromise: they strictly enforce the grouping constraint (ensuring no data leakage) while heuristically assigning groups to folds to approximate the global class distribution as closely as possible.
-
Example 11.2: A medical dataset contains scans from three patients. The total is \(25\) normal and \(15\) abnormal scans (ratio \(5\!:\!3\)).
Normal Abnormal Total Patient A 10 10 20 Patient B 15 0 15 Patient C 0 5 5 Dataset 25 15 40 The grouped stratified algorithm assigns whole patients to folds, approximating the global class ratio as closely as possible:
Patients Normal Abnormal Ratio Ideal fold — 12.5 7.5 \(5:3\) Fold 1 A 10 10 \(1:1\) Fold 2 B, C 15 5 \(3:1\) Both folds deviate from the ideal \(5\!:\!3\) ratio, but this is the best achievable split that strictly prevents patient leakage between folds.
11.1.2 Group Leakage Diagnostic
When data is collected from distinct logical groups, such as multiple patients, different sensors, or separate geographic locations, standard random cross-validation can inadvertently place samples from the same group into both the training and test folds. Consequently, the model may learn to recognize group-specific signatures (e.g., a specific patient’s baseline signal amplitude or a particular sensor’s calibration offset) instead of uncovering the true, generalizable underlying phenomena.
A simple and effective diagnostic for this issue is to evaluate the same ML pipeline using two different splitting strategies:
-
1. Standard (ungrouped) random CV.
-
2. Grouped CV (Sec. 4.4), which strictly ensures all samples from a given group are isolated within a single fold.
Confounder: A variable that correlates with both the input features and the target label but is not part of the true underlying relationship. A model that relies on confounders achieves high apparent accuracy without learning the genuine pattern, and fails when the confounding correlation no longer holds (e.g., when deployed on data from a new domain).
-
Example 11.3: A classifier is trained to detect pneumonia from chest X-rays collected at two hospitals. Hospital A uses portable X-ray machines (mostly for bedridden, sicker patients) and Hospital B uses fixed machines (mostly for ambulatory patients). The classifier learns to distinguish portable from fixed X-ray artifacts (a confounder that correlates with disease severity) rather than the actual lung pathology. It achieves high accuracy on data from both hospitals but fails on X-rays from a third hospital with a different equipment mix.
If the ungrouped cross-validation yields substantially higher accuracy than the grouped cross-validation, it is a strong indication that the model is exploiting group-specific confounders. In such scenarios, relying solely on grouped CV for evaluation may be insufficient if the ultimate goal is to build a highly accurate, generalizable model. The underlying data representation itself must be improved to prevent the model from learning the confounders in the first place. Common mitigation strategies include:
-
• Subject-specific normalization: Removing baseline offsets by standardizing features on a per-group basis (e.g., subtracting a patient’s resting baseline from their active measurements) rather than computing statistics globally across the entire dataset.
-
• Feature engineering: Designing or extracting features that are fundamentally invariant to the known group-specific confounders (e.g., focusing on relative frequency-domain changes instead of absolute raw time-domain amplitudes).
-
• Domain adaptation techniques: Employing models, architectures, or loss functions (such as adversarial training) specifically designed to learn invariant representations across different domains or groups.
11.1.3 Leave-One-Out Cross-Validation (LOOCV)
LOOCV uses \(k\)-fold cross-validation with \(k=M\), leaving exactly one sample out per fold. While sometimes necessary for extremely small datasets (Sec. 4.4), LOOCV suffers from several drawbacks that make it a potential source of misleadingly optimistic results:
-
1. High computational cost. LOOCV requires training the model \(M\) times. For large datasets this is impractical; it is only feasible when \(M\) is small.
-
2. High variance in performance estimates. Because only one sample is left out, the \(M\) training sets are nearly identical to each other and have low bias. However, the resulting models are highly correlated, and the average of correlated estimates can have higher variance than \(5\)-fold or \(10\)-fold cross-validation.
-
3. Sensitivity to outliers. Each data point serves as the sole validation sample in exactly one fold. A single outlier or noisy point can strongly influence that fold’s error, producing unreliable overall estimates compared to \(k\)-fold where outliers are diluted within larger folds.
-
4. Potential for overfitting. Because each training set contains \(M-1\) points, the resulting models are very similar to the full-data model. In high-dimensional or noisy settings (\(N\gg M\)), LOOCV can favor overly complex models that fit the training data too closely.
-
Example 11.4: LOOCV with \(N \gg M\) can yield near-perfect accuracy even on pure noise.
Consider \(M=80\) samples with \(N=2048\) features drawn entirely from i.i.d. \(\mathcal {N}(0,1)\) (pure noise) and binary labels with class sizes \(51\) vs. \(29\) (majority baseline \(\approx 0.64\)). A pipeline of standardization \(\rightarrow \) PCA(\(5\)) \(\rightarrow \) RBF-SVM \((C=0.05)\) with LOOCV achieves \(100\%\) accuracy across \(10\) random seeds—despite the features containing no real signal.
This illustrates how LOOCV combined with high dimensionality (\(N \gg M\)) can produce misleadingly optimistic performance estimates. The diagnostic methods described in the remainder of this chapter—particularly the random feature baseline (Sec. 11.2.2) and the permutation test (Sec. 11.2.1)—can detect such conditions.
11.2 Statistical Validation
Both tests in this section share a common logic: destroy the signal that a valid pipeline should rely on, and verify that performance drops to chance. The permutation test destroys the labels; the random feature baseline destroys the features.
11.2.1 Permutation Test
Permutation test: A statistical test that evaluates whether the observed CV performance is significantly better than what would be obtained by chance:
-
1. Train and evaluate the model using CV, obtaining the real score \(J_{\text {real}}\).
-
2. Randomly shuffle the labels \(\by \), breaking any true relationship between \(\bX \) and \(\by \).
-
3. Re-run the same CV on the shuffled labels, obtaining a permuted score \(J_{\text {perm}}^{(i)}\).
-
4. Repeat steps 2–3 for \(T\) times (typically \(T=1000\)) to build a null distribution—the distribution of scores when there is no true relationship between \(\bX \) and \(\by \).
The test is framed as a hypothesis test:
-
• \(H_0\): The features \(\bX \) and labels \(\by \) are independent — the model has no predictive power.
-
• \(H_1\): \(\bX \) carries genuine information about \(\by \).
Under \(H_0\), shuffling the labels does not change the joint distribution, so the permuted scores \(J_{\text {perm}}^{(i)}\) represent the distribution of performance expected by chance.
The p-value is the fraction of permuted scores that are at least as good as the real score,
\(\seteqnumber{0}{}{0}\)\begin{equation} p = \frac {1}{T+1}\left (\sum _{i=1}^{T} \bm {1}\left [J_{\text {perm}}^{(i)} \geq J_{\text {real}}\right ] + 1\right ) \end{equation}
where \(\bm {1}[\cdot ]\) is the indicator function. The \(+1\) in numerator and denominator accounts for the real score itself.
Decision rule The decision is made by comparing the p-value to a predetermined significance level \(\alpha \), which is the maximum acceptable probability of incorrectly rejecting \(H_0\). Common choices are \(\alpha = 0.05\) (5%) and \(\alpha = 0.01\) (1%).
-
• If \(p < \alpha \), reject \(H_0\): the model has learned a genuine relationship between \(\bX \) and \(\by \).
-
• If \(p \ge \alpha \), fail to reject \(H_0\): the observed accuracy is not statistically distinguishable from chance.
Choosing \(T\) The number of permutations \(T\) determines the resolution of the p-value: the smallest achievable p-value is \(\frac {1}{T+1}\). For standard significance testing at \(\alpha = 0.05\), \(T=100\) suffices. For more precise p-values (e.g., \(p < 0.001\)), use \(T \geq 1000\). Each permutation requires a full CV run, so the total cost is \((T+1)\) times a single CV evaluation.
11.2.2 Random Feature Baseline
While the permutation test (Sec. 11.2.1) keeps the real features and destroys the labels, the random feature baseline takes the complementary approach: keep the real labels but destroy the features.
Replace the real feature matrix \(\bX \) with random Gaussian noise \(\bX _{\text {rand}} \sim \mathcal {N}(0,1)\) of the same dimensions, and run the identical pipeline (same pre-processing, same model, same CV). The expected accuracy should be close to the majority-class rate,
\(\seteqnumber{0}{}{1}\)\begin{equation} \label {eq-chance-accuracy} J_{\text {chance}} = \frac {\max (M_0, M_1)}{M} \end{equation}
where \(M_0\) and \(M_1\) are the class counts.
If the pipeline achieves accuracy substantially above \(J_{\text {chance}}\) on random features, the evaluation procedure itself is flawed (e.g., data leakage as in Sec. 11.3.1, or LOOCV in high dimensions as in Example 11.4).
On imbalanced datasets with stratified CV, the random-feature baseline may slightly exceed \(J_{\text {chance}}\) due to the interaction between stratification and the classifier’s bias toward the majority class. A deviation of a few percentage points above \(J_{\text {chance}}\) is not necessarily alarming; a substantial deviation (e.g., \(>10\%\) above chance) indicates a pipeline problem.
11.3 Data Integrity
Even with a correctly configured CV strategy, problems in the data can produce misleading results. This section covers two common issues: data leakage (information from outside the training set influencing model training) and domain inconsistency (distribution shifts between training and deployment).
11.3.1 Data Leakage
Data leakage: Information from outside the training set—such as test labels, future observations, or global statistics—influences model training, producing overly optimistic performance estimates.
Common sources of leakage:
-
• Pre-processing leakage: fitting data-dependent transformations (e.g., standardization, PCA, feature selection) on the entire dataset before splitting into CV folds. The test fold then contains information that was used during training.
-
• Temporal leakage: in time-series problems, using future observations to predict the past. Standard random splits violate the causal ordering; temporal (forward-chaining) splits should be used instead.
-
• Target leakage: including features that are derived from or strongly correlated with the target variable (e.g., a feature computed from the label, or a proxy that is unavailable at prediction time).
Pre-processing leakage example. Consider a pipeline that applies standardization (Sec. 4.5) before classification with k-fold CV (Sec. 4.4). If the mean \(\bar {\bx }\) and standard deviation \(s_\bx \) are computed on the entire dataset (all \(M\) samples) and then the standardized features are split into CV folds, the test fold’s features encode information about the global data distribution, including the test samples themselves. The correct approach is to compute \(\bar {\bx }\) and \(s_\bx \) inside each CV fold on the training partition only, and then transform the test partition using the training-derived statistics. The same principle applies to PCA, supervised feature selection (Sec. 9), and any other data-dependent transformation.
Prevention & Best Practices.
-
• All data-dependent pre-processing steps (e.g., supervised feature selection, PCA, normalization as in Sec. 4.5) must be fitted strictly inside each CV fold on the training partition only.
-
• Audit feature provenance to ensure no target-derived or future-derived information enters the feature matrix.
-
• Ensure an appropriate number of folds: too few folds, especially when combined with class imbalance, can produce biased estimates; too many folds (approaching LOOCV) can produce high-variance estimates and mask overfitting in high-dimensional settings (Sec. 11.1.3).
11.3.2 Domain Consistency
Even when grouped CV (Sec. 11.1.2) correctly prevents within-group leakage, the model may still fail if the deployment domain differs systematically from the training domains.
Domain Inconsistency (Dataset Shift): A scenario where the distribution of the data (features or labels) in the deployment environment differs from the distribution in the training data. Common sources include differences in acquisition hardware (e.g., scanner manufacturer), environmental conditions, population demographics, or labeling conventions.
Given two datasets \(\bX _A\in \mathbb {R}^{M_A\times N}\) and \(\bX _B\in \mathbb {R}^{M_B\times N}\), a natural question is: can a classifier tell them apart? If yes, the domains are inconsistent.
Classifier two-sample test
The idea is to recast the distributional question as a binary classification problem:
-
1. Assign artificial labels \(y=0\) to all samples in \(\bX _A\) and \(y=1\) to all samples in \(\bX _B\).
-
2. Concatenate into \(\bX = [\bX _A;\,\bX _B]\in \mathbb {R}^{M\times N}\), \(M=M_A+M_B\), with label vector \(\by \).
-
3. Train a binary classifier on \((\bX ,\by )\) using stratified \(k\)-fold CV and record the accuracy \(J_{\text {real}}\).
The chance-level accuracy is the majority-class rate \(J_{\text {chance}} = \max (M_A, M_B)/M\) (Eq. 11.2), and if \(J_{\text {real}} \approx J_{\text {chance}}\) than \(\bX _A\) and \(\bX _B\) are drawn from the same distribution, Otherwise, \(J_{\text {real}} \gg J_{\text {chance}}\) and the distributions differ.
Statistical significance via permutation test To determine whether \(J_{\text {real}}\) is significantly above \(J_{\text {chance}}\), apply the permutation test (Sec. 11.2.1) with the following application-specific interpretation:
-
• Labels shuffled: the domain labels \(\by \) (the artificial \(0/1\) assignments), breaking any association between features and domain membership.
-
• Null distribution: the classifier accuracy expected when samples from \(\bX _A\) and \(\bX _B\) are interchangeable, i.e., when no domain difference exists.
-
• Decision: if \(p < \alpha \), reject \(H_0\) — the two datasets are statistically distinguishable and domain inconsistency is present.
-
Example 11.5: Dataset A contains \(M_A=60\) EEG recordings from Hospital 1 and dataset B contains \(M_B=40\) recordings from Hospital 2, both with \(N=128\) spectral features:
Hospital 1 (\(y=0\)) Hospital 2 (\(y=1\)) Samples 60 40 Chance accuracy \(J_{\text {chance}}\) \(60/100 = 0.60\) Domain classifier accuracy \(J_{\text {real}}\) \(0.91\gg J_{\text {chance}}\) Permutation test (\(\alpha =0.05\)) \(p < 0.001 \Rightarrow \) reject \(H_0\) The classifier easily separates the two hospitals (\(J_{\text {real}}=0.91 \gg 0.60\), \(p<\alpha \)), indicating that the feature distributions differ substantially—likely due to hardware or acquisition protocol differences. A task classifier trained on Hospital 1 data alone should not be expected to generalize to Hospital 2 without domain adaptation.
Rejecting \(H_0\) does not necessarily mean the task classifier will fail—only that the feature distributions differ. However, high domain separability is a strong warning that cross-domain generalization may be poor.
11.4 Image-Specific Sanity Checks
The sanity checks in this section are specific to classifiers operating on image data (raw pixels or spatial feature maps). They test whether the model exploits spatial structure—edges, textures, shapes—or relies on superficial statistics that happen to correlate with the labels.
11.4.1 Pixel Shuffle Test
Pixel shuffle test: Given a dataset of \(N\) images, randomly permute all pixel locations within each image independently, producing a shuffled dataset \(\bX _{\text {shuf}}\). Re-extract features from \(\bX _{\text {shuf}}\) and repeat the cross-validation procedure.
Shuffling destroys all spatial structure (edges, textures, object shapes) while preserving the per-image pixel histogram (mean intensity, variance, and higher-order marginal statistics remain identical). A classifier that genuinely relies on spatial patterns will see its accuracy drop to chance level on \(\bX _{\text {shuf}}\). Conversely, if the accuracy remains high after shuffling, the classifier exploits per-pixel statistics rather than spatial structure, and the original cross-validation result is unreliable.
The pixel shuffle test is specific to classifiers that operate on raw pixel features or spatial feature maps (e.g., convolutional networks). It does not apply when hand-crafted, non-spatial features are extracted before classification.
11.4.2 Black-Patch Test
Black-patch test: Occlude a randomly positioned square region of each image with a black (zero or other constant-valued) patch of side length \(s\) pixels. Re-extract features and repeat cross-validation. Repeat for progressively larger patch sizes (e.g., \(s \in \{16, 32, 56, 112\}\)) to obtain an accuracy-versus-occlusion curve.
The black-patch test probes whether the classifier uses distributed spatial information or relies on a localized region. Three representative outcomes:
-
• Gradual, monotonic decline in accuracy as \(s\) increases—the classifier uses information spread across the image, which is the expected behavior for a well-trained model.
-
• Sharp drop at a specific patch size—the classifier depends on a localized region. This may indicate a genuine region of interest, but could also signal reliance on a confounding artifact (e.g., a timestamp, label overlay, or acquisition marker embedded in a fixed image location).
-
• Stable accuracy across all patch sizes—the classifier is largely insensitive to spatial content, raising concern that it exploits non-image-based confounders (e.g., differences in file encoding or image dimensions between classes).
When interpreting the black-patch test, consider the patch area relative to the total image area. A \(112 \times 112\) patch occludes roughly \(44\%\) of a \(168 \times 168\) image but only \(5\%\) of a \(512 \times 512\) image. Report patch sizes as fractions of the image dimensions to allow meaningful comparison across datasets.