Machine Learning & Signals Learning
14 Classifier Comparison
14.1 Comparison Between Two Non-Probabilistic Classifiers
General methods concentrate on a performance of the particular classifier. In the following, dedicated inter-classifier methods are provided.
14.1.1 Summary
The following methods compare two classifiers by analyzing their hard label predictions on a shared test set. All are built on the four counts of the contingency table (Sec. 14.1.2).
| Method | Question answered |
| Contingency table (Sec. 14.1.2) | Where do the classifiers disagree? |
| Disagreement (Sec. 14.1.3) | What fraction of samples do they disagree on? |
| McNemar’s test (Sec. 14.1.4) | Is the difference statistically significant? |
| Cohen’s \(\kappa \) (Sec. 14.1.5) | How much do they agree beyond chance? |
| Yule’s \(Q\) (Sec. 14.1.6) | How correlated are their errors? |
14.1.2 Contingency Table
| Classifier 1 (C1) | |||
| Correct, \(\hat {Y}_1=Y\) | Incorrect, \(\hat {Y}_1\ne Y\) | ||
| Classifier 2 (C2) | Correct, \(\hat {Y}_2=Y\) | Both correct, \(n_{11}\) | C1 incorrect, C2 correct, \(n_{01}\) |
| Incorrect, \(\hat {Y}_2\ne Y\) | C1 correct, C2 incorrect, \(n_{10}\) | Both incorrect, \(n_{00}\) | |
Denote the four cells entries (Fig. 14.1) as:
-
• \(n_{11}\) – both classifiers correct
-
• \(n_{10}\) – C1 correct, C2 incorrect
-
• \(n_{01}\) – C1 incorrect, C2 correct
-
• \(n_{00}\) – both classifiers incorrect
with \(n_{11}+n_{10}+n_{01}+n_{00} = M\).
The accuracy of each classifier in terms of the contingency table entries is
\(\seteqnumber{0}{}{0}\)\begin{equation} \text {Accuracy}_1 = \frac {n_{11}+n_{10}}{M}, \qquad \text {Accuracy}_2 = \frac {n_{11}+n_{01}}{M} \end{equation}
The diagonal cells (\(n_{11}\), \(n_{00}\)) represent samples on which both classifiers agree. The off-diagonal cells (\(n_{10}\), \(n_{01}\)) are the informative ones: they count samples where exactly one classifier succeeds and the other fails. If \(n_{10} \gg n_{01}\), classifier 1 is superior; if \(n_{01} \gg n_{10}\), classifier 2 is superior; if \(n_{10} \approx n_{01}\), they perform similarly despite possibly making errors on different samples.
-
Example 14.1: Two classifiers (Table 14.2) are evaluated on \(M=200\) test samples. Classifier 1 accuracy: \((150+25)/200 = 87.5\%\). Classifier 2 accuracy: \((150+15)/200 = 82.5\%\).
Classifier 1 Correct Incorrect Classifier 2 Correct 150 15 Incorrect 25 10 Table 14.2: Contingency table for two classifiers on \(M=200\) samples.The accuracies are relatively close (\(5\%\) difference). However, the comparison matrix reveals a clearer picture: classifier 1 uniquely gets \(n_{10}=25\) samples right that classifier 2 misses, while classifier 2 uniquely gets only \(n_{01}=15\). Classifier 1 is correct on \(25-15=10\) more “disputed” samples.
14.1.3 Disagreement Measure
The disagreement measure is the ratio of instances on which the two classifiers produce different outcomes to the total number of instances:
\(\seteqnumber{0}{}{1}\)\begin{equation} \text {Dis} = \frac {n_{10} + n_{01}}{M} \end{equation}
Range: \(\text {Dis}\in [0,\,1]\). \(\text {Dis}=0\) indicates that the classifiers always make the same predictions (zero diversity). As \(\text {Dis}\to 1\), the disagreement increases.
-
Example 14.1: Back to the previous example (Table 14.2), \(n_{10}=25\), \(n_{01}=15\), \(M=200\).
\(\seteqnumber{0}{}{2}\)\begin{equation} \text {Dis} = \frac {25 + 15}{200} = \frac {40}{200} = 0.2 \end{equation}
The classifiers disagree on \(20\%\) of the samples.
14.1.4 McNemar’s Test
The comparison matrix shows how two classifiers differ, but does not indicate whether the difference is due to chance. McNemar’s test addresses this by testing the null hypothesis:
\(\seteqnumber{0}{}{3}\)\begin{equation} \begin{aligned} H_0\colon n_{10} = n_{01}\\ H_1\colon n_{10} \ne n_{01} \end {aligned} \end{equation}
i.e., the two classifiers have the same error rate, and any observed difference is due to sampling variability.
McNemar’s test only uses the off-diagonal cells (\(n_{10}\), \(n_{01}\)). Samples where both classifiers agree provide no information about which is better.
Parametric and non-parametric forms McNemar’s is the parametric (asymptotic) form of the test: under \(H_0\) it is based on probabilistic assumptions. Its non-parametric (distribution-free) counterpart is a permutation test (Sec. 13.3.1).
p-value interpretation The p-value is the probability of observing a test statistic, assuming \(H_0\) is true 1. The decision is made by comparing the p-value to a predetermined significance level \(\alpha \), which is the maximum probability of incorrectly rejecting \(H_0\) (i.e., concluding the classifiers differ when they actually do not). Common choices are \(\alpha = 0.05\) (5%) and \(\alpha = 0.01\) (1%).
-
• If \(p < \alpha \), we reject \(H_0\): the observed difference is statistically significant at level \(\alpha \).
-
• If \(p \ge \alpha \), we fail to reject \(H_0\): we cannot conclude that one classifier is better.
1 The calculation of the particular p-value for this test is out of the scope of this chapter. It can be found numerically by the appropriate code implementation.
14.1.5 Cohen’s Kappa Coefficient
McNemar’s test determines whether two classifiers differ significantly, but does not measure how much they agree. Cohen’s kappa coefficient \(\kappa \) addresses this by comparing observed agreement with the agreement expected under independent (random) predictions.
Using the contingency table notation (Fig. 14.1), define the observed agreement as the fraction of samples on which both classifiers give the same result (both correct or both incorrect):
\(\seteqnumber{0}{}{4}\)\begin{equation} p_o = \frac {n_{11} + n_{00}}{M} \end{equation}
Recall that the accuracy of each classifier is
\(\seteqnumber{0}{}{5}\)\begin{equation} \text {Accuracy}_1 = \frac {n_{11}+n_{10}}{M}, \qquad \text {Accuracy}_2 = \frac {n_{11}+n_{01}}{M} \end{equation}
If the two classifiers were independent, the probability that both are correct on the same sample is \(\text {Accuracy}_1 \cdot \text {Accuracy}_2\), and the probability that both are incorrect is \((1-\text {Accuracy}_1)(1-\text {Accuracy}_2)\). The expected agreement by chance is the sum of these two cases:
\(\seteqnumber{0}{}{6}\)\begin{equation} p_e = \underbrace {\text {Accuracy}_1 \cdot \text {Accuracy}_2}_{\text {both correct}} + \underbrace {(1-\text {Accuracy}_1)(1-\text {Accuracy}_2)}_{\text {both incorrect}} \end{equation}
Cohen’s kappa is then
\(\seteqnumber{0}{}{7}\)\begin{equation} \kappa = \frac {p_o - p_e}{1 - p_e} \end{equation}
Range: \(\kappa = 1\) indicates perfect agreement (the classifiers always agree). \(\kappa = 0\) indicates agreement no better than chance. \(\kappa < 0\) indicates agreement worse than chance (systematic disagreement).
-
Example 14.1: Back to the previous example (Table 14.2), \(n_{11}=150\), \(n_{10}=25\), \(n_{01}=15\), \(n_{00}=10\), \(M=200\).
\(\seteqnumber{0}{}{8}\)\begin{equation} \begin{aligned} p_o &= \frac {150+10}{200} = 0.80 \\[3pt] \text {Accuracy}_1 &= \frac {150+25}{200} = 0.875, \quad \text {Accuracy}_2 = \frac {150+15}{200} = 0.825 \\[3pt] p_e &= 0.875 \times 0.825 + 0.125 \times 0.175 = 0.7219 + 0.0219 = 0.7438 \\[3pt] \kappa &= \frac {0.80 - 0.7438}{1 - 0.7438} = \frac {0.0562}{0.2562} \approx 0.219 \end {aligned} \end{equation}
The low \(\kappa \approx 0.22\) indicates only slight agreement beyond chance, consistent with the McNemar’s test result that the classifiers do not differ significantly yet agree only modestly.
Kappa measures agreement, not individual accuracy. Two equally poor classifiers that make the same errors on the same samples will have high \(\kappa \) despite low accuracy.
14.1.6 Q-Statistic (Yule’s Q)
While Cohen’s \(\kappa \) measures the absolute level of agreement, the Q-statistic (Yule’s Q) measures the direction and strength of association between the classifiers’ outcomes. Using the contingency table notation (Fig. 14.1):
\(\seteqnumber{0}{}{9}\)\begin{equation} Q = \frac {n_{11}\, n_{00} - n_{10}\, n_{01}}{n_{11}\, n_{00} + n_{10}\, n_{01}} \end{equation}
Range: \(Q\in [-1,\,1]\).
-
• \(Q=1\): the classifiers are perfectly positively associated (whenever one is correct, so is the other).
-
• \(0<Q<1\): the classifiers are positively correlated, meaning they tend to recognize the same instances correctly and incorrectly.
-
• \(Q=0\): the classifiers are independent.
-
• \(-1<Q<0\): the classifiers are negatively correlated, meaning they tend to make errors on different instances.
-
• \(Q=-1\): perfect negative association (one is correct if and only if the other is incorrect).
-
Example 14.1: Back to the previous example (Table 14.2), \(n_{11}=150\), \(n_{10}=25\), \(n_{01}=15\), \(n_{00}=10\).
\(\seteqnumber{0}{}{10}\)\begin{equation} Q = \frac {150 \times 10 - 25 \times 15}{150 \times 10 + 25 \times 15} = \frac {1500 - 375}{1500 + 375} = \frac {1125}{1875} = 0.6 \end{equation}
The moderate positive \(Q=0.6\) indicates that the classifiers’ errors are positively correlated: they tend to succeed and fail on the same samples. Compare with \(\kappa \approx 0.22\), which reflects only slight agreement beyond chance. This illustrates the distinction: \(Q\) captures the correlation direction, while \(\kappa \) measures absolute agreement level.
Summary of Non-Probabilistic Comparisons
For quick reference, Table 14.3 summarizes how to interpret each subrange of the four metrics.
| Metric | Range | Subrange | Interpretation | |
| Disagreement \(d\) (Sec. 14.1.3) | \([0,\,1]\) | \(d \approx 0\) | Classifiers behave nearly identically on the test set | |
| \(d \to 1\) | Classifiers make different predictions on many samples | |||
| McNemar’s \(p\)-value (Sec. 14.1.4) | \([0,\,1]\) | \(p < 0.05\) | Accuracy difference is statistically significant | |
| \(p \ge 0.05\) | No evidence that the classifiers differ in accuracy | |||
| Cohen’s \(\kappa \) (Sec. 14.1.5) | \([-1,\,1]\) | \(\kappa < 0\) | Worse than chance (systematic disagreement) | |
| \(0 \le \kappa < 0.20\) | Slight agreement beyond chance | |||
| \(0.20 \le \kappa < 0.40\) | Fair agreement | |||
| \(0.40 \le \kappa < 0.60\) | Moderate agreement | |||
| \(0.60 \le \kappa < 0.80\) | Substantial agreement | |||
| \(0.80 \le \kappa \le 1\) | Almost perfect agreement | |||
| Yule’s \(Q\) (Sec. 14.1.6) | \([-1,\,1]\) | \(Q = -1\) | Perfect negative association: one is correct iff the other is wrong | |
| \(-1 < Q < 0\) | Errors anti-correlated, a desirable property for ensembling | |||
| \(Q \approx 0\) | Classifier outcomes independent | |||
| \(0 < Q < 1\) | Errors positively correlated | |||
| \(Q = 1\) | Perfect positive association: both correct (or both wrong) together |
The four methods above answer related but distinct questions. The contrasts between them clarify when each one is needed:
-
• Disagreement (Sec. 14.1.3) vs. McNemar’s test (Sec. 14.1.4). Disagreement reports the raw fraction of samples on which the classifiers differ. McNemar adds a significance check: it asks whether the asymmetry between \(n_{10}\) and \(n_{01}\) is real or could plausibly arise by chance.
-
Example 14.2: Two classifier pairs (\(M=100\)) share the same disagreement rate \(d=0.30\) but differ sharply on McNemar.
C1 C1 Cor Inc Cor Inc C2 Cor 60 15 C2 Cor 60 3 Inc 15 10 Inc 27 10 Symmetric (left): \(d=30/100=0.30\), \(\chi ^2=(15-15)^2/(15+15)=0\), \(p=1\). Asymmetric (right): \(d=0.30\), \(\chi ^2=(27-3)^2/(27+3)=19.2\), \(p\approx 10^{-5}\). The same raw disagreement leads to opposite conclusions about whether the accuracy gap is real.
-
-
• McNemar’s test (Sec. 14.1.4) vs. Cohen’s \(\kappa \) (Sec. 14.1.5). McNemar tests whether the two classifiers have significantly different accuracy. Cohen’s \(\kappa \) measures how much they agree beyond chance. Two classifiers can pass McNemar (no significant accuracy gap) and still have low \(\kappa \) if their joint agreement is no better than independent guessing.
-
Example 14.3: A table with all four cells equal, \(M=100\).
C1 Cor Inc C2 Cor 25 25 Inc 25 25 Both accuracies equal \(0.50\) and the off-diagonals match, so McNemar gives \(\chi ^2=0\), \(p=1\). But \(p_o=(25+25)/100=0.50\) and \(p_e=0.5^2+0.5^2=0.50\), hence \(\kappa =0\). Passing McNemar is compatible with chance-level agreement.
-
-
• Cohen’s \(\kappa \) (Sec. 14.1.5) vs. Yule’s \(Q\) (Sec. 14.1.6). High \(Q\) with low \(\kappa \) indicates strongly correlated errors at different accuracy levels. \(Q\approx 0\) with moderate \(\kappa \) indicates classifiers that agree often but whose errors are uncorrelated, a desirable property for ensemble methods.
-
Example 14.4: Two highly accurate classifiers, \(M=100\).
C1 Cor Inc C2 Cor 95 2 Inc 2 1 \(Q=(95\cdot 1-2\cdot 2)/(95\cdot 1+2\cdot 2)=91/99\approx 0.92\), a strong positive association. Yet Acc\({}_1=\)Acc\({}_2=0.97\) gives \(p_e=0.97^2+0.03^2=0.9418\), and with \(p_o=0.96\) we obtain \(\kappa \approx 0.31\), only fair agreement beyond chance. When accuracy is very high the chance baseline \(p_e\) is already close to \(1\), so \(\kappa \) shrinks even when \(Q\) reports strong association.
-
-
• Disagreement (Sec. 14.1.3) vs. Yule’s \(Q\) (Sec. 14.1.6). Disagreement counts how many predictions differ. \(Q\) asks whether the errors line up on the same samples. Two classifiers can share an identical disagreement rate yet have very different \(Q\), depending on how their mistakes overlap.
-
Example 14.5: Two classifier pairs (\(M=100\)) with identical \(d=0.20\) but opposite \(Q\).
C1 C1 Cor Inc Cor Inc C2 Cor 80 10 C2 Cor 40 10 Inc 10 0 Inc 10 40 Left: \(d=0.20\), \(Q=(80\cdot 0-10\cdot 10)/(80\cdot 0+10\cdot 10)=-1\). Both classifiers are accurate but their few errors never coincide, an ideal regime for ensembling. Right: \(d=0.20\), \(Q=(40\cdot 40-10\cdot 10)/(40\cdot 40+10\cdot 10)=1500/1700\approx 0.88\). The same off-diagonal mass now sits beside a large \(n_{00}\), so \(Q\) flips from perfect anti-correlation to strong positive association.
-
14.2 Comparison Between Two Probabilistic Classifiers
The non-probabilistic methods (Sec. 14.1) reduce each classifier’s output to a hard label \(\hat {y}\in \{0,1\}\), discarding the confidence information. When classifiers produce probabilistic outputs
\[{p_k = f_{\bth _k}(\bx ) = \Pr (\hat {y}=1\mid \bx )},\]
richer comparisons are possible.
14.2.1 Summary
The following methods in Table 14.4 exploit the full predicted probability \(p_i\in [0,1]\) rather than a hard label. They are grouped into three tasks: scoring accuracy (Brier score, calibration curve, paired score difference), significance testing (paired t-test, Wilcoxon, DeLong), and diversity assessment (error correlation, Spearman rank correlation).
| Method | Question answered | ||
| Scoring accuracy | How well the predicted probabilities approximate the true labels? | ||
| Brier score (Sec. 14.2.2) | How accurate are each classifier’s predicted probabilities? | ||
| Calibration curve / ECE (Sec. 14.2.3) | Where in the probability range does calibration break down? | ||
| Paired score difference (Sec. 14.2.4) | Which classifier has better calibrated probabilities? | ||
| Significance testing | Whether the observed difference could arise by chance? | ||
| Paired t-test (Sec. 14.2.5) | Is the score difference statistically significant (normal \(\delta _i\))? | ||
| Wilcoxon signed-rank (Sec. 14.2.6) | Is the score difference significant (non-parametric)? | ||
| DeLong’s test (Sec. 14.2.7) | Do the classifiers differ in discrimination (AUC)? | ||
| Diversity assessment | Whether the classifiers fail on the same or different samples? | ||
| Error correlation (Sec. 14.2.8) | How correlated are the classifiers’ errors (linear)? | ||
| Spearman rank correlation (Sec. 14.2.9) | How correlated are the classifiers’ errors (non-parametric)? |
14.2.2 Brier Score
AUC (Sec. 12.5) is scale-invariant: it is insensitive to the absolute predicted probability values \(\hat {y}_i\) and only evaluates whether positive samples receive higher probabilities than negative ones. The Brier score complements AUC by evaluating how close the predicted probabilities are to the actual outcomes. It is defined as the MSE between the predicted probability and the actual class label:
\(\seteqnumber{0}{}{11}\)\begin{equation} \text {BS} = \frac {1}{M}\sum _{i=1}^{M}\left (p_i - y_i\right )^2 \end{equation}
where \(p_i\in [0,1]\) is the predicted probability and \(y_i\in \{0,1\}\) is the actual label.
Range and reference values: \(0 \le \text {BS} \le 1\), lower is better.
-
• \(\text {BS}=0\): perfect classifier, assigns probability \(1\) to the correct class on every sample.
-
• \(\text {BS}=1\): worst case, assigns probability \(1\) to the wrong class on every sample (confidently wrong).
-
• \(\text {BS}=0.25\): constant prediction \(p_i=0.5\) for all samples, regardless of the label.
-
• \(\text {BS}=\pi _1(1-\pi _1)\): base-rate classifier that always predicts the marginal positive rate \(p_i=\pi _1=\Pr (y=1)\) as a probability. This is the optimal constant-probability predictor (it minimizes expected Brier among all constants) and serves as the natural “no-information” probabilistic baseline: on a balanced set it equals \(0.25\); on an imbalanced set (\(\pi _1=0.1\)) it drops to \(0.09\), so a small Brier score on imbalanced data is not automatically impressive.
-
• \(\text {BS}=\min (\pi _1,\,1-\pi _1)\): majority-vote classifier that always predicts the hard majority label (\(p_i=0\) if \(\pi _1<0.5\), else \(p_i=1\)). It is the accuracy-optimal constant classifier but is worse than the base-rate baseline as a probabilistic predictor: at \(\pi _1=0.1\) it gives \(0.10\) versus \(0.09\) for the base-rate classifier.
-
• Skill score: \(\text {BSS}=1-\text {BS}/\text {BS}_{\text {ref}}\) with \(\text {BS}_{\text {ref}}=\pi _1(1-\pi _1)\) rescales the score so that \(\text {BSS}=1\) is perfect, \(\text {BSS}=0\) is no better than the base-rate baseline, and \(\text {BSS}<0\) is worse.
-
Example 14.6: Consider \(M=4\) samples with actual labels and predicted probabilities:
\(\seteqnumber{0}{}{12}\)\(y_i\) \(\hat {y}_i\) \((p_i - y_i)^2\) 1 0.9 0.01 0 0.2 0.04 1 0.7 0.09 0 0.1 0.01 \begin{equation} \text {BS} = \frac {0.01 + 0.04 + 0.09 + 0.01}{4} = 0.0375 \end{equation}
The low Brier score indicates well-calibrated probability predictions.
Comparison with AUC:
-
• AUC measures discrimination: do positive samples receive higher predicted probabilities \(f_\bw (\bx _i)\) than negative ones? It is scale-invariant and threshold-invariant.
-
• Brier score measures calibration: are the predicted probabilities \(p_i\) close to the actual labels \(y_i\)? It is sensitive to the absolute probability values.
A classifier can have high AUC (good discrimination) but poor Brier score (poorly calibrated probabilities), or vice versa.
Fig. 14.2 illustrates this distinction. Both classifiers achieve the same AUC (all positive samples are scored higher than all negative samples), but Classifier A predicts probabilities close to the actual labels (\(y_i=0\) or \(y_i=1\)), yielding a low Brier score. Classifier B compresses its predictions toward \(0.5\), degrading calibration and increasing the Brier score, while maintaining the same discrimination ability.
14.2.3 Calibration Curve
The Brier score (Sec. 14.2.2) summarizes calibration with a single number. The calibration curve (also called reliability diagram) shows where along the probability range the classifier is miscalibrated.
Construction. The construction is a probability-type histogram (Sec. 1.4) of the predicted probabilities \(\{p_i\}_{i=1}^M\), but with each bin reporting the empirical positive rate \(\bar {y}_b\) instead of the relative frequency \(n_i/N\) of eq. (1.15).
-
1. Partition the predicted probabilities \(\{p_i\}_{i=1}^M\) into \(B\) bins \(I_1,\dots ,I_B\) over \([0,1]\) (equal-width or equal-count/quantile).
-
2. For each bin \(b\), compute the mean predicted probability and the observed positive fraction,
\(\seteqnumber{0}{}{13}\)\begin{equation} \bar {p}_b = \frac {1}{|I_b|}\sum _{i\in I_b} p_i,\qquad \bar {y}_b = \frac {1}{|I_b|}\sum _{i\in I_b} y_i \end{equation}
-
3. Plot \(\bar {y}_b\) versus \(\bar {p}_b\), together with the diagonal \(\bar {y}=\bar {p}\) as the reference for perfect calibration.
-
• Curve on the diagonal: well-calibrated. A predicted probability of \(0.7\) corresponds to a positive in \(70\%\) of such cases.
-
• Curve below the diagonal (\(\bar {y}_b<\bar {p}_b\)): the classifier is over-confident, i.e. predicted probabilities exceed the realized frequencies.
-
• Curve above the diagonal (\(\bar {y}_b>\bar {p}_b\)): the classifier is under-confident.
-
• Deviations are common (e.g., naive Bayes, SVM with Platt scaling) and can be corrected by post-hoc methods such as Platt scaling or isotonic regression2.
2 In Python: sklearn.calibration.calibration_curve and CalibratedClassifierCV.
-
Example 14.7: Fig. 14.3 compares two classifiers on a synthetic 1D dataset (\(M=2000\)): a logistic regression (well-calibrated by construction) and an over-confident variant obtained by sharpening the logits, \(\tilde {p} = \sigma (\alpha \cdot \mathrm {logit}(p))\) with \(\alpha =3\). Because \(\tilde {p}\) is a strictly monotonic function of \(p\), the two classifiers order the samples identically: any pair \((i,j)\) with \(p_i>p_j\) also satisfies \(\tilde {p}_i>\tilde {p}_j\). Since AUC depends only on this ranking and not on the absolute probability values, both classifiers achieve the same AUC (Sec. 12.5). The over-confident one nevertheless produces a calibration curve that bows below the diagonal in the mid-probability range, a markedly higher Brier score, and a higher Expected Calibration Error (both BS and ECE are reported in the figure legend).
Figure 14.3: Calibration curve (reliability diagram) for a well-calibrated logistic regression and an over-confident variant. The dashed diagonal marks perfect calibration; points below the diagonal indicate over-confidence. The legend reports the Brier score (BS) and Expected Calibration Error (ECE) of each classifier.
Expected Calibration Error. The calibration curve summarizes to a single scalar by weighting each bin’s deviation from the diagonal by its share of samples:
\(\seteqnumber{0}{}{14}\)\begin{equation} \text {ECE}_k = \sum _{b=1}^{B} \frac {|I_b|}{M} \left |\bar {y}_b - \bar {p}_b\right | \end{equation}
A lower ECE indicates better calibration. Comparing \(\text {ECE}_1\) and \(\text {ECE}_2\) complements the Brier score by isolating the calibration component from the refinement (sharpness) component.
Relation to AUC and Brier score:
-
• AUC (Sec. 12.5) is invariant to any monotonic rescaling of \(p_i\), so a classifier can have \(\mathsf {AUC}\approx 1\) while being severely miscalibrated.
-
• Brier score gives a scalar summary of calibration plus refinement; the calibration curve shows the shape of the miscalibration across probability ranges, which is the actionable information for choosing a recalibration method.
The Brier score decomposes into calibration and refinement (sharpness) terms. Sharpness refers to how concentrated the predicted probabilities are toward the extremes \(0\) and \(1\): a sharp classifier commits to confident predictions (most \(p_i\) near \(0\) or \(1\)), while an unsharp one hedges with probabilities clustered near the base rate (e.g., \(0.5\)). Two classifiers with similar Brier scores can have very different calibration curves: one may be well-calibrated but unsharp, the other miscalibrated but sharp. The calibration curve reveals this distinction, which the Brier score alone cannot.
-
Example 14.8: Two probabilistic classifiers are evaluated on \(M=6\) samples with labels \(y\) and predicted probabilities \(p_{1}, p_{2}\):
\(i\) 1 2 3 4 5 6 \(y_i\) 1 0 1 0 1 0 \(p_{1,i}\) 0.90 0.20 0.70 0.30 0.60 0.15 \(p_{2,i}\) 0.75 0.10 0.85 0.40 0.80 0.25 With \(B=2\) bins, \(I_1=[0,\,0.5)\) and \(I_2=[0.5,\,1]\):
Bin \(\bar {p}_{b,1}\) \(\bar {y}_{b,1}\) \(\bar {p}_{b,2}\) \(\bar {y}_{b,2}\) \([0,\,0.5)\) 0.217 0 0.250 0 \([0.5,\,1]\) 0.733 1 0.800 1 \(\text {ECE}_1 = \tfrac {3}{6}|0 - 0.217| + \tfrac {3}{6}|1 - 0.733| = 0.108 + 0.133 = 0.242\).
\(\text {ECE}_2 = \tfrac {3}{6}|0 - 0.250| + \tfrac {3}{6}|1 - 0.800| = 0.125 + 0.100 = 0.225\).Classifier 2 has slightly better calibration. Both classifiers are underconfident in the upper bin (\(\bar {p}_b < \bar {y}_b = 1\)) and overconfident in the lower bin (\(\bar {p}_b > \bar {y}_b = 0\)).
14.2.4 Paired Scoring Rule Comparison
Given a proper scoring rule \(S\) , such as the Brier score \(S_{\text {BS}}(p,y) = (p - y)^2\) (Sec. 14.2.2) or the CE loss \(S_{\text {LL}}(p,y) = -[y\log p + (1{-}y)\log (1{-}p)]\) (Sec. 8.4), denote the per-sample score \(S_{k,i} = S(p_{k,i},\, y_i)\) and compute the score difference:
\(\seteqnumber{0}{}{15}\)\begin{equation} \delta _i = S_{1,i} - S_{2,i} \end{equation}
where \(p_{k,i} = f_{\bth _k}(\bx _i)\) is the probability predicted by classifier \(k\) for sample \(i\). If \(\delta _i > 0\), classifier 2 is better on that sample; if \(\delta _i < 0\), classifier 1 is better.
The average score difference summarizes the overall comparison:
\(\seteqnumber{0}{}{16}\)\begin{equation} \bar {\delta } = \frac {1}{M}\sum _{i=1}^{M} \delta _i = \bar {S}_1 - \bar {S}_2 \end{equation}
where
\(\seteqnumber{0}{}{17}\)\begin{equation} \bar {S}_k = \frac {1}{M}\sum _{i=1}^{M} S_{k,i} \end{equation}
is the mean score of classifier \(k\). A negative \(\bar {\delta }\) favors classifier 1; a positive \(\bar {\delta }\) favors classifier 2.
-
Example 14.9 (Paired Scoring Rule Comparison): Two probabilistic classifiers are evaluated on \(M=6\) test samples using the Brier score \(S_{\text {BS}}(p,y) = (p-y)^2\):
\(y_i\) \(p_{1,i}\) \(S_{\text {BS},1}\) \(p_{2,i}\) \(S_{\text {BS},2}\) \(\delta _i\) 1 0.90 0.010 0.75 0.063 \(-0.053\) 0 0.20 0.040 0.10 0.010 \(+0.030\) 1 0.70 0.090 0.85 0.023 \(+0.068\) 0 0.30 0.090 0.40 0.160 \(-0.070\) 1 0.60 0.160 0.80 0.040 \(+0.120\) 0 0.15 0.023 0.25 0.063 \(-0.040\) \(\bar {S}_1\) 0.069 \(\bar {S}_2\) 0.060 \(\bar {\delta } = +0.009\) The small positive \(\bar {\delta } = 0.009\) slightly favors classifier 2, but the per-sample differences \(\delta _i\) alternate in sign, suggesting the advantage is not consistent. A statistical test is needed to determine whether this difference is significant.
14.2.5 Paired t-test on Score Differences
The paired scoring rule comparison provides a descriptive measure \(\bar {\delta }\), but does not indicate whether the difference could have arisen by chance. A paired statistical test addresses this by testing:
-
• \(H_0\): \(\E [\delta _i] = 0\) — the classifiers are equally good.
-
• \(H_1\): \(\E [\delta _i] \neq 0\) — one classifier is systematically better.
The paired t-test on \(\{\delta _1, \ldots , \delta _M\}\) yields a test statistic:
\(\seteqnumber{0}{}{18}\)\begin{equation} t = \frac {\bar {\delta }}{s_\delta / \sqrt {M}} \end{equation}
where \(s_\delta \) is the unbiased sample standard deviation:
\(\seteqnumber{0}{}{19}\)\begin{equation} s_\delta = \sqrt {\frac {1}{M-1}\sum _{i=1}^{M}(\delta _i - \bar {\delta })^2} \end{equation}
If \(p < \alpha \), reject \(H_0\): the scoring difference is statistically significant.
The paired t-test assumes that the \(\delta _i\) are approximately normally distributed. When this assumption is violated (e.g., small \(M\), heavy-tailed or skewed score distributions), the Wilcoxon signed-rank test (below) provides a non-parametric alternative.
-
Example 14.9: Back to the previous example (Table in Example 14.9), \(\bar {\delta } = 0.009\), \(s_\delta = 0.072\), \(M=6\).
\(\seteqnumber{0}{}{20}\)\begin{equation} t = \frac {0.009}{0.072/\sqrt {6}} = \frac {0.009}{0.029} \approx 0.31 \end{equation}
The corresponding p-value is \(p \approx 0.77\). Since \(p = 0.77 > 0.05\), we fail to reject \(H_0\). Despite the slight difference in mean Brier score, the two classifiers are not significantly different.
14.2.6 Wilcoxon Signed-Rank Test on Score Differences
The Wilcoxon signed-rank test uses the ranks of \(|\delta _i|\) rather than their magnitudes, making it robust to outliers and non-normal distributions. The procedure is:
-
1. Discard any samples where \(\delta _i = 0\) (ties). Let \(M'\) denote the remaining count.
-
2. Rank the absolute differences \(|\delta _1|, \ldots , |\delta _{M'}|\) from smallest to largest. Assign average ranks to tied values.
-
3. Compute the signed-rank sums:
\(\seteqnumber{0}{}{21}\)\begin{equation} W^+ = \sum _{\delta _i > 0} R_i, \qquad W^- = \sum _{\delta _i < 0} R_i \end{equation}
where \(R_i\) is the rank of \(|\delta _i|\). The test statistic is \(W = \min (W^+, W^-)\).
The hypotheses are:
-
• \(H_0\): the median of \(\delta _i\) is zero — the classifiers are equally good.
-
• \(H_1\): the median of \(\delta _i\) is not zero — one classifier is systematically better.
Under \(H_0\), positive and negative differences are equally likely at each rank, so \(W^+ \approx W^-\). A very small \(W\) indicates that one sign dominates the high ranks, i.e., one classifier is consistently better on the samples with the largest differences. If \(p < \alpha \), reject \(H_0\).
-
Example 14.9: Back to the previous example (Table in Example 14.9):
\(\delta _i\) \(|\delta _i|\) Rank \(R_i\) Sign \(+0.030\) 0.030 1 \(+\) \(-0.040\) 0.040 2 \(-\) \(-0.053\) 0.053 3 \(-\) \(+0.068\) 0.068 4 \(+\) \(-0.070\) 0.070 5 \(-\) \(+0.120\) 0.120 6 \(+\) \(W^+ = 1 + 4 + 6 = 11\), \(W^- = 2 + 3 + 5 = 10\), \(W = \min (11, 10) = 10\).
For \(M'=6\), the critical value at \(\alpha =0.05\) (two-sided) is \(W_{\text {crit}}=0\). Since \(W = 10 > 0\), we fail to reject \(H_0\), consistent with the paired t-test result. The positive and negative ranks are nearly balanced, confirming that neither classifier is systematically better.
14.2.7 DeLong’s Test for AUC Comparison
The paired scoring rule test compares calibration (how close probabilities are to the true labels). DeLong’s test addresses a complementary question: do the classifiers differ in their ability to rank positive samples above negative ones, as measured by the AUC (Sec. 12.5)?
Given two classifiers with \(\text {AUC}_1\) and \(\text {AUC}_2\) evaluated on the same test set, the hypotheses are:
-
• \(H_0\): \(\text {AUC}_1 = \text {AUC}_2\).
-
• \(H_1\): \(\text {AUC}_1 \neq \text {AUC}_2\).
The DeLong test statistic is:
\(\seteqnumber{0}{}{22}\)\begin{equation} z = \frac {\text {AUC}_1 - \text {AUC}_2}{\sqrt {\text {Var}(\text {AUC}_1 - \text {AUC}_2)}} \end{equation}
Under \(H_0\), \(z\) is approximately standard normal. The variance in the denominator accounts for the correlation between \(\text {AUC}_1\) and \(\text {AUC}_2\), since both are computed on the same test samples and their estimates are not independent3. If \(p < \alpha \), the classifiers have significantly different discrimination.
3 The computation of DeLong’s variance is based on placement values and is outside the scope of this chapter. It is available numerically via standard implementations (e.g., scipy.stats).
DeLong’s test evaluates discrimination (ranking ability) while the paired scoring rule test evaluates calibration (probability accuracy). Two classifiers can have identical AUC but different Brier scores, or vice versa. Both tests should be considered for a complete comparison.
14.2.8 Error Correlation
Yule’s Q (Sec. 14.1) measures the association between two classifiers’ binary correct/incorrect outcomes. For probabilistic classifiers, we can measure the same concept on the continuous per-sample scores \(S_{k,i}\): when classifier 1 has a large error, does classifier 2 also have a large error?
The error correlation is the Pearson correlation coefficient between the per-sample scores:
\(\seteqnumber{0}{}{23}\)\begin{equation} r = \frac {\sum _{i=1}^{M}(S_{1,i} - \bar {S}_1)(S_{2,i} - \bar {S}_2)}{\sqrt {\sum _{i=1}^{M}(S_{1,i} - \bar {S}_1)^2}\;\sqrt {\sum _{i=1}^{M}(S_{2,i} - \bar {S}_2)^2}} \end{equation}
Range: \(r \in [-1, 1]\).
-
• \(r \approx 1\): errors are positively correlated — both classifiers fail on the same samples. Combining them provides little benefit.
-
• \(r \approx 0\): errors are uncorrelated — ensembles can potentially average out errors effectively.
-
• \(r < 0\): errors are negatively correlated — the classifiers fail on different samples, which gives high potential for ensembles.
-
Example 14.9: Back to the previous example (Table in Example 14.9), the per-sample Brier scores are:
\(i\) \(S_{\text {BS},1}\) \(S_{\text {BS},2}\) 1 0.010 0.063 2 0.040 0.010 3 0.090 0.023 4 0.090 0.160 5 0.160 0.040 6 0.023 0.063 Computing the Pearson correlation: \(r \approx -0.18\). The weak negative correlation indicates that the classifiers’ errors are slightly complementary — when one has a large error, the other tends to have a smaller one. This suggests moderate potential for improvement via ensemble combination (Ch. 15.3).
14.2.9 Spearman Rank Correlation of Scores
The Pearson error correlation (Sec. 14.2.8) measures the linear association between \(S_{1,i}\) and \(S_{2,i}\). When the relationship is monotonic but non-linear, or when scores contain outliers, the Spearman rank correlation provides a more robust measure.
Let \(R_{k,i}\) denote the rank of \(S_{k,i}\) among \(\{S_{k,1},\ldots ,S_{k,M}\}\) (with average ranks for ties). The Spearman rank correlation (Sec. 2.5) is the Pearson correlation computed on the ranks:
\(\seteqnumber{0}{}{24}\)\begin{equation} r_s = \frac {\sum _{i=1}^{M}(R_{1,i} - \bar {R})(R_{2,i} - \bar {R})}{\sqrt {\sum _{i=1}^{M}(R_{1,i} - \bar {R})^2}\;\sqrt {\sum _{i=1}^{M}(R_{2,i} - \bar {R})^2}} \end{equation}
where \(\bar {R} = (M+1)/2\) is the mean rank.
Range: \(r_s \in [-1, 1]\). The interpretation mirrors the Pearson error correlation:
-
• \(r_s \approx 1\): classifiers find the same samples difficult — limited ensemble benefit.
-
• \(r_s \approx 0\): difficulty rankings are unrelated — good ensemble diversity.
-
• \(r_s < 0\): when one classifier struggles, the other excels — high ensemble potential.
The Spearman rank correlation relates to the Pearson error correlation as the Wilcoxon test relates to the paired t-test: a non-parametric alternative that is robust to outliers and non-linear score distributions.
-
Example 14.9: Back to the previous example (Table in Example 14.9). Ranking each classifier’s Brier scores (average ranks for ties):
\(i\) \(S_{\text {BS},1}\) \(R_{1,i}\) \(S_{\text {BS},2}\) \(R_{2,i}\) 1 0.010 1 0.063 4.5 2 0.040 3 0.010 1 3 0.090 4.5 0.023 2 4 0.090 4.5 0.160 6 5 0.160 6 0.040 3 6 0.023 2 0.063 4.5 Computing \(r_s \approx -0.18\), consistent with the Pearson result. The classifiers do not rank sample difficulty similarly, confirming modest complementarity.
Summary of Probabilistic Comparisons
For quick reference, Table 14.5 summarizes how to interpret each subrange of the probabilistic comparison metrics.
| Method | Range | Subrange | Interpretation | |
| Brier score (Sec. 14.2.2) | \([0,\,1]\) | \(\text {BS} \approx 0\) | Probabilities track the labels accurately | |
| \(\text {BS} \approx \pi _1(1-\pi _1)\) | No better than the base-rate (no-information) baseline | |||
| \(\text {BS} \to 1\) | Confidently wrong on most samples | |||
| ECE (Sec. 14.2.3) | \([0,\,1]\) | \(\text {ECE} \approx 0\) | Bin-averaged probabilities match empirical frequencies | |
| \(\text {ECE} \gtrsim 0.1\) | Visible miscalibration in at least one probability band | |||
| Paired score \(\bar {\delta }\) (Sec. 14.2.4) | \(\mathbb {R}\) | \(\bar {\delta } < 0\) | Classifier 1 scores better (lower loss) on average | |
| \(\bar {\delta } \approx 0\) | Classifiers tied on the chosen scoring rule | |||
| \(\bar {\delta } > 0\) | Classifier 2 scores better on average | |||
| Paired t-test \(p\) (Sec. 14.2.5) | \([0,\,1]\) | \(p < 0.05\) | Mean score gap is significant (assumes \(\delta _i\) approximately normal) | |
| \(p \ge 0.05\) | No evidence the mean score differs | |||
| Wilcoxon \(p\) (Sec. 14.2.6) | \([0,\,1]\) | \(p < 0.05\) | Median score gap is significant (non-parametric) | |
| \(p \ge 0.05\) | No evidence the median score differs | |||
| DeLong \(p\) (Sec. 14.2.7) | \([0,\,1]\) | \(p < 0.05\) | AUC (discrimination) differs significantly | |
| \(p \ge 0.05\) | No evidence the classifiers rank samples differently | |||
| Pearson error corr. \(r\) (Sec. 14.2.8) | \([-1,\,1]\) | \(r < 0\) | Errors anti-correlated, high ensemble potential | |
| \(r \approx 0\) | Errors uncorrelated, good ensemble diversity | |||
| \(0 < r < 0.6\) | Mildly correlated errors, modest ensemble benefit | |||
| \(r \ge 0.6\) | Strongly correlated errors, little ensemble benefit | |||
| Spearman \(r_s\) (Sec. 14.2.9) | \([-1,\,1]\) | \(r_s < 0\) | Difficulty rankings reversed, high ensemble potential | |
| \(r_s \approx 0\) | Rankings unrelated, good diversity | |||
| \(r_s \gtrsim 0.6\) | Same samples found hard by both, limited ensemble gain |
The methods above answer related but distinct questions. The contrasts between them clarify when each one is presented:
-
• Brier score (Sec. 14.2.2) vs. ECE (Sec. 14.2.3). Brier mixes calibration and refinement into one scalar. ECE isolates the calibration component by bin, so it reveals where along the probability range miscalibration concentrates; two classifiers with similar Brier can have very different ECE shapes.
-
• Paired score \(\bar {\delta }\) (Sec. 14.2.4) vs. paired t-test (Sec. 14.2.5). \(\bar {\delta }\) is the descriptive mean gap. The t-test asks whether that gap could plausibly arise by chance; a small \(|\bar {\delta }|\) with large per-sample variance can be non-significant, and conversely a tiny but consistent \(\bar {\delta }\) on a large \(M\) can be highly significant.
-
• Paired t-test (Sec. 14.2.5) vs. Wilcoxon (Sec. 14.2.6). The t-test compares means under approximate normality of the \(\delta _i\). Wilcoxon ranks the magnitudes, so it is robust to heavy tails and outliers; the two can disagree when a few extreme \(\delta _i\) dominate the mean.
-
• DeLong (Sec. 14.2.7) vs. paired t-test on Brier (Sec. 14.2.5). DeLong tests AUC, a ranking property invariant to monotone rescaling. The Brier-difference t-test tests calibrated accuracy. A classifier can keep DeLong’s verdict unchanged after a sharpening transform that destroys Brier (Fig. 14.3).
-
• Pearson error correlation (Sec. 14.2.8) vs. Spearman rank correlation (Sec. 14.2.9). Pearson measures the linear association between per-sample error scores; Spearman uses only ranks. They diverge whenever the score-error relationship is monotonic but non-linear, or when a few large errors inflate the Pearson value.
-
• Scoring accuracy (Brier, ECE, \(\bar {\delta }\)) vs. diversity (\(r\), \(r_s\)). The first family asks how good each classifier is individually; the second asks whether their failures overlap. A pair of strong but redundant classifiers (\(r \approx 1\)) gives less ensemble benefit than a pair of weaker but complementary ones (\(r \approx 0\)).
14.3 Comparison Between Two Multi-Class Classifiers
In the binary setting, the joint behaviour of two classifiers is captured by a \(2\times 2\) contingency table (Sec. 14.1.2). With \(K\ge 2\) classes, the natural object is a \(K\times K\) agreement matrix, whose entries count samples by the pair of predicted labels.
14.3.1 Summary
Table 14.6 maps each binary comparison method to its multi-class counterpart. Methods marked “same formula” need no modification; per-sample-score methods (Sec. 14.2.4, 14.2.5, 14.2.6, 14.2.8, 14.2.9) apply unchanged because they operate on scalar scores \(S_{k,i}\) that do not see the class count \(K\).
| Question / aspect | \(K=2\) method | \(K>2\) method | |||
| Joint prediction table | |||||
| Where do classifiers disagree? | Contingency table (Sec. 14.1.2; uses \(y\)) | Agreement matrix (Sec. 14.3.2; no \(y\)) | |||
| Fraction of disagreements | Disagreement (Sec. 14.1.3) | Disagreement (Sec. 14.3.3; same formula) | |||
| Significance testing | |||||
| Is the accuracy difference significant? | McNemar (Sec. 14.1.4) | apply on the correct/incorrect \(2\times 2\) collapse | |||
| Per-pair confusion symmetry | (degenerate at \(K=2\)) | Bowker (Sec. 14.3.4) | |||
| Equal class-prediction rates | (degenerate at \(K=2\)) | Stuart-Maxwell (Sec. 14.3.4) | |||
| Distribution-free difference test | permutation McNemar (Sec. 14.1.4) | Omnibus permutation (Sec. 14.3.5) | |||
| Agreement and error correlation | |||||
| Agreement beyond chance | Cohen’s \(\kappa \) (Sec. 14.1.5) | Cohen’s \(\kappa \) (Sec. 14.3.6) | |||
| Direction of error correlation | Yule’s \(Q\) (Sec. 14.1.6) | apply binary \(Q\) to correct/incorrect \(2\times 2\) collapse | |||
| Probabilistic scoring | |||||
| Probability-vector accuracy | Brier (Sec. 14.2.2) | Multi-class Brier (Sec. 14.3.7) | |||
| Calibration diagnosis | Calibration curve / ECE (Sec. 14.2.3) | Top-label / classwise ECE (Sec. 14.3.8) | |||
| Discrimination significance (AUC) | DeLong (Sec. 14.2.7) | Multi-class AUC + DeLong, OvR or OvO (Sec. 14.3.9) | |||
| Per-sample-score methods (unchanged) | |||||
| Paired score difference, t-test, Wilcoxon, Pearson, Spearman | Sec. 14.2.4, 14.2.5, 14.2.6, 14.2.8, 14.2.9 | Apply unchanged to the scalar score \(S_{k,i}\) (multi-class Brier or cross-entropy); see Sec. 14.3.10 | |||
14.3.2 Agreement Matrix
Let \(\hat {y}_{k,i}\in \{1,\dots ,K\}\) denote the prediction of classifier \(k\) on sample \(i\). The agreement matrix \(N=[n_{jk}]\in \mathbb {N}^{K\times K}\) has entries
\(\seteqnumber{0}{}{25}\)\begin{equation} n_{jk} = \sum _{i=1}^{M} \indFunc [\hat {y}_{1,i}=j]\,\indFunc [\hat {y}_{2,i}=k], \qquad \sum _{j,k} n_{jk} = M. \end{equation}
The diagonal entries \(n_{jj}\) count samples on which both classifiers predict class \(j\), and their sum \(\sum _{j=1}^{K} n_{jj}\) is the total number of samples on which the two classifiers issue the same label (whether correct or not). The off-diagonal entry \(n_{jk}\) with \(j\ne k\) counts samples where \(C_1\) predicts class \(j\) and \(C_2\) predicts class \(k\), revealing which class confusions drive the disagreement.
Confusion, contingency, agreement
Agreement matrix
The agreement matrix does not reference the ground truth \(y\): both axes are classifier predictions, \(\hat {y}_1\) and \(\hat {y}_2\). It is therefore a fundamentally different object from the \(2\times 2\) contingency table of Sec. 14.1.2, whose axes are correct/incorrect (i.e. \(\hat {y}_k=y\) vs. \(\hat {y}_k\ne y\)).
-
• Contingency table (Sec. 14.1.2): includes \(y\). Tells you which classifier is more often right.
-
• Agreement matrix (this section): excludes \(y\). Tells you how the classifiers differ from each other.
One can be computed without labels (e.g. on a live, unlabelled stream); the other cannot. Collapsing the agreement matrix by “predicted label equals \(y\)” would recover the contingency table, but the two are used for opposite purposes and should not be confused.
-
Example 14.10: Two classifiers are evaluated on \(M=200\) samples drawn from \(K=3\) classes (A, B, C). The agreement matrix is:
Classifier 2 (C2) A B C Row sum Classifier 1 (C1) A 70 6 4 80 B 10 55 5 70 C 8 7 35 50 Col. sum 88 68 44 200 Table 14.7: Agreement matrix between two 3-class classifiers on \(M=200\) samples.The diagonal sums to \(70+55+35=160\), so the classifiers issue the same label on \(160/200=80\%\) of samples. The largest off-diagonal cell is \(n_{BA}=10\): among the \(40\) disagreements, the most frequent pattern is “\(C_1\) says B while \(C_2\) says A.”
14.3.3 Disagreement Measure (Multi-Class)
The binary formula of Sec. 14.1.3 generalizes directly:
\(\seteqnumber{0}{}{26}\)\begin{equation} \text {Dis} = \frac {1}{M}\sum _{i=1}^{M}\indFunc [\hat {y}_{1,i}\ne \hat {y}_{2,i}] = 1 - \frac {1}{M}\sum _{j=1}^{K} n_{jj}, \end{equation}
where \(n_{jj}\) are the diagonal entries of the agreement matrix (Sec. 14.3.2) that count the samples on which both classifiers predict the same class \(j\).
-
Example 14.10: Back to the previous example (Table 14.7):
\(\seteqnumber{0}{}{27}\)\begin{equation} \text {Dis} = 1 - \frac {160}{200} = 0.20. \end{equation}
The classifiers disagree on \(20\%\) of the samples.
14.3.4 Bowker’s and Stuart-Maxwell Tests (*)
-
Goal: Diagnose whether two \(K\)-class classifiers behave differently from each other on a shared test set, without requiring ground truth \(y\). Two questions are answered separately:
-
• Do they confuse specific class pairs asymmetrically (Bowker)?
-
• Do they predict each class at different overall rates (Stuart-Maxwell)?
-
Unlike McNemar’s test (Sec. 14.1.4), which uses the correct/incorrect contingency table and answers “which classifier is more accurate,” Bowker and Stuart-Maxwell operate on the agreement matrix of Sec. 14.3.2, where ground truth \(y\) is absent. They cannot rank classifiers by accuracy; they detect whether the two classifiers behave systematically differently from each other. For multi-class accuracy comparison, apply McNemar unchanged to the correct/incorrect \(2\times 2\) collapse: each prediction is right or wrong regardless of \(K\).
The actionable outcomes are: (i) detect divergence between two models before labels are available (e.g. champion vs. challenger on live traffic); (ii) catch prior or threshold drift between two versions of a model (Stuart-Maxwell); (iii) localize which class pairs drive disagreement, by inspecting the per-pair Bowker contributions before summation (debugging); and (iv) decide whether an ensemble of the two should use symmetric (majority vote) or class-asymmetric weighting.
With \(K\) classes there are \(K(K-1)/2\) off-diagonal pairs, and the single binary equality \(n_{10}=n_{01}\) splits into two distinct, non-equivalent null hypotheses:
-
• Cell-wise symmetry. Every individual pair is balanced: \(n_{jk}=n_{kj}\) for all \(j<k\). This asks whether each specific class-confusion pattern (e.g. “\(C_1\) predicts A while \(C_2\) predicts B”) occurs as often as its mirror (“\(C_1\) predicts B while \(C_2\) predicts A”). Bowker’s test addresses this null. Rejecting \(H_0\) means that at least one off-diagonal pair \((j,k)\) is directionally biased: one classifier systematically substitutes class \(j\) for class \(k\) more often than the other does the reverse, so the two classifiers do not make the same kinds of confusions in the same proportions.
-
• Marginal homogeneity. For each class \(j\), the row marginal \(\sum _k n_{jk}\) (number of times \(C_1\) predicts class \(j\)) equals the column marginal \(\sum _k n_{kj}\) (number of times \(C_2\) predicts class \(j\)). This asks only whether the two classifiers issue each label at the same overall rate, without constraining which pairs of classes are confused: an excess of “\(C_1\) says A, \(C_2\) says B” can be cancelled out in the A-marginal by an excess of “\(C_1\) says C, \(C_2\) says A”, leaving the marginals balanced while individual off-diagonal pairs are not. Stuart-Maxwell’s test addresses this null. Rejecting \(H_0\) means that at least one class is predicted at a different overall rate by the two classifiers, indicating an aggregate bias in class usage (e.g. \(C_1\) predicts class A more often than \(C_2\) does, regardless of which other class \(C_2\) substitutes in its place).
The two coincide at \(K=2\) (where each marginal equation is equivalent to the single off-diagonal equality), but for \(K\ge 3\) cell-wise symmetry implies marginal homogeneity and not vice versa: asymmetries on different off-diagonal pairs can cancel in the marginals.
Bowker’s test of symmetry. \(H_0\colon n_{jk}=n_{kj}\) for every \(j<k\), i.e. the agreement matrix is symmetric about its diagonal:
\(\seteqnumber{0}{}{28}\)\begin{equation} \chi ^2_{\text {B}} = \sum _{1\le j<k\le K} \frac {(n_{jk}-n_{kj})^2}{n_{jk}+n_{kj}}, \qquad \text {df} = \frac {K(K-1)}{2}. \end{equation}
The degrees of freedom \(K(K-1)/2\) count the number of off-diagonal pairs above the diagonal, i.e. the number of distinct equalities \(n_{jk}=n_{kj}\) being tested; it is the parameter passed to the \(\chi ^2\) table (or to scipy.stats.chi2.sf) to obtain the p-value. At \(K=2\) this collapses to McNemar’s statistic (Sec. 14.1.4)4.
Stuart-Maxwell test of marginal homogeneity. A weaker null only requires that the row marginals match the column marginals:
\(\seteqnumber{0}{}{29}\)\begin{equation} H_0\colon \sum _{k=1}^{K} n_{jk} = \sum _{k=1}^{K} n_{kj} \quad \text {for all } j. \end{equation}
The test statistic \(\chi ^2_{\text {SM}}\) combines the \(K\) marginal differences into a single number using a covariance correction (one marginal is redundant because the totals must match), and is referred to a \(\chi ^2\) distribution with df\({}=K-1\). Standard implementations such as SquareTable.homogeneity compute it directly from \(N\). The p-value is interpreted as in Sec. 14.1.4: reject \(H_0\) when \(p<\alpha \).
Bowker’s test asks whether every individual confusion pattern \((j,k)\!\leftrightarrow \!(k,j)\) is balanced. Stuart-Maxwell asks only whether the two classifiers issue each class at the same overall rate. The two tests can disagree: a strongly asymmetric individual pair \((n_{jk}\gg n_{kj})\) can be masked by a compensating asymmetry on another pair, leaving the marginals balanced.
-
Example 14.10: Back to Example 14.10.
Bowker’s test (cell-wise symmetry). The three off-diagonal pairs above the diagonal are \((n_{AB},n_{BA})=(6,10)\), \((n_{AC},n_{CA})=(4,8)\), \((n_{BC},n_{CB})=(5,7)\).
\(\seteqnumber{0}{}{30}\)\begin{equation} \chi ^2_{\text {B}} = \frac {(6-10)^2}{6+10} + \frac {(4-8)^2}{4+8} + \frac {(5-7)^2}{5+7} = 1.00 + 1.33 + 0.33 = 2.67. \end{equation}
The three pairs contribute df\({}=3\), giving a p-value \(p\approx 0.45\). Since \(p>0.05\), we fail to reject \(H_0\): no individual confusion pattern shows a statistically significant directional bias, i.e. the classifiers swap each pair of classes in roughly balanced proportions.
Stuart-Maxwell test (marginal homogeneity). The row marginals of Table 14.7 are \((80,70,50)\) for \(C_1\) and the column marginals \((88,68,44)\) for \(C_2\), giving differences \(\bd =(-8,+2,+6)\) (summing to zero, as required). Plugging these into SquareTable.homogeneity yields \(\chi ^2_{\text {SM}}\approx 2.0\) with df\({}=2\) and \(p\approx 0.37\). We fail to reject \(H_0\): the two classifiers predict each class at statistically indistinguishable overall rates.
Both tests agree in this example: the disagreement on \(20\%\) of samples (Sec. 14.3.3) is the kind of fluctuation one would expect from a finite test set even if the two classifiers had no underlying systematic difference. Neither a per-pair bias (Bowker) nor an aggregate class-rate bias (Stuart-Maxwell) is detected.
Bowker rejection signals behavioural diversity between two classifiers, which is a necessary but not sufficient condition for ensemble benefit. Two models can fail Bowker (lots of asymmetric label substitution) while still being wrong on the same samples, in which case averaging or voting will not help. To decide whether an ensemble will actually improve accuracy, use error correlation (Sec. 14.2.8) or Yule’s \(Q\) on the correct/incorrect \(2\times 2\) collapse (Sec. 14.1.6).
14.3.5 Omnibus Permutation Test (*)
Just as the permutation test is the non-parametric counterpart of McNemar (Sec. 14.1.4), the omnibus permutation test is the non-parametric counterpart of Bowker and Stuart-Maxwell (Sec. 14.3.4).
The test summarizes the asymmetry of the agreement matrix (Sec. 14.3.2) with a single scalar, for example Bowker’s itself or the total off-diagonal asymmetry
\(\seteqnumber{0}{}{31}\)\begin{equation} S = \sum _{1\le j<k\le K} \lvert n_{jk}-n_{kj}\rvert , \end{equation}
used directly as a discrepancy score rather than referred to a reference distribution. The null distribution is built under exchangeability of the two classifiers: for each disagreeing sample \(i\) (where \(\hat {y}_{1,i}\ne \hat {y}_{2,i}\)), the ordered prediction pair is swapped, \((\hat {y}_{1,i},\hat {y}_{2,i})\to (\hat {y}_{2,i},\hat {y}_{1,i})\), with probability \(1/2\). The agreement matrix is rebuilt, the statistic recomputed, and the procedure repeated \(T\) times (Sec. 13.3.1) to obtain the null distribution and the \(\frac {1}{T+1}\) p-value. Swapping only disagreeing samples leaves the diagonal \(n_{jj}\) fixed, mirroring how the permutation form of McNemar reassigns only the discordant pairs.
-
Example 14.10: Back to Example 14.10 (Table 14.7). The off-diagonal pairs give
\(\seteqnumber{0}{}{32}\)\begin{equation} S = \lvert 6-10\rvert + \lvert 4-8\rvert + \lvert 5-7\rvert = 4 + 4 + 2 = 10. \end{equation}
Permuting the \(40\) disagreeing samples \(T=1000\) times yields a null distribution of \(S\) whose right tail places the observed \(S=10\) at \(p\approx 0.4\). This agrees with Bowker’s asymptotic \(p\approx 0.45\) (Sec. 14.3.4): the classifiers swap class pairs in roughly balanced proportions. Unlike Bowker’s \(\chi ^2\), the permutation p-value remains exact even if the cell counts had been too small for the asymptotic approximation.
14.3.6 Cohen’s Kappa (Multi-Class)
The definition of Sec. 14.1.5 extends without change; only the ingredients are computed on the \(K\times K\) agreement matrix. The observed agreement is the diagonal mass:
\(\seteqnumber{0}{}{33}\)\begin{equation} p_o = \frac {1}{M}\sum _{j=1}^{K} n_{jj}. \end{equation}
Define the empirical class-prediction frequency of each classifier from the marginals:
\(\seteqnumber{0}{}{34}\)\begin{equation} p_{1,j} = \frac {1}{M}\sum _{k=1}^{K} n_{jk}, \qquad p_{2,j} = \frac {1}{M}\sum _{k=1}^{K} n_{kj}. \end{equation}
Under independence, the probability that both classifiers issue the same class \(j\) is \(p_{1,j}p_{2,j}\), so the expected chance agreement is
\(\seteqnumber{0}{}{35}\)\begin{equation} p_e = \sum _{j=1}^{K} p_{1,j}\,p_{2,j}, \end{equation}
and Cohen’s kappa is \(\kappa =(p_o-p_e)/(1-p_e)\).
Range: \(\kappa = 1\) indicates perfect agreement, \(\kappa = 0\) indicates agreement no better than chance, and \(\kappa < 0\) indicates systematic disagreement. The interpretation is identical to the binary case (Sec. 14.1.5).
-
Example 14.10: Back to Example 14.10. Row marginals of \(C_1\) give \(p_{1,j}=(0.40,\,0.35,\,0.25)\); column marginals of \(C_2\) give \(p_{2,j}=(0.44,\,0.34,\,0.22)\).
\(\seteqnumber{0}{}{36}\)\begin{equation} \begin{aligned} p_o &= 160/200 = 0.80 \\[3pt] p_e &= 0.40\cdot 0.44 + 0.35\cdot 0.34 + 0.25\cdot 0.22 = 0.176 + 0.119 + 0.055 = 0.350 \\[3pt] \kappa &= \frac {0.80 - 0.350}{1 - 0.350} = \frac {0.450}{0.650} \approx 0.692. \end {aligned} \end{equation}
The substantial \(\kappa \approx 0.69\) indicates strong agreement beyond chance, well above the slight agreement observed in the binary example (Sec. 14.1.5, \(\kappa \approx 0.22\)).
14.3.7 Multi-Class Brier Score
-
Goal: Extend the Brier score (Sec. 14.2.2) to predicted class-probability vectors \(\bp _i\in [0,1]^K\) with \(\sum _{k=1}^{K} p_{i,k}=1\).
With one-hot labels \(\by _i\in \{0,1\}^K\) (where \(y_{i,k}=1\) iff sample \(i\) belongs to class \(k\)), the multi-class Brier score is the mean squared \(L_2\) distance between \(\bp _i\) and \(\by _i\):
\(\seteqnumber{0}{}{37}\)\begin{equation} \text {BS} = \frac {1}{M}\sum _{i=1}^{M}\sum _{k=1}^{K}\left (p_{i,k}-y_{i,k}\right )^2. \end{equation}
Range: \(0 \le \text {BS} \le 2\). A perfect classifier (\(\bp _i=\by _i\)) achieves \(\text {BS}=0\). A uniform predictor (\(p_{i,k}=1/K\)) achieves \(\text {BS}=1-1/K\).
When \(K=2\), the formula above is twice the binary Brier score of Sec. 14.2.2 because both classes contribute identical squared errors. Some libraries (and sklearn.metrics.brier_score_loss) report the binary form without the factor of \(2\); verify the convention before comparing values across sources.
-
Example 14.11: Three samples with \(K=3\) classes and predicted vectors:
\(\seteqnumber{0}{}{38}\)\(i\) true class \(\bp _i\) \(\|\bp _i-\by _i\|_2^2\) 1 A \((0.8,\,0.1,\,0.1)\) \(0.04+0.01+0.01 = 0.06\) 2 B \((0.3,\,0.5,\,0.2)\) \(0.09+0.25+0.04 = 0.38\) 3 C \((0.2,\,0.2,\,0.6)\) \(0.04+0.04+0.16 = 0.24\) \begin{equation} \text {BS} = \frac {0.06+0.38+0.24}{3} = \frac {0.68}{3} \approx 0.227. \end{equation}
The score lies well below the uniform-predictor baseline \(1-1/K=2/3\approx 0.667\), indicating that the classifier’s probability vectors are substantially more informative than a coin flip across classes. The per-sample contributions also point to where it is weakest: sample 2 dominates the average (contribution \(0.38\) out of \(0.68\)), because the predicted vector \((0.3,0.5,0.2)\) assigns only \(0.5\) to the true class B and leaks \(0.3\) to class A; samples 1 and 3 are tighter (\(0.06\) and \(0.24\)). Inspecting the per-sample term \(\|\bp _i-\by _i\|_2^2\) is the standard way to flag the test points where the model is least confident or most miscalibrated.
14.3.8 Multi-Class Calibration: Top-Label and Classwise ECE
-
Goal: Extend the calibration curve and the Expected Calibration Error of Sec. 14.2.3 to \(K\) classes.
The binary ECE bins predictions by their probability and compares the bin mean to the empirical positive rate. With \(K\) classes, two standard reductions to the binary machinery are used.
Top-label ECE Each sample is reduced to its winning probability and predicted class,
\(\seteqnumber{0}{}{39}\)\begin{equation} \hat {p}_i = \max _{k} p_{i,k}, \qquad \hat {y}_i = \argmax _k p_{i,k}. \end{equation}
Bin the values \(\{\hat {p}_i\}_{i=1}^{M}\) as in Sec. 14.2.3. In each bin \(b\), compare the mean confidence \(\bar {\hat {p}}_b\) to the bin accuracy \(\bar {a}_b = (1/|I_b|)\sum _{i\in I_b}\indFunc [\hat {y}_i=y_i]\):
\(\seteqnumber{0}{}{40}\)\begin{equation} \text {ECE}_{\text {top}} = \sum _{b=1}^{B}\frac {|I_b|}{M}\,\left |\bar {a}_b - \bar {\hat {p}}_b\right |. \end{equation}
This is the metric most commonly reported in the literature (e.g. for deep network calibration).
Classwise (marginal) ECE Treat each class as its own binary calibration problem with \((p_{i,k},\,\indFunc [y_i=k])\) and average:
\(\seteqnumber{0}{}{41}\)\begin{equation} \text {ECE}_{\text {cw}} = \frac {1}{K}\sum _{k=1}^{K}\text {ECE}^{(k)}, \qquad \text {ECE}^{(k)} = \sum _{b=1}^{B}\frac {|I_b^{(k)}|}{M}\,\left |\bar {y}^{(k)}_b - \bar {p}^{(k)}_b\right |. \end{equation}
Both reductions produce reliability diagrams of the same shape as Fig. 14.3: one curve per class for classwise, or a single curve in confidence space for top-label.
Top-label ECE is cheap and inspects only the predicted class, so it ignores miscalibration on the non-winning classes. Classwise ECE catches that residual miscalibration but is noisier per class, especially for imbalanced datasets where minority classes populate few bins.
-
Example 14.12: A probabilistic classifier is evaluated on \(M=6\) samples with \(K=3\) classes (A, B, C); the true labels are two of each class.
\(i\) \(y_i\) \(p_{i,A}\) \(p_{i,B}\) \(p_{i,C}\) \(\hat {p}_i=\max _c p_{i,c}\) \(\hat {y}_i\) 1 A 0.80 0.15 0.05 0.80 A 2 A 0.45 0.40 0.15 0.45 A 3 B 0.10 0.70 0.20 0.70 B 4 B 0.30 0.50 0.20 0.50 B 5 C 0.20 0.25 0.55 0.55 C 6 C 0.35 0.40 0.25 0.40 B The classifier is correct on samples 1–5 and wrong on sample 6, so accuracy is \(5/6\approx 83\%\). Use \(B=2\) bins, \(I_1=[0,\,0.5)\) and \(I_2=[0.5,\,1]\).
Top-label ECE. Bin by \(\hat {p}_i\):
\(\seteqnumber{0}{}{42}\)Bin samples \(\bar {\hat {p}}_b\) (mean confidence) \(\bar {a}_b\) (accuracy) \([0,\,0.5)\) 2, 6 \((0.45+0.40)/2 = 0.425\) \(1/2 = 0.500\) \([0.5,\,1]\) 1, 3, 4, 5 \((0.80+0.70+0.50+0.55)/4 = 0.6375\) \(4/4 = 1.000\) \begin{equation} \text {ECE}_{\text {top}} = \tfrac {2}{6}\,|0.500 - 0.425| + \tfrac {4}{6}\,|1.000 - 0.6375| = 0.025 + 0.2417 \approx 0.267. \end{equation}
The high-confidence bin dominates: the classifier is underconfident there (predicts \(0.64\) but is right \(100\%\) of the time).
Classwise ECE. For each class, treat \((p_{i,c},\,\indFunc [y_i=c])\) as a binary calibration problem.
Class \(c\) Bin samples in bin \(\bar {p}^{(c)}_b\) \(\bar {y}^{(c)}_b\) \(|\bar {y}^{(c)}_b - \bar {p}^{(c)}_b|\) A \([0,\,0.5)\) 2, 3, 4, 5, 6 \(0.280\) \(0.200\) \(0.080\) A \([0.5,\,1]\) 1 \(0.800\) \(1.000\) \(0.200\) B \([0,\,0.5)\) 1, 2, 5, 6 \(0.300\) \(0.000\) \(0.300\) B \([0.5,\,1]\) 3, 4 \(0.600\) \(1.000\) \(0.400\) C \([0,\,0.5)\) 1, 2, 3, 4, 6 \(0.170\) \(0.200\) \(0.030\) C \([0.5,\,1]\) 5 \(0.550\) \(1.000\) \(0.450\) The per-class ECEs are
\(\seteqnumber{0}{}{43}\)\begin{equation} \text {ECE}^{(A)} = \tfrac {5}{6}(0.080) + \tfrac {1}{6}(0.200) = 0.100,\quad \text {ECE}^{(B)} = \tfrac {4}{6}(0.300) + \tfrac {2}{6}(0.400) \approx 0.333, \end{equation}
\(\seteqnumber{0}{}{44}\)\begin{equation} \text {ECE}^{(C)} = \tfrac {5}{6}(0.030) + \tfrac {1}{6}(0.450) = 0.100, \end{equation}
and the aggregate is
\(\seteqnumber{0}{}{45}\)\begin{equation} \text {ECE}_{\text {cw}} = \tfrac {1}{3}(0.100 + 0.333 + 0.100) \approx 0.178. \end{equation}
Comparison. The top-label score (\(0.267\)) flags overall underconfidence in the high-probability bin but says nothing about which class is responsible. The classwise score (\(0.178\)) is smaller but localizes the problem: class B contributes \(0.333\), more than three times the per-class contribution of A or C, because the model assigns moderate probability mass to B on samples where B is absent (bin 1) and underestimates B on samples where it is correct (bin 2). For deployment, the actionable recommendation is to recalibrate class B specifically rather than the whole classifier.
14.3.9 Multi-Class AUC and DeLong
-
Goal: Extend AUC-based discrimination comparison (Sec. 14.2.7) to \(K\) classes (statistically significant AUC difference).
With \(K\) classes, the ROC curve is not unique. Two standard aggregations summarize discrimination in a single scalar.
One-vs-Rest macro-AUC. For each class \(k\) compute the binary AUC of \(\{(p_{i,k},\indFunc [y_i=k])\}_i\), then average:
\(\seteqnumber{0}{}{46}\)\begin{equation} \text {AUC}_{\text {OvR}} = \frac {1}{K}\sum _{k=1}^{K}\text {AUC}^{(k)}. \end{equation}
One-vs-One AUC. For every ordered pair of classes \((j,k)\) with \(j\ne k\), compute the AUC restricted to samples whose true label is \(j\) or \(k\), using \(p_{i,k}\) as the score, and average over the \(\binom {K}{2}\) unordered pairs (Hand and Till’s \(M\)-measure).
DeLong for multi-class. DeLong’s test (Sec. 14.2.7) gives a paired variance estimate for the difference between two AUCs. In the multi-class setting it is applied per class on the OvR scores, yielding \(K\) p-values that can be combined (e.g. Bonferroni correction across classes), or on \(\text {AUC}_{\text {OvR}}\) directly using the per-class DeLong covariances aggregated into the macro variance. Standard implementations (sklearn.metrics.roc_auc_score with multi_class=’ovr’ or ’ovo’) compute the point estimates; dedicated libraries5 provide the DeLong variance.
Macro-AUC weights classes equally, regardless of frequency, so a poorly discriminated minority class can dominate the score. OvO weights class pairs equally and is less sensitive to imbalance, but more expensive. The two can rank a pair of classifiers differently; report the one matching the operational cost structure.
14.3.10 Per-Sample-Score Methods (Carry Over Unchanged)
A comparison method is \(K\)-aware only if its formula directly indexes a class label or a class-conditional probability. Methods that reduce each sample to a single scalar \(S_{k,i}\) before doing anything else are \(K\)-agnostic: once you have a column of scalars per classifier, the test code is the same whether the scalars came from a binary or a multi-class problem.
The scalar score. For probabilistic classifiers the natural per-sample score is either the multi-class Brier (Sec. 14.3.7) or the cross-entropy loss (Sec. 8.4):
\(\seteqnumber{0}{}{47}\)\begin{equation} S_{k,i} = \sum _{c=1}^{K}\left (p_{k,i,c}-y_{i,c}\right )^2 \qquad \text {or}\qquad S_{k,i} = -\sum _{c=1}^{K} y_{i,c}\,\log p_{k,i,c} \end{equation}
where \(y_{i,c}\in \{0,1\}\) is the one-hot label and \(p_{k,i,c}\) is classifier \(k\)’s probability for class \(c\) on sample \(i\). Both reduce to their binary forms when \(K=2\).
What each method answers and what it sees.
-
• Paired score difference (Sec. 14.2.4). Goal: quantify which classifier produces better-calibrated probabilities, sample by sample. Input: the per-sample gap \(\delta _i=S_{1,i}-S_{2,i}\); the sign of \(\bar \delta \) picks the winner.
-
• Paired t-test (Sec. 14.2.5). Goal: test whether the score difference is statistically significant, assuming approximately normal \(\delta _i\). Input: the vector \(\{\delta _i\}_{i=1}^{M}\); tests \(\E [\delta _i]=0\).
-
• Wilcoxon signed-rank (Sec. 14.2.6). Goal: non-parametric significance test on the median score difference, robust to outliers and non-normality. Input: ranks of \(|\delta _i|\) with their signs.
-
• Pearson error correlation (Sec. 14.2.8). Goal: measure how correlated the two classifiers’ errors are (probabilistic analog of Yule’s \(Q\)); low correlation signals ensemble potential. Input: the two columns \(\{S_{1,i}\},\{S_{2,i}\}\); returns \(r\in [-1,1]\).
-
• Spearman rank correlation (Sec. 14.2.9). Goal: non-parametric error-correlation diagnostic; test whether the two classifiers rank samples by difficulty in the same order. Input: ranks of those columns; returns \(r_s\in [-1,1]\).
None of these formulas index a class. The class count \(K\) is absorbed into \(S\) at the score-construction step and is invisible from then on.
What is gained and what is lost. The same library calls (scipy.stats.ttest_rel, wilcoxon, pearsonr, spearmanr) work for any classification task; only the score function changes. The price is that these tests answer only “do the classifiers’ overall score distributions differ?” and not “on which class does one classifier do worse?” For per-class diagnosis use the classwise ECE (Sec. 14.3.8) or per-class DeLong (Sec. 14.3.9), both of which loop over \(K\) explicitly.
Rule of thumb: if a comparison method sees the classifier output only through a per-sample loss \(S_{k,i}\in \mathbb {R}\), it is \(K\)-agnostic and needs no multi-class version. If it indexes class probabilities \(p_{k,i,c}\) or class pairs \((j,k)\), it is \(K\)-aware and requires the multi-class treatment given earlier in this section.
-
Example 14.13: Two probabilistic classifiers are evaluated on \(M=6\) samples with \(K=3\) classes (A, B, C). The per-sample score is the multi-class Brier \(S_{k,i}=\sum _{c=1}^{K}(p_{k,i,c}-y_{i,c})^2\).
\(i\) \(y_i\) \(\bp _{1,i}\) \(\bp _{2,i}\) \(S_{1,i}\) \(S_{2,i}\) \(\delta _i\) 1 A \((0.80,\,0.15,\,0.05)\) \((0.60,\,0.30,\,0.10)\) 0.065 0.260 \(-0.195\) 2 A \((0.45,\,0.40,\,0.15)\) \((0.55,\,0.30,\,0.15)\) 0.485 0.315 \(+0.170\) 3 B \((0.10,\,0.70,\,0.20)\) \((0.20,\,0.55,\,0.25)\) 0.140 0.305 \(-0.165\) 4 B \((0.30,\,0.50,\,0.20)\) \((0.25,\,0.60,\,0.15)\) 0.380 0.245 \(+0.135\) 5 C \((0.20,\,0.25,\,0.55)\) \((0.30,\,0.20,\,0.50)\) 0.305 0.380 \(-0.075\) 6 C \((0.35,\,0.40,\,0.25)\) \((0.20,\,0.35,\,0.45)\) 0.845 0.465 \(+0.380\) means \(\bar {S}_1\!=\!0.370\) \(\bar {S}_2\!=\!0.328\) \(\bar {\delta }\!=\!+0.042\) The mean gap \(\bar \delta \approx +0.042\) is small and positive, slightly favouring classifier 2 (smaller Brier is better).
Paired t-test (Sec. 14.2.5). With \(s_\delta \approx 0.225\) and \(M=6\),
\(\seteqnumber{0}{}{48}\)\begin{equation} t = \frac {0.042}{0.225/\sqrt {6}} \approx 0.46, \qquad p\approx 0.67. \end{equation}
Fail to reject; the score difference is not significant.
Pearson error correlation (Sec. 14.2.8). Computing on the two score columns,
\(\seteqnumber{0}{}{49}\)\begin{equation} r = \frac {\sum _i(S_{1,i}-\bar {S}_1)(S_{2,i}-\bar {S}_2)}{\sqrt {\sum _i(S_{1,i}-\bar {S}_1)^2}\,\sqrt {\sum _i(S_{2,i}-\bar {S}_2)^2}} \approx 0.75. \end{equation}
Strong positive error correlation: when one classifier scores poorly on a sample, the other does too. By the Step 2 threshold of the Ensemble Recommendation (Sec. 14.3.11), \(r\approx 0.75\) is approaching the \(r\ge 0.8\) redundancy zone, so ensembling would likely produce only marginal gains.
Observation. The procedure is byte-for-byte identical to the binary Example 14.9: only the per-sample score function changes from binary Brier \((p_i-y_i)^2\) to multi-class Brier \(\sum _c(p_{i,c}-y_{i,c})^2\). The downstream library calls (scipy.stats.ttest_rel, pearsonr) do not see \(K\).
14.3.11 Ensemble Recommendation
-
Goal: Decide whether the two compared classifiers should be combined into an ensemble (Sec. 15.3), and if so, whether to use a symmetric (majority vote, score averaging) or asymmetric (weighted, class-conditional) fusion rule. The decision rests on the diagnostics built up in this chapter.
The diagnostics produce four checkpoints. The recommendation is conservative: an ensemble is justified only when all the following conditions hold. Otherwise, deploy the better single classifier.
Both classifiers are individually useful Verify that each classifier outperforms the trivial baseline (constant prediction at the majority class). Then run McNemar (Sec. 14.1.4, applied to the correct/incorrect \(2\times 2\) for \(K>2\)). If the test rejects and the accuracy gap is large (rule of thumb: \(>5\) percentage points), drop the weaker classifier; ensembling rarely recovers from a wide accuracy gap because the better model’s votes are diluted by the weaker one.
The errors are not redundant This is the critical condition: ensembling helps only when the two classifiers fail on different samples.
-
• Hard predictions: Yule’s \(Q\) on the correct/incorrect \(2\times 2\) (Sec. 14.1.6; for \(K>2\), apply to the collapse). \(Q\ge 0.8\) signals highly correlated errors, so ensembling will produce marginal gains at best.
-
• Probabilistic outputs: Pearson error correlation \(r\) (Sec. 14.2.8) or Spearman \(r_s\) (Sec. 14.2.9). \(r\ge 0.8\) is the same red flag; \(r\le 0\) is the ideal case (negatively correlated errors).
If both diagnostics indicate near-perfect correlation, stop: the two classifiers are effectively the same model and an ensemble will not help.
Choose symmetric vs. asymmetric fusion With diversity confirmed, the choice of fusion rule depends on the shape of disagreement:
-
• Bowker (Sec. 14.3.4) fails to reject: disagreements are symmetric across class pairs. Use symmetric fusion (majority vote, soft vote / score averaging, Sec. 15.3).
-
• Bowker rejects: at least one class pair is asymmetrically confused. Use asymmetric fusion: weighted averaging (weight by per-class accuracy) or class-conditional routing (send disputed-pair samples to the classifier known to be stronger on that boundary).
Calibration before soft averaging (optional) Soft vote and probability averaging assume the inputs are comparable. Before averaging probabilities, check the calibration of both classifiers (Sec. 14.2.3 for binary, Sec. 14.3.8 for multi-class). If either is poorly calibrated, recalibrate first (Platt scaling, isotonic regression) or fall back to hard voting, which is unaffected by calibration.
Cohen’s \(\kappa \) measures agreement on labels, not on errors. Two classifiers with high \(\kappa \) (they label most samples the same way) can still benefit from ensembling if their few disagreements happen to be precisely on the samples each one gets wrong, so \(\kappa \) is not the ensemble-benefit signal. The signal is Yule’s \(Q\) or the error correlation, both of which condition on the truth \(y\).
-
Example 14.14: Apply the recipe to the binary worked example of Sec. 14.1.1 (\(n_{11}=150\), \(n_{10}=25\), \(n_{01}=15\), \(n_{00}=10\)).
-
• McNemar’s p-value \(\approx 0.11\) (Sec. 14.1.4), so the \(5\)-point accuracy gap is not significant; both classifiers stay in play.
-
• \(Q\approx 0.6\) (Sec. 14.1.6), well below the \(0.8\) redundancy threshold, so diversity is sufficient.
-
• Bowker reduces to McNemar at \(K=2\) and again fails to reject, so disagreements are symmetric: use majority vote or soft averaging.
-
• with \(K=2\) and assuming the classifiers are reasonably calibrated, soft averaging is the default; otherwise, fall back to majority vote.
-