Machine Learning & Signals Learning
12 Classifier Comparison
12.1 Comparison Between Two Non-Probabilistic Classifiers
General methods concentrate on a performance of the particular classifier. In the following, dedicated inter-classifier methods are provided.
12.1.1 Contingency Table
| Classifier 1 (C1) | |||
| Correct, \(\hat {Y}_1=Y\) | Incorrect, \(\hat {Y}_1\ne Y\) | ||
| Classifier 2 (C2) | Correct, \(\hat {Y}_2=Y\) | Both correct, \(n_{11}\) | C1 incorrect, C2 correct, \(n_{01}\) |
| Incorrect, \(\hat {Y}_2\ne Y\) | C1 correct, C2 incorrect, \(n_{10}\) | Both incorrect, \(n_{00}\) | |
Denote the four cells entries (Fig. 12.1) as:
-
• \(n_{11}\) – both classifiers correct
-
• \(n_{10}\) – C1 correct, C2 incorrect
-
• \(n_{01}\) – C1 incorrect, C2 correct
-
• \(n_{00}\) – both classifiers incorrect
with \(n_{11}+n_{10}+n_{01}+n_{00} = M\).
The accuracy of each classifier in terms of the contingency table entries is
\(\seteqnumber{0}{}{0}\)\begin{equation} \text {Accuracy}_1 = \frac {n_{11}+n_{10}}{M}, \qquad \text {Accuracy}_2 = \frac {n_{11}+n_{01}}{M} \end{equation}
The diagonal cells (\(n_{11}\), \(n_{00}\)) represent samples on which both classifiers agree. The off-diagonal cells (\(n_{10}\), \(n_{01}\)) are the informative ones: they count samples where exactly one classifier succeeds and the other fails. If \(n_{10} \gg n_{01}\), classifier 1 is superior; if \(n_{01} \gg n_{10}\), classifier 2 is superior; if \(n_{10} \approx n_{01}\), they perform similarly despite possibly making errors on different samples.
-
Example 12.1: Two classifiers (Table 12.1) are evaluated on \(M=200\) test samples. Classifier 1 accuracy: \((150+25)/200 = 87.5\%\). Classifier 2 accuracy: \((150+15)/200 = 82.5\%\).
Classifier 1 Correct Incorrect Classifier 2 Correct 150 15 Incorrect 25 10 Table 12.1: Contingency table for two classifiers on \(M=200\) samples.The accuracies are relatively close (\(5\%\) difference). However, the comparison matrix reveals a clearer picture: classifier 1 uniquely gets \(n_{10}=25\) samples right that classifier 2 misses, while classifier 2 uniquely gets only \(n_{01}=15\). Classifier 1 is correct on \(25-15=10\) more “disputed” samples.
12.1.2 Disagreement Measure
The disagreement measure is the ratio of instances on which the two classifiers produce different outcomes to the total number of instances:
\(\seteqnumber{0}{}{1}\)\begin{equation} \text {Dis} = \frac {n_{10} + n_{01}}{M} \end{equation}
Range: \(\text {Dis}\in [0,\,1]\). \(\text {Dis}=0\) indicates that the classifiers always make the same predictions (zero diversity). As \(\text {Dis}\to 1\), the disagreement increases.
-
Example 12.1: Back to the previous example (Table 12.1), \(n_{10}=25\), \(n_{01}=15\), \(M=200\).
\(\seteqnumber{0}{}{2}\)\begin{equation} \text {Dis} = \frac {25 + 15}{200} = \frac {40}{200} = 0.2 \end{equation}
The classifiers disagree on \(20\%\) of the samples.
12.1.3 McNemar’s Test
The comparison matrix shows how two classifiers differ, but does not indicate whether the difference is due to chance. McNemar’s test addresses this by testing the null hypothesis:
\(\seteqnumber{0}{}{3}\)\begin{equation} \begin{aligned} H_0\colon n_{10} = n_{01}\\ H_1\colon n_{10} \ne n_{01} \end {aligned} \end{equation}
i.e., the two classifiers have the same error rate, and any observed difference is due to sampling variability.
The test statistic is
\(\seteqnumber{0}{}{4}\)\begin{equation} \chi ^2 = \frac {(n_{10} - n_{01})^2}{n_{10} + n_{01}} \end{equation}
The numerator measures the squared disagreement between the classifiers, and the denominator normalizes by the total number of disputed samples.
McNemar’s test only uses the off-diagonal cells (\(n_{10}\), \(n_{01}\)). Samples where both classifiers agree provide no information about which is better.
p-value interpretation The p-value is the probability of observing a test statistic, assuming \(H_0\) is true 1. The decision is made by comparing the p-value to a predetermined significance level \(\alpha \), which is the maximum probability of incorrectly rejecting \(H_0\) (i.e., concluding the classifiers differ when they actually do not). Common choices are \(\alpha = 0.05\) (5%) and \(\alpha = 0.01\) (1%).
-
• If \(p < \alpha \), we reject \(H_0\): the observed difference is statistically significant at level \(\alpha \).
-
• If \(p \ge \alpha \), we fail to reject \(H_0\): we cannot conclude that one classifier is better.
1 The calculation of the particular p-value for this test is out of the scope of this chapter. It can be found numerically by the appropriate code implementation.
-
Example 12.1: Back to the previous example (Table 12.1), \(n_{10}=25\), \(n_{01}=15\).
\(\seteqnumber{0}{}{5}\)\begin{equation} \chi ^2 = \frac {(25-15)^2}{25+15} = \frac {100}{40} = 2.5 \end{equation}
The corresponding p-value is \(p \approx 0.114\). Since \(p = 0.114 > 0.05\), we fail to reject \(H_0\). Despite classifier 1 having higher accuracy, the difference is not statistically significant at the \(5\%\) level.
12.1.4 Cohen’s Kappa Coefficient
McNemar’s test determines whether two classifiers differ significantly, but does not measure how much they agree. Cohen’s kappa coefficient \(\kappa \) addresses this by comparing observed agreement with the agreement expected under independent (random) predictions.
Using the contingency table notation (Fig. 12.1), define the observed agreement as the fraction of samples on which both classifiers give the same result (both correct or both incorrect):
\(\seteqnumber{0}{}{6}\)\begin{equation} p_o = \frac {n_{11} + n_{00}}{M} \end{equation}
Recall that the accuracy of each classifier is
\(\seteqnumber{0}{}{7}\)\begin{equation} \text {Accuracy}_1 = \frac {n_{11}+n_{10}}{M}, \qquad \text {Accuracy}_2 = \frac {n_{11}+n_{01}}{M} \end{equation}
If the two classifiers were independent, the probability that both are correct on the same sample is \(\text {Accuracy}_1 \cdot \text {Accuracy}_2\), and the probability that both are incorrect is \((1-\text {Accuracy}_1)(1-\text {Accuracy}_2)\). The expected agreement by chance is the sum of these two cases:
\(\seteqnumber{0}{}{8}\)\begin{equation} p_e = \underbrace {\text {Accuracy}_1 \cdot \text {Accuracy}_2}_{\text {both correct}} + \underbrace {(1-\text {Accuracy}_1)(1-\text {Accuracy}_2)}_{\text {both incorrect}} \end{equation}
Cohen’s kappa is then
\(\seteqnumber{0}{}{9}\)\begin{equation} \kappa = \frac {p_o - p_e}{1 - p_e} \end{equation}
Range: \(\kappa = 1\) indicates perfect agreement (the classifiers always agree). \(\kappa = 0\) indicates agreement no better than chance. \(\kappa < 0\) indicates agreement worse than chance (systematic disagreement).
-
Example 12.1: Back to the previous example (Table 12.1), \(n_{11}=150\), \(n_{10}=25\), \(n_{01}=15\), \(n_{00}=10\), \(M=200\).
\(\seteqnumber{0}{}{10}\)\begin{equation} \begin{aligned} p_o &= \frac {150+10}{200} = 0.80 \\[3pt] \text {Accuracy}_1 &= \frac {150+25}{200} = 0.875, \quad \text {Accuracy}_2 = \frac {150+15}{200} = 0.825 \\[3pt] p_e &= 0.875 \times 0.825 + 0.125 \times 0.175 = 0.7219 + 0.0219 = 0.7438 \\[3pt] \kappa &= \frac {0.80 - 0.7438}{1 - 0.7438} = \frac {0.0562}{0.2562} \approx 0.219 \end {aligned} \end{equation}
The low \(\kappa \approx 0.22\) indicates only slight agreement beyond chance, consistent with the McNemar’s test result that the classifiers do not differ significantly yet agree only modestly.
Kappa measures agreement, not individual accuracy. Two equally poor classifiers that make the same errors on the same samples will have high \(\kappa \) despite low accuracy.
12.1.5 Q-Statistic (Yule’s Q)
While Cohen’s \(\kappa \) measures the absolute level of agreement, the Q-statistic (Yule’s Q) measures the direction and strength of association between the classifiers’ outcomes. Using the contingency table notation (Fig. 12.1):
\(\seteqnumber{0}{}{11}\)\begin{equation} Q = \frac {n_{11}\, n_{00} - n_{10}\, n_{01}}{n_{11}\, n_{00} + n_{10}\, n_{01}} \end{equation}
Range: \(Q\in [-1,\,1]\).
-
• \(Q=1\): the classifiers are perfectly positively associated (whenever one is correct, so is the other).
-
• \(0<Q<1\): the classifiers are positively correlated, meaning they tend to recognize the same instances correctly and incorrectly.
-
• \(Q=0\): the classifiers are independent.
-
• \(-1<Q<0\): the classifiers are negatively correlated, meaning they tend to make errors on different instances.
-
• \(Q=-1\): perfect negative association (one is correct if and only if the other is incorrect).
-
Example 12.1: Back to the previous example (Table 12.1), \(n_{11}=150\), \(n_{10}=25\), \(n_{01}=15\), \(n_{00}=10\).
\(\seteqnumber{0}{}{12}\)\begin{equation} Q = \frac {150 \times 10 - 25 \times 15}{150 \times 10 + 25 \times 15} = \frac {1500 - 375}{1500 + 375} = \frac {1125}{1875} = 0.6 \end{equation}
The moderate positive \(Q=0.6\) indicates that the classifiers’ errors are positively correlated: they tend to succeed and fail on the same samples. Compare with \(\kappa \approx 0.22\), which reflects only slight agreement beyond chance. This illustrates the distinction: \(Q\) captures the correlation direction, while \(\kappa \) measures absolute agreement level.
Two classifiers can have high \(Q\) (strongly correlated errors) but low \(\kappa \) (different accuracy levels). Conversely, \(Q\approx 0\) with moderate \(\kappa \) would indicate classifiers that agree often but whose errors are uncorrelated — a desirable property for ensemble methods.
12.1.6 Summary
| Method | Question answered | Key quantity | |||
| Contingency table | Where do the classifiers disagree? | Off-diagonal cells \(n_{10}\), \(n_{01}\) | |||
| Disagreement | What fraction of samples do they disagree on? | \(\text {Dis} = \dfrac {n_{10}+n_{01}}{M}\) | |||
| McNemar’s test | Is the difference statistically significant? | \(\chi ^2 = \dfrac {(n_{10}-n_{01})^2}{n_{10}+n_{01}}\) | |||
| Cohen’s \(\kappa \) | How much do they agree beyond chance? | \(\kappa = \dfrac {p_o - p_e}{1 - p_e}\) | |||
| Yule’s \(Q\) | How correlated are their errors? | \(Q = \dfrac {n_{11}\, n_{00} - n_{10}\, n_{01}}{n_{11}\, n_{00} + n_{10}\, n_{01}}\) |
12.2 Comparison Between Two Probabilistic Classifiers
The non-probabilistic methods (Sec. 12.1) reduce each classifier’s output to a hard label \(\hat {y}\in \{0,1\}\), discarding the confidence information. When classifiers produce probabilistic outputs \(p_k = f_{\bth _k}(\bx ) = \Pr (\hat {y}=1\mid \bx )\), richer comparisons are possible.
12.2.1 Paired Scoring Rule Comparison
Given a proper scoring rule \(S\) , such as the Brier score \(S_{\text {BS}}(p,y) = (p - y)^2\) (Sec. 10.5.3) or the CE loss \(S_{\text {LL}}(p,y) = -[y\log p + (1{-}y)\log (1{-}p)]\) (Sec. 8.4), denote the per-sample score \(S_{k,i} = S(p_{k,i},\, y_i)\) and compute the score difference:
\(\seteqnumber{0}{}{13}\)\begin{equation} \delta _i = S_{1,i} - S_{2,i} \end{equation}
where \(p_{k,i} = f_{\bth _k}(\bx _i)\) is the probability predicted by classifier \(k\) for sample \(i\). If \(\delta _i > 0\), classifier 2 is better on that sample; if \(\delta _i < 0\), classifier 1 is better.
The average score difference summarizes the overall comparison:
\(\seteqnumber{0}{}{14}\)\begin{equation} \bar {\delta } = \frac {1}{M}\sum _{i=1}^{M} \delta _i = \bar {S}_1 - \bar {S}_2 \end{equation}
where
\(\seteqnumber{0}{}{15}\)\begin{equation} \bar {S}_k = \frac {1}{M}\sum _{i=1}^{M} S_{k,i} \end{equation}
is the mean score of classifier \(k\). A negative \(\bar {\delta }\) favors classifier 1; a positive \(\bar {\delta }\) favors classifier 2.
-
Example 12.2: Two probabilistic classifiers are evaluated on \(M=6\) test samples using the Brier score \(S_{\text {BS}}(p,y) = (p-y)^2\):
\(y_i\) \(p_{1,i}\) \(S_{\text {BS},1}\) \(p_{2,i}\) \(S_{\text {BS},2}\) \(\delta _i\) 1 0.90 0.010 0.75 0.063 \(-0.053\) 0 0.20 0.040 0.10 0.010 \(+0.030\) 1 0.70 0.090 0.85 0.023 \(+0.068\) 0 0.30 0.090 0.40 0.160 \(-0.070\) 1 0.60 0.160 0.80 0.040 \(+0.120\) 0 0.15 0.023 0.25 0.063 \(-0.040\) \(\bar {S}_1\) 0.069 \(\bar {S}_2\) 0.060 \(\bar {\delta } = +0.009\) The small positive \(\bar {\delta } = 0.009\) slightly favors classifier 2, but the per-sample differences \(\delta _i\) alternate in sign, suggesting the advantage is not consistent. A statistical test is needed to determine whether this difference is significant.
12.2.2 Paired t-test on Score Differences
The paired scoring rule comparison provides a descriptive measure \(\bar {\delta }\), but does not indicate whether the difference could have arisen by chance. A paired statistical test addresses this by testing:
-
• \(H_0\): \(\E [\delta _i] = 0\) — the classifiers are equally good.
-
• \(H_1\): \(\E [\delta _i] \neq 0\) — one classifier is systematically better.
The paired t-test on \(\{\delta _1, \ldots , \delta _M\}\) yields a test statistic:
\(\seteqnumber{0}{}{16}\)\begin{equation} t = \frac {\bar {\delta }}{s_\delta / \sqrt {M}} \end{equation}
where \(s_\delta \) is the unbiased sample standard deviation:
\(\seteqnumber{0}{}{17}\)\begin{equation} s_\delta = \sqrt {\frac {1}{M-1}\sum _{i=1}^{M}(\delta _i - \bar {\delta })^2} \end{equation}
If \(p < \alpha \), reject \(H_0\): the scoring difference is statistically significant.
The paired t-test assumes that the \(\delta _i\) are approximately normally distributed. When this assumption is violated (e.g., small \(M\), heavy-tailed or skewed score distributions), the Wilcoxon signed-rank test (below) provides a non-parametric alternative.
-
Example 12.2: Back to the previous example (Table in Example 12.2), \(\bar {\delta } = 0.009\), \(s_\delta = 0.072\), \(M=6\).
\(\seteqnumber{0}{}{18}\)\begin{equation} t = \frac {0.009}{0.072/\sqrt {6}} = \frac {0.009}{0.029} \approx 0.31 \end{equation}
The corresponding p-value is \(p \approx 0.77\). Since \(p = 0.77 > 0.05\), we fail to reject \(H_0\). Despite the slight difference in mean Brier score, the two classifiers are not significantly different.
12.2.3 Wilcoxon Signed-Rank Test on Score Differences
The Wilcoxon signed-rank test uses the ranks of \(|\delta _i|\) rather than their magnitudes, making it robust to outliers and non-normal distributions. The procedure is:
-
1. Discard any samples where \(\delta _i = 0\) (ties). Let \(M'\) denote the remaining count.
-
2. Rank the absolute differences \(|\delta _1|, \ldots , |\delta _{M'}|\) from smallest to largest. Assign average ranks to tied values.
-
3. Compute the signed-rank sums:
\(\seteqnumber{0}{}{19}\)\begin{equation} W^+ = \sum _{\delta _i > 0} R_i, \qquad W^- = \sum _{\delta _i < 0} R_i \end{equation}
where \(R_i\) is the rank of \(|\delta _i|\). The test statistic is \(W = \min (W^+, W^-)\).
The hypotheses are:
-
• \(H_0\): the median of \(\delta _i\) is zero — the classifiers are equally good.
-
• \(H_1\): the median of \(\delta _i\) is not zero — one classifier is systematically better.
Under \(H_0\), positive and negative differences are equally likely at each rank, so \(W^+ \approx W^-\). A very small \(W\) indicates that one sign dominates the high ranks, i.e., one classifier is consistently better on the samples with the largest differences. If \(p < \alpha \), reject \(H_0\).
-
Example 12.2: Back to the previous example (Table in Example 12.2):
\(\delta _i\) \(|\delta _i|\) Rank \(R_i\) Sign \(+0.030\) 0.030 1 \(+\) \(-0.040\) 0.040 2 \(-\) \(-0.053\) 0.053 3 \(-\) \(+0.068\) 0.068 4 \(+\) \(-0.070\) 0.070 5 \(-\) \(+0.120\) 0.120 6 \(+\) \(W^+ = 1 + 4 + 6 = 11\), \(W^- = 2 + 3 + 5 = 10\), \(W = \min (11, 10) = 10\).
For \(M'=6\), the critical value at \(\alpha =0.05\) (two-sided) is \(W_{\text {crit}}=0\). Since \(W = 10 > 0\), we fail to reject \(H_0\), consistent with the paired t-test result. The positive and negative ranks are nearly balanced, confirming that neither classifier is systematically better.
12.2.4 DeLong’s Test for AUC Comparison
The paired scoring rule test compares calibration (how close probabilities are to the true labels). DeLong’s test addresses a complementary question: do the classifiers differ in their ability to rank positive samples above negative ones, as measured by the AUC?
Given two classifiers with \(\text {AUC}_1\) and \(\text {AUC}_2\) evaluated on the same test set, the hypotheses are:
-
• \(H_0\): \(\text {AUC}_1 = \text {AUC}_2\).
-
• \(H_1\): \(\text {AUC}_1 \neq \text {AUC}_2\).
The DeLong test statistic is:
\(\seteqnumber{0}{}{20}\)\begin{equation} z = \frac {\text {AUC}_1 - \text {AUC}_2}{\sqrt {\text {Var}(\text {AUC}_1 - \text {AUC}_2)}} \end{equation}
Under \(H_0\), \(z\) is approximately standard normal. The variance in the denominator accounts for the correlation between \(\text {AUC}_1\) and \(\text {AUC}_2\) — since both are computed on the same test samples, their estimates are not independent2. If \(p < \alpha \), the classifiers have significantly different discrimination.
2 The computation of DeLong’s variance is based on placement values and is outside the scope of this chapter. It is available numerically via standard implementations (e.g., scipy.stats).
DeLong’s test evaluates discrimination (ranking ability) while the paired scoring rule test evaluates calibration (probability accuracy). Two classifiers can have identical AUC but different Brier scores, or vice versa. Both tests should be considered for a complete comparison.
12.2.5 Calibration Curve Comparison
The Brier score provides a single scalar summary of calibration quality. A calibration curve (reliability diagram) reveals how the calibration breaks down by comparing predicted probabilities to observed frequencies across bins.
Partition the predicted probabilities into \(B\) equal-width bins \(I_1,\ldots ,I_B\). For each bin \(b\), compute:
-
• Average predicted probability: \(\;\bar {p}_b = \frac {1}{|I_b|}\sum _{i\in I_b} p_{k,i}\)
-
• Observed frequency: \(\;\bar {y}_b = \frac {1}{|I_b|}\sum _{i\in I_b} y_i\)
A perfectly calibrated classifier satisfies \(\bar {y}_b = \bar {p}_b\) for all bins, corresponding to the diagonal line. Deviations above the diagonal indicate underconfidence; deviations below indicate overconfidence.
The Expected Calibration Error (ECE) summarizes the calibration curve as a single scalar:
\(\seteqnumber{0}{}{21}\)\begin{equation} \text {ECE}_k = \sum _{b=1}^{B} \frac {|I_b|}{M} \left |\bar {y}_b - \bar {p}_b\right | \end{equation}
A lower ECE indicates better calibration. Comparing \(\text {ECE}_1\) and \(\text {ECE}_2\) complements the Brier score by isolating the calibration component from the refinement (sharpness) component.
The Brier score decomposes into calibration and refinement terms. Two classifiers with similar Brier scores can have very different calibration curves: one may be well-calibrated but unsharp, the other miscalibrated but sharp. The calibration curve reveals this distinction, which the Brier score alone cannot.
-
Example 12.2: Back to the previous example (Table in Example 12.2) with \(B=2\) bins: \(I_1=[0,\,0.5)\) and \(I_2=[0.5,\,1]\).
Bin \(\bar {p}_{b,1}\) \(\bar {y}_{b,1}\) \(\bar {p}_{b,2}\) \(\bar {y}_{b,2}\) \([0,\,0.5)\) 0.217 0 0.250 0 \([0.5,\,1]\) 0.733 1 0.800 1 \(\text {ECE}_1 = \tfrac {3}{6}|0 - 0.217| + \tfrac {3}{6}|1 - 0.733| = 0.108 + 0.133 = 0.242\).
\(\text {ECE}_2 = \tfrac {3}{6}|0 - 0.250| + \tfrac {3}{6}|1 - 0.800| = 0.125 + 0.100 = 0.225\).Classifier 2 has slightly better calibration. Both classifiers are underconfident in the upper bin (\(\bar {p}_b < \bar {y}_b = 1\)) and overconfident in the lower bin (\(\bar {p}_b > \bar {y}_b = 0\)).
12.2.6 Error Correlation
Yule’s Q (Sec. 12.1) measures the association between two classifiers’ binary correct/incorrect outcomes. For probabilistic classifiers, we can measure the same concept on the continuous per-sample scores \(S_{k,i}\): when classifier 1 has a large error, does classifier 2 also have a large error?
The error correlation is the Pearson correlation coefficient between the per-sample scores:
\(\seteqnumber{0}{}{22}\)\begin{equation} r = \frac {\sum _{i=1}^{M}(S_{1,i} - \bar {S}_1)(S_{2,i} - \bar {S}_2)}{\sqrt {\sum _{i=1}^{M}(S_{1,i} - \bar {S}_1)^2}\;\sqrt {\sum _{i=1}^{M}(S_{2,i} - \bar {S}_2)^2}} \end{equation}
Range: \(r \in [-1, 1]\).
-
• \(r \approx 1\): errors are positively correlated — both classifiers fail on the same samples. Combining them provides little benefit.
-
• \(r \approx 0\): errors are uncorrelated — ensembles can potentially average out errors effectively.
-
• \(r < 0\): errors are negatively correlated — the classifiers fail on different samples, which gives high potential for ensembles.
-
Example 12.2: Back to the previous example (Table in Example 12.2), the per-sample Brier scores are:
\(i\) \(S_{\text {BS},1}\) \(S_{\text {BS},2}\) 1 0.010 0.063 2 0.040 0.010 3 0.090 0.023 4 0.090 0.160 5 0.160 0.040 6 0.023 0.063 Computing the Pearson correlation: \(r \approx -0.18\). The weak negative correlation indicates that the classifiers’ errors are slightly complementary — when one has a large error, the other tends to have a smaller one. This suggests moderate potential for improvement via ensemble combination (Ch. 13.3).
12.2.7 Spearman Rank Correlation of Scores
The Pearson error correlation (Sec. 12.2.6) measures the linear association between \(S_{1,i}\) and \(S_{2,i}\). When the relationship is monotonic but non-linear, or when scores contain outliers, the Spearman rank correlation provides a more robust measure.
Let \(R_{k,i}\) denote the rank of \(S_{k,i}\) among \(\{S_{k,1},\ldots ,S_{k,M}\}\) (with average ranks for ties). The Spearman rank correlation (Sec. 2.5) is the Pearson correlation computed on the ranks:
\(\seteqnumber{0}{}{23}\)\begin{equation} r_s = \frac {\sum _{i=1}^{M}(R_{1,i} - \bar {R})(R_{2,i} - \bar {R})}{\sqrt {\sum _{i=1}^{M}(R_{1,i} - \bar {R})^2}\;\sqrt {\sum _{i=1}^{M}(R_{2,i} - \bar {R})^2}} \end{equation}
where \(\bar {R} = (M+1)/2\) is the mean rank.
Range: \(r_s \in [-1, 1]\). The interpretation mirrors the Pearson error correlation:
-
• \(r_s \approx 1\): classifiers find the same samples difficult — limited ensemble benefit.
-
• \(r_s \approx 0\): difficulty rankings are unrelated — good ensemble diversity.
-
• \(r_s < 0\): when one classifier struggles, the other excels — high ensemble potential.
The Spearman rank correlation relates to the Pearson error correlation as the Wilcoxon test relates to the paired t-test: a non-parametric alternative that is robust to outliers and non-linear score distributions.
-
Example 12.2: Back to the previous example (Table in Example 12.2). Ranking each classifier’s Brier scores (average ranks for ties):
\(i\) \(S_{\text {BS},1}\) \(R_{1,i}\) \(S_{\text {BS},2}\) \(R_{2,i}\) 1 0.010 1 0.063 4.5 2 0.040 3 0.010 1 3 0.090 4.5 0.023 2 4 0.090 4.5 0.160 6 5 0.160 6 0.040 3 6 0.023 2 0.063 4.5 Computing \(r_s \approx -0.18\), consistent with the Pearson result. The classifiers do not rank sample difficulty similarly, confirming modest complementarity.
12.2.8 Summary
| Method | Question answered | ||
| Paired score difference | Which classifier has better calibrated probabilities? | ||
| Paired t-test | Is the score difference statistically significant (normal \(\delta _i\))? | ||
| Wilcoxon signed-rank | Is the score difference significant (non-parametric)? | ||
| DeLong’s test | Do the classifiers differ in discrimination (AUC)? | ||
| Calibration curve / ECE | Where in the probability range does calibration break down? | ||
| Error correlation | How correlated are the classifiers’ errors (linear)? | ||
| Spearman rank correlation | How correlated are the classifiers’ errors (non-parametric)? |