Machine Learning & Signals Learning
13 Combining Classifiers
When multiple data sources are available, there are several strategies for incorporating them into a classification pipeline. Figure 13.1 illustrates five common approaches.
-
• Single source (a): a single dataset is processed through one feature-extraction and classification pipeline.
-
• Data concatenation (b): raw data from multiple sources are concatenated before feature extraction, producing a single feature vector per sample.
-
• Feature concatenation (c): each data source undergoes independent feature extraction; the resulting feature vectors are concatenated before classification.
-
• Classifier fusion (d): each data source is processed by an independent feature-extraction and classification pipeline; the individual predictions are combined by a fusion rule (Sec. 13.3).
-
• Classifiers ensemble (e): a single data source undergoes feature extraction, then multiple different classifiers are applied in parallel; their predictions are combined by an ensemble rule (e.g., majority voting or averaging) (Sec. 13.3).
Approaches (b) and (c) are collectively termed early fusion because they merge information before the classifier; approaches (d) and (e) are late fusion because independent classifiers produce separate outputs that are combined afterward.
13.1 Data Concatenation
Given \(K\) data sources producing raw vectors \(\bd _1, \bd _2, \ldots , \bd _K\) for the same sample, form the concatenated input:
\(\seteqnumber{0}{}{0}\)\begin{equation} \bd = [\bd _1;\, \bd _2;\, \ldots ;\, \bd _K] \end{equation}
A single feature-extraction pipeline then maps \(\bd \) to a feature vector \(\bx \), which is passed to the classifier (Fig. 13.1b).
-
• When combining multimodal data (e.g., EEG and EMG, or images from different scanners), each source may have its own acquisition characteristics, noise profile, and feature distribution.
-
• Before concatenation verify that the sources share a compatible representation (e.g., same sampling rate, resolution, or coordinate system).
-
• The concatenated vector may be very high-dimensional, increasing the risk of overfitting when \(M\) is small relative to \(N\).
13.2 Feature Concatenation
Each source \(k\) is processed by its own feature-extraction pipeline, yielding \(\bx _k \in \mathbb {R}^{N_k}\). The concatenated feature vector is
\(\seteqnumber{0}{}{1}\)\begin{equation} \bx = [\bx _1;\, \bx _2;\, \ldots ;\, \bx _K] \in \mathbb {R}^{N}, \quad N = \sum _{k=1}^{K} N_k \end{equation}
A single classifier operates on \(\bx \) (Fig. 13.1c).
Feature concatenation allows each source to use a specialized extraction pipeline suited to its modality. However, the concatenated dimension \(N\) grows with \(K\), increasing the risk of overfitting when \(M\) is small relative to \(N\).
13.3 Ensemble of Classifiers
Given \(K\) classifiers, each producing a prediction for the same input \(\bx \), fusion methods aggregate these predictions into a single output. The methods in this section apply to two settings (Fig. 13.1d,e):
-
• Classifiers ensemble: \(K\) different classifiers operate on the same data source. The ensemble benefits from diversity among classifiers; if they make similar errors, combining them provides little gain. Classifier comparison methods (Sec. 12.1 and 12.2.6) can quantify this diversity by measuring:
-
– Yule’s Q (Sec. 12.1.5): measures the association between binary correct/incorrect outcomes. \(Q \approx 0\) indicates independent errors, while \(Q < 0\) suggests complementary performance.
-
– Error correlation (Sec. 12.2.6 and 12.2.7): the Pearson/Spearman correlation between per-sample probabilistic scores. \(r < 0\) and \(r_s<0\) indicate that the classifiers’ errors are negatively correlated, which indicates a high ensemble potential.
-
-
• Classifier fusion: a single classifier is applied to \(K\) different data sources (e.g., multimodal sensors). When the sources originate from different domains, a domain consistency check (Sec. 11.3.2) should verify distributional compatibility before fusion. Classifier fusion methods are a subset of the ensemble methods; applicability is summarized in Sec. 13.3.8.
13.3.1 Majority Vote
Each classifier \(k\) produces a binary prediction \(\hat {y}_k\in \{0,1\}\). The fused prediction is
\(\seteqnumber{0}{}{2}\)\begin{equation} \hat {y}_{\text {fused}} = \begin{cases} 1 & \displaystyle \sum _{k=1}^{K} \hat {y}_k > K/2 \\[6pt] 0 & \text {otherwise} \end {cases} \end{equation}
Choosing odd \(K\) avoids ties.
13.3.2 Soft Vote
Each probabilistic classifier \(k\) outputs \(p_k = f_k(\bx ) = \Pr (\hat {y}=1\mid \bx )\). The soft vote averages these probabilities:
\(\seteqnumber{0}{}{3}\)\begin{equation} p_{\text {soft}} = \frac {1}{K}\sum _{k=1}^{K} p_k \end{equation}
The fused decision is
\(\seteqnumber{0}{}{4}\)\begin{equation} \hat {y}_{\text {fused}} = \begin{cases} 1 & p_{\text {soft}} \ge 0.5\\ 0 & p_{\text {soft}} < 0.5 \end {cases} \end{equation}
-
Example 13.1: Three classifiers produce probabilities \(p_1 = 0.9\), \(p_2 = 0.4\), \(p_3 = 0.45\) for a given sample.
-
• Majority vote: \(\hat {y}_1 = 1\), \(\hat {y}_2 = 0\), \(\hat {y}_3 = 0\). Fused prediction: \(\hat {y}_{\text {fused}} = 0\) (two out of three vote for class 0).
-
• Soft vote: \(p_{\text {soft}} = (0.9 + 0.4 + 0.45)/3 = 0.583\). Fused prediction: \(\hat {y}_{\text {fused}} = 1\) (average probability exceeds \(0.5\)).
The high confidence of classifier 1 tips the soft vote toward class 1, whereas majority vote ignores this confidence.
-
For the special case of \(K=2\), if the classifiers disagree (\(\hat {y}_1 \neq \hat {y}_2\)), the fused prediction follows the more confident classifier:
\(\seteqnumber{0}{}{5}\)\begin{equation} \hat {y}_{\text {fused}} = \begin{cases} \hat {y}_1 & |p_1 - 0.5| > |p_2 - 0.5| \\ \hat {y}_2 & |p_2 - 0.5| > |p_1 - 0.5| \end {cases} \end{equation}
If one classifier is very confident (e.g., \(p_k = 0.95\)) while the others are weakly opposed (e.g., \(p_j = 0.45\)), the confident vote contributes more to the average than a single hard vote would.
13.3.3 Linearly Weighted Combining
Each probabilistic classifier \(k\) outputs \(p_k = f_{k}(\bx )\). Assign non-negative weights \(w_k\ge 0\) with \(\sum _{k=1}^{K}w_k = 1\). The fused probability is
\(\seteqnumber{0}{}{6}\)\begin{equation} p_{\text {fused}} = \sum _{k=1}^{K} w_k \, p_k \end{equation}
The fused decision is
\(\seteqnumber{0}{}{7}\)\begin{equation} \hat {y}_{\text {fused}} = \begin{cases} 1 & p_{\text {fused}} \ge 0.5\\ 0 & p_{\text {fused}} < 0.5 \end {cases} \end{equation}
A common approach is to normalize validation accuracies [19]:
\(\seteqnumber{0}{}{8}\)\begin{equation} w_k = \frac {a_k}{\sum _{j=1}^{K} a_j} \end{equation}
where \(a_k\) is the validation accuracy (or other performance metric, e.g., \(F_1\)) of classifier \(k\). Equal weights \(w_k = 1/K\) recover simple averaging.
Alternatively, weights can be learned by minimizing the Brier score or cross-entropy loss on a validation set, treating \(\{w_k\}\) as optimization variables subject to \(w_k \ge 0\) and \(\sum _k w_k = 1\). However, this approach involves a significant computation complexity.
13.3.4 Log-Likelihood Ratio (LLR)
-
Goal: Combine probabilistic classifier outputs using log-odds (Sec. 8.6).
Each probabilistic classifier \(k\) outputs \(p_k = f_{k}(\bx ) = \Pr (\hat {y}=1\mid \bx )\). Recall that the logit (log-odds) is given by
\(\seteqnumber{0}{}{9}\)\begin{equation} \text {logit}_k = \ln \frac {p_k}{1 - p_k} \end{equation}
Under the assumption that the \(K\) classifiers (or classifications) are conditionally independent given the true label, the fused log-odds is the sum of individual logits:
\(\seteqnumber{0}{}{10}\)\begin{equation} \text {logit}_{\text {fused}} = \sum _{k=1}^{K}\text {logit}_k = \sum _{k=1}^{K} \ln \frac {p_k}{1 - p_k} = \ln \prod _{k=1}^{K}\frac {p_k}{1 - p_k} \end{equation}
The fused probability is recovered by applying the sigmoid: \(p_{\text {fused}} = \sigma (\text {logit}_{\text {fused}})\).
When the classes are imbalanced, the log-prior odds should not be dropped. The general fusion formula is
\(\seteqnumber{0}{}{11}\)\begin{equation} \text {logit}_{\text {fused}} = \ln \frac {M_1}{M_0} + \sum _{k=1}^{K}\text {logit}_k \end{equation}
where \(M_0\) and \(M_1\) are the class counts. The log-prior term acts as a constant bias that shifts the decision boundary toward the minority class. When the classes are balanced (\(M_0 = M_1\)), the bias vanishes and the standard formula is recovered.
The fused decision is
\(\seteqnumber{0}{}{12}\)\begin{equation} \hat {y}_{\text {fused}} = \begin{cases} 1 & \text {logit}_{\text {fused}} \ge 0 \quad (p_{\text {fused}} \ge 0.5)\\ 0 & \text {logit}_{\text {fused}} < 0 \end {cases} \end{equation}
With imbalanced priors, the log-prior bias in \(\text {logit}_{\text {fused}}\) effectively shifts the decision boundary away from \(0.5\) toward the minority class.
LLR fusion requires probabilistic classifiers and assumes independence between classifier errors. When the independence assumption is violated, the fused score may be overconfident.
-
Proof. Why Conditional Independence Leads to Summing Log-Odds? By Bayes’ rule in odds form, the posterior odds given evidence \(\bx _k\) from classifier \(k\) are
\(\seteqnumber{0}{}{13}\)\begin{equation} \frac {\Pr (y{=}1\mid \bx _k)}{\Pr (y{=}0\mid \bx _k)} = \frac {\Pr (\bx _k\mid y{=}1)}{\Pr (\bx _k\mid y{=}0)} \cdot \frac {\Pr (y{=}1)}{\Pr (y{=}0)} \end{equation}
where \(\frac {\Pr (\bx _k\mid y{=}1)}{\Pr (\bx _k\mid y{=}0)}\) is \(\text {likelihood ratio}_k\). When the \(K\) classifiers are conditionally independent given the true label, the joint likelihood ratio factorizes:
\(\seteqnumber{0}{}{14}\)\begin{equation} \frac {\Pr (\bx _1,\ldots ,\bx _K\mid y{=}1)}{\Pr (\bx _1,\ldots ,\bx _K\mid y{=}0)} = \prod _{k=1}^{K}\frac {\Pr (\bx _k\mid y{=}1)}{\Pr (\bx _k\mid y{=}0)} = \prod _{k=1}^{K} \text {likelihood ratio}_k \end{equation}
Multiplying by the prior odds and taking the logarithm converts the product into a sum. Absorbing the log-prior into a constant (or assuming equal priors), the fused log-odds reduces to the sum of individual LLRs.
13.3.5 Learned Combiner (Stacking)
The previous fusion methods use fixed rules. When the classifiers’ outputs are probabilistic, a stronger ensemble can be obtained by fitting a logistic regression model on the classifiers’ log-odds:
\(\seteqnumber{0}{}{15}\)\begin{equation} \text {logit}(p_{\text {fused}}) = \beta _0 + \sum _{k=1}^{K} \beta _k\, \text {logit}_k \end{equation}
where \(\beta _0\) is a bias term and \(\beta _k\) are learned coefficients. The parameters \(\{\beta _0, \beta _1, \ldots , \beta _K\}\) are fit on a validation set, and the stacked model is evaluated on a separate test set to avoid overfitting.
The fused probability is recovered via the sigmoid: \(p_{\text {fused}} = \sigma \!\left (\beta _0 + \sum _{k} \beta _k\, \text {logit}_k\right )\).
When \(\beta _0 = 0\) and \(\beta _k = 1\) for all \(k\), the learned combiner reduces to LLR fusion. The learned coefficients allow the model to down-weight redundant or poorly calibrated classifiers.
13.3.6 General Non-Linear Combining
The stacking approach can be generalized by replacing the linear logistic regression model with any non-linear classifier (e.g., a neural network, random forest, or gradient boosting machine). The fused prediction is formulated as:
\(\seteqnumber{0}{}{16}\)\begin{equation} p_{\text {fused}} = g_{\bphi }(p_1, \ldots , p_K) \end{equation}
where \(g_{\bphi }\) is the non-linear combiner parameterized by \(\bphi \). As with the linear learned combiner, the parameters \(\bphi \) are fit on a validation set, and the stacked model is evaluated on a separate test set.
While non-linear combiners can capture complex dependencies between base classifiers, they are more prone to overfitting than linear combiners, especially if the validation set is small.
13.3.7 Held-out average log-likelihood gain (*)
The ensemble gain can be quantified using the log-likelihood. Recall (Sec. 8.4) that the per-sample log-likelihood for a probabilistic classifier with output \(p_i = f_\bw (\bx _i)\) is
\(\seteqnumber{0}{}{17}\)\begin{equation} \log p(y_i \mid \bx _i) = y_i \log p_i + (1-y_i)\log (1-p_i) \end{equation}
The held-out average log-likelihood gain of the ensemble over the best base model is
\(\seteqnumber{0}{}{18}\)\begin{equation} \Delta \overline {\ell } = \frac {1}{M_{\text {test}}} \sum _{i=1}^{M_{\text {test}}} \left [\log p_{\text {fused}}(y_i \mid \bx _i) - \log p_{\text {best}}(y_i \mid \bx _i)\right ] \end{equation}
where \(p_{\text {fused}}\) and \(p_{\text {best}}\) are the predicted probabilities of the stacked model and the best individual classifier, respectively, evaluated on the test set.
Positive \(\Delta \overline {\ell }\) indicates that the ensemble provides better calibrated probabilities than the best base model. Equivalently, \(\Delta \overline {\ell }\) equals the reduction in BCE loss: \(\Delta \overline {\ell } = \loss _{\text {best}} - \loss _{\text {fused}}\).
13.3.8 Summary
The methods above apply to both classifiers ensemble of \(K\) classifiers and classifier fusion of interference from \(K\) different data sources (or noisy measurements \(\bx _1, \ldots , \bx _K\) of the same quantity). In the classifier-fusion setting, the independence assumption required for LLR fusion corresponds to independent measurement noise, and methods that assign different learned parameters per classifier (Linearly Weighted Combining, Learned Combiner) are not applicable, since there is only one classifier. The method summary is presented in Table 13.1.
Cross-validation of ensembles When evaluating an ensemble with cross-validation, the procedure depends on whether the fusion rule has learned parameters:
-
• Fixed-rule methods (Majority vote, Soft vote, LLR fusion): no parameters are learned from data. A standard outer CV loop suffices. In each fold, train the \(K\) base classifiers on the training partition, apply the fixed fusion rule, and evaluate on the held-out test partition.
-
• Learned-parameter methods (Weighted combining, Learned combiner): the combiner weights (\(w_k\) or \(\beta _k\)) must be fitted on data that is separate from both the base-classifier training data and the test data. Within each outer CV fold, further split the training partition into a base-training set (to train the \(K\) classifiers) and a validation set (to fit the combiner weights). The fused model is then evaluated on the outer test fold (Sec. 4.4).
Fitting combiner weights on the same data used to train the base classifiers is a form of data leakage (Sec. 11.3.1): the base classifiers’ predictions on their own training samples are overconfident, leading to inflated ensemble performance estimates.
| Method | Input type | Key idea | Classifier fusion | Validation set | |||||
| Majority vote | Hard (\(\hat {y}_k\in \{0,1\}\)) | Most common class | Yes | No | |||||
| Soft vote | Probabilistic (\(p_k\)) | Average of probabilities | Yes | No | |||||
| Weighted combining | Probabilistic (\(p_k\)) | Weighted average of probabilities | No | Yes | |||||
| LLR fusion | Probabilistic (\(p_k\)) | Sum of log-odds (assumes independence) | Yes | No | |||||
| Learned combiner | Probabilistic (\(p_k\)) | Logistic regression on log-odds | No | Yes | |||||
| Non-linear combiner | Any | Non-linear model on outputs | No | Yes |
13.4 Sequential Combining
The ensemble methods in the previous section assume that all \(K\) classifier outputs (or measurements) are available simultaneously. In many practical settings, however, measurements arrive one at a time (streaming data), and a decision can/must be updated after each new observation.
13.4.1 Sequential Log-Odds Update
Consider a single classifier applied to a sequence of \(K\) independent noisy measurements \(\bx _1, \bx _2, \ldots , \bx _K\) of the same underlying quantity. After measurement \(\bx _t\), the classifier outputs the probability \(p_t = f_\bw (\bx _t)\).
In the batch LLR fusion, all measurements are combined at once:
\(\seteqnumber{0}{}{19}\)\begin{equation} \text {logit}_{\text {fused}} = \sum _{k=1}^{K} \ln \frac {p_k}{1 - p_k} \end{equation}
The same result can be obtained sequentially. Define the accumulated log-odds after \(t\) measurements as
\(\seteqnumber{0}{}{20}\)\begin{equation} L_t = L_{t-1} + \underbrace {\ln \frac {p_t}{1-p_t}}_{\text {logit}_t}, \quad L_0 = 0 \end{equation}
After each update, the fused probability is
\(\seteqnumber{0}{}{21}\)\begin{equation} p_{\text {fused},t} = \sigma (L_t) = \frac {1}{1 + e^{-L_t}} \end{equation}
The sequential formulation produces identical results to the batch LLR fusion: \(L_K = \text {logit}_{\text {fused}}\). Its advantage is that the decision can be monitored and potentially made before all \(K\) measurements are collected.
The initialization \(L_0 = 0\) corresponds to an uninformative (equal) prior, i.e., \(p_{\text {fused},0} = \sigma (0) = 0.5\). A non-equal prior \(\pi _1 = \Pr (y=1)\) can be incorporated by setting \(L_0 = \ln \frac {\pi _1}{1-\pi _1}\).
A key advantage of the sequential formulation is that the decision can be made as soon as sufficient evidence has accumulated, without waiting for all \(K\) measurements.
Fixed-threshold decision At each step \(t\), compare the accumulated log-odds to a threshold:
\(\seteqnumber{0}{}{22}\)\begin{equation} \hat {y}_t = \begin{cases} 1 & L_t \ge \tau _1\\ 0 & L_t \le \tau _0\\ \text {continue} & \tau _0 < L_t < \tau _1 \end {cases} \end{equation}
where \(\tau _1 > 0\) and \(\tau _0 < 0\) are decision thresholds. If \(L_t\) falls between the two thresholds, no decision is made and the next measurement is collected.
Confidence-based stopping Equivalently, the decision can be expressed in terms of the fused probability \(p_{\text {fused},t} = \sigma (L_t)\). Stop when the confidence exceeds a desired level \(\gamma \):
\(\seteqnumber{0}{}{23}\)\begin{equation} \text {stop at time } t^* = \min \left \{t : p_{\text {fused},t} \ge \gamma \;\text { or }\; p_{\text {fused},t} \le 1-\gamma \right \} \end{equation}
where \(\gamma \in (0.5, 1)\) is the confidence threshold. For example, \(\gamma = 0.95\) means the decision is made when the fused probability exceeds \(95\%\) for either class.
-
Example 13.2: A classifier is applied to \(K=4\) sequential noisy measurements with outputs \(p_1=0.7\), \(p_2=0.8\), \(p_3=0.6\), \(p_4=0.75\).
\(t\) \(p_t\) \(\text {logit}_t\) \(L_t\) \(p_{\text {fused},t} = \sigma (L_t)\) 1 0.7 \(\ln \frac {0.7}{0.3} = 0.847\) 0.847 0.700 2 0.8 \(\ln \frac {0.8}{0.2} = 1.386\) 2.233 0.903 3 0.6 \(\ln \frac {0.6}{0.4} = 0.405\) 2.638 0.933 4 0.75 \(\ln \frac {0.75}{0.25} = 1.099\) 3.737 0.977 -
• Fixed-threshold stopping (\(\tau _1 = 2\), \(\tau _0 = -2\)): the decision \(\hat {y}=1\) is reached at \(t=2\), since \(L_2 = 2.233 > \tau _1\). Measurements \(p_3\), \(p_4\) are not needed.
-
• Confidence-based stopping \((\gamma = 0.95)\): With a strict confidence threshold \(\gamma = 0.95\), the decision is delayed to \(t=4\) (\(p_{\text {fused},4} = 0.977 > 0.95\)). A higher threshold requires more evidence but yields a more reliable result (\(97.7\%\) confidence vs. \(90.3\%\)).
-
13.4.2 Multiple Independent Sensors
The previous subsection considers repeated measurements from a single classifier. In many applications, \(N\) different independent sensors (or classifiers) observe the same underlying state, each producing its own probability estimate. In this case:
-
1. Since the sensors are independent, the log-odds combine by summation (LLR fusion).
-
2. Sequential update (Sec. 13.4.1) is performed with combined log-odds.
13.4.3 Connection to Bayesian Learning (*)
The sequential log-odds update is a special case of Bayesian belief updating.
Belief: A belief \(\pi (H_i)\) is a probability assigned to hypothesis \(H_i\), representing the agent’s confidence that \(H_i\) is the true state of the world. The belief vector satisfies \(\pi (H_i) \ge 0\) and \(\sum _i \pi (H_i) = 1\).
Consider a binary hypothesis set \(\{H_0, H_1\}\) (e.g., \(y=0\) vs. \(y=1\)) with prior beliefs \(\pi _{\text {old}}(H_0)\) and \(\pi _{\text {old}}(H_1)\). After observing new evidence \(x\), the posterior is updated using Bayes’ rule:
\(\seteqnumber{0}{}{24}\)\begin{equation} \label {eq-bayes-belief-update} \pi _{\text {new}}(H_i) = \frac {\pi _{\text {old}}(H_i) \cdot L(H_i; x)}{\displaystyle \sum _{j} \pi _{\text {old}}(H_j) \cdot L(H_j; x)} \end{equation}
where \(L(H_i; x)\) is the likelihood of the observation \(x\) under hypothesis \(H_i\).
In the sequential setting, the posterior after observation \(x_t\) becomes the prior for observation \(x_{t+1}\). Taking the log-ratio of the two hypotheses after \(t\) observations yields
\(\seteqnumber{0}{}{25}\)\begin{equation} \label {eq-log-odds-bayesian} \ln \frac {\pi (H_1)}{\pi (H_0)} = \underbrace {\ln \frac {\pi _0(H_1)}{\pi _0(H_0)}}_{\text {log-prior odds}} + \sum _{k=1}^{t} \underbrace {\ln \frac {L(H_1; x_k)}{L(H_0; x_k)}}_{\text {log-likelihood ratio}_k} \end{equation}
This is exactly the sequential log-odds update with \(L_0 = \ln \frac {\pi _0(H_1)}{\pi _0(H_0)}\) and \(\text {logit}_k = \ln \frac {L(H_1; x_k)}{L(H_0; x_k)}\).
When a probabilistic classifier outputs \(p_k = f_\bw (\bx _k)\), its logit \(\ln \frac {p_k}{1-p_k}\) serves as an estimate of the log-likelihood ratio. The sequential log-odds update therefore implements approximate Bayesian inference using classifier outputs as surrogate likelihood ratios.
-
Example 13.3: A sensor classifies objects as planes (\(H_1\)) or non-planes (\(H_0\)). The prior belief is \(\pi _{\text {old}}(H_1) = 0.3\), \(\pi _{\text {old}}(H_0) = 0.7\). A new measurement \(x\) has likelihoods \(L(H_1; x) = 0.8\) and \(L(H_0; x) = 0.2\). Applying Eq. (13.25):
\(\seteqnumber{0}{}{26}\)\begin{equation} \begin{aligned} \pi _{\text {new}}(H_1) &= \frac {0.3 \times 0.8}{0.3 \times 0.8 + 0.7 \times 0.2} = \frac {0.24}{0.38} \approx 0.632 \\[3pt] \pi _{\text {new}}(H_0) &= \frac {0.7 \times 0.2}{0.38} \approx 0.368 \end {aligned} \end{equation}
A single observation shifted the belief from \(30\%\) to \(63.2\%\) in favor of “plane.” This posterior becomes the prior for the next measurement.