Machine Learning & Signals Learning

$\newcommand{\footnotename}{footnote}$ $\def \LWRfootnote {1}$ $\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}$ $\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}$ $\let \LWRorighspace \hspace $ $\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }$ $\newcommand {\TextOrMath }[2]{#2}$ $\newcommand {\mathnormal }[1]{{#1}}$ $\newcommand \ensuremath [1]{#1}$ $\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } $ $\newcommand {\setlength }[2]{}$ $\newcommand {\addtolength }[2]{}$ $\newcommand {\setcounter }[2]{}$ $\newcommand {\addtocounter }[2]{}$ $\newcommand {\arabic }[1]{}$ $\newcommand {\number }[1]{}$ $\newcommand {\noalign }[1]{\text {#1}\notag \\}$ $\newcommand {\cline }[1]{}$ $\newcommand {\directlua }[1]{\text {(directlua)}}$ $\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}$ $\newcommand {\protect }{}$ $\def \LWRabsorbnumber #1 {}$ $\def \LWRabsorbquotenumber "#1 {}$ $\newcommand {\LWRabsorboption }[1][]{}$ $\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }$ $\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }$ $\def \mathcode #1={\mathchar }$ $\let \delcode \mathcode $ $\let \delimiter \mathchar $ $\def \oe {\unicode {x0153}}$ $\def \OE {\unicode {x0152}}$ $\def \ae {\unicode {x00E6}}$ $\def \AE {\unicode {x00C6}}$ $\def \aa {\unicode {x00E5}}$ $\def \AA {\unicode {x00C5}}$ $\def \o {\unicode {x00F8}}$ $\def \O {\unicode {x00D8}}$ $\def \l {\unicode {x0142}}$ $\def \L {\unicode {x0141}}$ $\def \ss {\unicode {x00DF}}$ $\def \SS {\unicode {x1E9E}}$ $\def \dag {\unicode {x2020}}$ $\def \ddag {\unicode {x2021}}$ $\def \P {\unicode {x00B6}}$ $\def \copyright {\unicode {x00A9}}$ $\def \pounds {\unicode {x00A3}}$ $\let \LWRref \ref $ $\renewcommand {\ref }{\ifstar \LWRref \LWRref }$ $ \newcommand {\multicolumn }[3]{#3}$ $\require {textcomp}$ $ \newcommand {\abs }[1]{\lvert #1\rvert } $ $ \DeclareMathOperator {\sign }{sign} $ $\newcommand {\intertext }[1]{\text {#1}\notag \\}$ $\let \Hat \hat $ $\let \Check \check $ $\let \Tilde \tilde $ $\let \Acute \acute $ $\let \Grave \grave $ $\let \Dot \dot $ $\let \Ddot \ddot $ $\let \Breve \breve $ $\let \Bar \bar $ $\let \Vec \vec $ $\newcommand {\bm }[1]{\boldsymbol {#1}}$ $\require {physics}$ $\newcommand {\LWRphystrig }[2]{\ifblank {#1}{\textrm {#2}}{\textrm {#2}^{#1}}}$ $\renewcommand {\sin }[1][]{\LWRphystrig {#1}{sin}}$ $\renewcommand {\sinh }[1][]{\LWRphystrig {#1}{sinh}}$ $\renewcommand {\arcsin }[1][]{\LWRphystrig {#1}{arcsin}}$ $\renewcommand {\asin }[1][]{\LWRphystrig {#1}{asin}}$ $\renewcommand {\cos }[1][]{\LWRphystrig {#1}{cos}}$ $\renewcommand {\cosh }[1][]{\LWRphystrig {#1}{cosh}}$ $\renewcommand {\arccos }[1][]{\LWRphystrig {#1}{arcos}}$ $\renewcommand {\acos }[1][]{\LWRphystrig {#1}{acos}}$ $\renewcommand {\tan }[1][]{\LWRphystrig {#1}{tan}}$ $\renewcommand {\tanh }[1][]{\LWRphystrig {#1}{tanh}}$ $\renewcommand {\arctan }[1][]{\LWRphystrig {#1}{arctan}}$ $\renewcommand {\atan }[1][]{\LWRphystrig {#1}{atan}}$ $\renewcommand {\csc }[1][]{\LWRphystrig {#1}{csc}}$ $\renewcommand {\csch }[1][]{\LWRphystrig {#1}{csch}}$ $\renewcommand {\arccsc }[1][]{\LWRphystrig {#1}{arccsc}}$ $\renewcommand {\acsc }[1][]{\LWRphystrig {#1}{acsc}}$ $\renewcommand {\sec }[1][]{\LWRphystrig {#1}{sec}}$ $\renewcommand {\sech }[1][]{\LWRphystrig {#1}{sech}}$ $\renewcommand {\arcsec }[1][]{\LWRphystrig {#1}{arcsec}}$ $\renewcommand {\asec }[1][]{\LWRphystrig {#1}{asec}}$ $\renewcommand {\cot }[1][]{\LWRphystrig {#1}{cot}}$ $\renewcommand {\coth }[1][]{\LWRphystrig {#1}{coth}}$ $\renewcommand {\arccot }[1][]{\LWRphystrig {#1}{arccot}}$ $\renewcommand {\acot }[1][]{\LWRphystrig {#1}{acot}}$ $\require {cancel}$ $\newcommand *{\underuparrow }[1]{{\underset {\uparrow }{#1}}}$ $\DeclareMathOperator *{\argmax }{argmax}$ $\DeclareMathOperator *{\argmin }{arg\,min}$ $\def \E [#1]{\mathbb {E}\!\left [ #1 \right ]}$ $\def \Var [#1]{\operatorname {Var}\!\left [ #1 \right ]}$ $\def \Cov [#1]{\operatorname {Cov}\!\left [ #1 \right ]}$ $\newcommand {\floor }[1]{\lfloor #1 \rfloor }$ $\newcommand {\DTFTH }{ H \brk 1{e^{j\omega }}}$ $\newcommand {\DTFTX }{ X\brk 1{e^{j\omega }}}$ $\newcommand {\DFTtr }[1]{\mathrm {DFT}\left \{#1\right \}}$ $\newcommand {\DTFTtr }[1]{\mathrm {DTFT}\left \{#1\right \}}$ $\newcommand {\DTFTtrI }[1]{\mathrm {DTFT^{-1}}\left \{#1\right \}}$ $\newcommand {\Ftr }[1]{ \mathcal {F}\left \{#1\right \}}$ $\newcommand {\FtrI }[1]{ \mathcal {F}^{-1}\left \{#1\right \}}$ $\newcommand {\Zover }{\overset {\mathscr Z}{\Longleftrightarrow }}$ $\renewcommand {\real }{\mathbb {R}}$ $\newcommand {\ba }{\mathbf {a}}$ $\newcommand {\bb }{\mathbf {b}}$ $\newcommand {\bd }{\mathbf {d}}$ $\newcommand {\be }{\mathbf {e}}$ $\newcommand {\bh }{\mathbf {h}}$ $\newcommand {\bn }{\mathbf {n}}$ $\newcommand {\bq }{\mathbf {q}}$ $\newcommand {\br }{\mathbf {r}}$ $\newcommand {\bt }{\mathbf {t}}$ $\newcommand {\bv }{\mathbf {v}}$ $\newcommand {\bw }{\mathbf {w}}$ $\newcommand {\bx }{\mathbf {x}}$ $\newcommand {\bxx }{\mathbf {xx}}$ $\newcommand {\bxy }{\mathbf {xy}}$ $\newcommand {\by }{\mathbf {y}}$ $\newcommand {\byy }{\mathbf {yy}}$ $\newcommand {\bz }{\mathbf {z}}$ $\newcommand {\bA }{\mathbf {A}}$ $\newcommand {\bB }{\mathbf {B}}$ $\newcommand {\bI }{\mathbf {I}}$ $\newcommand {\bK }{\mathbf {K}}$ $\newcommand {\bP }{\mathbf {P}}$ $\newcommand {\bQ }{\mathbf {Q}}$ $\newcommand {\bR }{\mathbf {R}}$ $\newcommand {\bU }{\mathbf {U}}$ $\newcommand {\bW }{\mathbf {W}}$ $\newcommand {\bX }{\mathbf {X}}$ $\newcommand {\bY }{\mathbf {Y}}$ $\newcommand {\bZ }{\mathbf {Z}}$ $\newcommand {\balpha }{\bm {\alpha }}$ $\newcommand {\bth }{{\bm {\theta }}}$ $\newcommand {\bepsilon }{{\bm {\epsilon }}}$ $\newcommand {\bmu }{{\bm {\mu }}}$ $\newcommand {\bphi }{\bm {\phi }}$ $\newcommand {\bOne }{\mathbf {1}}$ $\newcommand {\bZero }{\mathbf {0}}$ $\newcommand {\btx }{\tilde {\bx }}$ $\newcommand {\loss }{\mathcal {L}}$ $\newcommand {\appropto }{\mathrel {\vcenter { \offinterlineskip \halign {\hfil $##$\cr \propto \cr \noalign {\kern 2pt}\sim \cr \noalign {\kern -2pt}}}}}$ $\newcommand {\SSE }{\mathrm {SSE}}$ $\newcommand {\MSE }{\mathrm {MSE}}$ $\newcommand {\RMSE }{\mathrm {RMSE}}$ $\newcommand {\toprule }[1][]{\hline }$ $\let \midrule \toprule $ $\let \bottomrule \toprule $ $\def \LWRbooktabscmidruleparen (#1)#2{}$ $\newcommand {\LWRbooktabscmidrulenoparen }[1]{}$ $\newcommand {\cmidrule }[1][]{\ifnextchar (\LWRbooktabscmidruleparen \LWRbooktabscmidrulenoparen }$ $\newcommand {\morecmidrules }{}$ $\newcommand {\specialrule }[3]{\hline }$ $\newcommand {\addlinespace }[1][]{}$ $\newcommand {\LWRsubmultirow }[2][]{#2}$ $\newcommand {\LWRmultirow }[2][]{\LWRsubmultirow }$ $\newcommand {\multirow }[2][]{\LWRmultirow }$ $\newcommand {\mrowcell }{}$ $\newcommand {\mcolrowcell }{}$ $\newcommand {\STneed }[1]{}$ $\newcommand {\tcbset }[1]{}$ $\newcommand {\tcbsetforeverylayer }[1]{}$ $\newcommand {\tcbox }[2][]{\boxed {\text {#2}}}$ $\newcommand {\tcboxfit }[2][]{\boxed {#2}}$ $\newcommand {\tcblower }{}$ $\newcommand {\tcbline }{}$ $\newcommand {\tcbtitle }{}$ $\newcommand {\tcbsubtitle [2][]{\mathrm {#2}}}$ $\newcommand {\tcboxmath }[2][]{\boxed {#2}}$ $\newcommand {\tcbhighmath }[2][]{\boxed {#2}}$

13 Combining Classifiers

Goal: Combine information from multiple data sources or feature sets or classifiers to improve classification performance.

When multiple data sources are available, there are several strategies for incorporating them into a classification pipeline. Figure 13.1 illustrates five common approaches.

• Single source (a): a single dataset is processed through one feature-extraction and classification pipeline.
• Data concatenation (b): raw data from multiple sources are concatenated before feature extraction, producing a single feature vector per sample.
• Feature concatenation (c): each data source undergoes independent feature extraction; the resulting feature vectors are concatenated before classification.
• Classifier fusion (d): each data source is processed by an independent feature-extraction and classification pipeline; the individual predictions are combined by a fusion rule (Sec. 13.3).
• Classifiers ensemble (e): a single data source undergoes feature extraction, then multiple different classifiers are applied in parallel; their predictions are combined by an ensemble rule (e.g., majority voting or averaging) (Sec. 13.3).

Approaches (b) and (c) are collectively termed early fusion because they merge information before the classifier; approaches (d) and (e) are late fusion because independent classifiers produce separate outputs that are combined afterward.

13.1 Data Concatenation

Goal: Combine (multimodal) raw data from multiple sources before feature extraction.

Given $K$ data sources producing raw vectors $\bd _1, \bd _2, \ldots , \bd _K$ for the same sample, form the concatenated input:

\begin{equation} \bd = [\bd _1;\, \bd _2;\, \ldots ;\, \bd _K] \end{equation}

A single feature-extraction pipeline then maps $\bd $ to a feature vector $\bx $, which is passed to the classifier (Fig. 13.1b).

• When combining multimodal data (e.g., EEG and EMG, or images from different scanners), each source may have its own acquisition characteristics, noise profile, and feature distribution.
• Before concatenation verify that the sources share a compatible representation (e.g., same sampling rate, resolution, or coordinate system).
• The concatenated vector may be very high-dimensional, increasing the risk of overfitting when $M$ is small relative to $N$.

13.2 Feature Concatenation

Goal: Combine independently extracted features from multiple sources into a single feature vector.

Each source $k$ is processed by its own feature-extraction pipeline, yielding $\bx _k \in \mathbb {R}^{N_k}$. The concatenated feature vector is

\begin{equation} \bx = [\bx _1;\, \bx _2;\, \ldots ;\, \bx _K] \in \mathbb {R}^{N}, \quad N = \sum _{k=1}^{K} N_k \end{equation}

A single classifier operates on $\bx $ (Fig. 13.1c).

Feature concatenation allows each source to use a specialized extraction pipeline suited to its modality. However, the concatenated dimension $N$ grows with $K$, increasing the risk of overfitting when $M$ is small relative to $N$.

13.3 Ensemble of Classifiers

Goal: Combine predictions from multiple binary classifiers to improve overall performance.

Given $K$ classifiers, each producing a prediction for the same input $\bx $, fusion methods aggregate these predictions into a single output. The methods in this section apply to two settings (Fig. 13.1d,e):

• Classifiers ensemble: $K$ different classifiers operate on the same data source. The ensemble benefits from diversity among classifiers; if they make similar errors, combining them provides little gain. Classifier comparison methods (Sec. 12.1 and 12.2.6) can quantify this diversity by measuring:
- – Yule’s Q (Sec. 12.1.5): measures the association between binary correct/incorrect outcomes. $Q \approx 0$ indicates independent errors, while $Q < 0$ suggests complementary performance.
- – Error correlation (Sec. 12.2.6 and 12.2.7): the Pearson/Spearman correlation between per-sample probabilistic scores. $r < 0$ and $r_s<0$ indicate that the classifiers’ errors are negatively correlated, which indicates a high ensemble potential.
• Classifier fusion: a single classifier is applied to $K$ different data sources (e.g., multimodal sensors). When the sources originate from different domains, a domain consistency check (Sec. 11.3.2) should verify distributional compatibility before fusion. Classifier fusion methods are a subset of the ensemble methods; applicability is summarized in Sec. 13.3.8.

13.3.1 Majority Vote

Goal: Combine predictions by selecting the most common class.

Each classifier $k$ produces a binary prediction $\hat {y}_k\in \{0,1\}$. The fused prediction is

\begin{equation} \hat {y}_{\text {fused}} = \begin{cases} 1 & \displaystyle \sum _{k=1}^{K} \hat {y}_k > K/2 \\[6pt] 0 & \text {otherwise} \end {cases} \end{equation}

Choosing odd $K$ avoids ties.

13.3.2 Soft Vote

Goal: Combine probabilistic classifier outputs by averaging their predicted probabilities.

Each probabilistic classifier $k$ outputs $p_k = f_k(\bx ) = \Pr (\hat {y}=1\mid \bx )$. The soft vote averages these probabilities:

\begin{equation} p_{\text {soft}} = \frac {1}{K}\sum _{k=1}^{K} p_k \end{equation}

The fused decision is

\begin{equation} \hat {y}_{\text {fused}} = \begin{cases} 1 & p_{\text {soft}} \ge 0.5\\ 0 & p_{\text {soft}} < 0.5 \end {cases} \end{equation}

Example 13.1: Three classifiers produce probabilities $p_1 = 0.9$, $p_2 = 0.4$, $p_3 = 0.45$ for a given sample.
- • Majority vote: $\hat {y}_1 = 1$, $\hat {y}_2 = 0$, $\hat {y}_3 = 0$. Fused prediction: $\hat {y}_{\text {fused}} = 0$ (two out of three vote for class 0).
- • Soft vote: $p_{\text {soft}} = (0.9 + 0.4 + 0.45)/3 = 0.583$. Fused prediction: $\hat {y}_{\text {fused}} = 1$ (average probability exceeds $0.5$).
The high confidence of classifier 1 tips the soft vote toward class 1, whereas majority vote ignores this confidence.

For the special case of $K=2$, if the classifiers disagree ($\hat {y}_1 \neq \hat {y}_2$), the fused prediction follows the more confident classifier:

\begin{equation} \hat {y}_{\text {fused}} = \begin{cases} \hat {y}_1 & |p_1 - 0.5| > |p_2 - 0.5| \\ \hat {y}_2 & |p_2 - 0.5| > |p_1 - 0.5| \end {cases} \end{equation}

If one classifier is very confident (e.g., $p_k = 0.95$) while the others are weakly opposed (e.g., $p_j = 0.45$), the confident vote contributes more to the average than a single hard vote would.

13.3.3 Linearly Weighted Combining

Goal: Combine probabilistic classifier outputs using a weighted average.

Each probabilistic classifier $k$ outputs $p_k = f_{k}(\bx )$. Assign non-negative weights $w_k\ge 0$ with $\sum _{k=1}^{K}w_k = 1$. The fused probability is

\begin{equation} p_{\text {fused}} = \sum _{k=1}^{K} w_k \, p_k \end{equation}

The fused decision is

\begin{equation} \hat {y}_{\text {fused}} = \begin{cases} 1 & p_{\text {fused}} \ge 0.5\\ 0 & p_{\text {fused}} < 0.5 \end {cases} \end{equation}

A common approach is to normalize validation accuracies [19]:

\begin{equation} w_k = \frac {a_k}{\sum _{j=1}^{K} a_j} \end{equation}

where $a_k$ is the validation accuracy (or other performance metric, e.g., $F_1$) of classifier $k$. Equal weights $w_k = 1/K$ recover simple averaging.

Alternatively, weights can be learned by minimizing the Brier score or cross-entropy loss on a validation set, treating $\{w_k\}$ as optimization variables subject to $w_k \ge 0$ and $\sum _k w_k = 1$. However, this approach involves a significant computation complexity.

13.3.4 Log-Likelihood Ratio (LLR)

Goal: Combine probabilistic classifier outputs using log-odds (Sec. 8.6).

Each probabilistic classifier $k$ outputs $p_k = f_{k}(\bx ) = \Pr (\hat {y}=1\mid \bx )$. Recall that the logit (log-odds) is given by

\begin{equation} \text {logit}_k = \ln \frac {p_k}{1 - p_k} \end{equation}

Under the assumption that the $K$ classifiers (or classifications) are conditionally independent given the true label, the fused log-odds is the sum of individual logits:

\begin{equation} \text {logit}_{\text {fused}} = \sum _{k=1}^{K}\text {logit}_k = \sum _{k=1}^{K} \ln \frac {p_k}{1 - p_k} = \ln \prod _{k=1}^{K}\frac {p_k}{1 - p_k} \end{equation}

The fused probability is recovered by applying the sigmoid: $p_{\text {fused}} = \sigma (\text {logit}_{\text {fused}})$.

When the classes are imbalanced, the log-prior odds should not be dropped. The general fusion formula is

\begin{equation} \text {logit}_{\text {fused}} = \ln \frac {M_1}{M_0} + \sum _{k=1}^{K}\text {logit}_k \end{equation}

where $M_0$ and $M_1$ are the class counts. The log-prior term acts as a constant bias that shifts the decision boundary toward the minority class. When the classes are balanced ($M_0 = M_1$), the bias vanishes and the standard formula is recovered.

The fused decision is

\begin{equation} \hat {y}_{\text {fused}} = \begin{cases} 1 & \text {logit}_{\text {fused}} \ge 0 \quad (p_{\text {fused}} \ge 0.5)\\ 0 & \text {logit}_{\text {fused}} < 0 \end {cases} \end{equation}

With imbalanced priors, the log-prior bias in $\text {logit}_{\text {fused}}$ effectively shifts the decision boundary away from $0.5$ toward the minority class.

LLR fusion requires probabilistic classifiers and assumes independence between classifier errors. When the independence assumption is violated, the fused score may be overconfident.

Proof. Why Conditional Independence Leads to Summing Log-Odds? By Bayes’ rule in odds form, the posterior odds given evidence $\bx _k$ from classifier $k$ are
$\seteqnumber{0}{}{13}$
\begin{equation} \frac {\Pr (y{=}1\mid \bx _k)}{\Pr (y{=}0\mid \bx _k)} = \frac {\Pr (\bx _k\mid y{=}1)}{\Pr (\bx _k\mid y{=}0)} \cdot \frac {\Pr (y{=}1)}{\Pr (y{=}0)} \end{equation}

where $\frac {\Pr (\bx _k\mid y{=}1)}{\Pr (\bx _k\mid y{=}0)}$ is $\text {likelihood ratio}_k$. When the $K$ classifiers are conditionally independent given the true label, the joint likelihood ratio factorizes:
$\seteqnumber{0}{}{14}$
\begin{equation} \frac {\Pr (\bx _1,\ldots ,\bx _K\mid y{=}1)}{\Pr (\bx _1,\ldots ,\bx _K\mid y{=}0)} = \prod _{k=1}^{K}\frac {\Pr (\bx _k\mid y{=}1)}{\Pr (\bx _k\mid y{=}0)} = \prod _{k=1}^{K} \text {likelihood ratio}_k \end{equation}

Multiplying by the prior odds and taking the logarithm converts the product into a sum. Absorbing the log-prior into a constant (or assuming equal priors), the fused log-odds reduces to the sum of individual LLRs.

13.3.5 Learned Combiner (Stacking)

Goal: Learn optimal combination weights from data, rather than assuming equal weights or independence.

The previous fusion methods use fixed rules. When the classifiers’ outputs are probabilistic, a stronger ensemble can be obtained by fitting a logistic regression model on the classifiers’ log-odds:

\begin{equation} \text {logit}(p_{\text {fused}}) = \beta _0 + \sum _{k=1}^{K} \beta _k\, \text {logit}_k \end{equation}

where $\beta _0$ is a bias term and $\beta _k$ are learned coefficients. The parameters $\{\beta _0, \beta _1, \ldots , \beta _K\}$ are fit on a validation set, and the stacked model is evaluated on a separate test set to avoid overfitting.

The fused probability is recovered via the sigmoid: $p_{\text {fused}} = \sigma \!\left (\beta _0 + \sum _{k} \beta _k\, \text {logit}_k\right )$.

When $\beta _0 = 0$ and $\beta _k = 1$ for all $k$, the learned combiner reduces to LLR fusion. The learned coefficients allow the model to down-weight redundant or poorly calibrated classifiers.

13.3.6 General Non-Linear Combining

Goal: Use a non-linear model to combine base classifier predictions, capturing complex interactions between them.

The stacking approach can be generalized by replacing the linear logistic regression model with any non-linear classifier (e.g., a neural network, random forest, or gradient boosting machine). The fused prediction is formulated as:

\begin{equation} p_{\text {fused}} = g_{\bphi }(p_1, \ldots , p_K) \end{equation}

where $g_{\bphi }$ is the non-linear combiner parameterized by $\bphi $. As with the linear learned combiner, the parameters $\bphi $ are fit on a validation set, and the stacked model is evaluated on a separate test set.

While non-linear combiners can capture complex dependencies between base classifiers, they are more prone to overfitting than linear combiners, especially if the validation set is small.

13.3.7 Held-out average log-likelihood gain (*)

The ensemble gain can be quantified using the log-likelihood. Recall (Sec. 8.4) that the per-sample log-likelihood for a probabilistic classifier with output $p_i = f_\bw (\bx _i)$ is

\begin{equation} \log p(y_i \mid \bx _i) = y_i \log p_i + (1-y_i)\log (1-p_i) \end{equation}

The held-out average log-likelihood gain of the ensemble over the best base model is

\begin{equation} \Delta \overline {\ell } = \frac {1}{M_{\text {test}}} \sum _{i=1}^{M_{\text {test}}} \left [\log p_{\text {fused}}(y_i \mid \bx _i) - \log p_{\text {best}}(y_i \mid \bx _i)\right ] \end{equation}

where $p_{\text {fused}}$ and $p_{\text {best}}$ are the predicted probabilities of the stacked model and the best individual classifier, respectively, evaluated on the test set.

Positive $\Delta \overline {\ell }$ indicates that the ensemble provides better calibrated probabilities than the best base model. Equivalently, $\Delta \overline {\ell }$ equals the reduction in BCE loss: $\Delta \overline {\ell } = \loss _{\text {best}} - \loss _{\text {fused}}$.

13.3.8 Summary

The methods above apply to both classifiers ensemble of $K$ classifiers and classifier fusion of interference from $K$ different data sources (or noisy measurements $\bx _1, \ldots , \bx _K$ of the same quantity). In the classifier-fusion setting, the independence assumption required for LLR fusion corresponds to independent measurement noise, and methods that assign different learned parameters per classifier (Linearly Weighted Combining, Learned Combiner) are not applicable, since there is only one classifier. The method summary is presented in Table 13.1.

Cross-validation of ensembles When evaluating an ensemble with cross-validation, the procedure depends on whether the fusion rule has learned parameters:

• Fixed-rule methods (Majority vote, Soft vote, LLR fusion): no parameters are learned from data. A standard outer CV loop suffices. In each fold, train the $K$ base classifiers on the training partition, apply the fixed fusion rule, and evaluate on the held-out test partition.
• Learned-parameter methods (Weighted combining, Learned combiner): the combiner weights ($w_k$ or $\beta _k$) must be fitted on data that is separate from both the base-classifier training data and the test data. Within each outer CV fold, further split the training partition into a base-training set (to train the $K$ classifiers) and a validation set (to fit the combiner weights). The fused model is then evaluated on the outer test fold (Sec. 4.4).

Fitting combiner weights on the same data used to train the base classifiers is a form of data leakage (Sec. 11.3.1): the base classifiers’ predictions on their own training samples are overconfident, leading to inflated ensemble performance estimates.

.
Method	Input type	Key idea	Classifier fusion	Validation set
Majority vote	Hard ($\hat {y}_k\in \{0,1\}$)	Most common class	Yes	No
Soft vote	Probabilistic ($p_k$)	Average of probabilities	Yes	No
Weighted combining	Probabilistic ($p_k$)	Weighted average of probabilities	No	Yes
LLR fusion	Probabilistic ($p_k$)	Sum of log-odds (assumes independence)	Yes	No
Learned combiner	Probabilistic ($p_k$)	Logistic regression on log-odds	No	Yes
Non-linear combiner	Any	Non-linear model on outputs	No	Yes

Table 13.1: Summary of classifier fusion and ensemble methods. All the methods are applicable for classifiers ensemble. “Classifier fusion” indicates applicability when the same classifier is applied to $K$ independent data sources. “Validation set” indicates whether the method requires a held-out validation set to fit combiner parameters.

13.4 Sequential Combining

Goal: Sequential processing of independent noisy measurements.

The ensemble methods in the previous section assume that all $K$ classifier outputs (or measurements) are available simultaneously. In many practical settings, however, measurements arrive one at a time (streaming data), and a decision can/must be updated after each new observation.

13.4.1 Sequential Log-Odds Update

Consider a single classifier applied to a sequence of $K$ independent noisy measurements $\bx _1, \bx _2, \ldots , \bx _K$ of the same underlying quantity. After measurement $\bx _t$, the classifier outputs the probability $p_t = f_\bw (\bx _t)$.

In the batch LLR fusion, all measurements are combined at once:

\begin{equation} \text {logit}_{\text {fused}} = \sum _{k=1}^{K} \ln \frac {p_k}{1 - p_k} \end{equation}

The same result can be obtained sequentially. Define the accumulated log-odds after $t$ measurements as

\begin{equation} L_t = L_{t-1} + \underbrace {\ln \frac {p_t}{1-p_t}}_{\text {logit}_t}, \quad L_0 = 0 \end{equation}

After each update, the fused probability is

\begin{equation} p_{\text {fused},t} = \sigma (L_t) = \frac {1}{1 + e^{-L_t}} \end{equation}

The sequential formulation produces identical results to the batch LLR fusion: $L_K = \text {logit}_{\text {fused}}$. Its advantage is that the decision can be monitored and potentially made before all $K$ measurements are collected.

The initialization $L_0 = 0$ corresponds to an uninformative (equal) prior, i.e., $p_{\text {fused},0} = \sigma (0) = 0.5$. A non-equal prior $\pi _1 = \Pr (y=1)$ can be incorporated by setting $L_0 = \ln \frac {\pi _1}{1-\pi _1}$.

A key advantage of the sequential formulation is that the decision can be made as soon as sufficient evidence has accumulated, without waiting for all $K$ measurements.

Fixed-threshold decision At each step $t$, compare the accumulated log-odds to a threshold:

\begin{equation} \hat {y}_t = \begin{cases} 1 & L_t \ge \tau _1\\ 0 & L_t \le \tau _0\\ \text {continue} & \tau _0 < L_t < \tau _1 \end {cases} \end{equation}

where $\tau _1 > 0$ and $\tau _0 < 0$ are decision thresholds. If $L_t$ falls between the two thresholds, no decision is made and the next measurement is collected.

Confidence-based stopping Equivalently, the decision can be expressed in terms of the fused probability $p_{\text {fused},t} = \sigma (L_t)$. Stop when the confidence exceeds a desired level $\gamma $:

\begin{equation} \text {stop at time } t^* = \min \left \{t : p_{\text {fused},t} \ge \gamma \;\text { or }\; p_{\text {fused},t} \le 1-\gamma \right \} \end{equation}

where $\gamma \in (0.5, 1)$ is the confidence threshold. For example, $\gamma = 0.95$ means the decision is made when the fused probability exceeds $95\%$ for either class.

Example 13.2: A classifier is applied to $K=4$ sequential noisy measurements with outputs $p_1=0.7$, $p_2=0.8$, $p_3=0.6$, $p_4=0.75$.

.
$t$	$p_t$	$\text {logit}_t$	$L_t$	$p_{\text {fused},t} = \sigma (L_t)$
1	0.7	$\ln \frac {0.7}{0.3} = 0.847$	0.847	0.700
2	0.8	$\ln \frac {0.8}{0.2} = 1.386$	2.233	0.903
3	0.6	$\ln \frac {0.6}{0.4} = 0.405$	2.638	0.933
4	0.75	$\ln \frac {0.75}{0.25} = 1.099$	3.737	0.977

• Fixed-threshold stopping ($\tau _1 = 2$, $\tau _0 = -2$): the decision $\hat {y}=1$ is reached at $t=2$, since $L_2 = 2.233 > \tau _1$. Measurements $p_3$, $p_4$ are not needed.
• Confidence-based stopping $(\gamma = 0.95)$: With a strict confidence threshold $\gamma = 0.95$, the decision is delayed to $t=4$ ($p_{\text {fused},4} = 0.977 > 0.95$). A higher threshold requires more evidence but yields a more reliable result ($97.7\%$ confidence vs. $90.3\%$).

13.4.2 Multiple Independent Sensors

The previous subsection considers repeated measurements from a single classifier. In many applications, $N$ different independent sensors (or classifiers) observe the same underlying state, each producing its own probability estimate. In this case:

1. Since the sensors are independent, the log-odds combine by summation (LLR fusion).
2. Sequential update (Sec. 13.4.1) is performed with combined log-odds.

13.4.3 Connection to Bayesian Learning (*)

Goal: Relate the sequential log-odds update to the Bayesian belief update framework.

The sequential log-odds update is a special case of Bayesian belief updating.

Belief: A belief $\pi (H_i)$ is a probability assigned to hypothesis $H_i$, representing the agent’s confidence that $H_i$ is the true state of the world. The belief vector satisfies $\pi (H_i) \ge 0$ and $\sum _i \pi (H_i) = 1$.

Consider a binary hypothesis set $\{H_0, H_1\}$ (e.g., $y=0$ vs. $y=1$) with prior beliefs $\pi _{\text {old}}(H_0)$ and $\pi _{\text {old}}(H_1)$. After observing new evidence $x$, the posterior is updated using Bayes’ rule:

\begin{equation} \label {eq-bayes-belief-update} \pi _{\text {new}}(H_i) = \frac {\pi _{\text {old}}(H_i) \cdot L(H_i; x)}{\displaystyle \sum _{j} \pi _{\text {old}}(H_j) \cdot L(H_j; x)} \end{equation}

where $L(H_i; x)$ is the likelihood of the observation $x$ under hypothesis $H_i$.

In the sequential setting, the posterior after observation $x_t$ becomes the prior for observation $x_{t+1}$. Taking the log-ratio of the two hypotheses after $t$ observations yields

\begin{equation} \label {eq-log-odds-bayesian} \ln \frac {\pi (H_1)}{\pi (H_0)} = \underbrace {\ln \frac {\pi _0(H_1)}{\pi _0(H_0)}}_{\text {log-prior odds}} + \sum _{k=1}^{t} \underbrace {\ln \frac {L(H_1; x_k)}{L(H_0; x_k)}}_{\text {log-likelihood ratio}_k} \end{equation}

This is exactly the sequential log-odds update with $L_0 = \ln \frac {\pi _0(H_1)}{\pi _0(H_0)}$ and $\text {logit}_k = \ln \frac {L(H_1; x_k)}{L(H_0; x_k)}$.

When a probabilistic classifier outputs $p_k = f_\bw (\bx _k)$, its logit $\ln \frac {p_k}{1-p_k}$ serves as an estimate of the log-likelihood ratio. The sequential log-odds update therefore implements approximate Bayesian inference using classifier outputs as surrogate likelihood ratios.

Example 13.3: A sensor classifies objects as planes ($H_1$) or non-planes ($H_0$). The prior belief is $\pi _{\text {old}}(H_1) = 0.3$, $\pi _{\text {old}}(H_0) = 0.7$. A new measurement $x$ has likelihoods $L(H_1; x) = 0.8$ and $L(H_0; x) = 0.2$. Applying Eq. (13.25):
$\seteqnumber{0}{}{26}$
\begin{equation} \begin{aligned} \pi _{\text {new}}(H_1) &= \frac {0.3 \times 0.8}{0.3 \times 0.8 + 0.7 \times 0.2} = \frac {0.24}{0.38} \approx 0.632 \\[3pt] \pi _{\text {new}}(H_0) &= \frac {0.7 \times 0.2}{0.38} \approx 0.368 \end {aligned} \end{equation}

A single observation shifted the belief from $30\%$ to $63.2\%$ in favor of “plane.” This posterior becomes the prior for the next measurement.

.
\(t\)	\(p_t\)	\(\text {logit}_t\)	\(L_t\)	\(p_{\text {fused},t} = \sigma (L_t)\)
1	0.7	\(\ln \frac {0.7}{0.3} = 0.847\)	0.847	0.700
2	0.8	\(\ln \frac {0.8}{0.2} = 1.386\)	2.233	0.903
3	0.6	\(\ln \frac {0.6}{0.4} = 0.405\)	2.638	0.933
4	0.75	\(\ln \frac {0.75}{0.25} = 1.099\)	3.737	0.977