Machine Learning & Signals Learning

$\newcommand{\footnotename}{footnote}$ $\def \LWRfootnote {1}$ $\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}$ $\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}$ $\let \LWRorighspace \hspace $ $\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }$ $\newcommand {\TextOrMath }[2]{#2}$ $\newcommand {\mathnormal }[1]{{#1}}$ $\newcommand \ensuremath [1]{#1}$ $\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } $ $\newcommand {\setlength }[2]{}$ $\newcommand {\addtolength }[2]{}$ $\newcommand {\setcounter }[2]{}$ $\newcommand {\addtocounter }[2]{}$ $\newcommand {\arabic }[1]{}$ $\newcommand {\number }[1]{}$ $\newcommand {\noalign }[1]{\text {#1}\notag \\}$ $\newcommand {\cline }[1]{}$ $\newcommand {\directlua }[1]{\text {(directlua)}}$ $\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}$ $\newcommand {\protect }{}$ $\def \LWRabsorbnumber #1 {}$ $\def \LWRabsorbquotenumber "#1 {}$ $\newcommand {\LWRabsorboption }[1][]{}$ $\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }$ $\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }$ $\def \mathcode #1={\mathchar }$ $\let \delcode \mathcode $ $\let \delimiter \mathchar $ $\def \oe {\unicode {x0153}}$ $\def \OE {\unicode {x0152}}$ $\def \ae {\unicode {x00E6}}$ $\def \AE {\unicode {x00C6}}$ $\def \aa {\unicode {x00E5}}$ $\def \AA {\unicode {x00C5}}$ $\def \o {\unicode {x00F8}}$ $\def \O {\unicode {x00D8}}$ $\def \l {\unicode {x0142}}$ $\def \L {\unicode {x0141}}$ $\def \ss {\unicode {x00DF}}$ $\def \SS {\unicode {x1E9E}}$ $\def \dag {\unicode {x2020}}$ $\def \ddag {\unicode {x2021}}$ $\def \P {\unicode {x00B6}}$ $\def \copyright {\unicode {x00A9}}$ $\def \pounds {\unicode {x00A3}}$ $\let \LWRref \ref $ $\renewcommand {\ref }{\ifstar \LWRref \LWRref }$ $ \newcommand {\multicolumn }[3]{#3}$ $\require {textcomp}$ $ \newcommand {\abs }[1]{\lvert #1\rvert } $ $ \DeclareMathOperator {\sign }{sign} $ $\newcommand {\intertext }[1]{\text {#1}\notag \\}$ $\let \Hat \hat $ $\let \Check \check $ $\let \Tilde \tilde $ $\let \Acute \acute $ $\let \Grave \grave $ $\let \Dot \dot $ $\let \Ddot \ddot $ $\let \Breve \breve $ $\let \Bar \bar $ $\let \Vec \vec $ $\newcommand {\bm }[1]{\boldsymbol {#1}}$ $\require {physics}$ $\newcommand {\LWRphystrig }[2]{\ifblank {#1}{\textrm {#2}}{\textrm {#2}^{#1}}}$ $\renewcommand {\sin }[1][]{\LWRphystrig {#1}{sin}}$ $\renewcommand {\sinh }[1][]{\LWRphystrig {#1}{sinh}}$ $\renewcommand {\arcsin }[1][]{\LWRphystrig {#1}{arcsin}}$ $\renewcommand {\asin }[1][]{\LWRphystrig {#1}{asin}}$ $\renewcommand {\cos }[1][]{\LWRphystrig {#1}{cos}}$ $\renewcommand {\cosh }[1][]{\LWRphystrig {#1}{cosh}}$ $\renewcommand {\arccos }[1][]{\LWRphystrig {#1}{arcos}}$ $\renewcommand {\acos }[1][]{\LWRphystrig {#1}{acos}}$ $\renewcommand {\tan }[1][]{\LWRphystrig {#1}{tan}}$ $\renewcommand {\tanh }[1][]{\LWRphystrig {#1}{tanh}}$ $\renewcommand {\arctan }[1][]{\LWRphystrig {#1}{arctan}}$ $\renewcommand {\atan }[1][]{\LWRphystrig {#1}{atan}}$ $\renewcommand {\csc }[1][]{\LWRphystrig {#1}{csc}}$ $\renewcommand {\csch }[1][]{\LWRphystrig {#1}{csch}}$ $\renewcommand {\arccsc }[1][]{\LWRphystrig {#1}{arccsc}}$ $\renewcommand {\acsc }[1][]{\LWRphystrig {#1}{acsc}}$ $\renewcommand {\sec }[1][]{\LWRphystrig {#1}{sec}}$ $\renewcommand {\sech }[1][]{\LWRphystrig {#1}{sech}}$ $\renewcommand {\arcsec }[1][]{\LWRphystrig {#1}{arcsec}}$ $\renewcommand {\asec }[1][]{\LWRphystrig {#1}{asec}}$ $\renewcommand {\cot }[1][]{\LWRphystrig {#1}{cot}}$ $\renewcommand {\coth }[1][]{\LWRphystrig {#1}{coth}}$ $\renewcommand {\arccot }[1][]{\LWRphystrig {#1}{arccot}}$ $\renewcommand {\acot }[1][]{\LWRphystrig {#1}{acot}}$ $\require {cancel}$ $\newcommand *{\underuparrow }[1]{{\underset {\uparrow }{#1}}}$ $\DeclareMathOperator *{\argmax }{argmax}$ $\DeclareMathOperator *{\argmin }{arg\,min}$ $\def \E [#1]{\mathbb {E}\!\left [ #1 \right ]}$ $\def \Var [#1]{\operatorname {Var}\!\left [ #1 \right ]}$ $\def \Cov [#1]{\operatorname {Cov}\!\left [ #1 \right ]}$ $\newcommand {\floor }[1]{\lfloor #1 \rfloor }$ $\newcommand {\DTFTH }{ H \brk 1{e^{j\omega }}}$ $\newcommand {\DTFTX }{ X\brk 1{e^{j\omega }}}$ $\newcommand {\DFTtr }[1]{\mathrm {DFT}\left \{#1\right \}}$ $\newcommand {\DTFTtr }[1]{\mathrm {DTFT}\left \{#1\right \}}$ $\newcommand {\DTFTtrI }[1]{\mathrm {DTFT^{-1}}\left \{#1\right \}}$ $\newcommand {\Ftr }[1]{ \mathcal {F}\left \{#1\right \}}$ $\newcommand {\FtrI }[1]{ \mathcal {F}^{-1}\left \{#1\right \}}$ $\newcommand {\Zover }{\overset {\mathscr Z}{\Longleftrightarrow }}$ $\renewcommand {\real }{\mathbb {R}}$ $\newcommand {\ba }{\mathbf {a}}$ $\newcommand {\bb }{\mathbf {b}}$ $\newcommand {\bc }{\mathbf {c}}$ $\newcommand {\bd }{\mathbf {d}}$ $\newcommand {\be }{\mathbf {e}}$ $\newcommand {\bf }{\mathbf {f}}$ $\newcommand {\bh }{\mathbf {h}}$ $\newcommand {\bi }{\mathbf {i}}$ $\newcommand {\bn }{\mathbf {n}}$ $\newcommand {\bo }{\mathbf {o}}$ $\newcommand {\bp }{\mathbf {p}}$ $\newcommand {\bq }{\mathbf {q}}$ $\newcommand {\br }{\mathbf {r}}$ $\newcommand {\bs }{\mathbf {s}}$ $\newcommand {\bt }{\mathbf {t}}$ $\newcommand {\bu }{\mathbf {u}}$ $\newcommand {\bv }{\mathbf {v}}$ $\newcommand {\bw }{\mathbf {w}}$ $\newcommand {\bx }{\mathbf {x}}$ $\newcommand {\bxx }{\mathbf {xx}}$ $\newcommand {\bxy }{\mathbf {xy}}$ $\newcommand {\by }{\mathbf {y}}$ $\newcommand {\byy }{\mathbf {yy}}$ $\newcommand {\bz }{\mathbf {z}}$ $\newcommand {\bA }{\mathbf {A}}$ $\newcommand {\bB }{\mathbf {B}}$ $\newcommand {\bC }{\mathbf {C}}$ $\newcommand {\bD }{\mathbf {D}}$ $\newcommand {\bH }{\mathbf {H}}$ $\newcommand {\bI }{\mathbf {I}}$ $\newcommand {\bK }{\mathbf {K}}$ $\newcommand {\bM }{\mathbf {M}}$ $\newcommand {\bP }{\mathbf {P}}$ $\newcommand {\bQ }{\mathbf {Q}}$ $\newcommand {\bR }{\mathbf {R}}$ $\newcommand {\bS }{\mathbf {S}}$ $\newcommand {\bU }{\mathbf {U}}$ $\newcommand {\bW }{\mathbf {W}}$ $\newcommand {\bX }{\mathbf {X}}$ $\newcommand {\bY }{\mathbf {Y}}$ $\newcommand {\bZ }{\mathbf {Z}}$ $\newcommand {\balpha }{\bm {\alpha }}$ $\newcommand {\bth }{{\bm {\theta }}}$ $\newcommand {\bepsilon }{{\bm {\epsilon }}}$ $\newcommand {\bmu }{{\bm {\mu }}}$ $\newcommand {\bphi }{\bm {\phi }}$ $\newcommand {\bOne }{\mathbf {1}}$ $\newcommand {\bZero }{\mathbf {0}}$ $\newcommand {\indFunc }{\mathbb {1}}$ $\newcommand {\btx }{\tilde {\bx }}$ $\newcommand {\loss }{\mathcal {L}}$ $\newcommand {\appropto }{\mathrel {\vcenter { \offinterlineskip \halign {\hfil $##$\cr \propto \cr \noalign {\kern 2pt}\sim \cr \noalign {\kern -2pt}}}}}$ $\newcommand {\SSE }{\mathrm {SSE}}$ $\newcommand {\MSE }{\mathrm {MSE}}$ $\newcommand {\RMSE }{\mathrm {RMSE}}$ $\newcommand {\toprule }[1][]{\hline }$ $\let \midrule \toprule $ $\let \bottomrule \toprule $ $\def \LWRbooktabscmidruleparen (#1)#2{}$ $\newcommand {\LWRbooktabscmidrulenoparen }[1]{}$ $\newcommand {\cmidrule }[1][]{\ifnextchar (\LWRbooktabscmidruleparen \LWRbooktabscmidrulenoparen }$ $\newcommand {\morecmidrules }{}$ $\newcommand {\specialrule }[3]{\hline }$ $\newcommand {\addlinespace }[1][]{}$ $\newcommand {\LWRsubmultirow }[2][]{#2}$ $\newcommand {\LWRmultirow }[2][]{\LWRsubmultirow }$ $\newcommand {\multirow }[2][]{\LWRmultirow }$ $\newcommand {\mrowcell }{}$ $\newcommand {\mcolrowcell }{}$ $\newcommand {\STneed }[1]{}$ $\newcommand {\tcbset }[1]{}$ $\newcommand {\tcbsetforeverylayer }[1]{}$ $\newcommand {\tcbox }[2][]{\boxed {\text {#2}}}$ $\newcommand {\tcboxfit }[2][]{\boxed {#2}}$ $\newcommand {\tcblower }{}$ $\newcommand {\tcbline }{}$ $\newcommand {\tcbtitle }{}$ $\newcommand {\tcbsubtitle [2][]{\mathrm {#2}}}$ $\newcommand {\tcboxmath }[2][]{\boxed {#2}}$ $\newcommand {\tcbhighmath }[2][]{\boxed {#2}}$ $\require {colortbl}$ $\let \LWRorigcolumncolor \columncolor $ $\renewcommand {\columncolor }[2][named]{\LWRorigcolumncolor [#1]{#2}\LWRabsorbtwooptions }$ $\let \LWRorigrowcolor \rowcolor $ $\renewcommand {\rowcolor }[2][named]{\LWRorigrowcolor [#1]{#2}\LWRabsorbtwooptions }$ $\let \LWRorigcellcolor \cellcolor $ $\renewcommand {\cellcolor }[2][named]{\LWRorigcellcolor [#1]{#2}\LWRabsorbtwooptions }$

15 Classifier Fusion

Goal: Combine information from multiple data sources or feature sets or classifiers to improve classification performance.

When multiple data sources are available, there are several strategies for incorporating them into a classification pipeline. Figure 15.1 illustrates five common approaches.

• Single source (a): a single dataset is processed through one feature-extraction and classification pipeline.
• Data concatenation (b): raw data from multiple sources are concatenated before feature extraction, producing a single feature vector per sample.
• Feature concatenation (c): each data source undergoes independent feature extraction; the resulting feature vectors are concatenated before classification.
• Classifier fusion (d): each data source is processed by an independent feature-extraction and classification pipeline; the individual predictions are combined by a fusion rule (Sec. 15.3).
• Classifiers ensemble (e): a single data source undergoes feature extraction, then multiple different classifiers are applied in parallel; their predictions are combined by an ensemble rule (e.g., majority voting or averaging) (Sec. 15.3).

Approaches (b) and (c) are collectively termed early fusion because they merge information before the classifier; approaches (d) and (e) are late fusion because independent classifiers produce separate outputs that are combined afterward.

15.1 Data Concatenation

Goal: Combine (multimodal) raw data from multiple sources before feature extraction.

Given $K$ data sources producing raw vectors $\bd _1, \bd _2, \ldots , \bd _K$ for the same sample, form the concatenated input:

\begin{equation} \bd = [\bd _1;\, \bd _2;\, \ldots ;\, \bd _K] \end{equation}

A single feature-extraction pipeline then maps $\bd $ to a feature vector $\bx $, which is passed to the classifier (Fig. 15.1b).

• When combining multimodal data (e.g., EEG and EMG, or images from different scanners), each source may have its own acquisition characteristics, noise profile, and feature distribution.
• Before concatenation verify that the sources share a compatible representation (e.g., same sampling rate, resolution, or coordinate system).
• The concatenated vector may be very high-dimensional, increasing the risk of overfitting when $M$ is small relative to $N$.

15.2 Feature Concatenation

Goal: Combine independently extracted features from multiple sources into a single feature vector.

Each source $k$ is processed by its own feature-extraction pipeline, yielding $\bx _k \in \mathbb {R}^{N_k}$. The concatenated feature vector is

\begin{equation} \bx = [\bx _1;\, \bx _2;\, \ldots ;\, \bx _K] \in \mathbb {R}^{N}, \quad N = \sum _{k=1}^{K} N_k \end{equation}

A single classifier operates on $\bx $ (Fig. 15.1c).

Feature concatenation allows each source to use a specialized extraction pipeline suited to its modality. However, the concatenated dimension $N$ grows with $K$, increasing the risk of overfitting when $M$ is small relative to $N$.

15.3 Ensemble of Classifiers

Goal: Combine predictions from multiple binary classifiers to improve overall performance.

15.3.1 Preface

Given $K$ classifiers, each producing a prediction for the same input $\bx $, fusion methods aggregate these predictions into a single output. The methods in this section apply to two settings (Fig. 15.1d,e):

• Classifiers ensemble: $K$ different classifiers operate on the same data source. The ensemble benefits from diversity among classifiers; if they make similar errors, combining them provides little gain. Classifier comparison methods (Sec. 14.1 and 14.2.8) can quantify this diversity by measuring:
- – Yule’s Q (Sec. 14.1.6): measures the association between binary correct/incorrect outcomes. $Q \approx 0$ indicates independent errors, while $Q < 0$ suggests complementary performance.
- – Error correlation (Sec. 14.2.8 and 14.2.9): the Pearson/Spearman correlation between per-sample probabilistic scores. $r < 0$ and $r_s<0$ indicate that the classifiers’ errors are negatively correlated, which indicates a high ensemble potential.
• Classifier fusion: a single classifier is applied to $K$ different data sources (e.g., multimodal sensors). When the sources originate from different domains, a domain consistency check (Sec. 13.2.2) should verify distributional compatibility before fusion. Classifier fusion methods are a subset of the ensemble methods; applicability is summarized in Sec. 15.3.9.

15.3.2 Majority Vote

Goal: Combine predictions by selecting the most common class.

Condorcet’s jury theorem

Condorcet’s jury theorem (1785) is a political science theorem about the relative probability of a given group of individuals arriving at a correct decision. The theorem suggests a decision-making scenario with two options, where one option is assumed to be correct. Given a group of $n$ voters making independent choices with $p>\frac {1}{2}$ of making the correct choice, the theorem states that the majority vote is most likely to select the correct option. Moreover, for $n\rightarrow \infty $ the probability of the group reaching the correct decision by majority vote approaches 1.

Definition

Each classifier $k$ produces a binary prediction $\hat {y}_k\in \{0,1\}$. The fused prediction is

\begin{equation} \hat {y}_{\text {fused}} = \begin{cases} 1 & \displaystyle \sum _{k=1}^{K} \hat {y}_k > K/2 \\[6pt] 0 & \text {otherwise} \end {cases} \end{equation}

Choosing odd $K$ avoids ties.

15.3.3 Soft Vote

Goal: Combine probabilistic classifier outputs by averaging their predicted probabilities.

Each probabilistic classifier $k$ outputs $p_k = f_k(\bx ) = \Pr (\hat {y}=1\mid \bx )$. The soft vote averages these probabilities:

\begin{equation} p_{\text {soft}} = \frac {1}{K}\sum _{k=1}^{K} p_k \end{equation}

The fused decision is

\begin{equation} \hat {y}_{\text {fused}} = \begin{cases} 1 & p_{\text {soft}} \ge 0.5\\ 0 & p_{\text {soft}} < 0.5 \end {cases} \end{equation}

Example 15.1: Three classifiers produce probabilities $p_1 = 0.9$, $p_2 = 0.4$, $p_3 = 0.45$ for a given sample.
- • Majority vote: $\hat {y}_1 = 1$, $\hat {y}_2 = 0$, $\hat {y}_3 = 0$. Fused prediction: $\hat {y}_{\text {fused}} = 0$ (two out of three vote for class 0).
- • Soft vote: $p_{\text {soft}} = (0.9 + 0.4 + 0.45)/3 = 0.583$. Fused prediction: $\hat {y}_{\text {fused}} = 1$ (average probability exceeds $0.5$).
The high confidence of classifier 1 tips the soft vote toward class 1, whereas majority vote ignores this confidence.

For the special case of $K=2$, if the classifiers disagree ($\hat {y}_1 \neq \hat {y}_2$), the fused prediction follows the more confident classifier:

\begin{equation} \hat {y}_{\text {fused}} = \begin{cases} \hat {y}_1 & |p_1 - 0.5| > |p_2 - 0.5| \\ \hat {y}_2 & |p_2 - 0.5| > |p_1 - 0.5| \end {cases} \end{equation}

If one classifier is very confident (e.g., $p_k = 0.95$) while the others are weakly opposed (e.g., $p_j = 0.45$), the confident vote contributes more to the average than a single hard vote would.

15.3.4 Linearly Weighted Combining

Goal: Combine probabilistic classifier outputs using a weighted average.

Each probabilistic classifier $k$ outputs $p_k = f_{k}(\bx )$. Assign non-negative weights $w_k\ge 0$ with $\sum _{k=1}^{K}w_k = 1$. The fused probability is

\begin{equation} p_{\text {fused}} = \sum _{k=1}^{K} w_k \, p_k \end{equation}

The fused decision is

\begin{equation} \hat {y}_{\text {fused}} = \begin{cases} 1 & p_{\text {fused}} \ge 0.5\\ 0 & p_{\text {fused}} < 0.5 \end {cases} \end{equation}

A common approach is to normalize validation accuracies [17]:

\begin{equation} w_k = \frac {a_k}{\sum _{j=1}^{K} a_j} \end{equation}

where $a_k$ is the validation accuracy (or other performance metric, e.g., $F_1$) of classifier $k$. Equal weights $w_k = 1/K$ recover simple averaging.

Alternatively, weights can be learned by minimizing the Brier score or cross-entropy loss on a validation set, treating $\{w_k\}$ as optimization variables subject to $w_k \ge 0$ and $\sum _k w_k = 1$. However, this approach involves a significant computation complexity.

15.3.5 Log-Likelihood Ratio (LLR)

Goal: Combine probabilistic classifier outputs using log-odds (Sec. 8.6).

Each probabilistic classifier $k$ outputs $p_k = f_{k}(\bx ) = \Pr (\hat {y}=1\mid \bx )$. Recall that the logit (log-odds) is given by

\begin{equation} \text {logit}_k = \ln \frac {p_k}{1 - p_k} \end{equation}

Under the assumption that the $K$ classifiers (or classifications) are conditionally independent given the true label, the fused log-odds is the sum of individual logits:

\begin{equation} \text {logit}_{\text {fused}} = \sum _{k=1}^{K}\text {logit}_k = \sum _{k=1}^{K} \ln \frac {p_k}{1 - p_k} = \ln \prod _{k=1}^{K}\frac {p_k}{1 - p_k} \end{equation}

The fused probability is recovered by applying the sigmoid: $p_{\text {fused}} = \sigma (\text {logit}_{\text {fused}})$.

When the classes are imbalanced, the log-prior odds should not be dropped. The general fusion formula is

\begin{equation} \text {logit}_{\text {fused}} = \ln \frac {M_1}{M_0} + \sum _{k=1}^{K}\text {logit}_k \end{equation}

where $M_0$ and $M_1$ are the class counts. The log-prior term acts as a constant bias that shifts the decision boundary toward the minority class. When the classes are balanced ($M_0 = M_1$), the bias vanishes and the standard formula is recovered.

The fused decision is

\begin{equation} \hat {y}_{\text {fused}} = \begin{cases} 1 & \text {logit}_{\text {fused}} \ge 0 \quad (p_{\text {fused}} \ge 0.5)\\ 0 & \text {logit}_{\text {fused}} < 0 \end {cases} \end{equation}

With imbalanced priors, the log-prior bias in $\text {logit}_{\text {fused}}$ effectively shifts the decision boundary away from $0.5$ toward the minority class.

LLR fusion requires probabilistic classifiers and assumes independence between classifier errors. When the independence assumption is violated, the fused score may be overconfident.

Proof. Why Conditional Independence Leads to Summing Log-Odds? By Bayes’ rule in odds form, the posterior odds given evidence $\bx _k$ from classifier $k$ are
$\seteqnumber{0}{}{13}$
\begin{equation} \frac {\Pr (y{=}1\mid \bx _k)}{\Pr (y{=}0\mid \bx _k)} = \frac {\Pr (\bx _k\mid y{=}1)}{\Pr (\bx _k\mid y{=}0)} \cdot \frac {\Pr (y{=}1)}{\Pr (y{=}0)} \end{equation}

where $\frac {\Pr (\bx _k\mid y{=}1)}{\Pr (\bx _k\mid y{=}0)}$ is $\text {likelihood ratio}_k$. When the $K$ classifiers are conditionally independent given the true label, the joint likelihood ratio factorizes:
$\seteqnumber{0}{}{14}$
\begin{equation} \frac {\Pr (\bx _1,\ldots ,\bx _K\mid y{=}1)}{\Pr (\bx _1,\ldots ,\bx _K\mid y{=}0)} = \prod _{k=1}^{K}\frac {\Pr (\bx _k\mid y{=}1)}{\Pr (\bx _k\mid y{=}0)} = \prod _{k=1}^{K} \text {likelihood ratio}_k \end{equation}

Multiplying by the prior odds and taking the logarithm converts the product into a sum. Absorbing the log-prior into a constant (or assuming equal priors), the fused log-odds reduces to the sum of individual LLRs.

15.3.6 Learned Combiner (Stacking)

Goal: Learn optimal combination weights from data, rather than assuming equal weights or independence.

The previous fusion methods use fixed rules. When the classifiers’ outputs are probabilistic, a stronger ensemble can be obtained by fitting a logistic regression model on the classifiers’ log-odds:

\begin{equation} \text {logit}(p_{\text {fused}}) = \beta _0 + \sum _{k=1}^{K} \beta _k\, \text {logit}_k \end{equation}

where $\beta _0$ is a bias term and $\beta _k$ are learned coefficients. The parameters $\{\beta _0, \beta _1, \ldots , \beta _K\}$ are fit on a validation set, and the stacked model is evaluated on a separate test set to avoid overfitting.

The fused probability is recovered via the sigmoid: $p_{\text {fused}} = \sigma \!\left (\beta _0 + \sum _{k} \beta _k\, \text {logit}_k\right )$.

When $\beta _0 = 0$ and $\beta _k = 1$ for all $k$, the learned combiner reduces to LLR fusion. The learned coefficients allow the model to down-weight redundant or poorly calibrated classifiers.

15.3.7 General Non-Linear Combining

Goal: Use a non-linear model to combine base classifier predictions, capturing complex interactions between them.

The stacking approach can be generalized by replacing the linear logistic regression model with any non-linear classifier (e.g., a neural network, random forest, or gradient boosting machine). The fused prediction is formulated as:

\begin{equation} p_{\text {fused}} = g_{\bphi }(p_1, \ldots , p_K) \end{equation}

where $g_{\bphi }$ is the non-linear combiner parameterized by $\bphi $. As with the linear learned combiner, the parameters $\bphi $ are fit on a validation set, and the stacked model is evaluated on a separate test set.

While non-linear combiners can capture complex dependencies between base classifiers, they are more prone to overfitting than linear combiners, especially if the validation set is small.

15.3.8 Held-out average log-likelihood gain (*)

The ensemble gain can be quantified using the log-likelihood. Recall (Sec. 8.4) that the per-sample log-likelihood for a probabilistic classifier with output $p_i = f_\bw (\bx _i)$ is

\begin{equation} \log p(y_i \mid \bx _i) = y_i \log p_i + (1-y_i)\log (1-p_i) \end{equation}

The held-out average log-likelihood gain of the ensemble over the best base model is

\begin{equation} \Delta \overline {\ell } = \frac {1}{M_{\text {test}}} \sum _{i=1}^{M_{\text {test}}} \left [\log p_{\text {fused}}(y_i \mid \bx _i) - \log p_{\text {best}}(y_i \mid \bx _i)\right ] \end{equation}

where $p_{\text {fused}}$ and $p_{\text {best}}$ are the predicted probabilities of the stacked model and the best individual classifier, respectively, evaluated on the test set.

Positive $\Delta \overline {\ell }$ indicates that the ensemble provides better calibrated probabilities than the best base model. Equivalently, $\Delta \overline {\ell }$ equals the reduction in BCE loss: $\Delta \overline {\ell } = \loss _{\text {best}} - \loss _{\text {fused}}$.

15.3.9 Summary

The methods above apply to both classifiers ensemble of $K$ classifiers and classifier fusion of interference from $K$ different data sources (or noisy measurements $\bx _1, \ldots , \bx _K$ of the same quantity). In the classifier-fusion setting, the independence assumption required for LLR fusion corresponds to independent measurement noise, and methods that assign different learned parameters per classifier (Linearly Weighted Combining, Learned Combiner) are not applicable, since there is only one classifier. The method summary is presented in Table 15.1.

Cross-validation of ensembles When evaluating an ensemble with cross-validation, the procedure depends on whether the fusion rule has learned parameters:

• Fixed-rule methods (Majority vote, Soft vote, LLR fusion): no parameters are learned from data. A standard outer CV loop suffices. In each fold, train the $K$ base classifiers on the training partition, apply the fixed fusion rule, and evaluate on the held-out test partition.
• Learned-parameter methods (Weighted combining, Learned combiner): the combiner weights ($w_k$ or $\beta _k$) must be fitted on data that is separate from both the base-classifier training data and the test data. Within each outer CV fold, further split the training partition into a base-training set (to train the $K$ classifiers) and a validation set (to fit the combiner weights). The fused model is then evaluated on the outer test fold (Sec. 4.5).

Possible data leakage

Fitting combiner weights on the same data used to train the base classifiers is a form of data leakage (Sec. 13.2.1): the base classifiers’ predictions on their own training samples are overconfident, leading to inflated ensemble performance estimates.

.
Method	Input type	Key idea	Classifier fusion	Validation set
Majority vote	Hard ($\hat {y}_k\in \{0,1\}$)	Most common class	Yes	No
Soft vote	Probabilistic ($p_k$)	Average of probabilities	Yes	No
Weighted combining	Probabilistic ($p_k$)	Weighted average of probabilities	No	Yes
LLR fusion	Probabilistic ($p_k$)	Sum of log-odds (assumes independence)	Yes	No
Learned combiner	Probabilistic ($p_k$)	Logistic regression on log-odds	No	Yes
Non-linear combiner	Any	Non-linear model on outputs	No	Yes

Table 15.1: Summary of classifier fusion and ensemble methods. All the methods are applicable for classifiers ensemble. “Classifier fusion” indicates applicability when the same classifier is applied to $K$ independent data sources. “Validation set” indicates whether the method requires a held-out validation set to fit combiner parameters.

15.4 Multi-Class Ensemble

Goal: Extend the binary fusion rules of Sec. 15.3 to $C$-class classification.

Consider $K$ base classifiers and $C\ge 2$ classes. Each classifier $k$ outputs either a hard label $\hat {y}_k\in \{1,\ldots ,C\}$ or a probability vector $\bp _k = (p_{k,1},\ldots ,p_{k,C})$ with $p_{k,c}\ge 0$ and $\sum _{c=1}^{C} p_{k,c} = 1$. The fused output is either a single label $\hat {y}_{\text {fused}}$ or a probability vector $\bp _{\text {fused}}$. The methods below are direct lifts of their binary counterparts; Table 15.2 summarizes them upfront.

.
Method	Input	Fused output	Decision rule	Validation set
Plurality vote	Hard $\hat {y}_k$	Vote counts per class	$\arg \max $ counts	No
Soft vote	Probabilistic $\bp _k$	Mean probability vector	$\arg \max _c \bar {p}_c$	No
Weighted combining	Probabilistic $\bp _k$	Weighted mean vector	$\arg \max _c \sum _k w_k p_{k,c}$	Yes
Log-probability fusion	Probabilistic $\bp _k$	Softmax of summed log-probabilities	$\arg \max _c \sum _k \log p_{k,c}$	No
Learned combiner	Probabilistic $\bp _k$	Multinomial logistic on log-probabilities	$\arg \max _c$ of softmax output	Yes
Non-linear combiner	Any	Any model with softmax head	$\arg \max _c$ of model output	Yes

Table 15.2: Multi-class ensemble methods. Each row is the $C$-class analogue of the corresponding binary rule in Table 15.1.

15.4.1 Plurality Vote

Goal: Multi-class analogue of majority vote: select the class with the largest number of votes.

Each classifier $k$ produces a hard label $\hat {y}_k\in \{1,\ldots ,C\}$. The fused prediction is

\begin{equation} \hat {y}_{\text {fused}} = \arg \max _{c\in \{1,\ldots ,C\}}\, \sum _{k=1}^{K} \indFunc [\hat {y}_k = c] \end{equation}

Condorcet’s theorem extends to multi-class with the threshold $p > 1/C$: if each classifier is correct with probability $p > 1/C$ and errors are independent, the plurality vote converges to the correct class as $K\to \infty $.

Ties are more frequent than in the binary case, especially when $K$ is small relative to $C$. Common tie-breakers: (i) weight votes by per-classifier validation accuracy, (ii) fall back to soft vote (Sec. 15.4.2), or (iii) prefer the class with the smaller index (deterministic but arbitrary).

15.4.2 Soft Vote

Goal: Average the probability vectors of probabilistic classifiers.

Each classifier $k$ outputs $\bp _k$. The fused probability vector is the elementwise mean:

\begin{equation} \bp _{\text {soft}} = \frac {1}{K}\sum _{k=1}^{K} \bp _k, \qquad \hat {y}_{\text {fused}} = \arg \max _{c} p_{\text {soft},c} \end{equation}

Soft vote averages probability scales across classifiers; if any base classifier is poorly calibrated, its overconfident scores dominate the average. Verify per-class calibration (Sec. 14.3.8) before averaging.

15.4.3 Linearly Weighted Combining

Goal: Weight each classifier’s probability vector by its validation performance.

Assign non-negative weights $w_k$ with $\sum _k w_k = 1$. The fused probability vector and decision are

\begin{equation} \bp _{\text {fused}} = \sum _{k=1}^{K} w_k\, \bp _k, \qquad \hat {y}_{\text {fused}} = \arg \max _{c}\, p_{\text {fused},c} \end{equation}

A common choice is $w_k = a_k / \sum _j a_j$ where $a_k$ is the validation accuracy (or macro-$F_1$) of classifier $k$.

When Bowker’s test (Sec. 14.3.4) rejects symmetry, the per-class error rates differ across classifiers. In that case, class-specific weights $w_{k,c}$ (re-normalized so $\sum _c p_{\text {fused},c} = 1$) can route disputed class pairs to the classifier known to be stronger on that boundary. See Sec. 14.3.11, Step 3.

15.4.4 Log-Probability Fusion

Goal: Multi-class analogue of LLR fusion (Sec. 15.3.5): sum log-probabilities under conditional independence.

Under the assumption that the $K$ classifiers are conditionally independent given the true class, the joint log-posterior factorizes as

\begin{equation} \log \Pr (y = c\mid \bx _1,\ldots ,\bx _K) \;\propto \; \log \pi _c + \sum _{k=1}^{K} \log p_{k,c} \end{equation}

where $\pi _c = \Pr (y = c)$ is the class prior. Normalizing by the softmax across classes yields

\begin{equation} p_{\text {fused},c} = \frac {\pi _c \prod _{k=1}^{K} p_{k,c}}{\sum _{c'=1}^{C} \pi _{c'} \prod _{k=1}^{K} p_{k,c'}} \end{equation}

The fused decision is

\begin{equation} \hat {y}_{\text {fused}} = \arg \max _{c}\left [\log \pi _c + \sum _{k=1}^{K} \log p_{k,c}\right ] \end{equation}

When classes are balanced ($\pi _c = 1/C$), the prior term is a constant that drops out of the $\arg \max $. The binary case (Sec. 15.3.5) is recovered by taking the log-ratio against class $0$.

If any classifier assigns $p_{k,c}\approx 0$ to a class, $\log p_{k,c}\to -\infty $ and that class is permanently excluded by a single overconfident vote. Floor probabilities at $\varepsilon $ (e.g., $\varepsilon = 10^{-6}$): $p_{k,c}\leftarrow \max (p_{k,c}, \varepsilon )$. This is the multi-class form of the overconfidence warning for binary LLR fusion.

15.4.5 Learned Combiner (Multinomial Stacking)

Goal: Fit a multinomial logistic regression on the base classifiers’ log-probabilities.

The fused log-posterior is

\begin{equation} \log \Pr (y = c\mid \cdot ) \;\propto \; \beta _{0,c} + \sum _{k=1}^{K}\sum _{c'=1}^{C} \beta _{k,c,c'}\, \log p_{k,c'} \end{equation}

followed by a softmax across $c$. Parameters $\{\beta _{0,c}, \beta _{k,c,c'}\}$ are fitted on a held-out validation set and the stacked model is evaluated on a separate test set.

The log-probability fusion of Sec. 15.4.4 is the special case $\beta _{0,c} = \log \pi _c$, $\beta _{k,c,c} = 1$, and $\beta _{k,c,c'} = 0$ for $c'\ne c$. The learned coefficients allow the stacked model to down-weight redundant or poorly calibrated classifiers, and the off-diagonal terms ($c'\ne c$) capture systematic class confusions in individual base classifiers.

Non-linear combiners (Sec. 15.3.7) apply unchanged: replace the multinomial logistic head with any classifier that accepts the stacked probability vectors $(\bp _1,\ldots ,\bp _K)$ as input and produces a $C$-dimensional probability vector.

15.4.6 Hard versus Soft Outputs

The rules above produce a probability vector $\bp _{\text {fused}}$, from which one can take $\arg \max $ (top-1) or report the top-$k$ predictions ranked by $p_{\text {fused},c}$. If only hard labels $\hat {y}_k$ are available from the base classifiers, only plurality vote (and a degenerate weighted vote where each classifier contributes a one-hot indicator) applies.

Example 15.2: Three classifiers produce probability vectors for a $C=3$ classification task:

.

Classifier $p_{k,1}$ $p_{k,2}$ $p_{k,3}$ $\hat {y}_k$

$k=1$ 0.55 0.40 0.05 1

$k=2$ 0.50 0.45 0.05 1

$k=3$ 0.10 0.85 0.05 2

Assume balanced priors ($\pi _c = 1/3$).
- • Plurality vote: class 1 receives 2 votes, class 2 receives 1 vote. $\hat {y}_{\text {fused}} = 1$.
- • Soft vote: $\bp _{\text {soft}} = (0.383,\, 0.567,\, 0.050)$. $\hat {y}_{\text {fused}} = 2$.
- • Log-probability fusion: $\sum _k \log p_{k,1} = -2.40$, $\sum _k \log p_{k,2} = -2.18$, $\sum _k \log p_{k,3} = -8.99$. $\hat {y}_{\text {fused}} = 2$.
The confident vote of classifier 3 for class 2 tips both soft vote and log-probability fusion toward class 2, while plurality vote ignores the confidence and follows the count.

.
Classifier	\(p_{k,1}\)	\(p_{k,2}\)	\(p_{k,3}\)	\(\hat {y}_k\)
\(k=1\)	0.55	0.40	0.05	1
\(k=2\)	0.50	0.45	0.05	1
\(k=3\)	0.10	0.85	0.05	2

15.5 Sequential Combining

Goal: Sequential processing of independent noisy measurements.

The ensemble methods in the previous section assume that all $K$ classifier outputs (or measurements) are available simultaneously. In many practical settings, however, measurements arrive one at a time (streaming data), and a decision can/must be updated after each new observation.

15.5.1 Sequential Log-Odds Update

Consider a single classifier applied to a sequence of $K$ independent noisy measurements $\bx _1, \bx _2, \ldots , \bx _K$ of the same underlying quantity. After measurement $\bx _t$, the classifier outputs the probability $p_t = f_\bw (\bx _t)$.

In the batch LLR fusion, all measurements are combined at once:

\begin{equation} \text {logit}_{\text {fused}} = \sum _{k=1}^{K} \ln \frac {p_k}{1 - p_k} \end{equation}

The same result can be obtained sequentially. Define the accumulated log-odds after $t$ measurements as

\begin{equation} L_t = L_{t-1} + \underbrace {\ln \frac {p_t}{1-p_t}}_{\text {logit}_t}, \quad L_0 = 0 \end{equation}

After each update, the fused probability is

\begin{equation} p_{\text {fused},t} = \sigma (L_t) = \frac {1}{1 + e^{-L_t}} \end{equation}

The sequential formulation produces identical results to the batch LLR fusion: $L_K = \text {logit}_{\text {fused}}$. Its advantage is that the decision can be monitored and potentially made before all $K$ measurements are collected.

The initialization $L_0 = 0$ corresponds to an uninformative (equal) prior, i.e., $p_{\text {fused},0} = \sigma (0) = 0.5$. A non-equal prior $\pi _1 = \Pr (y=1)$ can be incorporated by setting $L_0 = \ln \frac {\pi _1}{1-\pi _1}$.

A key advantage of the sequential formulation is that the decision can be made as soon as sufficient evidence has accumulated, without waiting for all $K$ measurements.

Fixed-threshold decision At each step $t$, compare the accumulated log-odds to a threshold:

\begin{equation} \hat {y}_t = \begin{cases} 1 & L_t \ge \tau _1\\ 0 & L_t \le \tau _0\\ \text {continue} & \tau _0 < L_t < \tau _1 \end {cases} \end{equation}

where $\tau _1 > 0$ and $\tau _0 < 0$ are decision thresholds. If $L_t$ falls between the two thresholds, no decision is made and the next measurement is collected.

Confidence-based stopping Equivalently, the decision can be expressed in terms of the fused probability $p_{\text {fused},t} = \sigma (L_t)$. Stop when the confidence exceeds a desired level $\gamma $:

\begin{equation} \text {stop at time } t^* = \min \left \{t : p_{\text {fused},t} \ge \gamma \;\text { or }\; p_{\text {fused},t} \le 1-\gamma \right \} \end{equation}

where $\gamma \in (0.5, 1)$ is the confidence threshold. For example, $\gamma = 0.95$ means the decision is made when the fused probability exceeds $95\%$ for either class.

Example 15.3: A classifier is applied to $K=4$ sequential noisy measurements with outputs $p_1=0.7$, $p_2=0.8$, $p_3=0.6$, $p_4=0.75$.

.
$t$	$p_t$	$\text {logit}_t$	$L_t$	$p_{\text {fused},t} = \sigma (L_t)$
1	0.7	$\ln \frac {0.7}{0.3} = 0.847$	0.847	0.700
2	0.8	$\ln \frac {0.8}{0.2} = 1.386$	2.233	0.903
3	0.6	$\ln \frac {0.6}{0.4} = 0.405$	2.638	0.933
4	0.75	$\ln \frac {0.75}{0.25} = 1.099$	3.737	0.977

• Fixed-threshold stopping ($\tau _1 = 2$, $\tau _0 = -2$): the decision $\hat {y}=1$ is reached at $t=2$, since $L_2 = 2.233 > \tau _1$. Measurements $p_3$, $p_4$ are not needed.
• Confidence-based stopping $(\gamma = 0.95)$: With a strict confidence threshold $\gamma = 0.95$, the decision is delayed to $t=4$ ($p_{\text {fused},4} = 0.977 > 0.95$). A higher threshold requires more evidence but yields a more reliable result ($97.7\%$ confidence vs. $90.3\%$).

15.5.2 Multiple Independent Sensors

The previous subsection considers repeated measurements from a single classifier. In many applications, $N$ different independent sensors (or classifiers) observe the same underlying state, each producing its own probability estimate. In this case:

1. Since the sensors are independent, the log-odds combine by summation (LLR fusion).
2. Sequential update (Sec. 15.5.1) is performed with combined log-odds.

15.5.3 Connection to Bayesian Learning (*)

Goal: Relate the sequential log-odds update to the Bayesian belief update framework.

The sequential log-odds update is a special case of Bayesian belief updating.

Belief: A belief $\pi (H_i)$ is a probability assigned to hypothesis $H_i$, representing the agent’s confidence that $H_i$ is the true state of the world. The belief vector satisfies $\pi (H_i) \ge 0$ and $\sum _i \pi (H_i) = 1$.

Consider a binary hypothesis set $\{H_0, H_1\}$ (e.g., $y=0$ vs. $y=1$) with prior beliefs $\pi _{\text {old}}(H_0)$ and $\pi _{\text {old}}(H_1)$. After observing new evidence $x$, the posterior is updated using Bayes’ rule:

\begin{equation} \label {eq-bayes-belief-update} \pi _{\text {new}}(H_i) = \frac {\pi _{\text {old}}(H_i) \cdot L(H_i; x)}{\displaystyle \sum _{j} \pi _{\text {old}}(H_j) \cdot L(H_j; x)} \end{equation}

where $L(H_i; x)$ is the likelihood of the observation $x$ under hypothesis $H_i$.

In the sequential setting, the posterior after observation $x_t$ becomes the prior for observation $x_{t+1}$. Taking the log-ratio of the two hypotheses after $t$ observations yields

\begin{equation} \label {eq-log-odds-bayesian} \ln \frac {\pi (H_1)}{\pi (H_0)} = \underbrace {\ln \frac {\pi _0(H_1)}{\pi _0(H_0)}}_{\text {log-prior odds}} + \sum _{k=1}^{t} \underbrace {\ln \frac {L(H_1; x_k)}{L(H_0; x_k)}}_{\text {log-likelihood ratio}_k} \end{equation}

This is exactly the sequential log-odds update with $L_0 = \ln \frac {\pi _0(H_1)}{\pi _0(H_0)}$ and $\text {logit}_k = \ln \frac {L(H_1; x_k)}{L(H_0; x_k)}$.

When a probabilistic classifier outputs $p_k = f_\bw (\bx _k)$, its logit $\ln \frac {p_k}{1-p_k}$ serves as an estimate of the log-likelihood ratio. The sequential log-odds update therefore implements approximate Bayesian inference using classifier outputs as surrogate likelihood ratios.

Example 15.4: A sensor classifies objects as planes ($H_1$) or non-planes ($H_0$). The prior belief is $\pi _{\text {old}}(H_1) = 0.3$, $\pi _{\text {old}}(H_0) = 0.7$. A new measurement $x$ has likelihoods $L(H_1; x) = 0.8$ and $L(H_0; x) = 0.2$. Applying Eq. (15.32):
$\seteqnumber{0}{}{33}$
\begin{equation} \begin{aligned} \pi _{\text {new}}(H_1) &= \frac {0.3 \times 0.8}{0.3 \times 0.8 + 0.7 \times 0.2} = \frac {0.24}{0.38} \approx 0.632 \\[3pt] \pi _{\text {new}}(H_0) &= \frac {0.7 \times 0.2}{0.38} \approx 0.368 \end {aligned} \end{equation}

A single observation shifted the belief from $30\%$ to $63.2\%$ in favor of “plane.” This posterior becomes the prior for the next measurement.

15.6 Multiple Instance Learning (MIL)

Goal: Classify bags of instances under weak supervision: only the bag label is observed during training, while the labels of the individual instances remain hidden.

In MIL, the $i$-th training example is a bag of $n_i$ instances,

\begin{equation} B_i = \{\bx _{i,1},\, \bx _{i,2},\, \ldots ,\, \bx _{i,n_i}\}, \end{equation}

with a single observed bag label $Y_i \in \{0,1\}$. Each instance has a latent label $y_{i,j} \in \{0,1\}$ that is not available during training. The standard MIL assumption relates the two:

\begin{equation} \label {eq-mil-assumption} Y_i = \max _{j=1,\ldots ,n_i} y_{i,j}, \end{equation}

i.e., a bag is positive if and only if at least one of its instances is positive. This setting reduces annotation cost whenever labeling every instance is impractical (e.g., a whole-slide pathology image labeled only as “tumor/no tumor,” with the individual image patches left unlabeled).

Two learning strategies are common: (i) train an instance-level classifier $f(\bx _{i,j}) = p_{i,j}$ and aggregate the per-instance probabilities into a bag probability $P_i$; or (ii) train a bag-level classifier that maps the whole set $B_i$ directly to $P_i$. Both strategies ultimately depend on an aggregation rule

\begin{equation} P_i = g(p_{i,1},\, p_{i,2},\, \ldots ,\, p_{i,n_i}), \end{equation}

which plays the same role that the fusion rules of Sec. 15.3 play for ensembles of classifiers, except the index now runs over instances within one bag rather than classifiers on one sample. MIL is therefore ensemble aggregation applied along the instance axis.

The aggregated bag prediction $P_i$ acts as the bridge between the micro-level instances and the macro-level bag. It represents the model’s overall confidence that the entire bag belongs to the positive class ($Y_i = 1$). Because the true labels of individual instances are hidden, the model must combine its per-instance predictions into this single collective score so the loss can be calculated against the only available ground truth: the bag label.

Four aggregation rules appear frequently:

\begin{equation} \begin{aligned} P_i^{\text {max}} &= \max _{j} p_{i,j}, \\[2pt] P_i^{\text {mean}} &= \tfrac {1}{n_i}\sum _{j=1}^{n_i} p_{i,j}, \\[2pt] P_i^{\text {noisy-OR}} &= 1 - \prod _{j=1}^{n_i} (1 - p_{i,j}), \\[2pt] P_i^{\text {att}} &= \sum _{j=1}^{n_i} \alpha _{i,j}\, p_{i,j}, \qquad \alpha _{i,j} = \frac {\exp (\bw ^\top \phi (\bx _{i,j}))}{\sum _{j'} \exp (\bw ^\top \phi (\bx _{i,j'}))}. \end {aligned} \end{equation}

Max-pooling is the direct probabilistic counterpart of Eq. (15.36). Mean-pooling is the instance-axis analog of soft vote combining (Sec. 15.3.3). Noisy-OR is the analog of LLR fusion (Sec. 15.3.5) under the assumption that instance predictions are conditionally independent. Attention pooling learns the mixing weights $\alpha _{i,j}$ from the data and plays the role of a learned combiner (Sec. 15.3.7).

Example 15.5 (The Security Guard Analogy): Imagine a security guard tasked with determining if a 5-room building (the bag) is dangerous ($Y_i = 1$). The guard sweeps each room (the instances) and assigns a threat probability $p_{i,j}$ to each:
- • Room 1: $1\%$ ($0.01$)
- • Room 2: $2\%$ ($0.02$)
- • Room 3: $95\%$ ($0.95$) – Weapon found
- • Room 4: $0\%$ ($0.00$)
- • Room 5: $2\%$ ($0.02$)
How the guard reports the overall threat level of the building ($P_i$) depends entirely on the aggregation rule:
- • Max-pooling ($P_i = 0.95$): The guard reports the building is $95\%$ dangerous. The building is only as safe as its most dangerous room. This perfectly aligns with the standard MIL assumption.
- • Mean-pooling ($P_i = 0.20$): The guard averages the threat levels and reports the building is $20\%$ dangerous. Because there are four relatively safe rooms, the severe threat in Room 3 gets dangerously diluted.
- • Noisy-OR ($P_i = 1 - (0.99)(0.98)(0.05)(1.00)(0.98) \approx 0.952$): The guard calculates the probability that at least one room is dangerous, assuming independent events. The combined probability remains very high.
Under the standard MIL assumption, a single strongly positive instance should determine the bag label, so max-pooling and noisy-OR are consistent aggregators. Mean-pooling mirrors Soft Vote and tends to dangerously under-weight rare but decisive positive instances.

Because individual instance labels $y_{i,j}$ are completely absent during training, a standard per-instance loss cannot be computed directly. Instead, training an instance-level classifier relies on one of three main paradigms:

End-to-End Backpropagation

The instance-level classifier (parameterized by $\bw $) and the aggregation function $g$ are optimized simultaneously. Because instance-level labels are unknown, the loss is evaluated strictly at the bag level. The aggregated bag prediction $P_i$ is compared to the true bag label $Y_i$ using standard Binary Cross-Entropy (Sec. 8.4):

\begin{equation} \loss = - \left [ Y_i \log P_i + (1 - Y_i) \log (1 - P_i) \right ] \end{equation}

During backpropagation, the error signal must pass through the aggregation function $g$ to reach the instance-level predictions $p_{i,j}$. By the chain rule, $g$ acts as a structural bottleneck that dictates exactly how the bag-level gradient is distributed among the individual instances. The choice of aggregation directly determines this gradient flow:

• Max-pooling: The max operation acts as a hard mathematical switch. The gradient routes entirely to the single instance that yielded the maximum probability. The remaining $n_i - 1$ instances in the bag receive a gradient of zero, meaning their parameters receive no learning signal from this forward pass.
• Mean-pooling: The gradient is divided evenly by $n_i$ and passed to all instances in the bag. Every instance is updated equally, which means noisy or non-informative instances receive the exact same learning signal as the truly predictive ones.
• Attention-pooling: The gradient is scaled proportionally by the learned attention weights $\alpha _{i,j}$. The model dynamically routes stronger learning signals to the instances it deems most relevant (high $\alpha $), while suppressing the updates for instances treated as background noise (low $\alpha $).

Label Inheritance

A heuristic approach where every instance inside a bag is simply assigned the bag’s label ($y_{i,j} = Y_i$). A standard binary classifier is then trained on the instances directly. This introduces significant noise into positive bags (since a positive bag may contain many negative instances), but it is computationally cheap and often used as a pre-training step to initialize weights.

Iterative Pseudo-Labeling

An alternating expectation-maximization (EM) style approach to refine instance predictions:

1. Train an initial classifier (often using Label Inheritance).
2. Use this classifier to score all instances in the training bags. Assign hard “pseudo-labels” based on these scores (e.g., in a positive bag, label the top-scoring instance as $1$ and the rest as $0$).
3. Retrain the classifier using these new pseudo-labels, and repeat the process until convergence.

Iterative pseudo-labeling is vulnerable to a confirmation-bias failure mode: the classifier’s own predictions become the next round’s training targets, so the loop can converge to a self-consistent solution with no genuine predictive power while still reporting high training and cross-validation accuracy. This is the same spurious-signal pathology that the sanity checks of Chapter 13 are designed to detect. After pseudo-labeling converges, the final model should be validated with:

• A permutation test on the bag labels (Sec. 13.3.1): shuffle only the outer $Y_i$, re-run the full pseudo-labeling loop, and verify that the bag-level accuracy collapses to the chance level $J_{\text {chance}}$ (Eq. 13.2).
• A random feature baseline (Sec. 13.3.4): replace the instance features $\bx _{i,j}$ with noise of the same dimensions and confirm that the pseudo-labeling loop no longer produces above-chance bag accuracy.

A pseudo-labeling pipeline that passes both checks provides meaningful evidence that the recovered instance labels reflect genuine signal rather than a self-reinforced artifact.

.
\(t\)	\(p_t\)	\(\text {logit}_t\)	\(L_t\)	\(p_{\text {fused},t} = \sigma (L_t)\)
1	0.7	\(\ln \frac {0.7}{0.3} = 0.847\)	0.847	0.700
2	0.8	\(\ln \frac {0.8}{0.2} = 1.386\)	2.233	0.903
3	0.6	\(\ln \frac {0.6}{0.4} = 0.405\)	2.638	0.933
4	0.75	\(\ln \frac {0.75}{0.25} = 1.099\)	3.737	0.977