Machine Learning & Signals Learning

$\newcommand{\footnotename}{footnote}$ $\def \LWRfootnote {1}$ $\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}$ $\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}$ $\let \LWRorighspace \hspace $ $\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }$ $\newcommand {\TextOrMath }[2]{#2}$ $\newcommand {\mathnormal }[1]{{#1}}$ $\newcommand \ensuremath [1]{#1}$ $\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } $ $\newcommand {\setlength }[2]{}$ $\newcommand {\addtolength }[2]{}$ $\newcommand {\setcounter }[2]{}$ $\newcommand {\addtocounter }[2]{}$ $\newcommand {\arabic }[1]{}$ $\newcommand {\number }[1]{}$ $\newcommand {\noalign }[1]{\text {#1}\notag \\}$ $\newcommand {\cline }[1]{}$ $\newcommand {\directlua }[1]{\text {(directlua)}}$ $\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}$ $\newcommand {\protect }{}$ $\def \LWRabsorbnumber #1 {}$ $\def \LWRabsorbquotenumber "#1 {}$ $\newcommand {\LWRabsorboption }[1][]{}$ $\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }$ $\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }$ $\def \mathcode #1={\mathchar }$ $\let \delcode \mathcode $ $\let \delimiter \mathchar $ $\def \oe {\unicode {x0153}}$ $\def \OE {\unicode {x0152}}$ $\def \ae {\unicode {x00E6}}$ $\def \AE {\unicode {x00C6}}$ $\def \aa {\unicode {x00E5}}$ $\def \AA {\unicode {x00C5}}$ $\def \o {\unicode {x00F8}}$ $\def \O {\unicode {x00D8}}$ $\def \l {\unicode {x0142}}$ $\def \L {\unicode {x0141}}$ $\def \ss {\unicode {x00DF}}$ $\def \SS {\unicode {x1E9E}}$ $\def \dag {\unicode {x2020}}$ $\def \ddag {\unicode {x2021}}$ $\def \P {\unicode {x00B6}}$ $\def \copyright {\unicode {x00A9}}$ $\def \pounds {\unicode {x00A3}}$ $\let \LWRref \ref $ $\renewcommand {\ref }{\ifstar \LWRref \LWRref }$ $ \newcommand {\multicolumn }[3]{#3}$ $\require {textcomp}$ $ \newcommand {\abs }[1]{\lvert #1\rvert } $ $ \DeclareMathOperator {\sign }{sign} $ $\newcommand {\intertext }[1]{\text {#1}\notag \\}$ $\let \Hat \hat $ $\let \Check \check $ $\let \Tilde \tilde $ $\let \Acute \acute $ $\let \Grave \grave $ $\let \Dot \dot $ $\let \Ddot \ddot $ $\let \Breve \breve $ $\let \Bar \bar $ $\let \Vec \vec $ $\newcommand {\bm }[1]{\boldsymbol {#1}}$ $\require {physics}$ $\newcommand {\LWRphystrig }[2]{\ifblank {#1}{\textrm {#2}}{\textrm {#2}^{#1}}}$ $\renewcommand {\sin }[1][]{\LWRphystrig {#1}{sin}}$ $\renewcommand {\sinh }[1][]{\LWRphystrig {#1}{sinh}}$ $\renewcommand {\arcsin }[1][]{\LWRphystrig {#1}{arcsin}}$ $\renewcommand {\asin }[1][]{\LWRphystrig {#1}{asin}}$ $\renewcommand {\cos }[1][]{\LWRphystrig {#1}{cos}}$ $\renewcommand {\cosh }[1][]{\LWRphystrig {#1}{cosh}}$ $\renewcommand {\arccos }[1][]{\LWRphystrig {#1}{arcos}}$ $\renewcommand {\acos }[1][]{\LWRphystrig {#1}{acos}}$ $\renewcommand {\tan }[1][]{\LWRphystrig {#1}{tan}}$ $\renewcommand {\tanh }[1][]{\LWRphystrig {#1}{tanh}}$ $\renewcommand {\arctan }[1][]{\LWRphystrig {#1}{arctan}}$ $\renewcommand {\atan }[1][]{\LWRphystrig {#1}{atan}}$ $\renewcommand {\csc }[1][]{\LWRphystrig {#1}{csc}}$ $\renewcommand {\csch }[1][]{\LWRphystrig {#1}{csch}}$ $\renewcommand {\arccsc }[1][]{\LWRphystrig {#1}{arccsc}}$ $\renewcommand {\acsc }[1][]{\LWRphystrig {#1}{acsc}}$ $\renewcommand {\sec }[1][]{\LWRphystrig {#1}{sec}}$ $\renewcommand {\sech }[1][]{\LWRphystrig {#1}{sech}}$ $\renewcommand {\arcsec }[1][]{\LWRphystrig {#1}{arcsec}}$ $\renewcommand {\asec }[1][]{\LWRphystrig {#1}{asec}}$ $\renewcommand {\cot }[1][]{\LWRphystrig {#1}{cot}}$ $\renewcommand {\coth }[1][]{\LWRphystrig {#1}{coth}}$ $\renewcommand {\arccot }[1][]{\LWRphystrig {#1}{arccot}}$ $\renewcommand {\acot }[1][]{\LWRphystrig {#1}{acot}}$ $\require {cancel}$ $\newcommand *{\underuparrow }[1]{{\underset {\uparrow }{#1}}}$ $\DeclareMathOperator *{\argmax }{argmax}$ $\DeclareMathOperator *{\argmin }{arg\,min}$ $\def \E [#1]{\mathbb {E}\!\left [ #1 \right ]}$ $\def \Var [#1]{\operatorname {Var}\!\left [ #1 \right ]}$ $\def \Cov [#1]{\operatorname {Cov}\!\left [ #1 \right ]}$ $\newcommand {\floor }[1]{\lfloor #1 \rfloor }$ $\newcommand {\DTFTH }{ H \brk 1{e^{j\omega }}}$ $\newcommand {\DTFTX }{ X\brk 1{e^{j\omega }}}$ $\newcommand {\DFTtr }[1]{\mathrm {DFT}\left \{#1\right \}}$ $\newcommand {\DTFTtr }[1]{\mathrm {DTFT}\left \{#1\right \}}$ $\newcommand {\DTFTtrI }[1]{\mathrm {DTFT^{-1}}\left \{#1\right \}}$ $\newcommand {\Ftr }[1]{ \mathcal {F}\left \{#1\right \}}$ $\newcommand {\FtrI }[1]{ \mathcal {F}^{-1}\left \{#1\right \}}$ $\newcommand {\Zover }{\overset {\mathscr Z}{\Longleftrightarrow }}$ $\renewcommand {\real }{\mathbb {R}}$ $\newcommand {\ba }{\mathbf {a}}$ $\newcommand {\bb }{\mathbf {b}}$ $\newcommand {\bc }{\mathbf {c}}$ $\newcommand {\bd }{\mathbf {d}}$ $\newcommand {\be }{\mathbf {e}}$ $\newcommand {\bf }{\mathbf {f}}$ $\newcommand {\bh }{\mathbf {h}}$ $\newcommand {\bi }{\mathbf {i}}$ $\newcommand {\bn }{\mathbf {n}}$ $\newcommand {\bo }{\mathbf {o}}$ $\newcommand {\bp }{\mathbf {p}}$ $\newcommand {\bq }{\mathbf {q}}$ $\newcommand {\br }{\mathbf {r}}$ $\newcommand {\bs }{\mathbf {s}}$ $\newcommand {\bt }{\mathbf {t}}$ $\newcommand {\bu }{\mathbf {u}}$ $\newcommand {\bv }{\mathbf {v}}$ $\newcommand {\bw }{\mathbf {w}}$ $\newcommand {\bx }{\mathbf {x}}$ $\newcommand {\bxx }{\mathbf {xx}}$ $\newcommand {\bxy }{\mathbf {xy}}$ $\newcommand {\by }{\mathbf {y}}$ $\newcommand {\byy }{\mathbf {yy}}$ $\newcommand {\bz }{\mathbf {z}}$ $\newcommand {\bA }{\mathbf {A}}$ $\newcommand {\bB }{\mathbf {B}}$ $\newcommand {\bC }{\mathbf {C}}$ $\newcommand {\bD }{\mathbf {D}}$ $\newcommand {\bH }{\mathbf {H}}$ $\newcommand {\bI }{\mathbf {I}}$ $\newcommand {\bK }{\mathbf {K}}$ $\newcommand {\bM }{\mathbf {M}}$ $\newcommand {\bP }{\mathbf {P}}$ $\newcommand {\bQ }{\mathbf {Q}}$ $\newcommand {\bR }{\mathbf {R}}$ $\newcommand {\bS }{\mathbf {S}}$ $\newcommand {\bU }{\mathbf {U}}$ $\newcommand {\bW }{\mathbf {W}}$ $\newcommand {\bX }{\mathbf {X}}$ $\newcommand {\bY }{\mathbf {Y}}$ $\newcommand {\bZ }{\mathbf {Z}}$ $\newcommand {\balpha }{\bm {\alpha }}$ $\newcommand {\bth }{{\bm {\theta }}}$ $\newcommand {\bepsilon }{{\bm {\epsilon }}}$ $\newcommand {\bmu }{{\bm {\mu }}}$ $\newcommand {\bphi }{\bm {\phi }}$ $\newcommand {\bOne }{\mathbf {1}}$ $\newcommand {\bZero }{\mathbf {0}}$ $\newcommand {\indFunc }{\mathbb {1}}$ $\newcommand {\btx }{\tilde {\bx }}$ $\newcommand {\loss }{\mathcal {L}}$ $\newcommand {\appropto }{\mathrel {\vcenter { \offinterlineskip \halign {\hfil $##$\cr \propto \cr \noalign {\kern 2pt}\sim \cr \noalign {\kern -2pt}}}}}$ $\newcommand {\SSE }{\mathrm {SSE}}$ $\newcommand {\MSE }{\mathrm {MSE}}$ $\newcommand {\RMSE }{\mathrm {RMSE}}$ $\newcommand {\toprule }[1][]{\hline }$ $\let \midrule \toprule $ $\let \bottomrule \toprule $ $\def \LWRbooktabscmidruleparen (#1)#2{}$ $\newcommand {\LWRbooktabscmidrulenoparen }[1]{}$ $\newcommand {\cmidrule }[1][]{\ifnextchar (\LWRbooktabscmidruleparen \LWRbooktabscmidrulenoparen }$ $\newcommand {\morecmidrules }{}$ $\newcommand {\specialrule }[3]{\hline }$ $\newcommand {\addlinespace }[1][]{}$ $\newcommand {\LWRsubmultirow }[2][]{#2}$ $\newcommand {\LWRmultirow }[2][]{\LWRsubmultirow }$ $\newcommand {\multirow }[2][]{\LWRmultirow }$ $\newcommand {\mrowcell }{}$ $\newcommand {\mcolrowcell }{}$ $\newcommand {\STneed }[1]{}$ $\newcommand {\tcbset }[1]{}$ $\newcommand {\tcbsetforeverylayer }[1]{}$ $\newcommand {\tcbox }[2][]{\boxed {\text {#2}}}$ $\newcommand {\tcboxfit }[2][]{\boxed {#2}}$ $\newcommand {\tcblower }{}$ $\newcommand {\tcbline }{}$ $\newcommand {\tcbtitle }{}$ $\newcommand {\tcbsubtitle [2][]{\mathrm {#2}}}$ $\newcommand {\tcboxmath }[2][]{\boxed {#2}}$ $\newcommand {\tcbhighmath }[2][]{\boxed {#2}}$ $\require {colortbl}$ $\let \LWRorigcolumncolor \columncolor $ $\renewcommand {\columncolor }[2][named]{\LWRorigcolumncolor [#1]{#2}\LWRabsorbtwooptions }$ $\let \LWRorigrowcolor \rowcolor $ $\renewcommand {\rowcolor }[2][named]{\LWRorigrowcolor [#1]{#2}\LWRabsorbtwooptions }$ $\let \LWRorigcellcolor \cellcolor $ $\renewcommand {\cellcolor }[2][named]{\LWRorigcellcolor [#1]{#2}\LWRabsorbtwooptions }$

8 Logistic Regression

Goal: Binary (two-class) classification with linear decision boundary.

For binary classification, each entry of vector $\by $ is $y_j\in \{0,1\}$.

8.1 Generalized Linear Classification Models

Goal: Extend linear models to classification by applying a non-linear activation function.

Generalized linear model: A generalized linear model applies a non-linear function $g(\cdot ):\real \rightarrow \real $ to the linear combination $\bw ^T\bx _i$:

\begin{equation} \label {eq-general-linear-model} \hat {y}_i = g(\bw ^T\bx _i) \end{equation}

The choice of $g(\cdot )$ determines the model type. The decision boundary $\{\bx : \bw ^T\bx \lessgtr thr\}$ is linear (hyperplane), regardless of the choice of $g(\cdot )$ (Fig. 8.1).

Two important questions:

• Selection of a loss function.
• Minimization of a selected loss function.

8.2 Basic Linear Model

Goal: Use linear regression for classification as a baseline approach.

Linear model is with

\begin{equation} \label {eq-basic-linear-model} g(x) = x \end{equation}

and MSE loss. It has unbounded and continuous output.

Linear regression can be applied to classification by thresholding the output, as follows:

1. Compute regression weights $\bw $ according to data $\bX $ and the binary vector $\by $ (e.g., by (3.5)).
2. Compute regression output according to (8.1) with (8.2) substituted. Then, apply threshold $0.5$ to obtain class labels:
$\seteqnumber{0}{}{2}$
\begin{equation} \hat {y}_i = \begin{cases} 1 & \bw ^T\bx _i > 0.5\\ 0 & \bw ^T\bx _i \leqslant 0.5 \end {cases} \end{equation}

Example 8.1: Consider $M=20$ samples with a single feature $x_1\in [10,29]$: the first ten labeled $y=1$ and the last ten $y=0$. The design matrix $\bX =[\bOne _M\;\;\bx ]\in \real ^{M\times 2}$ includes a column of ones for the intercept (Sec. 3.1). Fitting $\bw $ by least squares and thresholding $\hat {y}=\bw ^T\bx \lessgtr 0.5$ classifies all points correctly (Fig. 8.2(a)).

Adding ten class-0 outliers at $x_1\in [60,69]$ pulls the regression line toward them, shifting the decision boundary and causing misclassification near the original boundary (Fig. 8.2(b)).

Limitations

• Unbounded output: $\tilde {y}$ can be significantly larger than 1 or smaller than 0, with no probabilistic interpretation.
• Outlier sensitivity: Distant points disproportionately influence the regression line, shifting the decision boundary (Fig. 8.2(b)).

These limitations motivate the logistic model.

8.3 Logistic Model

Goal: Binary classification model with:
- • Generalized linear model
- • Outlier robustness
- • Probabilistic interpretation

The logistic model addresses the limitations of the basic linear model by applying a sigmoid function (Fig. 8.3):

\begin{equation} \sigma (x) = \frac {\exp (x)}{1+\exp (x)} = \frac {1}{1+\exp (-x)} \end{equation}

Because $\sigma (x)\in [0,1]$, the output is bounded and can be interpreted as a probability. The saturating tails reduce the influence of distant points, improving robustness to outliers.

Logistic regression model Substituting $g(x) = \sigma (x)$ into (8.1) gives the logistic model. The decision rule with threshold $thr$ is

\begin{equation} \hat {y} = \begin{cases} 1 & \sigma (\bw ^T\bx ) > thr\\ 0 & \sigma (\bw ^T\bx ) \le thr \end {cases} \end{equation}

With the default $thr=0.5$, the rule simplifies to

\begin{equation} \label {eq-lr-decision0.5} \hat {y} = \begin{cases} 1 & \bw ^T\bx > 0\\ 0 & \bw ^T\bx \le 0 \end {cases} \end{equation}

since $\sigma (0)=0.5$.

Example 8.2: Using the same 1D dataset from Sec. 8.2, the logistic model correctly separates the two classes (Fig. 8.4(a)). Unlike linear regression, adding ten class-0 outliers at $x_1\in [60,69]$ does not significantly shift the decision boundary (Fig. 8.4(b)), because the sigmoid saturates for large $|\bw ^T\bx |$.

Why not MSE? Applying MSE loss to the logistic model,

\begin{equation} \loss (\cdot ) = \frac {1}{2M}\norm {\hat {\by } - \by }^2 \end{equation}

yields a non-convex optimization problem with multiple local minima and no closed-form solution. This motivates the cross-entropy loss in the following section (Sec. 8.4).

A loss function is not necessarily a metric.

8.4 Cross-Entropy Loss

Goal: Probabilistic loss that quantifies the distance between target and predicted distributions.

The logistic model output $\hat {y}=\sigma (\bw ^T\bx )\in [0,1]$ can be interpreted as the probability $\Pr (y=1\mid \bx ,\bw )$. For each sample, both the binary label $y\in \{0,1\}$ and the scalar prediction $\hat {y}\in [0,1]$ thus define a Bernoulli distribution over the two outcomes $\{y=0,\, y=1\}$: the target distribution $p$ is one-hot, concentrated on the true class, and the predicted distribution $q$ is $(1-\hat {y},\,\hat {y})$ produced by the sigmoid (Table 8.1).

Table 8.1: Mapping of label $y$ and prediction $\hat {y}$ to two-outcome distributions $p$ (target) and $q$ (predicted).

.
	$\Pr (y{=}0)$	$\Pr (y{=}1)$
$y=0$ (target $p$)	$p_0=1$	$p_1=0$
$y=1$ (target $p$)	$p_0=0$	$p_1=1$
$\hat {y}=\sigma (\bw ^T\bx )$ (predicted $q$)	$q_0=1-\hat {y}$	$q_1=\hat {y}$

This reframes classification as distribution matching: training should push the predicted distribution $q$ toward the target distribution $p$. A suitable loss must therefore:

(i) reach its minimum only when $q=p$,
(ii) grow as $q$ moves away from $p$, and
(iii) remain well-defined when $p$ is one-hot.

Cross-entropy loss, rooted in information theory, satisfies all three properties.

8.4.1 Entropy

Entropy: For a discrete distribution $P = \left \{p_i = \Pr [X=x_i]\right \}$, the entropy is

\begin{equation} H(P) = - \sum _i p_i\log (p_i) \end{equation}

Entropy measures the uncertainty of a distribution $P$. It is maximized when all outcomes are equally likely ($p_i=p_j\;\forall \, i,j$) and decreases as the distribution becomes more concentrated.

Coding interpretation: With base-2 logarithm, entropy gives the theoretical minimum average number of bits needed to encode outcomes drawn from $P$.

Example 8.3: For a binary distribution:
$\seteqnumber{0}{}{8}$
\begin{align*} p_1 = p_2 = \dfrac {1}{2} &\Rightarrow H(P) = -2\cdot \tfrac {1}{2}\log _2\!\left (\tfrac {1}{2}\right ) =1 \text { bit}\\ p_1 = \dfrac {1}{10},\; p_2 = \dfrac {9}{10} &\Rightarrow H(P) = -\tfrac {1}{10}\log _2\!\left (\tfrac {1}{10}\right ) - \tfrac {9}{10}\log _2\!\left (\tfrac {9}{10}\right ) \approx 0.469 \text { bits} \end{align*} The more concentrated distribution has lower entropy and theoretically requires fewer bits on average.

Example 8.4: Consider transmitting letters $\{A,B,C,D\}$ over a binary channel.

Equal probabilities. If all four letters are equally likely ($p_i = 0.25$), an optimal code is $\{00,01,10,11\}$ that are two bits per letter. The entropy confirms: $H(P)=-4\cdot \frac {1}{4}\log _2\!\left (\frac {1}{4}\right )=2$ bits.

Unequal probabilities. With the distribution in Table 8.2, shorter codewords are assigned to more probable letters. The average code length is

\[ \sum _{i}\mathrm {length}_i\cdot p_i = 1\cdot 0.7 + 2\cdot 0.26 + 3\cdot 0.02 + 3\cdot 0.02 = 1.34 \text { bits} \]

while the entropy gives the theoretical minimum:

\begin{equation*} H(P) = - 0.7\log _2(0.7) - 0.26\log _2(0.26) - 2\cdot 0.02\log _2(0.02)\approx 1.091 \text { bits} \end{equation*}

Table 8.2: Variable-length coding example for an unequal distribution.

.
Letter	Probability, $p_i$	Codeword	Length
A	0.70	0	1
B	0.26	10	2
C	0.02	110	3
D	0.02	111	3

8.4.2 Cross-Entropy

Goal: Quantify distance between distributions $p$ and $q$.

Cross-entropy: For two discrete distributions $p$ and $q$ over the same outcomes, the cross-entropy is

\begin{equation} H(p,q) = - \sum _i p_i\log (q_i) \end{equation}

Cross-entropy satisfies $H(p,q) \ge H(p)$, with equality if and only if $p=q$.

Coding interpretation: With base-2 logarithm, $H(p,q)$ is the average number of bits needed to encode outcomes drawn from $p$ using a code optimized for $q$.

Example 8.5: Let $q_i=\left \{\frac {1}{4},\frac {1}{4},\frac {1}{4},\frac {1}{4}\right \}$ (optimal code: two bits per letter) and $p_i= \left \{\frac {1}{2},\frac {1}{2},0,0\right \}$. Then $H(p) = 1$ bit, but $H(p,q)=2$ bits: using a code designed for $q$ wastes one bit per symbol on average when the true distribution is $p$.

The convention $\lim \limits _{x\to 0} x\log (x) = 0$ is used so that zero-probability events contribute nothing. For loss functions, the natural logarithm ($\ln $) is used. Minimizing cross-entropy with respect to model parameters $\bw $ is equivalent to maximum likelihood estimation (MLE).

8.4.3 Binary Cross-Entropy (BCE) Loss

Goal: Convex loss for binary classification with probabilistic interpretation.

For a binary outcome, the cross-entropy reduces to

\begin{equation} \label {eq-bce-univariate} H(p,q) = -p_0\log (q_0) - p_1\log (q_1) \end{equation}

which is visualized in Fig. 8.5. When $y=1$ (i.e., $p_0=0,\, p_1=1$), the expression reduces to $H(p,q)=-\log (q_1)$, which penalizes small predicted probabilities $q_1$.

For a single sample with true label $y\in \{0,1\}$ and predicted probability $\hat {y}=f_\bth (\bx )\in (0,1)$, the target and predicted distributions are

\begin{align*} p_0 = 1-y,\quad p_1 &= y\\ q_0 = 1-f_\bth (\bx ),\quad q_1 &= f_\bth (\bx ) \end{align*} Substituting into Eq. (8.10):

\begin{equation*} H(p,q) =-(1-y)\log \!\left (1- f_\bth (\bx )\right ) -y\log \!\left (f_\bth (\bx )\right ) \end{equation*}

BCE loss: Binary cross-entropy (BCE) loss for a single sample:

\begin{equation} \loss (y,\hat {y}) = -(1-y)\log (1-\hat {y}) - y\log (\hat {y}) \end{equation}

For $M$ samples, the loss is averaged over all elements:

\begin{equation} \begin{aligned} \loss &=-\frac {1}{M}\sum _{j=1}^M \left [(1-y_j)\log (1-\hat {y}_j) + y_j\log (\hat {y}_j)\right ]\\ &=-\frac {1}{M}\left [(1-\by )^T\log (1-\hat {\by }) + \by ^T\log (\hat {\by })\right ] \end {aligned} \end{equation}

Properties: The BCE loss:

• Is continuous, differentiable, and convex—suitable for gradient-based optimization.
• Has a unique global minimum.
• Provides probabilistic predictions:
$\seteqnumber{0}{}{12}$
\begin{equation} \label {eq-lr-inference} \begin{aligned} \Pr (y=1|\bx ,\bw ) &= \sigma (\bw ^T\bx )\\ \Pr (y=0|\bx ,\bw ) &= 1-\sigma (\bw ^T\bx ) \end {aligned} \end{equation}
• Yields a classification decision via thresholding, e.g. $\hat {y}\lessgtr \frac {1}{2}$.

8.5 BCE Loss for Logistic Regression

Probabilistic interpretation

The predicted probability $\hat {y}=\sigma (\bw ^T\bx )$ partitions the feature space into regions of varying confidence. Fig. 8.6 illustrates four regions: high-confidence positive ($\hat {y}>0.999$), moderate positive ($0.8\ge \hat {y}\ge 0.5$), moderate negative ($0.5>\hat {y}\ge 0.2$), and high-confidence negative ($0.2>\hat {y}$). Points near the decision boundary have predictions closer to $0.5$, reflecting greater classification uncertainty.

The probabilistic output provides confidence levels but does not eliminate classification errors.

Decision boundary (*)

From Eq. (8.6), the classification rule is

\begin{equation} \hat {\by } = \begin{cases} 1 & \bw ^T\bx \ge 0\\ 0 & \bw ^T\bx < 0\\ \end {cases} \end{equation}

The decision boundary is the set $\{\bx : \bw ^T\bx = 0\}$, i.e., $w_0+w_1x_1 +w_2x_2 + \cdots =0$. Geometrically, $\bw ^T\bx =\norm {\bx }\norm {\bw }\cos (\theta )$, so the boundary consists of all points $\bx $ perpendicular to $\bw $ ($\theta =90^\circ $), forming a hyperplane with normal vector $\bw $.

Loss minimization

Substituting $\hat {\by }=\sigma (\bX \bw )$ into the BCE loss gives the logistic regression objective in vector form:

\begin{align} \loss = \frac {1}{M}\left [-\by ^T\log (\sigma (\bX \bw )) - (1-\by )^T\log (1-\sigma (\bX \bw ))\right ] \end{align} Setting $\nabla _\bw \loss =\mathbf {0}$ has no closed-form solution for $\bw $. However, the gradient has a compact form:

\begin{equation} \nabla _\bw \loss (\bw ) = \frac {1}{M}\bX ^T\left (\sigma \left (\bX \bw \right )-\by \right ) \end{equation}

which enables GD using mostly matrix–vector operations:

\begin{equation} \bw _{n+1} = \bw _{n} - \alpha \nabla _\bw \mathcal {L}(\bw ) \end{equation}

Regularization and feature mapping

• Regularization. $L_2$ regularization can be applied, similar to Eq. (4.23)
$\seteqnumber{0}{}{17}$
\begin{equation} \mathcal {L}_{\mathrm {reg}} = \loss (\by ,\hat {\by }) + \frac {\lambda }{2M}\sum _{i=1}^N w_i^2 \end{equation}
• Feature mapping. Mapping functions or kernels extend the model to non-linear decision boundaries.

Example 8.6: For example, a polynomial mapping

\[ \varphi (x_1,x_2) = \langle 1,x_1,x_1^2,x_2,x_2^2,x_1x_2,x_1^2x_2,x_1x_2^2,x_1^2x_2^2 \rangle \]

replaces $\bx $ with $\varphi (\bx )$, yielding a non-linear boundary in the original input space while remaining linear in the feature space (Fig. 8.7).

(a) Same dataset as Figs. 8.1 and 8.6: the quadratic boundary separates both classes without errors. Note the boundary in the upper-left corner.

(b) A different dataset with a non-convex class structure; the elliptical boundary captures the inner class, though some points (red) are misclassified.

Figure 8.7: Degree-2 polynomial feature mapping $\varphi (x_1,x_2)$ produces non-linear decision boundaries in the original input space.

8.5.1 Maximum Likelihood Estimation (MLE) (*)

The use of the BCE loss is not arbitrary; it is the direct result of applying the maximum likelihood principle.

Given a dataset of $M$ samples, and assuming the samples are independent and identically distributed (i.i.d.), the likelihood of the parameters $\bw $ is the joint probability of observing the targets $\by $ given the inputs $\bX $:

\begin{equation} L(\bw ) = \prod _{i=1}^M \Pr (y_i \mid \bx _i, \bw ) \end{equation}

Since each $y_i$ follows a Bernoulli distribution with parameter $p_i = \sigma (\bw ^T\bx _i)$, we write:

\begin{equation} \Pr (y_i \mid \bx _i, \bw ) = \hat {y}_i^{y_i} (1-\hat {y}_i)^{1-y_i} \end{equation}

To simplify optimization, we maximize the log-likelihood:

\begin{equation} \ell (\bw ) = \log L(\bw ) = \sum _{i=1}^M \left [ y_i \log (\hat {y}_i) + (1-y_i) \log (1-\hat {y}_i) \right ] \end{equation}

Maximizing the log-likelihood is mathematically equivalent to minimizing the negative log-likelihood. Dividing by $M$ to get the average, we arrive exactly at the BCE loss formula defined in Sec. 8.4:

\begin{equation} \text {logit} = \ln \left (\frac {p}{1-p}\right ) \end{equation}

The logit maps probabilities from the range $(0, 1)$ to the entire real line $(-\infty , \infty )$.

Logistic Regression As already mentioned in (8.13),

\begin{equation} \Pr (y=1) = \sigma (\bw ^T\bx ) = \frac {1}{1+\exp (-\bw ^T\bx )} = p \end{equation}

By reformulation,

\begin{align} \frac {p}{1-p} &= \frac {\Pr (\hat {y}=1)}{\Pr (\hat {y}=0)} = \exp (\bw ^T\bx )\\[3pt] \ln \left (\frac {p}{1-p}\right ) & = \bw ^T\bx \end{align}

To illustrate the influence of $\tilde {x}_i = x_i + \Delta x$,

\begin{equation} \begin{aligned} \frac {\text {odds}_{\Delta x}}{\text {odds}} &= \frac {\exp (w_0+w_1x_1 + \cdots + w_i(x_i + \Delta x) + \cdots + w_Nx_n)}{\exp (w_0+w_1x_1 + \cdots + w_ix_i + \cdots + w_Nx_n)}\\[3pt] &= \exp (w_i(x_i + \Delta x) - w_ix_i) = \exp (w_i\Delta x) \end {aligned} \end{equation}

Example 8.7: A logistic regression model for exam pass/fail with a single feature (hours studied, $x_1$) yields weights $w_0=-4,\; w_1=0.5$. For a student who studied $x_1=10$ hours:
$\seteqnumber{0}{}{28}$
\begin{align*} \bw ^T\bx &= w_0 + w_1 x_1 = -4 + 0.5\cdot 10 = 1\\ p &= \sigma (1) = \frac {1}{1+e^{-1}} \approx 0.731 \end{align*} The odds and logit are:
$\seteqnumber{0}{}{28}$
\begin{align*} \text {odds} &= \frac {p}{1-p} = \frac {0.731}{0.269} \approx 2.72 = e^1\\ \text {logit} &= \ln (2.72) = 1 = \bw ^T\bx \end{align*} An odds of $2.72$ means the model considers passing roughly $2.7$ times more likely than failing.

If the student studies $\Delta x = 2$ additional hours, the odds ratio is
$\seteqnumber{0}{}{28}$
\begin{equation*} \frac {\text {odds}_{\Delta x}}{\text {odds}} = \exp (w_1 \Delta x) = \exp (0.5\cdot 2) = e \approx 2.72 \end{equation*}

Each extra two hours of study multiplies the odds of passing by $\approx 2.72$, regardless of the starting point.

8.7 Summary

• Logistic regression applies the sigmoid function to a linear model, producing bounded output with probabilistic interpretation.
• BCE loss is convex, differentiable, and equivalent to maximum likelihood estimation.
• The decision boundary is a hyperplane in input space; non-linear boundaries require feature mapping.
• Optimization is performed via gradient descent (no closed-form solution for $\bw $).
• Regularization and polynomial/kernel mappings extend naturally from linear regression.
• Complete separation failure: If there is a feature that would perfectly separate the two classes, the weight for that feature would not converge, because the optimal weight would be infinite.

.
	\(\Pr (y{=}0)\)	\(\Pr (y{=}1)\)
\(y=0\) (target \(p\))	\(p_0=1\)	\(p_1=0\)
\(y=1\) (target \(p\))	\(p_0=0\)	\(p_1=1\)
\(\hat {y}=\sigma (\bw ^T\bx )\) (predicted \(q\))	\(q_0=1-\hat {y}\)	\(q_1=\hat {y}\)

.
Letter	Probability, \(p_i\)	Codeword	Length
A	0.70	0	1
B	0.26	10	2
C	0.02	110	3
D	0.02	111	3