Machine Learning & Signals Learning

\(\newcommand{\footnotename}{footnote}\) \(\def \LWRfootnote {1}\) \(\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\let \LWRorighspace \hspace \) \(\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }\) \(\newcommand {\TextOrMath }[2]{#2}\) \(\newcommand {\mathnormal }[1]{{#1}}\) \(\newcommand \ensuremath [1]{#1}\) \(\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } \) \(\newcommand {\setlength }[2]{}\) \(\newcommand {\addtolength }[2]{}\) \(\newcommand {\setcounter }[2]{}\) \(\newcommand {\addtocounter }[2]{}\) \(\newcommand {\arabic }[1]{}\) \(\newcommand {\number }[1]{}\) \(\newcommand {\noalign }[1]{\text {#1}\notag \\}\) \(\newcommand {\cline }[1]{}\) \(\newcommand {\directlua }[1]{\text {(directlua)}}\) \(\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}\) \(\newcommand {\protect }{}\) \(\def \LWRabsorbnumber #1 {}\) \(\def \LWRabsorbquotenumber "#1 {}\) \(\newcommand {\LWRabsorboption }[1][]{}\) \(\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }\) \(\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }\) \(\def \mathcode #1={\mathchar }\) \(\let \delcode \mathcode \) \(\let \delimiter \mathchar \) \(\def \oe {\unicode {x0153}}\) \(\def \OE {\unicode {x0152}}\) \(\def \ae {\unicode {x00E6}}\) \(\def \AE {\unicode {x00C6}}\) \(\def \aa {\unicode {x00E5}}\) \(\def \AA {\unicode {x00C5}}\) \(\def \o {\unicode {x00F8}}\) \(\def \O {\unicode {x00D8}}\) \(\def \l {\unicode {x0142}}\) \(\def \L {\unicode {x0141}}\) \(\def \ss {\unicode {x00DF}}\) \(\def \SS {\unicode {x1E9E}}\) \(\def \dag {\unicode {x2020}}\) \(\def \ddag {\unicode {x2021}}\) \(\def \P {\unicode {x00B6}}\) \(\def \copyright {\unicode {x00A9}}\) \(\def \pounds {\unicode {x00A3}}\) \(\let \LWRref \ref \) \(\renewcommand {\ref }{\ifstar \LWRref \LWRref }\) \( \newcommand {\multicolumn }[3]{#3}\) \(\require {textcomp}\) \( \newcommand {\abs }[1]{\lvert #1\rvert } \) \( \DeclareMathOperator {\sign }{sign} \) \(\newcommand {\intertext }[1]{\text {#1}\notag \\}\) \(\let \Hat \hat \) \(\let \Check \check \) \(\let \Tilde \tilde \) \(\let \Acute \acute \) \(\let \Grave \grave \) \(\let \Dot \dot \) \(\let \Ddot \ddot \) \(\let \Breve \breve \) \(\let \Bar \bar \) \(\let \Vec \vec \) \(\newcommand {\bm }[1]{\boldsymbol {#1}}\) \(\require {physics}\) \(\newcommand {\LWRphystrig }[2]{\ifblank {#1}{\textrm {#2}}{\textrm {#2}^{#1}}}\) \(\renewcommand {\sin }[1][]{\LWRphystrig {#1}{sin}}\) \(\renewcommand {\sinh }[1][]{\LWRphystrig {#1}{sinh}}\) \(\renewcommand {\arcsin }[1][]{\LWRphystrig {#1}{arcsin}}\) \(\renewcommand {\asin }[1][]{\LWRphystrig {#1}{asin}}\) \(\renewcommand {\cos }[1][]{\LWRphystrig {#1}{cos}}\) \(\renewcommand {\cosh }[1][]{\LWRphystrig {#1}{cosh}}\) \(\renewcommand {\arccos }[1][]{\LWRphystrig {#1}{arcos}}\) \(\renewcommand {\acos }[1][]{\LWRphystrig {#1}{acos}}\) \(\renewcommand {\tan }[1][]{\LWRphystrig {#1}{tan}}\) \(\renewcommand {\tanh }[1][]{\LWRphystrig {#1}{tanh}}\) \(\renewcommand {\arctan }[1][]{\LWRphystrig {#1}{arctan}}\) \(\renewcommand {\atan }[1][]{\LWRphystrig {#1}{atan}}\) \(\renewcommand {\csc }[1][]{\LWRphystrig {#1}{csc}}\) \(\renewcommand {\csch }[1][]{\LWRphystrig {#1}{csch}}\) \(\renewcommand {\arccsc }[1][]{\LWRphystrig {#1}{arccsc}}\) \(\renewcommand {\acsc }[1][]{\LWRphystrig {#1}{acsc}}\) \(\renewcommand {\sec }[1][]{\LWRphystrig {#1}{sec}}\) \(\renewcommand {\sech }[1][]{\LWRphystrig {#1}{sech}}\) \(\renewcommand {\arcsec }[1][]{\LWRphystrig {#1}{arcsec}}\) \(\renewcommand {\asec }[1][]{\LWRphystrig {#1}{asec}}\) \(\renewcommand {\cot }[1][]{\LWRphystrig {#1}{cot}}\) \(\renewcommand {\coth }[1][]{\LWRphystrig {#1}{coth}}\) \(\renewcommand {\arccot }[1][]{\LWRphystrig {#1}{arccot}}\) \(\renewcommand {\acot }[1][]{\LWRphystrig {#1}{acot}}\) \(\require {cancel}\) \(\newcommand *{\underuparrow }[1]{{\underset {\uparrow }{#1}}}\) \(\DeclareMathOperator *{\argmax }{argmax}\) \(\DeclareMathOperator *{\argmin }{arg\,min}\) \(\def \E [#1]{\mathbb {E}\!\left [ #1 \right ]}\) \(\def \Var [#1]{\operatorname {Var}\!\left [ #1 \right ]}\) \(\def \Cov [#1]{\operatorname {Cov}\!\left [ #1 \right ]}\) \(\newcommand {\floor }[1]{\lfloor #1 \rfloor }\) \(\newcommand {\DTFTH }{ H \brk 1{e^{j\omega }}}\) \(\newcommand {\DTFTX }{ X\brk 1{e^{j\omega }}}\) \(\newcommand {\DFTtr }[1]{\mathrm {DFT}\left \{#1\right \}}\) \(\newcommand {\DTFTtr }[1]{\mathrm {DTFT}\left \{#1\right \}}\) \(\newcommand {\DTFTtrI }[1]{\mathrm {DTFT^{-1}}\left \{#1\right \}}\) \(\newcommand {\Ftr }[1]{ \mathcal {F}\left \{#1\right \}}\) \(\newcommand {\FtrI }[1]{ \mathcal {F}^{-1}\left \{#1\right \}}\) \(\newcommand {\Zover }{\overset {\mathscr Z}{\Longleftrightarrow }}\) \(\renewcommand {\real }{\mathbb {R}}\) \(\newcommand {\ba }{\mathbf {a}}\) \(\newcommand {\bb }{\mathbf {b}}\) \(\newcommand {\bd }{\mathbf {d}}\) \(\newcommand {\be }{\mathbf {e}}\) \(\newcommand {\bh }{\mathbf {h}}\) \(\newcommand {\bn }{\mathbf {n}}\) \(\newcommand {\bq }{\mathbf {q}}\) \(\newcommand {\br }{\mathbf {r}}\) \(\newcommand {\bt }{\mathbf {t}}\) \(\newcommand {\bv }{\mathbf {v}}\) \(\newcommand {\bw }{\mathbf {w}}\) \(\newcommand {\bx }{\mathbf {x}}\) \(\newcommand {\bxx }{\mathbf {xx}}\) \(\newcommand {\bxy }{\mathbf {xy}}\) \(\newcommand {\by }{\mathbf {y}}\) \(\newcommand {\byy }{\mathbf {yy}}\) \(\newcommand {\bz }{\mathbf {z}}\) \(\newcommand {\bA }{\mathbf {A}}\) \(\newcommand {\bB }{\mathbf {B}}\) \(\newcommand {\bI }{\mathbf {I}}\) \(\newcommand {\bK }{\mathbf {K}}\) \(\newcommand {\bP }{\mathbf {P}}\) \(\newcommand {\bQ }{\mathbf {Q}}\) \(\newcommand {\bR }{\mathbf {R}}\) \(\newcommand {\bU }{\mathbf {U}}\) \(\newcommand {\bW }{\mathbf {W}}\) \(\newcommand {\bX }{\mathbf {X}}\) \(\newcommand {\bY }{\mathbf {Y}}\) \(\newcommand {\bZ }{\mathbf {Z}}\) \(\newcommand {\balpha }{\bm {\alpha }}\) \(\newcommand {\bth }{{\bm {\theta }}}\) \(\newcommand {\bepsilon }{{\bm {\epsilon }}}\) \(\newcommand {\bmu }{{\bm {\mu }}}\) \(\newcommand {\bOne }{\mathbf {1}}\) \(\newcommand {\bZero }{\mathbf {0}}\) \(\newcommand {\loss }{\mathcal {L}}\) \(\newcommand {\appropto }{\mathrel {\vcenter { \offinterlineskip \halign {\hfil $##$\cr \propto \cr \noalign {\kern 2pt}\sim \cr \noalign {\kern -2pt}}}}}\) \(\newcommand {\SSE }{\mathrm {SSE}}\) \(\newcommand {\MSE }{\mathrm {MSE}}\) \(\newcommand {\RMSE }{\mathrm {RMSE}}\) \(\newcommand {\toprule }[1][]{\hline }\) \(\let \midrule \toprule \) \(\let \bottomrule \toprule \) \(\def \LWRbooktabscmidruleparen (#1)#2{}\) \(\newcommand {\LWRbooktabscmidrulenoparen }[1]{}\) \(\newcommand {\cmidrule }[1][]{\ifnextchar (\LWRbooktabscmidruleparen \LWRbooktabscmidrulenoparen }\) \(\newcommand {\morecmidrules }{}\) \(\newcommand {\specialrule }[3]{\hline }\) \(\newcommand {\addlinespace }[1][]{}\) \(\newcommand {\LWRsubmultirow }[2][]{#2}\) \(\newcommand {\LWRmultirow }[2][]{\LWRsubmultirow }\) \(\newcommand {\multirow }[2][]{\LWRmultirow }\) \(\newcommand {\mrowcell }{}\) \(\newcommand {\mcolrowcell }{}\) \(\newcommand {\STneed }[1]{}\) \(\newcommand {\tcbset }[1]{}\) \(\newcommand {\tcbsetforeverylayer }[1]{}\) \(\newcommand {\tcbox }[2][]{\boxed {\text {#2}}}\) \(\newcommand {\tcboxfit }[2][]{\boxed {#2}}\) \(\newcommand {\tcblower }{}\) \(\newcommand {\tcbline }{}\) \(\newcommand {\tcbtitle }{}\) \(\newcommand {\tcbsubtitle [2][]{\mathrm {#2}}}\) \(\newcommand {\tcboxmath }[2][]{\boxed {#2}}\) \(\newcommand {\tcbhighmath }[2][]{\boxed {#2}}\)

8 Logistic Regression

  • Goal: Binary (two-class) classification with linear decision boundary.

For binary classification, each entry of vector \(\by \) is \(y_j\in \{0,1\}\).

8.1 Generalized Linear Classification Models

  • Goal: Extend linear models to classification by applying a non-linear activation function.

Generalized linear model: A generalized linear model applies a non-linear function \(g(\cdot ):\real \rightarrow \real \) to the linear combination \(\bw ^T\bx _i\):

\begin{equation} \label {eq-general-linear-model} \hat {y}_i = g(\bw ^T\bx _i) \end{equation}

The choice of \(g(\cdot )\) determines the model type. The decision boundary \(\{\bx : \bw ^T\bx \lessgtr thr\}\) is linear (hyperplane), regardless of the choice of \(g(\cdot )\) (Fig. 8.1).

Two important questions:

  • Selection of a loss function.

  • Minimization of a selected loss function.

(image)

Figure 8.1: An example of linear classification boundary.

8.2 Basic Linear Model

  • Goal: Use linear regression for classification as a baseline approach.

Linear model is with

\begin{equation} \label {eq-basic-linear-model} g(x) = x \end{equation}

and MSE loss. It has unbounded and continuous output.

Linear regression can be applied to classification by thresholding the output, as follows:

  • 1. Compute regression weights \(\bw \) according to data \(\bX \) and the binary vector \(\by \) (e.g., by (3.5)).

  • 2. Compute regression output according to (8.1) with (8.2) substituted. Then, apply threshold \(0.5\) to obtain class labels:

    \begin{equation} \hat {y}_i = \begin{cases} 1 & \bw ^T\bx _i > 0.5\\ 0 & \bw ^T\bx _i \leqslant 0.5 \end {cases} \end{equation}

  • Example 8.1: Consider \(M=20\) samples with a single feature \(x_1\in [10,29]\): the first ten labeled \(y=1\) and the last ten \(y=0\). The design matrix \(\bX =[\bOne _M\;\;\bx ]\in \real ^{M\times 2}\) includes a column of ones for the intercept (Sec. 3.1). Fitting \(\bw \) by least squares and thresholding \(\hat {y}=\bw ^T\bx \lessgtr 0.5\) classifies all points correctly (Fig. 8.2(a)).

    Adding ten class-0 outliers at \(x_1\in [60,69]\) pulls the regression line toward them, shifting the decision boundary and causing misclassification near the original boundary (Fig. 8.2(b)).

(image)

Figure 8.2: Linear regression for 1D binary classification: (a) least-squares fit with threshold at \(0.5\) correctly separates the two classes; (b) adding class-0 outliers shifts the decision boundary and causing misclassification.

Limitations

  • Unbounded output: \(\tilde {y}\) can be significantly larger than 1 or smaller than 0, with no probabilistic interpretation.

  • Outlier sensitivity: Distant points disproportionately influence the regression line, shifting the decision boundary (Fig. 8.2(b)).

These limitations motivate the logistic model.

8.3 Logistic Model

  • Goal: Binary classification model with:

    • Generalized linear model

    • Outlier robustness

    • Probabilistic interpretation

The logistic model addresses the limitations of the basic linear model by applying a sigmoid function (Fig. 8.3):

\begin{equation} \sigma (x) = \frac {\exp (x)}{1+\exp (x)} = \frac {1}{1+\exp (-x)} \end{equation}

Because \(\sigma (x)\in [0,1]\), the output is bounded and can be interpreted as a probability. The saturating tails reduce the influence of distant points, improving robustness to outliers.

(image)

Figure 8.3: Sigmoid function: bounded output \(\sigma (x)\in [0,1]\) with saturating tails.

Logistic regression model Substituting \(g(x) = \sigma (x)\) into (8.1) gives the logistic model. The decision rule with threshold \(thr\) is

\begin{equation} \hat {y} = \begin{cases} 1 & \sigma (\bw ^T\bx ) > thr\\ 0 & \sigma (\bw ^T\bx ) \le thr \end {cases} \end{equation}

With the default \(thr=0.5\), the rule simplifies to

\begin{equation} \hat {y} = \begin{cases} 1 & \bw ^T\bx > 0\\ 0 & \bw ^T\bx \le 0 \end {cases} \end{equation}

since \(\sigma (0)=0.5\).

  • Example 8.2: Using the same 1D dataset from Sec. 8.2, the logistic model correctly separates the two classes (Fig. 8.4(a)). Unlike linear regression, adding ten class-0 outliers at \(x_1\in [60,69]\) does not significantly shift the decision boundary (Fig. 8.4(b)), because the sigmoid saturates for large \(|\bw ^T\bx |\).

(image)

Figure 8.4: Logistic regression for 1D binary classification: (a) sigmoid fit correctly separates the two classes; (b) adding class-0 outliers does not shift the decision boundary, unlike linear regression (Fig. 8.2).

Why not MSE? Applying MSE loss to the logistic model,

\begin{equation} \loss (\cdot ) = \frac {1}{2M}\norm {\hat {\by } - \by }^2 \end{equation}

yields a non-convex optimization problem with multiple local minima and no closed-form solution. This motivates the cross-entropy loss in the following section (Sec. 8.4).

A loss function is not necessarily a metric.

8.4 Cross-Entropy Loss

  • Goal: Probabilistic loss that quantifies the distance between target and predicted distributions.

The logistic model output \(\hat {y}=\sigma (\bw ^T\bx )\in [0,1]\) can be interpreted as the probability \(\Pr (y=1\mid \bx ,\bw )\). The loss function should therefore measure how far the predicted distribution is from the target distribution. Cross-entropy, rooted in information theory, provides exactly such a measure.

8.4.1 Entropy

Entropy: For a discrete distribution \(P = \left \{p_i = \Pr [X=x_i]\right \}\), the entropy is

\begin{equation} H(P) = - \sum _i p_i\log (p_i) \end{equation}

Entropy measures the uncertainty of a distribution \(P\). It is maximized when all outcomes are equally likely (\(p_i=p_j\;\forall \, i,j\)) and decreases as the distribution becomes more concentrated.

Coding interpretation: With base-2 logarithm, entropy gives the theoretical minimum average number of bits needed to encode outcomes drawn from \(P\).

  • Example 8.3: For a binary distribution:

    \begin{align*} p_1 = p_2 = \dfrac {1}{2} &\Rightarrow H(P) = -2\cdot \tfrac {1}{2}\log _2\!\left (\tfrac {1}{2}\right ) =1 \text { bit}\\ p_1 = \dfrac {1}{10},\; p_2 = \dfrac {9}{10} &\Rightarrow H(P) = -\tfrac {1}{10}\log _2\!\left (\tfrac {1}{10}\right ) - \tfrac {9}{10}\log _2\!\left (\tfrac {9}{10}\right ) \approx 0.469 \text { bits} \end{align*} The more concentrated distribution has lower entropy and theoretically requires fewer bits on average.

  • Example 8.4: Consider transmitting letters \(\{A,B,C,D\}\) over a binary channel.

Equal probabilities. If all four letters are equally likely (\(p_i = 0.25\)), an optimal code is \(\{00,01,10,11\}\) that are two bits per letter. The entropy confirms: \(H(P)=-4\cdot \frac {1}{4}\log _2\!\left (\frac {1}{4}\right )=2\) bits.

Unequal probabilities. With the distribution in Table 8.1, shorter codewords are assigned to more probable letters. The average code length is

\[ \sum _{i}\mathrm {length}_i\cdot p_i = 1\cdot 0.7 + 2\cdot 0.26 + 3\cdot 0.02 + 3\cdot 0.02 = 1.34 \text { bits} \]

while the entropy gives the theoretical minimum:

\begin{equation*} H(P) = - 0.7\log _2(0.7) - 0.26\log _2(0.26) - 2\cdot 0.02\log _2(0.02)\approx 1.091 \text { bits} \end{equation*}

Table 8.1: Variable-length coding example for an unequal distribution.
.
Letter Probability, \(p_i\) Codeword Length
A 0.70 0 1
B 0.26 10 2
C 0.02 110 3
D 0.02 111 3
8.4.2 Cross-Entropy
  • Goal: Quantify distance between distributions \(p\) and \(q\).

Cross-entropy: For two discrete distributions \(p\) and \(q\) over the same outcomes, the cross-entropy is

\begin{equation} H(p,q) = - \sum _i p_i\log (q_i) \end{equation}

Cross-entropy satisfies \(H(p,q) \ge H(p)\), with equality if and only if \(p=q\).

Coding interpretation: With base-2 logarithm, \(H(p,q)\) is the average number of bits needed to encode outcomes drawn from \(p\) using a code optimized for \(q\).

  • Example 8.5: Let \(q_i=\left \{\frac {1}{4},\frac {1}{4},\frac {1}{4},\frac {1}{4}\right \}\) (optimal code: two bits per letter) and \(p_i= \left \{\frac {1}{2},\frac {1}{2},0,0\right \}\). Then \(H(p) = 1\) bit, but \(H(p,q)=2\) bits: using a code designed for \(q\) wastes one bit per symbol on average when the true distribution is \(p\).

The convention \(\lim \limits _{x\to 0} x\log (x) = 0\) is used so that zero-probability events contribute nothing. For loss functions, the natural logarithm (\(\ln \)) is used. Minimizing cross-entropy with respect to model parameters \(\bw \) is equivalent to maximum likelihood estimation (MLE).

8.4.3 Binary Cross-Entropy (BCE) Loss
  • Goal: Convex loss for binary classification with probabilistic interpretation.

For a binary outcome, the cross-entropy reduces to

\begin{equation} \label {eq-bce-univariate} H(p,q) = -p_0\log (q_0) - p_1\log (q_1) \end{equation}

which is visualized in Fig. 8.5. When \(y=1\) (i.e., \(p_0=0,\, p_1=1\)), the expression reduces to \(H(p,q)=-\log (q_1)\), which penalizes small predicted probabilities \(q_1\).

(image)

Figure 8.5: Binary cross-entropy for the two cases: \(y=1\) (left) and \(y=0\) (right) (Eq. (8.10)).

For a single sample with true label \(y\in \{0,1\}\) and predicted probability \(\hat {y}=f_\bth (\bx )\in (0,1)\), the target and predicted distributions are

\begin{align*} p_0 = 1-y,\quad p_1 &= y\\ q_0 = 1-f_\bth (\bx ),\quad q_1 &= f_\bth (\bx ) \end{align*} Substituting into Eq. (8.10):

\begin{equation*} H(p,q) =-(1-y)\log \!\left (1- f_\bth (\bx )\right ) -y\log \!\left (f_\bth (\bx )\right ) \end{equation*}

BCE loss: Binary cross-entropy (BCE) loss for a single sample:

\begin{equation} \loss (y,\hat {y}) = -(1-y)\log (1-\hat {y}) - y\log (\hat {y}) \end{equation}

For \(M\) samples, the loss is averaged over all elements:

\begin{equation} \begin{aligned} \loss &=-\frac {1}{M}\sum _{j=1}^M \left [(1-y_j)\log (1-\hat {y}_j) + y_j\log (\hat {y}_j)\right ]\\ &=-\frac {1}{M}\left [(1-\by )^T\log (1-\hat {\by }) + \by ^T\log (\hat {\by })\right ] \end {aligned} \end{equation}

Properties: The BCE loss:

  • Is continuous, differentiable, and convex—suitable for gradient-based optimization.

  • Has a unique global minimum.

  • Provides probabilistic predictions:

    \begin{equation} \label {eq-lr-decision} \begin{aligned} \Pr (y=1|\bx ,\bw ) &= \sigma (\bw ^T\bx )\\ \Pr (y=0|\bx ,\bw ) &= 1-\sigma (\bw ^T\bx ) \end {aligned} \end{equation}

  • Yields a classification decision via thresholding, e.g. \(\hat {y}\lessgtr \frac {1}{2}\).

8.5 BCE Loss for Logistic Regression

Probabilistic interpretation

The predicted probability \(\hat {y}=\sigma (\bw ^T\bx )\) partitions the feature space into regions of varying confidence. Fig. 8.6 illustrates four regions: high-confidence positive (\(\hat {y}>0.999\)), moderate positive (\(0.8\ge \hat {y}\ge 0.5\)), moderate negative (\(0.5>\hat {y}\ge 0.2\)), and high-confidence negative (\(0.2>\hat {y}\)). Points near the decision boundary have predictions closer to \(0.5\), reflecting greater classification uncertainty.

The probabilistic output provides confidence levels but does not eliminate classification errors.

(image)

Figure 8.6: Probabilistic interpretation: regions of predicted probability \(\hat {y}=\sigma (\bw ^T\bx )\) with boundaries at \(0.2\) and \(0.8\). Misclassified points are marked in red (see also Fig. 8.1).
Loss minimization

Substituting \(\hat {\by }=\sigma (\bX \bw )\) into the BCE loss gives the logistic regression objective in vector form:

\begin{align} \loss = \frac {1}{M}\left [-\by ^T\log (\sigma (\bX \bw )) - (1-\by )^T\log (1-\sigma (\bX \bw ))\right ] \end{align} Setting \(\nabla _\bw \loss =\mathbf {0}\) has no closed-form solution for \(\bw \). However, the gradient has a compact form:

\begin{equation} \nabla _\bw \loss (\bw ) = \frac {1}{M}\bX ^T\left (\sigma \left (\bX \bw \right )-\by \right ) \end{equation}

which enables gradient descent using only matrix–vector operations:

\begin{equation} \bw _{n+1} = \bw _{n} - \alpha \nabla _\bw \mathcal {L}(\bw ) \end{equation}

Decision boundary

From Eq. (8.13), the classification rule is

\begin{equation} \hat {\by } = \begin{cases} 1 & \bw ^T\bx \ge 0\\ 0 & \bw ^T\bx < 0\\ \end {cases} \end{equation}

The decision boundary is the set \(\{\bx : \bw ^T\bx = 0\}\), i.e., \(w_0+w_1x_1 +w_2x_2 + \cdots =0\). Geometrically, \(\bw ^T\bx =\norm {\bx }\norm {\bw }\cos (\theta )\), so the boundary consists of all points \(\bx \) perpendicular to \(\bw \) (\(\theta =90^\circ \)), forming a hyperplane with normal vector \(\bw \).

Regularization and feature mapping
  • Regularization. \(L_2\) regularization can be applied, similar to Eq. (5.13):

    \begin{equation} \mathcal {L}_{\mathrm {reg}} = \loss (\by ,\hat {\by }) + \frac {\lambda }{2M}\sum _{i=1}^N w_i^2 \end{equation}

  • Feature mapping. Mapping functions or kernels extend the model to non-linear decision boundaries.

  • Example 8.6: For example, a polynomial mapping

    \[ \varphi (x_1,x_2) = \langle 1,x_1,x_1^2,x_2,x_2^2,x_1x_2,x_1^2x_2,x_1x_2^2,x_1^2x_2^2 \rangle \]

    replaces \(\bx \) with \(\varphi (\bx )\), yielding a non-linear boundary in the original input space while remaining linear in the feature space (Fig. 8.7).

    (image)

    (a) Same dataset as Figs. 8.1 and 8.6: the quadratic boundary separates both classes without errors. Note the boundary in the upper-left corner.

    (image)

    (b) A different dataset with a non-convex class structure; the elliptical boundary captures the inner class, though some points (red) are misclassified.
    Figure 8.7: Degree-2 polynomial feature mapping \(\varphi (x_1,x_2)\) produces non-linear decision boundaries in the original input space.

8.6 Odd and Logit

If \(\Pr (Y=1)=p\), than the odds (probability of event divided by probability of no event) is defined by ratio

\begin{equation} \text {odds} = \frac {\Pr (y=1)}{\Pr (y=0)} = \frac {\Pr (y=1)}{1-\Pr (y=1)} = \frac {p}{1-p} \end{equation}

The odds range from \([0, \infty )\).

Taking the natural logarithm gives the logit function (or log-odds):

\begin{equation} \text {logit}(p) = \ln \left (\frac {p}{1-p}\right ) \end{equation}

The logit maps probabilities from the range \((0, 1)\) to the entire real line \((-\infty , \infty )\).

Logistic Regression As already mentioned in (8.13),

\begin{equation} \Pr (y=1) = \sigma (\bw ^T\bx ) = \frac {1}{1+\exp (-\bw ^T\bx )} = p \end{equation}

By reformulation,

\begin{align} \frac {p}{1-p} &= \frac {\Pr (y=1)}{\Pr (y=0)} = \exp (\bw ^T\bx )\\[3pt] \ln \left (\frac {p}{1-p}\right ) & = \bw ^T\bx \end{align}

To illustrate the influence of \(\tilde {x}_i = x_i + \Delta x\),

\begin{equation} \begin{aligned} \frac {\text {odds}_{\Delta x}}{\text {odds}} &= \frac {\exp (w_0+w_1x_1 + \cdots + w_i(x_i + \Delta x) + \cdots + w_Nx_n)}{\exp (w_0+w_1x_1 + \cdots + w_ix_i + \cdots + w_Nx_n)}\\[3pt] &= \exp (w_i(x_i + \Delta x) - w_ix_i) = \exp (w_i\Delta x) \end {aligned} \end{equation}

Add numerical example

8.7 Summary

  • Logistic regression applies the sigmoid function to a linear model, producing bounded output with probabilistic interpretation.

  • BCE loss is convex, differentiable, and equivalent to maximum likelihood estimation.

  • The decision boundary is a hyperplane in input space; non-linear boundaries require feature mapping.

  • Optimization is performed via gradient descent (no closed-form solution for \(\bw \)).

  • Regularization and polynomial/kernel mappings extend naturally from linear regression.

  • Complete separation failure: If there is a feature that would perfectly separate the two classes, the weight for that feature would not converge, because the optimal weight would be infinite.