Machine Learning & Signals Learning
8 Logistic Regression
For binary classification, each entry of vector \(\by \) is \(y_j\in \{0,1\}\).
8.1 Generalized Linear Classification Models
Generalized linear model: A generalized linear model applies a non-linear function \(g(\cdot ):\real \rightarrow \real \) to the linear combination \(\bw ^T\bx _i\):
\(\seteqnumber{0}{}{0}\)\begin{equation} \label {eq-general-linear-model} \hat {y}_i = g(\bw ^T\bx _i) \end{equation}
The choice of \(g(\cdot )\) determines the model type. The decision boundary \(\{\bx : \bw ^T\bx \lessgtr thr\}\) is linear (hyperplane), regardless of the choice of \(g(\cdot )\) (Fig. 8.1).
Two important questions:
-
• Selection of a loss function.
-
• Minimization of a selected loss function.
8.2 Basic Linear Model
Linear model is with
\(\seteqnumber{0}{}{1}\)\begin{equation} \label {eq-basic-linear-model} g(x) = x \end{equation}
and MSE loss. It has unbounded and continuous output.
Linear regression can be applied to classification by thresholding the output, as follows:
-
1. Compute regression weights \(\bw \) according to data \(\bX \) and the binary vector \(\by \) (e.g., by (3.5)).
-
2. Compute regression output according to (8.1) with (8.2) substituted. Then, apply threshold \(0.5\) to obtain class labels:
\(\seteqnumber{0}{}{2}\)\begin{equation} \hat {y}_i = \begin{cases} 1 & \bw ^T\bx _i > 0.5\\ 0 & \bw ^T\bx _i \leqslant 0.5 \end {cases} \end{equation}
-
Example 8.1: Consider \(M=20\) samples with a single feature \(x_1\in [10,29]\): the first ten labeled \(y=1\) and the last ten \(y=0\). The design matrix \(\bX =[\bOne _M\;\;\bx ]\in \real ^{M\times 2}\) includes a column of ones for the intercept (Sec. 3.1). Fitting \(\bw \) by least squares and thresholding \(\hat {y}=\bw ^T\bx \lessgtr 0.5\) classifies all points correctly (Fig. 8.2(a)).
Adding ten class-0 outliers at \(x_1\in [60,69]\) pulls the regression line toward them, shifting the decision boundary and causing misclassification near the original boundary (Fig. 8.2(b)).
-
• Unbounded output: \(\tilde {y}\) can be significantly larger than 1 or smaller than 0, with no probabilistic interpretation.
-
• Outlier sensitivity: Distant points disproportionately influence the regression line, shifting the decision boundary (Fig. 8.2(b)).
These limitations motivate the logistic model.
8.3 Logistic Model
The logistic model addresses the limitations of the basic linear model by applying a sigmoid function (Fig. 8.3):
\(\seteqnumber{0}{}{3}\)\begin{equation} \sigma (x) = \frac {\exp (x)}{1+\exp (x)} = \frac {1}{1+\exp (-x)} \end{equation}
Because \(\sigma (x)\in [0,1]\), the output is bounded and can be interpreted as a probability. The saturating tails reduce the influence of distant points, improving robustness to outliers.
Logistic regression model Substituting \(g(x) = \sigma (x)\) into (8.1) gives the logistic model. The decision rule with threshold \(thr\) is
\(\seteqnumber{0}{}{4}\)\begin{equation} \hat {y} = \begin{cases} 1 & \sigma (\bw ^T\bx ) > thr\\ 0 & \sigma (\bw ^T\bx ) \le thr \end {cases} \end{equation}
With the default \(thr=0.5\), the rule simplifies to
\(\seteqnumber{0}{}{5}\)\begin{equation} \hat {y} = \begin{cases} 1 & \bw ^T\bx > 0\\ 0 & \bw ^T\bx \le 0 \end {cases} \end{equation}
since \(\sigma (0)=0.5\).
-
Example 8.2: Using the same 1D dataset from Sec. 8.2, the logistic model correctly separates the two classes (Fig. 8.4(a)). Unlike linear regression, adding ten class-0 outliers at \(x_1\in [60,69]\) does not significantly shift the decision boundary (Fig. 8.4(b)), because the sigmoid saturates for large \(|\bw ^T\bx |\).
Why not MSE? Applying MSE loss to the logistic model,
\(\seteqnumber{0}{}{6}\)\begin{equation} \loss (\cdot ) = \frac {1}{2M}\norm {\hat {\by } - \by }^2 \end{equation}
yields a non-convex optimization problem with multiple local minima and no closed-form solution. This motivates the cross-entropy loss in the following section (Sec. 8.4).
A loss function is not necessarily a metric.
8.4 Cross-Entropy Loss
The logistic model output \(\hat {y}=\sigma (\bw ^T\bx )\in [0,1]\) can be interpreted as the probability \(\Pr (y=1\mid \bx ,\bw )\). The loss function should therefore measure how far the predicted distribution is from the target distribution. Cross-entropy, rooted in information theory, provides exactly such a measure.
8.4.1 Entropy
Entropy: For a discrete distribution \(P = \left \{p_i = \Pr [X=x_i]\right \}\), the entropy is
\(\seteqnumber{0}{}{7}\)\begin{equation} H(P) = - \sum _i p_i\log (p_i) \end{equation}
Entropy measures the uncertainty of a distribution \(P\). It is maximized when all outcomes are equally likely (\(p_i=p_j\;\forall \, i,j\)) and decreases as the distribution becomes more concentrated.
Coding interpretation: With base-2 logarithm, entropy gives the theoretical minimum average number of bits needed to encode outcomes drawn from \(P\).
-
Example 8.3: For a binary distribution:
\(\seteqnumber{0}{}{8}\)\begin{align*} p_1 = p_2 = \dfrac {1}{2} &\Rightarrow H(P) = -2\cdot \tfrac {1}{2}\log _2\!\left (\tfrac {1}{2}\right ) =1 \text { bit}\\ p_1 = \dfrac {1}{10},\; p_2 = \dfrac {9}{10} &\Rightarrow H(P) = -\tfrac {1}{10}\log _2\!\left (\tfrac {1}{10}\right ) - \tfrac {9}{10}\log _2\!\left (\tfrac {9}{10}\right ) \approx 0.469 \text { bits} \end{align*} The more concentrated distribution has lower entropy and theoretically requires fewer bits on average.
Equal probabilities. If all four letters are equally likely (\(p_i = 0.25\)), an optimal code is \(\{00,01,10,11\}\) that are two bits per letter. The entropy confirms: \(H(P)=-4\cdot \frac {1}{4}\log _2\!\left (\frac {1}{4}\right )=2\) bits.
Unequal probabilities. With the distribution in Table 8.1, shorter codewords are assigned to more probable letters. The average code length is
\[ \sum _{i}\mathrm {length}_i\cdot p_i = 1\cdot 0.7 + 2\cdot 0.26 + 3\cdot 0.02 + 3\cdot 0.02 = 1.34 \text { bits} \]
while the entropy gives the theoretical minimum:
\(\seteqnumber{0}{}{8}\)\begin{equation*} H(P) = - 0.7\log _2(0.7) - 0.26\log _2(0.26) - 2\cdot 0.02\log _2(0.02)\approx 1.091 \text { bits} \end{equation*}
8.4.2 Cross-Entropy
Cross-entropy: For two discrete distributions \(p\) and \(q\) over the same outcomes, the cross-entropy is
\(\seteqnumber{0}{}{8}\)\begin{equation} H(p,q) = - \sum _i p_i\log (q_i) \end{equation}
Cross-entropy satisfies \(H(p,q) \ge H(p)\), with equality if and only if \(p=q\).
Coding interpretation: With base-2 logarithm, \(H(p,q)\) is the average number of bits needed to encode outcomes drawn from \(p\) using a code optimized for \(q\).
-
Example 8.5: Let \(q_i=\left \{\frac {1}{4},\frac {1}{4},\frac {1}{4},\frac {1}{4}\right \}\) (optimal code: two bits per letter) and \(p_i= \left \{\frac {1}{2},\frac {1}{2},0,0\right \}\). Then \(H(p) = 1\) bit, but \(H(p,q)=2\) bits: using a code designed for \(q\) wastes one bit per symbol on average when the true distribution is \(p\).
The convention \(\lim \limits _{x\to 0} x\log (x) = 0\) is used so that zero-probability events contribute nothing. For loss functions, the natural logarithm (\(\ln \)) is used. Minimizing cross-entropy with respect to model parameters \(\bw \) is equivalent to maximum likelihood estimation (MLE).
8.4.3 Binary Cross-Entropy (BCE) Loss
For a binary outcome, the cross-entropy reduces to
\(\seteqnumber{0}{}{9}\)\begin{equation} \label {eq-bce-univariate} H(p,q) = -p_0\log (q_0) - p_1\log (q_1) \end{equation}
which is visualized in Fig. 8.5. When \(y=1\) (i.e., \(p_0=0,\, p_1=1\)), the expression reduces to \(H(p,q)=-\log (q_1)\), which penalizes small predicted probabilities \(q_1\).
For a single sample with true label \(y\in \{0,1\}\) and predicted probability \(\hat {y}=f_\bth (\bx )\in (0,1)\), the target and predicted distributions are
\(\seteqnumber{0}{}{10}\)\begin{align*} p_0 = 1-y,\quad p_1 &= y\\ q_0 = 1-f_\bth (\bx ),\quad q_1 &= f_\bth (\bx ) \end{align*} Substituting into Eq. (8.10):
\(\seteqnumber{0}{}{10}\)\begin{equation*} H(p,q) =-(1-y)\log \!\left (1- f_\bth (\bx )\right ) -y\log \!\left (f_\bth (\bx )\right ) \end{equation*}
BCE loss: Binary cross-entropy (BCE) loss for a single sample:
\(\seteqnumber{0}{}{10}\)\begin{equation} \loss (y,\hat {y}) = -(1-y)\log (1-\hat {y}) - y\log (\hat {y}) \end{equation}
For \(M\) samples, the loss is averaged over all elements:
\(\seteqnumber{0}{}{11}\)\begin{equation} \begin{aligned} \loss &=-\frac {1}{M}\sum _{j=1}^M \left [(1-y_j)\log (1-\hat {y}_j) + y_j\log (\hat {y}_j)\right ]\\ &=-\frac {1}{M}\left [(1-\by )^T\log (1-\hat {\by }) + \by ^T\log (\hat {\by })\right ] \end {aligned} \end{equation}
Properties: The BCE loss:
-
• Is continuous, differentiable, and convex—suitable for gradient-based optimization.
-
• Has a unique global minimum.
-
• Provides probabilistic predictions:
\(\seteqnumber{0}{}{12}\)\begin{equation} \label {eq-lr-decision} \begin{aligned} \Pr (y=1|\bx ,\bw ) &= \sigma (\bw ^T\bx )\\ \Pr (y=0|\bx ,\bw ) &= 1-\sigma (\bw ^T\bx ) \end {aligned} \end{equation}
-
• Yields a classification decision via thresholding, e.g. \(\hat {y}\lessgtr \frac {1}{2}\).
8.5 BCE Loss for Logistic Regression
Probabilistic interpretation
The predicted probability \(\hat {y}=\sigma (\bw ^T\bx )\) partitions the feature space into regions of varying confidence. Fig. 8.6 illustrates four regions: high-confidence positive (\(\hat {y}>0.999\)), moderate positive (\(0.8\ge \hat {y}\ge 0.5\)), moderate negative (\(0.5>\hat {y}\ge 0.2\)), and high-confidence negative (\(0.2>\hat {y}\)). Points near the decision boundary have predictions closer to \(0.5\), reflecting greater classification uncertainty.
The probabilistic output provides confidence levels but does not eliminate classification errors.
Loss minimization
Substituting \(\hat {\by }=\sigma (\bX \bw )\) into the BCE loss gives the logistic regression objective in vector form:
\(\seteqnumber{0}{}{13}\)\begin{align} \loss = \frac {1}{M}\left [-\by ^T\log (\sigma (\bX \bw )) - (1-\by )^T\log (1-\sigma (\bX \bw ))\right ] \end{align} Setting \(\nabla _\bw \loss =\mathbf {0}\) has no closed-form solution for \(\bw \). However, the gradient has a compact form:
\(\seteqnumber{0}{}{14}\)\begin{equation} \nabla _\bw \loss (\bw ) = \frac {1}{M}\bX ^T\left (\sigma \left (\bX \bw \right )-\by \right ) \end{equation}
which enables gradient descent using only matrix–vector operations:
\(\seteqnumber{0}{}{15}\)\begin{equation} \bw _{n+1} = \bw _{n} - \alpha \nabla _\bw \mathcal {L}(\bw ) \end{equation}
Decision boundary
From Eq. (8.13), the classification rule is
\(\seteqnumber{0}{}{16}\)\begin{equation} \hat {\by } = \begin{cases} 1 & \bw ^T\bx \ge 0\\ 0 & \bw ^T\bx < 0\\ \end {cases} \end{equation}
The decision boundary is the set \(\{\bx : \bw ^T\bx = 0\}\), i.e., \(w_0+w_1x_1 +w_2x_2 + \cdots =0\). Geometrically, \(\bw ^T\bx =\norm {\bx }\norm {\bw }\cos (\theta )\), so the boundary consists of all points \(\bx \) perpendicular to \(\bw \) (\(\theta =90^\circ \)), forming a hyperplane with normal vector \(\bw \).
Regularization and feature mapping
-
• Regularization. \(L_2\) regularization can be applied, similar to Eq. (5.13):
\(\seteqnumber{0}{}{17}\)\begin{equation} \mathcal {L}_{\mathrm {reg}} = \loss (\by ,\hat {\by }) + \frac {\lambda }{2M}\sum _{i=1}^N w_i^2 \end{equation}
-
• Feature mapping. Mapping functions or kernels extend the model to non-linear decision boundaries.
-
Example 8.6: For example, a polynomial mapping
\[ \varphi (x_1,x_2) = \langle 1,x_1,x_1^2,x_2,x_2^2,x_1x_2,x_1^2x_2,x_1x_2^2,x_1^2x_2^2 \rangle \]
replaces \(\bx \) with \(\varphi (\bx )\), yielding a non-linear boundary in the original input space while remaining linear in the feature space (Fig. 8.7).
8.6 Odd and Logit
If \(\Pr (Y=1)=p\), than the odds (probability of event divided by probability of no event) is defined by ratio
\(\seteqnumber{0}{}{18}\)\begin{equation} \text {odds} = \frac {\Pr (y=1)}{\Pr (y=0)} = \frac {\Pr (y=1)}{1-\Pr (y=1)} = \frac {p}{1-p} \end{equation}
The odds range from \([0, \infty )\).
Taking the natural logarithm gives the logit function (or log-odds):
\(\seteqnumber{0}{}{19}\)\begin{equation} \text {logit}(p) = \ln \left (\frac {p}{1-p}\right ) \end{equation}
The logit maps probabilities from the range \((0, 1)\) to the entire real line \((-\infty , \infty )\).
Logistic Regression As already mentioned in (8.13),
\(\seteqnumber{0}{}{20}\)\begin{equation} \Pr (y=1) = \sigma (\bw ^T\bx ) = \frac {1}{1+\exp (-\bw ^T\bx )} = p \end{equation}
By reformulation,
\(\seteqnumber{0}{}{21}\)\begin{align} \frac {p}{1-p} &= \frac {\Pr (y=1)}{\Pr (y=0)} = \exp (\bw ^T\bx )\\[3pt] \ln \left (\frac {p}{1-p}\right ) & = \bw ^T\bx \end{align}
To illustrate the influence of \(\tilde {x}_i = x_i + \Delta x\),
\(\seteqnumber{0}{}{23}\)\begin{equation} \begin{aligned} \frac {\text {odds}_{\Delta x}}{\text {odds}} &= \frac {\exp (w_0+w_1x_1 + \cdots + w_i(x_i + \Delta x) + \cdots + w_Nx_n)}{\exp (w_0+w_1x_1 + \cdots + w_ix_i + \cdots + w_Nx_n)}\\[3pt] &= \exp (w_i(x_i + \Delta x) - w_ix_i) = \exp (w_i\Delta x) \end {aligned} \end{equation}
Add numerical example
8.7 Summary
-
• Logistic regression applies the sigmoid function to a linear model, producing bounded output with probabilistic interpretation.
-
• BCE loss is convex, differentiable, and equivalent to maximum likelihood estimation.
-
• The decision boundary is a hyperplane in input space; non-linear boundaries require feature mapping.
-
• Optimization is performed via gradient descent (no closed-form solution for \(\bw \)).
-
• Regularization and polynomial/kernel mappings extend naturally from linear regression.
-
• Complete separation failure: If there is a feature that would perfectly separate the two classes, the weight for that feature would not converge, because the optimal weight would be infinite.