Machine Learning & Signals Learning
12 Vanilla NN
12.1 Definition
12.1.1 Single neuron
A single neuron is the neural-network realization of the generalized linear model \(\hat {y}_i = g(\bw ^T\bx _i)\) introduced in the logistic regression chapter (Eq. (8.1)).
Single neuron: A single neuron computes its output in two steps (Fig. 12.1):
\(\seteqnumber{0}{}{0}\)\begin{align} a &= w_1x_1 + w_2x_2 + \cdots + w_Nx_N + b = \bw ^T\bx + b\label {eq-intermediate-linear-combination}\\ z &= g(a) \end{align} where
-
• \(x_i\) are input features, \(i=1,\ldots ,N\)
-
• \(w_i\) are learnable weights and \(\bw \) is weight vector,
-
• \(b\) is an explicit bias term that is separated from the weight vector,
-
• \(a\) is intermediate linear combination (pre-activation) in (12.1),
-
• \(g(\cdot )\) is activation function, e.g. \(\sigma (\cdot )\), or identity,
-
• \(z\) is output after applying the activation function (neuron output or post-activation)
When \(g=\sigma \) (sigmoid), the single neuron reduces to logistic regression.
In the ML part of this book the design matrix \(\bX \in \Re ^{M\times N}\) has samples as rows. In DL notation the transposed form \(\bX ^T\) is commonly used, so that samples are columns.
12.1.2 Layered representation
A single neuron in a layer is indexed by its layer number \([k]\). Applying the single-neuron definition to every neuron in a layer yields the vector form:
\(\seteqnumber{0}{}{2}\)\begin{align} \ba ^{[k]} &= \left (\bW ^{[k]}\right )^T \bz ^{[k-1]} + \bb ^{[k]}\\ \bz ^{[k]} &= g_k\!\left (\ba ^{[k]}\right ) \end{align} For the one-layer network in Fig. 12.2a (\(k=1\), four inputs, five hidden neurons) the pre-activation vector expands as
\(\seteqnumber{0}{}{4}\)\begin{equation} \underbrace {\begin{bmatrix} a_1 \\ a_2 \\ a_3 \\ a_4 \\ a_5 \end {bmatrix}}_{\ba ^{[1]}} = \underbrace {\begin{bmatrix} w_{11} & w_{21} & w_{31} & w_{41} \\ w_{12} & w_{22} & w_{32} & w_{42} \\ w_{13} & w_{23} & w_{33} & w_{43} \\ w_{14} & w_{24} & w_{34} & w_{44} \\ w_{15} & w_{25} & w_{35} & w_{45} \end {bmatrix}}_{\left (\bW ^{[1]}\right )^T} \underbrace {\begin{bmatrix} x_{1} \\ x_{2} \\ x_{3} \\ x_{4} \end {bmatrix}}_{\bx } + \underbrace {\begin{bmatrix} b_1 \\ b_2 \\ b_3 \\ b_4 \\ b_5 \end {bmatrix}}_{\bb ^{[1]}} \end{equation}
Multi-layer model (single sample):
\(\seteqnumber{0}{}{5}\)\begin{align} \ba ^{[k]} &= \left (\bW ^{[k]}\right )^T \bz ^{[k-1]} + \bb ^{[k]}\\ \bz ^{[k]} &= g_k\!\left (\ba ^{[k]}\right )\\ \bz ^{[0]} &=\bx _i \end{align}
When the entire dataset is processed at once, each column of \(\bZ ^{[k]}\) corresponds to one sample.
Multi-layer model (all dataset):
\(\seteqnumber{0}{}{8}\)\begin{align} \bZ ^{[k]} &= g_k\!\left (\bW ^{[k]}\bZ ^{[k-1]} + \bb ^{[k]}\right )\\ \bZ ^{[0]} & =\bX \end{align} Output is the value of \(\bZ ^{[k]}\) at the output layer.
Fig. 12.2 illustrates three common architectures. A one-layer network with a single output neuron (Fig. 12.2a) performs scalar regression or binary classification. Adding output neurons (Fig. 12.2b) enables multi-target regression or multi-class classification. Stacking hidden layers (Fig. 12.2c) increases the model capacity, allowing the network to learn hierarchical feature representations.
The number of hidden layers and the number of neurons per layer are hyperparameters chosen before training begins. Increasing the number of layers (depth) and of neurons per layer (width) enlarge the model complexity and the number of trainable parameters, so they must be balanced against the available training data to avoid overfitting.
Most of the calculations are matrix addition and multiplication. Recently applied activation functions (e.g., ReLU) also require similar operations.
12.2 Activation Functions
Activation functions are essential components of neural networks:
-
• Every activation function must be differentiable (or piecewise differentiable) to support back-propagation (Sec. 12.5 below).
-
• A linear activation function \(g(z)=z\) reduces the entire network to linear regression, regardless of depth.
-
• The choice of activation function affects convergence speed, gradient flow, and computational cost.
Table 12.1 summarizes the common activation functions and their derivatives.
| Name | Function | Range | Derivative |
| Sigmoid | \(f(z)=\dfrac {1}{1+e^{-z}}\) | \([0,1]\) | \(f'(z) = f(z)(1-f(z))\) |
| Tanh | \(f(z)=\dfrac {e^{z}-e^{-z}}{e^{z} + e^{-z}}\) | \([-1,1]\) | \(f'(z) = 1 - f^2(z)\) |
| ReLU | \(f(z)=\max (0,z)\) | \([0,\infty )\) | \(f'(z)=\begin {cases} 1 & z > 0\\ 0 & z < 0 \end {cases}\) |
| Leaky ReLU | \(f(z)=\max (\alpha z,z)\) | \((-\infty ,\infty )\) | \(f'(z)=\begin {cases} 1 & z \ge 0\\ \alpha & z < 0 \end {cases}\) |
Sigmoid
The sigmoid function maps any real value to the interval \((0,1)\), making it suitable for probabilistic interpretation.
-
• Outputs can be interpreted as probabilities.
-
• Suffers from vanishing gradients: for large \(|z|\), the derivative approaches zero, slowing learning.
-
• Computationally expensive due to the exponential operation.
-
• Not zero-centered, which can slow convergence.
Tanh
The hyperbolic tangent is a scaled and shifted sigmoid, \(\tanh (z) = 2\sigma (2z) - 1\).
-
• Zero-centered output in \([-1,1]\), which often improves convergence.
-
• Still suffers from vanishing gradients for large \(|z|\).
-
• Computationally expensive.
ReLU
The Rectified Linear Unit is currently the most widely used activation function.
-
• Faster convergence than sigmoid or tanh due to non-saturating gradients for \(z>0\).
-
• Computationally efficient: only requires a threshold comparison.
-
• Unbounded output can lead to exploding activations.
-
• “Dying ReLU” problem: neurons with negative inputs have zero gradient and stop learning.
-
• Sensitive to weight initialization.
Leaky ReLU
Addresses the dying ReLU problem by allowing a small gradient for negative inputs.
-
• Uses a small slope \(\alpha \) (typically \(\alpha = 0.01\)) for \(z<0\).
-
• Prevents neurons from becoming permanently inactive.
12.3 Softmax Layer
In multi-class classification with \(K\) classes, the target label is represented as a one-hot vector.
One-hot encoding: A categorical label \(c\in \{1,\ldots ,K\}\) is encoded as a binary vector \(\by \in \{0,1\}^K\) with exactly one element equal to 1:
\(\seteqnumber{0}{}{10}\)\begin{equation} y_j = \begin{cases} 1 & j = c \\ 0 & j \neq c \end {cases} \qquad j=1,\ldots ,K \end{equation}
For example, with \(K=4\) classes, label \(c=3\) is encoded as \(\by =(0,0,1,0)^T\).
The softmax function produces a probability distribution over \(K\) classes that can be compared directly against this one-hot target.
Softmax function: Given a probabilistic score \(s_i\) for each class, the softmax is defined by
\(\seteqnumber{0}{}{11}\)\begin{equation} \hat {s_i} = f_S(s_i) = \frac {\exp (s_i)}{\sum _{j=1}^{K}\exp (s_j)}\quad i=1,\ldots ,K \end{equation}
The corresponding loss (categorical cross entropy) is given by
\(\seteqnumber{0}{}{12}\)\begin{equation} \loss = -\frac {1}{K}\sum _{j=1}^{K}\log (\hat {s}_i) \end{equation}
For example, \({(1,2,8)}\rightarrow {(0.001,0.002,0.997)}\).
ArgMax It may be considered as an approximation of \(\arg \max \) operator, when all the values are close to 0 except a one value that is close to 1.
Sigmoid Sigmoid is a special case of softmax function with \(s_1=0\) and \(K=2\).
12.4 Loss Function
The loss function measures the discrepancy between predictions and targets. The choice depends on the task:
-
• Regression – Mean Squared Error (MSE):
\(\seteqnumber{0}{}{13}\)\begin{equation} \loss _{\mathrm {MSE}} = \frac {1}{M}\sum _{i=1}^{M}(y_i - \hat {y}_i)^2 \end{equation}
Typically paired with a linear (identity) output activation. See Ch. 7 for derivation and properties.
-
• Binary classification – Binary Cross-Entropy (BCE):
\(\seteqnumber{0}{}{14}\)\begin{equation} \loss _{\mathrm {BCE}} = -\frac {1}{M}\sum _{i=1}^{M}\bigl [y_i\log \hat {y}_i + (1-y_i)\log (1-\hat {y}_i)\bigr ] \end{equation}
Paired with a sigmoid output activation. See Sec. 8.4 for derivation.
-
• Multi-class classification – Categorical Cross-Entropy (CCE):
\(\seteqnumber{0}{}{15}\)\begin{equation} \loss _{\mathrm {CCE}} = -\frac {1}{M}\sum _{i=1}^{M}\sum _{j=1}^{K} y_{ij}\log \hat {y}_{ij} \end{equation}
Paired with a softmax output activation.
Table 12.2 summarizes the common task–loss–activation combinations.
| Task | Loss | Output activation |
| Regression | MSE | Linear (identity) |
| Binary classification | BCE | Sigmoid |
| Multi-class classification | CCE | Softmax |
12.5 Back-propagation
12.5.1 Concept
-
Goal: Compute the gradient of the loss with respect to every weight in the network; use it for gradient-descent training (Sec. 3.4).
Because every layer applies differentiable operations, the chain rule gives the partial derivative of the loss with respect to any weight:
\(\seteqnumber{0}{}{16}\)\begin{equation} \frac {\partial \loss (y,\hat {y})}{\partial w_{ij}^{[k]}} = \frac {\partial \loss (y,\hat {y})}{\partial \hat {y}} \frac {\partial \hat {y}}{\partial w_{ij}^{[k]}} \end{equation}
Gradient descent iteratively updates every weight in the negative-gradient direction. For non-convex losses the procedure converges to a local minimum. The training loop is:
-
1. Initialize weights.
-
2. Forward propagation \(\Rightarrow \) evaluate \(\loss ()\).
-
3. Back-propagation: compute \(\dfrac {\partial \loss (y,\hat {y})}{\partial w_{ij}^{[k]}}\) for all weights.
-
4. Update weights by GD.
-
5. Return to step 2.
12.5.2 General Example
Consider MSE loss with a \(\frac {1}{2}\) scaling factor for convenience:
\(\seteqnumber{0}{}{17}\)\begin{equation} \loss (\by ,\hat {\by }) =\frac {1}{2M}\sum _{i=1}^{M}\left (y_i - \hat {y}_i\right )^2 \end{equation}
Its derivative with respect to the prediction is
\(\seteqnumber{0}{}{18}\)\begin{equation} \frac {\partial \loss (\by ,\hat {\by })}{\partial \hat {y}_i} =-\frac {1}{M}\left (y_i - \hat {y}_i\right ) \end{equation}
The output layer applies a sigmoid activation, \(\hat {y}_i=\sigma (a_i^{[l]})\), whose derivative is
\(\seteqnumber{0}{}{19}\)\begin{equation} \frac {\partial \hat {y}_i}{\partial a_i^{[l]}} =\sigma (a_i^{[l]})\bigl (1-\sigma (a_i^{[l]})\bigr ) \end{equation}
Combining via the chain rule:
\(\seteqnumber{0}{}{20}\)\begin{equation} \frac {\partial \loss (\by ,\hat {\by })}{\partial a_i^{[l]}} =\frac {\partial \loss (\by ,\hat {\by })}{\partial \hat {y}_i}\; \frac {\partial \hat {y}_i}{\partial a_i^{[l]}} \end{equation}
The pre-activation is \(a_i^{[l]}=\sum _j w_{ij}^{[l]}z_j^{[l-1]}+b_i^{[l]}\), so \(\frac {\partial a_i^{[l]}}{\partial w_{ij}^{[l]}}=z_j^{[l-1]}\). The full gradient is therefore
\(\seteqnumber{0}{}{21}\)\begin{equation} \frac {\partial \loss (\by ,\hat {\by })}{\partial w_{ij}^{[l]}} =\frac {\partial \loss (\by ,\hat {\by })}{\partial \hat {y}_i}\; \frac {\partial \hat {y}_i}{\partial a_i^{[l]}}\; \frac {\partial a_i^{[l]}}{\partial w_{ij}^{[l]}} \end{equation}
This pattern extends to deeper networks: the chain rule propagates the gradient backward through each layer.
12.6 Learning
12.6.1 Stochastic and Mini-Batch Gradient Descent
Basic gradient descent (Section 3.4) applies a single, fixed learning rate to every parameter. In bigger/deep networks this can be inefficient. The modifications below address these issues.
Recall that batch GD computes the gradient over all \(M\) samples, mini-batch GD uses a random subset of size \(B\ll M\), and the special case \(B=1\) is stochastic gradient descent (SGD). These concepts, together with the notion of an epoch, are defined in Section 3.4.
Mini-batch GD algorithm:
-
• Initialize weights \(\bw \) and learning rate \(\alpha \).
-
• Repeat until stop condition is met:
For each epoch:
-
– Randomly shuffle the dataset.
-
– Partition into mini-batches \((\bX _i,\by _i)\).
-
– For each mini-batch, update
\(\seteqnumber{0}{}{22}\)\begin{equation} \bw _{i+1} = \bw _{i} - \alpha \nabla \loss _\bw (\by _i,\hat {\by }_i)\nonumber \end{equation}
-
-
• Each mini-batch is representative of the overall dataset patterns.
-
• The noise in mini-batch gradients allows the optimizer to escape shallow local minima.
-
• Smaller batches may improve generalization [?].
Advanced optimizers Several methods improve upon basic SGD by adapting the learning rate per parameter. Notable examples include Momentum, RMSProp, and Adam (Adaptive Moment Estimation). Adam is currently the default choice in most deep-learning frameworks.
12.6.2 Learning control
Two hyperparameters govern the training loop (Table 12.3):
-
• the learning rate,
-
• and the stopping criterion.
| Hyperparameter | Strategy | Description |
| Learning rate | Constant | Pre-defined fixed value |
| Exponential decay | Decrease as a function of the epoch number | |
| Loss-based | Reduce when the loss plateaus for several epochs | |
| Stopping | Fixed budget | Pre-defined number of epochs |
| Early stopping | Halt when validation loss stops improving |
Instead of a fixed learning rate, \(\alpha \) may vary as a function of the iteration number \(k\),
\(\seteqnumber{0}{}{22}\)\begin{equation} \bw _{n+1} = \bw _{n} - \alpha _k\nabla \loss _\bw \nonumber \end{equation}
Common strategies include constant rate, step decay (reduce \(\alpha \) by a factor every few epochs), and exponential decay \(\alpha _k = \alpha _0 e^{-\lambda k}\).
12.7 Weight Initialization
Problem with ReLU For zero initialization, the gradients are zero. For example, we initialize all the biases to 0 and the weights with some constant \(\alpha \). If we forward propagate an input \(\bx \) in this network, the output of all hidden units will be \(g(\bw ^T\bx )\). All hidden units will have identical influence on the cost, which will lead to identical gradients. Thus, all neurons will evolve symmetrically throughout training, effectively preventing different neurons from learning different things.
The common approach is to use some random initialization with either Gaussian or uniform distribution with some distribution parameters. Examples are Xavier, Glorot (Gaussian distribution) and He (Gaussian/uniform distribution) initializations [?]. Different methods use different distribution parameters according to number of neurons in current and previous layer.
Xavier initialization An example of initialization is Xavier initialization (or one of its derived methods), that for every layer \(l\) uses
\(\seteqnumber{0}{}{22}\)\begin{equation} \begin{aligned} \bW ^{[l]}&\sim N(\mu =0,\sigma ^2 = \frac {1}{n^{[l-1]}})\\ b^{[l]} &= 0 \end {aligned} \end{equation}
where \(n^{[l-1]}\) is the number of neuron in layer \(l-1\) (previous layer).
12.8 Summary
-
• Data
-
– Train/test and validation for hyper-parameters evaluation
-
– Batch size
-
-
• NN model
-
– Neural layers: number of layers, number of neurons in each level, activation functions
-
– Special layers
-
– Output layer that matches loss function
-
– Complexity grows exponentially with additional layers, \(O(2^k)\)
-
-
• Loss
-
– Minimization goal according to the task, e.g. regression (mse) or classification (ce)
-
-
• Metric
-
– Performance evaluation, e.g. rmse for regression or accuracy for classification
-
-
• Learning algorithm: Adam, ...
-
– Learning rate
-
– Hyper-parameters (very rare)
-
– Learning rate control, e.g. exponential decrease.
-
– Early stopping by loss or metric convergence
-
– Number of epochs (GD steps)
-