Machine Learning & Signals Learning
6 Kernels
The goal is to effectively apply non-linear models.
6.1 Uni-variate Polynomial Model
The \(L\)-degree uni-variate polynomial regression model1 is
\(\seteqnumber{0}{}{0}\)\begin{equation} \begin{aligned}\label {eq:poly:poly_pred} \hat {y} = f(x;\bw ) &= w_0 + w_1x + w_2x^2 + \cdots + w_Lx^L\\ &=\sum _{j=0}^{L} w_jx^j \end {aligned} \end{equation}
This problem is linear under the change of variables \(z_{j} = x^j,\ j=0,\ldots ,L\),
\(\seteqnumber{0}{}{1}\)\begin{equation} \hat {y} =\sum _{j=0}^{L} w_jz_{j} \end{equation}
The corresponding prediction for the dataset \(\left \{x_k,y_k\right \}_{k=1}^M\) can be easily written by using the matrix notation (also termed Vandermonde matrix),
\(\seteqnumber{0}{}{2}\)\begin{equation} \label {eq-vandermonde} \renewcommand *{\arraystretch }{1.3} \bX = \begin{bmatrix} 1 & x_1 & x_1^2 & \cdots & x_1^L\\ 1 & x_2 & x_2^2 & \cdots & x_2^L\\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_M & x_M^2 & \cdots & x_M^L\\ \end {bmatrix} \end{equation}
The weights vector \(\bw \) then follows from the standard least-squares formula.
6.2 Mapping function
The data-linearization results (e.g., in (6.3)) can be formulated by a mapping function.
Univariate polynomial mapping The polynomial model uses a uni-variate to multi-variate mapping, \(\phi (\cdot ):\real \rightarrow \real ^L\),
\(\seteqnumber{0}{}{3}\)\begin{equation} \phi (x) = \begin{bmatrix} 1 & x & x^2 & \cdots & x^L \end {bmatrix} \end{equation}
Applying \(\phi \) to each value in \(\left \{x_k\right \}_{k=1}^M\) builds the matrix \(\bX \) in (6.3).
Multivariate mapping example More general multi-variate to multi-variate mappings \(\phi (\cdot ):\real ^N\rightarrow \real ^L\) can be applied. For the regression model
\(\seteqnumber{0}{}{4}\)\begin{equation*} f(\bx ;\bw ) = w_0 + w_1x_1 + w_2x_2 + w_3x_1x_2 + w_4x_1^2 + w_5x_2^2 \end{equation*}
the corresponding \(N=2\) and \(L=6\) mapping function (ellipse equation elements) is
\(\seteqnumber{0}{}{4}\)\begin{equation} \phi (\bx ) = \phi (x_{1},x_{2}) = \begin{bmatrix} 1, x_{1}, x_{2}, x_{1}x_{2}, x_{1}^2,x_{2}^2 \end {bmatrix} \end{equation}
that for each value in \(\left \{x_{1_k},x_{2_k}\right \}_{k=1}^M\) produces the corresponding matrix
\(\seteqnumber{0}{}{5}\)\begin{equation} \label {eq-mapping-example-X} \phi (\bX ) = \bX _\phi = \begin{bmatrix} 1 & x_{1_1} & x_{2_1} & x_{1_1}x_{2_1} & x_{1_1}^2 & x_{2_1}^2 \\ 1 & x_{1_2} & x_{2_2} & x_{1_2}x_{2_2} & x_{1_2}^2 & x_{2_2}^2 \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 1 & x_{1_M} & x_{2_M} & x_{1_M}x_{2_M} & x_{1_M}^2 & x_{2_M}^2 \\ \end {bmatrix} \end{equation}
Problem
Even for a relatively small \(N\), multivariate mapping produces numerous feature vectors (large \(L\)) that are less convenient to implement. Moreover, the matrix \(\bX \in \real ^{M\times L}\) produces memory and computationally prohibitive \(\bX ^T\bX \in \real ^{L\times L}\) matrix.
Example of \(\bX ^T\bX \) For \(L=3\) and \(\bX = \begin {bmatrix} \bx _1 & \bx _2 & \bx _3 \end {bmatrix}\in \real ^{M\times 3}\) (using (3.7) vector notation) the resulting \(\bX ^T\bX \in \real ^{3\times 3}\),
\(\seteqnumber{0}{}{6}\)\begin{align} \label {eq:kernel:xtx} \bX ^T\bX &= \begin{bmatrix} \bx _1^T \\ \bx _2^T \\ \bx _3^T \end {bmatrix} \begin{bmatrix}\bx _1 & \bx _2 & \bx _3\end {bmatrix} = \begin{bmatrix} \bx _1^T\bx _1 & \bx _1^T\bx _2 & \bx _1^T\bx _3\\ \bx _2^T\bx _1 & \bx _2^T\bx _2 & \bx _2^T\bx _3 \\ \bx _3^T\bx _1 & \bx _3^T\bx _2 & \bx _3^T\bx _3 \\ \end {bmatrix}, \end{align} where \(\bx _j^T\bx _k\in \real \) is a scalar.
6.3 Kernel Trick
The goal is:
-
• To replace \(\bX ^T\bX \in \real ^{L\times L}\) with \(\bX \bX ^T\in \real ^{M\times M}\), a matrix useful in the \(L>M\) case.
-
• Use arbitrary high \(L\).
-
• Provide vector-based and more general expressions for \(\phi (\bx )\).
6.3.1 LS Identity
The base of the method is an identity
\(\seteqnumber{0}{}{7}\)\begin{equation} \begin{aligned} \bw &= \bX ^T\left (\bX \bX ^T\right )^{-1}\by \\ &=\left (\bX ^T\bX \right )^{-1}\bX ^T \by \end {aligned} \end{equation}
The key property of this representation is that the \(M\times M\) dimensions are independent of the number of original features \(N\) and mapped features \(L\). The prediction
\[ \hat {\by } = \bX \bw \]
is independent of the way \(\bw \) is evaluated.
Example of \(\bX \bX ^T\) for \(N=3\) For \(\bX = \begin {bmatrix} \bx _1 & \bx _2 & \bx _3 \end {bmatrix}\in \real ^{M\times 3}\) (using (3.7) vector notation) the resulting \(\bX \bX ^T\in \real ^{M\times M}\),
\(\seteqnumber{0}{}{8}\)\begin{align} \label {eq:kernel:xxt} \bX \bX ^T &= \begin{bmatrix}\bx _1 & \bx _2 & \bx _3\end {bmatrix} \begin{bmatrix} \bx _1^T \\ \bx _2^T \\ \bx _3^T \end {bmatrix} = \bx _1\bx _1^T + \bx _2\bx _2^T + \bx _3\bx _3^T \end{align} where each term \(\bx _j\bx _j^T\in \real ^{M\times M}\) is an outer product.
6.3.2 Kernel Matrix
The mapping function \(\phi (\cdot ):\real ^{M\times N}\rightarrow \real ^{M\times L}\) is applied to the full data matrix \(\bX \) (as in (6.6))
\(\seteqnumber{0}{}{9}\)\begin{equation} \bX _\phi = \phi (\bX ) \end{equation}
By using the identity
\(\seteqnumber{0}{}{10}\)\begin{equation} \phi \left (\bX ^T\right ) = \phi \left (\bX \right )^T \end{equation}
the resulting kernel matrix \(\bK \in \real ^{M\times M}\) is defined as
\(\seteqnumber{0}{}{11}\)\begin{equation} \label {eq-kernel-matrix} \bK (\bX ,\bX ) = \phi \left (\bX \right )\phi \left (\bX \right )^T \end{equation}
Kernel matrix properties:
-
• Symmetry, \(\bK = \bK ^T\)
-
• Positive semi-definite, \(\bx \bK \bx ^T\ge 0\), with all eigenvalues non-negative.
6.3.3 Handling Overfitting
Large \(L\) values, e.g. in the polynomial model, result in a high potential for overfitting. Therefore, ridge regression (5.11) is applied together with a kernel trick,
\(\seteqnumber{0}{}{12}\)\begin{equation} \label {eq:kernels:w} \bw = \bX ^T\left (\bX \bX ^T +\lambda \mathbf {I}\right )^{-1} \by \end{equation}
6.3.4 Kernel-Based Model
The calculation of the kernel trick relation in Eq. (6.13) is performed by the following steps:
-
1. Evaluate the kernel matrix \(\bK \) using (6.12).
-
2. Calculate \(\balpha \),
\(\seteqnumber{0}{}{13}\)\begin{equation} \label {eq:kernels:alpha} \bm {\alpha } = (\mathbf {K}+\lambda \mathbf {I})^{-1} \by \end{equation}
where \(\bK \) replaces \(\bX \bX ^T\), yielding \(\balpha \in \real ^{M}\). After this step, \(\bK \) no longer needs to be stored. The weight vector \(\bw \in \real ^{L}\) can then (theoretically) be recovered as
\(\seteqnumber{0}{}{14}\)\begin{equation} \bw = \varphi \left (\bX \right )^T\bm {\alpha } = \sum _{j=1}^M\alpha _j\varphi \left (\tilde {\bx }_j\right ) \end{equation}
but is not required for prediction.
-
3. Model inference: The regression for a new test point \(\tilde {\bx }_0\) is evaluated by
\(\seteqnumber{0}{}{15}\)\begin{equation} \begin{aligned}\label {eq:kernels:pred} \hat {y}_0 &= \varphi (\tilde {\bx }_0)^T\bw \\ &= \sum _{j=1}^M\alpha _j\varphi (\tilde {\bx }_0)^T\varphi \left (\tilde {\bx }_j\right )\\ &= \varphi (\tilde {\bx }_0)^T\varphi \left (\bX \right )\balpha \\ &=K\left (\tilde {\bx }_0,\bX \right )\balpha \end {aligned} \end{equation}
where
\(\seteqnumber{0}{}{16}\)\begin{equation} \begin{aligned} \varphi (\tilde {\bx }_0)^T&\in \real ^{1\times L},\varphi (\tilde {\bx }_0) \in \real ^{L\times 1}\\ \varphi \left (\bX \right )^T &\in \real ^{L\times M},\varphi \left (\bX \right )\in \real ^{M\times L}\\ K\left (\tilde {\bx }_0,\bX \right )&\in \real ^{1\times M}\\ \end {aligned} \end{equation}
No \(L\)-dimensional calculations are required, and \(\bw \) need not be evaluated explicitly.
In summary, the order is:
-
1. Kernel matrix, (6.12).
-
2. \(\balpha \), (6.14).
-
3. For test data \(\tilde {\bX }_0\), the prediction (6.16) becomes
\(\seteqnumber{0}{}{17}\)\begin{equation} \hat {\by }_0 = K\left (\tilde {\bX }_0,\bX \right )\balpha \end{equation}
6.4 Kernel Types
6.4.1 Polynomial Kernel
\(\seteqnumber{0}{}{18}\)\begin{equation} K(\bx ,\by ) = \left (\bx ^\mathsf {T} \by + c\right )^{d} \end{equation}
where \(c\) is some constant (optionally \(c=0\)) and \(d\) is the power.
The number of mapped dimensions is \(L=\displaystyle \binom {d+N-1}{d}\). For example, for \(N=10,d = 4, L=\binom {13}{4}=715\).
\(\seteqnumber{0}{}{19}\)\begin{align*} K(\ba ,\bb ) &=\left (\ba ^T\bb \right )^3 = \left (\left [a_1, a_2\right ] \begin{bmatrix} b_1 \\ b_2 \end {bmatrix} \right )^3\\ &=\left (a_1b_1 + a_2b_2\right )^3\\ &= a_1^3b_1^3 + 3a_1^2b_1^2a_2b_2 + 3a_1b_1a_2^2b_2^2 + a_2^3b_2^3\\ &=\left [a_1^3, \sqrt {3}a_1^2a_2,\sqrt {3}a_1a_2^2,a_2^3\right ]\cdot \left [b_1^3, \sqrt {3}b_1^2b_2,\sqrt {3}b_1b_2^2,b_2^3\right ]^T\\ &=\varphi (\ba )^T\varphi (\bb ) \end{align*}
Example for \(c=0,d=2\) and arbitrary \(L\):
\(\seteqnumber{0}{}{19}\)\begin{align} K(\ba ,\bb ) &=\left (\ba ^T\bb + 1\right )^2\\ &=\left (\ba ^T\bb \right )^2 + 2\ba ^T\bb + 1\nonumber \\ &=\left (\sum _{i=1}^{L}a_ib_i\right )^2 + 2\sum _{i=1}^{L}a_ib_i + 1\nonumber \\ &=\sum _{i=1}^{L}a_ib_i\sum _{j=1}^{L}a_jb_j + 2\sum _{i=1}^{L}a_ib_i + 1\nonumber \\ &=\sum _{i=1}^{L}a_i^2b_i^2 + 2\sum _{i=1}^{L}\sum _{j=i+1}^{L}a_ib_ia_jb_j + 2\sum _{i=1}^{L}a_ib_i + 1\nonumber \\ &=\varphi (\ba )^T\varphi (\bb )\nonumber \end{align} That corresponds to the mapping functions
\(\seteqnumber{0}{}{20}\)\begin{align*} \varphi (\ba ) &= \langle 1,\sqrt {2}a_1,\sqrt {2}a_2,\ldots , \sqrt {2}a_L, a_1^2,a_2^2,\ldots ,a_L^2,\\ &\quad \sqrt {2}a_1a_2,\sqrt {2}a_1a_3,\ldots ,\sqrt {2}a_1a_L,\sqrt {2}a_2a_3,\ldots , \sqrt {2}a_{L-1}a_L\rangle \\ \varphi (\bb ) &= \langle \underbrace {1}_1,\underbrace {\sqrt {2}b_1,\sqrt {2}b_2,\ldots , \sqrt {2}b_L}_{2\sum _{i=1}^{L}a_ib_i}, \underbrace {b_1^2,b_2^2,\ldots ,b_L^2}_{\sum _{i=1}^{L}a_i^2b_i^2},\\ &\quad \underbrace {\sqrt {2}b_1b_2,\sqrt {2}b_1b_3,\ldots ,\sqrt {2}b_1b_L,\sqrt {2}b_2b_3,\ldots , \sqrt {2}b_{L-1}b_L}_{2\sum _{i=1}^{L}\sum _{j=i+1}^{L}a_ib_ia_jb_j}\rangle \end{align*} with all the combinations above.
6.4.2 Gaussian Radial Basis Kernel (RBK)
The kernel definition is
\(\seteqnumber{0}{}{20}\)\begin{equation} K\!\left (\bx ,\by \right ) = \exp (-\frac {\norm {\bx - \by }^2}{2b}) \end{equation}
Due to Taylor expansion,
\[ \exp (x)\approxeq 1 + x + \frac {x^2}{2!} + \frac {x^3}{3!} + \cdots \]
this kernel has \(L\rightarrow \infty \).
Typically,
-
• Too low \(b\), when \(b\rightarrow 0\)
-
– Kernel becomes \(\delta \)-function-like.
-
– Each training point only affects neighbors
-
– Overfitting (high variance, low bias).
-
-
• Too high \(b\), when \(b\rightarrow \infty \)
-
– Kernel becomes constant, \(\bK _{ij}\approx 1\).
-
– All training points affect everywhere
-
– Underfitting (low variance, high bias)
-
The numerical example (same dataset as in Fig. 5.6) is presented in Fig. 6.1.
6.4.3 Summary
-
• Using virtually high \(L\).
-
• Require regularization parameter (\(\lambda \)) tuning.
-
• Kernel type is hyper-parameter, and each kernel has its own hyper-parameters.
-
• Somewhat limited by \(M\times M\) kernel matrix size.
-
• Since the values of \(\bw \) are not evaluated explicitly, the influence of each \(\bx _j\) and inter-data \(\bx _j,\bx _k\) relations are unknown.
6.5 Kernel Smoothing
Nadaraya–Watson smoothing For the \(\tilde {\bx }_0\) of interest, the kernel smoothing is weighted averaging given by
\(\seteqnumber{0}{}{21}\)\begin{equation} \label {eq:kernels:nadaraya} \hat {y}_0 = \sum _{j=1}^M w_jy_j \end{equation}
where the weights are
\(\seteqnumber{0}{}{22}\)\begin{equation} \tilde {\bw }= K\!\left (\tilde {\bx }_0,\bX \right ) \end{equation}
with normalization
\(\seteqnumber{0}{}{23}\)\begin{equation} w_j = \frac {\tilde {w}_j}{\sum _{j=1}^{M}\tilde {w}_j} \end{equation}
Finally, for a test dataset, \(\bX _{\text {test}}\) (vector notation)
\(\seteqnumber{0}{}{24}\)\begin{align} \tilde {\bW }& = K\!\left (\bX _{\text {test}},\bX _{\text {train}}\right ) &\in \real ^{M_{\text {test}}\times M_{\text {train}}} \\ \bW &= \tilde {\bW }/\tilde {\bW }\bOne _{M_{\text {train}}}\\ \hat {\by }_{\text {test}} &= \bW \by _{\text {train}} \end{align}
The numerical example is presented in Fig. 6.2.
Summary
-
• Low numerical complexity.
-
• Kernel type is hyper-parameter, and each kernel has its own hyper-parameters. No additional method hyper-parameters are required.
-
• Non-trivial hyper-parameter optimization on multivariate datasets.
-
• Require sufficiently dense dataset.
-
• Capture nonlinear relationships without assuming a specific parametric form.