Machine Learning & Signals Learning

\(\newcommand{\footnotename}{footnote}\) \(\def \LWRfootnote {1}\) \(\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\let \LWRorighspace \hspace \) \(\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }\) \(\newcommand {\TextOrMath }[2]{#2}\) \(\newcommand {\mathnormal }[1]{{#1}}\) \(\newcommand \ensuremath [1]{#1}\) \(\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } \) \(\newcommand {\setlength }[2]{}\) \(\newcommand {\addtolength }[2]{}\) \(\newcommand {\setcounter }[2]{}\) \(\newcommand {\addtocounter }[2]{}\) \(\newcommand {\arabic }[1]{}\) \(\newcommand {\number }[1]{}\) \(\newcommand {\noalign }[1]{\text {#1}\notag \\}\) \(\newcommand {\cline }[1]{}\) \(\newcommand {\directlua }[1]{\text {(directlua)}}\) \(\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}\) \(\newcommand {\protect }{}\) \(\def \LWRabsorbnumber #1 {}\) \(\def \LWRabsorbquotenumber "#1 {}\) \(\newcommand {\LWRabsorboption }[1][]{}\) \(\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }\) \(\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }\) \(\def \mathcode #1={\mathchar }\) \(\let \delcode \mathcode \) \(\let \delimiter \mathchar \) \(\def \oe {\unicode {x0153}}\) \(\def \OE {\unicode {x0152}}\) \(\def \ae {\unicode {x00E6}}\) \(\def \AE {\unicode {x00C6}}\) \(\def \aa {\unicode {x00E5}}\) \(\def \AA {\unicode {x00C5}}\) \(\def \o {\unicode {x00F8}}\) \(\def \O {\unicode {x00D8}}\) \(\def \l {\unicode {x0142}}\) \(\def \L {\unicode {x0141}}\) \(\def \ss {\unicode {x00DF}}\) \(\def \SS {\unicode {x1E9E}}\) \(\def \dag {\unicode {x2020}}\) \(\def \ddag {\unicode {x2021}}\) \(\def \P {\unicode {x00B6}}\) \(\def \copyright {\unicode {x00A9}}\) \(\def \pounds {\unicode {x00A3}}\) \(\let \LWRref \ref \) \(\renewcommand {\ref }{\ifstar \LWRref \LWRref }\) \( \newcommand {\multicolumn }[3]{#3}\) \(\require {textcomp}\) \( \newcommand {\abs }[1]{\lvert #1\rvert } \) \( \DeclareMathOperator {\sign }{sign} \) \(\newcommand {\intertext }[1]{\text {#1}\notag \\}\) \(\let \Hat \hat \) \(\let \Check \check \) \(\let \Tilde \tilde \) \(\let \Acute \acute \) \(\let \Grave \grave \) \(\let \Dot \dot \) \(\let \Ddot \ddot \) \(\let \Breve \breve \) \(\let \Bar \bar \) \(\let \Vec \vec \) \(\newcommand {\bm }[1]{\boldsymbol {#1}}\) \(\require {physics}\) \(\newcommand {\LWRphystrig }[2]{\ifblank {#1}{\textrm {#2}}{\textrm {#2}^{#1}}}\) \(\renewcommand {\sin }[1][]{\LWRphystrig {#1}{sin}}\) \(\renewcommand {\sinh }[1][]{\LWRphystrig {#1}{sinh}}\) \(\renewcommand {\arcsin }[1][]{\LWRphystrig {#1}{arcsin}}\) \(\renewcommand {\asin }[1][]{\LWRphystrig {#1}{asin}}\) \(\renewcommand {\cos }[1][]{\LWRphystrig {#1}{cos}}\) \(\renewcommand {\cosh }[1][]{\LWRphystrig {#1}{cosh}}\) \(\renewcommand {\arccos }[1][]{\LWRphystrig {#1}{arcos}}\) \(\renewcommand {\acos }[1][]{\LWRphystrig {#1}{acos}}\) \(\renewcommand {\tan }[1][]{\LWRphystrig {#1}{tan}}\) \(\renewcommand {\tanh }[1][]{\LWRphystrig {#1}{tanh}}\) \(\renewcommand {\arctan }[1][]{\LWRphystrig {#1}{arctan}}\) \(\renewcommand {\atan }[1][]{\LWRphystrig {#1}{atan}}\) \(\renewcommand {\csc }[1][]{\LWRphystrig {#1}{csc}}\) \(\renewcommand {\csch }[1][]{\LWRphystrig {#1}{csch}}\) \(\renewcommand {\arccsc }[1][]{\LWRphystrig {#1}{arccsc}}\) \(\renewcommand {\acsc }[1][]{\LWRphystrig {#1}{acsc}}\) \(\renewcommand {\sec }[1][]{\LWRphystrig {#1}{sec}}\) \(\renewcommand {\sech }[1][]{\LWRphystrig {#1}{sech}}\) \(\renewcommand {\arcsec }[1][]{\LWRphystrig {#1}{arcsec}}\) \(\renewcommand {\asec }[1][]{\LWRphystrig {#1}{asec}}\) \(\renewcommand {\cot }[1][]{\LWRphystrig {#1}{cot}}\) \(\renewcommand {\coth }[1][]{\LWRphystrig {#1}{coth}}\) \(\renewcommand {\arccot }[1][]{\LWRphystrig {#1}{arccot}}\) \(\renewcommand {\acot }[1][]{\LWRphystrig {#1}{acot}}\) \(\require {cancel}\) \(\newcommand *{\underuparrow }[1]{{\underset {\uparrow }{#1}}}\) \(\DeclareMathOperator *{\argmax }{argmax}\) \(\DeclareMathOperator *{\argmin }{arg\,min}\) \(\def \E [#1]{\mathbb {E}\!\left [ #1 \right ]}\) \(\def \Var [#1]{\operatorname {Var}\!\left [ #1 \right ]}\) \(\def \Cov [#1]{\operatorname {Cov}\!\left [ #1 \right ]}\) \(\newcommand {\floor }[1]{\lfloor #1 \rfloor }\) \(\newcommand {\DTFTH }{ H \brk 1{e^{j\omega }}}\) \(\newcommand {\DTFTX }{ X\brk 1{e^{j\omega }}}\) \(\newcommand {\DFTtr }[1]{\mathrm {DFT}\left \{#1\right \}}\) \(\newcommand {\DTFTtr }[1]{\mathrm {DTFT}\left \{#1\right \}}\) \(\newcommand {\DTFTtrI }[1]{\mathrm {DTFT^{-1}}\left \{#1\right \}}\) \(\newcommand {\Ftr }[1]{ \mathcal {F}\left \{#1\right \}}\) \(\newcommand {\FtrI }[1]{ \mathcal {F}^{-1}\left \{#1\right \}}\) \(\newcommand {\Zover }{\overset {\mathscr Z}{\Longleftrightarrow }}\) \(\renewcommand {\real }{\mathbb {R}}\) \(\newcommand {\ba }{\mathbf {a}}\) \(\newcommand {\bb }{\mathbf {b}}\) \(\newcommand {\bd }{\mathbf {d}}\) \(\newcommand {\be }{\mathbf {e}}\) \(\newcommand {\bh }{\mathbf {h}}\) \(\newcommand {\bn }{\mathbf {n}}\) \(\newcommand {\bq }{\mathbf {q}}\) \(\newcommand {\br }{\mathbf {r}}\) \(\newcommand {\bt }{\mathbf {t}}\) \(\newcommand {\bv }{\mathbf {v}}\) \(\newcommand {\bw }{\mathbf {w}}\) \(\newcommand {\bx }{\mathbf {x}}\) \(\newcommand {\bxx }{\mathbf {xx}}\) \(\newcommand {\bxy }{\mathbf {xy}}\) \(\newcommand {\by }{\mathbf {y}}\) \(\newcommand {\byy }{\mathbf {yy}}\) \(\newcommand {\bz }{\mathbf {z}}\) \(\newcommand {\bA }{\mathbf {A}}\) \(\newcommand {\bB }{\mathbf {B}}\) \(\newcommand {\bI }{\mathbf {I}}\) \(\newcommand {\bK }{\mathbf {K}}\) \(\newcommand {\bP }{\mathbf {P}}\) \(\newcommand {\bQ }{\mathbf {Q}}\) \(\newcommand {\bR }{\mathbf {R}}\) \(\newcommand {\bU }{\mathbf {U}}\) \(\newcommand {\bW }{\mathbf {W}}\) \(\newcommand {\bX }{\mathbf {X}}\) \(\newcommand {\bY }{\mathbf {Y}}\) \(\newcommand {\bZ }{\mathbf {Z}}\) \(\newcommand {\balpha }{\bm {\alpha }}\) \(\newcommand {\bth }{{\bm {\theta }}}\) \(\newcommand {\bepsilon }{{\bm {\epsilon }}}\) \(\newcommand {\bmu }{{\bm {\mu }}}\) \(\newcommand {\bOne }{\mathbf {1}}\) \(\newcommand {\bZero }{\mathbf {0}}\) \(\newcommand {\loss }{\mathcal {L}}\) \(\newcommand {\appropto }{\mathrel {\vcenter { \offinterlineskip \halign {\hfil $##$\cr \propto \cr \noalign {\kern 2pt}\sim \cr \noalign {\kern -2pt}}}}}\) \(\newcommand {\SSE }{\mathrm {SSE}}\) \(\newcommand {\MSE }{\mathrm {MSE}}\) \(\newcommand {\RMSE }{\mathrm {RMSE}}\) \(\newcommand {\toprule }[1][]{\hline }\) \(\let \midrule \toprule \) \(\let \bottomrule \toprule \) \(\def \LWRbooktabscmidruleparen (#1)#2{}\) \(\newcommand {\LWRbooktabscmidrulenoparen }[1]{}\) \(\newcommand {\cmidrule }[1][]{\ifnextchar (\LWRbooktabscmidruleparen \LWRbooktabscmidrulenoparen }\) \(\newcommand {\morecmidrules }{}\) \(\newcommand {\specialrule }[3]{\hline }\) \(\newcommand {\addlinespace }[1][]{}\) \(\newcommand {\LWRsubmultirow }[2][]{#2}\) \(\newcommand {\LWRmultirow }[2][]{\LWRsubmultirow }\) \(\newcommand {\multirow }[2][]{\LWRmultirow }\) \(\newcommand {\mrowcell }{}\) \(\newcommand {\mcolrowcell }{}\) \(\newcommand {\STneed }[1]{}\) \(\newcommand {\tcbset }[1]{}\) \(\newcommand {\tcbsetforeverylayer }[1]{}\) \(\newcommand {\tcbox }[2][]{\boxed {\text {#2}}}\) \(\newcommand {\tcboxfit }[2][]{\boxed {#2}}\) \(\newcommand {\tcblower }{}\) \(\newcommand {\tcbline }{}\) \(\newcommand {\tcbtitle }{}\) \(\newcommand {\tcbsubtitle [2][]{\mathrm {#2}}}\) \(\newcommand {\tcboxmath }[2][]{\boxed {#2}}\) \(\newcommand {\tcbhighmath }[2][]{\boxed {#2}}\)

6 Kernels

The goal is to effectively apply non-linear models.

6.1 Uni-variate Polynomial Model

The \(L\)-degree uni-variate polynomial regression model1 is

\begin{equation} \begin{aligned}\label {eq:poly:poly_pred} \hat {y} = f(x;\bw ) &= w_0 + w_1x + w_2x^2 + \cdots + w_Lx^L\\ &=\sum _{j=0}^{L} w_jx^j \end {aligned} \end{equation}

This problem is linear under the change of variables \(z_{j} = x^j,\ j=0,\ldots ,L\),

\begin{equation} \hat {y} =\sum _{j=0}^{L} w_jz_{j} \end{equation}

The corresponding prediction for the dataset \(\left \{x_k,y_k\right \}_{k=1}^M\) can be easily written by using the matrix notation (also termed Vandermonde matrix),

\begin{equation} \label {eq-vandermonde} \renewcommand *{\arraystretch }{1.3} \bX = \begin{bmatrix} 1 & x_1 & x_1^2 & \cdots & x_1^L\\ 1 & x_2 & x_2^2 & \cdots & x_2^L\\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_M & x_M^2 & \cdots & x_M^L\\ \end {bmatrix} \end{equation}

The weights vector \(\bw \) then follows from the standard least-squares formula.

1 The model is also presented in Sec. 4.1. The selection of \(L\) is discussed in Ch. 5.

6.2 Mapping function

The data-linearization results (e.g., in (6.3)) can be formulated by a mapping function.

Univariate polynomial mapping The polynomial model uses a uni-variate to multi-variate mapping, \(\phi (\cdot ):\real \rightarrow \real ^L\),

\begin{equation} \phi (x) = \begin{bmatrix} 1 & x & x^2 & \cdots & x^L \end {bmatrix} \end{equation}

Applying \(\phi \) to each value in \(\left \{x_k\right \}_{k=1}^M\) builds the matrix \(\bX \) in (6.3).

Multivariate mapping example More general multi-variate to multi-variate mappings \(\phi (\cdot ):\real ^N\rightarrow \real ^L\) can be applied. For the regression model

\begin{equation*} f(\bx ;\bw ) = w_0 + w_1x_1 + w_2x_2 + w_3x_1x_2 + w_4x_1^2 + w_5x_2^2 \end{equation*}

the corresponding \(N=2\) and \(L=6\) mapping function (ellipse equation elements) is

\begin{equation} \phi (\bx ) = \phi (x_{1},x_{2}) = \begin{bmatrix} 1, x_{1}, x_{2}, x_{1}x_{2}, x_{1}^2,x_{2}^2 \end {bmatrix} \end{equation}

that for each value in \(\left \{x_{1_k},x_{2_k}\right \}_{k=1}^M\) produces the corresponding matrix

\begin{equation} \label {eq-mapping-example-X} \phi (\bX ) = \bX _\phi = \begin{bmatrix} 1 & x_{1_1} & x_{2_1} & x_{1_1}x_{2_1} & x_{1_1}^2 & x_{2_1}^2 \\ 1 & x_{1_2} & x_{2_2} & x_{1_2}x_{2_2} & x_{1_2}^2 & x_{2_2}^2 \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 1 & x_{1_M} & x_{2_M} & x_{1_M}x_{2_M} & x_{1_M}^2 & x_{2_M}^2 \\ \end {bmatrix} \end{equation}

Problem

Even for a relatively small \(N\), multivariate mapping produces numerous feature vectors (large \(L\)) that are less convenient to implement. Moreover, the matrix \(\bX \in \real ^{M\times L}\) produces memory and computationally prohibitive \(\bX ^T\bX \in \real ^{L\times L}\) matrix.

Example of \(\bX ^T\bX \) For \(L=3\) and \(\bX = \begin {bmatrix} \bx _1 & \bx _2 & \bx _3 \end {bmatrix}\in \real ^{M\times 3}\) (using (3.7) vector notation) the resulting \(\bX ^T\bX \in \real ^{3\times 3}\),

\begin{align} \label {eq:kernel:xtx} \bX ^T\bX &= \begin{bmatrix} \bx _1^T \\ \bx _2^T \\ \bx _3^T \end {bmatrix} \begin{bmatrix}\bx _1 & \bx _2 & \bx _3\end {bmatrix} = \begin{bmatrix} \bx _1^T\bx _1 & \bx _1^T\bx _2 & \bx _1^T\bx _3\\ \bx _2^T\bx _1 & \bx _2^T\bx _2 & \bx _2^T\bx _3 \\ \bx _3^T\bx _1 & \bx _3^T\bx _2 & \bx _3^T\bx _3 \\ \end {bmatrix}, \end{align} where \(\bx _j^T\bx _k\in \real \) is a scalar.

6.3 Kernel Trick

The goal is:

  • To replace \(\bX ^T\bX \in \real ^{L\times L}\) with \(\bX \bX ^T\in \real ^{M\times M}\), a matrix useful in the \(L>M\) case.

  • Use arbitrary high \(L\).

  • Provide vector-based and more general expressions for \(\phi (\bx )\).

6.3.1 LS Identity

The base of the method is an identity

\begin{equation} \begin{aligned} \bw &= \bX ^T\left (\bX \bX ^T\right )^{-1}\by \\ &=\left (\bX ^T\bX \right )^{-1}\bX ^T \by \end {aligned} \end{equation}

The key property of this representation is that the \(M\times M\) dimensions are independent of the number of original features \(N\) and mapped features \(L\). The prediction

\[ \hat {\by } = \bX \bw \]

is independent of the way \(\bw \) is evaluated.

Example of \(\bX \bX ^T\) for \(N=3\) For \(\bX = \begin {bmatrix} \bx _1 & \bx _2 & \bx _3 \end {bmatrix}\in \real ^{M\times 3}\) (using (3.7) vector notation) the resulting \(\bX \bX ^T\in \real ^{M\times M}\),

\begin{align} \label {eq:kernel:xxt} \bX \bX ^T &= \begin{bmatrix}\bx _1 & \bx _2 & \bx _3\end {bmatrix} \begin{bmatrix} \bx _1^T \\ \bx _2^T \\ \bx _3^T \end {bmatrix} = \bx _1\bx _1^T + \bx _2\bx _2^T + \bx _3\bx _3^T \end{align} where each term \(\bx _j\bx _j^T\in \real ^{M\times M}\) is an outer product.

6.3.2 Kernel Matrix

The mapping function \(\phi (\cdot ):\real ^{M\times N}\rightarrow \real ^{M\times L}\) is applied to the full data matrix \(\bX \) (as in (6.6))

\begin{equation} \bX _\phi = \phi (\bX ) \end{equation}

By using the identity

\begin{equation} \phi \left (\bX ^T\right ) = \phi \left (\bX \right )^T \end{equation}

the resulting kernel matrix \(\bK \in \real ^{M\times M}\) is defined as

\begin{equation} \label {eq-kernel-matrix} \bK (\bX ,\bX ) = \phi \left (\bX \right )\phi \left (\bX \right )^T \end{equation}

Kernel matrix properties:

  • Symmetry, \(\bK = \bK ^T\)

  • Positive semi-definite, \(\bx \bK \bx ^T\ge 0\), with all eigenvalues non-negative.

6.3.3 Handling Overfitting

Large \(L\) values, e.g. in the polynomial model, result in a high potential for overfitting. Therefore, ridge regression (5.11) is applied together with a kernel trick,

\begin{equation} \label {eq:kernels:w} \bw = \bX ^T\left (\bX \bX ^T +\lambda \mathbf {I}\right )^{-1} \by \end{equation}

6.3.4 Kernel-Based Model

The calculation of the kernel trick relation in Eq. (6.13) is performed by the following steps:

  • 1. Evaluate the kernel matrix \(\bK \) using (6.12).

  • 2. Calculate \(\balpha \),

    \begin{equation} \label {eq:kernels:alpha} \bm {\alpha } = (\mathbf {K}+\lambda \mathbf {I})^{-1} \by \end{equation}

    where \(\bK \) replaces \(\bX \bX ^T\), yielding \(\balpha \in \real ^{M}\). After this step, \(\bK \) no longer needs to be stored. The weight vector \(\bw \in \real ^{L}\) can then (theoretically) be recovered as

    \begin{equation} \bw = \varphi \left (\bX \right )^T\bm {\alpha } = \sum _{j=1}^M\alpha _j\varphi \left (\tilde {\bx }_j\right ) \end{equation}

    but is not required for prediction.

  • 3. Model inference: The regression for a new test point \(\tilde {\bx }_0\) is evaluated by

    \begin{equation} \begin{aligned}\label {eq:kernels:pred} \hat {y}_0 &= \varphi (\tilde {\bx }_0)^T\bw \\ &= \sum _{j=1}^M\alpha _j\varphi (\tilde {\bx }_0)^T\varphi \left (\tilde {\bx }_j\right )\\ &= \varphi (\tilde {\bx }_0)^T\varphi \left (\bX \right )\balpha \\ &=K\left (\tilde {\bx }_0,\bX \right )\balpha \end {aligned} \end{equation}

    where

    \begin{equation} \begin{aligned} \varphi (\tilde {\bx }_0)^T&\in \real ^{1\times L},\varphi (\tilde {\bx }_0) \in \real ^{L\times 1}\\ \varphi \left (\bX \right )^T &\in \real ^{L\times M},\varphi \left (\bX \right )\in \real ^{M\times L}\\ K\left (\tilde {\bx }_0,\bX \right )&\in \real ^{1\times M}\\ \end {aligned} \end{equation}

    No \(L\)-dimensional calculations are required, and \(\bw \) need not be evaluated explicitly.

In summary, the order is:

  • 1. Kernel matrix, (6.12).

  • 2. \(\balpha \), (6.14).

  • 3. For test data \(\tilde {\bX }_0\), the prediction (6.16) becomes

    \begin{equation} \hat {\by }_0 = K\left (\tilde {\bX }_0,\bX \right )\balpha \end{equation}

6.4 Kernel Types

6.4.1 Polynomial Kernel

\begin{equation} K(\bx ,\by ) = \left (\bx ^\mathsf {T} \by + c\right )^{d} \end{equation}

where \(c\) is some constant (optionally \(c=0\)) and \(d\) is the power.

The number of mapped dimensions is \(L=\displaystyle \binom {d+N-1}{d}\). For example, for \(N=10,d = 4, L=\binom {13}{4}=715\).

Example for \(c=0,d=3,L=2\):

\begin{align*} K(\ba ,\bb ) &=\left (\ba ^T\bb \right )^3 = \left (\left [a_1, a_2\right ] \begin{bmatrix} b_1 \\ b_2 \end {bmatrix} \right )^3\\ &=\left (a_1b_1 + a_2b_2\right )^3\\ &= a_1^3b_1^3 + 3a_1^2b_1^2a_2b_2 + 3a_1b_1a_2^2b_2^2 + a_2^3b_2^3\\ &=\left [a_1^3, \sqrt {3}a_1^2a_2,\sqrt {3}a_1a_2^2,a_2^3\right ]\cdot \left [b_1^3, \sqrt {3}b_1^2b_2,\sqrt {3}b_1b_2^2,b_2^3\right ]^T\\ &=\varphi (\ba )^T\varphi (\bb ) \end{align*}

Example for \(c=0,d=2\) and arbitrary \(L\):

\begin{align} K(\ba ,\bb ) &=\left (\ba ^T\bb + 1\right )^2\\ &=\left (\ba ^T\bb \right )^2 + 2\ba ^T\bb + 1\nonumber \\ &=\left (\sum _{i=1}^{L}a_ib_i\right )^2 + 2\sum _{i=1}^{L}a_ib_i + 1\nonumber \\ &=\sum _{i=1}^{L}a_ib_i\sum _{j=1}^{L}a_jb_j + 2\sum _{i=1}^{L}a_ib_i + 1\nonumber \\ &=\sum _{i=1}^{L}a_i^2b_i^2 + 2\sum _{i=1}^{L}\sum _{j=i+1}^{L}a_ib_ia_jb_j + 2\sum _{i=1}^{L}a_ib_i + 1\nonumber \\ &=\varphi (\ba )^T\varphi (\bb )\nonumber \end{align} That corresponds to the mapping functions

\begin{align*} \varphi (\ba ) &= \langle 1,\sqrt {2}a_1,\sqrt {2}a_2,\ldots , \sqrt {2}a_L, a_1^2,a_2^2,\ldots ,a_L^2,\\ &\quad \sqrt {2}a_1a_2,\sqrt {2}a_1a_3,\ldots ,\sqrt {2}a_1a_L,\sqrt {2}a_2a_3,\ldots , \sqrt {2}a_{L-1}a_L\rangle \\ \varphi (\bb ) &= \langle \underbrace {1}_1,\underbrace {\sqrt {2}b_1,\sqrt {2}b_2,\ldots , \sqrt {2}b_L}_{2\sum _{i=1}^{L}a_ib_i}, \underbrace {b_1^2,b_2^2,\ldots ,b_L^2}_{\sum _{i=1}^{L}a_i^2b_i^2},\\ &\quad \underbrace {\sqrt {2}b_1b_2,\sqrt {2}b_1b_3,\ldots ,\sqrt {2}b_1b_L,\sqrt {2}b_2b_3,\ldots , \sqrt {2}b_{L-1}b_L}_{2\sum _{i=1}^{L}\sum _{j=i+1}^{L}a_ib_ia_jb_j}\rangle \end{align*} with all the combinations above.

6.4.2 Gaussian Radial Basis Kernel (RBK)

The kernel definition is

\begin{equation} K\!\left (\bx ,\by \right ) = \exp (-\frac {\norm {\bx - \by }^2}{2b}) \end{equation}

Due to Taylor expansion,

\[ \exp (x)\approxeq 1 + x + \frac {x^2}{2!} + \frac {x^3}{3!} + \cdots \]

this kernel has \(L\rightarrow \infty \).

Typically,

  • Too low \(b\), when \(b\rightarrow 0\)

    • Kernel becomes \(\delta \)-function-like.

    • Each training point only affects neighbors

    • Overfitting (high variance, low bias).

  • Too high \(b\), when \(b\rightarrow \infty \)

    • Kernel becomes constant, \(\bK _{ij}\approx 1\).

    • All training points affect everywhere

    • Underfitting (low variance, high bias)

The numerical example (same dataset as in Fig. 5.6) is presented in Fig. 6.1.

(image)

Figure 6.1: Kernel regression example with polynomial kernel. Direct comparison of RBF kernel regression with three bandwidth values. The medium bandwidth (\(b=0.1\), green) achieves the best balance between fitting the data and maintaining smoothness.
6.4.3 Summary
  • Using virtually high \(L\).

  • Require regularization parameter (\(\lambda \)) tuning.

  • Kernel type is hyper-parameter, and each kernel has its own hyper-parameters.

  • Somewhat limited by \(M\times M\) kernel matrix size.

  • Since the values of \(\bw \) are not evaluated explicitly, the influence of each \(\bx _j\) and inter-data \(\bx _j,\bx _k\) relations are unknown.

6.5 Kernel Smoothing

Nadaraya–Watson smoothing For the \(\tilde {\bx }_0\) of interest, the kernel smoothing is weighted averaging given by

\begin{equation} \label {eq:kernels:nadaraya} \hat {y}_0 = \sum _{j=1}^M w_jy_j \end{equation}

where the weights are

\begin{equation} \tilde {\bw }= K\!\left (\tilde {\bx }_0,\bX \right ) \end{equation}

with normalization

\begin{equation} w_j = \frac {\tilde {w}_j}{\sum _{j=1}^{M}\tilde {w}_j} \end{equation}

Finally, for a test dataset, \(\bX _{\text {test}}\) (vector notation)

\begin{align} \tilde {\bW }& = K\!\left (\bX _{\text {test}},\bX _{\text {train}}\right ) &\in \real ^{M_{\text {test}}\times M_{\text {train}}} \\ \bW &= \tilde {\bW }/\tilde {\bW }\bOne _{M_{\text {train}}}\\ \hat {\by }_{\text {test}} &= \bW \by _{\text {train}} \end{align}

The numerical example is presented in Fig. 6.2.

(image)

Figure 6.2: Kernel smoothing example with exponential kernel. Incorrect selection of \(b\) parameters results in overfitting. The kernel smoother provides a non-parametric estimate by computing locally weighted averages of nearby training points, where weights decay exponentially with distance.
Summary
  • Low numerical complexity.

  • Kernel type is hyper-parameter, and each kernel has its own hyper-parameters. No additional method hyper-parameters are required.

  • Non-trivial hyper-parameter optimization on multivariate datasets.

  • Require sufficiently dense dataset.

  • Capture nonlinear relationships without assuming a specific parametric form.