Machine Learning & Signals Learning

$\newcommand{\footnotename}{footnote}$ $\def \LWRfootnote {1}$ $\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}$ $\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}$ $\let \LWRorighspace \hspace $ $\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }$ $\newcommand {\TextOrMath }[2]{#2}$ $\newcommand {\mathnormal }[1]{{#1}}$ $\newcommand \ensuremath [1]{#1}$ $\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } $ $\newcommand {\setlength }[2]{}$ $\newcommand {\addtolength }[2]{}$ $\newcommand {\setcounter }[2]{}$ $\newcommand {\addtocounter }[2]{}$ $\newcommand {\arabic }[1]{}$ $\newcommand {\number }[1]{}$ $\newcommand {\noalign }[1]{\text {#1}\notag \\}$ $\newcommand {\cline }[1]{}$ $\newcommand {\directlua }[1]{\text {(directlua)}}$ $\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}$ $\newcommand {\protect }{}$ $\def \LWRabsorbnumber #1 {}$ $\def \LWRabsorbquotenumber "#1 {}$ $\newcommand {\LWRabsorboption }[1][]{}$ $\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }$ $\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }$ $\def \mathcode #1={\mathchar }$ $\let \delcode \mathcode $ $\let \delimiter \mathchar $ $\def \oe {\unicode {x0153}}$ $\def \OE {\unicode {x0152}}$ $\def \ae {\unicode {x00E6}}$ $\def \AE {\unicode {x00C6}}$ $\def \aa {\unicode {x00E5}}$ $\def \AA {\unicode {x00C5}}$ $\def \o {\unicode {x00F8}}$ $\def \O {\unicode {x00D8}}$ $\def \l {\unicode {x0142}}$ $\def \L {\unicode {x0141}}$ $\def \ss {\unicode {x00DF}}$ $\def \SS {\unicode {x1E9E}}$ $\def \dag {\unicode {x2020}}$ $\def \ddag {\unicode {x2021}}$ $\def \P {\unicode {x00B6}}$ $\def \copyright {\unicode {x00A9}}$ $\def \pounds {\unicode {x00A3}}$ $\let \LWRref \ref $ $\renewcommand {\ref }{\ifstar \LWRref \LWRref }$ $ \newcommand {\multicolumn }[3]{#3}$ $\require {textcomp}$ $ \newcommand {\abs }[1]{\lvert #1\rvert } $ $ \DeclareMathOperator {\sign }{sign} $ $\newcommand {\intertext }[1]{\text {#1}\notag \\}$ $\let \Hat \hat $ $\let \Check \check $ $\let \Tilde \tilde $ $\let \Acute \acute $ $\let \Grave \grave $ $\let \Dot \dot $ $\let \Ddot \ddot $ $\let \Breve \breve $ $\let \Bar \bar $ $\let \Vec \vec $ $\newcommand {\bm }[1]{\boldsymbol {#1}}$ $\require {physics}$ $\newcommand {\LWRphystrig }[2]{\ifblank {#1}{\textrm {#2}}{\textrm {#2}^{#1}}}$ $\renewcommand {\sin }[1][]{\LWRphystrig {#1}{sin}}$ $\renewcommand {\sinh }[1][]{\LWRphystrig {#1}{sinh}}$ $\renewcommand {\arcsin }[1][]{\LWRphystrig {#1}{arcsin}}$ $\renewcommand {\asin }[1][]{\LWRphystrig {#1}{asin}}$ $\renewcommand {\cos }[1][]{\LWRphystrig {#1}{cos}}$ $\renewcommand {\cosh }[1][]{\LWRphystrig {#1}{cosh}}$ $\renewcommand {\arccos }[1][]{\LWRphystrig {#1}{arcos}}$ $\renewcommand {\acos }[1][]{\LWRphystrig {#1}{acos}}$ $\renewcommand {\tan }[1][]{\LWRphystrig {#1}{tan}}$ $\renewcommand {\tanh }[1][]{\LWRphystrig {#1}{tanh}}$ $\renewcommand {\arctan }[1][]{\LWRphystrig {#1}{arctan}}$ $\renewcommand {\atan }[1][]{\LWRphystrig {#1}{atan}}$ $\renewcommand {\csc }[1][]{\LWRphystrig {#1}{csc}}$ $\renewcommand {\csch }[1][]{\LWRphystrig {#1}{csch}}$ $\renewcommand {\arccsc }[1][]{\LWRphystrig {#1}{arccsc}}$ $\renewcommand {\acsc }[1][]{\LWRphystrig {#1}{acsc}}$ $\renewcommand {\sec }[1][]{\LWRphystrig {#1}{sec}}$ $\renewcommand {\sech }[1][]{\LWRphystrig {#1}{sech}}$ $\renewcommand {\arcsec }[1][]{\LWRphystrig {#1}{arcsec}}$ $\renewcommand {\asec }[1][]{\LWRphystrig {#1}{asec}}$ $\renewcommand {\cot }[1][]{\LWRphystrig {#1}{cot}}$ $\renewcommand {\coth }[1][]{\LWRphystrig {#1}{coth}}$ $\renewcommand {\arccot }[1][]{\LWRphystrig {#1}{arccot}}$ $\renewcommand {\acot }[1][]{\LWRphystrig {#1}{acot}}$ $\require {cancel}$ $\newcommand *{\underuparrow }[1]{{\underset {\uparrow }{#1}}}$ $\DeclareMathOperator *{\argmax }{argmax}$ $\DeclareMathOperator *{\argmin }{arg\,min}$ $\def \E [#1]{\mathbb {E}\!\left [ #1 \right ]}$ $\def \Var [#1]{\operatorname {Var}\!\left [ #1 \right ]}$ $\def \Cov [#1]{\operatorname {Cov}\!\left [ #1 \right ]}$ $\newcommand {\floor }[1]{\lfloor #1 \rfloor }$ $\newcommand {\DTFTH }{ H \brk 1{e^{j\omega }}}$ $\newcommand {\DTFTX }{ X\brk 1{e^{j\omega }}}$ $\newcommand {\DFTtr }[1]{\mathrm {DFT}\left \{#1\right \}}$ $\newcommand {\DTFTtr }[1]{\mathrm {DTFT}\left \{#1\right \}}$ $\newcommand {\DTFTtrI }[1]{\mathrm {DTFT^{-1}}\left \{#1\right \}}$ $\newcommand {\Ftr }[1]{ \mathcal {F}\left \{#1\right \}}$ $\newcommand {\FtrI }[1]{ \mathcal {F}^{-1}\left \{#1\right \}}$ $\newcommand {\Zover }{\overset {\mathscr Z}{\Longleftrightarrow }}$ $\renewcommand {\real }{\mathbb {R}}$ $\newcommand {\ba }{\mathbf {a}}$ $\newcommand {\bb }{\mathbf {b}}$ $\newcommand {\bc }{\mathbf {c}}$ $\newcommand {\bd }{\mathbf {d}}$ $\newcommand {\be }{\mathbf {e}}$ $\newcommand {\bf }{\mathbf {f}}$ $\newcommand {\bh }{\mathbf {h}}$ $\newcommand {\bi }{\mathbf {i}}$ $\newcommand {\bn }{\mathbf {n}}$ $\newcommand {\bo }{\mathbf {o}}$ $\newcommand {\bp }{\mathbf {p}}$ $\newcommand {\bq }{\mathbf {q}}$ $\newcommand {\br }{\mathbf {r}}$ $\newcommand {\bs }{\mathbf {s}}$ $\newcommand {\bt }{\mathbf {t}}$ $\newcommand {\bu }{\mathbf {u}}$ $\newcommand {\bv }{\mathbf {v}}$ $\newcommand {\bw }{\mathbf {w}}$ $\newcommand {\bx }{\mathbf {x}}$ $\newcommand {\bxx }{\mathbf {xx}}$ $\newcommand {\bxy }{\mathbf {xy}}$ $\newcommand {\by }{\mathbf {y}}$ $\newcommand {\byy }{\mathbf {yy}}$ $\newcommand {\bz }{\mathbf {z}}$ $\newcommand {\bA }{\mathbf {A}}$ $\newcommand {\bB }{\mathbf {B}}$ $\newcommand {\bC }{\mathbf {C}}$ $\newcommand {\bD }{\mathbf {D}}$ $\newcommand {\bH }{\mathbf {H}}$ $\newcommand {\bI }{\mathbf {I}}$ $\newcommand {\bK }{\mathbf {K}}$ $\newcommand {\bM }{\mathbf {M}}$ $\newcommand {\bP }{\mathbf {P}}$ $\newcommand {\bQ }{\mathbf {Q}}$ $\newcommand {\bR }{\mathbf {R}}$ $\newcommand {\bS }{\mathbf {S}}$ $\newcommand {\bU }{\mathbf {U}}$ $\newcommand {\bW }{\mathbf {W}}$ $\newcommand {\bX }{\mathbf {X}}$ $\newcommand {\bY }{\mathbf {Y}}$ $\newcommand {\bZ }{\mathbf {Z}}$ $\newcommand {\balpha }{\bm {\alpha }}$ $\newcommand {\bth }{{\bm {\theta }}}$ $\newcommand {\bepsilon }{{\bm {\epsilon }}}$ $\newcommand {\bmu }{{\bm {\mu }}}$ $\newcommand {\bphi }{\bm {\phi }}$ $\newcommand {\bOne }{\mathbf {1}}$ $\newcommand {\bZero }{\mathbf {0}}$ $\newcommand {\indFunc }{\mathbb {1}}$ $\newcommand {\btx }{\tilde {\bx }}$ $\newcommand {\loss }{\mathcal {L}}$ $\newcommand {\appropto }{\mathrel {\vcenter { \offinterlineskip \halign {\hfil $##$\cr \propto \cr \noalign {\kern 2pt}\sim \cr \noalign {\kern -2pt}}}}}$ $\newcommand {\SSE }{\mathrm {SSE}}$ $\newcommand {\MSE }{\mathrm {MSE}}$ $\newcommand {\RMSE }{\mathrm {RMSE}}$ $\newcommand {\toprule }[1][]{\hline }$ $\let \midrule \toprule $ $\let \bottomrule \toprule $ $\def \LWRbooktabscmidruleparen (#1)#2{}$ $\newcommand {\LWRbooktabscmidrulenoparen }[1]{}$ $\newcommand {\cmidrule }[1][]{\ifnextchar (\LWRbooktabscmidruleparen \LWRbooktabscmidrulenoparen }$ $\newcommand {\morecmidrules }{}$ $\newcommand {\specialrule }[3]{\hline }$ $\newcommand {\addlinespace }[1][]{}$ $\newcommand {\LWRsubmultirow }[2][]{#2}$ $\newcommand {\LWRmultirow }[2][]{\LWRsubmultirow }$ $\newcommand {\multirow }[2][]{\LWRmultirow }$ $\newcommand {\mrowcell }{}$ $\newcommand {\mcolrowcell }{}$ $\newcommand {\STneed }[1]{}$ $\newcommand {\tcbset }[1]{}$ $\newcommand {\tcbsetforeverylayer }[1]{}$ $\newcommand {\tcbox }[2][]{\boxed {\text {#2}}}$ $\newcommand {\tcboxfit }[2][]{\boxed {#2}}$ $\newcommand {\tcblower }{}$ $\newcommand {\tcbline }{}$ $\newcommand {\tcbtitle }{}$ $\newcommand {\tcbsubtitle [2][]{\mathrm {#2}}}$ $\newcommand {\tcboxmath }[2][]{\boxed {#2}}$ $\newcommand {\tcbhighmath }[2][]{\boxed {#2}}$ $\require {colortbl}$ $\let \LWRorigcolumncolor \columncolor $ $\renewcommand {\columncolor }[2][named]{\LWRorigcolumncolor [#1]{#2}\LWRabsorbtwooptions }$ $\let \LWRorigrowcolor \rowcolor $ $\renewcommand {\rowcolor }[2][named]{\LWRorigrowcolor [#1]{#2}\LWRabsorbtwooptions }$ $\let \LWRorigcellcolor \cellcolor $ $\renewcommand {\cellcolor }[2][named]{\LWRorigcellcolor [#1]{#2}\LWRabsorbtwooptions }$

5 Kernels

The goal is to effectively apply non-linear models.

5.1 Uni-variate Polynomial Model

The $L$-degree uni-variate polynomial regression model¹ is

\begin{equation} \begin{aligned}\label {eq:poly:poly_pred} \hat {y} = f_\bw (x) &= w_0 + w_1x + w_2x^2 + \cdots + w_Lx^L\\ &=\sum _{j=0}^{L} w_jx^j \end {aligned} \end{equation}

This problem is linear under the change of variables $z_{j} = x^j,\ j=0,\ldots ,L$,

\begin{equation} \hat {y} =\sum _{j=0}^{L} w_jz_{j} \end{equation}

The corresponding prediction for the dataset $\left \{x_k,y_k\right \}_{k=1}^M$ can be easily written by using the matrix notation (also termed Vandermonde matrix),

\begin{equation} \label {eq-vandermonde} \renewcommand *{\arraystretch }{1.3} \bX = \begin{bmatrix} 1 & x_1 & x_1^2 & \cdots & x_1^L\\ 1 & x_2 & x_2^2 & \cdots & x_2^L\\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_M & x_M^2 & \cdots & x_M^L\\ \end {bmatrix} \end{equation}

The weights vector $\bw $ then follows from the standard least-squares formula.

¹ The model is also presented in Sec. 4.1. The selection of $L$ is discussed in Ch. 4.4.2.

5.2 Mapping function

The data-linearization results (e.g., in (5.3)) can be formulated by a mapping function.

Univariate polynomial mapping The polynomial model uses a uni-variate to multi-variate mapping, $\phi (\cdot ):\real \rightarrow \real ^L$,

\begin{equation} \phi (x) = \begin{bmatrix} 1 & x & x^2 & \cdots & x^L \end {bmatrix} \end{equation}

Applying $\phi $ to each value in $\left \{x_k\right \}_{k=1}^M$ builds the matrix $\bX $ in (5.3).

Multivariate mapping example More general multi-variate to multi-variate mappings $\phi (\cdot ):\real ^N\rightarrow \real ^L$ can be applied. For the regression model

\begin{equation*} f(\bx ;\bw ) = w_0 + w_1x_1 + w_2x_2 + w_3x_1x_2 + w_4x_1^2 + w_5x_2^2 \end{equation*}

the corresponding $N=2$ and $L=6$ mapping function (ellipse equation elements) is

\begin{equation} \phi (\bx ) = \phi (x_{1},x_{2}) = \begin{bmatrix} 1, x_{1}, x_{2}, x_{1}x_{2}, x_{1}^2,x_{2}^2 \end {bmatrix} \end{equation}

that for each value in $\left \{x_{1_k},x_{2_k}\right \}_{k=1}^M$ produces the corresponding matrix

\begin{equation} \label {eq-mapping-example-X} \phi (\bX ) = \bX _\phi = \begin{bmatrix} 1 & x_{1_1} & x_{2_1} & x_{1_1}x_{2_1} & x_{1_1}^2 & x_{2_1}^2 \\ 1 & x_{1_2} & x_{2_2} & x_{1_2}x_{2_2} & x_{1_2}^2 & x_{2_2}^2 \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 1 & x_{1_M} & x_{2_M} & x_{1_M}x_{2_M} & x_{1_M}^2 & x_{2_M}^2 \\ \end {bmatrix} \end{equation}

Problem

Even for a relatively small $N$, multivariate mapping produces numerous feature vectors (large $L$) that are less convenient to implement. Moreover, the matrix $\bX \in \real ^{M\times L}$ produces memory and computationally prohibitive $\bX ^T\bX \in \real ^{L\times L}$ matrix.

Example of $\bX ^T\bX $ For $L=3$ and $\bX = \begin {bmatrix} \btx _1 & \btx _2 & \btx _3 \end {bmatrix}\in \real ^{M\times 3}$ (column decomposition) the resulting $\bX ^T\bX \in \real ^{3\times 3}$,

\begin{align} \label {eq:kernel:xtx} \bX ^T\bX &= \begin{bmatrix} \btx _1^T \\ \btx _2^T \\ \btx _3^T \end {bmatrix} \begin{bmatrix}\btx _1 & \btx _2 & \btx _3\end {bmatrix} = \begin{bmatrix} \btx _1^T\btx _1 & \btx _1^T\btx _2 & \btx _1^T\btx _3\\ \btx _2^T\btx _1 & \btx _2^T\btx _2 & \btx _2^T\btx _3 \\ \btx _3^T\btx _1 & \btx _3^T\btx _2 & \btx _3^T\btx _3 \\ \end {bmatrix}, \end{align} where $\btx _j^T\btx _k\in \real $ is a scalar.

5.3 Kernel Trick

The goal is:

• To replace $\bX ^T\bX \in \real ^{L\times L}$ with $\bX \bX ^T\in \real ^{M\times M}$, a matrix useful in the $L>M$ case.
• Use arbitrary high $L$.
• Provide vector-based and more general expressions for $\phi (\bx )$.

5.3.1 LS Identity

The base of the method is an identity

\begin{equation} \begin{aligned} \bw &= \bX ^T\left (\bX \bX ^T\right )^{-1}\by \\ &=\left (\bX ^T\bX \right )^{-1}\bX ^T \by \end {aligned} \end{equation}

The key property of this representation is that the $M\times M$ dimensions are independent of the number of original features $N$ and mapped features $L$. The prediction

\[ \hat {\by } = \bX \bw \]

is independent of the way $\bw $ is evaluated.

Example of $\bX \bX ^T$ for $N=3$ For $\bX = \begin {bmatrix} \btx _1 & \btx _2 & \btx _3 \end {bmatrix}\in \real ^{M\times 3}$ (column decomposition) the resulting $\bX \bX ^T\in \real ^{M\times M}$,

\begin{align} \label {eq:kernel:xxt} \bX \bX ^T &= \begin{bmatrix}\btx _1 & \btx _2 & \btx _3\end {bmatrix} \begin{bmatrix} \btx _1^T \\ \btx _2^T \\ \btx _3^T \end {bmatrix} = \btx _1\btx _1^T + \btx _2\btx _2^T + \btx _3\btx _3^T \end{align} where each term $\btx _j\btx _j^T\in \real ^{M\times M}$ is an outer product.

5.3.2 Kernel Matrix

The mapping function $\phi (\cdot ):\real ^{M\times N}\rightarrow \real ^{M\times L}$ is applied to the full data matrix $\bX $ (as in (5.6))

\begin{equation} \bX _\phi = \phi (\bX ) \end{equation}

By using the identity

\begin{equation} \phi \left (\bX ^T\right ) = \phi \left (\bX \right )^T \end{equation}

the resulting kernel matrix $\bK \in \real ^{M\times M}$ is defined as

\begin{equation} \label {eq-kernel-matrix} \bK (\bX ,\bX ) = \phi \left (\bX \right )\phi \left (\bX \right )^T \end{equation}

Kernel matrix properties:

• Symmetry, $\bK = \bK ^T$
• Positive semi-definite, $\bx \bK \bx ^T\ge 0$, with all eigenvalues non-negative.

5.3.3 Handling Overfitting

Large $L$ values, e.g. in the polynomial model, result in a high potential for overfitting. Therefore, ridge regression (4.21) is applied together with a kernel trick,

\begin{equation} \label {eq:kernels:w} \bw = \bX ^T\left (\bX \bX ^T +\lambda \mathbf {I}\right )^{-1} \by \end{equation}

5.3.4 Kernel-Based Model

The calculation of the kernel trick relation in Eq. (5.13) is performed by the following steps:

1. Evaluate the kernel matrix $\bK $ using (5.12).
2. Calculate $\balpha $,
$\seteqnumber{0}{}{13}$
\begin{equation} \label {eq:kernels:alpha} \bm {\alpha } = (\mathbf {K}+\lambda \mathbf {I})^{-1} \by \end{equation}

where $\bK $ replaces $\bX \bX ^T$, yielding $\balpha \in \real ^{M}$. After this step, $\bK $ no longer needs to be stored. The weight vector $\bw \in \real ^{L}$ can then (theoretically) be recovered as
$\seteqnumber{0}{}{14}$
\begin{equation} \bw = \varphi \left (\bX \right )^T\bm {\alpha } = \sum _{j=1}^M\alpha _j\varphi \left (\tilde {\bx }_j\right ) \end{equation}

but is not required for prediction.
3. Model inference: The regression for a new test point $\tilde {\bx }_0$ is evaluated by
$\seteqnumber{0}{}{15}$
\begin{equation} \begin{aligned}\label {eq:kernels:pred} \hat {y}_0 &= \varphi (\tilde {\bx }_0)^T\bw \\ &= \sum _{j=1}^M\alpha _j\varphi (\tilde {\bx }_0)^T\varphi \left (\tilde {\bx }_j\right )\\ &= \varphi (\tilde {\bx }_0)^T\varphi \left (\bX \right )\balpha \\ &=K\left (\tilde {\bx }_0,\bX \right )\balpha \end {aligned} \end{equation}

where
$\seteqnumber{0}{}{16}$
\begin{equation} \begin{aligned} \varphi (\tilde {\bx }_0)^T&\in \real ^{1\times L},\varphi (\tilde {\bx }_0) \in \real ^{L\times 1}\\ \varphi \left (\bX \right )^T &\in \real ^{L\times M},\varphi \left (\bX \right )\in \real ^{M\times L}\\ K\left (\tilde {\bx }_0,\bX \right )&\in \real ^{1\times M}\\ \end {aligned} \end{equation}

No $L$-dimensional calculations are required, and $\bw $ need not be evaluated explicitly.

In summary, the order is:

1. Kernel matrix, (5.12).
2. $\balpha $, (5.14).
3. For test data $\tilde {\bX }_0$, the prediction (5.16) becomes
$\seteqnumber{0}{}{17}$
\begin{equation} \hat {\by }_0 = K\left (\tilde {\bX }_0,\bX \right )\balpha \end{equation}

5.4 Kernel Types

5.4.1 Polynomial Kernel

\begin{equation} K(\bx ,\by ) = \left (\bx ^\mathsf {T} \by + c\right )^{d} \end{equation}

where $c$ is some constant (optionally $c=0$) and $d$ is the power.

The number of mapped dimensions is $L=\displaystyle \binom {d+N-1}{d}$. For example, for $N=10,d = 4, L=\binom {13}{4}=715$.

Example for $c=0,d=3,L=2$:

\begin{align*} K(\ba ,\bb ) &=\left (\ba ^T\bb \right )^3 = \left (\left [a_1, a_2\right ] \begin{bmatrix} b_1 \\ b_2 \end {bmatrix} \right )^3\\ &=\left (a_1b_1 + a_2b_2\right )^3\\ &= a_1^3b_1^3 + 3a_1^2b_1^2a_2b_2 + 3a_1b_1a_2^2b_2^2 + a_2^3b_2^3\\ &=\left [a_1^3, \sqrt {3}a_1^2a_2,\sqrt {3}a_1a_2^2,a_2^3\right ]\cdot \left [b_1^3, \sqrt {3}b_1^2b_2,\sqrt {3}b_1b_2^2,b_2^3\right ]^T\\ &=\varphi (\ba )^T\varphi (\bb ) \end{align*}

Example for $c=0,d=2$ and arbitrary $L$:

\begin{align} K(\ba ,\bb ) &=\left (\ba ^T\bb + 1\right )^2\\ &=\left (\ba ^T\bb \right )^2 + 2\ba ^T\bb + 1\nonumber \\ &=\left (\sum _{i=1}^{L}a_ib_i\right )^2 + 2\sum _{i=1}^{L}a_ib_i + 1\nonumber \\ &=\sum _{i=1}^{L}a_ib_i\sum _{j=1}^{L}a_jb_j + 2\sum _{i=1}^{L}a_ib_i + 1\nonumber \\ &=\sum _{i=1}^{L}a_i^2b_i^2 + 2\sum _{i=1}^{L}\sum _{j=i+1}^{L}a_ib_ia_jb_j + 2\sum _{i=1}^{L}a_ib_i + 1\nonumber \\ &=\varphi (\ba )^T\varphi (\bb )\nonumber \end{align} That corresponds to the mapping functions

\begin{align*} \varphi (\ba ) &= \langle 1,\sqrt {2}a_1,\sqrt {2}a_2,\ldots , \sqrt {2}a_L, a_1^2,a_2^2,\ldots ,a_L^2,\\ &\quad \sqrt {2}a_1a_2,\sqrt {2}a_1a_3,\ldots ,\sqrt {2}a_1a_L,\sqrt {2}a_2a_3,\ldots , \sqrt {2}a_{L-1}a_L\rangle \\ \varphi (\bb ) &= \langle \underbrace {1}_1,\underbrace {\sqrt {2}b_1,\sqrt {2}b_2,\ldots , \sqrt {2}b_L}_{2\sum _{i=1}^{L}a_ib_i}, \underbrace {b_1^2,b_2^2,\ldots ,b_L^2}_{\sum _{i=1}^{L}a_i^2b_i^2},\\ &\quad \underbrace {\sqrt {2}b_1b_2,\sqrt {2}b_1b_3,\ldots ,\sqrt {2}b_1b_L,\sqrt {2}b_2b_3,\ldots , \sqrt {2}b_{L-1}b_L}_{2\sum _{i=1}^{L}\sum _{j=i+1}^{L}a_ib_ia_jb_j}\rangle \end{align*} with all the combinations above.

5.4.2 Gaussian Radial Basis Kernel (RBK)

The kernel definition is

\begin{equation} K\!\left (\bx ,\by \right ) = \exp (-\frac {\norm {\bx - \by }^2}{2b}) \end{equation}

Due to Taylor expansion,

\[ \exp (x)\approxeq 1 + x + \frac {x^2}{2!} + \frac {x^3}{3!} + \cdots \]

this kernel has $L\rightarrow \infty $.

Typically,

• Too low $b$, when $b\rightarrow 0$
- – Kernel becomes $\delta $-function-like.
- – Each training point only affects neighbors
- – Overfitting (high variance, low bias).
• Too high $b$, when $b\rightarrow \infty $
- – Kernel becomes constant, $\bK _{ij}\approx 1$.
- – All training points affect everywhere
- – Underfitting (low variance, high bias)

The numerical example (same dataset as in Fig. 4.13) is presented in Fig. 5.1.

5.4.3 Summary

• Using virtually high $L$.
• Require regularization parameter ($\lambda $) tuning.
• Kernel type is hyper-parameter, and each kernel has its own hyper-parameters.
• Somewhat limited by $M\times M$ kernel matrix size.
• Since the values of $\bw $ are not evaluated explicitly, the influence of each $\btx _j$ and inter-data $\btx _j,\btx _k$ relations are unknown.

5.5 Kernel Smoothing

Nadaraya–Watson smoothing For the $\tilde {\bx }_0$ of interest, the kernel smoothing is weighted averaging given by

\begin{equation} \label {eq:kernels:nadaraya} \hat {y}_0 = \sum _{j=1}^M w_jy_j \end{equation}

where the weights are

\begin{equation} \tilde {\bw }= K\!\left (\tilde {\bx }_0,\bX \right ) \end{equation}

with normalization

\begin{equation} w_j = \frac {\tilde {w}_j}{\sum _{j=1}^{M}\tilde {w}_j} \end{equation}

Finally, for a test dataset, $\bX _{\text {test}}$ (vector notation)

\begin{align} \tilde {\bW }& = K\!\left (\bX _{\text {test}},\bX _{\text {train}}\right ) &\in \real ^{M_{\text {test}}\times M_{\text {train}}} \\ \bW &= \tilde {\bW }/\tilde {\bW }\bOne _{M_{\text {train}}}\\ \hat {\by }_{\text {test}} &= \bW \by _{\text {train}} \end{align}

The numerical example is presented in Fig. 5.2.

Summary

• Low numerical complexity.
• Kernel type is hyper-parameter, and each kernel has its own hyper-parameters. No additional method hyper-parameters are required.
• Non-trivial hyper-parameter optimization on multivariate datasets.
• Require sufficiently dense dataset.
• Capture nonlinear relationships without assuming a specific parametric form.