Machine Learning & Signals Learning

\(\newcommand{\footnotename}{footnote}\) \(\def \LWRfootnote {1}\) \(\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\let \LWRorighspace \hspace \) \(\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }\) \(\newcommand {\TextOrMath }[2]{#2}\) \(\newcommand {\mathnormal }[1]{{#1}}\) \(\newcommand \ensuremath [1]{#1}\) \(\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } \) \(\newcommand {\setlength }[2]{}\) \(\newcommand {\addtolength }[2]{}\) \(\newcommand {\setcounter }[2]{}\) \(\newcommand {\addtocounter }[2]{}\) \(\newcommand {\arabic }[1]{}\) \(\newcommand {\number }[1]{}\) \(\newcommand {\noalign }[1]{\text {#1}\notag \\}\) \(\newcommand {\cline }[1]{}\) \(\newcommand {\directlua }[1]{\text {(directlua)}}\) \(\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}\) \(\newcommand {\protect }{}\) \(\def \LWRabsorbnumber #1 {}\) \(\def \LWRabsorbquotenumber "#1 {}\) \(\newcommand {\LWRabsorboption }[1][]{}\) \(\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }\) \(\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }\) \(\def \mathcode #1={\mathchar }\) \(\let \delcode \mathcode \) \(\let \delimiter \mathchar \) \(\def \oe {\unicode {x0153}}\) \(\def \OE {\unicode {x0152}}\) \(\def \ae {\unicode {x00E6}}\) \(\def \AE {\unicode {x00C6}}\) \(\def \aa {\unicode {x00E5}}\) \(\def \AA {\unicode {x00C5}}\) \(\def \o {\unicode {x00F8}}\) \(\def \O {\unicode {x00D8}}\) \(\def \l {\unicode {x0142}}\) \(\def \L {\unicode {x0141}}\) \(\def \ss {\unicode {x00DF}}\) \(\def \SS {\unicode {x1E9E}}\) \(\def \dag {\unicode {x2020}}\) \(\def \ddag {\unicode {x2021}}\) \(\def \P {\unicode {x00B6}}\) \(\def \copyright {\unicode {x00A9}}\) \(\def \pounds {\unicode {x00A3}}\) \(\let \LWRref \ref \) \(\renewcommand {\ref }{\ifstar \LWRref \LWRref }\) \( \newcommand {\multicolumn }[3]{#3}\) \(\require {textcomp}\) \( \newcommand {\abs }[1]{\lvert #1\rvert } \) \( \DeclareMathOperator {\sign }{sign} \) \(\newcommand {\intertext }[1]{\text {#1}\notag \\}\) \(\let \Hat \hat \) \(\let \Check \check \) \(\let \Tilde \tilde \) \(\let \Acute \acute \) \(\let \Grave \grave \) \(\let \Dot \dot \) \(\let \Ddot \ddot \) \(\let \Breve \breve \) \(\let \Bar \bar \) \(\let \Vec \vec \) \(\newcommand {\bm }[1]{\boldsymbol {#1}}\) \(\require {physics}\) \(\newcommand {\LWRphystrig }[2]{\ifblank {#1}{\textrm {#2}}{\textrm {#2}^{#1}}}\) \(\renewcommand {\sin }[1][]{\LWRphystrig {#1}{sin}}\) \(\renewcommand {\sinh }[1][]{\LWRphystrig {#1}{sinh}}\) \(\renewcommand {\arcsin }[1][]{\LWRphystrig {#1}{arcsin}}\) \(\renewcommand {\asin }[1][]{\LWRphystrig {#1}{asin}}\) \(\renewcommand {\cos }[1][]{\LWRphystrig {#1}{cos}}\) \(\renewcommand {\cosh }[1][]{\LWRphystrig {#1}{cosh}}\) \(\renewcommand {\arccos }[1][]{\LWRphystrig {#1}{arcos}}\) \(\renewcommand {\acos }[1][]{\LWRphystrig {#1}{acos}}\) \(\renewcommand {\tan }[1][]{\LWRphystrig {#1}{tan}}\) \(\renewcommand {\tanh }[1][]{\LWRphystrig {#1}{tanh}}\) \(\renewcommand {\arctan }[1][]{\LWRphystrig {#1}{arctan}}\) \(\renewcommand {\atan }[1][]{\LWRphystrig {#1}{atan}}\) \(\renewcommand {\csc }[1][]{\LWRphystrig {#1}{csc}}\) \(\renewcommand {\csch }[1][]{\LWRphystrig {#1}{csch}}\) \(\renewcommand {\arccsc }[1][]{\LWRphystrig {#1}{arccsc}}\) \(\renewcommand {\acsc }[1][]{\LWRphystrig {#1}{acsc}}\) \(\renewcommand {\sec }[1][]{\LWRphystrig {#1}{sec}}\) \(\renewcommand {\sech }[1][]{\LWRphystrig {#1}{sech}}\) \(\renewcommand {\arcsec }[1][]{\LWRphystrig {#1}{arcsec}}\) \(\renewcommand {\asec }[1][]{\LWRphystrig {#1}{asec}}\) \(\renewcommand {\cot }[1][]{\LWRphystrig {#1}{cot}}\) \(\renewcommand {\coth }[1][]{\LWRphystrig {#1}{coth}}\) \(\renewcommand {\arccot }[1][]{\LWRphystrig {#1}{arccot}}\) \(\renewcommand {\acot }[1][]{\LWRphystrig {#1}{acot}}\) \(\require {cancel}\) \(\newcommand *{\underuparrow }[1]{{\underset {\uparrow }{#1}}}\) \(\DeclareMathOperator *{\argmax }{argmax}\) \(\DeclareMathOperator *{\argmin }{arg\,min}\) \(\def \E [#1]{\mathbb {E}\!\left [ #1 \right ]}\) \(\def \Var [#1]{\operatorname {Var}\!\left [ #1 \right ]}\) \(\def \Cov [#1]{\operatorname {Cov}\!\left [ #1 \right ]}\) \(\newcommand {\floor }[1]{\lfloor #1 \rfloor }\) \(\newcommand {\DTFTH }{ H \brk 1{e^{j\omega }}}\) \(\newcommand {\DTFTX }{ X\brk 1{e^{j\omega }}}\) \(\newcommand {\DFTtr }[1]{\mathrm {DFT}\left \{#1\right \}}\) \(\newcommand {\DTFTtr }[1]{\mathrm {DTFT}\left \{#1\right \}}\) \(\newcommand {\DTFTtrI }[1]{\mathrm {DTFT^{-1}}\left \{#1\right \}}\) \(\newcommand {\Ftr }[1]{ \mathcal {F}\left \{#1\right \}}\) \(\newcommand {\FtrI }[1]{ \mathcal {F}^{-1}\left \{#1\right \}}\) \(\newcommand {\Zover }{\overset {\mathscr Z}{\Longleftrightarrow }}\) \(\renewcommand {\real }{\mathbb {R}}\) \(\newcommand {\ba }{\mathbf {a}}\) \(\newcommand {\bb }{\mathbf {b}}\) \(\newcommand {\bc }{\mathbf {c}}\) \(\newcommand {\bd }{\mathbf {d}}\) \(\newcommand {\be }{\mathbf {e}}\) \(\newcommand {\bf }{\mathbf {f}}\) \(\newcommand {\bh }{\mathbf {h}}\) \(\newcommand {\bi }{\mathbf {i}}\) \(\newcommand {\bn }{\mathbf {n}}\) \(\newcommand {\bo }{\mathbf {o}}\) \(\newcommand {\bp }{\mathbf {p}}\) \(\newcommand {\bq }{\mathbf {q}}\) \(\newcommand {\br }{\mathbf {r}}\) \(\newcommand {\bs }{\mathbf {s}}\) \(\newcommand {\bt }{\mathbf {t}}\) \(\newcommand {\bu }{\mathbf {u}}\) \(\newcommand {\bv }{\mathbf {v}}\) \(\newcommand {\bw }{\mathbf {w}}\) \(\newcommand {\bx }{\mathbf {x}}\) \(\newcommand {\bxx }{\mathbf {xx}}\) \(\newcommand {\bxy }{\mathbf {xy}}\) \(\newcommand {\by }{\mathbf {y}}\) \(\newcommand {\byy }{\mathbf {yy}}\) \(\newcommand {\bz }{\mathbf {z}}\) \(\newcommand {\bA }{\mathbf {A}}\) \(\newcommand {\bB }{\mathbf {B}}\) \(\newcommand {\bC }{\mathbf {C}}\) \(\newcommand {\bD }{\mathbf {D}}\) \(\newcommand {\bH }{\mathbf {H}}\) \(\newcommand {\bI }{\mathbf {I}}\) \(\newcommand {\bK }{\mathbf {K}}\) \(\newcommand {\bM }{\mathbf {M}}\) \(\newcommand {\bP }{\mathbf {P}}\) \(\newcommand {\bQ }{\mathbf {Q}}\) \(\newcommand {\bR }{\mathbf {R}}\) \(\newcommand {\bS }{\mathbf {S}}\) \(\newcommand {\bU }{\mathbf {U}}\) \(\newcommand {\bW }{\mathbf {W}}\) \(\newcommand {\bX }{\mathbf {X}}\) \(\newcommand {\bY }{\mathbf {Y}}\) \(\newcommand {\bZ }{\mathbf {Z}}\) \(\newcommand {\balpha }{\bm {\alpha }}\) \(\newcommand {\bth }{{\bm {\theta }}}\) \(\newcommand {\bepsilon }{{\bm {\epsilon }}}\) \(\newcommand {\bmu }{{\bm {\mu }}}\) \(\newcommand {\bphi }{\bm {\phi }}\) \(\newcommand {\bOne }{\mathbf {1}}\) \(\newcommand {\bZero }{\mathbf {0}}\) \(\newcommand {\indFunc }{\mathbb {1}}\) \(\newcommand {\btx }{\tilde {\bx }}\) \(\newcommand {\loss }{\mathcal {L}}\) \(\newcommand {\appropto }{\mathrel {\vcenter { \offinterlineskip \halign {\hfil $##$\cr \propto \cr \noalign {\kern 2pt}\sim \cr \noalign {\kern -2pt}}}}}\) \(\newcommand {\SSE }{\mathrm {SSE}}\) \(\newcommand {\MSE }{\mathrm {MSE}}\) \(\newcommand {\RMSE }{\mathrm {RMSE}}\) \(\newcommand {\toprule }[1][]{\hline }\) \(\let \midrule \toprule \) \(\let \bottomrule \toprule \) \(\def \LWRbooktabscmidruleparen (#1)#2{}\) \(\newcommand {\LWRbooktabscmidrulenoparen }[1]{}\) \(\newcommand {\cmidrule }[1][]{\ifnextchar (\LWRbooktabscmidruleparen \LWRbooktabscmidrulenoparen }\) \(\newcommand {\morecmidrules }{}\) \(\newcommand {\specialrule }[3]{\hline }\) \(\newcommand {\addlinespace }[1][]{}\) \(\newcommand {\LWRsubmultirow }[2][]{#2}\) \(\newcommand {\LWRmultirow }[2][]{\LWRsubmultirow }\) \(\newcommand {\multirow }[2][]{\LWRmultirow }\) \(\newcommand {\mrowcell }{}\) \(\newcommand {\mcolrowcell }{}\) \(\newcommand {\STneed }[1]{}\) \(\newcommand {\tcbset }[1]{}\) \(\newcommand {\tcbsetforeverylayer }[1]{}\) \(\newcommand {\tcbox }[2][]{\boxed {\text {#2}}}\) \(\newcommand {\tcboxfit }[2][]{\boxed {#2}}\) \(\newcommand {\tcblower }{}\) \(\newcommand {\tcbline }{}\) \(\newcommand {\tcbtitle }{}\) \(\newcommand {\tcbsubtitle [2][]{\mathrm {#2}}}\) \(\newcommand {\tcboxmath }[2][]{\boxed {#2}}\) \(\newcommand {\tcbhighmath }[2][]{\boxed {#2}}\) \(\require {colortbl}\) \(\let \LWRorigcolumncolor \columncolor \) \(\renewcommand {\columncolor }[2][named]{\LWRorigcolumncolor [#1]{#2}\LWRabsorbtwooptions }\) \(\let \LWRorigrowcolor \rowcolor \) \(\renewcommand {\rowcolor }[2][named]{\LWRorigrowcolor [#1]{#2}\LWRabsorbtwooptions }\) \(\let \LWRorigcellcolor \cellcolor \) \(\renewcommand {\cellcolor }[2][named]{\LWRorigcellcolor [#1]{#2}\LWRabsorbtwooptions }\)

B Essential formulas

B.1 Linear algebra

(image)

Figure B.1: Vector and matrix dimensions notation.
B.1.1 Basic operations

For a column vector \(\bz = [z_1,\,z_2,\,\ldots ,\,z_N]^T \in \mathbb {R}^N\), the squared Euclidean norm is

\begin{equation} \norm {\bz }^2 = \bz ^T\bz = z_1^2 + z_2^2 + \cdots + z_N^2. \end{equation}

For matrices \(\bA \) and \(\bB \) of compatible sizes, the transpose of a product is

\begin{equation} (\bA \bB )^T = \bB ^T\bA ^T. \end{equation}

B.1.2 Dot product

For two vectors \(\ba , \bb \in \mathbb {R}^N\),

\begin{equation} \ba \cdot \bb = \ba ^T\bb = \bb ^T\ba = \norm {\ba }\,\norm {\bb }\,\cos \theta , \end{equation}

where \(\theta \) is the angle between them. In particular, \(\ba \perp \bb \;\Leftrightarrow \; \theta = 90^\circ \;\Leftrightarrow \; \ba ^T\bb = 0\).

B.1.3 Pseudo-inverse

When \(\bX \in \mathbb {R}^{M\times N}\) has full column rank (\(M \ge N\) and the columns are linearly independent), the matrix \(\bX ^T\bX \) is invertible and the (left, Moore-Penrose) pseudo-inverse is

\begin{equation} \bX ^+ = \left (\bX ^T\bX \right )^{-1}\bX ^T, \end{equation}

which satisfies \(\bX ^+\bX = \bI \) and provides the least-squares solution to overdetermined linear systems.

B.1.4 Least squares

The least-squares loss for a linear model \(\bX \bw \approx \bY \) is

\begin{align} \loss (\bw ) &= \norm {\bY -\bX \bw }^2 = (\bY -\bX \bw )^T(\bY -\bX \bw ) \\ &= \bY ^T\bY - 2\bw ^T\bX ^T\bY + \bw ^T\bX ^T\bX \bw . \end{align} Setting the gradient to zero,

\begin{equation} \frac {\partial \loss (\bw )}{\partial \bw } = -2\bX ^T\bY + 2\bX ^T\bX \bw = \bZero , \end{equation}

gives the normal equations \(\bX ^T\bX \bw = \bX ^T\bY \) and the closed-form solution

\begin{equation} \bw ^\star = \left (\bX ^T\bX \right )^{-1}\bX ^T\bY = \bX ^+\bY . \end{equation}

B.1.5 Eigenvalues and eigenvectors

Definition. Let \(\bA \in \mathbb {R}^{n\times n}\). A non-zero vector \(\bv \in \mathbb {R}^n\) is an eigenvector of \(\bA \) with associated eigenvalue \(\lambda \in \mathbb {R}\) if

\begin{equation} \bA \bv = \lambda \bv . \end{equation}

Geometrically, \(\bA \) acts on any vector along the direction of \(\bv \) by scaling it by the factor \(\lambda \), without rotation.

The eigenvalues of \(\bA \) satisfy the identities

\begin{align} \operatorname {tr}(\bA ) &= \sum _{i=1}^n a_{ii} = \sum _{i=1}^n \lambda _i,\\ \det (\bA ) &= \prod _{i=1}^n \lambda _i. \end{align}

Eigendecomposition

Assume \(\bA \) has \(n\) linearly independent eigenvectors \(\bv _1, \bv _2, \ldots , \bv _n\) (this assumption, called diagonalisability, is non-trivial; not every square matrix satisfies it). Stack them as the columns of

\begin{equation} \bQ = \begin{bmatrix} \bv _1 & \bv _2 & \cdots & \bv _n \end {bmatrix}. \end{equation}

Applying \(\bA \) to each column scales it by the corresponding eigenvalue, so column-by-column

\begin{equation} \bA \bQ = \begin{bmatrix} \lambda _1\bv _1 & \lambda _2\bv _2 & \cdots & \lambda _n\bv _n \end {bmatrix} = \bQ \bm {\Lambda }, \end{equation}

where \(\bm {\Lambda }=\diag (\lambda _1, \lambda _2, \ldots , \lambda _n)\). Multiplying on the right by \(\bQ ^{-1}\) gives the eigendecomposition of \(\bA \),

\begin{align} \bA &= \bQ \bm {\Lambda }\bQ ^{-1},\\ \bQ ^{-1}\bA \bQ &= \bm {\Lambda }. \end{align}

Symmetric case (spectral theorem)

When \(\bA \) is real symmetric, \(\bA = \bA ^T\), the eigendecomposition always exists and has stronger properties: the eigenvalues \(\lambda _i\) are real, and the eigenvectors \(\bv _i\) can be chosen orthonormal. In that case \(\bQ \) is an orthogonal matrix, \(\bQ ^{-1} = \bQ ^T\), and

\begin{equation} \bA = \bQ \bm {\Lambda }\bQ ^T. \end{equation}

A sample covariance matrix is symmetric and positive semi-definite, so its eigendecomposition has real non-negative eigenvalues and orthonormal eigenvectors.

Positive (semi-)definite matrices

A real symmetric matrix \(\bA \in \mathbb {R}^{n\times n}\) is called

  • positive definite (PD), written \(\bA \succ 0\), if \(\bx ^T\bA \bx > 0\) for every non-zero \(\bx \in \mathbb {R}^n\);

  • positive semi-definite (PSD), written \(\bA \succeq 0\), if \(\bx ^T\bA \bx \ge 0\) for every \(\bx \in \mathbb {R}^n\) (equality is allowed for non-zero \(\bx \)).

Equivalent characterisations in terms of the eigenvalues \(\lambda _1,\ldots ,\lambda _n\) (which are real because \(\bA \) is symmetric):

  • \(\bA \succ 0 \;\Leftrightarrow \; \lambda _i > 0\) for all \(i\);

  • \(\bA \succeq 0 \;\Leftrightarrow \; \lambda _i \ge 0\) for all \(i\).

A PD matrix is invertible (none of its eigenvalues is zero); a PSD matrix may be singular. Common examples:

  • For any matrix \(\bX \), the Gram matrix \(\bX ^T\bX \) is PSD; it is PD when \(\bX \) has full column rank.

  • A sample covariance matrix is PSD; it is PD when the data spans the full feature space.

  • The Hessian of a strictly convex function is PD at every point; the Hessian of a convex function is PSD.

Negative definite (\(\bA \prec 0\)) and negative semi-definite (\(\bA \preceq 0\)) are defined analogously, with all eigenvalues strictly negative or non-positive.

B.1.6 Projections
Vector projection

The vector projection of \(\ba \) onto \(\bb \) is the orthogonal projection of \(\ba \) onto the direction of \(\bb \) (Fig. B.2). With \(\hat \bb = \bb /\norm {\bb }\) the unit vector in the direction of \(\bb \),

\begin{equation} \ba _1 = (\ba \cdot \hat \bb )\,\hat \bb = \frac {\ba ^T\bb }{\norm {\bb }^2}\,\bb . \end{equation}

(image)

Figure B.2: \(\ba _1\) is the projection of \(\ba \) onto \(\bb \).
Subspace projection

The projection of a vector \(\ba \in \mathbb {R}^N\) onto a subspace spanned by the columns of \(\bB = [\bb _1,\,\bb _2,\,\ldots ,\,\bb _k]\in \mathbb {R}^{N\times k}\) (Fig. B.3) is

\begin{equation} \ba _1 = \bP \ba , \qquad \bP = \bB (\bB ^T\bB )^{-1}\bB ^T, \end{equation}

where \(\bP \) is the orthogonal projection matrix onto the column space of \(\bB \). It satisfies \(\bP ^2 = \bP \) and \(\bP ^T = \bP \). The single-vector formula above is the special case \(k=1\).

(image)

Figure B.3: Orthogonal projection of \(\ba \) onto the subspace spanned by \(\bb _1,\bb _2\).
Condition number

The condition number of a matrix measures how much small perturbations of the input can be amplified in the output of a linear system. For a non-singular matrix \(\bA \),

\begin{align} \label {app-math-poor-coditioning} \kappa (\bA ) &= \frac {\sigma _{\max }(\bA )}{\sigma _{\min }(\bA )},\\ \kappa (\bA ^T\bA ) &= \kappa (\bA )^2, \end{align} where \(\sigma _{\max }\) and \(\sigma _{\min }\) are the largest and smallest singular values of \(\bA \). For a symmetric positive-definite \(\bA \), the singular values coincide with the absolute eigenvalues, so \(\kappa (\bA ) = |\lambda _{\max }|/|\lambda _{\min }|\). A large condition number indicates that the matrix is close to singular, leading to numerical inaccuracies in matrix calculations; in particular, forming \(\bA ^T\bA \) (as in the normal equations) squares the condition number, which is one reason SVD-based solvers are preferred for ill-conditioned least-squares problems.

For example,

\begin{equation} \bA = \begin{bmatrix} 1 & 1 & -\epsilon \\ \epsilon & 0 & 1 \\ 0 & \epsilon & 1 \end {bmatrix} \;\Rightarrow \; \kappa (\bA )\approx \sqrt {2}/\epsilon ,\quad \kappa (\bA ^T\bA )\approx 2/\epsilon ^2. \end{equation}

B.2 Indicator functions and counting

The indicator \(\indFunc [\text {predicate}]\) equals \(1\) when the predicate is true and \(0\) otherwise. Two notational patterns recur throughout the book.

B.2.1 Sum of a single indicator

Given labels \(Y_1,\ldots ,Y_M\) and a fixed class \(c\), the expression

\begin{equation} \sum _{i=1}^{M} \indFunc [Y_i = c] \end{equation}

turns the label vector into a column of 0s and 1s, with a 1 in row \(i\) whenever \(Y_i=c\), and then adds those entries. The sum is simply the number of 1s in that column, i.e. the count of samples whose label is \(c\).

B.2.2 Sum of a product of indicators

A product of two indicators is itself an indicator of the conjunction,

\begin{equation} \indFunc [A]\,\indFunc [B] = \indFunc [A \text { and } B], \end{equation}

because the product is \(1\) only when both factors are \(1\). For two label vectors (say the truth \(Y_i\) and a prediction \(\hat Y_i\)), the sum

\begin{equation} \sum _{i=1}^{M} \indFunc [Y_i = c]\,\indFunc [\hat Y_i = c'] \end{equation}

reads two 0/1 columns row by row, multiplies entry by entry, and counts the rows where both columns held a 1. It is the number of samples whose true class is \(c\) and whose predicted class is \(c'\).

The entry-by-entry product of two vectors is called the Hadamard product and written with \(\odot \): if \(\bv ,\bu \in \{0,1\}^M\) are the two indicator columns, then \((\bv \odot \bu )_i = v_i u_i\), and the sum above is the total of the entries of \(\bv \odot \bu \).

  • Example B.1: With \(M=5\), true labels \(\bY =(c,\, c,\, b,\, c,\, a)\) and predictions \(\hat \bY =(c,\, b,\, b,\, c,\, b)\), taking \(c\) as the positive class and asking “true \(c\) predicted as \(c\)”:

    \begin{equation*} \begin{array}{c|ccccc|c} i & 1 & 2 & 3 & 4 & 5 & \text {sum} \\[3pt] \hline \indFunc [Y_i = c] & 1 & 1 & 0 & 1 & 0 & 3 \\[3pt] \indFunc [\hat Y_i = c] & 1 & 0 & 0 & 1 & 0 & 2 \\[3pt] \text {product} & 1 & 0 & 0 & 1 & 0 & 2 \end {array} \end{equation*}

    The single-indicator column sums give the marginal counts (3 true \(c\), 2 predicted \(c\)); the product column counts samples in both at once (2 correctly predicted as \(c\)).

This is the mechanism behind confusion-matrix entries such as Eq. (12.22) and the agreement-matrix entries in Sec. 14.3.2.

B.3 Function analysis

B.3.1 Minimum, maximum, argmin, argmax

For a real-valued function \(f(x)\):

  • \(\min _x f(x)\) is the smallest value attained by \(f\) over all possible \(x\).

  • \(\argmin _x f(x)\) is the location where that minimum is attained: if \(y = \argmin _x f(x)\), then \(f(y) = \min _x f(x)\).

The pair \(\max _x f(x)\) and \(\argmax _x f(x)\) are defined analogously. The distinction matters in optimisation: a learning algorithm reports the argmin (the parameter values), not the min (the residual loss).

B.3.2 Vector and matrix calculus

For a scalar-valued function \(f:\mathbb {R}^N \to \mathbb {R}\) of a vector \(\bz = [z_1,\ldots ,z_N]^T\), the gradient is the column vector of partial derivatives,

\begin{equation} \nabla _\bz f(\bz ) = \begin{bmatrix} \dfrac {\partial f}{\partial z_1} \\[5pt] \dfrac {\partial f}{\partial z_2} \\ \vdots \\ \dfrac {\partial f}{\partial z_N} \end {bmatrix} \in \mathbb {R}^N. \end{equation}

The Hessian is the symmetric \(N\times N\) matrix of second partial derivatives,

\begin{equation} \bH = \nabla _\bz ^2 f(\bz ),\qquad H_{ij} = \frac {\partial ^2 f}{\partial z_i\,\partial z_j}. \end{equation}

At a stationary point (\(\nabla _\bz f = \bZero \)), the function has a local minimum if \(\bH \) is positive definite, a local maximum if \(\bH \) is negative definite, and a saddle point if \(\bH \) has eigenvalues of both signs.

Bibliography

  • [1]  Tomas Andersson. Selected topics in frequency estimation. PhD thesis, KTH Royal Institute of Technology, 2003.

  • [2]  Dima Bykhovsky. Experimental lognormal modeling of harmonics power of switched-mode power supplies. Energies, 15(2), 2022.

  • [3]  Dima Bykhovsky and Asaf Cohen. Electrical network frequency (ENF) maximum-likelihood estimation via a multitone harmonic model. IEEE Transactions on Information Forensics and Security, 8(5):744–753, 2013.

  • [4]  Lorenzo Ciampiconi, Adam Elwood, Marco Leonardi, Ashraf Mohamed, and Alessandro Rozza. A survey and taxonomy of loss functions in machine learning. arXiv preprint arXiv:2301.05579, 2023.

  • [5]  Angus Dempster, François Petitjean, and Geoffrey I Webb. Rocket: exceptionally fast and accurate time series classification using random convolutional kernels. Data Mining and Knowledge Discovery, 34(5):1454–1495, 2020.

  • [6]  Angus Dempster, Daniel F Schmidt, and Geoffrey I Webb. Minirocket: A very fast (almost) deterministic transform for time series classification. In Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining, pages 248–257, 2021.

  • [7]  Bo Diao, Kun Wen, Jian Chen, Yueping Liu, Zilin Yuan, Chao Han, Jiahui Chen, Yuxian Pan, Li Chen, Yunjie Dan, Jing Wang, Yongwen Chen, Guohong Deng, Hongwei Zhou, and Yuzhang Wu. Diagnosis of acute respiratory syndrome coronavirus 2 infection by detection of nucleocapsid protein. medRxiv, 2020.

  • [8]  Sharon Gannot, Zheng-Hua Tan, Martin Haardt, Nancy F Chen, Hoi-To Wai, Ivan Tashev, Walter Kellermann, and Justin Dauwels. Data science education: The signal processing perspective [sp education]. IEEE Signal Processing Magazine, 40(7):89–93, 2023.

  • [9]  Toni Giorgino. Computing and visualizing dynamic time warping alignments in r: the dtw package. Journal of statistical Software, 31:1–24, 2009.

  • [10]  Monson H Hayes. Statistical Digital Signal Processing and Modeling. John Wiley & Sons, 1996.

  • [11]  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.

  • [12]  Steven M. Kay. Fundamentals of Statistical Signal Processing, Volume I: Estimation Theory. Prentice Hall, 1993.

  • [13]  Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2017.

  • [14]  Jason Lines, Sarah Taylor, and Anthony Bagnall. Hive-cote: The hierarchical vote collective of transformation-based ensembles for time series classification. In 2016 IEEE 16th international conference on data mining (ICDM), pages 1041–1046. IEEE, 2016.

  • [15]  Boaz Porat. Digital processing of random signals: theory and methods. Courier Dover Publications, 2008.

  • [16]  Pavel Senin and Sergey Malinchik. Sax-vsm: Interpretable time series classification using sax and vector space model. In 2013 IEEE 13th international conference on data mining, pages 1175–1180. IEEE, 2013.

  • [17]  Albert Wong, Athena Nguyen, Eugene Li, Yew-Wei Lim, Mike Wu, and Shuk Wai Tsang. Combining classifiers for improved accuracies -voting and linearly weighted algorithms, Feb 2026.

  • [18]  Lexiang Ye and Eamonn Keogh. Time series shapelets: a new primitive for data mining. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 947–956, 2009.