Machine Learning & Signals Learning

\(\newcommand{\footnotename}{footnote}\) \(\def \LWRfootnote {1}\) \(\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\let \LWRorighspace \hspace \) \(\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }\) \(\newcommand {\TextOrMath }[2]{#2}\) \(\newcommand {\mathnormal }[1]{{#1}}\) \(\newcommand \ensuremath [1]{#1}\) \(\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } \) \(\newcommand {\setlength }[2]{}\) \(\newcommand {\addtolength }[2]{}\) \(\newcommand {\setcounter }[2]{}\) \(\newcommand {\addtocounter }[2]{}\) \(\newcommand {\arabic }[1]{}\) \(\newcommand {\number }[1]{}\) \(\newcommand {\noalign }[1]{\text {#1}\notag \\}\) \(\newcommand {\cline }[1]{}\) \(\newcommand {\directlua }[1]{\text {(directlua)}}\) \(\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}\) \(\newcommand {\protect }{}\) \(\def \LWRabsorbnumber #1 {}\) \(\def \LWRabsorbquotenumber "#1 {}\) \(\newcommand {\LWRabsorboption }[1][]{}\) \(\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }\) \(\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }\) \(\def \mathcode #1={\mathchar }\) \(\let \delcode \mathcode \) \(\let \delimiter \mathchar \) \(\def \oe {\unicode {x0153}}\) \(\def \OE {\unicode {x0152}}\) \(\def \ae {\unicode {x00E6}}\) \(\def \AE {\unicode {x00C6}}\) \(\def \aa {\unicode {x00E5}}\) \(\def \AA {\unicode {x00C5}}\) \(\def \o {\unicode {x00F8}}\) \(\def \O {\unicode {x00D8}}\) \(\def \l {\unicode {x0142}}\) \(\def \L {\unicode {x0141}}\) \(\def \ss {\unicode {x00DF}}\) \(\def \SS {\unicode {x1E9E}}\) \(\def \dag {\unicode {x2020}}\) \(\def \ddag {\unicode {x2021}}\) \(\def \P {\unicode {x00B6}}\) \(\def \copyright {\unicode {x00A9}}\) \(\def \pounds {\unicode {x00A3}}\) \(\let \LWRref \ref \) \(\renewcommand {\ref }{\ifstar \LWRref \LWRref }\) \( \newcommand {\multicolumn }[3]{#3}\) \(\require {textcomp}\) \( \newcommand {\abs }[1]{\lvert #1\rvert } \) \( \DeclareMathOperator {\sign }{sign} \) \(\newcommand {\intertext }[1]{\text {#1}\notag \\}\) \(\let \Hat \hat \) \(\let \Check \check \) \(\let \Tilde \tilde \) \(\let \Acute \acute \) \(\let \Grave \grave \) \(\let \Dot \dot \) \(\let \Ddot \ddot \) \(\let \Breve \breve \) \(\let \Bar \bar \) \(\let \Vec \vec \) \(\newcommand {\bm }[1]{\boldsymbol {#1}}\) \(\require {physics}\) \(\newcommand {\LWRphystrig }[2]{\ifblank {#1}{\textrm {#2}}{\textrm {#2}^{#1}}}\) \(\renewcommand {\sin }[1][]{\LWRphystrig {#1}{sin}}\) \(\renewcommand {\sinh }[1][]{\LWRphystrig {#1}{sinh}}\) \(\renewcommand {\arcsin }[1][]{\LWRphystrig {#1}{arcsin}}\) \(\renewcommand {\asin }[1][]{\LWRphystrig {#1}{asin}}\) \(\renewcommand {\cos }[1][]{\LWRphystrig {#1}{cos}}\) \(\renewcommand {\cosh }[1][]{\LWRphystrig {#1}{cosh}}\) \(\renewcommand {\arccos }[1][]{\LWRphystrig {#1}{arcos}}\) \(\renewcommand {\acos }[1][]{\LWRphystrig {#1}{acos}}\) \(\renewcommand {\tan }[1][]{\LWRphystrig {#1}{tan}}\) \(\renewcommand {\tanh }[1][]{\LWRphystrig {#1}{tanh}}\) \(\renewcommand {\arctan }[1][]{\LWRphystrig {#1}{arctan}}\) \(\renewcommand {\atan }[1][]{\LWRphystrig {#1}{atan}}\) \(\renewcommand {\csc }[1][]{\LWRphystrig {#1}{csc}}\) \(\renewcommand {\csch }[1][]{\LWRphystrig {#1}{csch}}\) \(\renewcommand {\arccsc }[1][]{\LWRphystrig {#1}{arccsc}}\) \(\renewcommand {\acsc }[1][]{\LWRphystrig {#1}{acsc}}\) \(\renewcommand {\sec }[1][]{\LWRphystrig {#1}{sec}}\) \(\renewcommand {\sech }[1][]{\LWRphystrig {#1}{sech}}\) \(\renewcommand {\arcsec }[1][]{\LWRphystrig {#1}{arcsec}}\) \(\renewcommand {\asec }[1][]{\LWRphystrig {#1}{asec}}\) \(\renewcommand {\cot }[1][]{\LWRphystrig {#1}{cot}}\) \(\renewcommand {\coth }[1][]{\LWRphystrig {#1}{coth}}\) \(\renewcommand {\arccot }[1][]{\LWRphystrig {#1}{arccot}}\) \(\renewcommand {\acot }[1][]{\LWRphystrig {#1}{acot}}\) \(\require {cancel}\) \(\newcommand *{\underuparrow }[1]{{\underset {\uparrow }{#1}}}\) \(\DeclareMathOperator *{\argmax }{argmax}\) \(\DeclareMathOperator *{\argmin }{arg\,min}\) \(\def \E [#1]{\mathbb {E}\!\left [ #1 \right ]}\) \(\def \Var [#1]{\operatorname {Var}\!\left [ #1 \right ]}\) \(\def \Cov [#1]{\operatorname {Cov}\!\left [ #1 \right ]}\) \(\newcommand {\floor }[1]{\lfloor #1 \rfloor }\) \(\newcommand {\DTFTH }{ H \brk 1{e^{j\omega }}}\) \(\newcommand {\DTFTX }{ X\brk 1{e^{j\omega }}}\) \(\newcommand {\DFTtr }[1]{\mathrm {DFT}\left \{#1\right \}}\) \(\newcommand {\DTFTtr }[1]{\mathrm {DTFT}\left \{#1\right \}}\) \(\newcommand {\DTFTtrI }[1]{\mathrm {DTFT^{-1}}\left \{#1\right \}}\) \(\newcommand {\Ftr }[1]{ \mathcal {F}\left \{#1\right \}}\) \(\newcommand {\FtrI }[1]{ \mathcal {F}^{-1}\left \{#1\right \}}\) \(\newcommand {\Zover }{\overset {\mathscr Z}{\Longleftrightarrow }}\) \(\renewcommand {\real }{\mathbb {R}}\) \(\newcommand {\ba }{\mathbf {a}}\) \(\newcommand {\bb }{\mathbf {b}}\) \(\newcommand {\bd }{\mathbf {d}}\) \(\newcommand {\be }{\mathbf {e}}\) \(\newcommand {\bh }{\mathbf {h}}\) \(\newcommand {\bn }{\mathbf {n}}\) \(\newcommand {\bq }{\mathbf {q}}\) \(\newcommand {\br }{\mathbf {r}}\) \(\newcommand {\bt }{\mathbf {t}}\) \(\newcommand {\bv }{\mathbf {v}}\) \(\newcommand {\bw }{\mathbf {w}}\) \(\newcommand {\bx }{\mathbf {x}}\) \(\newcommand {\bxx }{\mathbf {xx}}\) \(\newcommand {\bxy }{\mathbf {xy}}\) \(\newcommand {\by }{\mathbf {y}}\) \(\newcommand {\byy }{\mathbf {yy}}\) \(\newcommand {\bz }{\mathbf {z}}\) \(\newcommand {\bA }{\mathbf {A}}\) \(\newcommand {\bB }{\mathbf {B}}\) \(\newcommand {\bI }{\mathbf {I}}\) \(\newcommand {\bK }{\mathbf {K}}\) \(\newcommand {\bP }{\mathbf {P}}\) \(\newcommand {\bQ }{\mathbf {Q}}\) \(\newcommand {\bR }{\mathbf {R}}\) \(\newcommand {\bU }{\mathbf {U}}\) \(\newcommand {\bW }{\mathbf {W}}\) \(\newcommand {\bX }{\mathbf {X}}\) \(\newcommand {\bY }{\mathbf {Y}}\) \(\newcommand {\bZ }{\mathbf {Z}}\) \(\newcommand {\balpha }{\bm {\alpha }}\) \(\newcommand {\bth }{{\bm {\theta }}}\) \(\newcommand {\bepsilon }{{\bm {\epsilon }}}\) \(\newcommand {\bmu }{{\bm {\mu }}}\) \(\newcommand {\bOne }{\mathbf {1}}\) \(\newcommand {\bZero }{\mathbf {0}}\) \(\newcommand {\loss }{\mathcal {L}}\) \(\newcommand {\appropto }{\mathrel {\vcenter { \offinterlineskip \halign {\hfil $##$\cr \propto \cr \noalign {\kern 2pt}\sim \cr \noalign {\kern -2pt}}}}}\) \(\newcommand {\SSE }{\mathrm {SSE}}\) \(\newcommand {\MSE }{\mathrm {MSE}}\) \(\newcommand {\RMSE }{\mathrm {RMSE}}\) \(\newcommand {\toprule }[1][]{\hline }\) \(\let \midrule \toprule \) \(\let \bottomrule \toprule \) \(\def \LWRbooktabscmidruleparen (#1)#2{}\) \(\newcommand {\LWRbooktabscmidrulenoparen }[1]{}\) \(\newcommand {\cmidrule }[1][]{\ifnextchar (\LWRbooktabscmidruleparen \LWRbooktabscmidrulenoparen }\) \(\newcommand {\morecmidrules }{}\) \(\newcommand {\specialrule }[3]{\hline }\) \(\newcommand {\addlinespace }[1][]{}\) \(\newcommand {\LWRsubmultirow }[2][]{#2}\) \(\newcommand {\LWRmultirow }[2][]{\LWRsubmultirow }\) \(\newcommand {\multirow }[2][]{\LWRmultirow }\) \(\newcommand {\mrowcell }{}\) \(\newcommand {\mcolrowcell }{}\) \(\newcommand {\STneed }[1]{}\) \(\newcommand {\tcbset }[1]{}\) \(\newcommand {\tcbsetforeverylayer }[1]{}\) \(\newcommand {\tcbox }[2][]{\boxed {\text {#2}}}\) \(\newcommand {\tcboxfit }[2][]{\boxed {#2}}\) \(\newcommand {\tcblower }{}\) \(\newcommand {\tcbline }{}\) \(\newcommand {\tcbtitle }{}\) \(\newcommand {\tcbsubtitle [2][]{\mathrm {#2}}}\) \(\newcommand {\tcboxmath }[2][]{\boxed {#2}}\) \(\newcommand {\tcbhighmath }[2][]{\boxed {#2}}\)

9 Classification Performance Metrics

  • Goal: Quantify the performance of a binary classifier on a test dataset.

Definitions:

  • \(\by \) - target values vector of the test database, \(\by \in \real ^M\)

  • \(\hat {\by }\) - predicted value, \(\hat {\by }\in \real ^M\), output of some classifier \(\hat {\by } = f_\bth (\bX )\).

Typically, in binary classification, \(y_i\in \left \{0,1\right \}\).

9.1 Definitions

  • Goal: Classification between two groups (only).

Basic terminology:

  • ‘1’ – positive group or result

  • ‘0’ – negative group or result

  • \(Y\) – actual class

  • \(\hat {Y}\) – predicted class

Positive/negative terminology is rather arbitrary. Typically, the result of interest is termed positive.

9.2 Confusion matrix

  • Goal: Summarize classification results of a test set of a particular database.

The summarization is in the for m of a 2D non-normalized histogram of \((Y,\hat {Y})\).

.
Predicted values
Positive, \(\hat {Y}=1\) Negative, \(\hat {Y}=0\)
Actual values Positive, \(Y=1\) TP, True Positive FN, False Negative
Negative, \(Y=0\) FP, False Positive TN, True Negative
Figure 9.1: Confusion matrix. Note, sometimes, transposed representation is used.

The (test) database has \(M\) values, among them:

  • TP + FN positive values

  • FP + TN negative values

This is the most common way to summarize the performance of a particular classifier on a particular dataset. It can be easily extended for multi-class classifiers.

9.3 Performance Metrics

  • Goal: Characterization is useful to compare classifiers and/or performance on different datasets.

9.3.1 Accuracy
  • Goal: The most intuitive metric, fraction of \(Y=\hat {Y}\), among all the classification results, \(\Pr (Y=\hat {Y})\).

\begin{equation} \begin{aligned} \text {Accuracy} &= \frac {\text {correct predictions}}{\text {total predictions}}\\[3pt] &= \frac {TP + TN}{TP + TN + FP + FN} \end {aligned} \end{equation}

  • Example 9.1: Covid antibody (fast non-PCR) test performance. The example includes test statistics of 239 participants [?], as presented below.

    .
    Predicted
    Yes No
    Actual Yes 141 67
    No 0 31

    The resulting accuracy is

    \begin{equation} \text {Accuracy} = \frac {141+31}{239} = 0.7196652 \approx 72.0\% \end{equation}

In the example, FN=67 is a bad performance, and FP=0 is probably something good. However, accuracy does not reflect the discrepancy between these two. Additional metrics are used to quantify these aspects.

9.3.2 Precision
  • Goal: Proportion of positive classification that is actually correct, \(\Pr (Y=1|\hat {Y}=1)\).

From the probability theory,

\begin{align} \Pr (Y=1|\hat {Y}=1) &= \frac {\Pr (Y=1,\hat {Y}=1)}{\Pr (\hat {Y}=1)} \\ \Pr (\hat {Y}=1) &= \Pr (\hat {Y}=1,Y=0) + \Pr (\hat {Y}=1,Y=1) \end{align}

\begin{equation} \text {Precision} = \dfrac {TP}{FP+TP} = \dfrac {\text {Correctly predicted 1's}}{\text {All predicted 1's}} \end{equation}

TP = Correctly predicted 1’s
TP + FP = All predicted 1’s

.
Term Radar Interpretation
Accuracy Percentage of all correctly identified as planes or not planes
Precision Among all classified as planes, the portion that is correctly classified as planes
Recall (sensitivity) Among all existing planes, portion of correctly classified as planes
Specificity Among all actual non-planes, portion of correctly classified as non-planes
Table 9.1: Radar interpretation of the classification metrics.
  • Example 9.1: Back to the previous example,

    \begin{equation} \text {Precision} = \dfrac {141}{0+141} = 1 = 100\% \end{equation}

    The high value of the precision is due to the low FP. From the medical point of view, all positive results are actually positive. Whoever was identified by this test as Covid-positive is really positive.

9.3.3 Recall (sensitivity)
  • Goal: Proportion of positives identified correctly, \(\Pr (\hat {Y}=1| Y=1)\).

From the probability theory,

\begin{align} \Pr (\hat {Y}=1|Y=1) &= \frac {\Pr (Y=1,\hat {Y}=1)}{\Pr (Y=1)} \\ \Pr (Y=1) &= \Pr (Y=1,\hat {Y}=0) + \Pr (Y=1,\hat {Y}=1) \end{align}

\begin{equation} \text {Recall} = \frac {TP}{TP + FN} = \frac {\text {Correctly predicted 1's}}{\text {Actual 1's}} \end{equation}

Medical meaning: portion of correctly classified ill among all the ill.

  • Example 9.1: Back to the previous example,

    \begin{equation} \text {Recall} = \dfrac {141}{141 + 67} = 0.678 = 67.8\% \end{equation}

    The low value of the recall is due to the high FN. From the medical point of view, among all actually positive (ill) individuals, only 67.8% were correctly identified.

9.3.4 Specificity
  • Goal: Proportion of negatives identified correctly, \(\Pr (\hat {Y}=0|Y=0)\).

\begin{equation} \text {Specificity} = \frac {TN}{FP+TN} = \frac {\text {Correctly predicted 0's}} {\text {Actual 0's}} \end{equation}

Medical meaning: portion of classified healthy among all the healthy.

  • Example 9.1: Back to the previous example,

    \begin{equation} \text {Specificity} = \dfrac {31}{0 + 31} = 1 = 100\% \end{equation}

    From the medical point of view, all negative results are really negative.

9.3.5 F1-score
  • Goal: Combination of precision and recall.

The harmonic mean between precision and recall,

\begin{equation} F_1 = \frac {2}{\frac {1}{recall} + \frac {1}{precision}} = \frac {TP}{TP + \frac {1}{2}\left (FP+FN\right )} \end{equation}

  • Example 9.1: Back to the previous example,

    \begin{equation} F_1 = \dfrac {141}{141+\frac {1}{2}\left (0+67\right )} = 0.808 = 80.8\% \end{equation}

  • Example 9.2: The logistic regression classifier from Figs. 8.1 and 8.6 has the following confusion matrix:

    .
    Predicted
    Yes No
    Actual Yes 55 5
    No 5 35

    The resulting metrics are

    \begin{equation} \begin{aligned} \text {Accuracy} &= \frac {55+35}{100} = 0.9 = 90\%\\ \text {Precision} &= \dfrac {55}{5+55} = \frac {55}{60} \approx 91.7\%\\ \text {Recall} &= \frac {55}{55+5} = \frac {55}{60} \approx 91.7\%\\ \text {Specificity} &= \frac {35}{5+35} = \frac {35}{40} = 87.5\%\\ F_1 &= \frac {55}{55 + \frac {1}{2}\left (5+5\right )} = \frac {55}{60} \approx 91.7\% \end {aligned} \end{equation}

    Note, since \(FP = FN\), precision, recall, and \(F_1\) are all equal.

9.3.6 Per-class performance

The metrics above are defined with respect to the positive class (\(Y=1\)). To assess both classes, the metrics can be computed separately for each class by treating it as the positive class.

For class \(c\), define:

  • \(TP_c\) – samples of class \(c\) correctly predicted as class \(c\)

  • \(FP_c\) – samples of other classes incorrectly predicted as class \(c\)

  • \(FN_c\) – samples of class \(c\) incorrectly predicted as another class

The per-class metrics are then

\begin{equation} \text {Precision}_c = \frac {TP_c}{TP_c + FP_c}, \quad \text {Recall}_c = \frac {TP_c}{TP_c + FN_c}, \quad F_{1,c} = \frac {2\cdot \text {Precision}_c\cdot \text {Recall}_c}{\text {Precision}_c + \text {Recall}_c} \end{equation}

Note the symmetry: for binary classification, \(FP_0 = FN_1\) and \(FP_1 = FN_0\), so precision for one class equals recall for the other only when \(FP_0 = FP_1\).

  • Example 9.1: Back to the Covid antibody test example (Example 9.1, with \(TP=141\), \(FP=0\), \(FN=67\), \(TN=31\)).

    The asymmetry is significant: \(FP=0\) yields perfect precision for class 1 but also perfect recall for class 0, while the high \(FN=67\) degrades recall for class 1 and precision for class 0. This illustrates the symmetry \(FP_1 = FN_0\) noted above.

    .
    Metric Class 1 (positive) Class 0 (negative)
    Precision \(\dfrac {141}{141+0} = 100\%\) \(\dfrac {31}{31+67} \approx 31.6\%\)
    Recall \(\dfrac {141}{141+67} \approx 67.8\%\) \(\dfrac {31}{31+0} = 100\%\)
    \(F_1\) \(\dfrac {2\cdot 1.0\cdot 0.678}{1.0+0.678} \approx 80.8\%\) \(\dfrac {2\cdot 0.316\cdot 1.0}{0.316+1.0} \approx 48.1\%\)
    Table 9.2: Per-class metrics for the Covid antibody test.
  • Example 9.2: Back to the example 9.2. The classifier performs better on class 1 (the majority class with 60 samples) than on class 0 (40 samples).

    .
    Metric Class 1 Class 0
    Precision \(\dfrac {55}{55+5} \approx 91.7\%\) \(\dfrac {35}{35+5} = 87.5\%\)
    Recall \(\dfrac {55}{55+5} \approx 91.7\%\) \(\dfrac {35}{35+5} = 87.5\%\)
    \(F_1\) \(\dfrac {2\cdot 0.917\cdot 0.917}{0.917+0.917} \approx 91.7\%\) \(\dfrac {2\cdot 0.875\cdot 0.875}{0.875+0.875} = 87.5\%\)
    Table 9.3: Per-class metrics for the logistic regression example.

9.4 Imbalanced Dataset

Imbalanced dataset: Dataset with significant differences between the numbers of labels of each class. The following examples present a few problems related to imbalanced datasets.

  • Example 9.3: Let’s take a dataset with 1000 samples:

    • 990 samples labeled ‘0’

    • 10 samples labeled ‘1’

    What are the performance metrics of the classifier that always predicts \(\hat {Y}=0\)?

    • Solution: The resulting confusion matrix is

      .
      Predicted
      Yes No
      Actual Yes 0 10
      No 0 990

      and the resulting quantities are

      \begin{equation} \begin{aligned} \text {Accuracy} &= \frac {990}{1000} = 0.99 = 99\%\\ \text {Precision} &= \dfrac {TP}{FP+TP} = \frac {0}{0 + 0} = \text {Undefined}\\ \text {Recall} &= \frac {TP}{TP + FN} = \frac {0}{0+10} = 0\\ \text {Specificity} &= \frac {TN}{FP+TN} = \frac {990}{1000}= 0.99 = 99\%\\ F_1 &= \frac {TP}{TP + \frac {1}{2}\left (FP+FN\right )} = \frac {0}{\cdots } = 0 \end {aligned} \end{equation}

    • Note, accuracy is insufficient metrics!

    • Note, while the convention is to label ‘1’ for the most important class outcome, sometimes it is interchangeable.

Majority classifier

Majority classifier is a classifier that always predicts the most frequent class in the dataset, as in Example 9.3.

Despite achieving high accuracy on imbalanced data, a majority classifier has zero recall and zero \(F_1\) for the minority class. It is commonly used as a baseline: any useful classifier should outperform it on metrics beyond accuracy.

Small dataset problem

In imbalanced datasets, the minority class may contain very few samples. Performance metrics computed on small subsets are subject to high sampling variability, making them unreliable estimates of the true classifier performance.

  • Example 9.4: Consider the dataset from Example 9.3 with only \(n=10\) positive samples. Assume a classifier with true per-sample accuracy of \(p=0.8\) on class ‘1’. What is the probability that it correctly classifies 6 or fewer of the 10 positive samples?

    • Solution: Each classification is an independent Bernoulli trial with success probability \(p=0.8\), so the number of correct classifications follows a binomial distribution, \(X\sim \text {Bin}(n=10,\, p=0.8)\). The cumulative probability is

      \begin{equation*} \Pr (X\le 6) = \sum _{k=0}^{6}\binom {10}{k}p^k(1-p)^{10-k} \approx 12.09\% \end{equation*}

      There is a \({\approx }12\%\) chance of observing recall \(\le 60\%\), despite the true accuracy being \(80\%\). Conversely, \(\Pr (X=10)=10.74\%\): even perfect recall is plausible by chance for an imperfect classifier.

This is a problem of confidence in performance evaluation. While confidence interval analysis is out of the scope of this document, the key takeaway is: metrics computed on small subsets can deviate significantly from the true performance and should be interpreted with caution.

Anomaly detection Anomaly detection is a sub-field of classification where the class of interest (the anomaly) is extremely rare compared to the normal class. Examples include fraud detection, network intrusion detection, and equipment failure prediction. The extreme class imbalance (e.g., 1:1 000) makes accuracy meaningless and amplifies the small dataset problem for the minority class. Precision and recall are the primary evaluation metrics in this setting.

9.5 Multi-class performance

9.5.1 Categorical Encoding
  • Goal: Convert categorical (non-numeric) features into numeric representations suitable for mathematical models.

Label encoding1 assigns each category an integer. For a feature with categories \(\{A, B, C\}\), label encoding maps \(A \mapsto 0\), \(B \mapsto 1\), \(C \mapsto 2\). This is appropriate for ordinal features where the order is meaningful (e.g., low, medium, high).

One-hot encoding2 creates a binary indicator column for each category. A feature with \(C\) categories is replaced by \(C\) binary features, increasing the dimensionality from \(N\) to \(N + C - 1\).

  • Example 9.5: Consider a feature “Color” with three categories: \(\{\text {Red}, \text {Green}, \text {Blue}\}\) and \(M=4\) samples:

    .
    Color Red Green Blue
    Red 1 0 0
    Blue 0 0 1
    Green 0 1 0
    Red 1 0 0

    The single categorical feature is replaced by three binary features.

Pros:

  • One-hot encoding avoids imposing artificial ordinal relationships between categories.

  • Label encoding is memory-efficient and preserves dimensionality.

Cons:

  • One-hot encoding increases dimensionality significantly for high-cardinality features.

  • Label encoding introduces a spurious ordinal relationship for nominal features.

9.5.2 Performance

For multi-class classification, confusion matrices, accuracy and per-class performance (Sec. 9.3.6) are used.

To get a single number, two averaging strategies are applied.

Macro-averaging

Averaging with equal weight to each class, regardless of the number of instances.

Micro-averaging

Equal weight to each instance, regardless of the class label and the number of instance in the class. All TP, FP, etc. are summed across all the classes.

Further reading: Accuracy, precision, and recall in multi-class classification

9.6 Decision threshold

  • Goal: Quantify the trade-off between confusion matrix elements as a function of the decision threshold.

With a probabilistic classifier, the output is the estimated probability of the positive class,

\begin{equation} \Pr (\hat {y}=1\mid \bx ) = f_\bth (\bx ) \end{equation}

The binary decision is obtained by comparing \(f_\bth (\bx )\) with a threshold \(\mathsf {thr}\):

\begin{equation} \hat {y} = \begin{cases} 1 & f_\bth (\bx ) \ge \mathsf {thr}\\ 0 & f_\bth (\bx ) < \mathsf {thr} \end {cases} \end{equation}

The default threshold is \(\mathsf {thr}=0.5\). Changing \(\mathsf {thr}\) shifts the trade-off between TP, FP, FN, and TN, and therefore changes all derived metrics (precision, recall, specificity, \(F_1\)).

9.6.1 Receiver Operating Characteristics (RoC)
  • Goal: Visualize classifier performance across all thresholds.

The RoC curve plots the following two quantities as \(\mathsf {thr}\) varies from \(0\) to \(1\):

  • True Positive Rate (TPR): synonym for recall,

    \begin{equation} TPR = \frac {TP}{TP+FN} = \text {recall} \end{equation}

  • False Positive Rate (FPR): complement of specificity,

    \begin{equation} FPR = \frac {FP}{FP+TN} = 1 - \text {specificity} \end{equation}

As \(\mathsf {thr}\to 0\), everything is classified as positive (\(TPR\to 1\), \(FPR\to 1\)). As \(\mathsf {thr}\to 1\), everything is classified as negative (\(TPR\to 0\), \(FPR\to 0\)).

RoC is a legacy term from the field of detector theory and communication system theory.

9.6.2 Area under curve (AUC)
  • Goal: Quantify threshold-independent performance with a single scalar.

AUC: AUC is the area under the RoC curve.

Range: A random (coin-toss) classifier has \(\mathsf {AUC}=0.5\) and an ideal classifier has \(\mathsf {AUC}=1\). All practical classifiers fall in the range \(0.5 \le \mathsf {AUC} \le 1\).

The relationship between classifier quality and the RoC curve is illustrated in Fig. 9.2. When the two class distributions (positive and negative) are well separated, the classifier achieves high TPR at low FPR, producing a RoC curve that hugs the top-left corner (\(\mathsf {AUC}\to 1\)). As the distributions overlap, the FN and FP regions grow, and the RoC curve shifts toward the diagonal (\(\mathsf {AUC}\to 0.5\)).

(image)

Figure 9.2: RoC curves (top) and corresponding class distributions (bottom) for three classifiers. Well-separated distributions yield \(\mathsf {AUC}\approx 1\); overlapping distributions produce FN/FP errors and \(\mathsf {AUC}\to 0.5\).

The choice of threshold \(\mathsf {thr}\) controls which part of the overlap region is assigned to each class. Lowering \(\mathsf {thr}\) classifies more samples as positive, increasing TP but also FP. Raising \(\mathsf {thr}\) increases TN but also FN. Fig. 9.3 illustrates this trade-off: moving the threshold left or right redistributes errors between FN and FP while tracing the RoC curve.

(image)

Figure 9.3: Effect of threshold on classification: lowering \(\mathsf {thr}\) (left) reduces FN but increases FP; raising \(\mathsf {thr}\) (right) reduces FP but increases FN. The arrows indicate the direction of change relative to the default threshold (center).

(Receiver Operating Characteristics example)

Figure 9.4: RoC of logistic regression example in Fig. 8.1. The model operation point is \(\mathsf {thr}=0.5\).

Advantages:

  • Scale-invariant: AUC measures how well predictions are ranked, rather than their absolute values.

  • Threshold-invariant: AUC summarizes performance across all thresholds, without requiring a specific threshold choice.

Limitations:

  • Scale invariance may be undesirable when well-calibrated probabilities are needed (e.g., for risk assessment), since AUC is insensitive to the predicted probability values.

  • Threshold invariance may be undesirable when the application requires a specific trade-off between false negatives and false positives. For example, in spam detection, minimizing false positives (legitimate email marked as spam) is more important than minimizing false negatives. AUC does not capture such asymmetric cost preferences.