Machine Learning & Signals Learning
9 Classification Performance Metrics
Definitions:
-
• \(\by \) - target values vector of the test database, \(\by \in \real ^M\)
-
• \(\hat {\by }\) - predicted value, \(\hat {\by }\in \real ^M\), output of some classifier \(\hat {\by } = f_\bth (\bX )\).
Typically, in binary classification, \(y_i\in \left \{0,1\right \}\).
9.1 Definitions
Basic terminology:
-
• ‘1’ – positive group or result
-
• ‘0’ – negative group or result
-
• \(Y\) – actual class
-
• \(\hat {Y}\) – predicted class
Positive/negative terminology is rather arbitrary. Typically, the result of interest is termed positive.
9.2 Confusion matrix
The summarization is in the for m of a 2D non-normalized histogram of \((Y,\hat {Y})\).
| Predicted values | |||
| Positive, \(\hat {Y}=1\) | Negative, \(\hat {Y}=0\) | ||
| Actual values | Positive, \(Y=1\) | TP, True Positive | FN, False Negative |
| Negative, \(Y=0\) | FP, False Positive | TN, True Negative | |
The (test) database has \(M\) values, among them:
-
• TP + FN positive values
-
• FP + TN negative values
This is the most common way to summarize the performance of a particular classifier on a particular dataset. It can be easily extended for multi-class classifiers.
9.3 Performance Metrics
9.3.1 Accuracy
\begin{equation} \begin{aligned} \text {Accuracy} &= \frac {\text {correct predictions}}{\text {total predictions}}\\[3pt] &= \frac {TP + TN}{TP + TN + FP + FN} \end {aligned} \end{equation}
-
Example 9.1: Covid antibody (fast non-PCR) test performance. The example includes test statistics of 239 participants [?], as presented below.
Predicted Yes No Actual Yes 141 67 No 0 31 The resulting accuracy is
\(\seteqnumber{0}{}{1}\)\begin{equation} \text {Accuracy} = \frac {141+31}{239} = 0.7196652 \approx 72.0\% \end{equation}
In the example, FN=67 is a bad performance, and FP=0 is probably something good. However, accuracy does not reflect the discrepancy between these two. Additional metrics are used to quantify these aspects.
9.3.2 Precision
From the probability theory,
\(\seteqnumber{0}{}{2}\)\begin{align} \Pr (Y=1|\hat {Y}=1) &= \frac {\Pr (Y=1,\hat {Y}=1)}{\Pr (\hat {Y}=1)} \\ \Pr (\hat {Y}=1) &= \Pr (\hat {Y}=1,Y=0) + \Pr (\hat {Y}=1,Y=1) \end{align}
\(\seteqnumber{0}{}{4}\)\begin{equation} \text {Precision} = \dfrac {TP}{FP+TP} = \dfrac {\text {Correctly predicted 1's}}{\text {All predicted 1's}} \end{equation}
TP = Correctly predicted 1’s
TP + FP = All predicted 1’s
| Term | Radar Interpretation | ||
| Accuracy | Percentage of all correctly identified as planes or not planes | ||
| Precision | Among all classified as planes, the portion that is correctly classified as planes | ||
| Recall (sensitivity) | Among all existing planes, portion of correctly classified as planes | ||
| Specificity | Among all actual non-planes, portion of correctly classified as non-planes |
-
Example 9.1: Back to the previous example,
\(\seteqnumber{0}{}{5}\)\begin{equation} \text {Precision} = \dfrac {141}{0+141} = 1 = 100\% \end{equation}
The high value of the precision is due to the low FP. From the medical point of view, all positive results are actually positive. Whoever was identified by this test as Covid-positive is really positive.
9.3.3 Recall (sensitivity)
From the probability theory,
\(\seteqnumber{0}{}{6}\)\begin{align} \Pr (\hat {Y}=1|Y=1) &= \frac {\Pr (Y=1,\hat {Y}=1)}{\Pr (Y=1)} \\ \Pr (Y=1) &= \Pr (Y=1,\hat {Y}=0) + \Pr (Y=1,\hat {Y}=1) \end{align}
\(\seteqnumber{0}{}{8}\)\begin{equation} \text {Recall} = \frac {TP}{TP + FN} = \frac {\text {Correctly predicted 1's}}{\text {Actual 1's}} \end{equation}
Medical meaning: portion of correctly classified ill among all the ill.
-
Example 9.1: Back to the previous example,
\(\seteqnumber{0}{}{9}\)\begin{equation} \text {Recall} = \dfrac {141}{141 + 67} = 0.678 = 67.8\% \end{equation}
The low value of the recall is due to the high FN. From the medical point of view, among all actually positive (ill) individuals, only 67.8% were correctly identified.
9.3.4 Specificity
\(\seteqnumber{0}{}{10}\)\begin{equation} \text {Specificity} = \frac {TN}{FP+TN} = \frac {\text {Correctly predicted 0's}} {\text {Actual 0's}} \end{equation}
Medical meaning: portion of classified healthy among all the healthy.
9.3.5 F1-score
The harmonic mean between precision and recall,
\(\seteqnumber{0}{}{12}\)\begin{equation} F_1 = \frac {2}{\frac {1}{recall} + \frac {1}{precision}} = \frac {TP}{TP + \frac {1}{2}\left (FP+FN\right )} \end{equation}
-
Example 9.2: The logistic regression classifier from Figs. 8.1 and 8.6 has the following confusion matrix:
Predicted Yes No Actual Yes 55 5 No 5 35 The resulting metrics are
\(\seteqnumber{0}{}{14}\)\begin{equation} \begin{aligned} \text {Accuracy} &= \frac {55+35}{100} = 0.9 = 90\%\\ \text {Precision} &= \dfrac {55}{5+55} = \frac {55}{60} \approx 91.7\%\\ \text {Recall} &= \frac {55}{55+5} = \frac {55}{60} \approx 91.7\%\\ \text {Specificity} &= \frac {35}{5+35} = \frac {35}{40} = 87.5\%\\ F_1 &= \frac {55}{55 + \frac {1}{2}\left (5+5\right )} = \frac {55}{60} \approx 91.7\% \end {aligned} \end{equation}
Note, since \(FP = FN\), precision, recall, and \(F_1\) are all equal.
9.3.6 Per-class performance
The metrics above are defined with respect to the positive class (\(Y=1\)). To assess both classes, the metrics can be computed separately for each class by treating it as the positive class.
For class \(c\), define:
-
• \(TP_c\) – samples of class \(c\) correctly predicted as class \(c\)
-
• \(FP_c\) – samples of other classes incorrectly predicted as class \(c\)
-
• \(FN_c\) – samples of class \(c\) incorrectly predicted as another class
The per-class metrics are then
\(\seteqnumber{0}{}{15}\)\begin{equation} \text {Precision}_c = \frac {TP_c}{TP_c + FP_c}, \quad \text {Recall}_c = \frac {TP_c}{TP_c + FN_c}, \quad F_{1,c} = \frac {2\cdot \text {Precision}_c\cdot \text {Recall}_c}{\text {Precision}_c + \text {Recall}_c} \end{equation}
Note the symmetry: for binary classification, \(FP_0 = FN_1\) and \(FP_1 = FN_0\), so precision for one class equals recall for the other only when \(FP_0 = FP_1\).
-
Example 9.1: Back to the Covid antibody test example (Example 9.1, with \(TP=141\), \(FP=0\), \(FN=67\), \(TN=31\)).
The asymmetry is significant: \(FP=0\) yields perfect precision for class 1 but also perfect recall for class 0, while the high \(FN=67\) degrades recall for class 1 and precision for class 0. This illustrates the symmetry \(FP_1 = FN_0\) noted above.
Metric Class 1 (positive) Class 0 (negative) Precision \(\dfrac {141}{141+0} = 100\%\) \(\dfrac {31}{31+67} \approx 31.6\%\) Recall \(\dfrac {141}{141+67} \approx 67.8\%\) \(\dfrac {31}{31+0} = 100\%\) \(F_1\) \(\dfrac {2\cdot 1.0\cdot 0.678}{1.0+0.678} \approx 80.8\%\) \(\dfrac {2\cdot 0.316\cdot 1.0}{0.316+1.0} \approx 48.1\%\) Table 9.2: Per-class metrics for the Covid antibody test.
-
Example 9.2: Back to the example 9.2. The classifier performs better on class 1 (the majority class with 60 samples) than on class 0 (40 samples).
Metric Class 1 Class 0 Precision \(\dfrac {55}{55+5} \approx 91.7\%\) \(\dfrac {35}{35+5} = 87.5\%\) Recall \(\dfrac {55}{55+5} \approx 91.7\%\) \(\dfrac {35}{35+5} = 87.5\%\) \(F_1\) \(\dfrac {2\cdot 0.917\cdot 0.917}{0.917+0.917} \approx 91.7\%\) \(\dfrac {2\cdot 0.875\cdot 0.875}{0.875+0.875} = 87.5\%\) Table 9.3: Per-class metrics for the logistic regression example.
9.4 Imbalanced Dataset
Imbalanced dataset: Dataset with significant differences between the numbers of labels of each class. The following examples present a few problems related to imbalanced datasets.
-
Example 9.3: Let’s take a dataset with 1000 samples:
-
• 990 samples labeled ‘0’
-
• 10 samples labeled ‘1’
What are the performance metrics of the classifier that always predicts \(\hat {Y}=0\)?
-
Solution: The resulting confusion matrix is
Predicted Yes No Actual Yes 0 10 No 0 990 and the resulting quantities are
\(\seteqnumber{0}{}{16}\)\begin{equation} \begin{aligned} \text {Accuracy} &= \frac {990}{1000} = 0.99 = 99\%\\ \text {Precision} &= \dfrac {TP}{FP+TP} = \frac {0}{0 + 0} = \text {Undefined}\\ \text {Recall} &= \frac {TP}{TP + FN} = \frac {0}{0+10} = 0\\ \text {Specificity} &= \frac {TN}{FP+TN} = \frac {990}{1000}= 0.99 = 99\%\\ F_1 &= \frac {TP}{TP + \frac {1}{2}\left (FP+FN\right )} = \frac {0}{\cdots } = 0 \end {aligned} \end{equation}
-
• Note, accuracy is insufficient metrics!
-
• Note, while the convention is to label ‘1’ for the most important class outcome, sometimes it is interchangeable.
-
Majority classifier
Majority classifier is a classifier that always predicts the most frequent class in the dataset, as in Example 9.3.
Despite achieving high accuracy on imbalanced data, a majority classifier has zero recall and zero \(F_1\) for the minority class. It is commonly used as a baseline: any useful classifier should outperform it on metrics beyond accuracy.
Small dataset problem
In imbalanced datasets, the minority class may contain very few samples. Performance metrics computed on small subsets are subject to high sampling variability, making them unreliable estimates of the true classifier performance.
-
Example 9.4: Consider the dataset from Example 9.3 with only \(n=10\) positive samples. Assume a classifier with true per-sample accuracy of \(p=0.8\) on class ‘1’. What is the probability that it correctly classifies 6 or fewer of the 10 positive samples?
-
Solution: Each classification is an independent Bernoulli trial with success probability \(p=0.8\), so the number of correct classifications follows a binomial distribution, \(X\sim \text {Bin}(n=10,\, p=0.8)\). The cumulative probability is
\(\seteqnumber{0}{}{17}\)\begin{equation*} \Pr (X\le 6) = \sum _{k=0}^{6}\binom {10}{k}p^k(1-p)^{10-k} \approx 12.09\% \end{equation*}
There is a \({\approx }12\%\) chance of observing recall \(\le 60\%\), despite the true accuracy being \(80\%\). Conversely, \(\Pr (X=10)=10.74\%\): even perfect recall is plausible by chance for an imperfect classifier.
-
This is a problem of confidence in performance evaluation. While confidence interval analysis is out of the scope of this document, the key takeaway is: metrics computed on small subsets can deviate significantly from the true performance and should be interpreted with caution.
Anomaly detection Anomaly detection is a sub-field of classification where the class of interest (the anomaly) is extremely rare compared to the normal class. Examples include fraud detection, network intrusion detection, and equipment failure prediction. The extreme class imbalance (e.g., 1:1 000) makes accuracy meaningless and amplifies the small dataset problem for the minority class. Precision and recall are the primary evaluation metrics in this setting.
9.5 Multi-class performance
9.5.1 Categorical Encoding
Label encoding1 assigns each category an integer. For a feature with categories \(\{A, B, C\}\), label encoding maps \(A \mapsto 0\), \(B \mapsto 1\), \(C \mapsto 2\). This is appropriate for ordinal features where the order is meaningful (e.g., low, medium, high).
One-hot encoding2 creates a binary indicator column for each category. A feature with \(C\) categories is replaced by \(C\) binary features, increasing the dimensionality from \(N\) to \(N + C - 1\).
-
Example 9.5: Consider a feature “Color” with three categories: \(\{\text {Red}, \text {Green}, \text {Blue}\}\) and \(M=4\) samples:
Color Red Green Blue Red 1 0 0 Blue 0 0 1 Green 0 1 0 Red 1 0 0 The single categorical feature is replaced by three binary features.
Pros:
-
• One-hot encoding avoids imposing artificial ordinal relationships between categories.
-
• Label encoding is memory-efficient and preserves dimensionality.
Cons:
-
• One-hot encoding increases dimensionality significantly for high-cardinality features.
-
• Label encoding introduces a spurious ordinal relationship for nominal features.
9.5.2 Performance
For multi-class classification, confusion matrices, accuracy and per-class performance (Sec. 9.3.6) are used.
To get a single number, two averaging strategies are applied.
Macro-averaging
Averaging with equal weight to each class, regardless of the number of instances.
Micro-averaging
Equal weight to each instance, regardless of the class label and the number of instance in the class. All TP, FP, etc. are summed across all the classes.
Further reading: Accuracy, precision, and recall in multi-class classification
9.6 Decision threshold
With a probabilistic classifier, the output is the estimated probability of the positive class,
\(\seteqnumber{0}{}{17}\)\begin{equation} \Pr (\hat {y}=1\mid \bx ) = f_\bth (\bx ) \end{equation}
The binary decision is obtained by comparing \(f_\bth (\bx )\) with a threshold \(\mathsf {thr}\):
\(\seteqnumber{0}{}{18}\)\begin{equation} \hat {y} = \begin{cases} 1 & f_\bth (\bx ) \ge \mathsf {thr}\\ 0 & f_\bth (\bx ) < \mathsf {thr} \end {cases} \end{equation}
The default threshold is \(\mathsf {thr}=0.5\). Changing \(\mathsf {thr}\) shifts the trade-off between TP, FP, FN, and TN, and therefore changes all derived metrics (precision, recall, specificity, \(F_1\)).
9.6.1 Receiver Operating Characteristics (RoC)
The RoC curve plots the following two quantities as \(\mathsf {thr}\) varies from \(0\) to \(1\):
-
• True Positive Rate (TPR): synonym for recall,
\(\seteqnumber{0}{}{19}\)\begin{equation} TPR = \frac {TP}{TP+FN} = \text {recall} \end{equation}
-
• False Positive Rate (FPR): complement of specificity,
\(\seteqnumber{0}{}{20}\)\begin{equation} FPR = \frac {FP}{FP+TN} = 1 - \text {specificity} \end{equation}
As \(\mathsf {thr}\to 0\), everything is classified as positive (\(TPR\to 1\), \(FPR\to 1\)). As \(\mathsf {thr}\to 1\), everything is classified as negative (\(TPR\to 0\), \(FPR\to 0\)).
RoC is a legacy term from the field of detector theory and communication system theory.
9.6.2 Area under curve (AUC)
AUC: AUC is the area under the RoC curve.
Range: A random (coin-toss) classifier has \(\mathsf {AUC}=0.5\) and an ideal classifier has \(\mathsf {AUC}=1\). All practical classifiers fall in the range \(0.5 \le \mathsf {AUC} \le 1\).
The relationship between classifier quality and the RoC curve is illustrated in Fig. 9.2. When the two class distributions (positive and negative) are well separated, the classifier achieves high TPR at low FPR, producing a RoC curve that hugs the top-left corner (\(\mathsf {AUC}\to 1\)). As the distributions overlap, the FN and FP regions grow, and the RoC curve shifts toward the diagonal (\(\mathsf {AUC}\to 0.5\)).
The choice of threshold \(\mathsf {thr}\) controls which part of the overlap region is assigned to each class. Lowering \(\mathsf {thr}\) classifies more samples as positive, increasing TP but also FP. Raising \(\mathsf {thr}\) increases TN but also FN. Fig. 9.3 illustrates this trade-off: moving the threshold left or right redistributes errors between FN and FP while tracing the RoC curve.
Advantages:
-
• Scale-invariant: AUC measures how well predictions are ranked, rather than their absolute values.
-
• Threshold-invariant: AUC summarizes performance across all thresholds, without requiring a specific threshold choice.
Limitations:
-
• Scale invariance may be undesirable when well-calibrated probabilities are needed (e.g., for risk assessment), since AUC is insensitive to the predicted probability values.
-
• Threshold invariance may be undesirable when the application requires a specific trade-off between false negatives and false positives. For example, in spam detection, minimizing false positives (legitimate email marked as spam) is more important than minimizing false negatives. AUC does not capture such asymmetric cost preferences.