Machine Learning & Signals Learning
10 Classification Performance Metrics
Definitions:
-
• \(\by \) - target values vector of the test database, \(\by \in \real ^M\)
-
• \(\hat {\by }\) - predicted value, \(\hat {\by }\in \real ^M\), output of some classifier \(\hat {\by } = f_\bw (\bX )\).
Typically, in binary classification, \(y_i\in \left \{0,1\right \}\).
10.1 Definitions
Basic terminology:
-
• ‘1’ – positive group or result
-
• ‘0’ – negative group or result
-
• \(Y\) – actual class
-
• \(\hat {Y}\) – predicted class
Positive/negative terminology is rather arbitrary. Typically, the result of interest is termed positive.
10.2 Confusion matrix
The summarization is in the for m of a 2D non-normalized histogram of \((Y,\hat {Y})\).
| Predicted values | |||
| Positive, \(\hat {Y}=1\) | Negative, \(\hat {Y}=0\) | ||
| Actual values | Positive, \(Y=1\) | TP, True Positive | FN, False Negative |
| Negative, \(Y=0\) | FP, False Positive | TN, True Negative | |
The (test) database has \(M\) values, among them:
-
• TP + FN positive values
-
• FP + TN negative values
This is the most common way to summarize the performance of a particular classifier on a particular dataset. It can be easily extended for multi-class classifiers.
10.3 Performance Metrics
10.3.1 Accuracy
\begin{equation} \begin{aligned} \text {Accuracy} &= \frac {\text {correct predictions}}{\text {total predictions}}\\[3pt] &= \frac {TP + TN}{TP + TN + FP + FN} \end {aligned} \end{equation}
-
Example 10.1: Covid antibody (fast non-PCR) test performance. The example includes test statistics of 239 participants [9], as presented below.
Predicted Yes No Actual Yes 141 67 No 0 31 The resulting accuracy is
\(\seteqnumber{0}{}{1}\)\begin{equation} \text {Accuracy} = \frac {141+31}{239} = 0.7196652 \approx 72.0\% \end{equation}
In the example, FN=67 is a bad performance, and FP=0 is probably something good. However, accuracy does not reflect the discrepancy between these two. Additional metrics are used to quantify these aspects.
10.3.2 Precision
From the probability theory,
\(\seteqnumber{0}{}{2}\)\begin{align} \Pr (Y=1|\hat {Y}=1) &= \frac {\Pr (Y=1,\hat {Y}=1)}{\Pr (\hat {Y}=1)} \\ \Pr (\hat {Y}=1) &= \Pr (\hat {Y}=1,Y=0) + \Pr (\hat {Y}=1,Y=1) \end{align}
\(\seteqnumber{0}{}{4}\)\begin{equation} \text {Precision} = \dfrac {TP}{FP+TP} = \dfrac {\text {Correctly predicted 1's}}{\text {All predicted 1's}} \end{equation}
TP = Correctly predicted 1’s
TP + FP = All predicted 1’s
| Term | Radar Interpretation | ||
| Accuracy | Percentage of all correctly identified as planes or not planes | ||
| Precision | Among all classified as planes, the portion that is correctly classified as planes | ||
| Recall (sensitivity) | Among all existing planes, portion of correctly classified as planes | ||
| Specificity | Among all actual non-planes, portion of correctly classified as non-planes |
-
Example 10.1: Back to the previous example,
\(\seteqnumber{0}{}{5}\)\begin{equation} \text {Precision} = \dfrac {141}{0+141} = 1 = 100\% \end{equation}
The high value of the precision is due to the low FP. From the medical point of view, all positive results are actually positive. Whoever was identified by this test as Covid-positive is really positive.
10.3.3 Recall (sensitivity)
From the probability theory,
\(\seteqnumber{0}{}{6}\)\begin{align} \Pr (\hat {Y}=1|Y=1) &= \frac {\Pr (Y=1,\hat {Y}=1)}{\Pr (Y=1)} \\ \Pr (Y=1) &= \Pr (Y=1,\hat {Y}=0) + \Pr (Y=1,\hat {Y}=1) \end{align}
\(\seteqnumber{0}{}{8}\)\begin{equation} \text {Recall} = \frac {TP}{TP + FN} = \frac {\text {Correctly predicted 1's}}{\text {Actual 1's}} \end{equation}
Medical meaning: portion of correctly classified ill among all the ill.
-
Example 10.1: Back to the previous example,
\(\seteqnumber{0}{}{9}\)\begin{equation} \text {Recall} = \dfrac {141}{141 + 67} = 0.678 = 67.8\% \end{equation}
The low value of the recall is due to the high FN. From the medical point of view, among all actually positive (ill) individuals, only 67.8% were correctly identified.
10.3.4 Specificity
\(\seteqnumber{0}{}{10}\)\begin{equation} \text {Specificity} = \frac {TN}{FP+TN} = \frac {\text {Correctly predicted 0's}} {\text {Actual 0's}} \end{equation}
Medical meaning: portion of classified healthy among all the healthy.
10.3.5 F1-score
The harmonic mean between precision and recall,
\(\seteqnumber{0}{}{12}\)\begin{equation} F_1 = \frac {2}{\frac {1}{recall} + \frac {1}{precision}} = \frac {TP}{TP + \frac {1}{2}\left (FP+FN\right )} \end{equation}
-
Example 10.2: The logistic regression classifier from Figs. 8.1 and 8.6 has the following confusion matrix:
Predicted Yes No Actual Yes 55 5 No 5 35 The resulting metrics are
\(\seteqnumber{0}{}{14}\)\begin{equation} \begin{aligned} \text {Accuracy} &= \frac {55+35}{100} = 0.9 = 90\%\\ \text {Precision} &= \dfrac {55}{5+55} = \frac {55}{60} \approx 91.7\%\\ \text {Recall} &= \frac {55}{55+5} = \frac {55}{60} \approx 91.7\%\\ \text {Specificity} &= \frac {35}{5+35} = \frac {35}{40} = 87.5\%\\ F_1 &= \frac {55}{55 + \frac {1}{2}\left (5+5\right )} = \frac {55}{60} \approx 91.7\% \end {aligned} \end{equation}
Note, since \(FP = FN\), precision, recall, and \(F_1\) are all equal.
10.3.6 Per-class performance
The metrics above are defined with respect to the positive class (\(Y=1\)). To assess both classes, the metrics can be computed separately for each class by treating it as the positive class.
For class \(c\), define:
-
• \(TP_c\) – samples of class \(c\) correctly predicted as class \(c\)
-
• \(FP_c\) – samples of other classes incorrectly predicted as class \(c\)
-
• \(FN_c\) – samples of class \(c\) incorrectly predicted as another class
The per-class metrics are then
\(\seteqnumber{0}{}{15}\)\begin{equation} \text {Precision}_c = \frac {TP_c}{TP_c + FP_c}, \quad \text {Recall}_c = \frac {TP_c}{TP_c + FN_c}, \quad F_{1,c} = \frac {2\cdot \text {Precision}_c\cdot \text {Recall}_c}{\text {Precision}_c + \text {Recall}_c} \end{equation}
Note the symmetry: for binary classification, \(FP_0 = FN_1\) and \(FP_1 = FN_0\), so precision for one class equals recall for the other only when \(FP_0 = FP_1\).
-
Example 10.1: Back to the Covid antibody test example (Example 10.1, with \(TP=141\), \(FP=0\), \(FN=67\), \(TN=31\)).
The asymmetry is significant: \(FP=0\) yields perfect precision for class 1 but also perfect recall for class 0, while the high \(FN=67\) degrades recall for class 1 and precision for class 0. This illustrates the symmetry \(FP_1 = FN_0\) noted above.
Metric Class 1 (positive) Class 0 (negative) Precision \(\dfrac {141}{141+0} = 100\%\) \(\dfrac {31}{31+67} \approx 31.6\%\) Recall \(\dfrac {141}{141+67} \approx 67.8\%\) \(\dfrac {31}{31+0} = 100\%\) \(F_1\) \(\dfrac {2\cdot 1.0\cdot 0.678}{1.0+0.678} \approx 80.8\%\) \(\dfrac {2\cdot 0.316\cdot 1.0}{0.316+1.0} \approx 48.1\%\) Table 10.2: Per-class metrics for the Covid antibody test.
-
Example 10.2: Back to the example 10.2. The classifier performs better on class 1 (the majority class with 60 samples) than on class 0 (40 samples).
Metric Class 1 Class 0 Precision \(\dfrac {55}{55+5} \approx 91.7\%\) \(\dfrac {35}{35+5} = 87.5\%\) Recall \(\dfrac {55}{55+5} \approx 91.7\%\) \(\dfrac {35}{35+5} = 87.5\%\) \(F_1\) \(\dfrac {2\cdot 0.917\cdot 0.917}{0.917+0.917} \approx 91.7\%\) \(\dfrac {2\cdot 0.875\cdot 0.875}{0.875+0.875} = 87.5\%\) Table 10.3: Per-class metrics for the logistic regression example.
10.4 Imbalanced Dataset
Imbalanced dataset: Dataset with significant differences between the numbers of labels of each class. The following examples present a few problems related to imbalanced datasets.
-
Example 10.3: Let’s take a dataset with 1000 samples:
-
• 990 samples labeled ‘0’
-
• 10 samples labeled ‘1’
What are the performance metrics of the classifier that always predicts \(\hat {Y}=0\)?
-
Solution: The resulting confusion matrix is
Predicted Yes No Actual Yes 0 10 No 0 990 and the resulting quantities are
\(\seteqnumber{0}{}{16}\)\begin{equation} \begin{aligned} \text {Accuracy} &= \frac {990}{1000} = 0.99 = 99\%\\ \text {Precision} &= \dfrac {TP}{FP+TP} = \frac {0}{0 + 0} = \text {Undefined}\\ \text {Recall} &= \frac {TP}{TP + FN} = \frac {0}{0+10} = 0\\ \text {Specificity} &= \frac {TN}{FP+TN} = \frac {990}{1000}= 0.99 = 99\%\\ F_1 &= \frac {TP}{TP + \frac {1}{2}\left (FP+FN\right )} = \frac {0}{\cdots } = 0 \end {aligned} \end{equation}
-
• Note, accuracy is insufficient metrics!
-
• Note, while the convention is to label ‘1’ for the most important class outcome, sometimes it is interchangeable.
-
Majority classifier
Majority classifier is a classifier that always predicts the most frequent class in the dataset, as in Example 10.3.
Despite achieving high accuracy on imbalanced data, a majority classifier has zero recall and zero \(F_1\) for the minority class. It is commonly used as a baseline: any useful classifier should outperform it on metrics beyond accuracy.
The accuracy of a majority classifier equals the majority-class rate,
\(\seteqnumber{0}{}{17}\)\begin{equation} J_{\mathrm {majority-class}} = \frac {\max (M_0, M_1)}{M} \end{equation}
where \(M_0\) and \(M_1\) are the class counts and \(M\) is the total number of samples.
Small dataset problem
In imbalanced datasets, the minority class may contain very few samples. Performance metrics computed on small subsets are subject to high sampling variability, making them unreliable estimates of the true classifier performance.
-
Example 10.4: Consider the dataset from Example 10.3 with only \(n=10\) positive samples. Assume a classifier with true per-sample accuracy of \(p=0.8\) on class ‘1’. What is the probability that it correctly classifies 6 or fewer of the 10 positive samples?
-
Solution: Each classification is an independent Bernoulli trial with success probability \(p=0.8\), so the number of correct classifications follows a binomial distribution, \(X\sim \text {Bin}(n=10,\, p=0.8)\). The cumulative probability is
\(\seteqnumber{0}{}{18}\)\begin{equation*} \Pr (X\le 6) = \sum _{k=0}^{6}\binom {10}{k}p^k(1-p)^{10-k} \approx 12.09\% \end{equation*}
There is a \({\approx }12\%\) chance of observing recall \(\le 60\%\), despite the true accuracy being \(80\%\). Conversely, \(\Pr (X=10)=10.74\%\): even perfect recall is plausible by chance for an imperfect classifier.
-
This is a problem of confidence in performance evaluation. While confidence interval analysis is out of the scope of this document, the key takeaway is: metrics computed on small subsets can deviate significantly from the true performance and should be interpreted with caution.
Anomaly detection Anomaly detection is a sub-field of classification where the class of interest (the anomaly) is extremely rare compared to the normal class. Examples include fraud detection, network intrusion detection, and equipment failure prediction. The extreme class imbalance (e.g., 1:1 000) makes accuracy meaningless and amplifies the small dataset problem for the minority class. Precision and recall are the primary evaluation metrics in this setting.
10.5 Metrics for Probabilistic Classifiers
-
Goal: Quantify performance of classifiers with probabilistic loss (probabilistic predictions). This methods extent the methods in Sec. 12.1.
With a probabilistic classifier, the output is the estimated probability of the positive class,
\(\seteqnumber{0}{}{18}\)\begin{equation} \Pr (\hat {y}=1\mid \bx ) = f_\bw (\bx ) \end{equation}
10.5.1 Receiver Operating Characteristics (RoC)
The binary decision is obtained by comparing \(f_\bw (\bx )\) with a threshold \(\mathsf {thr}\):
\(\seteqnumber{0}{}{19}\)\begin{equation} \hat {y} = \begin{cases} 1 & f_\bw (\bx ) \ge \mathsf {thr}\\ 0 & f_\bw (\bx ) < \mathsf {thr} \end {cases} \end{equation}
The default threshold is \(\mathsf {thr}=0.5\). Changing \(\mathsf {thr}\) shifts the trade-off between TP, FP, FN, and TN, and therefore changes all derived metrics (precision, recall, specificity, \(F_1\)).
The RoC curve plots the following two quantities as \(\mathsf {thr}\) varies from \(0\) to \(1\):
-
• True Positive Rate (TPR): synonym for recall,
\(\seteqnumber{0}{}{20}\)\begin{equation} TPR = \frac {TP}{TP+FN} = \text {recall} \end{equation}
-
• False Positive Rate (FPR): complement of specificity,
\(\seteqnumber{0}{}{21}\)\begin{equation} FPR = \frac {FP}{FP+TN} = 1 - \text {specificity} \end{equation}
As \(\mathsf {thr}\to 0\), everything is classified as positive (\(TPR\to 1\), \(FPR\to 1\)). As \(\mathsf {thr}\to 1\), everything is classified as negative (\(TPR\to 0\), \(FPR\to 0\)).
RoC is a legacy term from the field of detector theory and communication system theory.
10.5.2 Area under curve (AUC)
AUC: AUC is the area under the RoC curve.
Range: A random (coin-toss) classifier has \(\mathsf {AUC}=0.5\) and an ideal classifier has \(\mathsf {AUC}=1\). All practical classifiers fall in the range \(0.5 \le \mathsf {AUC} \le 1\).
The relationship between classifier quality and the RoC curve is illustrated in Fig. 10.2a. When the two class distributions (positive and negative) are well separated, the classifier achieves high TPR at low FPR, producing a RoC curve that hugs the top-left corner (\(\mathsf {AUC}\to 1\)). As the distributions overlap, the FN and FP regions grow, and the RoC curve shifts toward the diagonal (\(\mathsf {AUC}\to 0.5\)).
The choice of threshold \(\mathsf {thr}\) controls which part of the overlap region is assigned to each class. Lowering \(\mathsf {thr}\) classifies more samples as positive, increasing TP but also FP. Raising \(\mathsf {thr}\) increases TN but also FN. Fig. 10.2b illustrates this trade-off: moving the threshold left or right redistributes errors between FN and FP while tracing the RoC curve.
Advantages:
-
• Scale-invariant: AUC measures how well predictions are ranked, rather than their absolute values.
-
• Threshold-invariant: AUC summarizes performance across all thresholds, without requiring a specific threshold choice.
Limitations:
-
• Scale invariance may be undesirable when well-calibrated probabilities are needed (e.g., for risk assessment), since AUC is insensitive to the predicted probability values.
-
• Threshold invariance may be undesirable when the application requires a specific trade-off between false negatives and false positives. For example, in spam detection, minimizing false positives (legitimate email marked as spam) is more important than minimizing false negatives. AUC does not capture such asymmetric cost preferences.
10.5.3 Brier Score
AUC is scale-invariant: it is insensitive to the absolute predicted probability values \(\hat {y}_i\) and only evaluates whether positive samples receive higher probabilities than negative ones. The Brier score complements AUC by evaluating how close the predicted probabilities are to the actual outcomes. It is defined as the MSE between the predicted probability and the actual class label:
\(\seteqnumber{0}{}{22}\)\begin{equation} \text {BS} = \frac {1}{M}\sum _{i=1}^{M}\left (p_i - y_i\right )^2 \end{equation}
where \(p_i\in [0,1]\) is the predicted probability and \(y_i\in \{0,1\}\) is the actual label.
Range: \(0 \le \text {BS} \le 1\). A perfect classifier that assigns probability 1 to the correct class achieves \(\text {BS}=0\). A random classifier that always predicts \(0.5\) achieves \(\text {BS}=0.25\).
-
Example 10.5: Consider \(M=4\) samples with actual labels and predicted probabilities:
\(\seteqnumber{0}{}{23}\)\(y_i\) \(\hat {y}_i\) \((p_i - y_i)^2\) 1 0.9 0.01 0 0.2 0.04 1 0.7 0.09 0 0.1 0.01 \begin{equation} \text {BS} = \frac {0.01 + 0.04 + 0.09 + 0.01}{4} = 0.0375 \end{equation}
The low Brier score indicates well-calibrated probability predictions.
Comparison with AUC:
-
• AUC measures discrimination: do positive samples receive higher predicted probabilities \(f_\bw (\bx _i)\) than negative ones? It is scale-invariant and threshold-invariant.
-
• Brier score measures calibration: are the predicted probabilities \(p_i\) close to the actual labels \(y_i\)? It is sensitive to the absolute probability values.
A classifier can have high AUC (good discrimination) but poor Brier score (poorly calibrated probabilities), or vice versa.
Fig. 10.4 illustrates this distinction. Both classifiers achieve the same AUC (all positive samples are scored higher than all negative samples), but Classifier A predicts probabilities close to the actual labels (\(y_i=0\) or \(y_i=1\)), yielding a low Brier score. Classifier B compresses its predictions toward \(0.5\), degrading calibration and increasing the Brier score, while maintaining the same discrimination ability.
10.6 Multi-class performance
10.6.1 Categorical Encoding
Label encoding1 assigns each category an integer. For a feature with categories \(\{A, B, C\}\), label encoding maps \(A \mapsto 0\), \(B \mapsto 1\), \(C \mapsto 2\). This is appropriate for ordinal features where the order is meaningful (e.g., low, medium, high).
One-hot encoding2 creates a binary indicator column for each category. A feature with \(C\) categories is replaced by \(C\) binary features.
-
Example 10.6: Consider a feature “Color” with three categories: \(\{\text {Red}, \text {Green}, \text {Blue}\}\) and \(M=4\) samples:
Color Red Green Blue Red 1 0 0 Blue 0 0 1 Green 0 1 0 Red 1 0 0 The single categorical feature is replaced by three binary features.
-
• Label encoding is suitable for ordinal features (where the order is meaningful) and for tree-based models (e.g., decision trees, random forests), which split on thresholds and are therefore unaffected by the particular integer assignment.
-
• One-hot encoding is preferred for nominal features (no natural order) and for linear/logistic models, which interpret numeric values as quantities and would otherwise assume a meaningful ordering among categories.
Label encoding can mislead distance-based models such as \(k\)-NN or SVM with an RBF kernel. For example, encoding \(\{\text {Red}\mapsto 0,\;\text {Green}\mapsto 1,\;\text {Blue}\mapsto 2\}\) implies that the distance between Red and Blue is twice the distance between Red and Green—a relationship that is meaningless for nominal features. One-hot encoding avoids this problem: in the one-hot representation every pair of distinct categories is equidistant (Euclidean distance \(\sqrt {2}\)).
10.6.2 Performance
For multi-class classification, confusion matrices, accuracy and per-class performance (Sec. 10.3.6) are used.
To get a single number, two averaging strategies are applied.
Macro-averaging
Averaging with equal weight to each class, regardless of the number of instances.
Micro-averaging
Equal weight to each instance, regardless of the class label and the number of instance in the class. All TP, FP, etc. are summed across all the classes.
Further reading: Accuracy, precision, and recall in multi-class classification