Machine Learning & Signals Learning
12 Classification Performance Metrics
Definitions:
-
• \(\by \) - target values vector of the test database, \(\by \in \real ^M\)
-
• \(\hat {\by }\) - predicted value, \(\hat {\by }\in \real ^M\), output of some classifier \(\hat {\by } = f_\bw (\bX )\).
Typically, in binary classification, \(y_i\in \left \{0,1\right \}\).
12.1 Definitions
Basic terminology:
-
• ‘1’ – positive group or result
-
• ‘0’ – negative group or result
-
• \(Y\) – actual class
-
• \(\hat {Y}\) – predicted class
-
• ‘True’ – prediction matches the actual class, \(\hat {Y}=Y\) (correct classification)
-
• ‘False’ – prediction does not match the actual class, \(\hat {Y}\ne Y\) (incorrect classification)
Positive/negative terminology is rather arbitrary. Typically, the result of interest is termed positive.
12.2 Confusion matrix
The summarization is in the for m of a 2D non-normalized histogram of \((Y,\hat {Y})\).
| Predicted values | |||
| Positive, \(\hat {Y}=1\) | Negative, \(\hat {Y}=0\) | ||
| Actual values | Positive, \(Y=1\) | TP, True Positive | FN, False Negative |
| Negative, \(Y=0\) | FP, False Positive | TN, True Negative | |
The (test) database has \(M\) values, among them:
-
• TP + FN positive values
-
• FP + TN negative values
This is the most common way to summarize the performance of a particular classifier on a particular dataset. It can be easily extended for multi-class classifiers.
12.3 Performance Metrics
12.3.1 Accuracy
\begin{equation} \begin{aligned} \text {Accuracy} &= \frac {\text {correct predictions}}{\text {total predictions}}\\[3pt] &= \frac {TP + TN}{TP + TN + FP + FN} \end {aligned} \end{equation}
-
Example 12.1: Covid antibody (fast non-PCR) test performance. The example includes test statistics of 239 participants [7], as presented below.
Predicted Yes No Actual Yes 141 67 No 0 31 The resulting accuracy is
\(\seteqnumber{0}{}{1}\)\begin{equation} \text {Accuracy} = \frac {141+31}{239} = 0.7196652 \approx 72.0\% \end{equation}
In the example, FN=67 is a bad performance, and FP=0 is probably something good. However, accuracy does not reflect the discrepancy between these two. Additional metrics are used to quantify these aspects.
12.3.2 Precision
From the probability theory,
\(\seteqnumber{0}{}{2}\)\begin{align} \Pr (Y=1|\hat {Y}=1) &= \frac {\Pr (Y=1,\hat {Y}=1)}{\Pr (\hat {Y}=1)} \\ \Pr (\hat {Y}=1) &= \Pr (\hat {Y}=1,Y=0) + \Pr (\hat {Y}=1,Y=1) \end{align}
\(\seteqnumber{0}{}{4}\)\begin{equation} \text {Precision} = \dfrac {TP}{FP+TP} = \dfrac {\text {Correctly predicted 1's}}{\text {All predicted 1's}} \end{equation}
TP = Correctly predicted 1’s
TP + FP = All predicted 1’s
| Term | Radar Interpretation | ||
| Accuracy | Percentage of all correctly identified as planes or not planes | ||
| Precision | Among all classified as planes, the portion that is correctly classified as planes | ||
| Recall (sensitivity) | Among all existing planes, portion of correctly classified as planes | ||
| Specificity | Among all actual non-planes, portion of correctly classified as non-planes |
-
Example 12.1: Back to the previous example,
\(\seteqnumber{0}{}{5}\)\begin{equation} \text {Precision} = \dfrac {141}{0+141} = 1 = 100\% \end{equation}
The high value of the precision is due to the low FP. From the medical point of view, all positive results are actually positive. Whoever was identified by this test as Covid-positive is really positive.
12.3.3 Recall (sensitivity)
From the probability theory,
\(\seteqnumber{0}{}{6}\)\begin{align} \Pr (\hat {Y}=1|Y=1) &= \frac {\Pr (Y=1,\hat {Y}=1)}{\Pr (Y=1)} \\ \Pr (Y=1) &= \Pr (Y=1,\hat {Y}=0) + \Pr (Y=1,\hat {Y}=1) \end{align}
\(\seteqnumber{0}{}{8}\)\begin{equation} \text {Recall} = \frac {TP}{TP + FN} = \frac {\text {Correctly predicted 1's}}{\text {Actual 1's}} \end{equation}
Medical meaning: portion of correctly classified ill among all the ill.
-
Example 12.1: Back to the previous example,
\(\seteqnumber{0}{}{9}\)\begin{equation} \text {Recall} = \dfrac {141}{141 + 67} = 0.678 = 67.8\% \end{equation}
The low value of the recall is due to the high FN. From the medical point of view, among all actually positive (ill) individuals, only 67.8% were correctly identified.
12.3.4 Specificity
\(\seteqnumber{0}{}{10}\)\begin{equation} \text {Specificity} = \frac {TN}{FP+TN} = \frac {\text {Correctly predicted 0's}} {\text {Actual 0's}} \end{equation}
Medical meaning: portion of classified healthy among all the healthy.
12.3.5 F1-score
The harmonic mean between precision and recall,
\(\seteqnumber{0}{}{12}\)\begin{equation} F_1 = \frac {2}{\frac {1}{recall} + \frac {1}{precision}} = \frac {TP}{TP + \frac {1}{2}\left (FP+FN\right )} \end{equation}
-
Example 12.2: The logistic regression classifier from Figs. 8.1 and 8.6 has the following confusion matrix:
Predicted Yes No Actual Yes 55 5 No 5 35 The resulting metrics are
\(\seteqnumber{0}{}{14}\)\begin{equation} \begin{aligned} \text {Accuracy} &= \frac {55+35}{100} = 0.9 = 90\%\\ \text {Precision} &= \dfrac {55}{5+55} = \frac {55}{60} \approx 91.7\%\\ \text {Recall} &= \frac {55}{55+5} = \frac {55}{60} \approx 91.7\%\\ \text {Specificity} &= \frac {35}{5+35} = \frac {35}{40} = 87.5\%\\ F_1 &= \frac {55}{55 + \frac {1}{2}\left (5+5\right )} = \frac {55}{60} \approx 91.7\% \end {aligned} \end{equation}
Note, since \(FP = FN\), precision, recall, and \(F_1\) are all equal.
12.3.6 Summary
| Metric | Formula | Question answered | ||
| Accuracy | \(\dfrac {TP+TN}{TP+TN+FP+FN}\) | Of all predictions, what fraction is correct? | ||
| Precision | \(\dfrac {TP}{TP+FP}\) | Of all samples predicted positive, what fraction is actually positive? | ||
| Recall (sensitivity) | \(\dfrac {TP}{TP+FN}\) | Of all actual positives, what fraction is correctly identified? | ||
| Specificity | \(\dfrac {TN}{TN+FP}\) | Of all actual negatives, what fraction is correctly identified? | ||
| \(F_1\) | \(\dfrac {\cdot TP}{\cdot TP+\frac {1}{2}(FP+FN)}\) | Harmonic mean of precision and recall. |
12.4 Imbalanced Dataset
12.4.1 Definition
Imbalanced dataset: Dataset with significant differences between the numbers of labels of each class. The following examples present a few problems related to imbalanced datasets.
-
Example 12.3: Let’s take a dataset with 1000 samples:
-
• 990 samples labeled ‘0’
-
• 10 samples labeled ‘1’
What are the performance metrics of the classifier that always predicts \(\hat {Y}=0\)?
-
Solution: The resulting confusion matrix is
Predicted Yes No Actual Yes 0 10 No 0 990 and the resulting quantities are
\(\seteqnumber{0}{}{15}\)\begin{equation} \begin{aligned} \text {Accuracy} &= \frac {990}{1000} = 0.99 = 99\%\\ \text {Precision} &= \dfrac {TP}{FP+TP} = \frac {0}{0 + 0} = \text {Undefined}\\ \text {Recall} &= \frac {TP}{TP + FN} = \frac {0}{0+10} = 0\\ \text {Specificity} &= \frac {TN}{FP+TN} = \frac {990}{1000}= 0.99 = 99\%\\ F_1 &= \frac {TP}{TP + \frac {1}{2}\left (FP+FN\right )} = \frac {0}{\cdots } = 0 \end {aligned} \end{equation}
-
• Note, accuracy is insufficient metrics!
-
• Note, while the convention is to label ‘1’ for the most important class outcome, sometimes it is interchangeable.
-
12.4.2 Majority classifier
Majority classifier is a classifier that always predicts the most frequent class in the dataset, as in Example 12.3.
Despite achieving high accuracy on imbalanced data, a majority classifier has zero recall and zero \(F_1\) for the minority class. It is commonly used as a baseline: any useful classifier should outperform it on metrics beyond accuracy.
The accuracy of a majority classifier equals the majority-class rate,
\(\seteqnumber{0}{}{16}\)\begin{equation} J_{\mathrm {majority-class}} = \frac {\max (M_0, M_1)}{M} \end{equation}
where \(M_0\) and \(M_1\) are the class counts and \(M\) is the total number of samples.
12.4.3 Small or imbalanced datasets
The same issue arises in two related situations: the test set itself is small, or the test set is large but one class is heavily under-represented so that the per-class subset is small. In either case, a per-class metric is estimated from very few samples and is therefore subject to high sampling variability. The observed value can deviate substantially from the true classifier performance purely by chance.
The following example quantifies this gap for a class with only ten samples.
-
Example 12.4: Consider the dataset from Example 12.3, in which the minority class contains only \(n=10\) positive samples. Assume a classifier whose true per-sample accuracy on class ‘1’ is \(p=0.8\). What is the probability that it correctly classifies 6 or fewer of these 10 positive samples (i.e. that the observed recall is at most \(60\%\))?
-
Solution: Each classification is an independent Bernoulli trial with success probability \(p=0.8\), so the number of correct classifications follows a binomial distribution, \(X\sim \text {Bin}(n=10,\, p=0.8)\). The cumulative probability is
\(\seteqnumber{0}{}{17}\)\begin{equation*} \Pr (X\le 6) = \sum _{k=0}^{6}\binom {10}{k}p^k(1-p)^{10-k} \approx 12.09\%. \end{equation*}
Thus there is a \({\approx }12\%\) chance of observing a recall of \(60\%\) or less, even though the true accuracy is \(80\%\). Conversely, \(\Pr (X=10)\approx 10.74\%\): a perfect observed recall is plausible by chance, even for an imperfect classifier.
-
This is fundamentally a problem of confidence in performance evaluation: a single point estimate computed from a small (sub-)set says little about the underlying classifier. A formal treatment via confidence intervals is out of scope for this chapter; the practical takeaway is that such estimates must be interpreted with caution, and small differences between classifiers on small or imbalanced datasets should not be over-interpreted.
Anomaly detection Anomaly detection is a sub-field of classification where the class of interest (the anomaly) is extremely rare compared to the normal class. Examples include fraud detection, network intrusion detection, and equipment failure prediction. The extreme class imbalance (e.g., 1:1 000) makes accuracy meaningless and amplifies the small dataset problem for the minority class. Precision and recall are the primary evaluation metrics in this setting.
12.5 Metrics for Probabilistic Classifiers (ROC & AUC)
-
Goal: Quantify performance of classifiers with probabilistic loss (probabilistic predictions). This methods extent the methods in Sec. 14.1.
With a probabilistic classifier, the output is the estimated probability of the positive class,
\(\seteqnumber{0}{}{17}\)\begin{equation} \Pr (\hat {y}=1\mid \bx ) = f_\bw (\bx ) \end{equation}
12.5.1 Receiver Operating Characteristics (RoC)
The binary decision is obtained by comparing \(f_\bw (\bx )\) with a threshold \(\mathsf {thr}\):
\(\seteqnumber{0}{}{18}\)\begin{equation} \hat {y} = \begin{cases} 1 & f_\bw (\bx ) \ge \mathsf {thr}\\ 0 & f_\bw (\bx ) < \mathsf {thr} \end {cases} \end{equation}
The default threshold is \(\mathsf {thr}=0.5\). Changing \(\mathsf {thr}\) shifts the trade-off between TP, FP, FN, and TN, and therefore changes all derived metrics (precision, recall, specificity, \(F_1\)).
The RoC curve plots the following two quantities as \(\mathsf {thr}\) varies from \(0\) to \(1\):
-
• True Positive Rate (TPR): synonym for recall,
\(\seteqnumber{0}{}{19}\)\begin{equation} TPR = \frac {TP}{TP+FN} = \text {recall} \end{equation}
-
• False Positive Rate (FPR): complement of specificity,
\(\seteqnumber{0}{}{20}\)\begin{equation} FPR = \frac {FP}{FP+TN} = 1 - \text {specificity} \end{equation}
As \(\mathsf {thr}\to 0\), everything is classified as positive (\(TPR\to 1\), \(FPR\to 1\)). As \(\mathsf {thr}\to 1\), everything is classified as negative (\(TPR\to 0\), \(FPR\to 0\)).
RoC is a legacy term from the field of detector theory and communication system theory.
12.5.2 Area Under Curve (AUC)
AUC: AUC is the area under the RoC curve.
Range: A random (coin-toss) classifier has \(\mathsf {AUC}=0.5\) and an ideal classifier has \(\mathsf {AUC}=1\). All practical classifiers fall in the range \(0.5 \le \mathsf {AUC} \le 1\).
The relationship between classifier quality and the RoC curve is illustrated in Fig. 12.2a. When the two class distributions (positive and negative) are well separated, the classifier achieves high TPR at low FPR, producing a RoC curve that hugs the top-left corner (\(\mathsf {AUC}\to 1\)). As the distributions overlap, the FN and FP regions grow, and the RoC curve shifts toward the diagonal (\(\mathsf {AUC}\to 0.5\)).
The choice of threshold \(\mathsf {thr}\) controls which part of the overlap region is assigned to each class. Lowering \(\mathsf {thr}\) classifies more samples as positive, increasing TP but also FP. Raising \(\mathsf {thr}\) increases TN but also FN. Fig. 12.2b illustrates this trade-off: moving the threshold left or right redistributes errors between FN and FP while tracing the RoC curve.
Advantages:
-
• Scale-invariant: AUC measures how well predictions are ranked, rather than their absolute values.
-
• Threshold-invariant: AUC summarizes performance across all thresholds, without requiring a specific threshold choice.
Limitations:
-
• Scale invariance may be undesirable when well-calibrated probabilities are needed (e.g., for risk assessment), since AUC is insensitive to the predicted probability values.
-
• Threshold invariance may be undesirable when the application requires a specific trade-off between false negatives and false positives. For example, in spam detection, minimizing false positives (legitimate email marked as spam) is more important than minimizing false negatives. AUC does not capture such asymmetric cost preferences.
12.6 Multi-class performance
Single-label multi-class classification: Each sample carries exactly one true class label out of \(C\ge 2\) classes, and the classifier outputs exactly one predicted class. Both target and prediction are elements of \(\{1,\ldots ,C\}\).
Multi-label multi-class classification: Each sample may carry any subset of the \(C\) class labels (possibly empty), and the classifier outputs an independent binary decision per class. Both target and prediction are subsets of \(\{1,\ldots ,C\}\), equivalently binary vectors in \(\{0,1\}\in \real ^C\).
The remainder of this section assumes single-label classification; the multi-label setting is contrasted briefly in Sec. 12.6.3.
For single-label multi-class classification, the confusion matrix becomes a \(C\times C\) table whose entries
\(\seteqnumber{0}{}{21}\)\begin{equation} N_{cc'} \;=\; \sum _{i=1}^{M} \indFunc \!\left [Y_i = c\right ]\,\indFunc \!\left [\hat {Y}_i = c'\right ] \label {eq-classperf-confmat} \end{equation}
count the samples with true class \(c\) predicted as \(c'\). Accuracy generalizes naturally as the diagonal sum normalized by the total number of samples,
\(\seteqnumber{0}{}{22}\)\begin{equation} \text {Accuracy} \;=\; \frac {\sum _c N_{cc}}{\sum _c N_{cc} + \sum _c \sum _{c'\ne c} N_{cc'}} \;=\; \frac {1}{M}\sum _c N_{cc} . \end{equation}
Precision, recall, and \(F_1\), however, are defined with respect to a single positive class, so they must first be computed per class and then aggregated.
-
Example 12.5: Consider a single-label 3-class problem with \(M=100\) samples and the following confusion matrix (rows = actual class, columns = predicted class):
Predicted A B C Total Actual A 726280 B 86 1 15 C 21 2 5 Total 82 13 5 100 The classes are heavily imbalanced (A: 80, B: 15, C: 5). Taking class A as the positive class, the diagonal cell (green) is \(TP_A\), the off-diagonal cells in column A (red) are \(FP_A\), and the off-diagonal cells in row A (yellow) are \(FN_A\). Reading the matrix row-by-row and column-by-column gives the per-class counts
\(TP_A=72\)\(FP_A=10\)\(FN_A=8\)\(TP_B=6\) \(FP_B=7\) \(FN_B=9\) \(TP_C=2\) \(FP_C=3\) \(FN_C=3\) This running example is used throughout the rest of the section.
12.6.1 Per-class performance
The metrics in Sec. 12 are defined with respect to the positive class (\(Y=1\)). To assess all classes, the metrics can be computed separately for each class by treating it, in turn, as the positive class.
For class \(c\), define:
-
• \(TP_c\) – samples of class \(c\) correctly predicted as class \(c\)
-
• \(FP_c\) – samples of other classes incorrectly predicted as class \(c\)
-
• \(FN_c\) – samples of class \(c\) incorrectly predicted as another class
Equivalently, in terms of the confusion-matrix entries \(N_{cc'}\) of Eq. 12.22,
\(\seteqnumber{0}{}{23}\)\begin{equation} TP_c = N_{cc},\qquad FP_c = \sum _{c'\ne c} N_{c'c},\qquad FN_c = \sum _{c'\ne c} N_{cc'} . \end{equation}
The per-class metrics are then
\(\seteqnumber{0}{}{24}\)\begin{align} \text {Precision}_c &= \frac {TP_c}{TP_c + FP_c},\\[3pt] \text {Recall}_c &= \frac {TP_c}{TP_c + FN_c}, \\[3pt] F_{1,c} &= \frac {2\cdot \text {Precision}_c\cdot \text {Recall}_c}{\text {Precision}_c + \text {Recall}_c} \end{align} Note the symmetry: for binary classification, \(FP_0 = FN_1\) and \(FP_1 = FN_0\), so precision for one class equals recall for the other only when \(FP_0 = FP_1\).
-
Example 12.5: Back to Example 12.5. Plugging the per-class counts into the per-class formulas:
Metric Class A (80) Class B (15) Class C (5) Precision \(72/82 \approx 87.8\%\) \(6/13 \approx 46.2\%\) \(2/5 = 40.0\%\) Recall \(72/80 = 90.0\%\) \(6/15 = 40.0\%\) \(2/5 = 40.0\%\) \(F_1\) \({\approx }88.9\%\) \({\approx }42.9\%\) \(40.0\%\) Table 12.3: Per-class metrics for the running 3-class example.Class A (the majority) is classified well, while the rare classes B and C are classified poorly on every per-class metric.
-
Example 12.1: Back to the Covid antibody test example (Example 12.1, with \(TP=141\), \(FP=0\), \(FN=67\), \(TN=31\)).
The asymmetry is significant: \(FP=0\) yields perfect precision for class 1 but also perfect recall for class 0, while the high \(FN=67\) degrades recall for class 1 and precision for class 0. This illustrates the symmetry \(FP_1 = FN_0\) noted above.
Metric Class 1 (positive) Class 0 (negative) Precision \(\dfrac {141}{141+0} = 100\%\) \(\dfrac {31}{31+67} \approx 31.6\%\) Recall \(\dfrac {141}{141+67} \approx 67.8\%\) \(\dfrac {31}{31+0} = 100\%\) \(F_1\) \(\dfrac {2\cdot 1.0\cdot 0.678}{1.0+0.678} \approx 80.8\%\) \(\dfrac {2\cdot 0.316\cdot 1.0}{0.316+1.0} \approx 48.1\%\) Table 12.4: Per-class metrics for the Covid antibody test.
To collapse the \(C\) per-class metrics into a single scalar, two averaging strategies are commonly used.
12.6.2 Macro-averaging
The arithmetic mean of the per-class metric across all \(C\) classes:
\(\seteqnumber{0}{}{27}\)\begin{align} \text {Precision}_{\text {macro}} &= \frac {1}{C}\sum _{c=1}^{C}\text {Precision}_c, \\ \text {Recall}_{\text {macro}} &= \frac {1}{C}\sum _{c=1}^{C}\text {Recall}_c, \\ F_{1,\text {macro}} &= \frac {1}{C}\sum _{c=1}^{C} F_{1,c}. \end{align} Each class contributes equally, regardless of its support (the number of samples in the class). A rare class with poor performance pulls the macro score down as much as a frequent class. Macro-averaging is therefore the natural choice when minority-class performance matters, for example in imbalanced datasets or when fairness across categories is required.
-
Example 12.5: Back to Example 12.5. Averaging the per-class metrics:
\(\seteqnumber{0}{}{30}\)\begin{equation*} \begin{aligned} \text {Precision}_{\text {macro}} &= \tfrac {1}{3}(0.878+0.462+0.400) \approx 58.0\%,\\ \text {Recall}_{\text {macro}} &= \tfrac {1}{3}(0.900+0.400+0.400) \approx 56.7\%,\\ F_{1,\text {macro}} &= \tfrac {1}{3}(0.889+0.429+0.400) \approx 57.3\%. \end {aligned} \end{equation*}
The macro \(F_1\) (\(57.3\%\)) is far below the overall accuracy of \(80\%\), because the rare classes B and C, on which the classifier performs poorly, contribute equally to the macro mean despite holding only \(20\) of the \(100\) samples.
12.6.3 Micro-averaging
Pool the confusion-matrix counts across all classes first, then compute the metric once on the pooled counts:
\(\seteqnumber{0}{}{30}\)\begin{align} \text {Precision}_{\text {micro}} &= \frac {\sum _c TP_c}{\sum _c (TP_c+FP_c)}, \\ \text {Recall}_{\text {micro}} &= \frac {\sum _c TP_c}{\sum _c (TP_c+FN_c)}. \end{align} Each instance contributes equally, regardless of the class label, so frequent classes dominate the score.
-
Example 12.5: Back to Example 12.5. Pooling the counts across the three classes:
\(\seteqnumber{0}{}{32}\)\begin{equation*} \text {Precision}_{\text {micro}} = \frac {TP_A+TP_B+TP_C}{(TP_A+FP_A)+(TP_B+FP_B)+(TP_C+FP_C)} = \frac {72+6+2}{82+13+5} = \frac {80}{100} = 80.0\%. \end{equation*}
The same value coincides with the overall accuracy of the classifier, \((72+6+2)/100=80\%\). The next subsubsection shows that this is not a coincidence in the single-label setting.
Equivalence to accuracy in single-label classification
In the single-label multi-class setting (every sample has exactly one true class, and the model predicts exactly one class), micro-averaged precision, recall, and \(F_1\) all collapse to accuracy:
\(\seteqnumber{0}{}{32}\)\begin{equation} \text {Precision}_{\text {micro}} = \text {Recall}_{\text {micro}} = F_{1,\text {micro}} = \text {Accuracy}. \end{equation}
The argument has three steps.
-
1. Misclassifications are double-counted across classes: Take any wrong prediction: a sample whose true class is \(a\) but predicted as \(b\) (\(a\ne b\)). From class \(b\)’s perspective it is a false positive (predicted \(b\) but not actually \(b\)), contributing \(+1\) to \(FP_b\). From class \(a\)’s perspective it is a false negative (actually \(a\) but missed), contributing \(+1\) to \(FN_a\). The same error therefore raises exactly one \(FP_c\) and exactly one \(FN_c\). Summing over all classes:
\(\seteqnumber{0}{}{33}\)\begin{equation} \sum _c FP_c = \sum _c FN_c = \text {(total number of misclassified samples)}. \end{equation}
-
2. \(\sum _c TP_c\) is the total number of correct predictions: Each correct prediction adds \(1\) to \(TP_c\) for its single true class, so
\(\seteqnumber{0}{}{34}\)\begin{equation} \sum _c TP_c = M\cdot \text {Accuracy}, \end{equation}
where \(M\) is the dataset size.
-
3. Plug into the micro formulas:
\(\seteqnumber{0}{}{35}\)\begin{equation} \text {Precision}_{\text {micro}} = \frac {\sum _c TP_c}{\sum _c TP_c + \sum _c FP_c} = \frac {\text {correct}}{\text {correct}+\text {wrong}} = \frac {\text {correct}}{M} = \text {Accuracy}. \end{equation}
The denominator for recall is identical because \(\sum _c FN_c = \sum _c FP_c\), so \(\text {Recall}_{\text {micro}} = \text {Accuracy}\). Since precision equals recall, their harmonic mean \(F_{1,\text {micro}}\) takes the same value.
The identity \(\sum _c FP_c = \sum _c FN_c\) relies on each error contributing exactly one FP and one FN. In multi-label classification (a sample can belong to several classes simultaneously), a single sample can be wrong on multiple labels independently, so the bookkeeping no longer pairs up and micro-precision, micro-recall, and accuracy become genuinely different quantities.
Practical takeaway. In a single-label multi-class setting, reporting micro-\(F_1\) alongside accuracy is redundant: they are the same number. To expose minority-class behavior, use macro-averaging instead.
With balanced classes, macro and micro averages converge. With imbalance, they can differ substantially: micro tracks accuracy, while macro exposes poor minority-class performance.
-
Example 12.5: Back to Example 12.5 once more, the proof can be checked numerically. The total number of misclassified samples is \(M-(72+6+2)=20\), and summing the per-class counts:
\(\seteqnumber{0}{}{36}\)\begin{equation*} \sum _c FP_c = 10+7+3 = 20, \qquad \sum _c FN_c = 8+9+3 = 20, \end{equation*}
matching the count of misclassifications, as predicted by step 1 of the argument. Therefore \(\text {Precision}_{\text {micro}}=\text {Recall}_{\text {micro}}=F_{1,\text {micro}}=80\%=\text {Accuracy}\), while the macro \(F_1\) of \(57.3\%\) exposes the poor performance on the rare classes B and C.
Further reading
Multi-class classification performance
-
• Accuracy, precision, and recall in multi-class classification (accessed 27-Apr-2026).