Evaluation metrics for classification models

Evaluation metrics for classification models are used to assess the performance of a model in correctly predicting class labels. These metrics provide insight into various aspects of the model’s behavior, especially when the classes are imbalanced or there are different costs associated with misclassifications.

Below are some of the key evaluation metrics for classification models. However, listing a bunch of formulas is not very helpful and it is important to understand the ideas behind those different approaches. Therefore this list here is just for a starting point as keeping them under one page. In the future, I’ll investigate them deeper.

1a. Accuracy

Definition: The proportion of correctly classified instances over the total number of instances.
Formula:

[math]\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}[/math]

Use case: It’s a commonly used metric, especially when the classes are balanced. However, it can be misleading if the dataset has imbalanced classes.

1b. Misclassification Rate (Error Rate)

Definition: The proportion of incorrectly classified instances compared to the total number of instances.
Formula:

[math]\text{Misclassification Rate} = \frac{\text{Number of Incorrect Predictions}}{\text{Total Number of Predictions}}[/math]

Use case: Since accuracy is the most widely used metric for classification, it essentially captures the same information as the misclassification rate. In practice, accuracy is usually preferred over the misclassification rate because it is a more intuitive metric (e.g., “how many instances were correctly predicted out of the total?”).

The misclassification rate is the complement of accuracy. Specifically:

Misclassification Rate = 1 – Accuracy

2. Precision

Definition: The proportion of positive predictions that are actually correct. Precision tells you how many of the predicted positive instances were truly positive.
Formula:

[math]\text{Precision} = \frac{TP}{TP + FP}[/math]

where:

TP = True Positives (correctly predicted positives)
FP = False Positives (incorrectly predicted as positive)
Use case: Precision is particularly useful in scenarios where false positives are costly (e.g., spam detection, fraud detection).

3. Recall (Sensitivity or True Positive Rate)

Definition: The proportion of actual positives that were correctly identified. Recall tells you how many of the true positives were captured by the model.
Formula:

[math]\text{Recall} = \frac{TP}{TP + FN}[/math]

where:

FN = False Negatives (actual positives that were predicted as negative)
Use case: Recall is important when the cost of missing a positive instance is high, such as in medical diagnoses (e.g., detecting cancer).

4. F1 Score

Definition: The harmonic mean of precision and recall. The F1 score balances the trade-off between precision and recall, providing a single measure of model performance when both metrics are important.
Formula:

[math]\text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}[/math]

Use case: F1 score is a good metric when dealing with imbalanced datasets, where both precision and recall need to be balanced.

5. Specificity (True Negative Rate)

Definition: The proportion of actual negatives that were correctly identified. Specificity tells you how many of the true negatives were captured.
Formula:

[math]\text{Specificity} = \frac{TN}{TN + FP}[/math]

where:

TN = True Negatives (correctly predicted negatives)
Use case: Specificity is useful in medical diagnostics where false positives are costly, and you want to correctly identify the negative class.

6. Area Under the Receiver Operating Characteristic Curve (AUC-ROC)

Definition: The ROC curve is a graphical plot that shows the true positive rate (recall) versus the false positive rate (1 – specificity) at various thresholds. The AUC is the area under this curve and gives an aggregate measure of a model’s ability to distinguish between classes.
Range: The AUC value ranges from 0 to 1, where 1 represents perfect classification and 0.5 represents random classification.
Use case: AUC-ROC is especially useful when dealing with imbalanced datasets, as it focuses on the model’s ability to differentiate between classes, irrespective of the threshold.

7. Confusion Matrix

Definition: A table used to describe the performance of a classification model by comparing the predicted labels with the true labels. It shows the counts of true positives, false positives, true negatives, and false negatives.
Structure:

[math]\begin{array}{|c|c|c|} \hline & \text{Predicted Positive} & \text{Predicted Negative} \\ \hline \text{Actual Positive} & TP & FN \\ \hline \text{Actual Negative} & FP & TN \\ \hline \end{array}[/math]

Use case: It provides a complete picture of a model’s performance, from which other metrics like precision, recall, and F1 score can be derived.

8. Log Loss (Logarithmic Loss)

Definition: Log loss measures the uncertainty of the model’s predictions based on the probabilities it assigns to the possible classes. It penalizes wrong predictions more when the model is confident but wrong.
Formula:

[math]\text{Log Loss} = -\frac{1}{N} \sum_{i=1}^{N} \left( y_i \log(p_i) + (1 – y_i) \log(1 – p_i) \right)[/math]

where:

y_i is the true label (0 or 1)
p_i is the predicted probability of the positive class
N is the total number of samples
Use case: Log loss is useful when the model outputs probabilities, and you want to measure how well those probabilities match the actual outcomes.

9. Matthews Correlation Coefficient (MCC)

Definition: The Matthews correlation coefficient is a measure of the quality of binary classifications. It takes into account true and false positives and negatives, making it a more balanced metric than accuracy, especially for imbalanced datasets.
Formula:

[math]MCC = \frac{TP \times TN – FP \times FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}}[/math]

Use case: MCC is particularly useful for imbalanced classes, as it gives a more balanced view of classification performance.

10. Cohen’s Kappa

Definition: Cohen’s Kappa measures the agreement between two raters (or a model and the true labels) while accounting for the possibility of chance agreement.
Formula:

[math]\kappa = \frac{P_o – P_e}{1 – P_e}[/math]

where:

Po is the observed agreement
Pe is the expected agreement by chance
Use case: Kappa is useful when evaluating classification models, especially in cases where there is a need to account for random chance.

11. Balanced Accuracy

Definition: Balanced accuracy is the average of recall obtained on each class. It is particularly useful for imbalanced datasets, where traditional accuracy might be misleading.
Formula:

[math]\text{Balanced Accuracy} = \frac{1}{2} \left( \frac{TP}{TP + FN} + \frac{TN}{TN + FP} \right)[/math]

Use case: Balanced accuracy is a good alternative to accuracy when the classes are imbalanced.

Summary of Metrics:

Metric	Formula/Key Focus	Use Case/Importance
Accuracy	Proportion of correct predictions	Balanced classes, general use
Precision	Proportion of true positives among predicted positives	Cost of false positives, e.g., fraud detection
Recall	Proportion of true positives among actual positives	Cost of false negatives, e.g., medical diagnoses
F1 Score	Harmonic mean of precision and recall	Balancing precision and recall
Specificity	Proportion of true negatives among actual negatives	Cost of false positives, e.g., medical tests
AUC-ROC	Area under the ROC curve	Class separation ability, especially for imbalanced data
Confusion Matrix	Counts of TP, FP, TN, FN	Visualize performance and calculate other metrics
Log Loss	Penalizes incorrect predictions with confidence	Probabilistic predictions
MCC	Balanced measure for binary classification	Imbalanced classes
Cohen’s Kappa	Agreement measure considering chance	Evaluate agreement beyond accuracy
Balanced Accuracy	Average of recall for each class	Imbalanced classes

Each of these metrics is important depending on the specific problem we’re solving, and it’s often useful to consider multiple metrics to get a complete picture of model performance.