Evaluation metrics for classification models

Evaluation metrics for classification models are used to assess the performance of a model in correctly predicting class labels. These metrics provide insight into various aspects of the model’s behavior, especially when the classes are imbalanced or there are different costs associated with misclassifications. 

Below are some of the key evaluation metrics for classification models. However, listing a bunch of formulas is not very helpful and it is important to understand the ideas behind those different approaches. Therefore this list here is just for a starting point as keeping them under one page. In the future, I’ll investigate them deeper.

1a. Accuracy
  • Definition: The proportion of correctly classified instances over the total number of instances.
  • Formula

[math]\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}[/math]

  • Use case: It’s a commonly used metric, especially when the classes are balanced. However, it can be misleading if the dataset has imbalanced classes.
1b. Misclassification Rate (Error Rate)
  • Definition: The proportion of incorrectly classified instances compared to the total number of instances.
  • Formula

[math]\text{Misclassification Rate} = \frac{\text{Number of Incorrect Predictions}}{\text{Total Number of Predictions}}[/math]

  • Use case: Since accuracy is the most widely used metric for classification, it essentially captures the same information as the misclassification rate. In practice, accuracy is usually preferred over the misclassification rate because it is a more intuitive metric (e.g., “how many instances were correctly predicted out of the total?”).

The misclassification rate is the complement of accuracy. Specifically:

Misclassification Rate = 1 – Accuracy

2. Precision
  • Definition: The proportion of positive predictions that are actually correct. Precision tells you how many of the predicted positive instances were truly positive.
  • Formula

[math]\text{Precision} = \frac{TP}{TP + FP}[/math] 

where:

  • TP = True Positives (correctly predicted positives)
  • FP = False Positives (incorrectly predicted as positive)
  • Use case: Precision is particularly useful in scenarios where false positives are costly (e.g., spam detection, fraud detection).
3. Recall (Sensitivity or True Positive Rate)
  • Definition: The proportion of actual positives that were correctly identified. Recall tells you how many of the true positives were captured by the model.
  • Formula

[math]\text{Recall} = \frac{TP}{TP + FN}[/math] 

where:

  • FN = False Negatives (actual positives that were predicted as negative)
  • Use case: Recall is important when the cost of missing a positive instance is high, such as in medical diagnoses (e.g., detecting cancer).
4. F1 Score
  • Definition: The harmonic mean of precision and recall. The F1 score balances the trade-off between precision and recall, providing a single measure of model performance when both metrics are important.
  • Formula

[math]\text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}[/math]

  • Use case: F1 score is a good metric when dealing with imbalanced datasets, where both precision and recall need to be balanced.
5. Specificity (True Negative Rate)
  • Definition: The proportion of actual negatives that were correctly identified. Specificity tells you how many of the true negatives were captured.
  • Formula

[math]\text{Specificity} = \frac{TN}{TN + FP}[/math] 

where:

  • TN = True Negatives (correctly predicted negatives)
  • Use case: Specificity is useful in medical diagnostics where false positives are costly, and you want to correctly identify the negative class.
6. Area Under the Receiver Operating Characteristic Curve (AUC-ROC)
  • Definition: The ROC curve is a graphical plot that shows the true positive rate (recall) versus the false positive rate (1 – specificity) at various thresholds. The AUC is the area under this curve and gives an aggregate measure of a model’s ability to distinguish between classes.
  • Range: The AUC value ranges from 0 to 1, where 1 represents perfect classification and 0.5 represents random classification.
  • Use case: AUC-ROC is especially useful when dealing with imbalanced datasets, as it focuses on the model’s ability to differentiate between classes, irrespective of the threshold.
7. Confusion Matrix
  • Definition: A table used to describe the performance of a classification model by comparing the predicted labels with the true labels. It shows the counts of true positives, false positives, true negatives, and false negatives.
  • Structure

[math]\begin{array}{|c|c|c|} \hline & \text{Predicted Positive} & \text{Predicted Negative} \\ \hline \text{Actual Positive} & TP & FN \\ \hline \text{Actual Negative} & FP & TN \\ \hline \end{array}[/math]

  • Use case: It provides a complete picture of a model’s performance, from which other metrics like precision, recall, and F1 score can be derived.
8. Log Loss (Logarithmic Loss)
  • Definition: Log loss measures the uncertainty of the model’s predictions based on the probabilities it assigns to the possible classes. It penalizes wrong predictions more when the model is confident but wrong.
  • Formula

[math]\text{Log Loss} = -\frac{1}{N} \sum_{i=1}^{N} \left( y_i \log(p_i) + (1 – y_i) \log(1 – p_i) \right)[/math]

where:

  • y_i is the true label (0 or 1)
  • p_i is the predicted probability of the positive class
  • N is the total number of samples
  • Use case: Log loss is useful when the model outputs probabilities, and you want to measure how well those probabilities match the actual outcomes.
9. Matthews Correlation Coefficient (MCC)
  • Definition: The Matthews correlation coefficient is a measure of the quality of binary classifications. It takes into account true and false positives and negatives, making it a more balanced metric than accuracy, especially for imbalanced datasets.
  • Formula

[math]MCC = \frac{TP \times TN – FP \times FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}}[/math]

  • Use case: MCC is particularly useful for imbalanced classes, as it gives a more balanced view of classification performance.
10. Cohen’s Kappa
  • Definition: Cohen’s Kappa measures the agreement between two raters (or a model and the true labels) while accounting for the possibility of chance agreement.
  • Formula

[math]\kappa = \frac{P_o – P_e}{1 – P_e}[/math] 

where:

  • Po is the observed agreement
  • Pe is the expected agreement by chance
  • Use case: Kappa is useful when evaluating classification models, especially in cases where there is a need to account for random chance.
11. Balanced Accuracy
  • Definition: Balanced accuracy is the average of recall obtained on each class. It is particularly useful for imbalanced datasets, where traditional accuracy might be misleading.
  • Formula

[math]\text{Balanced Accuracy} = \frac{1}{2} \left( \frac{TP}{TP + FN} + \frac{TN}{TN + FP} \right)[/math]

  • Use case: Balanced accuracy is a good alternative to accuracy when the classes are imbalanced.
Summary of Metrics:
MetricFormula/Key FocusUse Case/Importance
AccuracyProportion of correct predictionsBalanced classes, general use
PrecisionProportion of true positives among predicted positivesCost of false positives, e.g., fraud detection
RecallProportion of true positives among actual positivesCost of false negatives, e.g., medical diagnoses
F1 ScoreHarmonic mean of precision and recallBalancing precision and recall
SpecificityProportion of true negatives among actual negativesCost of false positives, e.g., medical tests
AUC-ROCArea under the ROC curveClass separation ability, especially for imbalanced data
Confusion MatrixCounts of TP, FP, TN, FNVisualize performance and calculate other metrics
Log LossPenalizes incorrect predictions with confidenceProbabilistic predictions
MCCBalanced measure for binary classificationImbalanced classes
Cohen’s KappaAgreement measure considering chanceEvaluate agreement beyond accuracy
Balanced AccuracyAverage of recall for each classImbalanced classes

Each of these metrics is important depending on the specific problem we’re solving, and it’s often useful to consider multiple metrics to get a complete picture of model performance.