Encoding Methods

When our dataset includes categorical variables, we need to preprocess them before making some calculations or visualizing relationships, for instance, in a correlation matrix. 

In most cases, we’d encode the categorical variables into numerical values. I did write about some of them before but they require a deeper understanding and also there are many more encoding techniques. Here I’ll examine only some of the most commonly used ones: 

  1. Label Encoding
    Converts categories into integer codes.
  2. Ordinal Encoding
    Converts categorical data with an inherent order into integer labels.
  3. One-Hot Encoding
    Converts categorical variables into binary columns, each representing one category.
  4. Target Encoding
    Encodes categories based on the mean of the target variable.
  5. Frequency Encoding
    Encodes categories based on their frequency in the dataset.
  6. Binary Encoding
    Converts categories into binary format.

It is a little hard to grasp them at once, since they are highly related to each other. The main ideas are similar, but their differences come from the needs of the analysis and the form & size of the dataset. 

That’s why it was more helpful to me to start with the two and continue building up from there. Those two are Label Encoding and Ordinal Encoding; they look quite similar and it is easy to mix up with each other.

1. Label Encoding

Assigns arbitrary integer values to categories without any consideration of order.

Example:
[“apple”, “banana”, “cherry”] → [0, 1, 2]

Use Case:

  • When the categories are nominal (no inherent order).
  • Works well with algorithms that treat the values as labels rather than ordinal (e.g., tree-based algorithms like decision trees or random forests).
  • Pitfall: Algorithms like linear regression might interpret the numbers as ordinal, leading to misleading results.

2. Ordinal Encoding

Assigns integers based on a predefined order that reflects the natural ranking of the categories.

Example:
[“low”, “medium”, “high”] → [0, 1, 2]

Use Case:

  • When the categories are ordinal and there is an inherent hierarchy or ranking among them.
  • The encoding reflects this order, making it meaningful for algorithms that interpret numerical relationships.
What is the difference: Label vs. Ordinal

When we check both outputs, it looks like there is no difference between these two methods. The key difference is that ordinal encoding intentionally reflects a meaningful order, while label encoding creates an unintended one. 

Key Differences:

AspectLabel EncodingOrdinal Encoding
Assignment of NumbersArbitraryReflects category order
Use for Nominal DataYesNo
Use for Ordinal DataRarelyYes
Semantic MeaningNo inherent meaning in orderEncodes meaningful ranking
Key Problem with Label Encoding for Nominal Data

Label encoding introduces numerical values that can imply a false order to nominal categories, which can confuse algorithms that assume numerical relationships. This is especially problematic for:

  • Linear models (e.g., regression or logistic regression): These treat the encoded values as numerical, implying order and distance.
  • Clustering algorithms (e.g., k-means): These use Euclidean distance, which can lead to spurious results if the numerical encoding is arbitrary.
When Label Encoding Works

Despite its pitfalls, label encoding is not inherently flawed, it works well in specific contexts:

  1. Tree-based algorithms: Algorithms like decision trees, random forests, and gradient boosting (e.g., XGBoost, LightGBM) do not assume numerical relationships between features. They split data based on categories directly, so label encoding is safe for nominal data in these cases.
  2. Quick preprocessing: Label encoding can be a simple and quick way to preprocess data when exploratory analysis or lightweight models are the goal.
Alternative Solutions for Nominal Data

To avoid introducing false order, other encoding methods (like the below ones) are often better suited for nominal data.

3. One-Hot Encoding:

Each category is represented as a binary column.

Example:
[“apple”, “banana”, “cherry”] → 

apple   banana   cherry
1       0        0
0       1        0
0       0        1

  • Advantages: Prevents false order by treating all categories as equally distant.
  • Disadvantages: The dimensionality is increasing.

4. Target Encoding (in supervised learning):

Replaces each category with a statistic (e.g., mean of the target variable for that category or other aggregate for each category).

Example:
Category: [“A”, “B”, “A”, “C”, “B”, “C”, “A”]

Target (numeric): [100, 200, 150, 300, 250, 350, 175]

Encoded feature becomes: [141.67, 225.00, 141.67, 325.00, 225.00, 325.00, 141.67]

Advantages:

  • Reduces the dimensionality compared to one-hot encoding.
  • Captures the relationship between the categorical variable and the target variable.
  • Useful for high-cardinality categorical features (features with many unique categories).
  • Can be used with algorithms that don’t natively handle categorical data, like linear regression or neural networks.

Disadvantages:

  • Can lead to data leakage if not done carefully. Data leakage occurs when information from the test set influences the training process.
  • Sensitive to outliers in the target variable.
Avoiding Data Leakage

To prevent data leakage, the target mean for each category must be computed only from the training data available at the time. Typically, this is done using techniques like:

  • K-Fold Mean Encoding: Compute the mean within each fold during cross-validation.
  • Leave-One-Out Encoding: Exclude the current row’s target value when computing the mean for that category.
  • Smoothing: Combine the category mean with the overall target mean to reduce the effect of small sample sizes or outliers.

5. Frequency Encoding:

Replaces each category with its frequency in the dataset.

Example:
[“apple”, “apple”, “banana”, “cherry”] →

[0.5, 0.5, 0.25, 0.25]

Advantages:

  • Encodes categories into a single column, reducing dimensionality compared to one-hot encoding.
  • Useful for datasets with high cardinality (many unique categories).
  • The frequency of occurrence often carries meaningful information, especially in imbalanced datasets. (suitable for tree-based models)
  • Easy to implement and computationally efficient, requiring just a count and mapping operation.

Disadvantages:

  • Implies that categories with higher frequencies are more significant, which may not always align with the underlying data distribution or context.
  • Reduces interpretability since the original categories are replaced by numerical values.
  • Models might overfit to the specific frequency values if they are highly correlated with the target, especially in small datasets.
  • Linear models may interpret frequency values as numerical magnitudes rather than categorical indicators, leading to misleading relationships.

When to use:

  • Best for high-cardinality categorical variables in tree-based models or when frequency itself is meaningful (e.g., popularity, ranking).
  • Avoid for linear models or datasets where frequency doesn’t represent an inherent relationship with the target.

6. Binary Encoding

Combines the ideas of one-hot encoding and ordinal encoding but is more memory-efficient.

Each category is first assigned a unique integer (as in label encoding), and then the integers are converted to binary. Each binary digit becomes a separate column.

Example:

Categories: [“A”, “B”, “C”, “D”]

Label Encoding: [0, 1, 2, 3]

Binary Representation:

0 → 0 0
1 → 0 1
2 → 1 0
3 → 1 1

The encoded matrix:

Col1  Col2
0     0
0     1
1     0
1     1

  • Advantages: Reduces dimensionality compared to one-hot encoding.
  • Disadvantages: May not perform well for nominal variables in some algorithms.

Comparing the Encoding Methods

As seen above, the most commonly used encoding techniques primarily differ in terms of dimensionality and whether they account for the importance of order in the categorical data. 

Key Dimensions for Comparison:
  1. Dimensionality: How many new features or columns are created by the encoding process.
  2. Order Sensitivity: Whether the encoding method preserves or imposes an order among categories.
  3. Information Preservation: How well the method retains information about the categories’ relationships.
  4. Interpretability: Whether the encoded values are easy to interpret in terms of the original data.
Encoding MethodDimensionalityPreserves / Imposes OrderUse Case
Label EncodingLow (one column)Yes (imposes order unintentionally for nominal data)Works with tree-based models; avoid for nominal data with non-order-sensitive algorithms.
Ordinal EncodingLow (one column)Yes (imposes order)For ordinal data where order matters (e.g., “low”, “medium”, “high”).
One-Hot EncodingHigh (one column per category)NoFor nominal data, avoids false order and works with most models.
Target EncodingLow (one column)No (depends on target distribution)Captures relationship with target; good for high-cardinality features.
Frequency EncodingLow (one column)NoSimplifies high-cardinality features; retains category prevalence.
Binary EncodingModerate (log₂ of categories)NoReduces dimensionality while preserving category information.

Summary lists like the one above are helpful but sometimes I need a visual to get a better sense. Let’s plot a graph, where x-axis would be order sensitivity and y-axis would be dimensionality. In such a graph, the placements of above mentioned encoding techniques would be:

Quadrant 1 (Top-Left: High Dimensionality, No Order Sensitivity):

  • One-Hot Encoding: High dimensionality because it creates one column per category. Completely ignores order.

Quadrant 2 (Top-Right: High Dimensionality, High Order Sensitivity):

  • Rare or Inapplicable: High dimensionality is uncommon for order-sensitive methods, as ordinal data is usually compactly encoded.

Quadrant 3 (Bottom-Left: Low Dimensionality, No Order Sensitivity):

  • Frequency Encoding: Low dimensionality with no order sensitivity; captures prevalence of categories.
  • Binary Encoding: Reduces dimensionality while maintaining nominal information.
  • Target Encoding: Low dimensionality with order-agnostic nature, depending on target variable distribution.

Quadrant 4 (Bottom-Right: Low Dimensionality, High Order Sensitivity):

  • Ordinal Encoding: Assigns numbers based on a known order.
  • Label Encoding: Implicitly imposes order (although unintentionally for nominal data).

Advanced Encoding Methods

There are other encoding methods like:

  • Hash Encoding (Feature Hashing)
  • BaseN Encoding
  • Leave-One-Out Encoding
  • Count Encoding
  • Weighted Target Encoding
  • Probability Ratio Encoding
  • Embedding Encoding

And many more advanced ones are:

  • Polynomial Coding (Helmert Encoding)
  • Contrast Encoding (Deviation Coding)
  • Weight of Evidence (WoE) Encoding
  • CatBoost Encoding
  • James-Stein Encoding
  • Gaussian Encoding
  • Entity Embeddings
  • Ordinal Encoding with Thresholding
  • Cluster-Based Encoding
  • Ordinal Logistic Regression Encoding
  • Principal Component Encoding (PCA Encoding)

I listed them just for my future reference, I won’t study them now. But a quick summary of some less common methods could be:

Encoding MethodUse CaseKey Benefit
Polynomial CodingStatistical modelingCaptures group-level contrasts
Weight of Evidence (WoE)Credit scoring, binary targetsLog odds for binary target
CatBoost EncodingGradient boosting with CatBoostReduces data leakage
Gaussian EncodingAvoiding overfitting in small datasetsAdds noise to target encoding
Entity EmbeddingsComplex relationships in large datasetsDense, continuous representations
Cluster-Based EncodingGrouping similar categoriesSimplifies complex categories

Sticking to the more commonly used techniques, such as one-hot encoding, target encoding, or ordinal encoding, is often sufficient for most applications.