Below are the six encoding methods that I studied earlier, using both pandas and scikit-learn, along with sample outputs for each approach.
Encoding Method | pandas Implementation | scikit-learn Implementation |
Label Encoding | pd.factorize() | LabelEncoder() |
Ordinal Encoding | map() with custom order | OrdinalEncoder() |
One-Hot Encoding | pd.get_dummies() | OneHotEncoder() |
Target Encoding | groupby() + map() | Custom or TargetEncoder from category_encoders |
Frequency Encoding | value_counts() + map() | Custom implementation |
Binary Encoding | Custom with cat.codes and apply() | BinaryEncoder() from category_encoders |
Let’s quickly demonstrate each.
Label Encoding
import pandas as pd
data = pd.DataFrame({'Category': ['A', 'B', 'A', 'C', 'B']})
# Label Encoding using pandas (factorize)
data['LabelEncoded'] = pd.factorize(data['Category'])[0]
print(data)
from sklearn.preprocessing import LabelEncoder
# Label Encoding using scikit-learn
label_encoder = LabelEncoder()
data['LabelEncoded'] = label_encoder.fit_transform(data['Category'])
print(data)
Output:
Category LabelEncoded
0 A 0
1 B 1
2 A 0
3 C 2
4 B 1
Ordinal Encoding
import pandas as pd
data = pd.DataFrame({'Category': ['low', 'medium', 'high', 'medium', 'low']})
# Ordinal Encoding using pandas
order = ['low', 'medium', 'high']
data['OrdinalEncoded'] = data['Category'].map({value: idx for idx, value in enumerate(order)})
print(data)
from sklearn.preprocessing import OrdinalEncoder
# Ordinal Encoding using scikit-learn
ordinal_encoder = OrdinalEncoder(categories=[['low', 'medium', 'high']])
data['OrdinalEncoded'] = ordinal_encoder.fit_transform(data[['Category']])
print(data)
Output:
Category OrdinalEncoded
0 low 0.0
1 medium 1.0
2 high 2.0
3 medium 1.0
4 low 0.0
One-Hot Encoding
import pandas as pd
data = pd.DataFrame({'Category': ['A', 'B', 'A', 'C', 'B']})
# One-Hot Encoding using pandas
one_hot = pd.get_dummies(data['Category'], prefix='Category')
print(one_hot)
from sklearn.preprocessing import OneHotEncoder
# One-Hot Encoding using scikit-learn
encoder = OneHotEncoder(sparse=False)
one_hot_sklearn = encoder.fit_transform(data[['Category']])
print(pd.DataFrame(one_hot_sklearn, columns=encoder.get_feature_names_out(['Category'])))
Output:
Category_A Category_B Category_C
0 1.0 0.0 0.0
1 0.0 1.0 0.0
2 1.0 0.0 0.0
3 0.0 0.0 1.0
4 0.0 1.0 0.0
Target Encoding
import pandas as pd
data = pd.DataFrame({'Category': ['A', 'B', 'A', 'C', 'B'], 'Target': [1, 0, 1, 0, 1]})
# Target Encoding using pandas
target_means = data.groupby('Category')['Target'].mean()
data['TargetEncoded'] = data['Category'].map(target_means)
print(data)
Note: scikit-learn does not have direct support for target encoding, but we can implement this manually (like above) or use category_encoders library.
Output:
Category Target TargetEncoded
0 A 1 1.0
1 B 0 0.5
2 A 1 1.0
3 C 0 0.0
4 B 1 0.5
Frequency Encoding
import pandas as pd
data = pd.DataFrame({'Category': ['A', 'B', 'A', 'C', 'B']})
# Frequency Encoding using pandas
frequency = data['Category'].value_counts()
data['FrequencyEncoded'] = data['Category'].map(frequency)
print(data)
Scikit-learn does not have direct support for frequency encoding, but a manual approach would be like above.
Output:
Category FrequencyEncoded
0 A 2
1 B 2
2 A 2
3 C 1
4 B 2
Binary Encoding
import pandas as pd
data = pd.DataFrame({'Category': ['A', 'B', 'C', 'D']})
# Convert to categorical and get the codes
data['Category_Code'] = data['Category'].astype('category').cat.codes
# Convert the numeric codes into binary
binary_encoded = data['Category_Code'].apply(lambda x: bin(x)[2:].zfill(2)) # Ensure 2-bit binary representation
# Split the binary string into separate columns
binary_encoded_df = binary_encoded.apply(lambda x: pd.Series(list(x))).astype(int)
# Rename columns
binary_encoded_df.columns = [f'Category_{i}' for i in range(binary_encoded_df.shape[1])]
print(binary_encoded_df)
import category_encoders as ce
# Binary Encoding using category_encoders
binary_encoder = ce.BinaryEncoder(cols=['Category'])
binary_encoded = binary_encoder.fit_transform(data)
print(binary_encoded)
Output:
Category_0 Category_1
0 0 0
1 0 1
2 1 0
3 1 1