Transforming Data: Categorical Data and Label Encoding

Categorical data

Data that is divided into a limited number of qualitative groups. For example, demographic information tends to be categorical, like occupation, ethnicity and educational attainment. It’s like data that uses words or qualities rather than numbers.

Many data models and algorithms don’t work as well with categorical data as they do with numerical data. Assigning numerical representative values to categorical data is often the quickest and most effective way to learn about the distribution of categorical values in a data set.

(There are some algorithms that work well with categorical variables in word form, like, decision trees. However, with many data sets, the categorical data will need to be turned into numerical data.)

There are several ways to change categorical data to numerical. We will focus on two common methods, creating dummy variables and label encoding.

Dummy variables

Variables with values of 0 or 1, which indicate the presence or absence of something. Dummy variables are especially useful in statistical models and machine learning algorithms.

Label encoding

Data transformation technique where each category is assigned a unique number instead of a qualitative value.

Data is much simpler to clean, join and group when it is all numbers, it also takes up less storage space. An algorithm or model typically runs smoother when we take the time to transform our categorical data into numerical data.


A sample for Label Encoding

Let’s bring back the NOAA’s lightning dataset.

Objective 

We will be examining monthly lightning strike data collected by the National Oceanic and Atmospheric Association (NOAA) for 2016–2018. The dataset includes three columns: date, number_of_strikes, and center_point_geom. 

The objective is to assign the monthly number of strikes to the following categories: mild, scattered, heavy, or severe. Then we will create a heatmap of the three years so we can get a high-level understanding of monthly lightning severity from a simple diagram.

Import libraries and read the data file
import datetime
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

df = pd.read_csv('eda_label_encoding_dataset.csv')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10479003 entries, 0 to 10479002
Data columns (total 3 columns):
# Column Dtype
--- ------ -----
0 date object
1 number_of_strikes int64
2 center_point_geom object
dtypes: int64(1), object(2)
memory usage: 239.8+ MB
Prepare the data to work with date

First, let’s convert the date column to datetime, as we did a couple of times before.

# Convert `date` column to datetime
df['date'] = pd.to_datetime(df['date'])

# Create new `month` column
df['month'] = df['date'].dt.month_name().str.slice(stop=3)

df.head()
datenumber_of_strikescenter_point_geommonth
02016-08-0516POINT(-101.5 24.7)Aug
12016-08-0516POINT(-85 34.3)Aug
22016-08-0516POINT(-89 41.4)Aug
32016-08-0516POINT(-89.8 30.7)Aug
42016-08-0516POINT(-86.2 37.9)Aug

Now we need to encode the months as categorical information. This allows us to specifically designate them as categories that adhere to a specific order, which is helpful when we plot them later. We’ll also create a new year column. Then we’ll group the data by year and month, sum the remaining columns, and assign the results to a new dataframe.

# Create categorical designations
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

# Encode `month` column as categoricals 
df['month'] = pd.Categorical(df['month'], categories=months, ordered=True)

# Create `year` column by extracting the year info from the datetime object
df['year'] = df['date'].dt.strftime('%Y')

# Create a new df of month, year, total strikes
df_by_month = df.groupby(['year', 'month']).sum(numeric_only=True).reset_index()
df_by_month.head()

# NOTE: In pandas v.2.X+ we must set 'numeric_only=True' or else the sum() function will throw an error
yearmonthnumber_of_strikes
02016Jan313595
12016Feb312676
22016Mar2057527
32016Apr2636427
42016May5800500
Create a categorical variable

We’ll create a new column called strike_level that contains a categorical variable representing the lightning strikes for each month as mild, scattered, heavy, or severe. The pd.qcut pandas function makes this easy. We just input the column to be categorized, the number of quantiles to sort the data into, and how we want to name each quantile.

# Create a new column that categorizes number_of_strikes into 1 of 4 categories
df_by_month['strike_level'] = pd.qcut(
    df_by_month['number_of_strikes'],
    4,
    labels = ['Mild', 'Scattered', 'Heavy', 'Severe'])
df_by_month.head()
yearmonthnumber_of_strikesstrike_level
02016Jan313595Mild
12016Feb312676Mild
22016Mar2057527Scattered
32016Apr2636427Heavy
42016May5800500Severe
Encode strike_level into numerical values

Now that we have a categorical strike_level column, we can extract a numerical code from it using cat.codes (a function that takes categories and assigns them a numeric code) and assign this number to a new column.

# Create new column representing numerical value of strike level
df_by_month['strike_level_code'] = df_by_month['strike_level'].cat.codes
df_by_month.head()
yearmonthnumber_of_strikesstrike_levelstrike_level_code
02016Jan313595Mild0
12016Feb312676Mild0
22016Mar2057527Scattered1
32016Apr2636427Heavy2
42016May5800500Severe3
Creating dummy variables (demonstrational)
We can also create binary “dummy” variables from the strike_level column. This is a useful tool if we’d like to pass the categorical variable into a model. To do this, we could use the function pd.get_dummies().
pd.get_dummies(df_by_month['strike_level'])
MildScatteredHeavySevere
01000
11000
20100
30010
40001
50001
60001
70001
80010
90100
101000
111000
120100
131000
140100
150010
160010
170010
180001
190001
200010
210100
221000
231000
240100
250010
260100
270100
280010
290001
300001
310001
320010
330100
341000
351000

Simply calling the function as we did above will not convert the data unless we reassign the result back to a dataframe.

pd.get_dummies(df[‘column’]) 🠚 df unchanged
df = pd.get_dummies(df[‘column’]) 🠚 df changed

We don’t need to create dummy variables for our heatmap, so let’s continue without converting the dataframe.

Create a heatmap of number of strikes per month

Let’s discover what we can potentially learn from these groupings above. To do that, we’ll plot our data.

We want our heatmap to have the months on the x-axis and the years on the y-axis, and the color gradient should represent the severity (mild, scattered, heavy, severe) of lightning for each month. A simple way of preparing the data for the heatmap is to pivot it so the rows are years, columns are months, and the values are the numeric code of the lightning severity. We can do this with the df.pivot() method.

# Create new df that pivots the data
df_by_month_plot = df_by_month.pivot(index='year', columns='month', values='strike_level_code')
df_by_month_plot.head()
monthJanFebMarAprMayJunJulAugSepOctNovDec
year
2016001233332100
2017101222332100
2018121123332100
ax = sns.heatmap(df_by_month_plot, cmap = 'Blues')
colorbar = ax.collections[0].colorbar
colorbar.set_ticks([0, 1, 2, 3])
colorbar.set_ticklabels(['Mild', 'Scattered', 'Heavy', 'Severe'])
plt.show()

The heatmap indicates that for all three years, the most lightning strikes occurred during the summer months. A heatmap is an easily digestible way to understand a lot of data in a single graphic.

To give more information about the code above:

  • cmap definition gives us a preset color gradient from dark blue to white.
  • Then we used the standard collection zero from seaborn for our color bar. This just gives us a preset gradient color bar, which saves us a lot of time from having to code our own colors on a color bar.
  • Next, we set the color bar ticks to zero through three and label those ticks according to our four strike levels, severe, heavy, scattered, and mild.

Some potential problems with label encoding 

Imagine we’re analyzing a dataset with categories of music genres. We label encode “Blues,” “Electronic Dance Music (EDM),” “Hip Hop,” “Jazz,” “K-Pop,” “Metal,” “ and “Rock,” with the following numeric values, “1, 2, 3, 4, 5, 6, and 7.” 

  • With this label encoding, the resulting machine learning model could derive not only a ranking, but also a closer connection between Blues (1) and EDM (2) because of how close they are numerically than, say, Blues(1) and Jazz(4). 
  • In addition to these presumed relationships (which we may or may not want in your analysis) we should also notice that each code is equidistant from the other in the numeric sequence, as in 1 to 2 is the same distance as 5 to 6, etc. 
  • The question is, does that equidistant relationship accurately represent the relationships between the music genres in our dataset? 
  • To ask another question, after encoding, will the visualization or model we build treat the encoded labels as a ranking? 

In summary, label encoding may introduce unintended relationships between the categorical data in our dataset. There is another method for categorical encoding that may help with these potential problems.

One-hot Encoding

As I mentioned at the beginning, we can also create dummy variables. This creation of dummies is called one-hot encoding. A table with one-hot encoding could be like this:

    N/AMildScatteredHeavySevere
01000
11000
20100
30010
40001
50001
60001
70001

Label encoding or one-hot encoding: Which one to choose?

There is no simple answer to whether we should use label encoding or one-hot encoding. The decision needs to be made on a case-by-case, or dataset-by-dataset basis. But here are some guidelines: 

We can use label encoding when: 

  • There are a large number of different categorical variables, because label encoding uses far less data than one-hot encoding. 
  • The categorical values have a particular order to them (for example, age groups can be grouped as youngest to oldest or oldest to youngest). 
  • We plan to use a decision tree or random forest machine learning model.

We can use one-hot encoding when: 

  • There is a relatively small amount of categorical variables, because one-hot encoding uses much more data than label encoding. 
  • The categorical variables have no particular order.
  • We use a machine learning model in combination with dimensionality reduction, like Principal Component Analysis (PCA).

In

,