Basic Feature Engineering with Python

Overview

The data we will use here is customer data from a European bank and it will be used to predict whether a customer of the bank will churn. If a customer churns, it means they left the bank and took their business elsewhere. If we can predict which customers are likely to churn, we can take measures to retain them before they do.

Topics of focus in this activity include:

  • Feature selection
    • Removing uninformative features
  • Feature extraction
    • Creating new features from existing features
  • Feature transformation
    • Modifying existing features to better suit our objectives
    • Encoding of categorical features as dummies

Target variable

import numpy as np
import pandas as pd

The column called Exited is a Boolean value that indicates whether or not a customer left the bank (0 = did not leave, 1 = did leave). This will be our target variable. In other words, for each customer, our model should predict whether they should have a 0 or a 1 in the Exited column.

This is a supervised learning classification task because we will predict on a binary class.

# Read in data
df_original = pd.read_csv('Churn_Modelling.csv')
df_original.head()
RowNumberCustomerIdSurnameCreditScoreGeographyGenderAgeTenureBalanceNumOfProductsHasCrCardIsActiveMemberEstimatedSalaryExited
0115634602Hargrave619FranceFemale4220.00111101348.881
1215647311Hill608SpainFemale41183807.86101112542.580
2315619304Onio502FranceFemale428159660.80310113931.571
3415701354Boni699FranceFemale3910.0020093826.630
4515737888Mitchell850SpainFemale432125510.8211179084.100

When modeling, a best practice is to perform a rigorous examination of the data before beginning feature engineering and feature selection. But here we will skip that essential part of the modeling process. Let’s have a quick overview of the data:

# Print high-level info about data
df_original.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 RowNumber 10000 non-null int64
1 CustomerId 10000 non-null int64
2 Surname 10000 non-null object
3 CreditScore 10000 non-null int64
4 Geography 10000 non-null object
5 Gender 10000 non-null object
6 Age 10000 non-null int64
7 Tenure 10000 non-null int64
8 Balance 10000 non-null float64
9 NumOfProducts 10000 non-null int64
10 HasCrCard 10000 non-null int64
11 IsActiveMember 10000 non-null int64
12 EstimatedSalary 10000 non-null float64
13 Exited 10000 non-null int64
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB

From this table, we can confirm that the data has 14 features and 10,000 observations. We also know that nine features are integers, two are floats, and three are strings. Finally, we can tell that there are no null values.

Feature engineering

Feature selection

Feature selection is the process of choosing features to be used for modeling. In practice, feature selection takes place at multiple points in the PACE process. If we decide that the problem requires a model, we will then have to:

  • Consider what data is available to us
  • Decide on what kind of model we need
  • Decide on a target variable
  • Assemble a collection of features that we think might help predict on our chosen target

This all takes place during the Plan phase.

Then, during the Analyze phase, we perform EDA on the data and reevaluate your variables for appropriateness. For example, can our model handle null values? If not, what do we do with features with a lot of nulls? Perhaps we drop them. This too is feature selection.

Feature selection also occurs during the Construct phase. This usually involves building a model, examining which features are most predictive, and then removing the unpredictive features.

There’s a lot of work involved in feature selection. In this case, we already have a dataset, and we are not performing thorough EDA on it. However, we can still examine the data to ensure that all the features can reasonably be expected to have predictive potential.

Examining the features
  • The first column is called RowNumber, and it just enumerates the rows. We should drop this feature, because row number shouldn’t have any correlation with whether or not a customer churned.
  • The same is true for CustomerID, which appears to be a number assigned to the customer for administrative purposes, and Surname, which is the customer’s last name. Since these cannot be expected to have any influence over the target variable, we can remove them from the modeling dataset.
  • Finally, for ethical reasons, we should remove the Gender column. Because we don’t want our model-making predictions (and therefore, offering promotions/financial incentives) based on a person’s gender.
# Create a new df that drops RowNumber, CustomerId, Surname, and Gender cols
churn_df = df_original.drop(['RowNumber', 'CustomerId', 'Surname', 'Gender'], 
                            axis=1)
churn_df.head()
CreditScoreGeographyAgeTenureBalanceNumOfProductsHasCrCardIsActiveMemberEstimatedSalaryExited
0619France4220.00111101348.881
1608Spain41183807.86101112542.580
2502France428159660.80310113931.571
3699France3910.0020093826.630
4850Spain432125510.8211179084.100
Feature extraction

Depending on our data, we may be able to create brand new features from our existing features. Oftentimes, features that we create ourselves are some of the most important features selected by our model. Usually this is the case when we have both domain knowledge for the problem we’re solving and the right combinations of data.

For example, suppose we knew that our bank had a computer glitch that caused many credit card transactions to be mistakenly declined in October. It would be reasonable to suspect that people who experienced this might be at increased risk of leaving the bank. If we had a feature that represented each customer’s number of credit card transactions each month, we could create a new feature; for example, OctUseRatio, where:

OctUseRatio = num of Oct. transactions / avg num monthly transactions

This new feature would then give us a ratio that might be indicative of whether the customer experienced declined transactions.

We don’t have this kind of specific circumstantial knowledge, and we don’t have many features to choose from, but we can create a new feature that might help improve the model.

Let’s create a Loyalty feature that represents the percentage of each customer’s life that they were customers. We can do this by dividing Tenure by Age:

Loyalty = Tenure / Age

The intuition here is that people who have been customers for a greater proportion of their lives might be less likely to churn.

# Create Loyalty variable
churn_df['Loyalty'] = churn_df['Tenure'] / churn_df['Age']
churn_df.head()
CreditScoreGeographyAgeTenureBalanceNumOfProductsHasCrCardIsActiveMemberEstimatedSalaryExitedLoyalty
0619France4220.00111101348.8810.047619
1608Spain41183807.86101112542.5800.024390
2502France428159660.80310113931.5710.190476
3699France3910.0020093826.6300.025641
4850Spain432125510.8211179084.1000.046512
Feature transformation

Another step is to transform our features to get them ready for modeling. Different models have different requirements for how the data should be prepared and also different assumptions about their distributions, independence, and so on.

The models we will be building with this data are all classification models, and classification models generally need categorical variables to be encoded. Our dataset has one categorical feature: Geography. Let’s check how many categories appear in the data for this feature.

# Print unique values of Geography col
churn_df['Geography'].unique()
array(['France', 'Spain', 'Germany'], dtype=object)

There are three unique values: France, Spain, and Germany. By encoding this data, it can be represented using Boolean features. We will use a pandas function called pd.get_dummies() to do this.

When we call pd.get_dummies() on this feature, it will replace the Geography column with three new Boolean columns, one for each possible category contained in the column being dummied.

When we specify drop_first=True in the function call, it means that instead of replacing Geography with three new columns, it will instead replace it with two columns. We can do this because no information is lost from this, but the dataset is shorter and simpler.

In this case, we end up with two new columns called Geography_Germany and Geography_Spain. We don’t need a Geography_France column, because if a customer’s values in Geography_Germany and Geography_Spain are both 0, we will know they are from France!

# Dummy encode categorical variables
churn_df = pd.get_dummies(churn_df, drop_first=True)
churn_df.head()
CreditScoreAgeTenureBalanceNumOfProductsHasCrCardIsActiveMemberEstimatedSalaryExitedLoyaltyGeography_GermanyGeography_Spain
06194220.00111101348.8810.04761900
160841183807.86101112542.5800.02439001
2502428159660.80310113931.5710.19047600
36993910.0020093826.6300.02564100
4850432125510.8211179084.1000.04651201

We can now use our new dataset to build a model.


Disclaimer: Like most of my posts, this content is intended solely for educational purposes and was created primarily for my personal reference. At times, I may rephrase original texts, and in some cases, I include materials such as graphs, equations, and datasets directly from their original sources. 

I typically reference a variety of sources and update my posts whenever new or related information becomes available. For this particular post, the primary source was Google Advanced Data Analytics Professional Certificate program.