Overview
The data we will use here is customer data from a European bank and it will be used to predict whether a customer of the bank will churn. If a customer churns, it means they left the bank and took their business elsewhere. If we can predict which customers are likely to churn, we can take measures to retain them before they do.
Topics of focus in this activity include:
- Feature selection
- Removing uninformative features
- Feature extraction
- Creating new features from existing features
- Feature transformation
- Modifying existing features to better suit our objectives
- Encoding of categorical features as dummies
Target variable
import numpy as np
import pandas as pd
The column called Exited
is a Boolean value that indicates whether or not a customer left the bank (0 = did not leave, 1 = did leave). This will be our target variable. In other words, for each customer, our model should predict whether they should have a 0 or a 1 in the Exited
column.
This is a supervised learning classification task because we will predict on a binary class.
# Read in data
df_original = pd.read_csv('Churn_Modelling.csv')
df_original.head()
RowNumber | CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 15634602 | Hargrave | 619 | France | Female | 42 | 2 | 0.00 | 1 | 1 | 1 | 101348.88 | 1 |
1 | 2 | 15647311 | Hill | 608 | Spain | Female | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 | 0 |
2 | 3 | 15619304 | Onio | 502 | France | Female | 42 | 8 | 159660.80 | 3 | 1 | 0 | 113931.57 | 1 |
3 | 4 | 15701354 | Boni | 699 | France | Female | 39 | 1 | 0.00 | 2 | 0 | 0 | 93826.63 | 0 |
4 | 5 | 15737888 | Mitchell | 850 | Spain | Female | 43 | 2 | 125510.82 | 1 | 1 | 1 | 79084.10 | 0 |
When modeling, a best practice is to perform a rigorous examination of the data before beginning feature engineering and feature selection. But here we will skip that essential part of the modeling process. Let’s have a quick overview of the data:
# Print high-level info about data
df_original.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 RowNumber 10000 non-null int64
1 CustomerId 10000 non-null int64
2 Surname 10000 non-null object
3 CreditScore 10000 non-null int64
4 Geography 10000 non-null object
5 Gender 10000 non-null object
6 Age 10000 non-null int64
7 Tenure 10000 non-null int64
8 Balance 10000 non-null float64
9 NumOfProducts 10000 non-null int64
10 HasCrCard 10000 non-null int64
11 IsActiveMember 10000 non-null int64
12 EstimatedSalary 10000 non-null float64
13 Exited 10000 non-null int64
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB
From this table, we can confirm that the data has 14 features and 10,000 observations. We also know that nine features are integers, two are floats, and three are strings. Finally, we can tell that there are no null values.
Feature engineering
Feature selection
Feature selection is the process of choosing features to be used for modeling. In practice, feature selection takes place at multiple points in the PACE process. If we decide that the problem requires a model, we will then have to:
- Consider what data is available to us
- Decide on what kind of model we need
- Decide on a target variable
- Assemble a collection of features that we think might help predict on our chosen target
This all takes place during the Plan phase.
Then, during the Analyze phase, we perform EDA on the data and reevaluate your variables for appropriateness. For example, can our model handle null values? If not, what do we do with features with a lot of nulls? Perhaps we drop them. This too is feature selection.
Feature selection also occurs during the Construct phase. This usually involves building a model, examining which features are most predictive, and then removing the unpredictive features.
There’s a lot of work involved in feature selection. In this case, we already have a dataset, and we are not performing thorough EDA on it. However, we can still examine the data to ensure that all the features can reasonably be expected to have predictive potential.
Examining the features
- The first column is called
RowNumber
, and it just enumerates the rows. We should drop this feature, because row number shouldn’t have any correlation with whether or not a customer churned. - The same is true for
CustomerID
, which appears to be a number assigned to the customer for administrative purposes, andSurname
, which is the customer’s last name. Since these cannot be expected to have any influence over the target variable, we can remove them from the modeling dataset. - Finally, for ethical reasons, we should remove the
Gender
column. Because we don’t want our model-making predictions (and therefore, offering promotions/financial incentives) based on a person’s gender.
# Create a new df that drops RowNumber, CustomerId, Surname, and Gender cols
churn_df = df_original.drop(['RowNumber', 'CustomerId', 'Surname', 'Gender'],
axis=1)
churn_df.head()
CreditScore | Geography | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 619 | France | 42 | 2 | 0.00 | 1 | 1 | 1 | 101348.88 | 1 |
1 | 608 | Spain | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 | 0 |
2 | 502 | France | 42 | 8 | 159660.80 | 3 | 1 | 0 | 113931.57 | 1 |
3 | 699 | France | 39 | 1 | 0.00 | 2 | 0 | 0 | 93826.63 | 0 |
4 | 850 | Spain | 43 | 2 | 125510.82 | 1 | 1 | 1 | 79084.10 | 0 |
Feature extraction
Depending on our data, we may be able to create brand new features from our existing features. Oftentimes, features that we create ourselves are some of the most important features selected by our model. Usually this is the case when we have both domain knowledge for the problem we’re solving and the right combinations of data.
For example, suppose we knew that our bank had a computer glitch that caused many credit card transactions to be mistakenly declined in October. It would be reasonable to suspect that people who experienced this might be at increased risk of leaving the bank. If we had a feature that represented each customer’s number of credit card transactions each month, we could create a new feature; for example, OctUseRatio
, where:
OctUseRatio = num of Oct. transactions / avg num monthly transactions
This new feature would then give us a ratio that might be indicative of whether the customer experienced declined transactions.
We don’t have this kind of specific circumstantial knowledge, and we don’t have many features to choose from, but we can create a new feature that might help improve the model.
Let’s create a Loyalty
feature that represents the percentage of each customer’s life that they were customers. We can do this by dividing Tenure
by Age
:
Loyalty = Tenure / Age
The intuition here is that people who have been customers for a greater proportion of their lives might be less likely to churn.
# Create Loyalty variable
churn_df['Loyalty'] = churn_df['Tenure'] / churn_df['Age']
churn_df.head()
CreditScore | Geography | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | Loyalty | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 619 | France | 42 | 2 | 0.00 | 1 | 1 | 1 | 101348.88 | 1 | 0.047619 |
1 | 608 | Spain | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 | 0 | 0.024390 |
2 | 502 | France | 42 | 8 | 159660.80 | 3 | 1 | 0 | 113931.57 | 1 | 0.190476 |
3 | 699 | France | 39 | 1 | 0.00 | 2 | 0 | 0 | 93826.63 | 0 | 0.025641 |
4 | 850 | Spain | 43 | 2 | 125510.82 | 1 | 1 | 1 | 79084.10 | 0 | 0.046512 |
Feature transformation
Another step is to transform our features to get them ready for modeling. Different models have different requirements for how the data should be prepared and also different assumptions about their distributions, independence, and so on.
The models we will be building with this data are all classification models, and classification models generally need categorical variables to be encoded. Our dataset has one categorical feature: Geography
. Let’s check how many categories appear in the data for this feature.
# Print unique values of Geography col
churn_df['Geography'].unique()
array(['France', 'Spain', 'Germany'], dtype=object)
There are three unique values: France, Spain, and Germany. By encoding this data, it can be represented using Boolean features. We will use a pandas function called pd.get_dummies()
to do this.
When we call pd.get_dummies()
on this feature, it will replace the Geography
column with three new Boolean columns, one for each possible category contained in the column being dummied.
When we specify drop_first=True
in the function call, it means that instead of replacing Geography
with three new columns, it will instead replace it with two columns. We can do this because no information is lost from this, but the dataset is shorter and simpler.
In this case, we end up with two new columns called Geography_Germany
and Geography_Spain
. We don’t need a Geography_France
column, because if a customer’s values in Geography_Germany
and Geography_Spain
are both 0, we will know they are from France!
# Dummy encode categorical variables
churn_df = pd.get_dummies(churn_df, drop_first=True)
churn_df.head()
CreditScore | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | Loyalty | Geography_Germany | Geography_Spain | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 619 | 42 | 2 | 0.00 | 1 | 1 | 1 | 101348.88 | 1 | 0.047619 | 0 | 0 |
1 | 608 | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 | 0 | 0.024390 | 0 | 1 |
2 | 502 | 42 | 8 | 159660.80 | 3 | 1 | 0 | 113931.57 | 1 | 0.190476 | 0 | 0 |
3 | 699 | 39 | 1 | 0.00 | 2 | 0 | 0 | 93826.63 | 0 | 0.025641 | 0 | 0 |
4 | 850 | 43 | 2 | 125510.82 | 1 | 1 | 1 | 79084.10 | 0 | 0.046512 | 0 | 1 |
We can now use our new dataset to build a model.
Disclaimer: Like most of my posts, this content is intended solely for educational purposes and was created primarily for my personal reference. At times, I may rephrase original texts, and in some cases, I include materials such as graphs, equations, and datasets directly from their original sources.
I typically reference a variety of sources and update my posts whenever new or related information becomes available. For this particular post, the primary source was Google Advanced Data Analytics Professional Certificate program.