Introduction
This activity is a continuation of the project we began modeling with decision trees for an airline. Here, we will train, tune, and evaluate a random forest model using data from spreadsheet of survey responses from 129,880 customers. It includes data points such as class, flight distance, and inflight entertainment. Our random forest model will be used to predict whether a customer will be satisfied with their flight experience.
Step 1: Imports
import numpy as np
import pandas as pd
import pickle as pkl
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, PredefinedSplit, GridSearchCV
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score
air_data = pd.read_csv("Invistico_Airline.csv")
Step 2: Data cleaning
air_data.head(10)
satisfaction | Customer Type | Age | Type of Travel | Class | Flight Distance | Seat comfort | Departure/Arrival time convenient | Food and drink | Gate location | … | Online support | Ease of Online booking | On-board service | Leg room service | Baggage handling | Checkin service | Cleanliness | Online boarding | Departure Delay in Minutes | Arrival Delay in Minutes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | satisfied | Loyal Customer | 65 | Personal Travel | Eco | 265 | 0 | 0 | 0 | 2 | … | 2 | 3 | 3 | 0 | 3 | 5 | 3 | 2 | 0 | 0.0 |
1 | satisfied | Loyal Customer | 47 | Personal Travel | Business | 2464 | 0 | 0 | 0 | 3 | … | 2 | 3 | 4 | 4 | 4 | 2 | 3 | 2 | 310 | 305.0 |
2 | satisfied | Loyal Customer | 15 | Personal Travel | Eco | 2138 | 0 | 0 | 0 | 3 | … | 2 | 2 | 3 | 3 | 4 | 4 | 4 | 2 | 0 | 0.0 |
3 | satisfied | Loyal Customer | 60 | Personal Travel | Eco | 623 | 0 | 0 | 0 | 3 | … | 3 | 1 | 1 | 0 | 1 | 4 | 1 | 3 | 0 | 0.0 |
4 | satisfied | Loyal Customer | 70 | Personal Travel | Eco | 354 | 0 | 0 | 0 | 3 | … | 4 | 2 | 2 | 0 | 2 | 4 | 2 | 5 | 0 | 0.0 |
5 | satisfied | Loyal Customer | 30 | Personal Travel | Eco | 1894 | 0 | 0 | 0 | 3 | … | 2 | 2 | 5 | 4 | 5 | 5 | 4 | 2 | 0 | 0.0 |
6 | satisfied | Loyal Customer | 66 | Personal Travel | Eco | 227 | 0 | 0 | 0 | 3 | … | 5 | 5 | 5 | 0 | 5 | 5 | 5 | 3 | 17 | 15.0 |
7 | satisfied | Loyal Customer | 10 | Personal Travel | Eco | 1812 | 0 | 0 | 0 | 3 | … | 2 | 2 | 3 | 3 | 4 | 5 | 4 | 2 | 0 | 0.0 |
8 | satisfied | Loyal Customer | 56 | Personal Travel | Business | 73 | 0 | 0 | 0 | 3 | … | 5 | 4 | 4 | 0 | 1 | 5 | 4 | 4 | 0 | 0.0 |
9 | satisfied | Loyal Customer | 22 | Personal Travel | Eco | 1556 | 0 | 0 | 0 | 3 | … | 2 | 2 | 2 | 4 | 5 | 3 | 4 | 2 | 30 | 26.0 |
10 rows × 22 columns
# Display variable names and types
air_data.dtypes
satisfaction object Customer Type object Age int64 Type of Travel object Class object Flight Distance int64 Seat comfort int64 Departure/Arrival time convenient int64 Food and drink int64 Gate location int64 Inflight wifi service int64 Inflight entertainment int64 Online support int64 Ease of Online booking int64 On-board service int64 Leg room service int64 Baggage handling int64 Checkin service int64 Cleanliness int64 Online boarding int64 Departure Delay in Minutes int64 Arrival Delay in Minutes float64 dtype: object
# Identify the number of rows and the number of columns
air_data.shape
(129880, 22)
# Get the number of rows that contain missing values
air_data.isna().any(axis=1).sum()
393
# Drop missing values
air_data_subset = air_data.dropna(axis=0)
air_data_subset.head(10)
satisfaction | Customer Type | Age | Type of Travel | Class | Flight Distance | Seat comfort | Departure/Arrival time convenient | Food and drink | Gate location | … | Online support | Ease of Online booking | On-board service | Leg room service | Baggage handling | Checkin service | Cleanliness | Online boarding | Departure Delay in Minutes | Arrival Delay in Minutes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | satisfied | Loyal Customer | 65 | Personal Travel | Eco | 265 | 0 | 0 | 0 | 2 | … | 2 | 3 | 3 | 0 | 3 | 5 | 3 | 2 | 0 | 0.0 |
1 | satisfied | Loyal Customer | 47 | Personal Travel | Business | 2464 | 0 | 0 | 0 | 3 | … | 2 | 3 | 4 | 4 | 4 | 2 | 3 | 2 | 310 | 305.0 |
2 | satisfied | Loyal Customer | 15 | Personal Travel | Eco | 2138 | 0 | 0 | 0 | 3 | … | 2 | 2 | 3 | 3 | 4 | 4 | 4 | 2 | 0 | 0.0 |
3 | satisfied | Loyal Customer | 60 | Personal Travel | Eco | 623 | 0 | 0 | 0 | 3 | … | 3 | 1 | 1 | 0 | 1 | 4 | 1 | 3 | 0 | 0.0 |
4 | satisfied | Loyal Customer | 70 | Personal Travel | Eco | 354 | 0 | 0 | 0 | 3 | … | 4 | 2 | 2 | 0 | 2 | 4 | 2 | 5 | 0 | 0.0 |
5 | satisfied | Loyal Customer | 30 | Personal Travel | Eco | 1894 | 0 | 0 | 0 | 3 | … | 2 | 2 | 5 | 4 | 5 | 5 | 4 | 2 | 0 | 0.0 |
6 | satisfied | Loyal Customer | 66 | Personal Travel | Eco | 227 | 0 | 0 | 0 | 3 | … | 5 | 5 | 5 | 0 | 5 | 5 | 5 | 3 | 17 | 15.0 |
7 | satisfied | Loyal Customer | 10 | Personal Travel | Eco | 1812 | 0 | 0 | 0 | 3 | … | 2 | 2 | 3 | 3 | 4 | 5 | 4 | 2 | 0 | 0.0 |
8 | satisfied | Loyal Customer | 56 | Personal Travel | Business | 73 | 0 | 0 | 0 | 3 | … | 5 | 4 | 4 | 0 | 1 | 5 | 4 | 4 | 0 | 0.0 |
9 | satisfied | Loyal Customer | 22 | Personal Travel | Eco | 1556 | 0 | 0 | 0 | 3 | … | 2 | 2 | 2 | 4 | 5 | 3 | 4 | 2 | 30 | 26.0 |
10 rows × 22 columns
Confirm that it does not contain any missing values.
# Count of missing values
air_data_subset.isna().sum()
satisfaction 0 Customer Type 0 Age 0 Type of Travel 0 Class 0 Flight Distance 0 Seat comfort 0 Departure/Arrival time convenient 0 Food and drink 0 Gate location 0 Inflight wifi service 0 Inflight entertainment 0 Online support 0 Ease of Online booking 0 On-board service 0 Leg room service 0 Baggage handling 0 Checkin service 0 Cleanliness 0 Online boarding 0 Departure Delay in Minutes 0 Arrival Delay in Minutes 0 dtype: int64
Next, we’ll convert the categorical features to indicator (one-hot encoded) features.
# Convert categorical features to one-hot encoded features
air_data_subset_dummies = pd.get_dummies(air_data_subset,
columns=['Customer Type','Type of Travel','Class'])
# Display the first 10 rows
air_data_subset_dummies.head(10)
satisfaction | Age | Flight Distance | Seat comfort | Departure/Arrival time convenient | Food and drink | Gate location | Inflight wifi service | Inflight entertainment | Online support | … | Online boarding | Departure Delay in Minutes | Arrival Delay in Minutes | Customer Type_Loyal Customer | Customer Type_disloyal Customer | Type of Travel_Business travel | Type of Travel_Personal Travel | Class_Business | Class_Eco | Class_Eco Plus | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | satisfied | 65 | 265 | 0 | 0 | 0 | 2 | 2 | 4 | 2 | … | 2 | 0 | 0.0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 |
1 | satisfied | 47 | 2464 | 0 | 0 | 0 | 3 | 0 | 2 | 2 | … | 2 | 310 | 305.0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 |
2 | satisfied | 15 | 2138 | 0 | 0 | 0 | 3 | 2 | 0 | 2 | … | 2 | 0 | 0.0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 |
3 | satisfied | 60 | 623 | 0 | 0 | 0 | 3 | 3 | 4 | 3 | … | 3 | 0 | 0.0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 |
4 | satisfied | 70 | 354 | 0 | 0 | 0 | 3 | 4 | 3 | 4 | … | 5 | 0 | 0.0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 |
5 | satisfied | 30 | 1894 | 0 | 0 | 0 | 3 | 2 | 0 | 2 | … | 2 | 0 | 0.0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 |
6 | satisfied | 66 | 227 | 0 | 0 | 0 | 3 | 2 | 5 | 5 | … | 3 | 17 | 15.0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 |
7 | satisfied | 10 | 1812 | 0 | 0 | 0 | 3 | 2 | 0 | 2 | … | 2 | 0 | 0.0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 |
8 | satisfied | 56 | 73 | 0 | 0 | 0 | 3 | 5 | 3 | 5 | … | 4 | 0 | 0.0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 |
9 | satisfied | 22 | 1556 | 0 | 0 | 0 | 3 | 2 | 0 | 2 | … | 2 | 30 | 26.0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 |
10 rows × 26 columns
Let’s check the variables of air_data_subset_dummies.
# Display variables
air_data_subset_dummies.dtypes
satisfaction object Age int64 Flight Distance int64 Seat comfort int64 Departure/Arrival time convenient int64 Food and drink int64 Gate location int64 Inflight wifi service int64 Inflight entertainment int64 Online support int64 Ease of Online booking int64 On-board service int64 Leg room service int64 Baggage handling int64 Checkin service int64 Cleanliness int64 Online boarding int64 Departure Delay in Minutes int64 Arrival Delay in Minutes float64 Customer Type_Loyal Customer uint8 Customer Type_disloyal Customer uint8 Type of Travel_Business travel uint8 Type of Travel_Personal Travel uint8 Class_Business uint8 Class_Eco uint8 Class_Eco Plus uint8 dtype: object
All of the following changes could be observed:
- Customer Type –> Customer Type_Loyal Customer and Customer Type_disloyal Customer
- Type of Travel –> Type of Travel_Business travel and Type of Travel_Personal travel
- Class –> Class_Business, Class_Eco, Class_Eco Plus
Step 3: Model building
# Separate the dataset into labels (y) and features (X)
y = air_data_subset_dummies["satisfaction"]
X = air_data_subset_dummies.drop("satisfaction", axis=1)
# Separate into train, validate, test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, test_size = 0.25, random_state = 0)
Tune the model
Now, we’ll fit and tune a random forest model with separate validation set. We begin by determining a set of hyperparameters for tuning the model using GridSearchCV.
# Determine set of hyperparameters
cv_params = {'n_estimators' : [50,100],
'max_depth' : [10,50],
'min_samples_leaf' : [0.5,1],
'min_samples_split' : [0.001, 0.01],
'max_features' : ["sqrt"],
'max_samples' : [.5,.9]}
Next, we create a list of split indices.
# Create list of split indices
split_index = [0 if x in X_val.index else -1 for x in X_train.index]
custom_split = PredefinedSplit(split_index)
Now, we instantiate our model.
# Instantiate model
rf = RandomForestClassifier(random_state=0)
Next, we use GridSearchCV to search over the specified parameters.
# Search over specified parameters
rf_val = GridSearchCV(rf, cv_params, cv=custom_split, refit='f1', n_jobs = -1, verbose = 1)
Now, we fit our model.
%%time
# Fit the model
rf_val.fit(X_train, y_train)
Fitting 1 folds for each of 32 candidates, totaling 32 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done 32 out of 32 | elapsed: 40.8s finished
CPU times: user 4.99 s, sys: 87.7 ms, total: 5.08 s
Wall time: 45.5 s
GridSearchCV(cv=PredefinedSplit(test_fold=array([-1, -1, ..., -1, -1])), error_score=nan, estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weig... n_estimators=100, n_jobs=None, oob_score=False, random_state=0, verbose=0, warm_start=False), iid='deprecated', n_jobs=-1, param_grid={'max_depth': [10, 50], 'max_features': ['sqrt'], 'max_samples': [0.5, 0.9], 'min_samples_leaf': [0.5, 1], 'min_samples_split': [0.001, 0.01], 'n_estimators': [50, 100]}, pre_dispatch='2*n_jobs', refit='f1', return_train_score=False, scoring=None, verbose=1)
Finally, we obtain the optimal parameters.
# Obtain optimal parameters
rf_val.best_params_
{'max_depth': 50, 'max_features': 'sqrt', 'max_samples': 0.9, 'min_samples_leaf': 1, 'min_samples_split': 0.001, 'n_estimators': 50}
Step 4: Results and evaluation
Now we use the selected model to predict on our test data. We use the optimal parameters found via GridSearchCV.
# Use optimal parameters on GridSearchCV
rf_opt = RandomForestClassifier(n_estimators = 50, max_depth = 50,
min_samples_leaf = 1, min_samples_split = 0.001,
max_features="sqrt", max_samples = 0.9, random_state = 0)
Once again, we fit the optimal model.
# Fit the optimal model
rf_opt.fit(X_train, y_train)
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=50, max_features='sqrt', max_leaf_nodes=None, max_samples=0.9, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=0.001, min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=None, oob_score=False, random_state=0, verbose=0, warm_start=False)
And we predict on the test set using the optimal model.
# Predict on test set
y_pred = rf_opt.predict(X_test)
Obtain performance scores
First, we get our precision score.
# Get precision score
pc_test = precision_score(y_test, y_pred, pos_label = "satisfied")
print("The precision score is {pc:.3f}".format(pc = pc_test))
The precision score is 0.950
Then, we collect the recall score.
# Get recall score
rc_test = recall_score(y_test, y_pred, pos_label = "satisfied")
print("The recall score is {rc:.3f}".format(rc = rc_test))
The recall score is 0.945
Next, we obtain our accuracy score.
# Get accuracy score
ac_test = accuracy_score(y_test, y_pred)
print("The accuracy score is {ac:.3f}".format(ac = ac_test))
The accuracy score is 0.942
Finally, we collect our F1-score.
# Get F1 score
f1_test = f1_score(y_test, y_pred, pos_label = "satisfied")
print("The F1 score is {f1:.3f}".format(f1 = f1_test))
The F1 score is 0.947
Question: What are the pros and cons of performing the model selection using test data instead of a separate validation dataset?
Pros:
- The coding workload is reduced.
- The scripts for data splitting are shorter.
- It’s only necessary to evaluate test dataset performance once, instead of two evaluations (validate and test).
Cons:
- If a model is evaluated using samples that were also used to build or fine-tune that model, it likely will provide a biased evaluation.
- A potential overfitting issue could happen when fitting the model’s scores on the test data.
Evaluate the model
Now that we have results, let’s evaluate the model. Let’s calculate the scores: precision score, recall score, accuracy score, F1 score.
# Precision score on test data set
print("\nThe precision score is: {pc:.3f}".format(pc = pc_test), "for the test set,", "\nwhich means of all positive predictions,", "{pc_pct:.1f}% prediction are true positive.".format(pc_pct = pc_test * 100))
The precision score is: 0.950 for the test set,
which means of all positive predictions, 95.0% prediction are true positive.
# Recall score on test data set
print("\nThe recall score is: {rc:.3f}".format(rc = rc_test), "for the test set,", "\nwhich means of which means of all real positive cases in test set,", "{rc_pct:.1f}% are predicted positive.".format(rc_pct = rc_test * 100))
The recall score is: 0.945 for the test set,
which means of which means of all real positive cases in test set, 94.5% are predicted positive.
# Accuracy score on test data set
print("\nThe accuracy score is: {ac:.3f}".format(ac = ac_test), "for the test set,", "\nwhich means of all cases in test set,", "{ac_pct:.1f}% are predicted true positive or true negative.".format(ac_pct = ac_test * 100))
The accuracy score is: 0.942 for the test set,
which means of all cases in test set, 94.2% are predicted true positive or true negative.
# F1 score on test data set
print("\nThe F1 score is: {f1:.3f}".format(f1 = f1_test), "for the test set,", "\nwhich means the test set's harmonic mean is {f1_pct:.1f}%.".format(f1_pct = f1_test * 100))
The F1 score is: 0.947 for the test set,
which means the test set's harmonic mean is 94.7%.
The model performs well according to all 4 performance metrics. The model’s precision score is slightly better than the 3 other metrics.
Evaluate the model
Finally, we create a table of results that we can use to evaluate the performance of our model.
# Create table of results
table = pd.DataFrame({'Model': ["Tuned Decision Tree", "Tuned Random Forest"],
'F1': [0.945422, f1_test],
'Recall': [0.935863, rc_test],
'Precision': [0.955197, pc_test],
'Accuracy': [0.940864, ac_test]
}
)
table
Model | F1 | Recall | Precision | Accuracy | |
---|---|---|---|---|---|
0 | Tuned Decision Tree | 0.945422 | 0.935863 | 0.955197 | 0.940864 |
1 | Tuned Random Forest | 0.947306 | 0.944501 | 0.950128 | 0.942450 |
The tuned random forest has higher scores overall, so it is the better model. Particularly, it shows a better F1 score than the decision tree model, which indicates that the random forest model may do better at classification when taking into account false positives and false negatives.
Considerations
What summary could we provide to stakeholders?
- The random forest model predicted satisfaction with more than 94.2% accuracy. The precision is over 95% and the recall is approximately 94.5%.
- The random forest model outperformed the tuned decision tree with the best hyperparameters in most of the four scores. This indicates that the random forest model may perform better.
- Because stakeholders were interested in learning about the factors that are most important to customer satisfaction, this would be shared based on the tuned random forest.
- In addition, we would provide details about the precision, recall, accuracy, and F1 scores to support our findings.
Disclaimer: Like most of my posts, this content is intended solely for educational purposes and was created primarily for my personal reference. At times, I may rephrase original texts, and in some cases, I include materials such as graphs, equations, and datasets directly from their original sources.
I typically reference a variety of sources and update my posts whenever new or related information becomes available. For this particular post, the primary source was Google Advanced Data Analytics Professional Certificate program.