Build a random forest model in Python

Introduction

This activity is a continuation of the project we began modeling with decision trees for an airline. Here, we will train, tune, and evaluate a random forest model using data from spreadsheet of survey responses from 129,880 customers. It includes data points such as class, flight distance, and inflight entertainment. Our random forest model will be used to predict whether a customer will be satisfied with their flight experience.

Step 1: Imports

import numpy as np
import pandas as pd

import pickle as pkl
 
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, PredefinedSplit, GridSearchCV
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score
air_data = pd.read_csv("Invistico_Airline.csv")

Step 2: Data cleaning

air_data.head(10)
satisfactionCustomer TypeAgeType of TravelClassFlight DistanceSeat comfortDeparture/Arrival time convenientFood and drinkGate locationOnline supportEase of Online bookingOn-board serviceLeg room serviceBaggage handlingCheckin serviceCleanlinessOnline boardingDeparture Delay in MinutesArrival Delay in Minutes
0satisfiedLoyal Customer65Personal TravelEco26500022330353200.0
1satisfiedLoyal Customer47Personal TravelBusiness2464000323444232310305.0
2satisfiedLoyal Customer15Personal TravelEco213800032233444200.0
3satisfiedLoyal Customer60Personal TravelEco62300033110141300.0
4satisfiedLoyal Customer70Personal TravelEco35400034220242500.0
5satisfiedLoyal Customer30Personal TravelEco189400032254554200.0
6satisfiedLoyal Customer66Personal TravelEco2270003555055531715.0
7satisfiedLoyal Customer10Personal TravelEco181200032233454200.0
8satisfiedLoyal Customer56Personal TravelBusiness7300035440154400.0
9satisfiedLoyal Customer22Personal TravelEco15560003222453423026.0
10 rows × 22 columns
# Display variable names and types
air_data.dtypes
satisfaction                          object
Customer Type                         object
Age                                    int64
Type of Travel                        object
Class                                 object
Flight Distance                        int64
Seat comfort                           int64
Departure/Arrival time convenient      int64
Food and drink                         int64
Gate location                          int64
Inflight wifi service                  int64
Inflight entertainment                 int64
Online support                         int64
Ease of Online booking                 int64
On-board service                       int64
Leg room service                       int64
Baggage handling                       int64
Checkin service                        int64
Cleanliness                            int64
Online boarding                        int64
Departure Delay in Minutes             int64
Arrival Delay in Minutes             float64
dtype: object
# Identify the number of rows and the number of columns
air_data.shape
(129880, 22)
# Get the number of rows that contain missing values
air_data.isna().any(axis=1).sum()
393
# Drop missing values
air_data_subset = air_data.dropna(axis=0)
air_data_subset.head(10)
satisfactionCustomer TypeAgeType of TravelClassFlight DistanceSeat comfortDeparture/Arrival time convenientFood and drinkGate locationOnline supportEase of Online bookingOn-board serviceLeg room serviceBaggage handlingCheckin serviceCleanlinessOnline boardingDeparture Delay in MinutesArrival Delay in Minutes
0satisfiedLoyal Customer65Personal TravelEco26500022330353200.0
1satisfiedLoyal Customer47Personal TravelBusiness2464000323444232310305.0
2satisfiedLoyal Customer15Personal TravelEco213800032233444200.0
3satisfiedLoyal Customer60Personal TravelEco62300033110141300.0
4satisfiedLoyal Customer70Personal TravelEco35400034220242500.0
5satisfiedLoyal Customer30Personal TravelEco189400032254554200.0
6satisfiedLoyal Customer66Personal TravelEco2270003555055531715.0
7satisfiedLoyal Customer10Personal TravelEco181200032233454200.0
8satisfiedLoyal Customer56Personal TravelBusiness7300035440154400.0
9satisfiedLoyal Customer22Personal TravelEco15560003222453423026.0
10 rows × 22 columns

Confirm that it does not contain any missing values.

# Count of missing values
air_data_subset.isna().sum()
satisfaction                         0
Customer Type                        0
Age                                  0
Type of Travel                       0
Class                                0
Flight Distance                      0
Seat comfort                         0
Departure/Arrival time convenient    0
Food and drink                       0
Gate location                        0
Inflight wifi service                0
Inflight entertainment               0
Online support                       0
Ease of Online booking               0
On-board service                     0
Leg room service                     0
Baggage handling                     0
Checkin service                      0
Cleanliness                          0
Online boarding                      0
Departure Delay in Minutes           0
Arrival Delay in Minutes             0
dtype: int64

Next, we’ll convert the categorical features to indicator (one-hot encoded) features.

# Convert categorical features to one-hot encoded features
air_data_subset_dummies = pd.get_dummies(air_data_subset, 
                                         columns=['Customer Type','Type of Travel','Class'])
# Display the first 10 rows
air_data_subset_dummies.head(10)
satisfactionAgeFlight DistanceSeat comfortDeparture/Arrival time convenientFood and drinkGate locationInflight wifi serviceInflight entertainmentOnline supportOnline boardingDeparture Delay in MinutesArrival Delay in MinutesCustomer Type_Loyal CustomerCustomer Type_disloyal CustomerType of Travel_Business travelType of Travel_Personal TravelClass_BusinessClass_EcoClass_Eco Plus
0satisfied652650002242200.01001010
1satisfied47246400030222310305.01001100
2satisfied1521380003202200.01001010
3satisfied606230003343300.01001010
4satisfied703540003434500.01001010
5satisfied3018940003202200.01001010
6satisfied66227000325531715.01001010
7satisfied1018120003202200.01001010
8satisfied56730003535400.01001100
9satisfied221556000320223026.01001010
10 rows × 26 columns

Let’s check the variables of air_data_subset_dummies.

# Display variables
air_data_subset_dummies.dtypes
satisfaction                          object
Age                                    int64
Flight Distance                        int64
Seat comfort                           int64
Departure/Arrival time convenient      int64
Food and drink                         int64
Gate location                          int64
Inflight wifi service                  int64
Inflight entertainment                 int64
Online support                         int64
Ease of Online booking                 int64
On-board service                       int64
Leg room service                       int64
Baggage handling                       int64
Checkin service                        int64
Cleanliness                            int64
Online boarding                        int64
Departure Delay in Minutes             int64
Arrival Delay in Minutes             float64
Customer Type_Loyal Customer           uint8
Customer Type_disloyal Customer        uint8
Type of Travel_Business travel         uint8
Type of Travel_Personal Travel         uint8
Class_Business                         uint8
Class_Eco                              uint8
Class_Eco Plus                         uint8
dtype: object

All of the following changes could be observed:

  • Customer Type –> Customer Type_Loyal Customer and Customer Type_disloyal Customer
  • Type of Travel –> Type of Travel_Business travel and Type of Travel_Personal travel
  • Class –> Class_Business, Class_Eco, Class_Eco Plus

Step 3: Model building

# Separate the dataset into labels (y) and features (X)
y = air_data_subset_dummies["satisfaction"]
X = air_data_subset_dummies.drop("satisfaction", axis=1)
# Separate into train, validate, test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, test_size = 0.25, random_state = 0)
Tune the model

Now, we’ll fit and tune a random forest model with separate validation set. We begin by determining a set of hyperparameters for tuning the model using GridSearchCV.

# Determine set of hyperparameters
cv_params = {'n_estimators' : [50,100], 
              'max_depth' : [10,50],        
              'min_samples_leaf' : [0.5,1], 
              'min_samples_split' : [0.001, 0.01],
              'max_features' : ["sqrt"], 
              'max_samples' : [.5,.9]}

Next, we create a list of split indices.

# Create list of split indices
split_index = [0 if x in X_val.index else -1 for x in X_train.index]
custom_split = PredefinedSplit(split_index)

Now, we instantiate our model.

# Instantiate model
rf = RandomForestClassifier(random_state=0)

Next, we use GridSearchCV to search over the specified parameters.

# Search over specified parameters
rf_val = GridSearchCV(rf, cv_params, cv=custom_split, refit='f1', n_jobs = -1, verbose = 1)

Now, we fit our model.

%%time

# Fit the model
rf_val.fit(X_train, y_train)
Fitting 1 folds for each of 32 candidates, totaling 32 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done 32 out of 32 | elapsed: 40.8s finished
CPU times: user 4.99 s, sys: 87.7 ms, total: 5.08 s
Wall time: 45.5 s
GridSearchCV(cv=PredefinedSplit(test_fold=array([-1, -1, ..., -1, -1])),
             error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weig...
                                              n_estimators=100, n_jobs=None,
                                              oob_score=False, random_state=0,
                                              verbose=0, warm_start=False),
             iid='deprecated', n_jobs=-1,
             param_grid={'max_depth': [10, 50], 'max_features': ['sqrt'],
                         'max_samples': [0.5, 0.9],
                         'min_samples_leaf': [0.5, 1],
                         'min_samples_split': [0.001, 0.01],
                         'n_estimators': [50, 100]},
             pre_dispatch='2*n_jobs', refit='f1', return_train_score=False,
             scoring=None, verbose=1)

Finally, we obtain the optimal parameters.

# Obtain optimal parameters
rf_val.best_params_
{'max_depth': 50,
 'max_features': 'sqrt',
 'max_samples': 0.9,
 'min_samples_leaf': 1,
 'min_samples_split': 0.001,
 'n_estimators': 50}

Step 4: Results and evaluation

Now we use the selected model to predict on our test data. We use the optimal parameters found via GridSearchCV.

# Use optimal parameters on GridSearchCV
rf_opt = RandomForestClassifier(n_estimators = 50, max_depth = 50, 
                                min_samples_leaf = 1, min_samples_split = 0.001,
                                max_features="sqrt", max_samples = 0.9, random_state = 0)

Once again, we fit the optimal model.

# Fit the optimal model
rf_opt.fit(X_train, y_train)
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=50, max_features='sqrt',
                       max_leaf_nodes=None, max_samples=0.9,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=0.001,
                       min_weight_fraction_leaf=0.0, n_estimators=50,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

And we predict on the test set using the optimal model.

# Predict on test set
y_pred = rf_opt.predict(X_test)

Obtain performance scores

First, we get our precision score.

# Get precision score
pc_test = precision_score(y_test, y_pred, pos_label = "satisfied")
print("The precision score is {pc:.3f}".format(pc = pc_test))
The precision score is 0.950

Then, we collect the recall score.

# Get recall score
rc_test = recall_score(y_test, y_pred, pos_label = "satisfied")
print("The recall score is {rc:.3f}".format(rc = rc_test))
The recall score is 0.945

Next, we obtain our accuracy score.

# Get accuracy score
ac_test = accuracy_score(y_test, y_pred)
print("The accuracy score is {ac:.3f}".format(ac = ac_test))
The accuracy score is 0.942

Finally, we collect our F1-score.

# Get F1 score
f1_test = f1_score(y_test, y_pred, pos_label = "satisfied")
print("The F1 score is {f1:.3f}".format(f1 = f1_test))
The F1 score is 0.947

Question: What are the pros and cons of performing the model selection using test data instead of a separate validation dataset?

Pros:

  • The coding workload is reduced.
  • The scripts for data splitting are shorter.
  • It’s only necessary to evaluate test dataset performance once, instead of two evaluations (validate and test).

Cons:

  • If a model is evaluated using samples that were also used to build or fine-tune that model, it likely will provide a biased evaluation.
  • A potential overfitting issue could happen when fitting the model’s scores on the test data.

Evaluate the model

Now that we have results, let’s evaluate the model. Let’s calculate the scores: precision score, recall score, accuracy score, F1 score.

# Precision score on test data set
print("\nThe precision score is: {pc:.3f}".format(pc = pc_test), "for the test set,", "\nwhich means of all positive predictions,", "{pc_pct:.1f}% prediction are true positive.".format(pc_pct = pc_test * 100))
The precision score is: 0.950 for the test set, 
which means of all positive predictions, 95.0% prediction are true positive.
# Recall score on test data set
print("\nThe recall score is: {rc:.3f}".format(rc = rc_test), "for the test set,", "\nwhich means of which means of all real positive cases in test set,", "{rc_pct:.1f}% are  predicted positive.".format(rc_pct = rc_test * 100))
The recall score is: 0.945 for the test set, 
which means of which means of all real positive cases in test set, 94.5% are predicted positive.
# Accuracy score on test data set
print("\nThe accuracy score is: {ac:.3f}".format(ac = ac_test), "for the test set,", "\nwhich means of all cases in test set,", "{ac_pct:.1f}% are predicted true positive or true negative.".format(ac_pct = ac_test * 100))
The accuracy score is: 0.942 for the test set, 
which means of all cases in test set, 94.2% are predicted true positive or true negative.
# F1 score on test data set
print("\nThe F1 score is: {f1:.3f}".format(f1 = f1_test), "for the test set,", "\nwhich means the test set's harmonic mean is {f1_pct:.1f}%.".format(f1_pct = f1_test * 100))
The F1 score is: 0.947 for the test set, 
which means the test set's harmonic mean is 94.7%.

The model performs well according to all 4 performance metrics. The model’s precision score is slightly better than the 3 other metrics.

Evaluate the model

Finally, we create a table of results that we can use to evaluate the performance of our model.

# Create table of results
table = pd.DataFrame({'Model': ["Tuned Decision Tree", "Tuned Random Forest"],
                        'F1':  [0.945422, f1_test],
                        'Recall': [0.935863, rc_test],
                        'Precision': [0.955197, pc_test],
                        'Accuracy': [0.940864, ac_test]
                      }
                    )
table
ModelF1RecallPrecisionAccuracy
0Tuned Decision Tree0.9454220.9358630.9551970.940864
1Tuned Random Forest0.9473060.9445010.9501280.942450

The tuned random forest has higher scores overall, so it is the better model. Particularly, it shows a better F1 score than the decision tree model, which indicates that the random forest model may do better at classification when taking into account false positives and false negatives.

Considerations

What summary could we provide to stakeholders?

  • The random forest model predicted satisfaction with more than 94.2% accuracy. The precision is over 95% and the recall is approximately 94.5%.
  • The random forest model outperformed the tuned decision tree with the best hyperparameters in most of the four scores. This indicates that the random forest model may perform better.
  • Because stakeholders were interested in learning about the factors that are most important to customer satisfaction, this would be shared based on the tuned random forest.
  • In addition, we would provide details about the precision, recall, accuracy, and F1 scores to support our findings.

Disclaimer: Like most of my posts, this content is intended solely for educational purposes and was created primarily for my personal reference. At times, I may rephrase original texts, and in some cases, I include materials such as graphs, equations, and datasets directly from their original sources. 

I typically reference a variety of sources and update my posts whenever new or related information becomes available. For this particular post, the primary source was Google Advanced Data Analytics Professional Certificate program.