Build a random forest model in Python

Introduction

This activity is a continuation of the project we began modeling with decision trees for an airline. Here, we will train, tune, and evaluate a random forest model using data from spreadsheet of survey responses from 129,880 customers. It includes data points such as class, flight distance, and inflight entertainment. Our random forest model will be used to predict whether a customer will be satisfied with their flight experience.

Step 1: Imports

import numpy as np
import pandas as pd

import pickle as pkl
 
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, PredefinedSplit, GridSearchCV
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score

air_data = pd.read_csv("Invistico_Airline.csv")

Step 2: Data cleaning

air_data.head(10)

	satisfaction	Customer Type	Age	Type of Travel	Class	Flight Distance	Gate location	…	Online support	Ease of Online booking	On-board service	Leg room service	Baggage handling	Checkin service	Cleanliness	Online boarding	Departure Delay in Minutes	Arrival Delay in Minutes
0	satisfied	Loyal Customer	65	Personal Travel	Eco	265	2	…	2	3	3	0	3	5	3	2	0	0.0
1	satisfied	Loyal Customer	47	Personal Travel	Business	2464	3	…	2	3	4	4	4	2	3	2	310	305.0
2	satisfied	Loyal Customer	15	Personal Travel	Eco	2138	3	…	2	2	3	3	4	4	4	2	0	0.0
3	satisfied	Loyal Customer	60	Personal Travel	Eco	623	3	…	3	1	1	0	1	4	1	3	0	0.0
4	satisfied	Loyal Customer	70	Personal Travel	Eco	354	3	…	4	2	2	0	2	4	2	5	0	0.0
5	satisfied	Loyal Customer	30	Personal Travel	Eco	1894	3	…	2	2	5	4	5	5	4	2	0	0.0
6	satisfied	Loyal Customer	66	Personal Travel	Eco	227	3	…	5	5	5	0	5	5	5	3	17	15.0
7	satisfied	Loyal Customer	10	Personal Travel	Eco	1812	3	…	2	2	3	3	4	5	4	2	0	0.0
8	satisfied	Loyal Customer	56	Personal Travel	Business	73	3	…	5	4	4	0	1	5	4	4	0	0.0
9	satisfied	Loyal Customer	22	Personal Travel	Eco	1556	3	…	2	2	2	4	5	3	4	2	30	26.0

10 rows × 22 columns

# Display variable names and types
air_data.dtypes

satisfaction                          object
Customer Type                         object
Age                                    int64
Type of Travel                        object
Class                                 object
Flight Distance                        int64
Seat comfort                           int64
Departure/Arrival time convenient      int64
Food and drink                         int64
Gate location                          int64
Inflight wifi service                  int64
Inflight entertainment                 int64
Online support                         int64
Ease of Online booking                 int64
On-board service                       int64
Leg room service                       int64
Baggage handling                       int64
Checkin service                        int64
Cleanliness                            int64
Online boarding                        int64
Departure Delay in Minutes             int64
Arrival Delay in Minutes             float64
dtype: object

# Identify the number of rows and the number of columns
air_data.shape

(129880, 22)

# Get the number of rows that contain missing values
air_data.isna().any(axis=1).sum()

# Drop missing values
air_data_subset = air_data.dropna(axis=0)

air_data_subset.head(10)

	satisfaction	Customer Type	Age	Type of Travel	Class	Flight Distance	Gate location	…	Online support	Ease of Online booking	On-board service	Leg room service	Baggage handling	Checkin service	Cleanliness	Online boarding	Departure Delay in Minutes	Arrival Delay in Minutes
0	satisfied	Loyal Customer	65	Personal Travel	Eco	265	2	…	2	3	3	0	3	5	3	2	0	0.0
1	satisfied	Loyal Customer	47	Personal Travel	Business	2464	3	…	2	3	4	4	4	2	3	2	310	305.0
2	satisfied	Loyal Customer	15	Personal Travel	Eco	2138	3	…	2	2	3	3	4	4	4	2	0	0.0
3	satisfied	Loyal Customer	60	Personal Travel	Eco	623	3	…	3	1	1	0	1	4	1	3	0	0.0
4	satisfied	Loyal Customer	70	Personal Travel	Eco	354	3	…	4	2	2	0	2	4	2	5	0	0.0
5	satisfied	Loyal Customer	30	Personal Travel	Eco	1894	3	…	2	2	5	4	5	5	4	2	0	0.0
6	satisfied	Loyal Customer	66	Personal Travel	Eco	227	3	…	5	5	5	0	5	5	5	3	17	15.0
7	satisfied	Loyal Customer	10	Personal Travel	Eco	1812	3	…	2	2	3	3	4	5	4	2	0	0.0
8	satisfied	Loyal Customer	56	Personal Travel	Business	73	3	…	5	4	4	0	1	5	4	4	0	0.0
9	satisfied	Loyal Customer	22	Personal Travel	Eco	1556	3	…	2	2	2	4	5	3	4	2	30	26.0

10 rows × 22 columns

Confirm that it does not contain any missing values.

# Count of missing values
air_data_subset.isna().sum()

satisfaction                         0
Customer Type                        0
Age                                  0
Type of Travel                       0
Class                                0
Flight Distance                      0
Seat comfort                         0
Departure/Arrival time convenient    0
Food and drink                       0
Gate location                        0
Inflight wifi service                0
Inflight entertainment               0
Online support                       0
Ease of Online booking               0
On-board service                     0
Leg room service                     0
Baggage handling                     0
Checkin service                      0
Cleanliness                          0
Online boarding                      0
Departure Delay in Minutes           0
Arrival Delay in Minutes             0
dtype: int64

Next, we’ll convert the categorical features to indicator (one-hot encoded) features.

# Convert categorical features to one-hot encoded features
air_data_subset_dummies = pd.get_dummies(air_data_subset, 
                                         columns=['Customer Type','Type of Travel','Class'])

# Display the first 10 rows
air_data_subset_dummies.head(10)

	satisfaction	Age	Flight Distance	Gate location	Inflight wifi service	Inflight entertainment	Online support	…	Online boarding	Departure Delay in Minutes	Arrival Delay in Minutes	Customer Type_Loyal Customer	Type of Travel_Personal Travel	Class_Business	Class_Eco
0	satisfied	65	265	2	2	4	2	…	2	0	0.0	1	1	0	1
1	satisfied	47	2464	3	0	2	2	…	2	310	305.0	1	1	1	0
2	satisfied	15	2138	3	2	0	2	…	2	0	0.0	1	1	0	1
3	satisfied	60	623	3	3	4	3	…	3	0	0.0	1	1	0	1
4	satisfied	70	354	3	4	3	4	…	5	0	0.0	1	1	0	1
5	satisfied	30	1894	3	2	0	2	…	2	0	0.0	1	1	0	1
6	satisfied	66	227	3	2	5	5	…	3	17	15.0	1	1	0	1
7	satisfied	10	1812	3	2	0	2	…	2	0	0.0	1	1	0	1
8	satisfied	56	73	3	5	3	5	…	4	0	0.0	1	1	1	0
9	satisfied	22	1556	3	2	0	2	…	2	30	26.0	1	1	0	1

10 rows × 26 columns

Let’s check the variables of air_data_subset_dummies.

# Display variables
air_data_subset_dummies.dtypes

satisfaction                          object
Age                                    int64
Flight Distance                        int64
Seat comfort                           int64
Departure/Arrival time convenient      int64
Food and drink                         int64
Gate location                          int64
Inflight wifi service                  int64
Inflight entertainment                 int64
Online support                         int64
Ease of Online booking                 int64
On-board service                       int64
Leg room service                       int64
Baggage handling                       int64
Checkin service                        int64
Cleanliness                            int64
Online boarding                        int64
Departure Delay in Minutes             int64
Arrival Delay in Minutes             float64
Customer Type_Loyal Customer           uint8
Customer Type_disloyal Customer        uint8
Type of Travel_Business travel         uint8
Type of Travel_Personal Travel         uint8
Class_Business                         uint8
Class_Eco                              uint8
Class_Eco Plus                         uint8
dtype: object

All of the following changes could be observed:

Customer Type –> Customer Type_Loyal Customer and Customer Type_disloyal Customer
Type of Travel –> Type of Travel_Business travel and Type of Travel_Personal travel
Class –> Class_Business, Class_Eco, Class_Eco Plus

Step 3: Model building

# Separate the dataset into labels (y) and features (X)
y = air_data_subset_dummies["satisfaction"]
X = air_data_subset_dummies.drop("satisfaction", axis=1)

# Separate into train, validate, test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, test_size = 0.25, random_state = 0)

Tune the model

Now, we’ll fit and tune a random forest model with separate validation set. We begin by determining a set of hyperparameters for tuning the model using GridSearchCV.

# Determine set of hyperparameters
cv_params = {'n_estimators' : [50,100], 
              'max_depth' : [10,50],        
              'min_samples_leaf' : [0.5,1], 
              'min_samples_split' : [0.001, 0.01],
              'max_features' : ["sqrt"], 
              'max_samples' : [.5,.9]}

Next, we create a list of split indices.

# Create list of split indices
split_index = [0 if x in X_val.index else -1 for x in X_train.index]
custom_split = PredefinedSplit(split_index)

Now, we instantiate our model.

# Instantiate model
rf = RandomForestClassifier(random_state=0)

Next, we use GridSearchCV to search over the specified parameters.

# Search over specified parameters
rf_val = GridSearchCV(rf, cv_params, cv=custom_split, refit='f1', n_jobs = -1, verbose = 1)

Now, we fit our model.

%%time

# Fit the model
rf_val.fit(X_train, y_train)

Fitting 1 folds for each of 32 candidates, totaling 32 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  32 out of  32 | elapsed:   40.8s finished

CPU times: user 4.99 s, sys: 87.7 ms, total: 5.08 s
Wall time: 45.5 s

GridSearchCV(cv=PredefinedSplit(test_fold=array([-1, -1, ..., -1, -1])),
             error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weig...
                                              n_estimators=100, n_jobs=None,
                                              oob_score=False, random_state=0,
                                              verbose=0, warm_start=False),
             iid='deprecated', n_jobs=-1,
             param_grid={'max_depth': [10, 50], 'max_features': ['sqrt'],
                         'max_samples': [0.5, 0.9],
                         'min_samples_leaf': [0.5, 1],
                         'min_samples_split': [0.001, 0.01],
                         'n_estimators': [50, 100]},
             pre_dispatch='2*n_jobs', refit='f1', return_train_score=False,
             scoring=None, verbose=1)

Finally, we obtain the optimal parameters.

# Obtain optimal parameters
rf_val.best_params_

{'max_depth': 50,
 'max_features': 'sqrt',
 'max_samples': 0.9,
 'min_samples_leaf': 1,
 'min_samples_split': 0.001,
 'n_estimators': 50}

Step 4: Results and evaluation

Now we use the selected model to predict on our test data. We use the optimal parameters found via GridSearchCV.

# Use optimal parameters on GridSearchCV
rf_opt = RandomForestClassifier(n_estimators = 50, max_depth = 50, 
                                min_samples_leaf = 1, min_samples_split = 0.001,
                                max_features="sqrt", max_samples = 0.9, random_state = 0)

Once again, we fit the optimal model.

# Fit the optimal model
rf_opt.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=50, max_features='sqrt',
                       max_leaf_nodes=None, max_samples=0.9,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=0.001,
                       min_weight_fraction_leaf=0.0, n_estimators=50,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

And we predict on the test set using the optimal model.

# Predict on test set
y_pred = rf_opt.predict(X_test)

Obtain performance scores

First, we get our precision score.

# Get precision score
pc_test = precision_score(y_test, y_pred, pos_label = "satisfied")
print("The precision score is {pc:.3f}".format(pc = pc_test))

The precision score is 0.950

Then, we collect the recall score.

# Get recall score
rc_test = recall_score(y_test, y_pred, pos_label = "satisfied")
print("The recall score is {rc:.3f}".format(rc = rc_test))

The recall score is 0.945

Next, we obtain our accuracy score.

# Get accuracy score
ac_test = accuracy_score(y_test, y_pred)
print("The accuracy score is {ac:.3f}".format(ac = ac_test))

The accuracy score is 0.942

Finally, we collect our F1-score.

# Get F1 score
f1_test = f1_score(y_test, y_pred, pos_label = "satisfied")
print("The F1 score is {f1:.3f}".format(f1 = f1_test))

The F1 score is 0.947

Question: What are the pros and cons of performing the model selection using test data instead of a separate validation dataset?

Pros:

The coding workload is reduced.
The scripts for data splitting are shorter.
It’s only necessary to evaluate test dataset performance once, instead of two evaluations (validate and test).

Cons:

If a model is evaluated using samples that were also used to build or fine-tune that model, it likely will provide a biased evaluation.
A potential overfitting issue could happen when fitting the model’s scores on the test data.

Evaluate the model

Now that we have results, let’s evaluate the model. Let’s calculate the scores: precision score, recall score, accuracy score, F1 score.

# Precision score on test data set
print("\nThe precision score is: {pc:.3f}".format(pc = pc_test), "for the test set,", "\nwhich means of all positive predictions,", "{pc_pct:.1f}% prediction are true positive.".format(pc_pct = pc_test * 100))

The precision score is: 0.950 for the test set, 
which means of all positive predictions, 95.0% prediction are true positive.

# Recall score on test data set
print("\nThe recall score is: {rc:.3f}".format(rc = rc_test), "for the test set,", "\nwhich means of which means of all real positive cases in test set,", "{rc_pct:.1f}% are  predicted positive.".format(rc_pct = rc_test * 100))

The recall score is: 0.945 for the test set, 
which means of which means of all real positive cases in test set, 94.5% are  predicted positive.

# Accuracy score on test data set
print("\nThe accuracy score is: {ac:.3f}".format(ac = ac_test), "for the test set,", "\nwhich means of all cases in test set,", "{ac_pct:.1f}% are predicted true positive or true negative.".format(ac_pct = ac_test * 100))

The accuracy score is: 0.942 for the test set, 
which means of all cases in test set, 94.2% are predicted true positive or true negative.

# F1 score on test data set
print("\nThe F1 score is: {f1:.3f}".format(f1 = f1_test), "for the test set,", "\nwhich means the test set's harmonic mean is {f1_pct:.1f}%.".format(f1_pct = f1_test * 100))

The F1 score is: 0.947 for the test set, 
which means the test set's harmonic mean is 94.7%.

The model performs well according to all 4 performance metrics. The model’s precision score is slightly better than the 3 other metrics.

Evaluate the model

Finally, we create a table of results that we can use to evaluate the performance of our model.

# Create table of results
table = pd.DataFrame({'Model': ["Tuned Decision Tree", "Tuned Random Forest"],
                        'F1':  [0.945422, f1_test],
                        'Recall': [0.935863, rc_test],
                        'Precision': [0.955197, pc_test],
                        'Accuracy': [0.940864, ac_test]
                      }
                    )
table

	Model	F1	Recall	Precision	Accuracy
0	Tuned Decision Tree	0.945422	0.935863	0.955197	0.940864
1	Tuned Random Forest	0.947306	0.944501	0.950128	0.942450

The tuned random forest has higher scores overall, so it is the better model. Particularly, it shows a better F1 score than the decision tree model, which indicates that the random forest model may do better at classification when taking into account false positives and false negatives.

Considerations

What summary could we provide to stakeholders?

The random forest model predicted satisfaction with more than 94.2% accuracy. The precision is over 95% and the recall is approximately 94.5%.
The random forest model outperformed the tuned decision tree with the best hyperparameters in most of the four scores. This indicates that the random forest model may perform better.
Because stakeholders were interested in learning about the factors that are most important to customer satisfaction, this would be shared based on the tuned random forest.
In addition, we would provide details about the precision, recall, accuracy, and F1 scores to support our findings.

Disclaimer: Like most of my posts, this content is intended solely for educational purposes and was created primarily for my personal reference. At times, I may rephrase original texts, and in some cases, I include materials such as graphs, equations, and datasets directly from their original sources.

I typically reference a variety of sources and update my posts whenever new or related information becomes available. For this particular post, the primary source was Google Advanced Data Analytics Professional Certificate program.