Perform logistic regression in Python

Introduction

Here we’ll assume, we work as a consultant for an airline. The airline is interested in knowing if a better in-flight entertainment experience leads to higher customer satisfaction. They would like us to construct and evaluate a model that predicts whether a future customer would be satisfied with their services given previous customer feedback about their flight experience.

The data for this activity is for a sample size of 129,880 customers. It includes data points such as class, flight distance, and in-flight entertainment, among others. Our goal will be to utilize a binomial logistic regression model to help the airline model and better understand this data.

Imports & Load

# Standard operational package imports
import numpy as np
import pandas as pd

# Important imports for preprocessing, modeling, and evaluation
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import sklearn.metrics as metrics

# Visualization package imports
import matplotlib.pyplot as plt
import seaborn as sns
df_original = pd.read_csv("Invistico_Airline.csv")
df_original.head(n = 10)
satisfactionCustomer TypeAgeType of TravelClassFlight DistanceSeat comfortDeparture/Arrival time convenientFood and drinkGate locationOnline supportEase of Online bookingOn-board serviceLeg room serviceBaggage handlingCheckin serviceCleanlinessOnline boardingDeparture Delay in MinutesArrival Delay in Minutes
0satisfiedLoyal Customer65Personal TravelEco26500022330353200.0
1satisfiedLoyal Customer47Personal TravelBusiness2464000323444232310305.0
2satisfiedLoyal Customer15Personal TravelEco213800032233444200.0
3satisfiedLoyal Customer60Personal TravelEco62300033110141300.0
4satisfiedLoyal Customer70Personal TravelEco35400034220242500.0
5satisfiedLoyal Customer30Personal TravelEco189400032254554200.0
6satisfiedLoyal Customer66Personal TravelEco2270003555055531715.0
7satisfiedLoyal Customer10Personal TravelEco181200032233454200.0
8satisfiedLoyal Customer56Personal TravelBusiness7300035440154400.0
9satisfiedLoyal Customer22Personal TravelEco15560003222453423026.0
10 rows × 22 columns

Data exploration, data cleaning, and model preparation

Remember that logistic regression models expect numeric data.

df_original.dtypes
satisfaction                          object
Customer Type                         object
Age                                    int64
Type of Travel                        object
Class                                 object
Flight Distance                        int64
Seat comfort                           int64
Departure/Arrival time convenient      int64
Food and drink                         int64
Gate location                          int64
Inflight wifi service                  int64
Inflight entertainment                 int64
Online support                         int64
Ease of Online booking                 int64
On-board service                       int64
Leg room service                       int64
Baggage handling                       int64
Checkin service                        int64
Cleanliness                            int64
Online boarding                        int64
Departure Delay in Minutes             int64
Arrival Delay in Minutes             float64
dtype: object
df_original['satisfaction'].value_counts(dropna = False)
satisfied       71087
dissatisfied    58793
Name: satisfaction, dtype: int64

54.7 percent (71,087/129,880) of customers were satisfied. While this is a simple calculation, this value can be compared to a logistic regression model’s accuracy.

Check for missing values

An assumption of logistic regression models is that there are no missing values.

df_original.isnull().sum()

Out[6]:

satisfaction                           0
Customer Type                          0
Age                                    0
Type of Travel                         0
Class                                  0
Flight Distance                        0
Seat comfort                           0
Departure/Arrival time convenient      0
Food and drink                         0
Gate location                          0
Inflight wifi service                  0
Inflight entertainment                 0
Online support                         0
Ease of Online booking                 0
On-board service                       0
Leg room service                       0
Baggage handling                       0
Checkin service                        0
Cleanliness                            0
Online boarding                        0
Departure Delay in Minutes             0
Arrival Delay in Minutes             393
dtype: int64

For this activity, the airline is specifically interested in knowing if a better in-flight entertainment experience leads to higher customer satisfaction. The Arrival Delay in Minutes column won’t be included in the binomial logistic regression model; however, the airline might become interested in this column in the future.

For now, the missing values should be removed for two reasons:

  • There are only 393 missing values out of the total of 129,880, so these are a small percentage of the total.
  • This column might impact the relationship between entertainment and satisfaction.
df_subset = df_original.dropna(axis=0).reset_index(drop = True)
Prepare the data

If we want to create a plot (sns.regplot) of our model to visualize results later in the notebook, the independent variable Inflight entertainment cannot be “of type int” and the dependent variable satisfaction cannot be “of type object.”

df_subset = df_subset.astype({"Inflight entertainment": float})
# Convert the categorical column satisfaction into numeric through one-hot encoding

df_subset['satisfaction'] = OneHotEncoder(drop='first').fit_transform(df_subset[['satisfaction']]).toarray()

To examine what one-hot encoding did to the DataFrame, let’s output the first 10 rows of df_subset.

df_subset.head(10)
satisfactionCustomer TypeAgeType of TravelClassFlight DistanceSeat comfortDeparture/Arrival time convenientFood and drinkGate locationOnline supportEase of Online bookingOn-board serviceLeg room serviceBaggage handlingCheckin serviceCleanlinessOnline boardingDeparture Delay in MinutesArrival Delay in Minutes
01.0Loyal Customer65Personal TravelEco26500022330353200.0
11.0Loyal Customer47Personal TravelBusiness2464000323444232310305.0
21.0Loyal Customer15Personal TravelEco213800032233444200.0
31.0Loyal Customer60Personal TravelEco62300033110141300.0
41.0Loyal Customer70Personal TravelEco35400034220242500.0
51.0Loyal Customer30Personal TravelEco189400032254554200.0
61.0Loyal Customer66Personal TravelEco2270003555055531715.0
71.0Loyal Customer10Personal TravelEco181200032233454200.0
81.0Loyal Customer56Personal TravelBusiness7300035440154400.0
91.0Loyal Customer22Personal TravelEco15560003222453423026.0
10 rows × 22 columns
Create the training and testing data
X = df_subset[["Inflight entertainment"]]
y = df_subset["satisfaction"]

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=42)

Other variables, like Departure Delay in Minutes seem like they can be potentially influential to customer satisfaction. This is why only using one independent variable might not be ideal.

Model building

clf = LogisticRegression().fit(X_train,y_train)
clf.coef_
array([[0.99751462]])
clf.intercept_
array([-3.19355406])
sns.regplot(x="Inflight entertainment", y="satisfaction", data=df_subset, logistic=True, ci=None)

The graph seems to indicate that the higher the inflight entertainment value, the higher the customer satisfaction, though this is currently not the most informative plot. The graph currently doesn’t provide much insight into the data points, as Inflight entertainment is categorical.

Results and evaluation

Now that we’ve completed our regression, let’s review and analyze our results. We’ll input the holdout dataset into the predict function to get the predicted labels from the model and save these predictions as a variable called y_pred.

# Save predictions.
y_pred = clf.predict(X_test)
print(y_pred)
[1. 0. 0. ... 0. 0. 0.]
# Use predict_proba to output a probability
clf.predict_proba(X_test)
array([[0.14258068, 0.85741932],
       [0.55008402, 0.44991598],
       [0.89989329, 0.10010671],
       ...,
       [0.89989329, 0.10010671],
       [0.76826225, 0.23173775],
       [0.55008402, 0.44991598]])
# Use predict to output 0's and 1's
clf.predict(X_test)
array([1., 0., 0., ..., 0., 0., 0.])
Analyze the results
print("Accuracy:", "%.6f" % metrics.accuracy_score(y_test, y_pred))
print("Precision:", "%.6f" % metrics.precision_score(y_test, y_pred))
print("Recall:", "%.6f" % metrics.recall_score(y_test, y_pred))
print("F1 Score:", "%.6f" % metrics.f1_score(y_test, y_pred))
Accuracy: 0.801529
Precision: 0.816142
Recall: 0.821530
F1 Score: 0.818827
Produce a confusion matrix

Data professionals often like to know the types of errors made by an algorithm. To obtain this information, we’ll produce a confusion matrix.

cm = metrics.confusion_matrix(y_test, y_pred, labels = clf.classes_)
disp = metrics.ConfusionMatrixDisplay(confusion_matrix = cm,display_labels = clf.classes_)
disp.plot()
  • Two of the quadrants are under 4,000, which are relatively low numbers. These numbers relate to false positives and false negatives. Additionally, the other two quadrants—the true positives and true negatives—are both high numbers above 13,000.
  • There isn’t a large difference in the number of false positives and false negatives.
  • Using more than a single independent variable in the model training process could improve model performance. This is because other variables, like Departure Delay in Minutes, seem like they could potentially influence customer satisfaction.

Considerations

What could we recommend to stakeholders?

  • Customers who rated in-flight entertainment highly were more likely to be satisfied. Improving in-flight entertainment should lead to better customer satisfaction.
  • The model is 80.2 percent accurate. This is an improvement over the dataset’s customer satisfaction rate of 54.7 percent.
  • The success of the model suggests that the airline should invest more in model development to examine if adding more independent variables leads to better results. Building this model could not only be useful in predicting whether or not a customer would be satisfied but also lead to a better understanding of what independent variables lead to happier customers.

Disclaimer: Like most of my posts, this content is intended solely for educational purposes and was created primarily for my personal reference. At times, I may rephrase original texts, and in some cases, I include materials such as graphs, equations, and datasets directly from their original sources. 

I typically reference a variety of sources and update my posts whenever new or related information becomes available. For this particular post, the primary source was Google Advanced Data Analytics Professional Certificate.