Perform multiple linear regression in Python

Introduction

Here, we will be analyzing a small business’ historical marketing promotion data. Each row corresponds to an independent marketing promotion where their business uses TV, social media, radio, and influencer promotions to increase sales. They previously had you work on finding a single variable that predicts sales, and now they are hoping to expand this analysis to include other variables that can help them target their marketing efforts.

To address the business request, we will conduct a multiple linear regression analysis to estimate sales from a combination of independent variables. This will include:

  • Exploring and cleaning data
  • Using plots and descriptive statistics to select the independent variables
  • Creating a fitting multiple linear regression model
  • Checking model assumptions
  • Interpreting model outputs and communicating the results to non-technical stakeholders

Imports & Load

# Import libraries and modules
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from statsmodels.formula.api import ols
# Load the data
data = pd.read_csv('marketing_sales_data.csv')
# Display the first five rows.
data.head()
TVRadioSocial MediaInfluencerSales
0Low3.5180702.293790Micro55.261284
1Low7.7568762.572287Mega67.574904
2High20.3489881.227180Micro272.250108
3Medium20.1084872.728374Mega195.102176
4High31.6532007.776978Nano273.960377

Data exploration

The features in the data are:

  • TV promotional budget (in “Low,” “Medium,” and “High” categories)
  • Social media promotional budget (in millions of dollars)
  • Radio promotional budget (in millions of dollars)
  • Sales (in millions of dollars)
  • Influencer size (in “Mega,” “Macro,” “Micro,” and “Nano” categories)
Create a pairplot of the data
# Create a pairplot of the data
sns.pairplot(data)
  • Radio and Social Media both appear to have linear relationships with Sales. Given this, Radio and Social Media may be useful as independent variables in a multiple linear regression model estimating Sales.
  • TV and Influencer are excluded from the pairplot because they are not numeric.
Calculate the mean sales for each categorical variable
# Calculate the mean sales for each TV category 
print(data.groupby('TV')['Sales'].mean())
print('')
# Calculate the mean sales for each Influencer category 
print(data.groupby('Influencer')['Sales'].mean())
TV
High 300.853195
Low 90.984101
Medium 195.358032
Name: Sales, dtype: float64

Influencer
Macro 181.670070
Mega 194.487941
Micro 188.321846
Nano 191.874432
Name: Sales, dtype: float64
  • The average Sales for High TV promotions is considerably higher than for Medium and Low TV promotions. TV may be a strong predictor of Sales.
  • The categories for Influencer have different average Sales, but the variation is not substantial. Influencer may be a weak predictor of Sales.
  • These results can be investigated further when fitting the multiple linear regression model.
Remove missing data
# Drop rows that contain missing data and update the DataFrame
data = data.dropna(axis=0)
Clean column names

The ols() function doesn’t run when variable names contain a space. We need to fix them.

# Rename all columns in data that contain a space 
data = data.rename(columns={'Social Media': 'Social_Media'})

Model building

Fit a multiple linear regression model that predicts sales
# Define the OLS formula
ols_formula = 'Sales ~ C(TV) + Radio'

# Create an OLS model
OLS = ols(formula = ols_formula, data = data)

# Fit the model
model = OLS.fit()

# Save the results summary
model_results = model.summary()

# Display the model results
model_results
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
  • TV was selected, as the preceding analysis showed a strong relationship between the TV promotional budget and the average Sales.
  • Radio was selected because the pairplot showed a strong linear relationship between Radio and Sales.
  • Social Media was not selected because it did not increase model performance and it was later determined to be correlated with another independent variable: Radio.
  • Influencer was not selected because it did not show a strong relationship to Sales in the preceding analysis.

Check model assumptions

For multiple linear regression, there is an additional assumption added to the four simple linear regression assumptions: multicollinearity. We’ll check all five multiple linear regression assumptions below.

Model assumption: Linearity
# Create a scatterplot for each independent variable and the dependent variable
# Create a 1x2 plot figure
fig, axes = plt.subplots(1, 2, figsize = (8,4))

# Create a scatterplot between Radio and Sales
sns.scatterplot(x = data['Radio'], y = data['Sales'],ax=axes[0])
# Set the title of the first plot
axes[0].set_title("Radio and Sales")

# Create a scatterplot between Social Media and Sales
sns.scatterplot(x = data['Social_Media'], y = data['Sales'],ax=axes[1])
# Set the title of the second plot
axes[1].set_title("Social Media and Sales")
# Set the xlabel of the second plot
axes[1].set_xlabel("Social Media")

# Use matplotlib's tight_layout() function to add space between plots for a cleaner appearance
plt.tight_layout()
  • The linearity assumption holds for Radio, as there is a clear linear relationship in the scatterplot between Radio and SalesSocial Media was not included in the preceding multiple linear regression model, but it does appear to have a linear relationship with Sales.
Model assumption: Independence

The independent observation assumption states that each observation in the dataset is independent. As each marketing promotion (i.e., row) is independent from one another, the independence assumption is not violated.

Model assumption: Normality

We’ll create the following plots to check the normality assumption:

  • Plot 1: Histogram of the residuals
  • Plot 2: Q-Q plot of the residuals
# Calculate the residuals
residuals = model.resid

# Create a 1x2 plot figure
fig, axes = plt.subplots(1, 2, figsize = (8,4))
# Create a histogram with the residuals. 
sns.histplot(residuals, ax=axes[0])

# Set the x label of the residual plot
axes[0].set_xlabel("Residual Value")
# Set the title of the residual plot
axes[0].set_title("Histogram of Residuals")

# Create a Q-Q plot of the residuals
sm.qqplot(residuals, line='s',ax = axes[1])
# Set the title of the Q-Q plot
axes[1].set_title("Normal QQ Plot")

# Use matplotlib's tight_layout() function to add space between plots for a cleaner appearance
plt.tight_layout()
# Show the plot
plt.show()
  • The histogram of the residuals are approximately normally distributed, which supports that the normality assumption is met for this model. The residuals in the Q-Q plot form a straight line, further supporting that this assumption is met.
Model assumption: Constant variance
# Create a scatterplot with the fitted values from the model and the residuals
fig = sns.scatterplot(x = model.fittedvalues, y = model.resid)

# Set the x axis label
fig.set_xlabel("Fitted Values")
# Set the y axis label
fig.set_ylabel("Residuals")
# Set the title
fig.set_title("Fitted Values v. Residuals")

# Add a line at y = 0 to visualize the variance of residuals above and below 0
fig.axhline(0)
# Show the plot
plt.show()
  • The fitted values are in three groups because the categorical variable is dominating in this model, meaning that TV is the biggest factor that decides the sales.
  • However, the variance where there are fitted values is similarly distributed, validating that the assumption is met.
Model assumption: No multicollinearity

Two common ways to check for multicollinearity are to:

  • Create scatterplots to show the relationship between pairs of independent variables
  • Use the variance inflation factor to detect multicollinearity
# Create a pairplot of the data
sns.pairplot(data)
# Calculate the variance inflation factor (optional)
# Import variance_inflation_factor from statsmodels
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Create a subset of the data with the continuous independent variables 
X = data[['Radio','Social_Media']]

# Calculate the variance inflation factor for each variable.
vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

# Create a DataFrame with the VIF results for the column names in X.
df_vif = pd.DataFrame(vif, index=X.columns, columns = ['VIF'])

# Display the VIF results.
df_vif
VIF
Radio5.170922
Social_Media5.170922
  • The preceding model only has one continous independent variable, meaning there are no multicollinearity issues.
  • However if a model used both Radio and Social_Media as predictors, there would be a moderate linear relationship between Radio and Social_Media that violates the multicollinearity assumption.
  • Furthermore, the variance inflation factor when both Radio and Social_Media are included in the model is 5.17 for each variable, indicating high multicollinearity.

Results and evaluation

Let’s display the OLS regression results again.

# Display the model results summary
model_results
  • Using TV and Radio as the independent variables results in a multiple linear regression model with R2=0.904. In other words, the model explains 90.4% of the variation in Sales. This makes the model an excellent predictor of Sales.
Interpreting model coefficients
  • When TV and Radio are used to predict Sales, the model coefficients are:
    • β0 = 218.5261
    • βTVLow = −154.2971
    • βTVMedium = −75.3120
    • βRadio = 2.9669
  • We could write the relationship between Sales and the independent variables as a linear equation:
    • Sales = β0 + β1 ∗ X1 + β2 ∗ X2 + β3 ∗ X3
    • Sales = β0 + βTVLow ∗ XTVLow + βTVMedium ∗ XTVMedium + βRadio ∗ XRadio
    • Sales = 218.5261 − 154.2971 ∗ XTVLow − 75.3120 ∗ XTVMedium + 2.9669 ∗ XRadio
  • The default TV category for the model is High since there are coefficients for the other two TV categories, Medium and Low. Because the coefficients for the Medium and Low TV categories are negative, that means the average of sales is lower for Medium or Low TV categories compared to the High TV category when Radio is at the same level.
    • For example, the model predicts that a Low TV promotion is 154.2971 lower on average compared to a high TV promotion given the same Radio promotion.
  • The coefficient for Radio is positive, confirming the positive linear relationship shown earlier during the exploratory data analysis.
  • The p-value for all coefficients is 0.0000.000, meaning all coefficients are statistically significant at p=0.05.
  • The 95% confidence intervals for each coefficient should be reported when presenting results to stakeholders.
    • For example, there is a 95% chance that the interval [−163.979,−144.616] contains the true parameter of the slope of βTVLow, which is the estimated difference in promotion sales when a Low TV promotion is chosen instead of a High TV promotion.
  • Potential areas to explore include:
    • Providing the business with the estimated sales given different TV promotions and radio budgets.
    • Additional plots to help convey the results, such as using the seaborn regplot() to plot the data with a best fit regression line
  • Given how accurate TV was as a predictor, the model could be improved by getting a more granular view of the TV promotions, such as by considering more categories or the actual TV promotional budgets.
  • Furthermore, additional variables, such as the location of the marketing campaign or the time of year, could increase model accuracy.

Considerations

What findings could we share with others?

According to the model, high TV promotional budgets result in significantly more sales than medium and low TV promotional budgets. For example, the model predicts that a Low TV promotion is 154.2971 lower on average than a high TV promotion given the same Radio promotion.

The coefficient for radio is positive, confirming the positive linear relationship shown earlier during the exploratory data analysis.

The p-value for all coefficients is 0.0000.000, meaning all coefficients are statistically significant at p=0.05. The 95% confidence intervals for each coefficient should be reported when presenting results to stakeholders.

For example, there is a 95% chance the interval [−163.979,−144.616] contains the true parameter of the slope of βTVLow, which is the estimated difference in promotion sales when a low TV promotional budget is chosen instead of a high TV promotion budget.

How could we frame our findings to stakeholders?

High TV promotional budgets have a substantial positive influence on sales. The model estimates that switching from a high to medium TV promotional budget reduces sales by $75.3120 million (95% CI [−82.431,−68.193]), and switching from a high to low TV promotional budget reduces sales by $154.297 million (95% CI [−163.979,−144.616]).

The model also estimates that an increase of $1 million in the radio promotional budget will yield a $2.9669 million increase in sales (95% CI [2.551,3.383]).

Thus, it is recommended that the business allot a high promotional budget to TV when possible and invest in radio promotions to increase sales.


Disclaimer: Like most of my posts, this content is intended solely for educational purposes and was created primarily for my personal reference. At times, I may rephrase original texts, and in some cases, I include materials such as graphs, equations, and datasets directly from their original sources. 

I typically reference a variety of sources and update my posts whenever new or related information becomes available. For this particular post, the primary source was Google Advanced Data Analytics Professional Certificate.