Make assumptions with multiple linear regressions

As a reminder the linear regression assumptions are:

Linearity – Each predictor variable (X_i) is linearly related to the outcome variable (Y)
→ by plotting scatter plots
Independent observations – Each observation in the dataset is independent
→ by checking the data collection process
Normality – The residuals are normally distributed
→ by creating a Q-Q plot or plotting a histogram of residuals
Note: It’s a common misunderstanding that the independent and/or dependent variables must be normally distributed when performing linear regression. This is not the case. Only the model’s residuals are assumed to be normal.
Homoscedasticity – The variation of the residuals (errors) is constant or similar across the model
→ by creating a plot of the residuals vs fitted values

And there is another assumption for working with multiple regression:

No multicollinearity – No two independent variables (X_i and X_j) can be highly correlated with each other (they can not be linearly related to each other)

The scatter plot matrix creates a scatter plot for every pair of variables. If we’re observing linear relationships between an independent variable and the dependent variable, we should consider including it in our multiple regression model. However, If we observe linear relationships between two independent variables, and we include both variables in the model will likely violate the no multicollinearity assumption.

Variance Inflation Factors

In that sense, EDA and visualizations are powerful tools, but they can’t detect every relationship. So we can turn back to math. Conveniently, our computer can calculate VIF.

The Variance Inflation Factors (VIF) quantifies how correlated each independent variable is with all the other independent variables. The minimum value of VIF is 1 and it can get very large. The larger the VIF, the more multicollinearity there is in the model.

Once we have identified multicollinearity, these are a couple of solutions to handle with multicollinearity:

dropping one or more variables with multicollinearity
creating new variables using existing data

Then we can calculate the VIF again to check the multicollinearity assumption again.

When it might be OK to have multicollinearity

X variables that are linearly related could muddle the interpretation of the model’s results. If there are X variables that are linearly related, it is usually best to remove some independent variables from the model.

However, the assumption of no multicollinearity is most important when we are using our regression model to make inferences about our data, because the inclusion of collinear data increases the standard errors of the model’s beta parameter estimates.

But there may be times when the primary purpose of our model is to make predictions and the need for accurate predictions outweighs the need to make inferences about our data. In this case, including the collinear independent variables may be justified because it’s possible that their inclusion would result in better predictions.

How to check the no multicollinearity assumption

Below are a couple ways to check the no multicollinearity assumption.

A visual way to identify multicollinearity between independent (X) variables is using scatterplots or scatterplot matrices. The process is the same as when we checked the linearity assumption, except now we’re just focusing on the X variables, not the relationship between the X variables and the Y variable.
Calculating the variance inflation factor, or VIF, for each independent (X) variable is a way to quantify how much the variance of each variable is “inflated” due to correlation with other X variables.

The details of calculating VIF are beyond the scope of this post, but it’s helpful to know that √VIF_i represents the amount that the standard error of coefficient β_i increases relative to a situation in which all of the predictor variables are uncorrelated.

Here is an example of how we might obtain VIFs for our predictor variables.

from statsmodels.stats.outliers_influence import variance_inflation_factor
X = df[['col_1', 'col_2', 'col_3']]
vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif = zip(X, vif)
print(list(vif))

The smallest value a VIF can take on is 1, which would indicate 0 correlation between the X variable in question and the other predictor variables in the model. A high VIF, such as 5 and above, according to the statsmodels documentation, can indicate the presence of multicollinearity.

What to do if there is multicollinearity in our model

Variable Selection

The easiest way to handle multicollinearity is simply to only use a subset of independent variables in our model. There are a few specific statistical techniques we can use to select variables strategically: Forward selection and Backward elimination.

Advanced Techniques

In addition to the techniques listed above, there are more advanced techniques, such as: Ridge regression, Lasso regression, and Principal component analysis (PCA). These techniques can result in more accurate and predictive models, but can complicate the interpretation of regression results.

Top variable selection methods

Variable selection, also known as feature selection, is the process of determining which variables or features to include in a given model. Variable selection is iterative. We’ll cover forward selection and backward elimination, which are based on extra sums of squares F-tests.

We know that a model with zero independent variables is probably not the best choice, and a model with all of the possible independent variables is also probably not the best choice.

Forward selection

Forward selection is a stepwise variable selection process. It begins with the null model with zero independent variables and considers all possible variables to add. It incorporates the independent variable that contributes the most explanatory power to the model, based on the chosen metric and threshold. The process continues one variable at a time until no more variables can be added to the model.

Backwards elimination

Backwards elimination is a stepwise variable selection process that begins with the full model, with all possible independent variables and removes the independent variable that adds the least explanatory power to the model, based on the chosen metric and threshold. The process continues one variable at a time until no more variables can be removed from the model.

Both forward selection and backward elimination require more cutoff points or threshold to determine when to add or remove variables respectively. One common test is the extra sum of squares F-test.

The extra-sum-of-squares F-test quantifies the difference between the amount of variance that is left unexplained by a reduced model, that is explained by the full model.

The reduced model can be any model that is less complex than the full model. For the F-test, like other hypothesis tests, data professionals usually evaluate it based on a p-value. Based on the p-value, we can be fairly confident that important variance is being explained by a given model.

Disclaimer: Like most of my posts, this content is intended solely for educational purposes and was created primarily for my personal reference. At times, I may rephrase original texts, and in some cases, I include materials such as graphs, equations, and datasets directly from their original sources.

I typically reference a variety of sources and update my posts whenever new or related information becomes available. For this particular post, the primary source was Google Advanced Data Analytics Professional Certificate.