Underfitting and overfitting

Underfitting

In the case of underfitting, a multiple regression model fails to capture the underlying pattern in the outcome variable. An underfitting model has a low R-squared value.

A model can underfit the data for a variety of reasons:

The independent variables in the model might not have a strong relationship with the outcome variable. In this situation, different or additional predictors are needed.
It could be the case that the sample dataset is too small, and this prevents the model from being able to learn the relationship between the predictors and the outcome. Using more sample data to build the model might reduce the problem of underfitting.

There are additional reasons that a multiple regression model might underfit the data, and the methods used to reduce this obstacle depend on the specific context.

Data scientists divide the sample data into two categories called training data and test data, before building a multiple regression model. Training data is used to build the model, and test data is used to evaluate the model’s performance after it has been built.

Splitting the sample data in this way is also called holdout sampling, with the holdout sample being the test data. Holdout sampling allows data scientists to evaluate how a model performs on data it has not experienced yet. The holdout sample might also be called the validation data.

Overfitting

Underfitting causes a multiple regression model to perform poorly on the training data, which indicates that the model performance on test data will also be substandard. In contrast, overfitting causes a model to perform well on training data, but its performance is considerably worse when evaluated using the unseen test data. That’s why we compare model performance on training data versus test data to identify overfitting.

An overfitting model fits the observed or training data too specifically, making the model unable to generate suitable estimates for the general population. This multiple regression model has captured the signal (i.e. the relationship between the predictors and the outcome variable) and the noise (i.e. the randomness in the dataset that is not part of that relationship). We cannot use an overfitting model to draw conclusions for the population because this model only applies to the data used to build it.

Why does overfitting result in a higher R-squared value?

R-squared is a goodness of fit measure because it tells us the proportion of variance in the outcome variable that is captured by the independent variables in the multiple regression model. However, as we add more independent variables to a model, the associated R-squared value will increase regardless of whether or not those predictors have a strong relationship with the outcome variable.

When to use adjusted R-squared instead

Along with the R-squared value, a multiple regression model also has an associated adjusted R-squared value. The adjusted R-squared penalizes the addition of more independent variables to the multiple regression model.

Additionally, the adjusted R-squared only captures the proportion of variation explained by the independent variables that show a significant relationship with the outcome variable. These differences prevent the adjusted R-squared value from becoming inflated like the R-squared value.

Just like R squared, adjusted R squared varies from 0 to 1.
Adjusted R squared is used to compare multiple models of varying complexity, as we can determine if we should add another variable or not.
R squared is more useful when interpreting the results of a regression model as we can determine how much variation in the dependent variable is explained by the model.

Bias versus Variance

A model that underfits the sample data is described as having a high bias whereas a model that does not perform well on new data is described as having high variance. In data science, there is a phenomenon known as the bias versus variance tradeoff.

The bias error is an error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).
The variance is an error from sensitivity to small fluctuations in the training set. High variance may result from an algorithm modeling the random noise in the training data (overfitting).

This tradeoff is a dilemma that data scientists face when building any machine learning model because an ideal model should have low bias and low variance. This is another way of saying that it should neither underfit nor overfit.

However, as we try to lower bias, variance inevitably increases and vice versa. This is why we can never fully resolve the problems of underfitting and overfitting. Instead, we better focus on reducing these problems in our multiple regression model as much as possible.

Regularization: Lasso, Ridge, and Elastic Net regression

The problem of overfitting is related to the bias variance tradeoff, a concept at the heart of statistics and machine learning. The bias-variance tradeoff balances two model qualities, bias and variance, to minimize overall error for unobserved data.

Bias simplifies the model predictions by making assumptions about the variable relationships. A highly biased model may oversimplify the relationship, underfitting to the observed data and generating inaccurate estimates.

When we assume y=2, that’s a very biased model.

Variance in a model allows for model flexibility and complexity, so the model can learn from existing data. But a model with high variance can over fit to the observed data and generate inaccurate estimates for unseen data.

Note: this variance is not to be confused with the variance of a distribution.

Regularized regression

Regularization is a set of regression techniques that shrinks regression coefficient estimates towards zero, adding in bias, to reduce variance. By shrinking the estimates regularization helps avoid the risk of overfitting the model. There are three common regularization techniques:

Lasso regression
Ridge regression
Elastic net regression

For all three kinds of regularized regression some biases are introduced to lower variance in the model.

Lasso regression completely removes variables that are not important to predicting the y variable of interest.
In ridge regression the impact of less relevant variables is minimized but none of the variables will drop out of the equation. Ridge is a great option if we want to include all of the variables.
When working with large data sets, we can’t always know if we want variables to drop out of the model or not. So we can use the elastic net regression to test the benefit of lasso, ridge and a hybrid of lasso ridge regression all at once.

Each regularized regression technique is trying to help us better fit our model. But we should keep in mind that the estimated parameters are much harder to interpret than with simple linear regression or multiple regression.

Disclaimer: Like most of my posts, this content is intended solely for educational purposes and was created primarily for my personal reference. At times, I may rephrase original texts, and in some cases, I include materials such as graphs, equations, and datasets directly from their original sources.

I typically reference a variety of sources and update my posts whenever new or related information becomes available. For this particular post, the primary source was Google Advanced Data Analytics Professional Certificate.