Evaluate uncertainty in regression analysis

Overview

Let’s recap the PACE (Plan, Analyze, Construct, Execute) stages so far:

We planned by thinking through the problem and sub-setting the available data appropriately: How can we better understand the relationship between the variables.
In the analyzed stage we performed EDA and checked model assumptions.
Then we moved on to the construct stage that has mainly two steps:
- We built our model and were able to pull out some parameter estimates.
- Now we’re going to focus on the next step: model evaluation.

With Ols we could determine the best fit line. But randomness and unpredictability are characteristics of every regression model that make it difficult to predict outcomes with 100% certainty. After all, there is still a difference between our observed and predicted values. We are just finding the model that we are most certain about.

In Ols summary table:

P > |t| indicates the p value associated with the coefficient estimates.
The two columns to the right of the P value column indicate a 95% confidence interval around the coefficient estimates.

When evaluating simple linear regression results, we’ll focus less on the intercept row and more on the row involving our independent variables of interest. (In this case, that’s bill length.)

Confidence Interval

Confidence interval is defined as a range of values that describe the uncertainty surrounding an estimate. In the case of linear regression, we are estimating parameters. So a 95% confidence interval means that interval has a 95% chance of containing the true parameter value of the slope.

Confidence Band

What if the slope and intercept were slightly different?

We can draw out a few different lines on our plot with slightly different slopes and intercepts all within our confidence intervals. We get a region around the regression line that is tight around the center and fans out a bit towards either end of the line.

The resulting shape would be similar to the one we get by using Seaborn’s reg plot function. These lines make up the shaded region that was around the regression line.

Essentially the confidence interval around the parameter estimates reveal what it’s called a confidence band. A confidence band is the area surrounding the line that describes the uncertainty around the predicted outcomes at every value of X. Typically expressed as a shaded region around the best fit line on a scatter plot.

A confidence band reveals the confidence interval for each point on a regression line. Confidence bands are simply another way to report our findings responsibly.

We must remember data is noisy and results can be uncertain. When using regression models like simple linear regression, even the best data doesn’t tell a complete story. As a data analytics professional, we should always aim not only to evaluate the performance and accuracy of our models, but also to report uncertainty. Communicating about confidence intervals and confidence bands is part of being a responsible data professional.

Interpret measures of uncertainty in regression

Now we can give a little more details about the terms mentioned above.

The Concepts

We can represent a simple linear regression line as y = β₀ + β₁ X.

Since regression analysis utilizes estimation techniques, there is always a level of uncertainty surrounding the predictions made by regression models. To represent the error, we can actually rewrite the equation to include an error term, represented by the letter ϵ (“epsilon”):

y = β₀ + β₁ X + ϵ.

There is a residual (the difference between the predicted and actual value) for each data point in the dataset used to construct the model. We can then quantify how uncertain the entire model is through a few measures of uncertainty:

Confidence intervals around beta coefficients
P-values for the beta coefficients
Confidence band around the regression line

Confidence interval is a range of values that describes the uncertainty surrounding an estimate.

P-value is the probability of observing results as extreme as those observed when the null hypothesis is true.

Interpreting Uncertainty

Let’s bring back the summary results from a linear regression model:

According to the simple linear regression model we built, β₁^ is 141.1904. So for every one-millimeter increase in the bill length of a penguin, we would expect a penguin to have about 141.1904 more grams in body mass.

The estimate has a p-value of 0.000, which is less than 0.05, meaning that the coefficient is “statistically significant.”

Additionally our estimate has a 95% confidence interval of 131.788 and 150.592.

Let’s review these more:

Before we have learned about p-values and confidence intervals within the context of hypothesis testing. Even though it may seem unintuitive, even in regression analysis we are testing hypotheses.

P-values

When running regression analysis, we want to know if X is really correlated with y or not. So we do a hypothesis test on the regression results. In regression analysis, for each beta coefficient, we are testing the following set of null and alternative hypotheses:

H₀ (null hypothesis): β₁ = 0
H₁ (alternative hypothesis): β₁ ≠ 0

In our example, because the p-value is less than 0.05, we can reject the null hypothesis that β₁ is equal to 0, and state that the coefficient is statistically significant, which means that a difference in bill length of a penguin is truly correlated with a difference in body mass.

Confidence Intervals

Each beta coefficient also has a confidence interval associated with its estimate. A 95% interval means the interval itself has a 95% chance of containing the true parameter value of the coefficient. So there is a 5% chance that our confidence interval [131.788, 150.592] does not contain the true value of β₁.

More precisely, this means that if we were to repeat this experiment many times, 95% of the confidence intervals would contain the true value of β₁. But, since there is uncertainty in both of the estimated beta coefficients, then the estimated y values also have uncertainty. This is where confidence bands become useful.

Confidence band

Confidence band is the area surrounding the line that describes the uncertainty around the predicted outcome. We can think of the confidence band as representing the confidence interval surrounding each point estimate of y.

Since there is uncertainty at every point in the line, we use the confidence band to summarize the confidence intervals across the regression model. The confidence band is always narrowest towards the mean of the sample and widest at the extremities.

As a summary:

Regression analysis utilizes estimation techniques, so there is always uncertainty around the predictions.
We can measure uncertainty using confidence intervals, p-values, and confidence bands.
For every coefficient estimate, we are testing the hypothesis that the coefficient equals 0.

Disclaimer: Like most of my posts, this content is intended solely for educational purposes and was created primarily for my personal reference. At times, I may rephrase original texts, and in some cases, I include materials such as graphs, equations, and datasets directly from their original sources.

I typically reference a variety of sources and update my posts whenever new or related information becomes available. For this particular post, the primary source was Google Advanced Data Analytics Professional Certificate.