Making Predictions with different types of regression
Key regression techniques that we will encounter include linear regression, hypothesis testing, and logistic regression. When our goal is to make predictions with data, it is important to consider these different approaches and think about which approach will best help us achieve our task.
How to choose a regression technique
When choosing a regression technique, it is important to consider the data we are working with and the question we want to address.
Things to consider
- What is the question we want to answer? In other words, what do we want to predict?
- Which variable in our data can be the outcome variable?
- How is the outcome variable measured? If the outcome variable is continuous, it is more likely that either linear regression or hypothesis testing will be most appropriate. However, if the outcome variable is binary, we will find logistic regression to be more useful.
Sample 1: Music Streams
Let’s imagine we’re a data professional at a recording studio and we are thinking about the number of times each song is streamed. Some questions that come up include:
- What factors influence the number of music streams?
- How much does each factor influence the number of streams?
Since the number of music streams is the outcome variable and the number of streams is a continuous variable, we could consider linear regression or hypothesis testing. But because the question asks about how much each factor influences music streams, linear regression is a better model to answer the question.
Remember that linear regression:
- allows for accessible interpretation of the coefficients,
- (R-squared to help) explain which factors impact the outcome variable and by how much.
- We want to make sure that the model assumptions are met to add validity to our conclusions and insights.
Sample 2: Coffee Beans
Now let’s imagine a different scenario: We’re working for a cafe. They are sampling different coffee beans from different countries and want to figure out which kinds of coffee beans sell better. The team has put together some projections about how they expect the beans to sell. But they’re curious if these bean sales are independent of their pastry sales.
The country of origin of the beans is a known differentiator. The pastry sales would be more of the covariate. In this case, although the outcome variable, sales, is still continuous, the focus of the question is about comparing different groups, such as different kinds of coffee beans and different countries of origin. Therefore, we should focus more on hypothesis testing, which can be a great way to conduct A/B tests.
The null hypothesis (H0) would be that the cafe sells approximately the same number of bags of each coffee bean type. The alternative hypothesis (H1) would be that the cafe does not sell the same number of bags of each coffee bean type.
By conducting a series of tests, we can either accept or reject the null hypothesis with the particular P-value to help understand how well the model explains the trends. The team will be able to better understand which beans to order to sell more coffee at the cafe.
Sample 3: Social Media Post
Let’s consider one last example in which we’re working at a social media company and we’re interested in exploring why some posts do or do not go viral. We decide on a question:
- How can I predict whether a post will go viral?
Since the outcome variable is binary, either the post goes viral or does not, binomial logistic regression might be the first model we consider. The best way to determine if a logistic regression is the right choice, is to build and evaluate the model.
There are many metrics we can use:
- P-value
- Confusion matrices
- Precision
- Recall
- Accuracy
- ROC/AUC
- AIC
- BIC
Choosing the best metrics will depend on the situation. When interpreting the coefficients for logistic regression models, we need to remember to exponentiate them. Recall that when sharing results, logistic regression coefficients report in percentages how much a factor increases or decreases the likelihood of an outcome.
As a data professional, we will encounter many different questions that require a variety of approaches to address.
More example contexts for regression
The following examples demonstrate how the questions about prediction, outcome variable, and measurement can be navigated in order to choose a regression technique. This time, similar scenarios like the ones above will be introduced within each context.
Example context: User engagement
In our work as a data professional, imagine that we are interested in making predictions about user engagement for a mobile app.
Approach 1
First, we might ask, what is the question we want to answer? One possible question could be “How much does each in-app feature influence user engagement?”
The in-app features might include a live chat with customer support, an FAQ section that updates weekly, and a community space to connect with other users.
Next, we might ask, which variable in our data can be the outcome variable? If we have access to data about users’ session lengths (in other words, how long users spend in the app each time they open it), the outcome variable can be session length.
Our next question might be: how is the outcome variable measured? Session length can be measured by number of minutes, which is continuous. Because the outcome variable is continuous, and we are interested in how much each feature influences the outcome variable, we could proceed with linear regression and check the relevant model assumptions.
If there is only one feature of interest, we would build a simple linear regression model. If there are multiple features of interest, we would build a multiple linear regression model.
Approach 2
Another question of interest could be “Does a dynamic landing page versus a static landing page make a difference in user engagement?”
The outcome variable can be session length, measured by number of minutes, for this example, too.
Since the outcome variable is continuous, and the target question is about whether there is a difference in user engagement when one type of landing page is used over the other, we could proceed with hypothesis testing.
We can then frame the hypotheses, which could be the following:
- Null hypothesis (H0): Users spend approximately the same amount of time in the app when the landing page is dynamic versus when it is static.
- Alternative hypothesis (H1): Users do NOT spend approximately the same amount of time in the app when the landing page is dynamic versus when it is static.
Approach 3
Another question we might be interested in is “Will a user engage with the new line of products in-app?”
Next, we might ask, which variable in our data can be the outcome variable? If we have access to data about whether a user clicks to view the new line of products, that could be the outcome variable.
The next question is: how is the outcome variable measured? Whether a user clicks to view that content can be represented as a binary variable, with 1 indicating they clicked to view the content and 0 indicating that they did not click to view that content. Since this outcome variable is binary, we could proceed with binomial logistic regression.
Example context: Patient response
Now imagine that we are tasked with making predictions about patient responses to medical treatments.
Approach 1
We can start by asking, what is the question we want to answer? A possible question could be “How much does each factor influence a patient’s response to a medical treatment?”
If the goal of the treatment is to improve white blood cell (WBC) count and you have access to that data, WBC count can be the outcome variable.
The outcome variable is a continuous measure, and we could use linear regression to address this task.
Approach 2
Another question of interest could be “Will Treatment A, Treatment B, or Treatment C have a stronger impact on a patient’s WBC count?”
The outcome variable in this case would also be WBC count, which is continuous.
Since the target question is about comparing different treatments, it would be best to proceed with hypothesis testing.
We can then form the hypotheses, which could be the following.
- Null hypothesis (H0): Patients have approximately the same white blood cell count with each treatment.
- Alternative hypothesis (H1): Patients do NOT have approximately the same white blood cell count with each treatment.
Approach 3
A different question we might be interested in: “With Treatment A, will a patient’s WBC count reach the ideal range?”
If we have access to the associated data, the outcome variable would be whether a patient’s WBC count reaches the ideal range or not, which is a binary variable: 1 indicating that their WBC count falls within the ideal range and 0 indicating that it does not. We could build a logistic regression model to make predictions in this scenario.
Key takeaways
- Consider the question we want to answer and the data we have access to when choosing a regression technique for making predictions.
- Identifying the outcome variable of interest and how it is measured will help us decide which regression technique is most suitable for our task.
The following flowchart captures a high-level approach for choosing a regression technique, starting from the outcome variable, as discussed in this post. Also note that hypothesis testing is connected to regression analysis. For example, in linear regression, the process of testing whether there is a correlation between two variables (in other words, determining if the coefficients are statistically significant in the linear model) involves a hypothesis test.
Disclaimer: Like most of my posts, this content is intended solely for educational purposes and was created primarily for my personal reference. At times, I may rephrase original texts, and in some cases, I include materials such as graphs, equations, and datasets directly from their original sources.
I typically reference a variety of sources and update my posts whenever new or related information becomes available. For this particular post, the primary source was Google Advanced Data Analytics Professional Certificate.