ANCOVA: Analysis of covariance

Using simple linear regression, we could explore the relationship between the weather and ice coffee sales. And with multiple regression, we could then include temperature and distance to public transportation or other variables we think might be related to coffee sales.

There’s a similar relationship between ANOVA and ANCOVA. Analysis of covariances or ANCOVA is a statistical technique that tests the difference of means between three or more groups while controlling for the effects of covariance, or variable(s) irrelevant to our test. Covariates are the variables that are not of direct interest to the question we are trying to address.

For example, if examining ice coffee sales ANCOVA allows us to analyze how the sales are different on workdays versus the weekend while controlling for the temperature of the day. Let’s say we have a drop in sales on weekends. We could assume that no one buys coffee on weekends because they’re not going to work, but perhaps it was especially cold that weekend. ANCOVA allows us to double-check whether temperature was a factor.

Why would a data analytics professional use ANCOVA when we already have linear regression analysis we can use? There are many similarities between ANCOVA and linear regression. For example, both

allow for continuous and categorical independent variables,
focus on a continuous Y variable,
center on understanding relationships between variables.

But the use cases depend on which variable we’re most interested in understanding:

With ANCOVA,

while we are not focused on the covariates,
we are including covariates to gain a clear understanding of the categorical variable.

With linear regression,

we might be interested in all of the independent variables,
or in predicting the Y variable for unseen data.

What kinds of questions ANCOVA can help us answer?

Let’s say that we’re working at a bookstore and we’re interested in the relationship between book genres and sales. It seems new books tend to get more attention because authors are traveling to promote their recent work. At this point, ANCOVA will let us control for publication year.

Controlling for other variables is important so we don’t draw conclusions that are not accurate. In this example,

The categorical independent variable (X) is book genre.
The covariate is the years since the book was published.
The continuous dependent variable (Y) is the number of books sold in the last month.

The null hypothesis or H₀ is that book sales are equal for all genres regardless of the number of years since publication.

The alternative hypothesis or H₁ is that book sales are not equal for all genres regardless of number of years since publication.

Just like with ANOVA, we have Python and our computers available to run the test and the math, but of course we need to interpret the results. Typically, if the test yields a p-value of less than 0.05, we can reject the null hypothesis that all of the means were the same even when controlling for the covariates.

More dependent variables: MANOVA and MANCOVA

MANOVA or multivariate analysis of variance is an extension of ANOVA, that compares how two, or more continuous outcome variables vary according to categorical independent variables.

Like Anova, the two most common versions of MANOVA involve one or two categorical independent variables:

One-way MANOVA: One categorical independent variable
Two-way MANOVA: Two categorical variables

The independent variable must be categorical, and the outcome variables must be continuous.

Since we’re still dealing with hypothesis testing, we need some hypotheses to test. So let’s return to the bookstore example to generate and test new hypotheses about factors relating to book sales.

The one categorical variable will be book genre, the two continuous deepening variables could be the number of books sold per month, and profits from the book sales. Let’s say, we’re working with a one way MANOVA test, in this case,

Null hypothesis
- the number of books sold per month is the same for each book genre
- the profit from selling books is the same for each book genre
Alternative hypothesis
- Books sold per month is NOT the same for each book genre
- Profit from selling books is NOT the same for each book genre

For example, perhaps the profit made from self help books differs from the profit made from science fiction books. Or maybe the number of graphic novels sold per month differs from the number of historical fiction books sold per month.

In either case, we could reject the null hypothesis, MANOVA allows us to think of each data point as having a number of characteristics, which are the continuous Y variables, that we want to understand based on one or two sets of groups we care about; the one, or two categorical X variables.

If however, we’re only interested in one categorical variable, and we want to control for another variable we can use MANCOVA.

MANCOVA

MANCOVA or multivariate analysis of covariance is an extension of ANCOVA and MANOVA that compares how two or more continuous outcome variables vary according to categorical independent variables, while controlling for covariates.

Let’s say we’re still interested in whether the book genre is related to the number of books sold, and the amount of profit, but we want to control for the popularity of the author. Then we could use MANCOVA.

The categorical independent variable (X) is still book genre.
Now we add a covariant, which is the author’s social media follower count (which we are controlling for).
The continuous dependent variables (Y) remain the same, the number of books sold per month and the monthly profit.

The null hypothesis is that book sales and monthly profits are equal for all genres, regardless of the author’s social media following.

The alternative hypothesis is that book sales and monthly profits are not equal for all genres, regardless of the author’s social media following.

Disclaimer: Like most of my posts, this content is intended solely for educational purposes and was created primarily for my personal reference. At times, I may rephrase original texts, and in some cases, I include materials such as graphs, equations, and datasets directly from their original sources.

I typically reference a variety of sources and update my posts whenever new or related information becomes available. For this particular post, the primary source was Google Advanced Data Analytics Professional Certificate.