Analyzing Categorical Data

Analyzing one categorical variable

Let’s start with a simple data set.

DrinkTypeCaloriesSugars (g)Caffeine (mg)
Brewed coffeeHot40260
Caffè latteHot1001475
Caffè mochaHot1702795
CappuccinoHot60875
Iced brewed coffeeCold6015120
Chai latteHot1202560

The individuals on this data set are drinks.

This data set contains 4 variables, 1 of which is categorical and 3 of which are quantitative.

Two-way tables

Sometimes (unlike the sample above) data belongs to more than one category. For example, cakes might have chocolate, coconut, both, or neither. We can use Venn diagrams and two-way tables to represent those data points. 

Venn diagrams show sets and their overlaps. Two-way tables organize data in rows and columns and it shows how many data points fit in each category. Both methods help show relationships between categories.

The below Venn diagram and the two-way table show the same data.

coconutno coconutΣ
chocolate369
no chocolate123
Σ4812
Two-way relative frequency tables

Relative frequencies show how often something happens compared to the total number of times it could happen. This gives us a percentage or fraction rather than counts. They are good for seeing if there is an association between two variables.

Two-way relative frequency tables show what percent of data points fit in each category. We can use row relative frequencies or column relative frequencies, it just depends on the context of the problem.

xy
A2835
B97104
xy
A0.220.25
B0.780.75
1.001.00

Sometimes our percentages won’t add up to 100% even when we round properly. This is called round-off error.

Two-way relative frequency tables are useful when there are different sample sizes in a dataset.

Distribution in Two-way tables

We can talk about two types of distributions.

Marginal distribution focuses on one of the dimensions. We determine it by looking at the margin. It can be represented as counts or as percentages.

Conditional distribution is representing one variable given something true about the other variable. In other words, we look at the relationship between variables and understand how one variable impacts the distribution of another. The standard practice for conditional distribution is to think in terms of percentages.

0-20 min21-40 min41-60 min>60 minTotal
80-100 %04162040
60-79 %020301060
40-59 %24323270
20-39 %1028020
0-19 %200810
Total14308670200
Two-way table for the dataset of the relationship in a classroom of 200 students between the amount of time studied and the percentage of the correct answers (of an exam).

To represent the marginal and conditional distributions, let’s add their respective percentages to the table as well.

0-20 min21-40 min41-60 min>60 minTotal
80-100 %04 (13%)162040 (20%)
60-79 %020 (67%)301060 (30%)
40-59 %24 (13%)323270 (35%)
20-39 %102 (7%)8020 (10%)
0-19 %20 (0%)0810 (5%)
Total14 (7%)30 (15%)86 (43%)70 (35%)200

Marginal distributions marked as bold. (represented as counts or percentages)

A conditional distribution marked as italic: Distribution of ‘the percentage correct answers’ given that students ‘study between 21-40 minutes’. (represented mostly as percentages)


Disclaimer: Like most of my posts, this content is intended solely for educational purposes and was created primarily for my personal reference. At times, I may rephrase original texts, and in some cases, I include materials such as graphs, equations, and datasets directly from their original sources.
I typically reference a variety of sources and update my posts whenever new or related information becomes available. For this particular post, the primary source was Khan Academy’s Statistics and Probability series.