Experimental Design | Alper Kokcu

Data professionals often work with experimental data previously collected by other researchers. However, the right data for a specific project might not always be available or accessible. In this case, data professionals can design their own experiments and collect their own data.

Let’s discuss how data professionals design experiments to collect data, test hypotheses, and discover relationships between variables.

Context: Experimental design

Experimental design refers to planning an experiment in order to collect data to answer our research question.

The typical purpose of an experiment is to discover a cause-and-effect relationship between variables. For example, a data professional might design an experiment to discover whether:

A new medicine leads to faster recovery time
A new website design increases product sales
A new fertilizer increases crop growth
A new training program improves athletic performance

It’s important to understand experimental design because it affects the quality of our data, and the validity of any conclusions we draw based on our results. A poor design might lead to invalid results, which can be costly for companies and consumers.

Let’s explore an example to get a better understanding of experimental design.

Example: Clinical trial

Imagine we’re a data professional who works for a pharmaceutical company. The company invents a new medicine to treat the common cold. Our team leader asks us to design an experiment to test the effectiveness of the medicine. We want to find out whether taking the medicine leads to faster recovery time.

There are at least three key steps in designing an experiment:

Define your variables
Formulate your hypothesis
Assign test subjects to treatment and control groups

Note: These are basic steps that apply to controlled experiments. Experimental design is a complex topic that covers more than these.

Step 1: Define your variables

Data professionals often begin by defining the independent and dependent variables in their experiment. This helps clarify the relationship between the variables.

The independent variable refers to the cause we’re interested in investigating. A researcher changes or controls the independent variable to determine how it affects the dependent variable. “Independent” means it’s not influenced by other variables in the experiment.
The dependent variable refers to the effect we’re interested in measuring. “Dependent” means its value is influenced by the independent variable.

In our clinical trial, we want to find out how the medicine affects recovery time. Therefore:

Our independent variable is the medicine—the cause we want to investigate.
Our dependent variable is recovery time—the effect we want to measure.

In a more complex experiment, we might test the effect of different medicines on recovery time, or different doses of the same medicine. In each case, we manipulate our independent variable (medicine) to measure its effect on our dependent variable (recovery time).

Step 2: Formulate your hypothesis

Our hypothesis states the relationship between our independent and dependent variables and predicts the outcome of our experiment.

For our clinical trial:

Our null hypothesis (H0) is that the medicine has no effect.
Our alternative hypothesis (Ha) is that the medicine is effective.

Step 3: Assign test subjects to treatment and control groups

Treatment and control groups

Experiments such as clinical trials and A/B tests are controlled experiments. In a controlled experiment, test subjects are assigned to a treatment group and a control group. The treatment is the new change being tested in the experiment. The treatment group is exposed to the treatment. The control group is not exposed to the treatment. The difference in metric values between the two groups measures the treatment’s effect on the test subjects.

In our clinical trial, the treatment is the medicine that the subjects in the treatment group are given. The subjects in the control group are not given the medicine. Imagine our results show that mean recovery time is lower in the treatment group (6.2 days) than in the control group (7.5 days). The difference between the two groups, 7.5 – 6.2 = 1.3 days, measures the treatment’s impact. In other words, the medicine decreases mean recovery time by 1.3 days.

Note: After a data professional designs and runs their experiment, they use statistical testing to analyze the results. As a next step, we might conduct a two-sample t-test to determine whether the observed difference in recovery time is statistically significant or due to chance.

Ideally, exposure to the treatment is the only significant difference between the two groups. This design allows researchers to control for other factors that might influence the test results and draw causal conclusions about the effect of the treatment.

Randomization

Typically, data professionals randomly assign test subjects to treatment and control groups. Randomization helps control the effect of other factors on the outcome of an experiment. Two common methods for assigning subjects to treatment and control groups are completely randomized design and randomized block design.

In a completely randomized design, test subjects are assigned to treatment and control groups using a random process. For example, in a clinical trial, we might use a computer program to label each subject with a number and then randomly select numbers for each group.

Sometimes, however, a completely randomized design might not be the most effective approach. When designing an experiment, data professionals must account for nuisance factors. These are factors that can affect the result of an experiment, but are not of primary interest to the researcher.

Blocking

Researchers can use a randomized block design to minimize the impact of known nuisance factors. Blocking is the arranging of test subjects in groups, or blocks, that are similar to one another. In a block design, we first divide subjects into blocks, and then we randomly assign the subjects within each block to treatment and control groups.

For example, suppose we know that age is a significant factor in recovery time from the common cold. In particular, we know that people under the age of 35 tend to recover faster than older people. In this scenario, age is a nuisance factor because it might affect the results of our experiment. For example, in a clinical trial with a completely randomized design and a smaller sample size, we might randomly get a large proportion of young people in the treatment group. This will make it more difficult to determine whether any observed decrease in recovery time is due to the treatment (medicine) or to the nuisance factor (age).

In this case, blocking for the age factor is a more effective way to design our experiment. First, we divide the test subjects into blocks based on age, such as 21-35, 36-50, and 51-65. Next, we randomly assign the subjects within each block to treatment and control groups. This way, if there is a significant difference in recovery time within a specific block, we can be more confident that this result is due to the treatment (medicine) and not to the nuisance factor (age).

Disclaimer: Like most of my posts, this content is intended solely for educational purposes and was created primarily for my personal reference. At times, I may rephrase original texts, and in some cases, I include materials such as graphs, equations, and datasets directly from their original sources.

I typically reference a variety of sources and update my posts whenever new or related information becomes available. For this particular post, the primary source was Google Advanced Data Analytics Professional Certificate.