Decision trees can be good predictors. But we also know that they’re prone to overfitting and they’re very sensitive to variations in the training data. We can solve these problems by using the wisdom of the crowd.
Ensemble learning (or ensembling) involves building multiple models and then aggregating their outputs to make a final prediction.
A best practice when building an ensemble is to use very different methodologies for each model it contains, such as a logistic regression, a Naive Bayes model, and a decision tree classifier. In this way, when the models make errors (and they always will), the errors will be uncorrelated. The goal is for them to not all make the same errors for the same reasons.
But there’s another way to build an ensemble, a way that uses the same methodology for every contributing model. In this ensemble, each individual model that comprises, is called a base learner. For this method to work, we usually need a lot of base learners and each is trained on a unique random subset of the training data.
If the base learners were all trained on the exact same data, there would be too much correlation between the errors. This would affect the strength of the base learners, and if a base learner’s prediction is only slightly better than a random guess, it becomes a weak learner. To address this, data professionals do something called bagging in order to ensure random subsets of the data and strong learners. The word bagging comes from bootstrap aggregating.
Bootstrapping
Remember from statistics that bootstrapping refers to sampling with replacement. That’s what happens during bagging too. Each base learner samples from the data with replacement. For bagging this means the various base learners all sample the same observation, and a single learner can sample that observation multiple times during training.
Suppose we have a dataset of 1,000 observations, and we bootstrap sample it to generate a new dataset of 1,000 observations. On average, we should find about 632 of those observations in our sampled dataset (~63.2%).
Alper’s Note: How to explain this?
Sampling with Replacement:
- We start with a dataset of 1,000 observations.
- We randomly pick one observation at a time and add it to our bootstrap sample.
- Since we replace the observation before picking the next one, the same observation can appear multiple times in the bootstrap sample.
Expected Coverage (~63.2%):
- Since each of the 1,000 bootstrap samples is drawn independently, there’s a probability that some original observations are not selected.
- The probability that any given observation is not selected in a single draw is:
P(not chosen) = 1 − (1 / 1000) = 0.999
- Since you do this 1,000 times, the probability that an observation is never selected is:
(0.999)1000 ≈ e−1 ≈ 0.368
- This means about 36.8% of the original observations are not included in the bootstrap sample, leaving around 63.2% that do appear at least once.
Ensemble learning
Building a single model with bootstrapped data probably wouldn’t be very useful. To use the example above, if we start with 1,000 unique observations and use bootstrapping to create a sampled dataset of 1,000 observations, we’d only expect to get an average of 632 unique observations in that new dataset. This means that we’d lose whatever information was contained in the 368 observations that didn’t make it into the new sampled dataset.
This is when ensemble learning, or ensembling, comes to the rescue. Ensemble learning refers to building multiple models and aggregating their predictions. Sure, those 368 observations might not make it into that particular sampled dataset, but if we keep repeating the bootstrapping process (once for each base learner) eventually our overall ensemble of base learners will see all of the observations.
Alper’s Note: How to explain this?
Each Base Learner Sees a Different Subset
- In Bagging, multiple base learners (e.g., decision trees in a random forest) are trained on different bootstrap samples.
- Since each bootstrap sample includes only ~63.2% of unique observations, each base learner is trained on a slightly different dataset.
What Happens to the 36.8% That Are Left Out?
- Although one bootstrap sample misses about 36.8% of the data, the next bootstrap sample is independently drawn.
- The observations that were missing in one sample have a chance of being included in another.
Over Many Bootstrap Samples, All Observations Get Used
- Since bagging involves training multiple models (e.g., 100 trees in a Random Forest), each new base learner sees a new, slightly different dataset.
- Over multiple bootstrap iterations, every observation will appear in at least some of the bootstrap samples, even if it was skipped in some others.
- This ensures that no data point is completely ignored by the ensemble.
Aggregating
The aggregation part of bagging refers to the fact that the predictions of all the individual models are aggregated to produce a final prediction. For regression models, this is typically the average of all the predictions. For classification models, it’s often whichever class receives the most predictions, which is the mode.
When bagging is used with decision trees, we get a random forest. A random forest is an ensemble of decision tree base learners that are trained on bootstrapped data. The base learners predictions are all aggregated to determine a final prediction.
Random forest takes the randomization from bagging one step further. A regular decision tree model will seek the best feature to use to split a node. A random forest model will grow each of its trees by taking a random subset of the available features in the training data and then splitting each node at the best feature available to that tree. This means that each base learner in a random forest model has different combinations of features available to it, which helps to prevent the problem of correlated errors between learners in the ensemble.
Each individual base learner is a decision tree. It may be fully grown, so each leaf is a single observation or it may be very shallow depending on how we choose to tune our model. Ensembling many base learners helps reduce the high variance that we typically get from a single decision tree.
Ensemble learning is powerful because it combines the results of many models to help make more reliable final predictions, plus these predictions have less bias and lower variance than other standalone models.
Example: bagging vs. single decision tree
Here is some test data taken from a dataset containing two classes:
And here is a comparison of the predictions on this test data made by a single decision tree versus the predictions made by an ensemble of 50 decision trees using bagging (where the Xs indicate incorrect predictions):
Notice that the single decision tree got 11 predictions wrong out of 60, an accuracy score of 81.7%. Meanwhile, the ensemble of decision trees with bagging only got 6 wrong. Bagging resulted in a 10% improvement in accuracy.
Another way to examine the results of these models is to plot their decision boundaries:
Notice that the decision boundary for a tree-based model isn’t linear like those of logistic regression or Naive Bayes models. This is illustrative of decision trees’ tendency to fit (and overfit) the data. The single decision tree has many more “decision islands,” or areas where one class is surrounded by the other class.
Even slightly different data would likely result in a very different decision boundary plot, which is indicative of greater variance. Because bagging aggregates the predictions of many different trees, its resulting decision boundary is more stable, because the model has lower variance.
Why to use it
- Reduces variance: Standalone models can result in high variance. Aggregating base models’ predictions in an ensemble helps reduce it.
- Fast: Training can happen in parallel across CPU cores and even across different servers.
- Good for big data: Bagging doesn’t require an entire training dataset to be stored in memory during model training. We can set the sample size for each bootstrap to a fraction of the overall data, train a base learner, and string these base learners together without ever reading in the entire dataset all at once.
Alper’s Note: Bonus Explanation in four questions
Question 1
How many bootstrap samples (base learners) do we need to ensure that every data point gets used at least once?
a. Probability of an Observation Never Being Selected:
We know that the probability of one specific observation NOT being included in a single bootstrap sample is:
P(not chosen in one sample) = e−1 ≈ 0.368
So, if we perform N independent bootstrap samples (i.e., train N base models), the probability that this specific observation is never selected in any of them is:
P(never chosen in any of N samples) = (0.368)N
b. How Many Bootstrap Samples (N) Are Needed?
We want to find N such that this probability becomes very small—let’s say less than 1% (0.01):
(0.368)N < 0.01
Taking the natural logarithm on both sides:
N ⋅ ln(0.368) < ln(0.01)
N ⋅ (−1) < −4.6
N > 4.6
c. Conclusion:
After about 5 bootstrap samples, each data point has a 99% chance of being included at least once in the ensemble.
If we want to be even more certain, using 10-20 bootstrap samples practically guarantees that every observation is seen at least once.
This is why Random Forests often use hundreds of trees—to ensure diverse training subsets while making sure no data point is left out.
Question 2
How Does Bootstrapping Reduce Variance?
The key idea behind bagging (Bootstrap Aggregating) is that by training multiple models on different bootstrapped datasets and averaging their predictions, we reduce the overall variance of the final model. Here’s why:
Independent Errors Cancel Out
- Each base learner (e.g., a decision tree) is trained on a slightly different dataset due to bootstrapping.
- Individual models might overfit their specific training data, but each one overfits in a different way because they see different subsets.
- When we average predictions (or take a majority vote in classification), the random errors from each model tend to cancel each other out, leading to lower overall variance.
Analogy: Wisdom of the Crowd
- If one person makes a guess, it might be biased. But if 100 people make independent guesses, the average is usually more accurate.
- Similarly, combining multiple base models reduces the risk that any one model is overly influenced by noise.
Question 3
Isn’t Bootstrapping Changing the Dataset?
Yes, but in a useful way! Bootstrapping doesn’t alter the underlying distribution, it just creates diverse training sets.
What Stays the Same?
- The original dataset remains unchanged; we’re just sampling from it.
- The distribution of features is still representative of the full dataset.
What Changes?
- Each base model sees a different version of the dataset due to sampling with replacement.
- Some data points appear multiple times, while others might be left out in a given bootstrap sample.
- This randomness decorrelates the base models, making the final ensemble more stable.
Question 4: Sum up
We are not changing the data itself but how changing the distribution is not affecting the results in some biased way (especially if some points can appear more than others)?
- Yes, individual models are slightly biased due to sampling variation. But the ensemble doesn’t.
- Example: If a data point is missing in one sample, it will likely appear in another. So while a single tree may give biased predictions due to the sample it saw, the combined effect across all trees cancels out this bias.
- The ensemble averages out those biases, keeping the overall model unbiased.
- Analogy: Imagine rolling a die. If we roll just once, we might get a 6, which is not representative of the true probabilities. But if we roll 100 times, the average result will be much closer to the expected value of 3.5.
- The real effect is variance reduction, leading to better generalization.
- Bagging (bootstrap aggregating) does not significantly increase bias because each model is trained on roughly the same underlying distribution (just with slight variations).
- The primary impact is variance reduction. By averaging over multiple models, we minimize the risk that a single model’s high-variance error dominates.
Disclaimer: Like most of my posts, this content is intended solely for educational purposes and was created primarily for my personal reference. At times, I may rephrase original texts, and in some cases, I include materials such as graphs, equations, and datasets directly from their original sources.
I typically reference a variety of sources and update my posts whenever new or related information becomes available. For this particular post, the primary source was Google Advanced Data Analytics Professional Certificate program.