Previously we discussed ensemble learning (or ensembling) that involves building multiple models and then aggregating their outputs to make a final prediction. We also saw how decision trees lead us to random forests by building an ensemble that uses the same methodology for every contributing model.
A random forest is an ensemble of learning trees whose predictions are all aggregated to determine a final prediction. Each tree in a random forest model uses bootstrapping to randomly sample the observations in the training data with replacement. This means that any tree in the model can use the same observation and the same observation can be sampled more than once by the same tree.
Bootstrapping is a critical component of random forest models. It ensures that every base learner in the ensemble is trained on different data while allowing each learner to train on a data set that’s the same size as the original training data. Because there are duplicated observations in the trees training data, each one will be missing some of the observations from the original training data set.
One more important principle of random forest models is that all trees in the ensemble are trained on a random subset of the available features in the data set. No single tree sees all the features. Again, this is to introduce another element of randomness and ensure that each tree is as different from the others as possible.
One of the main weaknesses of decision trees is that they are very sensitive to new data, so they’re prone to overfitting. Therefore, randomizing both the data and the features used by each base learner means that no single tree can over fit all the data. This is because no single tree sees all the data. In fact, the trees under fit the data. They are high bias, but together they can be very powerful predictors that are more stable than a regular single decision tree.
In addition, random forests are very scalable. All the base learners they rely on can be trained in parallel using different processing units, even across many different servers. And just like decision trees, random forest models need to be tuned to find the combination of hyper parameter settings that results in the best predictions.
Bagging + random feature sampling = random forest
We now know that bootstrap aggregating (or bagging) can be an effective way to make predictions by building many base learners that are each trained on bootstrapped data and then combining their results. If we build a bagging ensemble of decision trees but take it one step further by randomizing the features used to train each base learner, the result is called a random forest.
Why randomize?
Random forest models leverage randomness to reduce the likelihood that a given base learner will make the same mistakes as other base learners. When mistakes between learners are uncorrelated, it reduces variance. In bagging, this randomization occurs by training each base learner on a sampling of the observations, with replacement.
Let’s consider a dataset with five observations: 1, 2, 3, 4, and 5. If we were to create a new, bootstrapped dataset of five observations from this original data, it might look like 1, 1, 3, 5, 5. It’s still five observations long, but some observations are missing and some are counted twice. The result is that the base learners are trained on data that is randomized by observation.
Random forest goes further. It randomizes the data by features too. This means that if there are five available features: A, B, C, D, and E, we can set the model to only sample from a subset of them. In other words, each base learner will only have a limited number of features available to it, but what those features are will vary between learners.
Here’s an example to illustrate how this might work when combining bootstrapping and feature sampling. The sample below contains five observations and four features from a larger dataset related to cars.
Model | Year | Kilometers | Price |
Honda Civic | 2007 | 54,000 | $2,739 |
Toyota Corolla | 2018 | 25,000 | $22,602 |
Ford Fiesta | 2012 | 90,165 | $6,164 |
Audi A4 | 2013 | 86,000 | $21,643 |
BMW X5 | 2019 | 30,000 | $67,808 |
If we were to build a random forest model of 3 base learners, each trained on bootstrapped samples of 3 observations and 2 features, it may result in the following three samples:
1. | 2. | 3. | |||
Kilometers | Price | Year | Kilometers | Model | Price |
54,000 | $2,739 | 2012 | 90,165 | Honda Civic | $2,739 |
54,000 | $2,739 | 2013 | 86,000 | Ford Fiesta | $6,164 |
90,165 | $6,164 | 2019 | 30,000 | Ford Fiesta | $6,164 |
Notice each sample contains three observations of just two features, and it’s possible that some of the observations may be repeated (because they’re sampled with replacement). A unique base learner would then be trained on each sample.
In practice, we’ll have much more data, so there will be a lot more available to grow each base learner. But as we can imagine, randomizing the samples of both the observations and the features of a very large dataset allows for a near-infinite number of combinations, thus ensuring that no two training samples are identical.
How does all this sampling affect predictions?
The effect of all this sampling is that the base learners each see only a fraction of the possible data that’s available to them. Surely this would result in a model that’s not as good as one that was trained on the full dataset, right?
No! In fact, not only is it possible for model scores to improve with sampling, but they also require significantly less time to run, since each tree is built from less data.
Here is a comparison of five different models, each trained and 5-fold cross-validated on the bank churn dataset from earlier. The full training data had 7,500 observations and 10 features. Aside from the bootstrap sample size and number of features sampled, all other hyperparameters remained the same. The accuracy score is from each model’s performance on the test data.
Bootstrap sample size | Features sampled | Accuracy score | Runtime | |
Bagging: | 100% | 10 | 0.8596 | 15m 49s |
Bagging: | 30% | 10 | 0.8692 | 7m 41s |
Random forest: | 100% | 4 | 0.8704 | 8m 19s |
Random forest: | 30% | 4 | 0.8736 | 4m 53s |
Random forest: | 5% | 4 | 0.8652 | 3m 41s |
The bagging model with only 30% bootstrapped samples performed better than the one that used 100% samples, and the random forest model that used 30% bootstrapped samples and just 4 features performed better than all the others. Not only that, but runtime was cut by nearly 70% using the random forest model with 30% bootstrap samples.
It may seem counterintuitive, but we can often build a well-performing model with even lower bootstrapping samples. Take for example the above random forest model whose base learners were each built from just 5% samples of the training data. It still was able to achieve a 0.8652 accuracy score, not much worse than the champion model.
Tuning a random forest
Let’s consider that decision trees continue to split until one of a certain set of conditions is met.
Continue splitting until
- Leaf nodes are all pure
- Reach min leaf size or max depth
- Performance metric achieved
We’ll recall that settings such as these are known as hyperparameters and they can be tuned to improve model performance directly affecting how the model is fit to the data.
Some of the most important hyperparameters in a decision tree were:
- max_depth specifies how many levels the tree can have and ultimately determines how many splits it can make. Remember, every time a node splits, our data gets separated into smaller subsets. The model is drawing another decision boundary.
- min_samples_leaf defines the minimum number of samples for a leaf node. With min-samples-leaf, a split can only occur if it guarantees a minimum number of observations in the resulting nodes.
- min_samples_split can be used to control the threshold below which nodes become leaves.
Random forest models have the same hyperparameters because they are ensembles of decision trees. These hyperparameters control how the learner trees are grown. But random forests also have some other hyperparameters which control the ensemble itself.
- max_features specifies the number of features that each tree selects randomly from the training data to determine its splits.
For example, if we have a dataset with features A, B, C, D, and E, and we build a random forest model with max features set to three, our first tree might use features A, C, and E to determine its splits and the next tree might use features B, D, and E, and so on.
- n_estimators (number of estimators) controls how many decision trees our model will build for its ensemble.
For example, if we set our number of estimators to 300, our model will train 300 individual trees. If we’re building regression trees, then the models’ final prediction would average the predictions of all 300 trees. If we’re building classification trees, the final prediction would be determined by whichever class received the majority vote from the 300 individual trees.
For random forest models, performance will typically increase as trees are added to the ensemble, but only to a certain point. After this point, improvement will level off and adding new trees will only increase our computing time. This happens because the new trees will become very similar to existing trees so they won’t contribute anything new to the model.
As a final point, many data professionals build models without hand setting each hyperparameter. In fact, when using scikit-learn, the model might perform well with no hyperparameters at all. That’s because it has an effective default setting. Remember to make the most of grid search to help us iterate.
Disclaimer: Like most of my posts, this content is intended solely for educational purposes and was created primarily for my personal reference. At times, I may rephrase original texts, and in some cases, I include materials such as graphs, equations, and datasets directly from their original sources.
I typically reference a variety of sources and update my posts whenever new or related information becomes available. For this particular post, the primary source was Google Advanced Data Analytics Professional Certificate program.