Boosting | Alper Kokcu

Introduction to boosting

Boosting is a supervised learning technique where we build an ensemble of weak learners sequentially, with each consecutive base learner trying to correct the errors of the one before.

As a reminder, a weak learner is a model whose prediction is only slightly better than a random guess, and a base learner is any individual model in an ensemble.

This practice is similar to random forest and bagging. Like random forest, boosting is an ensembling technique, and it also builds many weak learners, then aggregates their predictions.

But there are some key differences. Unlike random forest, which builds base learners in parallel, boosting builds learners sequentially. This is because each new base learner in the sequence focuses on what the preceding learner got wrong. Another difference from random forest is that for boosting models, the methodology we choose for the weak learner isn’t limited to tree-based methods. (However, we will use tree-based implementations here, because these are common and effective ways of building boosting models.)

AdaBoost

There are various different boosting methods available, we’ll explore two of the most commonly used methodologies. The first is called Adaptive Boosting or AdaBoosting. AdaBoost is a tree-based boosting methodology where each consecutive base learner assigns greater weight to the observations incorrectly predicted by the preceding learner.

AdaBoost builds its first tree on training data that gives equal weight to each observation. Then the algorithm evaluates which observations were incorrectly predicted by this first tree.

It increases the weights for the observations that the first tree got wrong, and decreases the weights for those that it got right.

This process repeats until either a tree makes a perfect prediction or the ensemble reaches the maximum number of trees, which is a hyperparameter that is specified by the data professional.

Once all the trees have been built, the ensemble makes predictions by aggregating the predictions of every model in the ensemble.

Because AdaBoost can be used for both classification and regression problems, this final step is a little different depending on which type is being addressed. For classification, the ensemble uses a voting process that places weight on each vote. Base learners that make more accurate predictions are weighted more heavily in the final aggregation. For regression, the model calculates a weighted mean prediction for all the trees in the ensemble.

Pros and Cons of Boosting

There is one disadvantage to note about boosting. We can’t train our model in parallel across many different servers, because each model in the ensemble is dependent on the one that preceded it. This means that in terms of computational efficiency, it doesn’t scale well to very large datasets when compared to bagging. However, this generally isn’t a concern unless we’re working with particularly large datasets.

But there are many noteworthy advantages:

One of the more accurate methodologies available today.
The problem of high variance is reduced. This is because no single tree weighs too heavily in the ensemble.
It reduces bias.
It’s also easy to understand and doesn’t require the data to be scaled or normalized.
Boosting can handle both numeric and categorical features, and it can still function well even with multicollinearity among the features
It’s robust to outliers.

Note that resilience to outliers is a major strength of all tree-based methodologies. This is because the model splits the data the same way, regardless of how extreme a value is. Here’s an example:

Suppose we have six elephants. Three are females that weigh 2’000, 2’500, and 3’000 kilograms, and three are males that weigh 4’000, 4’500, and 5’000 kilograms. If we drew a decision tree using this data, it would draw a decision boundary between males and females at 3’500 kilos. The midpoint between the weights of the heaviest female and the lightest male.

Now, suppose that instead of weighing 5’000 kilos, the last male elephant weighed 10’000 kilos. Our model would still divide males and females at 3’500 kilos. It doesn’t matter that the last elephant doubled in size.

Gradient boosting machines

Gradient Boosting is different from Adaptive Boosting because instead of assigning weights to incorrect predictions, each base learner in the sequence is built to predict the residual errors of the model that preceded it.

Although GBMs do not have to be tree-based, tree ensembles are the most common implementation of this technique. There are two key features of tree-based gradient boosting that set it apart from other modeling techniques:

It works by building an ensemble of decision tree base learners wherein each base learner is trained successively, attempts to predict the error (also known as “residual”) of the previous tree, and therefore compensate for it.
Its base learner trees are known as “weak learners” or “decision stumps.” They are generally very shallow.

Imagine that the target is a continuous variable (~decision tree regressor). Let’s start with a set of features X and a target variable Y.

We’ll train the first base learner decision tree on this data and call it learner1.
Learner1 makes its predictions, which we’ll call Y hat sub 1.
The residual errors of learner1’s prediction are found by subtracting the predicted values from the actual values. Call the set of residual errors, error_1.

Now train a new base learner using the same X data but instead of the original Y data, use error_1 as the target. That’s because this learner is predicting the error made by learner1.

Call this new base learner, learner2.
Learner2’s predictions are assigned to error hat sub 1.
Then compare learner2’s predictions to the actual values and assign the difference to error_2. In this case, the actual values are the errors made by learner1.

This process will continue for as many base learners as we specify. For now, repeat it just once more.

Stopping here results in an ensemble that contains three base learners. To get the final prediction for any new X, add together the predictions of all three learners.

Ensembles that use gradient boosting are called gradient boosting machines (GBM).

Pros and Cons of GBM

GBMs have many advantages:

High accuracy: Many machine-learning competition winners succeeded largely because of the accuracy of their boosting models.
Scalable: Even though they can’t be trained in parallel, like random forests, because their base learners are developed sequentially they still scale well to large datasets.
Work well with missing data: The fact that a value is missing is viewed as valuable information. So GBMs treat missing values just like any other value when determining how to split a feature. This makes gradient boosting relatively easy to use with messy data.
Because they are tree-based, GBMs don’t require the data to be scaled and they can handle outliers easily.

Gradient boosting also has its drawbacks:

Have a lot of hyperparameters, and tuning them can be a time-consuming process.
Can be difficult to interpret: GBMs can provide feature importance but unlike linear models, they do not have coefficients or directionality. They only show how important each feature is relative to the other features. *
Have difficulty with extrapolation. **
Prone to overfitting if not trained carefully: Usually this is caused by tuning too many hyperparameters, which can result in the trees growing to fit the training data, but not generalizing well to unseen data.

* Because of this, they’re often called black-box models. This is a model whose predictions cannot be precisely explained. In some industries such as medicine and banking, it’s essential that our model’s predictions be explainable. Therefore, GBMs are not well suited for some applications.

** Extrapolation is a model’s ability to predict new values that fall outside of the range of values in the training data. For instance, if one loaf of bread costs one dollar, two loaves of bread cost two dollars, and three loaves cost three dollars. A linear regression model would have no trouble predicting that 10 loaves cost $10, but a GBM wouldn’t be able to unless it saw the cost of 10 loaves in the training data.

An example for Gradient Boosting

Below is a table of how many bottles of water sold on different days and the noontime temperature of each day.

Temperature (℃)	Sales (bottles)
14	72
17	80
20	98
23	137
26	192
29	225
32	290
35	201
38	95
41	81

Step one is to fit a model to the data. This example uses temperature in Celsius as a single X feature and sales as the target variable. The model is a regular decision tree that is only allowed to grow to a maximum depth of one (i.e., it only makes one split), to replicate the weak learners used by GBMs.

index	temp(℃)	sales	tree 1predictions	tree 1error	tree 2predictions	tree 2error	tree 3predictions	tree 3error	finalprediction
0	14	72	80	-8	4.5	-12.5	-4.5	-8	80
1	17	80	80	0	4.5	-4.5	-4.5	0	80
2	20	98	80	18	4.5	13.5	-4.5	18	80
3	23	137	192	-55	4.5	-59.5	-4.5	-55	192
4	26	192	192	0	4.5	-4.5	-4.5	0	192
5	29	225	192	33	4.5	28.5	7	21.5	203.5
6	32	290	192	98	4.5	93.5	7	86.5	203.5
7	35	201	192	9	4.5	4.5	7	-2.5	203.5
8	38	95	192	-97	-104	7	7	0	95
9	41	81	192	-111	-104	-7	7	-14	95

Here are some figures that depict the predicted vs. actual number of bottles sold for each step of the process.

In the first graph, the Xs indicate the actual number of water bottles sold and the purple dots indicate the predicted number. Each vertical part of the line that connects the dots represents a split, or decision boundary, of the model. Notice that for a single tree there’s only one vertical line, because it only makes one split.

The next graph depicts the predictions of the first two trees vs. the actual values. Each predicted point represents the sum of the predictions of the first and second trees. The predictions match the actual values more closely, but the model still underfits the data.

The graph above depicts the next iteration of the model. Now there are three trees whose predictions are summed. Each additional base learner adds more nuance to the final prediction; it adds a decision boundary (represented here by the vertical segments of the yellow line), which allows the predictions to become more accurate. In this case, every sample is assigned to one of four different values, whereas in the previous example every sample was assigned to one of three different values.

Here is the prediction curve for 30 trees:

The predicted values of this model ensemble very closely match the actual values of the samples. Note, however, that they are not perfect predictions. Indeed, a single decision tree could fit this data perfectly if it were allowed to grow to a depth of five. However, a benefit of ensemble methods like gradient boosting is that they are less likely to overfit the data than a single decision tree.

Tuning a GBM model

XGBoost stands for Extreme Gradient Boosting, and it is an optimized GBM package. Scikit-learn has its own GBM implementation, which is similar. But XGBoost is a commonly used Gradient Boosting package that has many useful optimizations. These optimizations include fast training, effective regularization of features, and tunable hyper-parameters, which can improve model predictions.

max depth

It has the same functionality in XGBoost (as in decision trees, and random forests), which is that it controls how deep each base learner tree will grow.

The best way to find this value is through cross-validation. The model’s final max depth value is usually low. The deeper the tree, the more a model learns feature interactions that could be very specific to the training data, but may not generalize well to new information. Even short trees are powerful because of the ensemble.

Typical values for max depth are 2-10. But this depends on the number of features, and observations in the data.

n_estimators

It is the number of estimators or maximum number of base learners that the ensemble will grow. This is best determined using Grid search. For smaller data sets, more trees may be better than fewer. For very large data sets, the opposite could be true. Typical ranges are 50-500.

learning_rate

Remember that each time an ensemble builds a new base learner, it fits the data to the error from the previous model. In a basic implementation, the predictions of all the trees could then be summed to determine a final prediction. In this case, each tree’s prediction is considered equally important to the final prediction.

In practice, we use the learning rate to indicate how much weight the model should give to each consecutive base learner’s prediction. Lower learning rates mean that each subsequent tree contributes less to the ensemble’s final prediction. This helps prevent over-correction, and over-fitting.

Another common name for this concept is shrinkage, because less, and less weight is given to each consecutive tree’s prediction in the final ensemble. Each correction affects the prediction less than the one before it. Also, if we use a low learning rate, our model will often require more trees to compensate. Again, this is best determined using grid search. Typical values are from 0.01-0.3.

min_child_weight

A tree will not split a node if it results in any child node with less weight than what we specify in this hyper-parameter, instead, the node would become a leaf.

This is a regularization parameter. Values that are too high cause the model to underfit the data. The range of this setting is zero to infinity. If set 0-1, the algorithm interprets this as a percentage of our data. So 0.1 would mean that a node could not split unless its children each have greater than or equal to 10 percent of the training observations.

Generally, we can think of values greater than one as being equivalent to the number of observations in a child node, so a value of 10 would mean no child node could contain fewer than 10 observations.

Disclaimer: Like most of my posts, this content is intended solely for educational purposes and was created primarily for my personal reference. At times, I may rephrase original texts, and in some cases, I include materials such as graphs, equations, and datasets directly from their original sources.

I typically reference a variety of sources and update my posts whenever new or related information becomes available. For this particular post, the primary source was Google Advanced Data Analytics Professional Certificate program.