Each dataset presents its own set of characteristics for which the data professional must tailor the model. The best models neither underfit nor overfit the data. They identify intrinsic patterns within it, but do not capture randomness or noise. One way of helping to achieve this balance is through the use of hyperparameters.
Hyperparameter tuning
A very popular and widely used technique to improve performance after creation is known as hyperparameter tuning. Hyperparameters are aspects of a model that we set before the model is trained, and that affect how the model fits the data. They are not derived from the data itself. Hyperparameter tuning is the process of adjusting the hyperparameters to build a model that best fits the data.
Note that we’ll often hear them referred to as “parameters,” much like we might hear the word “theory” used to mean “idea,” when it actually has a very specific scientific meaning. It’s okay, as long as we understand the difference.
When we built a K-means model before, we set the value of K to produce different cluster results. But when we changed its value, we performed hyper parameter tuning.
Hyperparameters for decision trees
max_depth
One of the most basic hyper parameters for a decision tree is called max depth. Setting this hyper parameter defines a limit of how long a decision tree can get.
Setting a value for max depth can help reduce overfitting problems by limiting how deep the tree will go. Additionally, it can reduce the computational complexity of training and using the model in the first place.
min_samples_split
min samples split is the minimum number of samples that a node must have for it to split into more nodes. For example, if we set this to 10, then any node that contains nine or fewer samples will automatically become a leaf node. It will not continue splitting.
min_samples_leaf
min samples leaf is similar to min_samples_split, but instead of defining how many samples the parent node must have before splitting, min_samples_leaf defines the minimum number of samples that must be in each child node after the parent splits.
The right branch of the tree becomes a leaf at depth=1, while the left branch continues splitting until reaching class purity in the leaf nodes at depth=2. The right branch stops splitting because it would result in one of the child nodes containing just a single sample, which is below the threshold of three indicated by min_samples_leaf.
Finding the optimal set of hyperparameters
The values that these hyperparameters can take are limited only by the number of samples in our dataset. That leaves open the possibility for millions of combinations! How do we know what the values should be? The answer is to train a lot of different models to find out. There are a number of ways to do this. Performing a grid search is one of the more popular methods.
Grid search
A grid search is a technique that will train a model for every combination of preset ranges of hyperparameter values. The aim is to find the combination of values that results in a model that both fits the training data well and generalizes well enough to predict accurately on unseen data. After all these models have been trained, we then compare them to find this ideal model, if it exists.
When performing a grid search the first step is to specify which parameters we want to tune and the set of values that we want to search over. This continues until every combination has been analyzed. We can try any values and any number of values during grid search if we believe the benefits are worth the cost of our computing time.
Here is a very basic example of how a grid search might be used to find the best combination of two hyperparameters: max_depth and min_samples_leaf. We can define the set of max_depth values as [6, 8, None], and the set of min_samples_leaf values as [1, 3, 5].
Alternative approaches
With more hyperparameters and a more expansive array of values to search over, grid searches can quickly become computationally expensive. One helpful search strategy is to try a wider array of values for each hyperparameter—say, [3, 6, 9], instead of [3, 4, 5]. If the best model has 6 as the value for this hyperparameter, perhaps try another grid search using [5, 6, 7] as potential values. This technique uses multiple search iterations to progressively home in on an optimal set of values.
Another technique is to define a more comprehensive set of search values from the beginning—say, [3, 4, 5, 6, 7, 8, 9]—and let the model train for what could be a very long time. Which approach we take will depend on our computing environment, our computational resources, and how much time we have.
Disclaimer: Like most of my posts, this content is intended solely for educational purposes and was created primarily for my personal reference. At times, I may rephrase original texts, and in some cases, I include materials such as graphs, equations, and datasets directly from their original sources.
I typically reference a variety of sources and update my posts whenever new or related information becomes available. For this particular post, the primary source was Google Advanced Data Analytics Professional Certificate program.