Verify performance using validation
Fitting a model to training data and evaluating it on test data might be an adequate way of evaluating how well a single model generalizes to new data, but it’s not a recommended way to compare multiple models to determine which one is best. That’s because, by selecting the model that performs best on the test data, we never get a truly objective measure of future performance. The measure would be optimistic.
This may seem difficult to grasp or counterintuitive. We may be asking ourselves: How is it not objective? After all, I’m not using the test data to tune my models. Well, if we’re comparing how all of the models score on this data and then selecting the model with the best score as champion, in a way, we’re “tuning” another hyperparameter: the model itself! The selection of the final model would itself behave as a tuning process, because we’d be using the test data to go back and make an upstream decision about our model.
Put simply, if we want to use the test data to get a true measure of future performance, then we must never use it to make a modeling choice. Only use the test data to evaluate our final model. As a data professional, we will likely encounter scenarios where the test data is used to select a final model. It’s not best practice, but it’s unlikely that it will break our model. However, there are better, more rigorous ways of evaluating models and selecting a champion.
Model validation
One such way is through a process called validation. Model validation is the whole process of evaluating different models, selecting one, and then continuing to analyze the performance of the selected model to better understand its strengths and limitations.
This post will focus on evaluating different models and selecting a champion. Post-model-selection validation and analysis is a discipline unto itself and beyond the scope of this writing.
Validation sets
The simplest way to maintain the objectivity of the test data is to create another partition in the data—a validation set—and save the test data for after you select the final model. The validation set is then used, instead of the test set, to compare different models.
This method (using a separate validation set to compare models) is most commonly used when we have a very large dataset. The reason for this is that the more data we use for validation, the less we have for training and testing. However, if we don’t have enough validation data, then our models’ scores cannot be expected to give a reliable measure that we can use to select a model, because there’s a greater chance that the distributions in the validation data are not representative of those in the entire dataset.
When building a model using a separate validation set, once the final model is selected, best practice is to go back and fit the selected model to all the non-test data (i.e., the training data + validation data) before scoring this final model on the test data.
Cross validation
There is another approach to model validation that avoids having to split the data into three partitions (train / validate / test) in advance. Cross-validation makes more efficient use of the training data by splitting the training data into k number of “folds” (partitions), training a model on k – 1 folds, and using the fold that was held out to get a validation score.
The training process occurs k times, each time using a different fold as the validation set. At the end, the final validation score is the average of all k scores. This process is also commonly referred to as k-fold cross validation and it repeats until every combination is done and the evaluation metrics are averaged to get final validation scores.
After a model is selected using cross-validation, that selected model is then refitted to the entire training set (i.e., it’s retrained on all k folds combined).
Which validation technique?
Cross-validation reduces the likelihood of significant divergence of the distributions in the validation data from those in the full dataset. For this reason, it’s often the preferred technique when working with smaller datasets, which are more susceptible to randomness. The more folds we use, the more thorough the validation. But, adding folds increases the time needed to train, and may not be useful beyond a certain point.
However, cross-validation is not necessary when working with very large datasets. There’s so much data that maximizing the utility is not required. Actually can be problematic depending on the computing resources at our disposal. However, if limited computing resources or constraints in the data are not issues then cross-validation is almost always applied.
Model selection
Once we’ve trained and validated our candidate models, it’s time to select a champion. Of course, our models’ validation scores factor heavily into this decision, but score is seldom the only criterion. Often we’ll need to consider other factors too.
- How explainable is our model?
- How complex is it?
- How resilient is it against fluctuations in input values?
- How well does it perform on data not found in the training data?
- How much computational cost does it have to make predictions?
- Does it add much latency to any production system?
It’s not uncommon for a model with a slightly lower validation score to be selected over the highest-scoring model due to it being simpler, less computationally expensive, or more stable.
Once we have selected a champion model, it’s time to evaluate it using the test data. The test data is used only for this final model. Our model’s score on this data is how we can expect the model to perform on completely new data. Any changes we make to the model at this point that are based on its performance on the test data contaminate the objectivity of the score.
However, this does not mean that we can’t make changes to the model. For instance, we might want to retrain the champion model on the entire dataset (train + validate + test) so it makes use of all available data prior to deployment. This is acceptable, but we need to understand that at this point we have no way of meaningfully evaluating the model unless we acquire new data that the model hasn’t encountered.
The model development process
There is no single way to develop a model. Project-specific conditions will dictate the best approach. A rigorous approach to model development might use both cross-validation and validation.
The cross-validation can be used to tune hyperparameters, while the separate validation set lets us compare the scores of different algorithms (e.g., logistic regression vs. Naive Bayes vs. decision tree) to select a champion model. Finally, the test set gives us a benchmark score for performance on new data, as illustrated in the diagram below.
Different variations of this process could be as below.
A. Data is split into train and test sets. Models are trained on the train data and all tested on the test data. The model with the best score on the test data is selected as champion. This approach does not iteratively tune hyperparameters or test the champion model on new data. The champion model’s score on the test data would be an optimistic indicator of future performance.
B. The data is split into training, validation, and testing sets. Models are fit to the training data, and the champion model is the one that performs best on the validation set. This model alone is then scored on the test data. This is the same approach as example A, but with the added step of testing the champion model by itself to get a more reliable indicator of future performance.
C. This method splits the data into train and test sets. Models are trained and cross-validated using the training data. The model with the best cross-validation score is selected as the champion model, and this model alone is scored on the test data. The cross-validation makes the model more robust, and using the test data to evaluate only the champion model allows for a good understanding of future performance. However, selection of the champion model based on the cross-validation results alone increases the risk of overfitting the model to the training data.
D. In this variant, the data is split into training and testing sets. Models are trained and cross-validated using the training data, then all are scored on the test data. The model with the best performance on the test data is the champion. This is a very common approach, but note that it does not score the champion model on completely new data, so expected future performance may be optimistic. However, compared to approach C, this method mitigates the risk of overfitting the model to the training data.
Disclaimer: Like most of my posts, this content is intended solely for educational purposes and was created primarily for my personal reference. At times, I may rephrase original texts, and in some cases, I include materials such as graphs, equations, and datasets directly from their original sources.
I typically reference a variety of sources and update my posts whenever new or related information becomes available. For this particular post, the primary source was Google Advanced Data Analytics Professional Certificate program.