Ground Truth
The ground truth is the correct answer to the question that the model was asked to solve. In knowing the ground truth for all the predictions a model has made, one can judge how well that model is performing.
In a spam detection model, the ground truth would be whether a specific email was actually spam.
Even though the ground truth contains "truth", it's not always clear-cut.
In a recommendation engine, it would be whether the customer clicked on — or ultimately bought — one of the recommended products.
Ground truth for the new data is not available, which is why it's important to capture some of the live data when ground truth can be assessed and use it to train models.
Ground Truth Evaluation
The ground truth can be used to evaluate the performance of a model.
Collect samples of data where the ground truth becomes known and compare the model's predictions to the ground truth. If it deviates too much, the model needs to be retrained.
The metrics to be monitored can be of two varieties:
- Statistical metrics like accuracy, ROC AUC, log loss, etc. As the model designer has probably already chosen one of these metrics to pick the best model, it is a first-choice candidate for monitoring. For more complex models, where the average performance is not enough, it may be necessary to look at metrics computed by subpopulations.
- Business metrics, like cost-benefit assessment. For example, the credit scoring business has developed its own specific metrics.
When available, ground truth monitoring is the best solution. But it can be problematic in some cases:
- Ground truth might take a long time to be available. You can't wait for months to know if your model is working.
- System strongly decouples the prompt and ground truth. Most obvious case would be that the input sample is not available anymore. You need both to be available to be paired, packaged and used for evaluation later.
- Ground truth is not resolved or recorded.
- Ground truth is only partially available. It can be expensive to keep track of the ground truth for all the data so some sampling is required. This sampling logic can lead to a biased model.
- There might not be quantitative ground truth. For example, in general conversational models, the ground truth is the user's satisfaction so you are solely dependent on user feedback.