Evaluating Model Performance
Common approaches to evaluate machine learning model performance:
- Held-out validation: Split the validation set before you start training, model is never shown the held-out set before the actual validation.
- Cross-validation: Change the training and testing set split after each iteration, model will see the cross-validation set but is never knowns the actual split.
- Progressive validation: Compare the predictions side-by-side with live data and production model.
- Visualization: A huge average prediction accuracy can still cause significant loss in a single use-case of a big machine learning model. This is really hard for machines to understand so human readable visualizations are better approach to start debugging the problem. The best approach is to visualize slices by use-cases like "country" or "time of day" (e.g. how GridViz at Google does).
Deviation
Deviation is the difference between two or more values. e.g. prediction and actual, thus a measure of accuracy or performance. Deviation is frequently simply called "error".
Deviation is useful only for accuracy of continuous variables. Deviation is less useful in classification predictions.
Root-mean-square error (RMSE) is by far the most common deviation. More details below.
Mean Absolute Error (MAE)
MAE is one of the simplest measure of deviation.
the average of the absolute errors:
n = the number of i pairs
SUM(ABS(prediction_i - actual_i)) / n
e.g.
predictions = [5, 10, 15]
actuals = [3, 12, 12]
deviations = [5-3, 10-12, 15-12]
= [2, -2, 3]
absolute_deviations = [ABS(2), ABS(-2), ABS(3)]
= [2, 2, 3]
MAE = (1 + 1 + 2) / 3
= 7 / 3
= 2.333333333...
Mean Bias Error (MBE)
The same as MAE but you don't use absolute values of deviation. Usually intended to measure average model bias, but requires careful interpretation as positive and negative errors will cancel each other out.
Mean Absolute Scaled Error (MASE)
MASE is good if your accuracy must be comparable with other accuracies based on different datasets.
Mean Squared Error (MSE)
Square-based deviations are sensitive to outliers. Square-based deviations are used because larger errors have big effect on the score while small errors have little effect. They should be more useful when large errors are particularly undesirable.
e.g.
predictions = [5, 10, 15]
actuals = [3, 12, 12]
deviations = [5-3, 10-12, 15-12]
= [2, -2, 3]
squared_deviations = [2^2, -2^2, 3^2]
= [4, 4, 9]
MSE = (4 + 4 + 9) / 3
= 17 / 3
= 5.666666666...
Root-Mean-Square Error (RMSE)
the square root of the average of squared errors
n = the number of i pairs
SQRT(SUM((prediction_i - actual_i)^2) / n)
e.g.
predictions = [5, 10, 15]
actuals = [3, 12, 12]
MSE = 5.666666666...
RMSE = SQRT(5.666666666...)
= 2.380476142...
Normalized Root-Mean-Square Error (NRMSE)
NRMSE is useful if you are comparing accuracies of two different datasets. Original RMSE is scale-dependent so it has no straight meaning between different datasets or models with different scales, thus NRMSE is similar to MASE.
NRMSE = RMSE / (max_value_in_predictions - min_value_in_predictions)
e.g.
actuals = [3, 12, 12]
RMSE = SQRT(5.666666666...)
= 2.380476142...
NRMSE = 2.380476142... / (12 - 3)
= 0.264497349...
Coefficient of Variation of Root-Mean-Square Error CV(RMSE)
The same as NRMSE but uses mean of predictions for normalization. Easily mixed up with the normal NRMSE, but they do focus on answer the same question
CV(RMSD) = RMSE / mean_of_predictions
e.g.
actuals = [3, 12, 12]
RMSE = SQRT(5.666666666...)
= 2.380476142...
NRMSE = 2.380476142... / ((3 + 12 + 12) / 3)
= 0.264497349...