🤖 Machine Learning - Drift
Model retraining is not a question of if, but of when.
Drift tells that assumptions made during model development might not hold true in production anymore.
The model has been instructed to detect specific patterns in data and the patterns change somehow.
The most common types of drift are:
- Feature Drift: The statistical properties of features change over time
- Concept Drift: The statistical properties of the relationship between the features and the target variable change over time e.g. sudden seasonal changes in product demand; or gradual changes in sensor wear and tear.
- Data Drift: the distribution of features changes over time e.g. change in the distribution of customer demographics
- Label Drift: the distribution of the target variable change over time e.g. change in the proportion of spam to non-spam emails over time
Drift can also be broadly categorized as "sudden" or "gradual."
Drift Detection
It is very likely that no automatic actions can be taken if drift is detected. It is more of a signal to the data scientist to investigate further.
Data Drift Detection
There are two frequent root causes of data drift:
- Sample Selection Bias, where the training sample is not representative of the population.
- Non-stationary Environment, where training data collected from the source population does not represent the intended population.
Data drift detection checks if a model's predictions remain accurate by comparing recent data against the training data. Significant differences between the training data and actual data suggest a decrease in model performance.
Statistical analysis such as the Kolmogorov-Smirnov test and Chi-squared test might notice data drift.
- For continuous features, the Kolmogorov-Smirnov Test is a nonparametric hypothesis test used to check whether two samples come from the same distribution. It measures a distance between the empirical distribution functions.
- For categorical features, the Chi-squared Test is a practical choice that checks whether the observed frequencies for a categorical feature in the target data match the expected frequencies seen from the training data.
Domain Classifier
A domain classifier is a model that tries to predict the origin of a sample; whether it comes from the training dataset or the production dataset.
If the domain classifier performs well, it means that the training and production datasets are different, and data drift is likely to have already occurred.
Solving Drift
- Reweighing training dataset: Adjust the weights of the training data to reflect the current distribution. For example, if customers above 60 now represent 60% of users but were only 30% in the training set, then double their weight and retrain the model.
- Maintain multiple models: Deploy multiple models and select the best-performing one based on the most recent data or predefined criteria. For example, train a separate model to handle seasonal fluctuations.
- More feature engineering: Continuously update the features used in the model to capture the new patterns. For example, add a feature that indicates the time of year.
- More regular model re-training: Periodically retrain the model with recent data to ensure it remains accurate.
- Use online learning: Update the model incrementally as new data comes in, allowing it to adapt to changes in real-time.