Drift
Drift tells that assumptions made during model development might not hold true in production anymore.
The model has been instructed to detect specific patterns in data and the patterns change somehow.
The most common types of drift are:
- Feature Drift: The statistical properties of the input features change over time.
- Concept Drift: The statistical properties of the relationship between the input features and the target variable change over time. Examples: sudden seasonal changes in product demand; or gradual changes in sensor wear and tear.
- Data/Covariate Shift: the distribution of the input features changes over time. Examples: changes in the distribution of customer demographics.
- Label Drift: the distribution of the target variable change over time. Examples: change in the proportion of spam to non-spam emails over time.
Drift can also be broadly categorized as "sudden" or "gradual."
Drift Detection
It is very likely that no automatic actions can be taken if drift is detected. It is more of a signal to the data scientist to investigate further.
Data Drift Detection
There are two frequent root causes of data drift:
- Sample selection bias, where the training sample is not representative of the population.
- Non-stationary environment, where training data collected from the source population does not represent the intended population.
Data drift detection checks if a model's predictions remain accurate by comparing recent data against the training data. Significant differences between the training data and actual data suggest a decrease in model performance.
Statistical analysis such as the Kolmogorov-Smirnov test and Chi-squared test might notice the data drift.
- For continuous features, the Kolmogorov-Smirnov test is a nonparametric hypothesis test used to check whether two samples come from the same distribution. It measures a distance between the empirical distribution functions.
- For categorical features, the Chi-squared test is a practical choice that checks whether the observed frequencies for a categorical feature in the target data match the expected frequencies seen from the training data.
Combating Drift
- Reweighing training dataset, according to this feature. For example, if customers above 60 now represent 60% of users but were only 30% in the training set, then double their weight and retrain the model.
- More regular model re-training: This is the most common approach, periodically retraining the model with recent data.
- Maintain multiple models: Multiple models are deployed, and the best-performing model is selected based on the most recent data or manual definition. Like training a separate model to handle seasonal fluctuations.
- More feature engineering: Catch new patterns and continuously updating the features used in the model. For example, have an extra feature that indicates the time of year.
- Use online learning: This is a technique where the model is updated as new data comes in.