🤖 Machine Learning Systems

Updated at 2023-12-29 19:07

To improve the results of a machine learning system,
improve the productivity of humans who operate it.

Machine learning system is the end-to-end machine learning solution; from pre-processing the data to serving predictions.

Minimal features of a machine learning system:

It extracts features from data. (aka. Data Preparation)
It feeds features to training to produce models. (aka. Training)
It uses these models to make predictions. (aka. Deployment)

But what a production machine learning system usually involves:

 1. Gathering training data
 2. Applying feature extraction
 3. Training the model
 4. Verifying that the model is good
 5. Deploying the model

All of this is usually dubbed "the pipeline."

Machine learning system can be minimal. The system can consist only a few pieces of software running on your laptop. That isn't very scalable though.

Note that using machine learning-based tools does not automatically make the solution a machine learning system.

For example, if you have an application that does prompting to OpenAI or Anthropic APIs, that is not a machine learning system. This kind of applied machine learning can suffer from similar issues as a machine learning system, but it's not one as it's missing integral parts like training (there is no learning).

Machine learning systems are inherently more complex than normal software. Machine learning systems have all the same code complexity issues as normal software, but also have high system-level complexity that will cause compounding technical debt.

Make sure that data scientists and infrastructure engineers communicate. Code and pipelines are more manageable if engineers and researchers work closely; or are the same person.

Data scientists must go beyond optimizing the business metrics. Prediction bias, fairness, transparency and accountability are as important as security and privacy.

Security doesn't directly generate business value but, as we have recently noticed, data breaches can cost company a lot of money in plummeting stocks and lawsuits.

Privacy is the same, no direct business value, but you don't want to find yourself in the middle of a lawsuit.

Fairness is achieved by segmenting your test data and evaluation. You monitor prediction error rates per dataset segment.

Health care model suggesting treatments, but your dataset has a lot more millennial than senior citizens. Without proper normalization, this will cause more errors for senior citizen treatments. The predictions become unfair for the senior citizen.

If you have a recruitment model that usually evaluates and trains on data based on people of white ethnicity, it might start discriminating other ethnicities if the data distribution is not kept in check. The model essentially becomes racist.

It is not uncommon for face detection models to poorly detect people of African ethnicity. Or even mislabel them as animals as Google Photos sometimes did in 2015.

Who is responsible for a biased and unfair model? The question of accountability is hard to answer. Is it the data scientist who trained the model or the manager who made the business metrics to optimize? Whoever is accountable, the blame will fall to the company using the model.

Other features of a good machine learning systems:

Pipelining with dynamic, non-blocking task creation on the fly.
Task input files should act as futures and continue when available.
Support for different execution times and resource requirements.
Support for different data sources.
Transparent fault tolerance where task crashes just get restarted.
Low-prediction latency for high-throughput serving.

Entanglement

Data is as important as code. Behavior of a machine learning system is not specified directly in code but is learned from data. This makes system innately tightly coupled with the data used, which usually changes a lot over time.

Entanglement is unavoidable. Machine learning models are creating entanglement and making the isolation of improvements effectively impossible.

If you take a 100-feature health record neural network model and add a 101st feature, you risk the performance of the whole system even if the code itself stays same. Can be done but requires tweaking, retraining and a lot of testing.

No input or configuration value is independent. Adding, removing or modifying features or hyperparameters can cause a cascade of changes in the model. Machine learning is fundamentally tightly coupled with itself and the data it utilizes.

If nothing is provided, some form of centralizing configuration management usually materializes by itself.

Solution Strategies:

Use multiple separate models. Works if the problem decomposes naturally to problems, and you can combine the partial results to a meaningful outcome.
Visualize relationships between features. See effects across many dimensions and slices. Slice-by-slice basis may be extremely helpful. High-dimension plotting might also be useful.
Test relationships between features. For example, try removing a feature and see how the model behaves.
Calculate the cost of each feature. e.g. a feature causes a lot of latency or RAM increase but provides only a small accuracy improvement.

Unstable Data Dependencies

Unstable data sources change qualitatively or quantitatively over time. Common implicit example is using predictions of another machine learning model as accuracy will undoubtedly change over time. It can also happen explicitly when ownership of the input signal is separate from the engineering ownership of the model that consumes it.

Better signals are not always an improvement. Even improvements to input quality might have negative effects on your own predictions.

Solution Strategies:

Training data should be sanity checked. What to check depends on use-case e.g. unit test NaNs and infinite values that are fed into your model.
Limit sample value ranges. e.g. feature A takes values from 1 to 5 or that it's an integer.
Keep track of feature distributions. e.g. feature B is usually "Harry" and they account for 10% of all values.
Educate people designing the models. Don't use predictions straight from other machine learning systems.
Add versioning to the data source. It might be expensive, but it's essential to allow freezing of a version if they are potentially used as input to other models.
Monitor upstream feature stability. e.g. alert if one source stops sending data or signal provider does a major version upgrade.

Underutilized Data Dependencies

Data dependencies with little modeling benefit should be removed. It's the same with data as with code dependencies, underutilized code dependencies provide little of value, but make the system unnecessarily vulnerable to changes.

How do you get underutilized data dependencies:

Legacy Features: Feature F was required earlier, but as time goes on, other added features have made F redundant.
Bundled Features: Deadline pressures cause that a bunch of features are added and are seen to be beneficial as a group. This most likely makes some features redundant or totally useless.
e-Features: As researchers, it's satisfying to improve the model accuracy even a tiny bit. This can cause unnecessary complexity.
Correlated Features: Two features F1 and F2 have high correlation, and this usually means one of them is more causal and prominent. There is no automatic way to detect this, and the two features are credited equally; or in the worst case, the non-causal one is credited more.

Underutilized features make models brittle. As the system has been trained to take these features into account, it might lead to catastrophic failures in the future if the weight of the feature is high.

Solution Strategies:

Educate data scientists about underutilized data dependencies. When they understand the risks, they can choose if it's worth the added complexity.
Regularly evaluate and remove individual features. Train models that each has one of the features individually removed or train models with just one or two features.
Make sure that features manually flagged as unsuitable are not used. e.g. one researcher determines a feature as unreliable but other co-workers still keep on using it in other models. This can also be solved by automatic feature management discussed later.

Static Analysis of Data Dependencies

It's hard to apply static analysis to data dependency debt.

Nobody alone knows the status of every single feature.
Nobody alone knows all the places where a single feature is used.

You can't follow references of data like you can with code. A single data point can be just a partial input to form a feature.

Are there any production solutions using old binaries that rely on the data?

Solution Strategies:

Encode expectations about the data in a "schema". Schemes can be automatically checked e.g. alert if the most common word in English text is not 'the' while training or serving.
Enforce meta-level feature requirements. e.g. don't use features derived from user data or don't use a deprecated feature.
Combine these to an automated feature management system. Enable data sources and features to be annotated. Automated checks can be run to ensure that all dependencies have the appropriate annotations and dependency trees can be solved.

Automated Feature Management

Data sources produce signals. Signals, e.g., words in the advertisement or country of origin can be translated to a set of numerical features for learning.

Signals should be annotated both manually and automatically:

Availability — time spans the data has been available
Deprecation — new version available and waiting consumers to update
Domain-specific applicability — which use-cases are allowed

Automatic feature management applies the feature extraction. What feature management should handle:

Automatically applies feature extraction to new data, continuously or periodically.
Adds access control to sensitive data versions and transformations.
Keeps track of upstream data sources and signals.
Keeps track upstream reliability.
Keeps track who uses which data source and signal source through features.
Keeps track how much a feature changes adds to RAM usage and serving latency.
Alert feature subscribers when a new feature is available.
Alert feature subscribers when a new version of a feature is available.
Suggest removing a feature when nobody is using it.
Suggest removing a signal when nobody is using it.
Suggest unsubscribing from a data source when nobody is using it.

Correction Cascades

Even declared consumers cause dependency problems.

You have model A that solves problem alpha very well.

Then you get problem beta, that could be solved with model A with a few corrections; thus you add correction and effectively create model A'.

Acceptable at the start but needs fixing in the long run. These cascading systems will halt development at some point. Making even totally meaningful and valid improvements to the first model can cause the second model to perform poorly; this is because the second model is learning all the minor nuances of the first model and relies on those.

Solution Strategies:

Bake it all into the original model. Make the original model A more robust by adding features that allow distinguishing between the use-cases. Then you can query the model with the feature flags for the appropriate test distribution. Greedy Unsupervised Layer-wise Pre-training and freezing previous layers might also work.

Undeclared Consumers

Undeclared consumers are external systems we know little about. Predictions are commonly made accessible and consumed by other systems. Without access control and versioning, this will make it impossible to change the original prediction engine without affecting the dependent systems.

Solution Strategies:

Add access control to your model deployment. So you know by who and when is the model being used.
Add versioning to your model deployment. So you can notify your consumers about updates and deprecate unused versions when they upgrade.

Direct Feedback Loops

Direct feedback loops are easy to notice but tedious to tackle. Direct feedback loop is formed when a model directly influences the selection of its own future training data.

Direct feedback loop importance depends on the approach you use. Direct feedback loops are not a large problem in supervised learning as the data is prepared manually, but real-time reinforcement and unsupervised learning must be built to combat this.

Solution Strategies:

Add randomization to the training data. Your mileage may vary depending on your data.
Add force-feeding. Isolate a certain portion of data from being influenced by the model.

Hidden Feedback Loops

Hidden feedback loops are slow to notice. Two systems influencing each other indirectly through the real world create a feedback loop. Gradual changes are not visible in quick experiments thus making finding hidden feedback loops hard.

We have two companies that use stock-market prediction models. Improvements to one of those models can influence bidding behavior of the other model.

Solution Strategies:

Remove hidden feedback loops whenever feasible. It will make training the model much harder. The problem is that these are hard to detect automatically.

Training Troubles

Machine learning training can be seen as code compilation. Source of this compilation is both the training code and data. Thus training data needs testing like the code does, and a trained model needs production practices like debugging, rollbacks and monitoring.

Solution Strategies:

Compare your model to simpler models. e.g. host a trained linear model next to your complex model and compare the results.
All model training must be reproducible. Training twice on the same data should produce two identical models. Non-deterministic training should still have a defined seed.
Continuously search for the optimal hyperparameters. grid search and other hyperparameter search strategies improve performance and uncover reliability issues, but when data changes, hyperparameters might need to change too.
Drill down to model quality with data slices. e.g. error for users from Finland must be <5%.
Unit test training code. e.g. that a model can restore from a checkpoint after mid-training crash. Assertions during training might work too.

Infrastructure Robustness

Using machine learning in the real-world is more complicated than small examples or even large research experiments. You will require a big infrastructure of components that make up the whole system.

Solution Strategies:

Access control should cover the whole pipeline. e.g. checkpoint data during training should be accessible only by known systems.
Minimize calendar time needed to add a new feature to a production model. it shouldn't take days to try out new things on production data.
Minimize calendar days it takes to try a new approach at full scale.
Make sure training and deployment have the same feature extraction code. it will cause catastrophes if production uses different feature extraction as the model training. This is called "feature extraction skew".
Track your training speed and resource usage. e.g. RAM, CPU, GPU.
Track your serving latency and resource usage. e.g. RAM, CPU.
Use unit test of individual components.
Add integration tests to the full pipeline. tests should run all the way from data sources to serving.
Add regression tests. when you encounter a prediction error in a data slice, make it a reproducible test and always run that before serving in production.

Glue Code

For every 5% machine learning code, there is 95% glue code

Using already built packages and cloud platforms results in glue code. Glue code tends to freeze a system to the specific package or approach. This makes expensive to experiment with other packages and approaches; which is a real problem when your problem domain changes.

Solution Strategies:

Make your machine learning infrastructure general enough.
Wrap black-box packages into common APIs.

Pipeline Jungles

Pipelines jungles evolve organically. Preparing data to machine learning-friendly format is a jungle of scrapes, joins and sampling which should be kept in check.

Solution Strategies:

Plan your data collection and feature extraction in advance. Try to cover all foreseeable use-cases with appropriate generalization but avoid over-engineering it.
Add end-to-end integration tests for the most common use-cases. Without actual tests, it's hard to develop the pipeline systems further.
Add automated error monitoring and recovery. Detecting problems in backend pipelines can surface days or weeks after the actual pipeline breakage.
Test all feature extraction code. Feature extraction code may seem simple, but they are a persistent origin of bugs and the problems are hard to detect.

Dead Experimentation Code

Experimental code creates debt over time. Maintaining backward compatibility with unnecessary experimental code is a hassle and accidentally running experimental code in production can cause catastrophes.

Knight Capital lost $465 million over 45 minutes in 2012 because production predictions accidentally used old experimentation code.

Solution Strategies:

Examine and remove experimental branches frequently. See which of the experimental branches are actually in use. In a healthy machine learning system, experimental code should be well isolated, which requires rethinking the code APIs.
Clearly indicate which parts of your work are experimental. This can be done by flagging task runs and using git branches.

Abstraction Debt

Machine learning solutions have no widely accepted abstraction layers. What is the right interface to describe: a stream of data, a model, or a prediction? Relational database has grown to be a basic layer of abstraction in the software development scene, in general, but machine learning systems have nothing like this.

A couple of years ago you could've argued that MapReduce would've been a good abstraction layer but recent years have shown that MapReduce is not general enough for the vast majority of modern use-cases.

Lack of abstractions makes it hard to naturally separate components. It might be required to define standards in-house if nothing better is available.

Solution Strategies:

Define how your data formats.
Define how your models should look like, used and deployed.

Configuration Debt

Machine learning system configuration is hard. Managing configuration of a live machine learning system is harder than most engineers estimate it to be.

Which data is used?
Which features are extracted from the data?
Which hyperparameters are used for learning?
What pre-processing is needed for the data?
What post-processing is needed for the model?
How can we verify that the model is working well?

Just features alone have as complex rules as working with time zones:

Feature A was incorrectly logged for a week between 1.10. and 7.10.2017.
Feature B is not available before 7.10.2017.
Feature C pre-processing has to change before 1.11.2017 as the format changes.
Feature D is not available for production so D' and D'' must be used.
Feature E causes more training memory overhead caused by the lookup tables.
Feature F cannot be used as it causes latency overhead for the use-case.

And configuration is not a small problem. Mistakes in configuration can lead to serious loss of time, waste of resources and production issues. Lines of code dedicated to configuration can far exceed the number of lines of the code that actually does machine learning.

Good configuration systems:

It should be easy to specify configuration as a small change from a previous configuration.
It should be hard to make manual errors, omissions and oversights.
It should be easy to compare configuration between two models.
It should be easy to detect unused or redundant settings.
Configuration should go through full code review and be checked in version control.

Solution Strategies:

Assertions about silly configuration can reduce the number of mistakes.
Useful tool is the ability to present visual side-by-side differences of two configurations. Configurations are usually copy-paste + modifications anyway, so diffs make a lot of sense.
Configuration changes should be treated as the same level of seriousness as code changes, even to the extent of being carefully reviewed by peers.
Prohibit accidentally applying closely similar configuration to different models.

Fixed Thresholds in Dynamic Systems

It's often necessary to pick a decision thresholds for a prediction. Thresholds are commonly defined as a set of thresholds and are frequently manually set.

This mail is spam with 70% confidence => Is it considered spam now?

With new data coming in, old manual threshold might be invalid. This can cause problems in production that are hard to detect.

Solution Strategies:

Learn thresholds via simple evaluation on held-out validation data.

Collaboration

Solution Strategies:

Track how fast new members of the team can be brought up to speed.
Do experiment results get recorded automatically?
All machine learning code is checked in version control.
All production machine learning code changes undergo a code review.

Serving Robustness

Solution Strategies:

Avoid training/serving skew. Make sure that training and serving use the same feature extraction code.
Test model quality before attempting to serve it in production. e.g. testing against data with known correct inputs or comparing predictions to a previous version of the model while both are live. Allows periodically testing simpler models against your complex model.
Allow canary testing of models. canary test or canary deployment is pushing changes to only a small group of end users.
Keep track of how fast model updates can be rolled back. it should take less than a minute to roll back to a previous version of any model.
Allow observing the internal state of models. e.g. sending a single or a small batch example should allow monitoring the internal state for debugging.
Track the relationship between prediction metrics and actual impact metrics. e.g. when your model gets 1% better in product predictions, use A/B testing to figure out how much user satisfaction increases.
Measure the impact of model staleness. If predictions were based on a model trained last month or year, what is the difference in impact? Age vs. quality graphs are a good visualization.

Live Monitoring

Allow live monitoring on system behavior in real time for debugging.

Solution Strategies:

Add action limits. Add action limits as a smoke test, but they should be broad enough not to trigger misleadingly. If a spam detector starts to mark 1000 emails per second, it should raise and alert.
Track prediction bias. Distribution of predicted labels should be close to the distribution of training labels, if both are based on real data. Can cause false positives but helps to detect if something is really wrong.
Track that training and serving inputs hold data invariants. e.g. two specific features have the same number of non-zero values, or that a feature is always on the ranged between 0 and 100.
Track pipeline blockages. Visualize and detect blockages in data pipelines.
Allow subscribing to notifications. e.g. when input data statistics, training speed, serving latency or RAM usage changes drastically.

Smells

Plain-Old-Data Type Smell Data is just raw floats and integers. In a robust system, a parameter should know if it's a log-odds multiplier or a decision threshold.

Multiple-Language Smell Using multiple programming languages increases system complexity.

Prototype Smell Working a lot in prototype environments indicates that the full-scale system is too brittle. Improved abstractions would make the system easier to develop and manage.

Sources

Detecting Adversarial Advertisements in the Wild, KDD 2011
Ad Click Prediction: a View from the Trenches, KDD 2013
Machine Learning: The High-Interest Credit Card of Technical Debt, NIPS 2014
Hidden Technical Debt in Machine Learning Systems, NIPS 2015
Toward ethical, transparent and fair AI/ML: a critical reading list
What's your ML Test Score? A rubric for ML production systems, Eric Breck et al.
Real-Time Machine Learning: The Missing Pieces, Robert Nishihara et al.