🧠 MLOps

Updated at 2025-02-02 21:33

MLOps is the standardization of machine learning development.

MLOps (Machine Learning Operations) processes help organizations generate long-term value and reduce risk associated with data science, machine learning, and AI initiatives.

MLOps processes can be enforced by a platform, shared guidelines, or both.

Core requirements that MLOps should cover can be summarized as VAPOR:

Version everything, especially early experiments and pipelines
Assess if new models are better than the previous versions
Promote better-performing models to production
Observe when model performance degrades over time
Risk management of machine learning initiatives

Collaboration

Machine learning life cycle involves people from business, data science, software development, operations, legal, problem domain experts, etc.

AI efforts are collaborative. But not all speak the same language.

These groups are not used to the same tools or, in many cases, don't share the same fundamental skills to serve as a baseline of communication.

Data scientists are not software engineers. Most are specialized in model building and assessment, and they are not necessarily experts in writing applications or even in the subject-matter they are working with.

There is certainly overlap between the roles, but the skill sets are not identical.

Upper management should be able to understand what machine learning models are deployed in production and what effect they're having on the business. This is critical for business leaders to make informed decisions.

Arguably, they should also be able to drill down to understand the steps taken to go from raw data to final output behind those machine learning models.

Common Terms

Explainability: With deep learning, it is much harder to understand what features are used to determine a prediction, which in turn can make it much harder to demonstrate that models comply with the necessary regulatory or internal governance requirements.

Neural network decision-making is unexplainable by default and requires additional techniques to make it explainable if that is a hard requirement.

Intentionality includes:

Ensure that models behave in ways aligned with their purpose
Assurance that data comes from compliant and unbiased sources
A collaborative approach to AI projects that ensures multiple checks and balances on potential model bias

This is closely related to explainability but focusing on that the models are used in ways that are aligned with the business goals.

Accountability includes:

Having an overall view of which teams are using what data, how, and in which models
Trust that data is reliable and being collected in accordance with regulations
A centralized understanding of which models are used for what processes

This is closely tied to traceability: if something goes wrong, is it easy to find where in the pipeline it happened?

Auditability includes:

It must be possible to access the full model version history
Artifacts that allow running the model versions
Test results to verify how it was trained
Detailed model logs and monitoring metadata to show how it's being used

This is closely related to explainability but focusing on the ability to audit the model and its usage; not just understand the model.

Machine Learning Pipeline in a Nutshell

The process of developing a machine learning model starts with a business goal.

"Reducing fraudulent transactions to <0.1%"

"Gain the ability to identify people's faces on their social media photos."

With clear business goals defined, it is time to bring together subject-matter experts and data scientists to begin the journey of developing a solution.

Core Dependencies

Business dictates the need for machine learning. And as business needs shift over time, the assumptions made when the model was first built might change.

Code is the foundation of machine learning. Machine learning systems are often built on a stack of open source software (e.g., scikit-learn, Python, Linux), and having versions of this software in production that match those that the model was verified on is critically important.

Data is the lifeblood of machine learning. Data is constantly changing, and the data used to train a model may not be the same as the data that is used to make predictions. This is called drift.

Finding Data

The algorithms analyze sample data, known as training data, to build a software model that can make predictions.

Key questions to consider when sourcing data for machine learning models:

Data Availability and Quality
- What relevant datasets are available?
- Is this data accurate enough and reliable?
- How can stakeholders get access to this data?
Feature Engineering
- What data properties (known as features) can be made available by combining multiple sources of data?
- Will this data be available in real time or historically?
Data Labeling
- Is there a need to label data with the "ground truth" to be predicted?
- Does unsupervised learning make sense?
- If labeling is needed, how much will this cost in terms of resources?
Infrastructure and Deployment
- What platform(s) should be used?
- How will data be updated once the model is deployed?
Metrics
- How will the KPIs, which were established along with the business goals, be measured?
Data Governance
- Can the selected datasets be used for this purpose?
- What are the terms of use?
- Is there personally identifiable information (PII) to be anonymized?
- Are there features, such as gender, that legally cannot be used in this business context?
- Are minority populations sufficiently well represented that the model has equivalent performances on each group?

Exploratory data analysis (EDA<) techniques help reasoning about the data:

Build hypotheses about the data
Identify data cleaning requirements
Inform the process of selecting potentially significant features

EDA is done visually for intuitive insight or rigorously with statistical analysis.

EDA naturally leads into feature engineering and feature selection. Feature engineering is the process of taking raw data from the selected datasets and transforming it into features that better represents the underlying problem.

After feature engineering and selection, the next step is training.

Data Validation Checks

Are the following validation criteria met for the new training data?

Data completeness and consistency checks
Feature distribution comparison with a previous training set
Predefined metric validation
Alignment with model refinement goals

Training

The aim in machine learning is to save enough information about the environment the model was developed in so that the model can be reproduced with the same results from scratch.

Deployment

Deploying models is a key part of MLOps that presents an entirely different set of technical challenges than developing the model.

The usual deployment pipeline:
1. Build the model
2. Build the model artifacts
3. Send the artifacts to long-term storage
4. Run basic smoke and sanity checks
5. Generate fairness and explainability reports
6. Deploy to a test environment
7. Run tests to validate performance
8. Validate manually
9. Deploy to production environment
10. Deploy the model as canary
11. Fully deploy the model

Before production, there should be a clearly defined validation step. What is being validated is use-case-specific, but at the very least unit tests should be run.

Conformal Prediction:
estimate the probability of correctness against a known interval. If your predictor gives car price between $50 and $500k, you shouldn't automatically make business decisions based on the output.

Take a look at model formats such as ONNX, PMML, PFA, or POJO. These aim to improve model portability between systems and simplify deployment. However, they come at a cost: each format supports a limited range of algorithms, and sometimes the portable models behave in subtly different ways than the original to they must be thoroughly tested.

There are commonly three types of model deployment/usage:

Real-time, or "online"
- Typically, the model is deployed into a simple framework to provide an HTTP API endpoint that responds to requests in real time, scoring one or a few samples.
Batch, "deferred", or "offline"
- Whole datasets are processed using a model, such as in daily scheduled jobs. The model is usually downloaded for the processing and then discarded.
Embedded, or "local"
- The model is packaged into an application, which is then published. A common example is an application that provides batch-scoring of requests.

Scoring a sample is usually technically identical to scoring a batch of 'em. In both cases, multiple instances of the model can be deployed to increase throughput and potentially lower latency.

Batch scoring can also be parallelized, for example, by using a parallel processing runtime like Apache Spark, but also by splitting datasets and scoring the partitions independently.

Production environments take a wide variety of forms:

Custom-built services
Data science platforms
Dedicated services like TensorFlow Serving
Low-level infrastructure like Kubernetes clusters
JVMs on embedded systems

To make things even more complex, consider that in some organizations, multiple heterogeneous production environments coexist.

With distillation, a smaller "student" network is trained to mimic a bigger, more powerful network. Done appropriately, this can lead to better models (as compared to trying to train the smaller network directly from the data).

Teams should ask the uncomfortable questions:

What if the model acts in the worst imaginable way?
What if a user manages to extract training data or internal logic of the model?
What are the financial, business, legal, safety, and reputation risks?

Rollout strategies:

Blue-Green Deployment
- Deploy the new model alongside the old model.
- Gradually shift traffic from the old model to the new model.
- If the new model performs poorly, rollback to the old model.
Canary Deployment
- Deploy the new model to a subset of users and monitor its performance.
- If the new model performs well, gradually increase the number of users who are exposed to it.

Measuring Model Performance

Measuring model performance is not always straightforward. Before building and deploying a better model, you need to be able to identify performance degradation. Projects may use vastly different approaches to assess model performance.

There are two common approaches to detect if a model's performance is degrading:

Comparing Model Versions

Comparing models is a critical part of the machine learning process. It is essential to understand how models compare to each other and to the baseline.

Model comparison is not always straightforward. There are many ways to compare models, and the best approach depends on the

Remember to use the same data to ensure a fair comparison.

Production Comparison Approaches:

A/B Testing
- Candidate model scores a portion of live requests
- Deployed model scores the remaining requests
- Goes nicely with canary deployments
Champion/Challenger aka. Shadow Deployment
- Candidate model shadows the deployed model
- Scores the same live requests

Monitoring

Machine learning models need to be monitored at two levels:

At the resource level, including ensuring the model is running correctly in the production environment.
- Is the system alive?
- Are the CPU, RAM, network usage, and disk space as expected?
- Are requests being processed at the expected rate?

health monitoring i.e., if the model is indeed online and what is the latency
resource monitoring i.e., CPU, memory, disk, network

At the performance level, meaning monitoring the pertinence of the model over time.
- Is the model still an accurate representation of the patterns in new incoming data?
- Is it performing as well as it did during the design phase?

performance monitoring i.e., analyzing accuracy compared to a past or another version

Practically, every deployed model should come with monitoring and corresponding warning thresholds to detect meaningful business performance drops as quickly as possible.

Logging

Model training detail record:

The list of features used
The preprocessing techniques that are applied to each feature
The algorithm used with the chosen hyperparameters
The training dataset
The test dataset used to evaluate the model

An event log record:

Timestamp
- The time the event occurred
Model Identity
- Identification of the model and the version
Prediction Input
- The processed features of new observations
- Optionally, the raw data as well, sampled portions if it makes more sense; 5% of the data can be enough
- Allows verification of incoming data
- Enables detection of feature drift and data drift
Prediction Output
- Predictions made by the model
- Combined with ground truth for production performance evaluation
System Action
- System's response based on model prediction
- For example, in fraud detection, a high probability triggers can either block or send a warning
- Important for understanding user reactions and feedback data
Model Explanation
- Required in regulated domains (finance, healthcare)
- Predictions must include feature influence explanations
- Computed using techniques like Shapley value
- Logged to identify potential issues like bias and overfitting

A/B Testing

A/B testing is a method of comparing two versions of a solution against each other to determine which one performs better. In machine learning context, it can be used to compare two model versions.

A/B testing is more suitable if your predictions invoke action from the user. Shadow deployment is more suitable if you can and want to compare the performance of two models from the responses alone.

Before:

Define a clear goal as a quantitative business metric.
Define the control group and the treatment group; random split or more complex. Consider your demographic and minimize bias between groups.
Make sure that the same version handles all requests from a single actor.
Decide samples size and duration of the test. If you don't decide it before the test, you may be tempted to stop the test when you see the result you want to see.

During:

Do not stop the test before the duration is over. You will be tempted but such decisions can lead to biased results. This is called p-hacking or data dredging, and it is a common mistake in A/B testing (reporting only significant results).

After:

Run statistical analysis on the results.
If the results are not significant, you can either extend the test or conclude that there is no difference between the models.
If the results are significant, you can decide to deploy the new model.

Sources

Introducing MLOps: How To Scale Machine Learning In The Enterprise