MLOps
Machine learning is the high-interest credit card of technical debt.
MLOps is all about the standardization of machine learning development.
MLOps is a process that helps organizations generate long-term value and reduce risk associated with data science, machine learning, and AI initiatives.
MLOps process can be enforced by a platform, shared guidelines, or both.
MLOps Process Requirements
Must-haves:
- Keep track of versioning, especially early experiments
- Assess if new models are better than the previous versions
- Facilitate promoting better-performing models to production
- Ensure that model production performance is not degrading over time
- Machine learning risk management
Collaboration
AI efforts need to be collaborative. But not everyone speaks the same language.
Machine learning life cycle involves people from business, data science, software development, operations, legal, problem domain experts, etc.
These groups are not used to the same tools or, in many cases, don't share the same fundamental skills to serve as a baseline of communication.
Data scientists are not software engineers. Most are specialized in model building and assessment, and they are not necessarily experts in writing applications or even in the subject-matter they are working with.
There is certainly overlap between the roles, but the skill sets are not identical.
Upper management should be able to understand what machine learning models are deployed in production and what effect they're having on the business. This is critical for business leaders to make informed decisions.
Arguably, they should also be able to drill down to understand the steps taken to go from raw data to final output behind those machine learning models.
Common Terms
Explainability: With deep learning, it is much harder to understand what features are used to determine a prediction, which in turn can make it much harder to demonstrate that models comply with the necessary regulatory or internal governance requirements.
Neural network decision-making is unexplainable by default and requires additional techniques to make it explainable if that is a hard requirement.
Intentionality includes:
- Ensure that models behave in ways aligned with their purpose
- Assurance that data comes from compliant and unbiased sources
- A collaborative approach to AI projects that ensures multiple checks and balances on potential model bias
- Explainability, meaning the results of the systems should be explainable by humans (ideally, not just the humans who created the system)
Accountability includes:
- Having an overall view of which teams are using what data, how, and in which models
- Trust that data is reliable and being collected in accordance with regulations
- A centralized understanding of which models are used for what business processes
This is closely tied to traceability: if something goes wrong, is it easy to find where in the pipeline it happened?
Machine Learning Pipeline in a Nutshell
The process of developing a machine learning model should start with a business goal.
"Reducing fraudulent transactions to
< 0.1%
"
"Gain the ability to identify people's faces on their social media photos."
With clear business goals defined, it is time to bring together subject-matter experts and data scientists to begin the journey of developing a solution.
Core Dependencies
Business dictates the need for machine learning. Business needs also shift over time, so the assumptions that were made when the model was first built might change.
Code is the foundation of machine learning. Machine learning systems are often built on a stack of open source software (e.g., scikit-learn
, Python
, or Linux
), and having versions of this software in production that match those that the model was verified on is critically important.
Data is the lifeblood of machine learning. Data is constantly changing, and the data used to train a model may not be the same as the data that is used to make predictions.
Finding Data
The algorithms analyze sample data, known as training data
, to build a software model that can make predictions.
Key questions to consider when sourcing data for machine learning models:
Data Availability and Quality
- What relevant datasets are available?
- Is this data accurate enough and reliable?
- How can stakeholders get access to this data?
Feature Engineering
- What data properties (known as
features
) can be made available by combining multiple sources of data? - Will this data be available in real time or historically?
- What data properties (known as
Data Labeling
- Is there a need to label some data with the "ground truth" that is to be predicted?
- Does unsupervised learning make sense?
- If labeling is needed, how much will this cost in terms of time and resources?
Infrastructure and Deployment
- What platform should be used?
- How will data be updated once the model is deployed?
- Will the use of the model itself reduce the representativeness of the data?
Metrics
- How will the KPIs, which were established along with the business goals, be measured?
Data Governance
- Can the selected datasets be used for this purpose?
- What are the terms of use?
- Is there personally identifiable information (PII) that must be redacted or anonymized?
- Are there features, such as gender, that legally cannot be used in this business context?
- Are minority populations sufficiently well represented that the model has equivalent performances on each group?
Exploratory data analysis (EDA) techniques help reasoning about the data:
- Build hypotheses about the data
- Identify data cleaning requirements
- Inform the process of selecting potentially significant features
EDA is carried out visually for intuitive insight or more rigorously with statistical analysis.
EDA naturally leads into feature engineering and feature selection. Feature engineering is the process of taking raw data from the selected datasets and transforming it into "features" that better represents the underlying problem to be solved.
After feature engineering and selection, the next step is training.
Data Validation Checks
Are the following validation criteria met for the new training data?
- Data completeness and consistency checks
- Feature distribution comparison with a previous training set
- Predefined metric validation
- Alignment with model refinement goals
Training
The aim in machine learning is to save enough information about the environment the model was developed in so that the model can be reproduced with the same results from scratch.
???
The techniques most commonly used today include:
- Partial dependence plots, which look at the marginal impact of features on the predicted outcome
- Subpopulation analyses, which look at how the model treats specific subpopulations and that are the basis of many fairness analyses
- Individual model predictions, such as Shapley values, which explain how the value of each feature contributes to a specific prediction
- What-if analysis, which helps the ML model user to understand the sensitivity of the prediction to its inputs
Deployment
Deploying models is a key part of MLOps that presents an entirely different set of technical challenges than developing the model.
There are commonly two types of model deployment:
- Model-as-a-Service, or live-scoring model
- Typically, the model is deployed into a simple framework to provide a
REST API
endpoint (the means from which the API can access the resources it needs to perform the task) that responds to requests in real time.
- Typically, the model is deployed into a simple framework to provide a
- Embedded model
- Here the model is packaged into an application, which is then published.
- A common example is an application that provides batch-scoring of requests.
Export the model to a portable format such as PMML
, PFA
, ONNX
, or POJO
. These aim to improve model portability between systems and simplify deployment. However, they come at a cost: each format supports a limited range of algorithms, and sometimes the portable models behave in subtly different ways than the original.
There are two ways to approach model deployment:
- Batch scoring, where whole datasets are processed using a model, such as in daily scheduled jobs.
- Real-time scoring, where one or a small number of records are scored, such as when an advertisement is displayed on a website and a user session is scored by models to decide what to display.
There is a continuum between these two approaches, and in fact, in some systems, scoring on one record is technically identical to requesting a batch
of one. In both cases, multiple instances of the model can be deployed to increase throughput and potentially lower latency.
Batch scoring can also be parallelized, for example, by using a parallel processing runtime like Apache Spark
, but also by splitting datasets (which is usually called partitioning or sharding) and scoring the partitions independently.
Production environments take a wide variety of forms:
- Custom-built services
- Data science platforms
- Dedicated services like
TensorFlow Serving
- Low-level infrastructure like
Kubernetes
clusters - JVMs on embedded systems
To make things even more complex, consider that in some organizations, multiple heterogeneous production environments coexist.
With distillation, a smaller "student" network is trained to mimic a bigger, more powerful network. Done appropriately, this can lead to better models (as compared to trying to train the smaller network directly from the data).
Teams should ask the uncomfortable questions:
- What if the model acts in the worst imaginable way?
- What if a user manages to extract the training data or the internal logic of the model?
- What are the financial, business, legal, safety, and reputation risks?
Measuring Model Performance
Measuring model performance is not always straightforward. Before building and deploying a better model, you need to be able to identify performance degradation. Projects may use vastly different approaches to assess model performance.
There are two common approaches to detect if a model's performance is degrading:
Monitoring
Machine learning models need to be monitored at two levels:
At the resource level, including ensuring the model is running correctly in the production environment.
- Is the system alive?
- Are the CPU, RAM, network usage, and disk space as expected?
- Are requests being processed at the expected rate?
At the performance level, meaning monitoring the pertinence of the model over time.
- Is the model still an accurate representation of the patterns in new incoming data?
- Is it performing as well as it did during the design phase?
Logging
Model training detail record:
- The list of features used
- The preprocessing techniques that are applied to each feature
- The algorithm used with the chosen hyperparameters
- The training dataset
- The test dataset used to evaluate the model
An event log record:
- Timestamp
- The time the event occurred
- Model Identity
- Identification of the model and the version
- Prediction Input
- The processed features of new observations
- Optionally, the raw data as well, sampled portions if it makes more sense; 5% of the data can be enough
- Allows verification of incoming data
- Enables detection of feature drift and data drift
- Prediction Output
- Predictions made by the model
- Combined with ground truth for production performance evaluation
- System Action
- System's response based on model prediction
- For example, in fraud detection, a high probability triggers can either block or send a warning
- Important for understanding user reactions and feedback data
- Model Explanation
- Required in regulated domains (finance, healthcare)
- Predictions must include feature influence explanations
- Computed using techniques like Shapley value
- Logged to identify potential issues like bias and overfitting
Sources
- Introducing MLOps: How To Scale Machine Learning In The Enterprise