🤖 Machine Learning - Inference Taxonomy
Inference Taxonomy
Updated at 2024-12-18 19:39
Different ML models have different requirements for inference.
Here are some dimensions to consider when designing an inference pipeline:
Response Mode
- Real-time
- response is expected right after prompting aka. "online"
- Deferred
- no need for immediate response aka. "offline"
Model Location
- Remote
- the model is run somewhere remote, typically a service endpoint over HTTPS
- Local
- the model is run on the local machine or edge device
Trigger
- On-demand
- inference happens on-demand e.g. calling an API
- Event-based
- inference happens after something else happens e.g. new data is available
- Scheduled
- inference happens on schedule e.g. hourly, daily, weekly
Processing Pattern
- Batch
- multiple samples at once, common in scheduled scoring
- Single
- single sample at a time, usually done with real-time models
- Stream
- incremental samples, useful when you have a constant flow of samples but want inference without waiting a big chunk to complete
Response Type
- Object Response
- the inference result is a complete object
- Continuous Response
- prediction is streamed back to the client, usually infinite
Training Strategy
- Online Training
- the model gets updated continuously
- Periodic Training
- there is a separate pipeline to create a new model version
"embedded" is usually "real-time local on-demand single with periodic training"
"batch" is usually "deferred local scheduled batch with periodic training"