🤖 Machine Learning -
Inference Taxonomy

Updated at 2024-12-18 19:39

Different ML models have different requirements for inference.

Here are some dimensions to consider when designing an inference pipeline:

Response Mode

Real-time: response is expected right after prompting aka. "online"
Deferred: no need for immediate response aka. "offline"

Model Location

Remote: the model is run somewhere remote, typically a service endpoint over HTTPS
Local: the model is run on the local machine or edge device

Trigger

On-demand: inference happens on-demand e.g. calling an API
Event-based: inference happens after something else happens e.g. new data is available
Scheduled: inference happens on schedule e.g. hourly, daily, weekly

Processing Pattern

Batch: multiple samples at once, common in scheduled scoring
Single: single sample at a time, usually done with real-time models
Stream: incremental samples, useful when you have a constant flow of samples but want inference without waiting a big chunk to complete

Response Type

Object Response: the inference result is a complete object
Continuous Response: prediction is streamed back to the client, usually infinite

Training Strategy

Online Training: the model gets updated continuously
Periodic Training: there is a separate pipeline to create a new model version

"embedded" is usually "real-time local on-demand single with periodic training"

"batch" is usually "deferred local scheduled batch with periodic training"

🤖 Machine Learning - Inference Taxonomy

Response Mode

Model Location

Trigger

Processing Pattern

Response Type

Training Strategy

🤖 Machine Learning -
Inference Taxonomy