ruk·si

🤖 Machine Learning
Inference Taxonomy

Updated at 2024-12-18 19:39

Different ML models have different requirements for inference.

Here are some dimensions to consider when designing an inference pipeline:

Response Mode

Real-time
response is expected right after prompting aka. "online"
Deferred
no need for immediate response aka. "offline"

Model Location

Remote
the model is run somewhere remote, typically a service endpoint over HTTPS
Local
the model is run on the local machine or edge device

Trigger

On-demand
inference happens on-demand e.g. calling an API
Event-based
inference happens after something else happens e.g. new data is available
Scheduled
inference happens on schedule e.g. hourly, daily, weekly

Processing Pattern

Batch
multiple samples at once, common in scheduled scoring
Single
single sample at a time, usually done with real-time models
Stream
incremental samples, useful when you have a constant flow of samples but want inference without waiting a big chunk to complete

Response Type

Object Response
the inference result is a complete object
Continuous Response
prediction is streamed back to the client, usually infinite

Training Strategy

Online Training
the model gets updated continuously
Periodic Training
there is a separate pipeline to create a new model version

"embedded" is usually "real-time local on-demand single with periodic training"

"batch" is usually "deferred local scheduled batch with periodic training"