The process of running a trained AI model to generate predictions or outputs.
Inference is the process of running a trained AI model to generate predictions, completions, or other outputs from input data. When you call an LLM API, you're performing inference. This contrasts with training, which is the process of creating or updating the model's parameters.
For language models, inference involves: tokenizing input text, processing through the model's layers, sampling from probability distributions to generate tokens, and continuing until completion. Inference costs depend on model size, input/output length, and computational resources.
Understanding inference is important for AI engineers because: API costs are based on inference (tokens processed), latency depends on inference speed, self-hosted models require inference infrastructure, and optimization often focuses on making inference faster or cheaper. Factors like quantization, batching, and caching all relate to inference efficiency.
The process of running a trained AI model to generate predictions or outputs.
Join our network of elite AI-native engineers.