The time delay between sending a request and receiving a response from AI.
Latency in AI applications refers to the time delay between initiating a request and receiving a complete or usable response. For LLM-powered applications, latency is a critical factor affecting user experience and system design decisions.
LLM latency has multiple components: network round-trip time, time to first token (TTFT), and time per output token (generation speed). Larger models generally have higher latency but better quality, while smaller models offer faster responses with some capability tradeoffs.
Strategies for managing latency include: using streaming to show responses incrementally, caching common responses, choosing appropriately-sized models for tasks, running inference closer to users (edge deployment), optimizing prompts to reduce output length, and parallel processing where possible. AI engineers must balance latency requirements against cost and quality needs for each use case.
The time delay between sending a request and receiving a response from AI.
Join our network of elite AI-native engineers.