<vetted />
Engineering
Term 33 of 68

Latency

The time delay between sending a request and receiving a response from AI.

Full Definition3 paragraphs

Latency in AI applications refers to the time delay between initiating a request and receiving a complete or usable response. For LLM-powered applications, latency is a critical factor affecting user experience and system design decisions.

LLM latency has multiple components: network round-trip time, time to first token (TTFT), and time per output token (generation speed). Larger models generally have higher latency but better quality, while smaller models offer faster responses with some capability tradeoffs.

Strategies for managing latency include: using streaming to show responses incrementally, caching common responses, choosing appropriately-sized models for tasks, running inference closer to users (edge deployment), optimizing prompts to reduce output length, and parallel processing where possible. AI engineers must balance latency requirements against cost and quality needs for each use case.

Key Concept

The time delay between sending a request and receiving a response from AI.

Apply your knowledge

Master AI Development

Join our network of elite AI-native engineers.