The basic units of text that language models process, typically word fragments.
Tokens are the fundamental units that language models use to process and generate text. Rather than working with individual characters or complete words, most LLMs use subword tokenization, which breaks text into meaningful chunks that balance vocabulary size with representation efficiency.
Common tokenization schemes include BPE (Byte Pair Encoding) and SentencePiece. A rough approximation is that 1 token equals about 4 characters or 0.75 words in English, though this varies by language and content type. Code often tokenizes less efficiently than prose.
Understanding tokens is essential for AI engineers because: (1) API pricing is typically per-token, (2) context windows are measured in tokens, (3) generation speed is tokens-per-second, and (4) token boundaries can affect model behavior. Tools like tiktoken help count tokens for specific models, enabling cost estimation and context management in production applications.
The basic units of text that language models process, typically word fragments.
Join our network of elite AI-native engineers.