The neural network architecture that powers modern language models.
The Transformer is a neural network architecture introduced in the 2017 paper "Attention Is All You Need" that has become the foundation of modern language models. Its key innovation is the self-attention mechanism, which allows the model to weigh the importance of different parts of the input when processing each position.
Transformers replaced earlier recurrent architectures (RNNs, LSTMs) because they can process entire sequences in parallel, enabling much more efficient training on large datasets. The architecture consists of encoder and decoder stacks, though many LLMs use decoder-only variants (like GPT) or encoder-only variants (like BERT) depending on their purpose.
Understanding transformer basics helps AI engineers make better decisions about model selection, understand performance characteristics, and troubleshoot issues. Key concepts include attention heads, layer normalization, positional encodings, and the quadratic scaling of attention with sequence length (which limits context windows).
The neural network architecture that powers modern language models.
Join our network of elite AI-native engineers.