Transformers
Definition
Transformers are neural architectures based on self-attention: each token attends to all others to compute contextual representations. They avoid recurrence and enable parallelization, scaling to very long sequences and large models (BERT, GPT, etc.).
They underpin modern LLMs and have been extended to multimodal and vision models. Encoder-only (BERT) and decoder-only (GPT) variants are most common today; the encoder-decoder layout remains used for sequence-to-sequence tasks.
How it works
- Attention: Query, Key, Value are computed from inputs; attention weights combine values.
- Multi-head attention: Multiple attention heads capture different relations.
- Encoder-decoder or decoder-only: Encoder (e.g. BERT) sees full sequence; decoder (e.g. GPT) uses causal masking for autoregressive generation.
The diagram below shows one block: input goes through multi-head attention (with add and norm), then a feed-forward network (FFN), then add and norm again. Encoder stacks use bidirectional attention; decoder stacks use causal (masked) attention so each position only sees past tokens. Residual connections and layer norm stabilize training. Stacking many such blocks and scaling width and depth yields the large models used for NLP and beyond.
Use cases
Transformers underpin most modern NLP and multimodal systems; encoder-only, decoder-only, and encoder-decoder variants suit different tasks.
- BERT-style: named entity recognition, search relevance, question answering
- GPT-style: text generation, code completion, chat and dialogue
- Multimodal transformers for vision-language tasks
Pros and cons
| Pros | Cons |
|---|---|
| Parallelizable, scalable | High compute and memory |
| Strong at long-range dependencies | Requires large data |
| Unified architecture for many tasks | Interpretability challenges |
External documentation
- Attention Is All You Need (Vaswani et al.) — Original transformer paper
- Hugging Face – Summary of the models — Transformer model families
- The Illustrated Transformer — Visual explanation of the architecture