Transformers

Definition

Transformers are neural architectures based on self-attention: each token attends to all others to compute contextual representations. They avoid recurrence and enable parallelization, scaling to very long sequences and large models (BERT, GPT, etc.).

They underpin modern LLMs and have been extended to multimodal and vision models. Encoder-only (BERT) and decoder-only (GPT) variants are most common today; the encoder-decoder layout remains used for sequence-to-sequence tasks.

How it works

Attention: Query, Key, Value are computed from inputs; attention weights combine values.
Multi-head attention: Multiple attention heads capture different relations.
Encoder-decoder or decoder-only: Encoder (e.g. BERT) sees full sequence; decoder (e.g. GPT) uses causal masking for autoregressive generation.

The diagram below shows one block: input goes through multi-head attention (with add and norm), then a feed-forward network (FFN), then add and norm again. Encoder stacks use bidirectional attention; decoder stacks use causal (masked) attention so each position only sees past tokens. Residual connections and layer norm stabilize training. Stacking many such blocks and scaling width and depth yields the large models used for NLP and beyond.

Use cases

Transformers underpin most modern NLP and multimodal systems; encoder-only, decoder-only, and encoder-decoder variants suit different tasks.

BERT-style: named entity recognition, search relevance, question answering
GPT-style: text generation, code completion, chat and dialogue
Multimodal transformers for vision-language tasks

Pros and cons

Pros	Cons
Parallelizable, scalable	High compute and memory
Strong at long-range dependencies	Requires large data
Unified architecture for many tasks	Interpretability challenges

External documentation

Attention Is All You Need (Vaswani et al.) — Original transformer paper
Hugging Face – Summary of the models — Transformer model families
The Illustrated Transformer — Visual explanation of the architecture

Definition​

How it works​

Use cases​

Pros and cons​

External documentation​

See also​

Definition

How it works

Use cases

Pros and cons

External documentation

See also