BERT

Definition

BERT is a transformer encoder model pretrained with masked language modeling (MLM) and next-sentence prediction. It produces contextual embeddings that are fine-tuned for downstream NLP tasks.

Unlike GPT-style decoders, BERT uses bidirectional context (left and right of each token), which helps for understanding tasks (e.g. NLP classification, NER, QA) rather than open-ended generation. It is often used as a frozen or fine-tuned encoder in RAG and search pipelines.

How it works

Tokens are tokenized and embedded (token + position embeddings). The encoder layers apply bidirectional self-attention and FFNs; each token’s representation is influenced by all other tokens. Output can be pooled (e.g. [CLS] for sentence-level tasks) or sequence (one vector per token for NER, QA). Pretraining: randomly mask tokens and predict them (MLM), and predict whether two sentences are consecutive (NSP). Fine-tuning adds a task head (e.g. linear classifier) and updates the model (or only the head) on labeled data.

Use cases

BERT-style models excel when you need rich contextual representations for understanding (classification, NER, QA) rather than generation.

Named entity recognition and relation extraction
Search and retrieval (semantic matching, relevance ranking)
Question answering and natural language inference

BERT

Definition

How it works

Use cases

External documentation

See also

Definition​

How it works​

Use cases​

External documentation​

See also​

Definition

How it works

Use cases

External documentation

See also