Benchmarks

Definition

Benchmarks sind standardisierte Datensätze und Evaluationsprotokolle (z. B. GLUE, SuperGLUE for NLP; MMLU for broad knowledge; HumanEval for code). They enable comparison across models and over time.

They basieren auf evaluation metrics und festen Aufteilungen, damit Ergebnisse vergleichbar sind. Überanpassung an Benchmarks ist ein bekanntes Problem; supplement with out-of-distribution and human eval when deploying LLMs or production systems.

Funktionsweise

A model is run on a benchmark dataset (feste Prompts oder Eingaben, Standardaufteilung). Metrics (z. B. accuracy, pass@k) werden berechnet pro Aufgabe and often averaged; results are reported on a leaderboard or in papers. Protocols define what inputs to use, how to parse outputs, and which metrics to report. Reusing the same benchmark across time lets the community track progress. Care is needed: models can overfit to benchmark quirks, and benchmarks may not reflect real-world quality—use them as one signal among others.

Anwendungsfälle

Benchmarks give a common yardstick to compare models and methods; use them together with task-specific and human evaluation.

Comparing NLP models (z. B. GLUE, SuperGLUE, MMLU)
Evaluating code generation (z. B. HumanEval) or Schlussfolgern
Tracking model and method progress over time

Externe Dokumentation

Papers with Code – Leaderboards
MMLU (Hendrycks et al.) — Broad knowledge benchmark
HumanEval — Code generation benchmark

Definition​

Funktionsweise​

Anwendungsfälle​

Externe Dokumentation​

Siehe auch​

Definition

Funktionsweise

Anwendungsfälle

Externe Dokumentation

Siehe auch