Zum Hauptinhalt springen

Evaluierungsmetriken

Definition

Evaluationsmetriken quantifizieren, wie gut Modelle abschneiden: accuracy, F1, BLEU, ROUGE, perplexity, human preference, etc. Die Wahl hängt ab von task (classification, generation, Abruf) and goals (fairness, robustness).

Sie sind used in benchmarks, development, and production (A/B tests, monitoring). No single metric captures everything; combine automated metrics with human evaluation for LLMs and subjective tasks. See bias in AI for fairness-related metrics.

Funktionsweise

Vorhersagen (Modellausgaben) und Referenzen (Ground Truth oder menschliche Antworten) werden in eine Metrik eingespeist, diees a score. Classification: accuracy, F1, AUC. Generation: BLEU, ROUGE, BERTScore, or learned metrics. Retrieval: recall@k, MRR. For LLMs, benchmarks (MMLU, HumanEval) run fixed prompts and aggregate metrics; human eval (preference, correctness) wird oft needed for open-ended quality. Metrics should align mit dem product goal and be reported on held-out or standard splits.

Anwendungsfälle

Evaluation metrics are needed whenever you train or ship a model: to compare runs, track quality, and audit fairness or safety.

  • Comparing models on classification (accuracy, F1), generation (BLEU, ROUGE), or Abruf
  • Tracking progress in development and A/B tests
  • Auditing for fairness, robustness, or safety

Externe Dokumentation

Siehe auch