AI 安全

定义

AI 安全解决高级 AI 的风险：滥用、意外行为和对齐 (systems doing what we intend). It includes robustness, interpretability, and value alignment.

它与…重叠 AI ethics (governance, fairness) and bias in AI (unfair outcomes). For LLMs and agents, alignment (例如 RLHF, constitutional AI) and guardrails are the main levers; explainable AI supports auditing and debugging.

工作原理

输入由模型处理以产生输出。审计（测试、监控、红队测试）检查输出是否安全、对齐且 robust. Research and practice focus on: alignment (RLHF, constitutional AI, scalable oversight) so models follow intent; robustness (adversarial testing, distribution shift) so they behave under edge cases; monitoring in production to detect misuse or drift. Safety is considered across the lifecycle from 设计 and data to training, evaluation, and deployment. Formal methods and interpretability (XAI) support the audit step.

应用场景

AI 安全适用于任何高风险或面向公众的系统：从设计到部署的对齐、鲁棒性和监控。

Auditing and red-teaming high-stakes or public-facing models
Alignment and guardrails for LLMs and agents (例如 RLHF, constitutional AI)
Robustness testing and monitoring in production

外部文档

Anthropic – Safety — Research on AI safety and alignment
OpenAI – Safety and responsibility

定义​

工作原理​

应用场景​

外部文档​

另请参阅​

定义

工作原理

应用场景

外部文档

另请参阅