Anthropic ★★ Frequent Medium EvalsAlignmentRed-Team

A39 · Design an Evals Platform for Alignment Research A39 · 为对齐研究设计评估平台

Verified source经核实出处

Similar to OpenAI O30 but specialised for safety research. Anthropic publishes eval work. Credibility B.

Key decisions关键决策

  • **Safety-first gating**: a checkpoint cannot promote past ASL-N until its safety evals pass; capability evals are secondary.**安全优先门控**:未通过 ASL-N 安全 eval 的 checkpoint 不得晋级;能力评估其次。
  • **Adversarial evals are append-only**: red-team findings become fixed evals so the model doesn't forget lessons.**对抗评估 append-only**:红队结果转为固定 eval,避免模型遗忘。
  • **Human-in-loop arbitration** for ambiguous cases; inter-annotator agreement tracked.**人工仲裁**处理歧义样本;跟踪标注一致性。
  • **Replayability**: artefact store contains prompts, completions, grader versions for audit.**可复现**:artefact 存 prompt、completion、grader 版本,供审计。

Follow-ups追问

  • LLM-as-judge calibration? pin to frozen judge model; human spot-check 1%.LLM-as-judge 校准?固定 judge 版本;人工抽查 1%。

Related study-guide topics相关学习手册专题