A39 · Design an Evals Platform for Alignment Research A39 · 为对齐研究设计评估平台
Verified source经核实出处
Similar to OpenAI O30 but specialised for safety research. Anthropic publishes eval work. Credibility B.
Key decisions关键决策
- **Safety-first gating**: a checkpoint cannot promote past ASL-N until its safety evals pass; capability evals are secondary.**安全优先门控**:未通过 ASL-N 安全 eval 的 checkpoint 不得晋级;能力评估其次。
- **Adversarial evals are append-only**: red-team findings become fixed evals so the model doesn't forget lessons.**对抗评估 append-only**:红队结果转为固定 eval,避免模型遗忘。
- **Human-in-loop arbitration** for ambiguous cases; inter-annotator agreement tracked.**人工仲裁**处理歧义样本;跟踪标注一致性。
- **Replayability**: artefact store contains prompts, completions, grader versions for audit.**可复现**:artefact 存 prompt、completion、grader 版本,供审计。
Follow-ups追问
- LLM-as-judge calibration? pin to frozen judge model; human spot-check 1%.LLM-as-judge 校准?固定 judge 版本;人工抽查 1%。