Anthropic ★★ Frequent Medium EvalsAlignmentRed-Team

A39 · Design an Evals Platform for Alignment Research A39 · 为对齐研究设计评估平台

Similar to OpenAI O30 but specialised for safety research. Anthropic publishes eval work. Credibility B.

Key decisions关键决策

**Safety-first gating**: a checkpoint cannot promote past ASL-N until its safety evals pass; capability evals are secondary.**安全优先门控**：未通过 ASL-N 安全 eval 的 checkpoint 不得晋级；能力评估其次。
**Adversarial evals are append-only**: red-team findings become fixed evals so the model doesn't forget lessons.**对抗评估 append-only**：红队结果转为固定 eval，避免模型遗忘。
**Human-in-loop arbitration** for ambiguous cases; inter-annotator agreement tracked.**人工仲裁**处理歧义样本；跟踪标注一致性。
**Replayability**: artefact store contains prompts, completions, grader versions for audit.**可复现**：artefact 存 prompt、completion、grader 版本，供审计。

LLM-as-judge calibration? pin to frozen judge model; human spot-check 1%.LLM-as-judge 校准？固定 judge 版本；人工抽查 1%。