Anthropic ★★★ Frequent Hard SafetyConstitutional AIClassifiers

A32 · Design Anthropic's Safety Pipeline A32 · 设计 Anthropic 的安全流水线

Verified source经核实出处

Anthropic publishes safety policies and ASL framework (RSP). Interview reports confirm. Credibility A.

Architecture架构

flowchart LR
  Req --> IN[Input classifier - jailbreak, abuse]
  IN --> CTX[Context builder]
  CTX --> MODEL
  MODEL --> OUT[Output classifier]
  OUT --> TOOL[Tool-call guard]
  TOOL --> Deliver
  Deliver --> TRACE[(Trace + labels)]
  TRACE --> EVAL[Eval / Red-team]
  EVAL --> TRAIN[Constitutional AI training]

Key decisions关键决策

  • **Policy taxonomy as schema** (self-harm, CBRN, CSAM, etc.); per-category severity thresholds.**策略分类即 schema**(自伤、CBRN、CSAM 等),按类别严重度阈值。
  • **Constitutional classifier ensemble**: fast shield first, deeper classifier on edge cases.**宪法式分类器集成**:轻量 shield 先筛,深度分类器复核边缘样本。
  • **ASL level gating**: higher capabilities require stricter safeguards; deploy only if passing RSP checks.**ASL 等级门控**:能力越高防护越严;通过 RSP 方可上线。
  • **Continuous red-team loop**: findings become evals; failed evals block next release.**持续红队闭环**:结果进 eval;未过阻断发布。

Follow-ups追问

  • Agentic tool misuse? tool-call guard checks target + args vs policy.代理式工具滥用?tool-call guard 检查目标与参数是否违策略。
  • Low false-positive rate? A/B against user satisfaction; per-category calibration.低误伤?基于用户满意度 A/B,按类别校准。

Related study-guide topics相关学习手册专题