Anthropic ★★★ Frequent Hard SafetyConstitutional AIClassifiers

A32 · Design Anthropic's Safety Pipeline A32 · 设计 Anthropic 的安全流水线

Verified source经核实出处

Anthropic publishes safety policies and ASL framework (RSP). Interview reports confirm. Credibility A.

Architecture架构

flowchart LR
  Req --> IN[Input classifier - jailbreak, abuse]
  IN --> CTX[Context builder]
  CTX --> MODEL
  MODEL --> OUT[Output classifier]
  OUT --> TOOL[Tool-call guard]
  TOOL --> Deliver
  Deliver --> TRACE[(Trace + labels)]
  TRACE --> EVAL[Eval / Red-team]
  EVAL --> TRAIN[Constitutional AI training]

Key decisions关键决策

**Policy taxonomy as schema** (self-harm, CBRN, CSAM, etc.); per-category severity thresholds.**策略分类即 schema**（自伤、CBRN、CSAM 等），按类别严重度阈值。
**Constitutional classifier ensemble**: fast shield first, deeper classifier on edge cases.**宪法式分类器集成**：轻量 shield 先筛，深度分类器复核边缘样本。
**ASL level gating**: higher capabilities require stricter safeguards; deploy only if passing RSP checks.**ASL 等级门控**：能力越高防护越严；通过 RSP 方可上线。
**Continuous red-team loop**: findings become evals; failed evals block next release.**持续红队闭环**：结果进 eval；未过阻断发布。

Follow-ups追问

Agentic tool misuse? tool-call guard checks target + args vs policy.代理式工具滥用？tool-call guard 检查目标与参数是否违策略。
Low false-positive rate? A/B against user satisfaction; per-category calibration.低误伤？基于用户满意度 A/B，按类别校准。

A32 · Design Anthropic's Safety Pipeline A32 · 设计 Anthropic 的安全流水线

Verified source经核实出处

Architecture架构

Key decisions关键决策

Follow-ups追问

Related study-guide topics相关学习手册专题