A32 · Design Anthropic's Safety Pipeline A32 · 设计 Anthropic 的安全流水线
Verified source经核实出处
Anthropic publishes safety policies and ASL framework (RSP). Interview reports confirm. Credibility A.
Architecture架构
flowchart LR Req --> IN[Input classifier - jailbreak, abuse] IN --> CTX[Context builder] CTX --> MODEL MODEL --> OUT[Output classifier] OUT --> TOOL[Tool-call guard] TOOL --> Deliver Deliver --> TRACE[(Trace + labels)] TRACE --> EVAL[Eval / Red-team] EVAL --> TRAIN[Constitutional AI training]
Key decisions关键决策
- **Policy taxonomy as schema** (self-harm, CBRN, CSAM, etc.); per-category severity thresholds.**策略分类即 schema**(自伤、CBRN、CSAM 等),按类别严重度阈值。
- **Constitutional classifier ensemble**: fast shield first, deeper classifier on edge cases.**宪法式分类器集成**:轻量 shield 先筛,深度分类器复核边缘样本。
- **ASL level gating**: higher capabilities require stricter safeguards; deploy only if passing RSP checks.**ASL 等级门控**:能力越高防护越严;通过 RSP 方可上线。
- **Continuous red-team loop**: findings become evals; failed evals block next release.**持续红队闭环**:结果进 eval;未过阻断发布。
Follow-ups追问
- Agentic tool misuse? tool-call guard checks target + args vs policy.代理式工具滥用?tool-call guard 检查目标与参数是否违策略。
- Low false-positive rate? A/B against user satisfaction; per-category calibration.低误伤?基于用户满意度 A/B,按类别校准。