A41 · Design a Red-Team Detection System A41 · 设计红队攻击检测系统
Verified source经核实出处
Discussed in Anthropic safety blog posts. Interview reports 2025. Credibility B.
Architecture架构
flowchart LR Prompt --> CLF[Online classifier] CLF --> FLAG[Flag store] FLAG --> CLUS[Offline clustering] CLUS --> ANAL[Attack pattern DB] ANAL --> SHIELD[Feed into shield retraining] ANAL --> OPS[Safety team dashboard]
Key decisions关键决策
- **Two-timescale detection**: online per-request classifier (fast) + offline embedding clustering (finds campaigns).**两时间尺度**:在线单请求分类器(快)+ 离线嵌入聚类(发现攻势)。
- **Cross-account signal**: same attack pattern on 100 accounts -> campaign; 1 account rare -> noise.**跨账号信号**:同模式攻击 100 账号 -> 攻势;单账号 -> 噪音。
- **Minimal data retention**: prompts hashed/embedded for analytics; raw text purged per policy.**最小留存**:prompt 做哈希/嵌入;原文按策略清理。
- **Closed feedback loop**: confirmed attacks become training signal for shield models.**闭环反馈**:确认攻击回流到 shield 模型训练。
Follow-ups追问
- False positives? review queue; auto re-enable after decay.误报?复核队列;冷却后自动解除。
- Coordinated attacks? shared threat intel between orgs with privacy preservation.协同攻击?跨组织共享威胁情报 + 保隐私。