Anthropic ★★ Frequent Hard Red TeamDetectionAdversarial

A41 · Design a Red-Team Detection System A41 · 设计红队攻击检测系统

Verified source经核实出处

Discussed in Anthropic safety blog posts. Interview reports 2025. Credibility B.

Architecture架构

flowchart LR
  Prompt --> CLF[Online classifier]
  CLF --> FLAG[Flag store]
  FLAG --> CLUS[Offline clustering]
  CLUS --> ANAL[Attack pattern DB]
  ANAL --> SHIELD[Feed into shield retraining]
  ANAL --> OPS[Safety team dashboard]

Key decisions关键决策

**Two-timescale detection**: online per-request classifier (fast) + offline embedding clustering (finds campaigns).**两时间尺度**：在线单请求分类器（快）+ 离线嵌入聚类（发现攻势）。
**Cross-account signal**: same attack pattern on 100 accounts -> campaign; 1 account rare -> noise.**跨账号信号**：同模式攻击 100 账号 -> 攻势；单账号 -> 噪音。
**Minimal data retention**: prompts hashed/embedded for analytics; raw text purged per policy.**最小留存**：prompt 做哈希/嵌入；原文按策略清理。
**Closed feedback loop**: confirmed attacks become training signal for shield models.**闭环反馈**：确认攻击回流到 shield 模型训练。

Follow-ups追问

False positives? review queue; auto re-enable after decay.误报？复核队列；冷却后自动解除。
Coordinated attacks? shared threat intel between orgs with privacy preservation.协同攻击？跨组织共享威胁情报 + 保隐私。

A41 · Design a Red-Team Detection System A41 · 设计红队攻击检测系统

Verified source经核实出处

Architecture架构

Key decisions关键决策

Follow-ups追问

Related study-guide topics相关学习手册专题