Source cross-reference

Primary sources: Gulli, Agentic Design Patterns Ch.18 "Guardrails / Safety Patterns" and Ch.19 "Evaluation and Monitoring"; Bai et al. 2022 "Constitutional AI" (arXiv:2212.08073); Chip Huyen Ch.11 "The Human Side of ML". This page is the single highest-leverage topic for Anthropic interviews — every round touches it.

Why safety is an architecture, not a filter

Beginners picture safety as "a classifier that blocks bad content". This fails immediately under real traffic. A jailbreak like "ignore previous instructions and..." will bypass a single classifier; a prompt-injection via a retrieved webpage will bypass it from a completely different direction; a data-exfil attack will slip through by phrasing the request benignly. The only defensible posture is defence in depth: many cheap layers, each targeting a different failure mode, with monitoring that detects when any layer fires.

Anthropic treats this as a first-class architecture problem. Expect interviewers to push past the "add a classifier" answer and ask: what exact layers, in what order, with what failure behaviour, at what latency cost, with what observability, and with what human backstop when the automated layers are wrong. OpenAI's Trust & Safety team asks the same questions with different emphasis.

Harm taxonomies every candidate should know

You cannot design safety without a categorisation of what you are preventing. The canonical set covers seven harm families:

  1. CSAM (child sexual abuse material): legal obligation, zero-tolerance, specialised hash-matching (PhotoDNA) and specialised classifiers. Never handled by general-purpose filters alone.
  2. PII leakage: credit cards, SSNs, health records, addresses. Both input (users pasting PII) and output (model memorising training data) directions.
  3. CBRN (chemical, biological, radiological, nuclear): uplift for mass-casualty weapons. Covered explicitly by Anthropic's RSP (Responsible Scaling Policy) and OpenAI's Preparedness framework.
  4. Self-harm & suicide: dedicated escalation paths to safety messaging plus human review.
  5. Violent extremism / targeted harassment: content, but also users organising to attack individuals.
  6. Jailbreaks: adversarial prompts aimed at bypassing the model's policy (DAN, "roleplay as an evil AI", many-shot jailbreaking).
  7. Prompt injection (a 2023-era addition): hostile instructions smuggled through tool outputs, retrieved documents, or images. Distinct from jailbreak because the attacker is not the user.

A good answer lists all seven in the first 90 seconds, then deep-dives wherever the interviewer pushes.

Layered defences blueprint

flowchart LR
  u[User prompt] --> in1[L1 · Input filter
regex · PII redact · CSAM hash] in1 --> in2[L2 · Intent classifier
harm-family routing] in2 --> pol[L3 · Policy router
small · medium · human] pol -->|safe| llm[LLM generation] pol -->|risky| cai[L4 · Constitutional check
self-critique prompt] cai --> llm llm --> out1[L5 · Output classifier
harm categories] out1 --> out2[L6 · PII / code / URL scrub] out2 --> log[L7 · Audit log
sampled -> human review] log --> u2[User response] in1 -.block.-> refusal[Structured refusal
with safer-alternative] out1 -.block.-> refusal

Budget-conscious design rules:

  • Cheap layers first: regex / hash / small classifier run in 1-3 ms per check; heavy LLM self-critique only fires on the ~5% flagged subset.
  • Fail closed for high-harm categories (CSAM, CBRN), fail open with logging for borderline cases — over-refusal is itself a safety metric (hurts helpfulness, drives users to less-safe alternatives).
  • Always return a structured refusal, never a silent block. Refusal carries a category code for analytics and a safer-alternative suggestion for the user.
  • Sample every interaction into an audit stream (1-5%) for human review. Without this you cannot close the loop.

Concrete numbers

Typical production targets: input+output filter p99 < 50 ms combined; overall false-positive (over-refusal) rate < 2% on a benign prompt eval; false-negative rate < 0.1% on a red-team eval set of 10k adversarial prompts. Human review queue SLA: high-severity < 15 min, medium < 4 h.

Red-team platform & eval infra

A red-team platform is an internal product, not a one-off exercise. Four subsystems:

  1. Attack library: versioned corpus of known adversarial prompts, tagged by harm category, attack technique (DAN, prompt injection, many-shot, encoded/cipher), severity, discovery date. 10k-100k at mature orgs.
  2. Automated red-teaming: an LLM attacker that generates adversarial variants, scored by a judge model or a classifier. Produces fresh attacks at scale; essential because human red-teamers saturate.
  3. Evaluation harness: run a candidate model against the attack library nightly, produce per-category refusal rates and regression alerts. Integrated into the model registry so that no promotion to production can happen without passing the harness.
  4. Human red team: paid experts, domain specialists (bio, cyber, legal), external contractors, bug-bounty participants. Automated attackers cannot replace the creativity of a good human adversary.

Latency note: the attack-evaluation loop should run in minutes, not days. A safety regression caught at canary is an incident; a safety regression caught in human review after GA is a crisis.

RLHF vs DPO vs CAI

Three fine-tuning techniques for aligning a base model to a policy; be able to whiteboard all three.

  • RLHF (Christiano 2017, Ouyang 2022): (1) collect human preference pairs, (2) train a reward model, (3) PPO the policy against the reward model with a KL penalty. Expensive, unstable, but strong results. OpenAI's dominant technique through 2023.
  • DPO (Rafailov 2023): skip the reward model and RL loop by optimising a closed-form loss directly on preference pairs. Simpler, more stable, comparable quality on many benchmarks. Now the default in many open-source efforts.
  • CAI (Bai 2022, Anthropic): replace most of the human preference labelling with a constitution — a written set of principles the model uses to self-critique and self-revise its outputs. "Is this response harmful under principle 3? Rewrite if so." Dramatically reduces the human labelling required; makes the alignment target inspectable and contestable. Anthropic's signature contribution.
flowchart LR
  base[Base LLM] --> rlhf[RLHF
pref pairs + RM + PPO] base --> dpo[DPO
closed-form on pref pairs] base --> cai[CAI
self-critique by written constitution
+ RLAIF final pass] rlhf --> aligned[Aligned model] dpo --> aligned cai --> aligned

In interviews, the expected move is: "I would use CAI for the broad harm categories where the principles are inspectable, keep a thin human-preference layer for subjective quality, and use DPO rather than PPO for stability and cost."

Human review & incident response

Automated filters are necessary but insufficient; a production safety system always has humans in the loop:

  • Reviewer tooling: triage UI with conversation context, suggested label, keyboard-friendly label actions, wellbeing protections (rotation, hour caps, mandatory counseling for exposure to distressing content).
  • Labels feed back into: classifier training data, prompt-template fixes, policy documentation, attack-library additions.
  • Incident response: pager rotation, severity taxonomy (SEV1 = live harm, SEV2 = pattern, SEV3 = near-miss), kill-switches at the policy router and the model-serving layer, public transparency for SEV1 (post-mortem within a week).

Anti-pattern: "we will add safety later"

Retrofitting safety onto a shipped product costs 5-10x more than building it in, and always leaves permanent gaps. The classifier training data depends on early interaction logs; the human-review workflow depends on early product decisions; rollback requires a kill-switch that has to be designed in. "Safety later" is how high-profile failures happen.

Anti-pattern: over-block as a KPI

If your only metric is "blocks per 1k requests", you will over-refuse legitimate queries. Anthropic's own guidance treats unnecessary refusal as a safety failure — it erodes trust and pushes users to jailbreaks. Always pair block-rate with a benign-prompt false-positive rate.

Anthropic-specific interview expectations

What Anthropic specifically probes

  • Constitutional thinking: given a policy question, can you express the trade-off as competing principles rather than a single rule? ("helpfulness vs honesty vs harm avoidance").
  • RSP ladder: show that you know about ASL-2 / ASL-3 capability thresholds and that deployment decisions are gated by measured capability, not vibes.
  • Prompt injection: almost every agent-design interview at Anthropic includes "how do you defend against a malicious tool response?" Show the confused-deputy framing and concrete mitigations (per-tool allowlists, output schemas, cross-tool info-flow policy).
  • Transparency: you log enough to write an honest post-mortem, and you proactively publish safety results.

OpenAI framing differences

OpenAI treats safety with equal weight but bias toward product-integrated answers: moderation API, spec-driven policy updates, scalable human review, fine-tuning partnerships with large customers. Expect more emphasis on enterprise boundary cases (data residency, customer-specific policies) and less on the alignment technique itself.

The winning interview narrative for either company: name the seven harms, draw the layered blueprint, place a red-team platform as an explicit subsystem, explain CAI/RLHF/DPO trade-offs, and close with the human review + incident response loop. Show that safety is an engineering discipline with SLOs, regressions, and post-mortems — not a vibe.

来源对照

主要来源:Gulli《Agentic Design Patterns》Ch.18 安全模式 & Ch.19 评估与监控Bai 等 2022「Constitutional AI」(arXiv:2212.08073)Chip Huyen Ch.11「ML 的人本一面」。本页是 Anthropic 面试最高杠杆的一章——每一轮都会问到。

安全是一套架构,不是一个过滤器

初学者把安全想成「一个分类器挡住坏内容」。真实流量下立刻失效:像「ignore previous instructions and...」这种越狱直接绕单一分类器;通过检索网页注入的 prompt injection 从完全不同方向绕过;数据外泄攻击用中性措辞就能蒙混。唯一站得住的姿态是深度防御:许多廉价的层,各针对一种失败模式,并有监控告知哪一层起作用。

Anthropic 把这当作一流的架构问题。面试官会越过「加分类器」的答案继续追问:具体哪些层、什么顺序、失败时怎么处理、延迟代价多少、可观测性如何、自动层出错时的人工兜底在哪。OpenAI 的 Trust & Safety 也问同样的问题,只是侧重不同。

每位候选人都要懂的伤害分类

不能在没有分类的前提下设计安全。经典七类:

  1. CSAM(儿童性剥削素材):法律义务、零容忍,用 PhotoDNA 等特化哈希匹配与专用分类器,绝不交给通用过滤器。
  2. PII 泄漏:信用卡、身份号、健康记录、地址。输入(用户粘贴 PII)和输出(模型记住训练数据)两方向都要防。
  3. CBRN(化生放核):大规模杀伤武器能力抬升。Anthropic 的 RSP、OpenAI 的 Preparedness 框架都显式覆盖。
  4. 自伤与自杀:专门的升级路径到安全提示语 + 人工审核。
  5. 暴力极端/定向骚扰:内容本身之外,还要防用户串联攻击个体。
  6. 越狱:绕过模型策略的对抗 prompt(DAN、「扮演邪恶 AI」、many-shot 越狱)。
  7. Prompt injection(2023 新增):通过工具输出、检索文档、图像夹带的恶意指令。与越狱不同——攻击者不是用户。

好答案会在头 90 秒内把七类列全,然后按面试官指向深挖。

分层防御参考蓝图

flowchart LR
  u[用户 prompt] --> in1[L1 · 输入过滤
正则 · PII 脱敏 · CSAM 哈希] in1 --> in2[L2 · 意图分类
按伤害族路由] in2 --> pol[L3 · 策略路由
小模型 · 中模型 · 人工] pol -->|安全| llm[LLM 生成] pol -->|风险| cai[L4 · Constitutional 检查
自我批判 prompt] cai --> llm llm --> out1[L5 · 输出分类器
伤害类别] out1 --> out2[L6 · PII / 代码 / URL 清洗] out2 --> log[L7 · 审计日志
采样 -> 人工复核] log --> u2[返回用户] in1 -.拦截.-> refusal[结构化拒答
含更安全替代] out1 -.拦截.-> refusal

预算友好的设计原则:

  • 便宜层在前:正则 / 哈希 / 小分类器每次 1-3 ms;重的 LLM 自我批判只对约 5% 被标记子集跑。
  • 高危类别 fail-closed(CSAM、CBRN),边缘情形 fail-open 并记录——过度拒答本身就是安全指标,会伤害帮助性,把用户赶去更不安全的替代品。
  • 永远返回结构化拒答,别静默拦截。拒答带类别码供分析,带更安全的替代建议给用户。
  • 每次交互都按比例(1-5%)采样进审计流供人工复核。缺这一环就闭不上反馈回路。

具体数字

生产目标:输入+输出过滤 p99 < 50 ms;对良性 prompt 评估集的整体假阳性(过度拒答)率 < 2%;对 1 万条红队对抗集的假阴性率 < 0.1%。人工审核队列 SLA:高严重 < 15 分钟,中等 < 4 小时。

红队平台与评估基建

红队平台是内部产品,不是一次性活动。四个子系统:

  1. 攻击库:版本化的对抗 prompt 语料,按伤害类别、攻击手法(DAN、prompt injection、many-shot、编码/密文)、严重度、发现日期打标。成熟团队 1 万到 10 万条。
  2. 自动红队:用一个 LLM 攻击者生成对抗变体,由判定模型或分类器打分。规模化产出新鲜攻击;必不可少,因为人类红队会饱和。
  3. 评估流水线:每晚把候选模型跑过攻击库,出按类别的拒答率和回归告警。接入模型注册表,未通过此流水线不得晋升到生产。
  4. 人类红队:付费专家、领域专家(生物、网络、法律)、外包、bug bounty。自动攻击无法替代优秀人类对手的创造力。

延迟说明:攻击评估回路应在分钟级运行,不是天级。canary 抓到的安全回归是事件;GA 后人工审核才抓到的是危机。

RLHF vs DPO vs CAI

三种把基座模型对齐到策略的微调技术,都要能当白板讲清楚。

  • RLHF(Christiano 2017, Ouyang 2022):(1) 收集人类偏好对,(2) 训练奖励模型,(3) 用带 KL 惩罚的 PPO 优化策略。贵、不稳,但效果强。2023 年前 OpenAI 的主技术。
  • DPO(Rafailov 2023):跳过奖励模型与 RL 循环,直接在偏好对上优化闭式损失。更简单、更稳、许多基准上质量相当。如今多数开源工作默认 DPO。
  • CAI(Bai 2022,Anthropic):用一份宪法(书面原则集)代替多数人类偏好标注,让模型自我批判并自我修订:「此回应是否违反原则 3?若是请改写。」显著减少所需人工标注;让对齐目标可审视、可争辩。Anthropic 的标志性贡献。
flowchart LR
  base[基座 LLM] --> rlhf[RLHF
偏好对 + RM + PPO] base --> dpo[DPO
偏好对闭式损失] base --> cai[CAI
按书面宪法自我批判
+ RLAIF 终轮] rlhf --> aligned[对齐后模型] dpo --> aligned cai --> aligned

面试标准操作:「我会在原则可审视的大类伤害上用 CAI;在主观质量上保留薄人工偏好层;以 DPO 代替 PPO 以求稳定与降本。」

人工审核与事故响应

自动过滤必要但不够;生产安全永远有人在回路中:

  • 审核工具:含会话上下文的 triage UI、建议标签、键盘友好的操作、审核员 wellbeing 保护(轮岗、工时上限、强制心理咨询)。
  • 标签反哺:分类器训练数据、prompt 模板修复、策略文档、攻击库新增。
  • 事故响应:值班轮转、严重度分级(SEV1=真实伤害,SEV2=模式,SEV3=险情),策略路由与服务层双 kill-switch,SEV1 一周内对外透明的 post-mortem。

反模式:「安全以后再加」

对已上线产品补安全,成本是内建的 5-10 倍,且永远留下空洞。分类器训练数据依赖早期交互日志;审核流程依赖早期产品决策;回滚需要预留 kill-switch。「以后再加」就是高曝光事故的源头。

反模式:把过度拦截当 KPI

如果指标只有「每千请求拦截次数」,你会过度拒答合法问题。Anthropic 把无谓拒答视为安全失败——侵蚀信任、驱使用户寻找越狱路径。拦截率必须与良性 prompt 假阳性率成对评估。

Anthropic 面试特定预期

Anthropic 深挖什么

  • 宪法式思考:给你一个策略问题,你能把权衡表达成多个原则的取舍(帮助性 vs 诚实 vs 避伤),而不是单一规则。
  • RSP 阶梯:熟悉 ASL-2 / ASL-3 能力阈值,说明部署决策是由测得的能力门控,不是靠感觉。
  • Prompt injection:Anthropic 的 agent 设计题几乎都会问「如何防御恶意工具响应?」要能用「confused deputy」框架讲出具体缓解(逐工具白名单、输出 schema、跨工具信息流策略)。
  • 透明度:日志足以写诚实的 post-mortem,并主动公开安全结果。

OpenAI 视角差异

OpenAI 同样重视安全,但偏产品一体化答案:Moderation API、基于 spec 的策略更新、规模化人工审核、与大客户的微调合作。更强调企业边界(数据驻留、客户专属策略),对齐技术本身的细节着墨较少。

对两家公司都适用的获胜叙事:先列出七类伤害;画出分层蓝图;把红队平台作为一等子系统;解释 CAI / RLHF / DPO 的取舍;最后收束到人工审核 + 事故响应回路。展示安全是一门带 SLO、带回归、带 post-mortem 的工程学科——不是氛围。