Safety Engineering

Source cross-reference

Primary sources: Gulli, Agentic Design Patterns Ch.18 "Guardrails / Safety Patterns" and Ch.19 "Evaluation and Monitoring"; Bai et al. 2022 "Constitutional AI" (arXiv:2212.08073); Chip Huyen Ch.11 "The Human Side of ML". This page is the single highest-leverage topic for Anthropic interviews — every round touches it.

Why safety is an architecture, not a filter

Beginners picture safety as "a classifier that blocks bad content". This fails immediately under real traffic. A jailbreak like "ignore previous instructions and..." will bypass a single classifier; a prompt-injection via a retrieved webpage will bypass it from a completely different direction; a data-exfil attack will slip through by phrasing the request benignly. The only defensible posture is defence in depth: many cheap layers, each targeting a different failure mode, with monitoring that detects when any layer fires.

Anthropic treats this as a first-class architecture problem. Expect interviewers to push past the "add a classifier" answer and ask: what exact layers, in what order, with what failure behaviour, at what latency cost, with what observability, and with what human backstop when the automated layers are wrong. OpenAI's Trust & Safety team asks the same questions with different emphasis.

Harm taxonomies every candidate should know

You cannot design safety without a categorisation of what you are preventing. The canonical set covers seven harm families:

CSAM (child sexual abuse material): legal obligation, zero-tolerance, specialised hash-matching (PhotoDNA) and specialised classifiers. Never handled by general-purpose filters alone.
PII leakage: credit cards, SSNs, health records, addresses. Both input (users pasting PII) and output (model memorising training data) directions.
CBRN (chemical, biological, radiological, nuclear): uplift for mass-casualty weapons. Covered explicitly by Anthropic's RSP (Responsible Scaling Policy) and OpenAI's Preparedness framework.
Self-harm & suicide: dedicated escalation paths to safety messaging plus human review.
Violent extremism / targeted harassment: content, but also users organising to attack individuals.
Jailbreaks: adversarial prompts aimed at bypassing the model's policy (DAN, "roleplay as an evil AI", many-shot jailbreaking).
Prompt injection (a 2023-era addition): hostile instructions smuggled through tool outputs, retrieved documents, or images. Distinct from jailbreak because the attacker is not the user.

A good answer lists all seven in the first 90 seconds, then deep-dives wherever the interviewer pushes.

Layered defences blueprint

flowchart LR
  u[User prompt] --> in1[L1 · Input filter
regex · PII redact · CSAM hash]
  in1 --> in2[L2 · Intent classifier
harm-family routing]
  in2 --> pol[L3 · Policy router
small · medium · human]
  pol -->|safe| llm[LLM generation]
  pol -->|risky| cai[L4 · Constitutional check
self-critique prompt]
  cai --> llm
  llm --> out1[L5 · Output classifier
harm categories]
  out1 --> out2[L6 · PII / code / URL scrub]
  out2 --> log[L7 · Audit log
sampled -> human review]
  log --> u2[User response]
  in1 -.block.-> refusal[Structured refusal
with safer-alternative]
  out1 -.block.-> refusal

Budget-conscious design rules:

Cheap layers first: regex / hash / small classifier run in 1-3 ms per check; heavy LLM self-critique only fires on the ~5% flagged subset.
Fail closed for high-harm categories (CSAM, CBRN), fail open with logging for borderline cases — over-refusal is itself a safety metric (hurts helpfulness, drives users to less-safe alternatives).
Always return a structured refusal, never a silent block. Refusal carries a category code for analytics and a safer-alternative suggestion for the user.
Sample every interaction into an audit stream (1-5%) for human review. Without this you cannot close the loop.

Concrete numbers

Typical production targets: input+output filter p99 < 50 ms combined; overall false-positive (over-refusal) rate < 2% on a benign prompt eval; false-negative rate < 0.1% on a red-team eval set of 10k adversarial prompts. Human review queue SLA: high-severity < 15 min, medium < 4 h.

Red-team platform & eval infra

A red-team platform is an internal product, not a one-off exercise. Four subsystems:

Attack library: versioned corpus of known adversarial prompts, tagged by harm category, attack technique (DAN, prompt injection, many-shot, encoded/cipher), severity, discovery date. 10k-100k at mature orgs.
Automated red-teaming: an LLM attacker that generates adversarial variants, scored by a judge model or a classifier. Produces fresh attacks at scale; essential because human red-teamers saturate.
Evaluation harness: run a candidate model against the attack library nightly, produce per-category refusal rates and regression alerts. Integrated into the model registry so that no promotion to production can happen without passing the harness.
Human red team: paid experts, domain specialists (bio, cyber, legal), external contractors, bug-bounty participants. Automated attackers cannot replace the creativity of a good human adversary.

Latency note: the attack-evaluation loop should run in minutes, not days. A safety regression caught at canary is an incident; a safety regression caught in human review after GA is a crisis.

RLHF vs DPO vs CAI

Three fine-tuning techniques for aligning a base model to a policy; be able to whiteboard all three.

RLHF (Christiano 2017, Ouyang 2022): (1) collect human preference pairs, (2) train a reward model, (3) PPO the policy against the reward model with a KL penalty. Expensive, unstable, but strong results. OpenAI's dominant technique through 2023.
DPO (Rafailov 2023): skip the reward model and RL loop by optimising a closed-form loss directly on preference pairs. Simpler, more stable, comparable quality on many benchmarks. Now the default in many open-source efforts.
CAI (Bai 2022, Anthropic): replace most of the human preference labelling with a constitution — a written set of principles the model uses to self-critique and self-revise its outputs. "Is this response harmful under principle 3? Rewrite if so." Dramatically reduces the human labelling required; makes the alignment target inspectable and contestable. Anthropic's signature contribution.

flowchart LR
  base[Base LLM] --> rlhf[RLHF
pref pairs + RM + PPO]
  base --> dpo[DPO
closed-form on pref pairs]
  base --> cai[CAI
self-critique by written constitution
+ RLAIF final pass]
  rlhf --> aligned[Aligned model]
  dpo --> aligned
  cai --> aligned

In interviews, the expected move is: "I would use CAI for the broad harm categories where the principles are inspectable, keep a thin human-preference layer for subjective quality, and use DPO rather than PPO for stability and cost."

Human review & incident response

Automated filters are necessary but insufficient; a production safety system always has humans in the loop:

Reviewer tooling: triage UI with conversation context, suggested label, keyboard-friendly label actions, wellbeing protections (rotation, hour caps, mandatory counseling for exposure to distressing content).
Labels feed back into: classifier training data, prompt-template fixes, policy documentation, attack-library additions.
Incident response: pager rotation, severity taxonomy (SEV1 = live harm, SEV2 = pattern, SEV3 = near-miss), kill-switches at the policy router and the model-serving layer, public transparency for SEV1 (post-mortem within a week).

Anti-pattern: "we will add safety later"

Retrofitting safety onto a shipped product costs 5-10x more than building it in, and always leaves permanent gaps. The classifier training data depends on early interaction logs; the human-review workflow depends on early product decisions; rollback requires a kill-switch that has to be designed in. "Safety later" is how high-profile failures happen.

Anti-pattern: over-block as a KPI

If your only metric is "blocks per 1k requests", you will over-refuse legitimate queries. Anthropic's own guidance treats unnecessary refusal as a safety failure — it erodes trust and pushes users to jailbreaks. Always pair block-rate with a benign-prompt false-positive rate.

Anthropic-specific interview expectations

What Anthropic specifically probes

Constitutional thinking: given a policy question, can you express the trade-off as competing principles rather than a single rule? ("helpfulness vs honesty vs harm avoidance").
RSP ladder: show that you know about ASL-2 / ASL-3 capability thresholds and that deployment decisions are gated by measured capability, not vibes.
Prompt injection: almost every agent-design interview at Anthropic includes "how do you defend against a malicious tool response?" Show the confused-deputy framing and concrete mitigations (per-tool allowlists, output schemas, cross-tool info-flow policy).
Transparency: you log enough to write an honest post-mortem, and you proactively publish safety results.

OpenAI framing differences

OpenAI treats safety with equal weight but bias toward product-integrated answers: moderation API, spec-driven policy updates, scalable human review, fine-tuning partnerships with large customers. Expect more emphasis on enterprise boundary cases (data residency, customer-specific policies) and less on the alignment technique itself.

The winning interview narrative for either company: name the seven harms, draw the layered blueprint, place a red-team platform as an explicit subsystem, explain CAI/RLHF/DPO trade-offs, and close with the human review + incident response loop. Show that safety is an engineering discipline with SLOs, regressions, and post-mortems — not a vibe.

来源对照

主要来源：Gulli《Agentic Design Patterns》Ch.18 安全模式 & Ch.19 评估与监控；Bai 等 2022「Constitutional AI」(arXiv:2212.08073)；Chip Huyen Ch.11「ML 的人本一面」。本页是 Anthropic 面试最高杠杆的一章——每一轮都会问到。

安全是一套架构，不是一个过滤器

初学者把安全想成「一个分类器挡住坏内容」。真实流量下立刻失效：像「ignore previous instructions and...」这种越狱直接绕单一分类器；通过检索网页注入的 prompt injection 从完全不同方向绕过；数据外泄攻击用中性措辞就能蒙混。唯一站得住的姿态是深度防御：许多廉价的层，各针对一种失败模式，并有监控告知哪一层起作用。

Anthropic 把这当作一流的架构问题。面试官会越过「加分类器」的答案继续追问：具体哪些层、什么顺序、失败时怎么处理、延迟代价多少、可观测性如何、自动层出错时的人工兜底在哪。OpenAI 的 Trust & Safety 也问同样的问题，只是侧重不同。

每位候选人都要懂的伤害分类

不能在没有分类的前提下设计安全。经典七类：

CSAM（儿童性剥削素材）：法律义务、零容忍，用 PhotoDNA 等特化哈希匹配与专用分类器，绝不交给通用过滤器。
PII 泄漏：信用卡、身份号、健康记录、地址。输入（用户粘贴 PII）和输出（模型记住训练数据）两方向都要防。
CBRN（化生放核）：大规模杀伤武器能力抬升。Anthropic 的 RSP、OpenAI 的 Preparedness 框架都显式覆盖。
自伤与自杀：专门的升级路径到安全提示语 + 人工审核。
暴力极端/定向骚扰：内容本身之外，还要防用户串联攻击个体。
越狱：绕过模型策略的对抗 prompt（DAN、「扮演邪恶 AI」、many-shot 越狱）。
Prompt injection（2023 新增）：通过工具输出、检索文档、图像夹带的恶意指令。与越狱不同——攻击者不是用户。

好答案会在头 90 秒内把七类列全，然后按面试官指向深挖。

分层防御参考蓝图

flowchart LR
  u[用户 prompt] --> in1[L1 · 输入过滤
正则 · PII 脱敏 · CSAM 哈希]
  in1 --> in2[L2 · 意图分类
按伤害族路由]
  in2 --> pol[L3 · 策略路由
小模型 · 中模型 · 人工]
  pol -->|安全| llm[LLM 生成]
  pol -->|风险| cai[L4 · Constitutional 检查
自我批判 prompt]
  cai --> llm
  llm --> out1[L5 · 输出分类器
伤害类别]
  out1 --> out2[L6 · PII / 代码 / URL 清洗]
  out2 --> log[L7 · 审计日志
采样 -> 人工复核]
  log --> u2[返回用户]
  in1 -.拦截.-> refusal[结构化拒答
含更安全替代]
  out1 -.拦截.-> refusal

预算友好的设计原则：

便宜层在前：正则 / 哈希 / 小分类器每次 1-3 ms；重的 LLM 自我批判只对约 5% 被标记子集跑。
高危类别 fail-closed（CSAM、CBRN），边缘情形 fail-open 并记录——过度拒答本身就是安全指标，会伤害帮助性，把用户赶去更不安全的替代品。
永远返回结构化拒答，别静默拦截。拒答带类别码供分析，带更安全的替代建议给用户。
每次交互都按比例（1-5%）采样进审计流供人工复核。缺这一环就闭不上反馈回路。

具体数字

生产目标：输入+输出过滤 p99 < 50 ms；对良性 prompt 评估集的整体假阳性（过度拒答）率 < 2%；对 1 万条红队对抗集的假阴性率 < 0.1%。人工审核队列 SLA：高严重 < 15 分钟，中等 < 4 小时。

红队平台与评估基建

红队平台是内部产品，不是一次性活动。四个子系统：

攻击库：版本化的对抗 prompt 语料，按伤害类别、攻击手法（DAN、prompt injection、many-shot、编码/密文）、严重度、发现日期打标。成熟团队 1 万到 10 万条。
自动红队：用一个 LLM 攻击者生成对抗变体，由判定模型或分类器打分。规模化产出新鲜攻击；必不可少，因为人类红队会饱和。
评估流水线：每晚把候选模型跑过攻击库，出按类别的拒答率和回归告警。接入模型注册表，未通过此流水线不得晋升到生产。
人类红队：付费专家、领域专家（生物、网络、法律）、外包、bug bounty。自动攻击无法替代优秀人类对手的创造力。

延迟说明：攻击评估回路应在分钟级运行，不是天级。canary 抓到的安全回归是事件；GA 后人工审核才抓到的是危机。

RLHF vs DPO vs CAI

三种把基座模型对齐到策略的微调技术，都要能当白板讲清楚。

RLHF（Christiano 2017, Ouyang 2022）：(1) 收集人类偏好对，(2) 训练奖励模型，(3) 用带 KL 惩罚的 PPO 优化策略。贵、不稳，但效果强。2023 年前 OpenAI 的主技术。
DPO（Rafailov 2023）：跳过奖励模型与 RL 循环，直接在偏好对上优化闭式损失。更简单、更稳、许多基准上质量相当。如今多数开源工作默认 DPO。
CAI（Bai 2022，Anthropic）：用一份宪法（书面原则集）代替多数人类偏好标注，让模型自我批判并自我修订：「此回应是否违反原则 3？若是请改写。」显著减少所需人工标注；让对齐目标可审视、可争辩。Anthropic 的标志性贡献。

flowchart LR
  base[基座 LLM] --> rlhf[RLHF
偏好对 + RM + PPO]
  base --> dpo[DPO
偏好对闭式损失]
  base --> cai[CAI
按书面宪法自我批判
+ RLAIF 终轮]
  rlhf --> aligned[对齐后模型]
  dpo --> aligned
  cai --> aligned

面试标准操作：「我会在原则可审视的大类伤害上用 CAI；在主观质量上保留薄人工偏好层；以 DPO 代替 PPO 以求稳定与降本。」

人工审核与事故响应

自动过滤必要但不够；生产安全永远有人在回路中：

审核工具：含会话上下文的 triage UI、建议标签、键盘友好的操作、审核员 wellbeing 保护（轮岗、工时上限、强制心理咨询）。
标签反哺：分类器训练数据、prompt 模板修复、策略文档、攻击库新增。
事故响应：值班轮转、严重度分级（SEV1=真实伤害，SEV2=模式，SEV3=险情），策略路由与服务层双 kill-switch，SEV1 一周内对外透明的 post-mortem。

反模式：「安全以后再加」

对已上线产品补安全，成本是内建的 5-10 倍，且永远留下空洞。分类器训练数据依赖早期交互日志；审核流程依赖早期产品决策；回滚需要预留 kill-switch。「以后再加」就是高曝光事故的源头。

反模式：把过度拦截当 KPI

如果指标只有「每千请求拦截次数」，你会过度拒答合法问题。Anthropic 把无谓拒答视为安全失败——侵蚀信任、驱使用户寻找越狱路径。拦截率必须与良性 prompt 假阳性率成对评估。

Anthropic 面试特定预期

Anthropic 深挖什么

宪法式思考：给你一个策略问题，你能把权衡表达成多个原则的取舍（帮助性 vs 诚实 vs 避伤），而不是单一规则。
RSP 阶梯：熟悉 ASL-2 / ASL-3 能力阈值，说明部署决策是由测得的能力门控，不是靠感觉。
Prompt injection：Anthropic 的 agent 设计题几乎都会问「如何防御恶意工具响应？」要能用「confused deputy」框架讲出具体缓解（逐工具白名单、输出 schema、跨工具信息流策略）。
透明度：日志足以写诚实的 post-mortem，并主动公开安全结果。

OpenAI 视角差异

OpenAI 同样重视安全，但偏产品一体化答案：Moderation API、基于 spec 的策略更新、规模化人工审核、与大客户的微调合作。更强调企业边界（数据驻留、客户专属策略），对齐技术本身的细节着墨较少。

对两家公司都适用的获胜叙事：先列出七类伤害；画出分层蓝图；把红队平台作为一等子系统；解释 CAI / RLHF / DPO 的取舍；最后收束到人工审核 + 事故响应回路。展示安全是一门带 SLO、带回归、带 post-mortem 的工程学科——不是氛围。