OpenAI ★★ Frequent Hard ModerationSafetyClassifier

O32 · Design a Content Moderation Pipeline O32 · 设计内容审核流水线

Verified source经核实出处

OpenAI Moderation API (docs). Interview reports confirm design questions. Credibility A.

Architecture架构

flowchart LR
  Req --> IN[Input Moderation]
  IN -->|block| Reject
  IN -->|allow| LLM
  LLM --> OUT[Output Moderation]
  OUT -->|block| Scrub
  OUT -->|flag| Q[Review Queue]
  OUT --> Deliver
  Q --> Human
  Human --> LBL[(Labels DB)]
  LBL --> Trainer

Key decisions关键决策

  • **Two-sided moderation**: both prompts and completions scored, jailbreaks can craft benign prompts that elicit harmful outputs.**双侧审核**:prompt 与 completion 都要评分——越狱可能用无害 prompt 引出有害输出。
  • **Classifier ensemble** (fast small model -> deep model on edge cases).**分类器集成**:轻量初筛 + 重型复核边缘样本。
  • **Policy categories as structured schema**, not binary; per-category thresholds.**策略类别即结构化 schema**,非二分;每类独立阈值。
  • **Shadow eval in prod**: 1% traffic scored by canary classifier.**线上 shadow eval**:1% 流量由 canary 评分。

Follow-ups追问

  • Multilingual? per-language thresholds + translation pivot classifier.多语言?按语言阈值 + 英文枢纽分类器。
  • Latency budget? input ≤ 30 ms p99; output runs concurrently with generation.延迟?入站 ≤ 30 ms p99;出站与生成并行。

Related study-guide topics相关学习手册专题