OpenAI ★★ Frequent Hard ModerationClassifier

O13 · NSFW / Safety Detection for ChatGPT Outputs O13 · ChatGPT 输出的 NSFW/安全检测

Verified source经核实出处

Prompt: "Design a System to Detect NSFW Content in ChatGPT Outputs." — Jobright, Glassdoor. Credibility C/D.

What you need to cover必须覆盖的要点

  1. Data collection: sample outputs, red-team prompts, public benchmarks, user reports.数据采集:模型输出采样、red-team 提示、公共基准、用户举报。
  2. Model choice: rule filters (regex/keyword) → ML classifier (fast) → LLM-judge (high quality, expensive).模型选择:规则(正则/关键词)→ ML 分类器(快)→ LLM-judge(质量高但成本高)。
  3. Latency budget: block inline or async-moderate. Inline adds to TTFT.延迟预算:同步拦截 or 异步审核。同步会增加 TTFT。
  4. Feedback loop: label disagreement → retrain; policy updates → new classifier version.反馈循环:标签不一致 → 再训练;策略更新 → 分类器新版本。

Architecture架构

flowchart LR
  LLM[LLM Output] --> RULES[Rule Filter]
  RULES -->|pass| CLS[ML Classifier]
  RULES -->|flag| ACTION
  CLS -->|pass| RET[Return to user]
  CLS -->|uncertain| LJ[LLM Judge]
  LJ --> ACTION[Action: block / rewrite / warn]
  ACTION --> AUDIT[(Audit Log)]
  RET --> SAMPLE[Async Sampler]
  SAMPLE --> TRAIN[Retraining Data]

Design trade-offs设计权衡

  • Inline vs async: inline catches before egress (safer) but blocks streaming; async requires downstream correction (edit/delete).同步 vs 异步:同步在出口前拦截(更安全)但阻塞流式;异步需要下游修正(编辑/删除)。
  • Token-by-token moderation for streaming: chunk-level classifier every K tokens with rollback.流式的逐 token 审核:每 K tokens 做 chunk 级分类 + 回滚。

Related study-guide topics相关学习手册专题