O13 · NSFW / Safety Detection for ChatGPT Outputs O13 · ChatGPT 输出的 NSFW/安全检测
Verified source经核实出处
Prompt: "Design a System to Detect NSFW Content in ChatGPT Outputs." — Jobright, Glassdoor. Credibility C/D.
What you need to cover必须覆盖的要点
- Data collection: sample outputs, red-team prompts, public benchmarks, user reports.数据采集:模型输出采样、red-team 提示、公共基准、用户举报。
- Model choice: rule filters (regex/keyword) → ML classifier (fast) → LLM-judge (high quality, expensive).模型选择:规则(正则/关键词)→ ML 分类器(快)→ LLM-judge(质量高但成本高)。
- Latency budget: block inline or async-moderate. Inline adds to TTFT.延迟预算:同步拦截 or 异步审核。同步会增加 TTFT。
- Feedback loop: label disagreement → retrain; policy updates → new classifier version.反馈循环:标签不一致 → 再训练;策略更新 → 分类器新版本。
Architecture架构
flowchart LR LLM[LLM Output] --> RULES[Rule Filter] RULES -->|pass| CLS[ML Classifier] RULES -->|flag| ACTION CLS -->|pass| RET[Return to user] CLS -->|uncertain| LJ[LLM Judge] LJ --> ACTION[Action: block / rewrite / warn] ACTION --> AUDIT[(Audit Log)] RET --> SAMPLE[Async Sampler] SAMPLE --> TRAIN[Retraining Data]
Design trade-offs设计权衡
- Inline vs async: inline catches before egress (safer) but blocks streaming; async requires downstream correction (edit/delete).同步 vs 异步:同步在出口前拦截(更安全)但阻塞流式;异步需要下游修正(编辑/删除)。
- Token-by-token moderation for streaming: chunk-level classifier every K tokens with rollback.流式的逐 token 审核:每 K tokens 做 chunk 级分类 + 回滚。