OpenAI ★★★ Frequent Hard RAGACLCitations

O23 · Enterprise GPT for a 20K-Employee Company (RAG + ACL) O23 · 2 万人企业的 Enterprise GPT（RAG + ACL）

Verified source经核实出处

小红书 @Justin (2026) — candidate got Strong Hire with a 4-layer decomposition. Distinct from O16 (LLM-powered Enterprise Search) in four ways:
① explicitly 20K-employee scale, ② per-document ACL is a first-class requirement, ③ P95 < 2 s latency budget, ④ EN/中 multilingual, ⑤ continuous-learning loop from feedback. Credibility B/C (candidate-attributed, detailed).来自小红书 @Justin（2026）——候选人用四层架构拿到 Strong Hire。与 O16（LLM Enterprise Search）的区别在：① 明确 2 万人规模、② 文档级 ACL 是一等需求、③ P95 < 2 秒延迟预算、④ 中英双语、⑤ 反馈驱动的持续学习闭环。可信度 B/C（候选人自述，细节充分）。

Requirements (the exact interview spec)需求（面试原题）

Accurate answer with traceable citations (to the exact paragraph / version).准确回答，带可追溯引用（到具体段落/版本）。
ACL correctness: an employee sees only content they're authorised to see.权限正确：员工只能看到有权限的内容。
P95 < 2 s (excluding model inference, which can stream).P95 < 2 秒（不含模型推理，推理可做 streaming）。
Multilingual: EN + 中 at minimum.多语言：至少中英。
Continuous learning: offline metrics improve, online complaints drop.持续学习：离线指标稳定提升，线上投诉率下降。
Sources: Confluence, Google Docs, PDFs, Slack announcements, ticket KBs, PRDs.来源：Confluence、Google Docs、PDF、Slack 公告、工单知识库、PRD。

The 4-layer architecture (the Strong-Hire answer)四层架构（Strong Hire 答法）

flowchart TB
  subgraph L1["①Ingestion & Chunking"]
    CONN[Connectors
Confluence · GDocs · Slack · PDFs · Jira] --> NORM[Normalizer
dedupe, lang-detect, PII scrub]
    NORM --> CHUNK[Chunker
semantic + overlap]
    CHUNK --> EMB[Embedding
multilingual e5 / bge-m3]
    EMB --> IDX[(Vector Index
+ keyword index)]
    CHUNK --> META[(Doc Metadata
acl_ids, version, lang)]
  end

  subgraph L2["② Retriever"]
    Q[Query] --> QR[Query rewrite
LLM lite + HyDE opt.]
    QR --> HY[Hybrid search
BM25 + vector]
    HY --> ACL[ACL filter
user_acl ∩ chunk_acl]
    ACL --> RR[Reranker
cross-encoder]
  end

  subgraph L3["③ Evaluator"]
    RR --> EV[Answerability & coverage check]
    EV -->|insufficient| FB[Refuse / "I don't know"]
    EV -->|ok| GEN
  end

  subgraph L4["④ Generator"]
    GEN[Prompt assembly
system + ctx + citations] --> LLM[LLM inference
streaming]
    LLM --> POST[Post-check:
citation binding + PII guard]
    POST --> USR[User + feedback capture]
  end

  USR -.thumb/complaint.-> FB2[Feedback store] --> EVAL[Offline eval
factuality, citation@k]
  EVAL --> L1
  EVAL --> L2

Per-document ACL done right文档级 ACL 的正确做法

Store ACL with the chunk, not only the doc. Each chunk inherits doc ACL but can be overridden by section-level rules (e.g., PRD appendix).ACL 存在 chunk 级，不只是文档级。每个 chunk 继承文档 ACL，可被段级规则覆盖（如 PRD 附录）。
ACL as a first-class filter in retrieval. Post-filter leaks information (top-K might all be restricted → "I know nothing" reveals presence). Use pre-filter via a posting-list intersection at the vector/keyword index level.ACL 是检索中的一等过滤器。后过滤会泄露信息（top-K 全被禁 → "我什么都不知道"反而暴露存在）。应用 前过滤，在向量/关键词索引层做 posting-list 交集。
Principal expansion: flatten user → groups → roles into a set of acl_ids at request time; cache per-user in Redis with short TTL.主体展开：请求时把用户 → 组 → 角色展开为 acl_ids 集合；Redis 短 TTL 缓存按用户。
Never train on restricted content. Fine-tuning and eval sets filter by ACL; log hash-only to keep audit trail without leakage.绝不用受限内容训练。微调/评测集按 ACL 过滤；日志只存 hash 以审计而不泄露。

P95 < 2 s latency budget (retrieval stack)P95 < 2 秒延迟预算（不含推理）

Stage	Target	Technique
Query rewrite	< 150 ms	Small model; skip if query is unambiguous.小模型；无歧义直接跳过。
Hybrid retrieval	< 250 ms	Parallel BM25 + vector; ACL pre-filter at index level.BM25 与向量并行；索引层 ACL 前过滤。
Rerank	< 400 ms	Top-100 → top-8 cross-encoder; batch; cache top rerank results.top-100 → top-8 cross-encoder；批处理；缓存 rerank 结果。
Evaluator + prompt build	< 200 ms	Simple heuristic + template fill.简单启发式 + 模板拼接。
First token (model)	streamed	Out of budget by design → user sees streaming.按约定不计入预算 → 用户看到流式输出。

Multilingual (EN + 中)多语言（中英）

Multilingual embeddings (bge-m3, e5-multilingual): same vector space, no translate-at-query.多语言 embedding（bge-m3、e5-multilingual）：同向量空间，无需查询时翻译。
Chunk-language tag → boost chunks in the user's language but don't exclude cross-language hits.Chunk 语言标签 → 轻加权用户语言，不排除跨语言命中。
Answer language = user question language by default; cite in original.回答语言 = 用户提问语言（默认）；引用保留原文。

Continuous learning loop持续学习闭环

Capture (query, retrieved, answer, thumb, complaint) tuples.捕获 (query, retrieved, answer, thumb, complaint) 四元组。
Weekly offline eval: factuality, citation@k, refusal correctness, ACL-leak probes.每周离线评测：事实性、citation@k、拒答准确、ACL 泄漏探针。
Hard-negative mining for retriever fine-tuning; prompt-template AB.Retriever 的难负样本挖掘；Prompt 模板 AB。
Gate rollout on "complaint rate ↓ and factuality ↑ both significant".上线门槛：投诉率↓ 且事实性↑ 双显著。

Follow-ups追问（原帖在 Generator 被深挖）

Citation binding: force the model to emit [doc_id:chunk_id] tokens; post-checker verifies every claim maps to cited chunks; drop uncited sentences or refuse.引用绑定：强制模型输出 [doc_id:chunk_id] 标记；后检器验证每条断言都映射到引用 chunk；未引用句删除或拒答。
Fresh-write → instant availability? Dual-write to a "recent" index (small, cheap) while the batch indexer catches up; merge at query time.刚写入内容的即时可见？双写到"近期索引"（小、便宜），同时批量索引追进度；查询时合并。
Eval that catches ACL leaks? Red-team probe set: known sensitive Qs issued as different personas; assert no restricted chunk appears in retrieval.如何保证 ACL 不被绕过？Red-team 探针集：敏感问题以不同 persona 提问；断言检索结果无受限 chunk。
Cost guardrails? Per-employee monthly token budget; shrink context aggressively when near cap.成本护栏？按员工月度 token 预算；接近上限时激进缩上下文。