OpenAI ★★ Frequent Hard AutocompleteLatencySpeculative

O33 · Design an Autocomplete Service (Codex/Copilot-like) O33 · 设计自动补全服务(类 Codex/Copilot)

Verified source经核实出处

Asked at OpenAI onsite (Codex team), confirmed on Blind 2025. Credibility B.

Architecture架构

flowchart LR
  IDE --> GW[Low-latency GW]
  GW --> CTX[Context Builder - file + repo RAG]
  CTX --> CACHE[Prefix Cache]
  CACHE --> MODEL[Inference - small fast model]
  MODEL --> GW
  GW --> IDE

Key decisions关键决策

  • **Cancellation first-class**: 80-90% of completions cancelled before returning; abort generation on disconnect.**取消是一等公民**:80-90% 请求在返回前被取消;断连即中止。
  • **Speculative decoding** with tiny draft + large verifier; 2-3x speedup for ~15% extra GPU.**投机解码**:小 draft + 大 verifier;约 15% GPU 换 2-3x 速度。
  • **Bounded context**: FIM (fill-in-the-middle) with top-K RAG chunks; 4k cap to keep TTFT low.**有界上下文**:FIM + top-K RAG chunk;上下文硬限 4k token 以压 TTFT。
  • **Privacy mode**: retention off, prompts not logged; org-level opt-out.**隐私模式**:关闭保留;prompt 不落盘;组织级 opt-out。

Follow-ups追问

  • How to bias towards acceptance? monitor per-user acceptance rate, gate risky completions.如何提高接受率?监控用户接受率,门控高风险补全。
  • Non-English codebases? multilingual tokeniser, eval per-language.非英文代码?多语 tokenizer,按语言评估。

Related study-guide topics相关学习手册专题