O33 · Design an Autocomplete Service (Codex/Copilot-like) O33 · 设计自动补全服务(类 Codex/Copilot)
Verified source经核实出处
Asked at OpenAI onsite (Codex team), confirmed on Blind 2025. Credibility B.
Architecture架构
flowchart LR IDE --> GW[Low-latency GW] GW --> CTX[Context Builder - file + repo RAG] CTX --> CACHE[Prefix Cache] CACHE --> MODEL[Inference - small fast model] MODEL --> GW GW --> IDE
Key decisions关键决策
- **Cancellation first-class**: 80-90% of completions cancelled before returning; abort generation on disconnect.**取消是一等公民**:80-90% 请求在返回前被取消;断连即中止。
- **Speculative decoding** with tiny draft + large verifier; 2-3x speedup for ~15% extra GPU.**投机解码**:小 draft + 大 verifier;约 15% GPU 换 2-3x 速度。
- **Bounded context**: FIM (fill-in-the-middle) with top-K RAG chunks; 4k cap to keep TTFT low.**有界上下文**:FIM + top-K RAG chunk;上下文硬限 4k token 以压 TTFT。
- **Privacy mode**: retention off, prompts not logged; org-level opt-out.**隐私模式**:关闭保留;prompt 不落盘;组织级 opt-out。
Follow-ups追问
- How to bias towards acceptance? monitor per-user acceptance rate, gate risky completions.如何提高接受率?监控用户接受率,门控高风险补全。
- Non-English codebases? multilingual tokeniser, eval per-language.非英文代码?多语 tokenizer,按语言评估。