O25 · Design OpenAI Batch Inference API O25 · 设计 OpenAI 批量推理 API
Verified source经核实出处
OpenAI public product (Batch API docs). Interview reports on Blind/LeetCode ask candidates to design it end-to-end. Credibility A.
Architecture架构
flowchart LR U[User] -->|upload JSONL| S3[(Blob store)] U -->|POST /batches| API API --> Q[(Batch queue)] SCHED[Scheduler] --> Q SCHED -->|fills idle GPU| GPU[Inference fleet] GPU -->|write outputs| S3 SCHED --> META[(Batch metadata DB)] U -->|poll /batches/id| API
Key design choices关键设计
- **24h SLA, not 24h latency**: scheduler backfills idle GPU capacity from interactive traffic; higher utilisation, lower cost.**24h SLA 而非固定延迟**:调度器用 batch 回填交互流量空闲 GPU,提高利用率、降本。
- **Immutable input file**: content hash enables dedup + retry without re-upload.**不可变输入**:内容 hash 可去重与免重传重试。
- **Partial progress visible**: polling returns {completed, failed, total}.**渐进式进度**:轮询返回 {completed, failed, total}。
- **Per-row timeout + retry**: each line independent; failed rows go to errors.jsonl.**按行超时 + 重试**:每行独立;失败行汇入 errors.jsonl。
Scale math规模估算
50 M requests/day x 2 k output tokens = 100 B tokens/day. Batch absorbs capacity beyond the ~70 B online budget. KV-cache warm-up across a batch further reduces cost.日 5000 万请求 x 2k 输出 = 1000 亿 token/日。batch 吸收超出在线 ~700 亿的余量。batch 内 KV cache 预热再降本。
Follow-ups追问
- Guarantee 24h on quiet days? priority bumps as deadline nears.安静日保 24h?临近 deadline 提升优先级。
- Prompt cache across a batch? shared prefix detected, cached once, reused per row.batch 内 prompt 缓存?检测共享前缀,一次缓存多行复用。