OpenAI ★★★ Frequent Hard BatchLLM InferenceSLA

O25 · Design OpenAI Batch Inference API O25 · 设计 OpenAI 批量推理 API

Verified source经核实出处

OpenAI public product (Batch API docs). Interview reports on Blind/LeetCode ask candidates to design it end-to-end. Credibility A.

Architecture架构

flowchart LR
  U[User] -->|upload JSONL| S3[(Blob store)]
  U -->|POST /batches| API
  API --> Q[(Batch queue)]
  SCHED[Scheduler] --> Q
  SCHED -->|fills idle GPU| GPU[Inference fleet]
  GPU -->|write outputs| S3
  SCHED --> META[(Batch metadata DB)]
  U -->|poll /batches/id| API

Key design choices关键设计

  • **24h SLA, not 24h latency**: scheduler backfills idle GPU capacity from interactive traffic; higher utilisation, lower cost.**24h SLA 而非固定延迟**:调度器用 batch 回填交互流量空闲 GPU,提高利用率、降本。
  • **Immutable input file**: content hash enables dedup + retry without re-upload.**不可变输入**:内容 hash 可去重与免重传重试。
  • **Partial progress visible**: polling returns {completed, failed, total}.**渐进式进度**:轮询返回 {completed, failed, total}。
  • **Per-row timeout + retry**: each line independent; failed rows go to errors.jsonl.**按行超时 + 重试**:每行独立;失败行汇入 errors.jsonl。

Scale math规模估算

50 M requests/day x 2 k output tokens = 100 B tokens/day. Batch absorbs capacity beyond the ~70 B online budget. KV-cache warm-up across a batch further reduces cost.日 5000 万请求 x 2k 输出 = 1000 亿 token/日。batch 吸收超出在线 ~700 亿的余量。batch 内 KV cache 预热再降本。

Follow-ups追问

  • Guarantee 24h on quiet days? priority bumps as deadline nears.安静日保 24h?临近 deadline 提升优先级。
  • Prompt cache across a batch? shared prefix detected, cached once, reused per row.batch 内 prompt 缓存?检测共享前缀,一次缓存多行复用。

Related study-guide topics相关学习手册专题