A14 · Batch Inference API A14 · 批量推理 API
Verified source经核实出处
Prompt: "Design an inference service API where clients POST a job and later poll for results…queued/running/succeeded/failed…idempotency…partial failures within a batch." — PracHub, Onsite. Credibility B.
APIAPI 设计
POST /v1/jobs {model, inputs[...], idempotency_key} → {job_id}
GET /v1/jobs/{job_id} → {status, progress, counts}
GET /v1/jobs/{job_id}/results?cursor= # paginated, supports partial
POST /v1/jobs/{job_id}:cancelArchitecture架构
flowchart LR C[Client] --> S[Submit Job] S --> J[(Job DB)] S --> Q[Job Queue] Q --> W[Workers] W --> R[(Result Store)] C --> P[Poll Status / Results] P --> J P --> R
Data model (two layers)数据模型(两层)
Job(job_id, tenant_id, model, status, created_at, idempotency_key, input_ref)
JobItem(job_id, item_id, status, output_ref, error)Two-level model handles partial batch failures and supports incremental result pulling.两级模型处理批内部分失败,并支持增量拉取结果。
Cost & scaling成本与扩展
- Batch is naturally off-peak — use for valley-filling.批处理天然适合「填谷」——用于低峰期。
- Can batch-size more aggressively than online (higher W_MAX, bigger B_MAX).比在线服务更激进地 batch(W_MAX/B_MAX 更大)。
- Typical savings 40-50% vs online (Anthropic's public pricing confirms this).相比在线通常节省 40-50%(Anthropic 官方定价已证实)。