Anthropic ★★★ Frequent Hard Async JobIdempotency

A14 · Batch Inference API A14 · 批量推理 API

Verified source经核实出处

Prompt: "Design an inference service API where clients POST a job and later poll for results…queued/running/succeeded/failed…idempotency…partial failures within a batch." — PracHub, Onsite. Credibility B.

APIAPI 设计

POST /v1/jobs                        {model, inputs[...], idempotency_key} → {job_id}
GET  /v1/jobs/{job_id}               → {status, progress, counts}
GET  /v1/jobs/{job_id}/results?cursor=   # paginated, supports partial
POST /v1/jobs/{job_id}:cancel

Architecture架构

flowchart LR
  C[Client] --> S[Submit Job]
  S --> J[(Job DB)]
  S --> Q[Job Queue]
  Q --> W[Workers]
  W --> R[(Result Store)]
  C --> P[Poll Status / Results]
  P --> J
  P --> R

Data model (two layers)数据模型（两层）

Job(job_id, tenant_id, model, status, created_at, idempotency_key, input_ref)
JobItem(job_id, item_id, status, output_ref, error)

Two-level model handles partial batch failures and supports incremental result pulling.两级模型处理批内部分失败，并支持增量拉取结果。

Cost & scaling成本与扩展

Batch is naturally off-peak — use for valley-filling.批处理天然适合「填谷」——用于低峰期。
Can batch-size more aggressively than online (higher W_MAX, bigger B_MAX).比在线服务更激进地 batch（W_MAX/B_MAX 更大）。
Typical savings 40-50% vs online (Anthropic's public pricing confirms this).相比在线通常节省 40-50%（Anthropic 官方定价已证实）。

A14 · Batch Inference API A14 · 批量推理 API

Verified source经核实出处

APIAPI 设计

Architecture架构

Data model (two layers)数据模型（两层）

Cost & scaling成本与扩展

Related study-guide topics相关学习手册专题