A12 · GPU Inference Request Batching A12 · GPU 推理请求动态 batching
Verified source经核实出处
Prompt: "Design a system that serves online model-inference requests on GPUs…batching…balance throughput against latency SLOs…overload, failures, observability." — PracHub, 2026-03, Onsite. Credibility B.
Make batching a controllable decision把 batching 变成可控决策
if batch_size == B_MAX: flush
elif oldest_wait_ms > W_MAX: flush
elif predicted_exec_time_spread > T_SPREAD_MAX: flush
else: wait a tiny delta (1-5ms) and re-evaluateArchitecture架构
flowchart LR C[Client RPC] --> API[Inference API] API --> B[Batcher] B --> GPU[GPU Worker] GPU --> API
Bottlenecks & mitigations瓶颈与缓解
- Long tail: bucket by length / model spec; prioritize TTFT for small models.长尾:按长度 / 模型规格分桶;小模型优先保 TTFT。
- Overload: admission control (token bucket + concurrency cap) + queue-timeout + degrade (CPU / 429).过载:准入控制(令牌桶 + 并发上限)+ 排队超时 + 降级(转 CPU / 返回 429)。
- Observability: TTFT, per-token latency, batch-size distribution, drop/timeout rate, GPU memory pressure.可观测性:TTFT、per-token 延迟、batch size 分布、丢弃/超时率、GPU 内存压力。