A15 · Multi-Model GPU Inference API A15 · 多模型 GPU 推理 API
Verified source经核实出处
Prompt: "Design a GPU-backed inference API…multi-model…dynamic batching…A/B routing…autoscaling…KV/cache" — PracHub, Onsite. Credibility B.
Two-layer design两层架构
- Data plane: request ingress → batcher → GPU worker → token stream.数据面:请求入口 → batcher → GPU worker → token 流。
- Control plane: model registry, version rollouts, routing policy, autoscaling, cost policy.控制面:模型注册表、版本发布、路由策略、自动扩缩容、成本策略。
Architecture架构
flowchart LR C[Client] --> GW[Gateway] GW --> SCH[Scheduler / Batcher] SCH --> GPU[GPU Workers] CTRL[Control Plane] --> SCH CTRL --> REG[(Model Registry)] GPU --> OBS[Observability]
A/B routingA/B 路由
- Routing key:
tenant_id + experiment_id + user_idfor stickiness.路由键:tenant_id + experiment_id + user_id保证粘性。 - Rollback triggers: error-budget burn, p95 TTFT over threshold, rising OOM ratio.回滚触发:error-budget 烧光、p95 TTFT 超阈、OOM 比例上升。
Cold start is the hidden cost冷启动是隐藏成本
- Model load + warmup can take minutes. Solution: always-warm pool for top-N models + tiered SKU pools.模型加载 + warmup 耗时分钟级。方案:top-N 模型常驻热池 + 分层 SKU 池。
- KV-cache / weights reuse across tenants → citing vLLM PagedAttention credits you here.跨租户复用 KV-cache / weights → 引用 vLLM PagedAttention 加分。