A11 · High-Concurrency LLM Inference Service A11 · 高并发 LLM 推理服务
Verified source经核实出处
Prompt: "Design a high-concurrency LLM inference platform…support streaming token output…must discuss prefill vs decode, KV cache, batching strategy, tail latency…" — PracHub, 2026-02, Onsite. Credibility B.
Open with the prefill / decode split (first minute)第一分钟:分相 prefill / decode
- Prefill: one big compute pass over the prompt. Throughput-bound. Loves batching.Prefill:对 prompt 的一次大计算。吞吐导向。喜欢 batching。
- Decode: autoregressive per-token. Latency-bound. Vulnerable to tail latency from long-running requests.Decode:按 token 自回归。延迟导向。易被长请求拖累尾延迟。
Reference architecture参考架构
flowchart LR C[Client] --> GW[API Gateway] GW --> RL[Rate Limit & Auth] RL --> SCH[Scheduler / Continuous Batcher] SCH --> W[GPU Workers] W --> TOK[Token Stream] TOK --> C SCH --> MET[Metrics / Tracing] W --> KV[(PagedAttention KV Cache Mgr)]
KV cache — speak it like you've operated itKV cache——讲得像真做过
- KV cache avoids recomputing attention keys/values on every decode step. Time-space trade-off.KV cache 避免每次 decode 重算 attention 的 K/V,是时间-空间权衡。
- Under high concurrency it dominates GPU memory and fragments it.高并发下它会占用并碎片化 GPU 显存。
- vLLM's PagedAttention: OS-style virtual-memory paging for KV → fragmentation drops from 60-80% to <4%, enabling ~23× throughput.vLLM 的 PagedAttention:借 OS 虚存分页管 KV → 碎片率从 60-80% 降到 <4%,吞吐提升 ~23×。
Batching & scheduling (give an executable strategy)Batching 与调度(给可执行策略)
- Continuous batching: iteration-level scheduling — after each token, check for completions and substitute new requests. Eliminates static batching stragglers.Continuous batching:迭代级调度——每生成一个 token 后检查完成并替换新请求。消除静态 batching 的掉队者。
- Multi-bin by predicted remaining tokens to reduce intra-batch length variance.按预测剩余 token 分桶,减少批内长度方差。
- Objective:
w1·TTFT + w2·per-token latency + w3·GPU_util + w4·drop_rate; A/B the weights.目标函数:w1·TTFT + w2·per-token 延迟 + w3·GPU 利用率 + w4·丢弃率;权重 A/B。
Further optimizations (papers to namedrop)进阶优化(可引用论文)
- FlashAttention: IO-aware attention; big win for long contexts.FlashAttention:IO 感知 attention;长上下文收益显著。
- Speculative decoding: small draft model generates, big model verifies — 1.5-2× speedup when accept rate is high.Speculative decoding:小 draft 模型生成,大模型验证——接受率高时 1.5-2× 加速。
- Chunked prefill: split long prefills to keep decode alive.Chunked prefill:切分长 prefill 保持 decode 活跃。
- Prefix caching: cache KV for repeated system-prompt prefixes.前缀缓存:缓存重复 system-prompt 前缀的 KV。
Cost model成本模型
$ / token ≈ GPU_hourly_cost / tokens_per_hour_per_gpu
— reduce numerator via spot/larger batch; boost denominator via optimizationsRough anchor: AWS p4d.24xlarge (8× A100) ≈ $22/hr; H100 roughly 2× that. Quote a framework, not a final number.参考:AWS p4d.24xlarge(8× A100)≈ $22/小时;H100 约 2 倍。用公式,不拍死数字。
Anthropic-style follow-upsAnthropic 风格追问
(1) How do you avoid head-of-line blocking? → bucketing, chunked prefill, priority queue. (2) Autoscale signal? → token-queue depth > GPU util. (3) Release process? → warmup, canary, rollback thresholds, shadow traffic. (4) Where does safety go? → input/output filters, policy routing; cite Constitutional AI as framing.(1) 避免队头阻塞?→ 分桶、chunked prefill、优先级队列。(2) 扩容信号?→ token 队列深度优于 GPU 利用率。(3) 发布流程?→ warmup、canary、回滚阈值、影子流量。(4) Safety 怎么插?→ 输入/输出过滤、策略路由;引用 Constitutional AI。