Anthropic ★★★★ Frequent Hard LLM ServingKV CachePagedAttention

A11 · High-Concurrency LLM Inference Service A11 · 高并发 LLM 推理服务

Verified source经核实出处

Prompt: "Design a high-concurrency LLM inference platform…support streaming token output…must discuss prefill vs decode, KV cache, batching strategy, tail latency…" — PracHub, 2026-02, Onsite. Credibility B.

Open with the prefill / decode split (first minute)第一分钟:分相 prefill / decode

  • Prefill: one big compute pass over the prompt. Throughput-bound. Loves batching.Prefill:对 prompt 的一次大计算。吞吐导向。喜欢 batching。
  • Decode: autoregressive per-token. Latency-bound. Vulnerable to tail latency from long-running requests.Decode:按 token 自回归。延迟导向。易被长请求拖累尾延迟。

Reference architecture参考架构

flowchart LR
  C[Client] --> GW[API Gateway]
  GW --> RL[Rate Limit & Auth]
  RL --> SCH[Scheduler / Continuous Batcher]
  SCH --> W[GPU Workers]
  W --> TOK[Token Stream]
  TOK --> C
  SCH --> MET[Metrics / Tracing]
  W --> KV[(PagedAttention KV Cache Mgr)]

KV cache — speak it like you've operated itKV cache——讲得像真做过

  • KV cache avoids recomputing attention keys/values on every decode step. Time-space trade-off.KV cache 避免每次 decode 重算 attention 的 K/V,是时间-空间权衡。
  • Under high concurrency it dominates GPU memory and fragments it.高并发下它会占用并碎片化 GPU 显存。
  • vLLM's PagedAttention: OS-style virtual-memory paging for KV → fragmentation drops from 60-80% to <4%, enabling ~23× throughput.vLLM 的 PagedAttention:借 OS 虚存分页管 KV → 碎片率从 60-80% 降到 <4%,吞吐提升 ~23×。

Batching & scheduling (give an executable strategy)Batching 与调度(给可执行策略)

  • Continuous batching: iteration-level scheduling — after each token, check for completions and substitute new requests. Eliminates static batching stragglers.Continuous batching:迭代级调度——每生成一个 token 后检查完成并替换新请求。消除静态 batching 的掉队者。
  • Multi-bin by predicted remaining tokens to reduce intra-batch length variance.按预测剩余 token 分桶,减少批内长度方差。
  • Objective: w1·TTFT + w2·per-token latency + w3·GPU_util + w4·drop_rate; A/B the weights.目标函数:w1·TTFT + w2·per-token 延迟 + w3·GPU 利用率 + w4·丢弃率;权重 A/B。

Further optimizations (papers to namedrop)进阶优化(可引用论文)

  • FlashAttention: IO-aware attention; big win for long contexts.FlashAttention:IO 感知 attention;长上下文收益显著。
  • Speculative decoding: small draft model generates, big model verifies — 1.5-2× speedup when accept rate is high.Speculative decoding:小 draft 模型生成,大模型验证——接受率高时 1.5-2× 加速。
  • Chunked prefill: split long prefills to keep decode alive.Chunked prefill:切分长 prefill 保持 decode 活跃。
  • Prefix caching: cache KV for repeated system-prompt prefixes.前缀缓存:缓存重复 system-prompt 前缀的 KV。

Cost model成本模型

$ / token ≈ GPU_hourly_cost / tokens_per_hour_per_gpu
          — reduce numerator via spot/larger batch; boost denominator via optimizations

Rough anchor: AWS p4d.24xlarge (8× A100) ≈ $22/hr; H100 roughly 2× that. Quote a framework, not a final number.参考:AWS p4d.24xlarge(8× A100)≈ $22/小时;H100 约 2 倍。用公式,不拍死数字。

Anthropic-style follow-upsAnthropic 风格追问

(1) How do you avoid head-of-line blocking? → bucketing, chunked prefill, priority queue. (2) Autoscale signal? → token-queue depth > GPU util. (3) Release process? → warmup, canary, rollback thresholds, shadow traffic. (4) Where does safety go? → input/output filters, policy routing; cite Constitutional AI as framing.(1) 避免队头阻塞?→ 分桶、chunked prefill、优先级队列。(2) 扩容信号?→ token 队列深度优于 GPU 利用率。(3) 发布流程?→ warmup、canary、回滚阈值、影子流量。(4) Safety 怎么插?→ 输入/输出过滤、策略路由;引用 Constitutional AI。

Related study-guide topics相关学习手册专题