Anthropic ★★★★ Frequent Hard LLM ServingKV CachePagedAttention

A11 · High-Concurrency LLM Inference Service A11 · 高并发 LLM 推理服务

Verified source经核实出处

Prompt: "Design a high-concurrency LLM inference platform…support streaming token output…must discuss prefill vs decode, KV cache, batching strategy, tail latency…" — PracHub, 2026-02, Onsite. Credibility B.

Open with the prefill / decode split (first minute)第一分钟：分相 prefill / decode

Prefill: one big compute pass over the prompt. Throughput-bound. Loves batching.Prefill：对 prompt 的一次大计算。吞吐导向。喜欢 batching。
Decode: autoregressive per-token. Latency-bound. Vulnerable to tail latency from long-running requests.Decode：按 token 自回归。延迟导向。易被长请求拖累尾延迟。

Reference architecture参考架构

flowchart LR
  C[Client] --> GW[API Gateway]
  GW --> RL[Rate Limit & Auth]
  RL --> SCH[Scheduler / Continuous Batcher]
  SCH --> W[GPU Workers]
  W --> TOK[Token Stream]
  TOK --> C
  SCH --> MET[Metrics / Tracing]
  W --> KV[(PagedAttention KV Cache Mgr)]

KV cache — speak it like you've operated itKV cache——讲得像真做过

KV cache avoids recomputing attention keys/values on every decode step. Time-space trade-off.KV cache 避免每次 decode 重算 attention 的 K/V，是时间-空间权衡。
Under high concurrency it dominates GPU memory and fragments it.高并发下它会占用并碎片化 GPU 显存。
vLLM's PagedAttention: OS-style virtual-memory paging for KV → fragmentation drops from 60-80% to <4%, enabling ~23× throughput.vLLM 的 PagedAttention：借 OS 虚存分页管 KV → 碎片率从 60-80% 降到 <4%，吞吐提升 ~23×。

Batching & scheduling (give an executable strategy)Batching 与调度（给可执行策略）

Continuous batching: iteration-level scheduling — after each token, check for completions and substitute new requests. Eliminates static batching stragglers.Continuous batching：迭代级调度——每生成一个 token 后检查完成并替换新请求。消除静态 batching 的掉队者。
Multi-bin by predicted remaining tokens to reduce intra-batch length variance.按预测剩余 token 分桶，减少批内长度方差。
Objective: w1·TTFT + w2·per-token latency + w3·GPU_util + w4·drop_rate; A/B the weights.目标函数：w1·TTFT + w2·per-token 延迟 + w3·GPU 利用率 + w4·丢弃率；权重 A/B。

Further optimizations (papers to namedrop)进阶优化（可引用论文）

FlashAttention: IO-aware attention; big win for long contexts.FlashAttention：IO 感知 attention；长上下文收益显著。
Speculative decoding: small draft model generates, big model verifies — 1.5-2× speedup when accept rate is high.Speculative decoding：小 draft 模型生成，大模型验证——接受率高时 1.5-2× 加速。
Chunked prefill: split long prefills to keep decode alive.Chunked prefill：切分长 prefill 保持 decode 活跃。
Prefix caching: cache KV for repeated system-prompt prefixes.前缀缓存：缓存重复 system-prompt 前缀的 KV。

Cost model成本模型

$ / token ≈ GPU_hourly_cost / tokens_per_hour_per_gpu
          — reduce numerator via spot/larger batch; boost denominator via optimizations

Rough anchor: AWS p4d.24xlarge (8× A100) ≈ $22/hr; H100 roughly 2× that. Quote a framework, not a final number.参考：AWS p4d.24xlarge（8× A100）≈ $22/小时；H100 约 2 倍。用公式，不拍死数字。

Anthropic-style follow-upsAnthropic 风格追问

(1) How do you avoid head-of-line blocking? → bucketing, chunked prefill, priority queue. (2) Autoscale signal? → token-queue depth > GPU util. (3) Release process? → warmup, canary, rollback thresholds, shadow traffic. (4) Where does safety go? → input/output filters, policy routing; cite Constitutional AI as framing.(1) 避免队头阻塞？→ 分桶、chunked prefill、优先级队列。(2) 扩容信号？→ token 队列深度优于 GPU 利用率。(3) 发布流程？→ warmup、canary、回滚阈值、影子流量。(4) Safety 怎么插？→ 输入/输出过滤、策略路由；引用 Constitutional AI。

A11 · High-Concurrency LLM Inference Service A11 · 高并发 LLM 推理服务

Verified source经核实出处

Open with the prefill / decode split (first minute)第一分钟：分相 prefill / decode

Reference architecture参考架构

KV cache — speak it like you've operated itKV cache——讲得像真做过

Batching & scheduling (give an executable strategy)Batching 与调度（给可执行策略）

Further optimizations (papers to namedrop)进阶优化（可引用论文）

Cost model成本模型

Anthropic-style follow-upsAnthropic 风格追问

Related study-guide topics相关学习手册专题