LLM Serving at Scale

Why LLM serving is different

Classic web serving is "compute small response, return". LLM serving is "stream thousands of autoregressive tokens, each requiring a full forward pass through 70B parameters, while sharing a GPU with dozens of other requests". The constraints:

Memory-bound decode. Generating one token touches every parameter. For Llama-70B fp16 = 140 GB weights. On an H100 (HBM 3.35 TB/s), one forward pass reads 140 GB / 3350 GB/s ≈ 42 ms — that is your theoretical lower bound per token on a single GPU. You need tensor parallelism across multiple GPUs to get faster.
KV cache explosion. For each in-flight request, you keep the attention Key/Value tensors for every prior token. Formula: kv_bytes ≈ 2 * layers * heads * head_dim * seq_len * dtype_bytes. Llama-70B at 4096 tokens ≈ 1.3 GB per request. An 80 GB H100 fits only ~50 concurrent requests at that context.
TTFT vs ITL trade-off. Time-to-first-token (TTFT) depends on prompt length (prefill). Inter-token latency (ITL) depends on decode speed. A chat app cares about both; a batch summarizer only cares about throughput.

Interview number to memorize: a 7B model in fp16 is ~14 GB; 13B ≈ 26 GB; 70B ≈ 140 GB; 405B ≈ 810 GB. Quantization (fp8, int4) cuts this by 2–4×.

Source cross-reference

Read the vLLM paper (Kwon et al., 2023, "Efficient Memory Management for Large Language Model Serving with PagedAttention"), FlashAttention (Dao 2022), and Speculative Decoding (Leviathan et al. 2023, "Fast Inference from Transformers via Speculative Decoding"). These three papers explain 80% of every modern inference server.

Prefill vs decode: two different workloads

A single generation is two phases:

Prefill: consume the prompt in one parallel forward pass. Compute-bound (matmuls over the whole prompt); dominates TTFT. Latency scales roughly linearly with prompt length.
Decode: generate output tokens one at a time. Memory-bound (each step reads the full weight matrix to produce one token). Dominates total latency for long outputs.

These phases have opposite optimal batch sizes. Prefill saturates the GPU at batch 1 (the prompt itself is already a big matmul). Decode needs batch >16 to saturate memory bandwidth efficiently. Mixing them naively wastes both. Two solutions in modern servers:

Disaggregated prefill/decode (DistServe, TensorRT-LLM): separate GPU pools for prefill and decode. KV cache transferred over NVLink/IB once. Improves both TTFT and ITL at the cost of operational complexity.
Chunked prefill (vLLM, SGLang): split a long prompt into chunks and interleave with decode batches. One pool, no transfer, but harder scheduling. Sarathi-Serve paper shows 2–3× higher throughput than naive schedulers.

flowchart LR
  R[Request] --> S[Scheduler]
  S -->|prompt| P[Prefill GPU
compute-bound]
  P -->|KV cache| D[Decode GPU pool
continuous batching]
  D --> T[Token stream]
  T --> C[Client SSE]

KV cache, PagedAttention, prefix caching

The KV cache is the single largest lever in LLM serving. Three techniques matter:

PagedAttention (vLLM)

Classic KV cache allocates one contiguous buffer per request sized to max_seq_len. Result: 60–80% of that memory is wasted on padding. PagedAttention, modeled on OS virtual memory, breaks KV into 16-token pages. A request holds a page table; pages are allocated on demand. vLLM's paper reports 2–4× higher throughput vs HuggingFace pipelines at equal latency.

Prefix caching

If many requests share a long system prompt (think 2000-token Claude system prompt), you can cache that prefix's KV once. Subsequent requests skip that prefill work entirely. Anthropic explicitly exposes this as prompt caching: the first request pays full cost, subsequent requests within a 5-minute window pay ~10% for the cached prefix. OpenAI has similar automatic prefix caching for long system prompts.

Continuous batching

Static batching (wait until the batch is full, generate all together) stalls: a short request finishes in 3 tokens while the batch must wait for 1000-token companions. Continuous batching (Orca, vLLM) admits new requests every step and evicts finished ones mid-batch. Throughput improves 2–3× over static batching at the same latency.

FlashAttention, speculative decoding, quantization

FlashAttention (Dao 2022, v2 2023, v3 2024)

Standard attention materializes an N×N matrix — at N=8k, that's 256 MB per head. FlashAttention tiles the computation to keep it in SRAM, fusing softmax, matmul, and masking in one kernel. Result: 2–4× faster attention, linear memory in sequence length. Every production engine uses it (vLLM, TensorRT-LLM, xFormers, SGLang).

Speculative decoding (Leviathan 2023)

A small "draft" model (e.g., Llama-8B) generates K=5 candidate tokens; the large target model (Llama-70B) verifies them in one forward pass. Expected acceptance rate ~70%, yielding 2–3× decode speedup with no quality loss. Variants: Medusa (multi-head draft), EAGLE, lookahead decoding.

Quantization

fp8 (H100 native): 2× vs fp16, tiny quality drop for most models.
int4 (GPTQ, AWQ): 4× memory reduction, 1–2% benchmark loss; used by most open-weights serving (llama.cpp, Exllama).
int8 with smoothquant: popular for latency-critical smaller models.

OpenAI-specific

OpenAI runs GPT-4-class models on clusters of H100s with custom kernels likely derived from FlashAttention-3 and TensorRT-LLM. Observable behaviors: strong prefix caching (cached system prompts run ~50% faster), aggressive streaming, and per-request timeouts. Their per-token pricing reflects the prefill vs decode cost split.

Anthropic-specific

Anthropic's API surfaces prompt caching as a first-class feature with explicit cache-control blocks in the Messages API. Cached tokens are billed at ~10% of input rate. This is unique — OpenAI caches but doesn't give you control over cache breakpoints. In interviews, mention that Anthropic's public cache-hit rate on long Claude Code system prompts exceeds 90%.

Scheduling, autoscaling, failure modes

Scheduling signals

You do not autoscale LLM inference on GPU utilization — an idle GPU with 50 queued requests looks "90% utilized" but is starved. Right signals:

Queue depth (requests waiting for prefill).
TTFT p95 — scale up when it exceeds SLO (e.g., >1s).
KV cache utilization — when running hot, reject or queue new requests.

Architecture

flowchart LR
  C[Client] --> LB[Global LB]
  LB --> RT[Router
model + region]
  RT --> Q[Priority queue
by tier]
  Q --> SCH[Scheduler
continuous batching]
  SCH --> P1[Prefill pool H100x8]
  SCH --> D1[Decode pool H100x8]
  P1 -->|KV page table| D1
  D1 --> SSE[Token streamer]
  SSE --> C
  M[Metrics: TTFT, ITL, QD] --> A[Autoscaler]
  A --> P1
  A --> D1

Failure modes and anti-patterns

Anti-patterns

Static batching in a chat product. You'll eat 10× the latency of your shortest request.
Autoscale on GPU util. Memory-bound decode pegs HBM but not SM; metric lies.
Unbounded max_tokens. A user sets max_tokens=32k, holds KV cache for 5 minutes, starves everyone else. Cap and enforce.
Shared prefill+decode pool at batch 1. Long prompts block decode steps; TTFT and ITL both degrade.
No draining on deploy. Killing a pod mid-stream drops hundreds of open SSE connections; use lame-duck mode.

Systems to name in interviews: vLLM (open-source, PagedAttention), TensorRT-LLM (NVIDIA, fastest on H100), SGLang (structured gen + RadixAttention for prefix sharing), TGI (HuggingFace), DeepSpeed-MII, Ray Serve + vLLM (the common startup stack).

Interview checklist

Clarify: model size, max context, target TTFT, target throughput, streaming or batch.
Back-of-envelope: weights memory, KV cache per request, concurrent request ceiling.
Pick engine (vLLM default; TensorRT-LLM if latency-critical).
Continuous batching + PagedAttention + FlashAttention (all three).
Prefix caching / prompt caching for repeated system prompts.
Disaggregate or chunked-prefill for large prompt mix.
Quantization if memory-bound (fp8 or int4/AWQ).
Speculative decoding for quality-matched speedup.
Autoscale on queue depth + TTFT p95, not GPU util.
Per-request cap on max_tokens; lame-duck on deploy; SSE backpressure.

为什么 LLM 推理不一样

传统 Web 服务是"算个小响应返回"。LLM 推理是"流式吐出数千个自回归 token，每个都要经过 70B 参数的完整前向，与几十个请求共享 GPU"。约束：

Decode 受内存带宽限制。生成一个 token 要访问全部参数。Llama-70B fp16 = 140 GB 权重。H100（HBM 3.35 TB/s）单次前向 140 GB / 3350 GB/s ≈ 42 ms——这是单 GPU 每 token 的理论下界。想更快必须 tensor parallel。
KV 缓存爆炸。每个 in-flight 请求保留所有过去 token 的 Key/Value。公式：kv_bytes ≈ 2 * layers * heads * head_dim * seq_len * dtype_bytes。Llama-70B 4096 token ≈ 1.3 GB/请求。80 GB H100 只能同时放 ~50 个请求。
TTFT vs ITL 权衡。首 token 时间（TTFT）依赖 prompt 长度（prefill）；token 间延迟（ITL）依赖 decode 速度。聊天两者都要；批量摘要只看吞吐。

要背的数：7B fp16 ≈ 14 GB；13B ≈ 26 GB；70B ≈ 140 GB；405B ≈ 810 GB。量化（fp8、int4）能减 2-4 倍。

参考来源

读 vLLM 论文（Kwon et al. 2023，PagedAttention）、FlashAttention（Dao 2022）、Speculative Decoding（Leviathan et al. 2023）。这三篇解释了每个现代推理服务器 80% 的机制。

Prefill vs Decode：两种工作负载

一次生成分两阶段：

Prefill：一次并行前向消费整个 prompt。计算瓶颈（对整条 prompt 做大矩阵乘）；决定 TTFT。延迟近似线性于 prompt 长度。
Decode：一次一个 token。内存瓶颈（每步要读完整权重矩阵生成一个 token）。决定长输出总延迟。

两阶段最佳 batch 相反。Prefill 在 batch=1 就打满 GPU（prompt 本身已是个大矩阵乘）。Decode 需要 batch >16 才能打满 HBM。简单混在一起两败俱伤。现代服务器两种方案：

分离 prefill/decode（DistServe、TensorRT-LLM）：prefill 和 decode 用独立 GPU 池，KV cache 通过 NVLink/IB 传一次。TTFT 和 ITL 都好，运维复杂度加码。
Chunked prefill（vLLM、SGLang）：长 prompt 切块，与 decode 批交错。一个池、不用传，但调度更难。Sarathi-Serve 论文显示比朴素调度器 2-3 倍吞吐。

flowchart LR
  R[请求] --> S[调度器]
  S -->|prompt| P[Prefill GPU
计算瓶颈]
  P -->|KV cache| D[Decode GPU 池
连续批]
  D --> T[token 流]
  T --> C[Client SSE]

KV cache、PagedAttention、前缀缓存

KV cache 是 LLM 推理最大杠杆。三个关键技术：

PagedAttention (vLLM)

传统 KV cache 每请求分配一段连续 buffer 按 max_seq_len 大小。结果：60-80% 内存浪费在 padding。PagedAttention 借鉴 OS 虚拟内存，把 KV 切成 16 token 的页。请求持有页表，按需分配。vLLM 论文报告同延迟下 2-4 倍吞吐于 HuggingFace pipeline。

前缀缓存

如果很多请求共享长 system prompt（想象 2000 token 的 Claude system prompt），可缓存该前缀的 KV 一次。后续请求完全跳过此 prefill。Anthropic 显式暴露为 prompt caching：首请求付全价，5 分钟窗口内后续只付 ~10% 缓存费。OpenAI 对长 system prompt 自动缓存。

连续批

静态批（等 batch 满了一起生成）会卡：3 token 就完的短请求要等 1000 token 的同伴。连续批（Orca、vLLM）每步可接入新请求、淘汰已完成者。吞吐比静态批在相同延迟下高 2-3 倍。

FlashAttention、投机解码、量化

FlashAttention (Dao 2022、v2 2023、v3 2024)

标准 attention 会物化 N×N 矩阵——N=8k 时每 head 256 MB。FlashAttention 把计算分 tile 保持在 SRAM，把 softmax、矩阵乘、mask 融在一个 kernel。结果：attention 快 2-4 倍，对序列长度线性内存。所有生产引擎都用（vLLM、TensorRT-LLM、xFormers、SGLang）。

投机解码（Leviathan 2023）

小"草稿"模型（例如 Llama-8B）生成 K=5 候选 token，大目标模型（Llama-70B）一次前向验证。期望接受率 ~70%，无质量损失 2-3 倍 decode 加速。变种：Medusa（多头草稿）、EAGLE、lookahead decoding。

量化

fp8（H100 原生）：对 fp16 2 倍，大多数模型质量损失微小。
int4（GPTQ、AWQ）：4 倍内存缩减，benchmark 损失 1-2%；开源权重推理（llama.cpp、Exllama）主流。
int8 + smoothquant：延迟敏感的中小模型。

OpenAI 细节

OpenAI 在 H100 集群上跑 GPT-4 级模型，可能用基于 FlashAttention-3 与 TensorRT-LLM 的定制 kernel。可观察到：强前缀缓存（缓存过的 system prompt 约 50% 加速）、积极流式、按请求超时。按 token 计价反映了 prefill vs decode 成本分裂。

Anthropic 细节

Anthropic API 把 prompt caching 作为一等功能，Messages API 允许显式 cache_control 块。命中的缓存 token ~10% 输入价。这点独特——OpenAI 缓存但不让你控制边界。面试里可以提 Anthropic 公开 Claude Code 长 system prompt 缓存命中率 >90%。

调度、自动扩缩、故障模式

调度信号

LLM 推理不能按 GPU 利用率扩缩——50 个排队但 GPU 空闲时利用率显示"90%"实际是饿死。正确信号：

队列深度（等 prefill 的请求数）。
TTFT p95——超 SLO（例如 >1s）就扩。
KV 缓存利用率——接近满时拒绝或排队。

架构

flowchart LR
  C[Client] --> LB[Global LB]
  LB --> RT[Router
model + region]
  RT --> Q[按 tier 优先队列]
  Q --> SCH[调度器
连续批]
  SCH --> P1[Prefill 池 H100x8]
  SCH --> D1[Decode 池 H100x8]
  P1 -->|KV 页表| D1
  D1 --> SSE[token 流]
  SSE --> C
  M[指标: TTFT, ITL, QD] --> A[自动扩缩器]
  A --> P1
  A --> D1

故障与反模式

反模式

聊天产品用静态批：最短请求延迟被放大 10 倍。
按 GPU util 扩缩：decode 内存瓶颈、HBM 满但 SM 没满，指标骗人。
不限 max_tokens：用户设 32k、KV cache 占 5 分钟、饿死他人。必须上限并强制。
共享 prefill+decode 单池 batch=1：长 prompt 卡住 decode 步，TTFT 和 ITL 双退化。
部署不 drain：杀 pod 中断数百 SSE 连接，得用 lame-duck。

面试提名的系统：vLLM（开源、PagedAttention）、TensorRT-LLM（NVIDIA、H100 最快）、SGLang（结构化生成 + RadixAttention 前缀共享）、TGI、DeepSpeed-MII、Ray Serve + vLLM（创业公司常见栈）。

面试清单

澄清：模型大小、最大上下文、目标 TTFT、目标吞吐、流式还是批量。
估算：权重内存、每请求 KV、并发请求上限。
选引擎（vLLM 默认；延迟严苛用 TensorRT-LLM）。
连续批 + PagedAttention + FlashAttention（全上）。
重复 system prompt 上前缀/prompt 缓存。
prompt 混合大时分离或 chunked prefill。
内存瓶颈就量化（fp8 或 int4/AWQ）。
投机解码获质量等同加速。
按队列深度 + TTFT p95 扩缩，别用 GPU util。
max_tokens 上限、部署 lame-duck、SSE 背压。