Anthropic ★★ Frequent Hard ThroughputScale

A23 · Handle 100K RPS for LLM Token Generation A23 · 承载 100K RPS 的 LLM Token 生成

Prompt: "Handle 100K RPS for LLM Token-Generation." — Exponent. Credibility B/C.

Back-of-envelope估算

100K RPS × avg ~1K output tokens ≈ 100M tokens/sec (peak).100K RPS × 平均 ~1K 输出 tokens ≈ 100M tokens/sec（峰值）。
Single H100 can do ~5-10K tokens/sec for a 70B model → need ~10–20K H100s. Clearly multi-datacenter.单 H100 对 70B 模型可做 ~5-10K tokens/sec → 需要 ~1-2 万张 H100。显然是多 DC 规模。

Horizontal replica of the inference cluster with sticky session by conversation_id.水平副本，按 conversation_id 粘性路由。
Regional serving: route to nearest DC; only cross-region when capacity forces it.区域服务：路由到最近 DC；容量不足时才跨区。
Autoscale signal: token-queue depth per model (not CPU or GPU util).扩容信号：每模型的 token 队列深度（而非 CPU / GPU 利用率）。
Graceful degradation: under saturation, return 429 for free tier, slower model for paid tier, full model for Enterprise.优雅降级：饱和时免费返 429，付费转小模型，企业保原模型。

For the micro-architecture of each inference cluster, see A11. This question is really about fleet-level orchestration.单个推理集群的微架构见 A11。本题本质是集群级编排。