A23 · Handle 100K RPS for LLM Token Generation A23 · 承载 100K RPS 的 LLM Token 生成
Verified source经核实出处
Prompt: "Handle 100K RPS for LLM Token-Generation." — Exponent. Credibility B/C.
Back-of-envelope估算
- 100K RPS × avg ~1K output tokens ≈ 100M tokens/sec (peak).100K RPS × 平均 ~1K 输出 tokens ≈ 100M tokens/sec(峰值)。
- Single H100 can do ~5-10K tokens/sec for a 70B model → need ~10–20K H100s. Clearly multi-datacenter.单 H100 对 70B 模型可做 ~5-10K tokens/sec → 需要 ~1-2 万张 H100。显然是多 DC 规模。
How to scale如何扩展
- Horizontal replica of the inference cluster with sticky session by conversation_id.水平副本,按 conversation_id 粘性路由。
- Regional serving: route to nearest DC; only cross-region when capacity forces it.区域服务:路由到最近 DC;容量不足时才跨区。
- Autoscale signal: token-queue depth per model (not CPU or GPU util).扩容信号:每模型的 token 队列深度(而非 CPU / GPU 利用率)。
- Graceful degradation: under saturation, return 429 for free tier, slower model for paid tier, full model for Enterprise.优雅降级:饱和时免费返 429,付费转小模型,企业保原模型。
For the micro-architecture of each inference cluster, see A11. This question is really about fleet-level orchestration.单个推理集群的微架构见 A11。本题本质是集群级编排。