xAI ★★ Frequent Hard LLM ServingInferenceGPU

X1 · Design Grok's Inference Serving Stack X1 · 设计 Grok 的推理服务栈

Verified source经核实出处

xAI engineering blog + multiple posts on Colossus + Elon/Igor podcast interviews (2024-25). Credibility C.

Problem问题

Grok is embedded directly in X (formerly Twitter) and exposed via grok.com. Traffic is bursty (news events), global, and cost-sensitive given xAI's funding runway. Design the serving stack from API gateway → routing → inference → streaming.Grok 直接嵌入在 X（原 Twitter）和 grok.com 中。流量具有突发性（新闻事件）、全球分布，且由于 xAI 融资周期需要严格控制成本。设计从 API 网关 → 路由 → 推理 → 流式输出的整套栈。

Architecture架构

flowchart LR
  X[X clients / grok.com] --> GW[Global gateway]
  GW --> R[Region router]
  R --> Q[Token-aware queue]
  Q --> S[Grok inference pods]
  S --> KV[(KV cache)]
  S --> STR[SSE stream back]
  S --> TEL[Telemetry & cost]

Key decisions关键决策

Continuous batching with token-aware scheduling (a la vLLM) to maximize GPU utilization on H100/H200 clusters.采用 token 感知的持续批处理（类似 vLLM），在 H100/H200 集群上最大化 GPU 利用率。
Separate prefill and decode pools — prefill is compute-bound, decode is memory-bandwidth-bound; mixing them hurts both.将 prefill 与 decode 池分离——prefill 受算力限制、decode 受显存带宽限制；混合部署会双重拖累。
KV-cache offload to high-bandwidth CPU memory for long conversations; recompute when evicted.长对话的 KV cache 卸载到高带宽 CPU 内存；被驱逐时重新计算。
Speculative decoding with a small draft Grok to 2-3x decode throughput.用小型 Grok draft 模型做推测解码，将 decode 吞吐提升 2-3 倍。
Multi-region active-active; route to nearest healthy region, failover on GPU outage.多区域 active-active；路由到最近健康区域，GPU 故障时切换。

Follow-ups追问

How do you handle a 10x traffic spike during a breaking-news event?突发新闻期间流量 10x 暴涨如何处理？
Cost model: $/1M tokens when GPUs are $3/hr — show the math.成本模型：GPU 3 美元/小时时，每百万 token 成本是多少？列式推导。

Credibility note可信度说明

Watch-out注意

xAI has not published detailed serving numbers. Answers based on vLLM/SGLang community best practice + Grok's public behavior (streaming speed observable on grok.com).xAI 未公开详细的服务指标。答案基于 vLLM/SGLang 社区最佳实践 + 从 grok.com 可观察到的 Grok 流式速度。

X1 · Design Grok's Inference Serving Stack X1 · 设计 Grok 的推理服务栈

Verified source经核实出处

Problem问题

Architecture架构

Key decisions关键决策

Follow-ups追问

Credibility note可信度说明

Watch-out注意

Related study-guide topics相关学习手册专题