xAI ★★ Frequent Hard Distributed TrainingGPUFault Tolerance

X5 · Design Training Orchestration for 100k+ GPU Colossus X5 · 设计 10 万卡 Colossus 的训练编排

Verified source经核实出处

xAI publicly built Colossus (Memphis supercluster) in 2024-25; scale reported in xAI blog + WSJ + multiple engineer podcasts. Credibility B.

Problem问题

Colossus has 100k+ H100 GPUs (scaling toward 1M). At that scale, an individual GPU fails every few minutes and a node fails every hour. Design a training orchestrator that maintains high MFU (model FLOPs utilization) despite constant hardware churn.Colossus 拥有 10 万+ H100(目标 100 万)。这个规模下,单张 GPU 每几分钟故障一次,整节点每小时故障一次。设计训练编排器,在持续的硬件抖动下维持高 MFU。

Architecture架构

flowchart TB
  SC[Scheduler] --> TP[Topology-aware placement]
  TP --> PG[Pipeline + tensor parallel groups]
  PG --> NODE1[Node 1..N]
  NODE1 --> MON[Health monitor]
  MON -->|fail| REC[Reconfig + hot-spare swap]
  REC --> CKPT[Checkpoint manager]
  CKPT --> S3[(Object store)]
  REC --> PG

Key decisions关键决策

  • 3D parallelism (tensor × pipeline × data) with topology-aware placement — tensor parallel within NVLink island, pipeline across InfiniBand fabric.3D 并行(张量 × 流水线 × 数据),拓扑感知放置——张量并行限制在同一 NVLink island 内,流水线跨 InfiniBand fabric。
  • Hot-spare pool of ~5% idle GPUs that can be swapped in when a node fails, avoiding full-job restart.保留约 5% 空闲 GPU 的热备池,节点故障时替换进入,避免整作业重启。
  • Asynchronous checkpointing every 15 minutes to NVMe tier, then staged to object store — blocks training for <5 seconds.每 15 分钟异步检查点到 NVMe 层,再分阶段落到对象存储——阻塞训练小于 5 秒。
  • Silent data corruption detection via periodic gradient norm sanity checks + cross-replica hash comparison.通过周期性的梯度范数完整性校验 + 跨副本哈希对比,检测静默数据损坏。
  • Cooling and power at 100k scale is a first-class system constraint — scheduler must de-rate jobs under thermal events.10 万卡规模下冷却和供电是一类系统约束——热事件下调度器必须降配作业。

Follow-ups追问

  • What's the math for checkpoint size at 400B params? How fast can you restore?4000 亿参数的检查点有多大?恢复要多久?
  • How do you debug a run where loss diverges at step 50k?50000 步时 loss 发散,如何排查?

Credibility note可信度说明

Watch-out注意

Scale numbers are public; internal orchestration details are not. Answer above follows Megatron-LM / NVIDIA NeMo / DeepSpeed patterns that the xAI team is known to use.规模数字公开;内部编排细节未公开。上述回答遵循 Megatron-LM / NVIDIA NeMo / DeepSpeed 的模式——已知 xAI 团队会使用。

Related study-guide topics相关学习手册专题