X5 · Design Training Orchestration for 100k+ GPU Colossus X5 · 设计 10 万卡 Colossus 的训练编排
Verified source经核实出处
xAI publicly built Colossus (Memphis supercluster) in 2024-25; scale reported in xAI blog + WSJ + multiple engineer podcasts. Credibility B.
Problem问题
Colossus has 100k+ H100 GPUs (scaling toward 1M). At that scale, an individual GPU fails every few minutes and a node fails every hour. Design a training orchestrator that maintains high MFU (model FLOPs utilization) despite constant hardware churn.Colossus 拥有 10 万+ H100(目标 100 万)。这个规模下,单张 GPU 每几分钟故障一次,整节点每小时故障一次。设计训练编排器,在持续的硬件抖动下维持高 MFU。
Architecture架构
flowchart TB SC[Scheduler] --> TP[Topology-aware placement] TP --> PG[Pipeline + tensor parallel groups] PG --> NODE1[Node 1..N] NODE1 --> MON[Health monitor] MON -->|fail| REC[Reconfig + hot-spare swap] REC --> CKPT[Checkpoint manager] CKPT --> S3[(Object store)] REC --> PG
Key decisions关键决策
- 3D parallelism (tensor × pipeline × data) with topology-aware placement — tensor parallel within NVLink island, pipeline across InfiniBand fabric.3D 并行(张量 × 流水线 × 数据),拓扑感知放置——张量并行限制在同一 NVLink island 内,流水线跨 InfiniBand fabric。
- Hot-spare pool of ~5% idle GPUs that can be swapped in when a node fails, avoiding full-job restart.保留约 5% 空闲 GPU 的热备池,节点故障时替换进入,避免整作业重启。
- Asynchronous checkpointing every 15 minutes to NVMe tier, then staged to object store — blocks training for <5 seconds.每 15 分钟异步检查点到 NVMe 层,再分阶段落到对象存储——阻塞训练小于 5 秒。
- Silent data corruption detection via periodic gradient norm sanity checks + cross-replica hash comparison.通过周期性的梯度范数完整性校验 + 跨副本哈希对比,检测静默数据损坏。
- Cooling and power at 100k scale is a first-class system constraint — scheduler must de-rate jobs under thermal events.10 万卡规模下冷却和供电是一类系统约束——热事件下调度器必须降配作业。
Follow-ups追问
- What's the math for checkpoint size at 400B params? How fast can you restore?4000 亿参数的检查点有多大?恢复要多久?
- How do you debug a run where loss diverges at step 50k?50000 步时 loss 发散,如何排查?
Credibility note可信度说明
Watch-out注意
Scale numbers are public; internal orchestration details are not. Answer above follows Megatron-LM / NVIDIA NeMo / DeepSpeed patterns that the xAI team is known to use.规模数字公开;内部编排细节未公开。上述回答遵循 Megatron-LM / NVIDIA NeMo / DeepSpeed 的模式——已知 xAI 团队会使用。