A37 · Design a Training Checkpoint Service A37 · 设计训练 Checkpoint 服务
Verified source经核实出处
Discussed in frontier-lab training blogs (Anthropic, OpenAI, Meta). Onsite reports confirm. Credibility B.
Architecture架构
flowchart LR Trainer[GPU worker] --> SNAP[In-GPU snapshot] SNAP --> SHARD[Shard & compress] SHARD --> UPL[Async uploader] UPL --> OBJ[(Object store)] Master --> PLAN[Checkpoint plan] PLAN --> UPL
Key decisions关键决策
- **Snapshot to GPU RAM then async upload**: training resumes within 1-2 s; upload runs in background.**快照 -> GPU 内存 -> 异步上传**:训练 1-2 s 内可恢复;上传后台跑。
- **Per-rank sharded writes** to object store: 10 k parallel PUTs with content-addressable keys.**按 rank 分片写**到对象存储:10 k 并行 PUT,内容寻址 key。
- **Incremental/delta checkpoints** for optimiser state; full checkpoint every M steps.**增量/delta checkpoint**(优化器状态);每 M 步全量。
- **Fail-fast on corruption**: manifest includes SHA per shard; restore validates before load.**坏数据快失败**:manifest 记录每 shard SHA;恢复前校验。
Follow-ups追问
- Cross-region DR? async replicate manifests + hot regions.跨区灾备?manifest 异步复制 + 热区域。
- Recovery time at 10k GPUs? parallel restore, limited by object-store read throughput.万卡恢复时间?并行恢复,受对象存储读吞吐限制。