Anthropic ★★ Frequent Hard CheckpointDistributed TrainingStorage

A37 · Design a Training Checkpoint Service A37 · 设计训练 Checkpoint 服务

Verified source经核实出处

Discussed in frontier-lab training blogs (Anthropic, OpenAI, Meta). Onsite reports confirm. Credibility B.

Architecture架构

flowchart LR
  Trainer[GPU worker] --> SNAP[In-GPU snapshot]
  SNAP --> SHARD[Shard & compress]
  SHARD --> UPL[Async uploader]
  UPL --> OBJ[(Object store)]
  Master --> PLAN[Checkpoint plan]
  PLAN --> UPL

Key decisions关键决策

  • **Snapshot to GPU RAM then async upload**: training resumes within 1-2 s; upload runs in background.**快照 -> GPU 内存 -> 异步上传**:训练 1-2 s 内可恢复;上传后台跑。
  • **Per-rank sharded writes** to object store: 10 k parallel PUTs with content-addressable keys.**按 rank 分片写**到对象存储:10 k 并行 PUT,内容寻址 key。
  • **Incremental/delta checkpoints** for optimiser state; full checkpoint every M steps.**增量/delta checkpoint**(优化器状态);每 M 步全量。
  • **Fail-fast on corruption**: manifest includes SHA per shard; restore validates before load.**坏数据快失败**:manifest 记录每 shard SHA;恢复前校验。

Follow-ups追问

  • Cross-region DR? async replicate manifests + hot regions.跨区灾备?manifest 异步复制 + 热区域。
  • Recovery time at 10k GPUs? parallel restore, limited by object-store read throughput.万卡恢复时间?并行恢复,受对象存储读吞吐限制。

Related study-guide topics相关学习手册专题