OpenAI ★★ Frequent Staff-level 3D ParallelismZeROCheckpoint

O19 · Distributed ML Training Platform O19 · 分布式 ML 训练平台

Verified source经核实出处

Prompt: "Design a Distributed ML Training Platform." — Hello Interview (Staff-level). Credibility D.

Clarify scope范围澄清

  • Target model sizes? 1B vs 100B parameters pull wildly different architectures.目标模型规模?1B vs 100B 参数对应架构差异巨大。
  • Users: internal researchers or external customers?用户:内部研究者还是外部客户?
  • Hardware: homogeneous GPU cluster or multi-region multi-type?硬件:同构 GPU 集群还是跨区多型号?

Three orthogonal parallelism strategies三种正交并行策略

  • Data Parallel (DP): replicate model, shard data, AllReduce gradients. ZeRO-1/2/3 progressively shards optimizer states, gradients, parameters.数据并行 (DP):复制模型、切分数据、AllReduce 梯度。ZeRO-1/2/3 渐进分片优化器状态、梯度、参数。
  • Tensor Parallel (TP): split weight matrices within a layer. High bandwidth needed (NVLink-only).张量并行 (TP):层内切分权重矩阵。对带宽要求高(仅限 NVLink)。
  • Pipeline Parallel (PP): split layers across stages, feed micro-batches to fill the pipe. Bubble ratio < 25% needs micro-batches > 4× stages.流水线并行 (PP):跨阶段切分层,用微批次填充流水线。Bubble 率 < 25% 需要微批次 > 4× 阶段数。

Production recipe (3D parallelism)生产配方(3D 并行)

Intra-node TP (NVLink) + PP across nodes + DP across replicas. Meta Llama 3 uses topology-aware comm patterns.节点内 TP(NVLink)+ 跨节点 PP + 副本间 DP。Meta Llama 3 采用拓扑感知通信模式。

Platform services平台服务

flowchart LR
  UI[CLI / UI] --> SCH[Job Scheduler]
  SCH --> RES[Resource Manager]
  RES --> GPU[GPU Fleet]
  SCH --> CKPT[Checkpoint Service]
  CKPT --> S3[(Object Store)]
  GPU --> METRICS[Metrics / Profiler]
  GPU --> ELA[Elastic Reshard]

Fault tolerance is mandatory容错是必选项

  • Periodic checkpoints every N steps (memory-map + async upload).每 N 步周期 checkpoint(memory-map + 异步上传)。
  • Elastic training: detect GPU failure, reschedule job with same checkpoint, resume.弹性训练:检测 GPU 故障 → 用同 checkpoint 重调度 → 恢复。
  • Loss-spike detection: auto-rollback to prior checkpoint; gradient clip.Loss-spike 检测:自动回滚至前一 checkpoint;梯度裁剪。

Anthropic's takeAnthropic 视角

RE interviews explicitly ask: "Pre-training a 100B model — loss spike appears. Data issue? LR? Hardware?" Candidates are expected to walk through diagnosis and checkpoint-revert strategy.RE 面试明确问:「预训练 100B 模型出现 loss spike——是数据、LR、还是硬件问题?」候选人需能讲清诊断路径与 checkpoint 回滚策略。

Related study-guide topics相关学习手册专题