OpenAI ★★ Frequent Staff-level 3D ParallelismZeROCheckpoint

O19 · Distributed ML Training Platform O19 · 分布式 ML 训练平台

Verified source经核实出处

Prompt: "Design a Distributed ML Training Platform." — Hello Interview (Staff-level). Credibility D.

Clarify scope范围澄清

Target model sizes? 1B vs 100B parameters pull wildly different architectures.目标模型规模？1B vs 100B 参数对应架构差异巨大。
Users: internal researchers or external customers?用户：内部研究者还是外部客户？
Hardware: homogeneous GPU cluster or multi-region multi-type?硬件：同构 GPU 集群还是跨区多型号？

Three orthogonal parallelism strategies三种正交并行策略

Data Parallel (DP): replicate model, shard data, AllReduce gradients. ZeRO-1/2/3 progressively shards optimizer states, gradients, parameters.数据并行 (DP)：复制模型、切分数据、AllReduce 梯度。ZeRO-1/2/3 渐进分片优化器状态、梯度、参数。
Tensor Parallel (TP): split weight matrices within a layer. High bandwidth needed (NVLink-only).张量并行 (TP)：层内切分权重矩阵。对带宽要求高（仅限 NVLink）。
Pipeline Parallel (PP): split layers across stages, feed micro-batches to fill the pipe. Bubble ratio < 25% needs micro-batches > 4× stages.流水线并行 (PP)：跨阶段切分层，用微批次填充流水线。Bubble 率 < 25% 需要微批次 > 4× 阶段数。

Production recipe (3D parallelism)生产配方（3D 并行）

Intra-node TP (NVLink) + PP across nodes + DP across replicas. Meta Llama 3 uses topology-aware comm patterns.节点内 TP（NVLink）+ 跨节点 PP + 副本间 DP。Meta Llama 3 采用拓扑感知通信模式。

Platform services平台服务

flowchart LR
  UI[CLI / UI] --> SCH[Job Scheduler]
  SCH --> RES[Resource Manager]
  RES --> GPU[GPU Fleet]
  SCH --> CKPT[Checkpoint Service]
  CKPT --> S3[(Object Store)]
  GPU --> METRICS[Metrics / Profiler]
  GPU --> ELA[Elastic Reshard]

Fault tolerance is mandatory容错是必选项

Periodic checkpoints every N steps (memory-map + async upload).每 N 步周期 checkpoint（memory-map + 异步上传）。
Elastic training: detect GPU failure, reschedule job with same checkpoint, resume.弹性训练：检测 GPU 故障 → 用同 checkpoint 重调度 → 恢复。
Loss-spike detection: auto-rollback to prior checkpoint; gradient clip.Loss-spike 检测：自动回滚至前一 checkpoint；梯度裁剪。

Anthropic's takeAnthropic 视角

RE interviews explicitly ask: "Pre-training a 100B model — loss spike appears. Data issue? LR? Hardware?" Candidates are expected to walk through diagnosis and checkpoint-revert strategy.RE 面试明确问：「预训练 100B 模型出现 loss spike——是数据、LR、还是硬件问题？」候选人需能讲清诊断路径与 checkpoint 回滚策略。

O19 · Distributed ML Training Platform O19 · 分布式 ML 训练平台

Verified source经核实出处

Clarify scope范围澄清

Three orthogonal parallelism strategies三种正交并行策略

Production recipe (3D parallelism)生产配方（3D 并行）

Platform services平台服务

Fault tolerance is mandatory容错是必选项

Anthropic's takeAnthropic 视角

Related study-guide topics相关学习手册专题