O19 · Distributed ML Training Platform O19 · 分布式 ML 训练平台
Verified source经核实出处
Prompt: "Design a Distributed ML Training Platform." — Hello Interview (Staff-level). Credibility D.
Clarify scope范围澄清
- Target model sizes? 1B vs 100B parameters pull wildly different architectures.目标模型规模?1B vs 100B 参数对应架构差异巨大。
- Users: internal researchers or external customers?用户:内部研究者还是外部客户?
- Hardware: homogeneous GPU cluster or multi-region multi-type?硬件:同构 GPU 集群还是跨区多型号?
Three orthogonal parallelism strategies三种正交并行策略
- Data Parallel (DP): replicate model, shard data, AllReduce gradients. ZeRO-1/2/3 progressively shards optimizer states, gradients, parameters.数据并行 (DP):复制模型、切分数据、AllReduce 梯度。ZeRO-1/2/3 渐进分片优化器状态、梯度、参数。
- Tensor Parallel (TP): split weight matrices within a layer. High bandwidth needed (NVLink-only).张量并行 (TP):层内切分权重矩阵。对带宽要求高(仅限 NVLink)。
- Pipeline Parallel (PP): split layers across stages, feed micro-batches to fill the pipe. Bubble ratio < 25% needs micro-batches > 4× stages.流水线并行 (PP):跨阶段切分层,用微批次填充流水线。Bubble 率 < 25% 需要微批次 > 4× 阶段数。
Production recipe (3D parallelism)生产配方(3D 并行)
Intra-node TP (NVLink) + PP across nodes + DP across replicas. Meta Llama 3 uses topology-aware comm patterns.节点内 TP(NVLink)+ 跨节点 PP + 副本间 DP。Meta Llama 3 采用拓扑感知通信模式。
Platform services平台服务
flowchart LR UI[CLI / UI] --> SCH[Job Scheduler] SCH --> RES[Resource Manager] RES --> GPU[GPU Fleet] SCH --> CKPT[Checkpoint Service] CKPT --> S3[(Object Store)] GPU --> METRICS[Metrics / Profiler] GPU --> ELA[Elastic Reshard]
Fault tolerance is mandatory容错是必选项
- Periodic checkpoints every N steps (memory-map + async upload).每 N 步周期 checkpoint(memory-map + 异步上传)。
- Elastic training: detect GPU failure, reschedule job with same checkpoint, resume.弹性训练:检测 GPU 故障 → 用同 checkpoint 重调度 → 恢复。
- Loss-spike detection: auto-rollback to prior checkpoint; gradient clip.Loss-spike 检测:自动回滚至前一 checkpoint;梯度裁剪。
Anthropic's takeAnthropic 视角
RE interviews explicitly ask: "Pre-training a 100B model — loss spike appears. Data issue? LR? Hardware?" Candidates are expected to walk through diagnosis and checkpoint-revert strategy.RE 面试明确问:「预训练 100B 模型出现 loss spike——是数据、LR、还是硬件问题?」候选人需能讲清诊断路径与 checkpoint 回滚策略。