G5 · Design a TPU Cluster Scheduler G5 · 设计 TPU 集群调度器
Verified source经核实出处
Inspired by Google Borg / Omega papers. Internal TPU scheduling asked at infra onsites. Credibility B.
Key decisions关键决策
- **Topology-aware placement**: 3D torus — prefer contiguous slices to minimise cross-slice all-reduce.**拓扑感知放置**:3D torus——优先连续切片,减少跨切片 all-reduce。
- **Gang scheduling**: 512-chip job either fully allocates or waits; partial runs waste interconnect.**Gang scheduling**:512 芯片任务要么整体分配要么等;部分运行浪费互联。
- **Two-level scheduler** (a la Omega): optimistic cluster snapshot + CAS on master.**两级调度器**(仿 Omega):乐观快照 + master CAS。
- **Preemption budget**: jobs class-tagged; lower class preempted first.**抢占预算**:任务按 class 打标;低 class 先被抢。
Follow-ups追问
- Utilisation target? 90%+ via bin-packing + preemption; hold 5% for DR.利用率目标?90%+(装箱 + 抢占);留 5% 给灾备。