O24 · Design a Distributed Job Queue for ML Workloads O24 · 设计面向 ML 工作负载的分布式任务队列
Verified source经核实出处
Reported on LeetCode Discuss and 一亩三分地 - "design a job queue that schedules training jobs across our cluster". Credibility B.
Scope & goals范围与目标
Multi-tenant submission of heterogeneous GPU jobs (train / eval / batch-infer). SLOs: p95 scheduling latency < 30 s for small jobs; long jobs are checkpointable and preemptible; fair-share across teams.多租户提交异构 GPU 任务(训练/评估/批量推理)。SLO:小任务 p95 调度 < 30 s;长任务可 checkpoint 与抢占;团队间公平配额。
Architecture架构
flowchart LR Clients --> API[Submit API] API --> Q[(Priority Queue / Redis streams)] S[Scheduler] --> Q S --> P[Placement / Bin-pack] P --> N1[Node agent 1] P --> N2[Node agent 2] N1 --> CK[(Checkpoint Store)] S --> DB[(Postgres jobs & leases)]
Job state machine任务状态机
QUEUED -> ASSIGNED -> RUNNING -> (SUCCEEDED | FAILED | PREEMPTED)
PREEMPTED -> QUEUED (with last checkpoint)
Key decisions关键决策
- **Leased work stealing**: workers atomically CAS a lease row for N minutes; no central dispatcher bottleneck.**租约式 work-stealing**:worker 通过 CAS 原子抢占 lease;避免中心派发瓶颈。
- **Priority classes** (P0 online eval, P1 training, P2 batch) with weighted fair-share; higher classes preempt lower.**优先级类**(P0 在线评估、P1 训练、P2 批量)+ 加权公平份额;高优先级抢占低优先级。
- **Gang scheduling** for multi-node training: only start when all N slots ready to avoid deadlock.多节点训练 **gang scheduling**:N 个 slot 就绪才启动,防止死锁。
- **Checkpoint on preempt**: worker gets SIGTERM then flushes to object store then resumes elsewhere.**抢占时 checkpoint**:worker 收到 SIGTERM -> 落盘到对象存储 -> 在其他节点恢复。
Follow-ups追问
- How do you prevent head-of-line blocking by a 1-week job? per-class queues, bin-packing, backfill small jobs.如何避免一周级长任务 HoL?按 class 拆队列 + 装箱 + 小任务 backfill。
- Exactly-once task execution? at-least-once + idempotent handler keyed by task_id.Exactly-once?at-least-once + 以 task_id 幂等的 handler。