OpenAI ★★ Frequent Hard Job QueueSchedulerGPU

O24 · Design a Distributed Job Queue for ML Workloads O24 · 设计面向 ML 工作负载的分布式任务队列

Verified source经核实出处

Reported on LeetCode Discuss and 一亩三分地 - "design a job queue that schedules training jobs across our cluster". Credibility B.

Scope & goals范围与目标

Multi-tenant submission of heterogeneous GPU jobs (train / eval / batch-infer). SLOs: p95 scheduling latency < 30 s for small jobs; long jobs are checkpointable and preemptible; fair-share across teams.多租户提交异构 GPU 任务(训练/评估/批量推理)。SLO:小任务 p95 调度 < 30 s;长任务可 checkpoint 与抢占;团队间公平配额。

Architecture架构

flowchart LR
  Clients --> API[Submit API]
  API --> Q[(Priority Queue / Redis streams)]
  S[Scheduler] --> Q
  S --> P[Placement / Bin-pack]
  P --> N1[Node agent 1]
  P --> N2[Node agent 2]
  N1 --> CK[(Checkpoint Store)]
  S --> DB[(Postgres jobs & leases)]

Job state machine任务状态机

QUEUED -> ASSIGNED -> RUNNING -> (SUCCEEDED | FAILED | PREEMPTED)
PREEMPTED -> QUEUED (with last checkpoint)

Key decisions关键决策

  • **Leased work stealing**: workers atomically CAS a lease row for N minutes; no central dispatcher bottleneck.**租约式 work-stealing**:worker 通过 CAS 原子抢占 lease;避免中心派发瓶颈。
  • **Priority classes** (P0 online eval, P1 training, P2 batch) with weighted fair-share; higher classes preempt lower.**优先级类**(P0 在线评估、P1 训练、P2 批量)+ 加权公平份额;高优先级抢占低优先级。
  • **Gang scheduling** for multi-node training: only start when all N slots ready to avoid deadlock.多节点训练 **gang scheduling**:N 个 slot 就绪才启动,防止死锁。
  • **Checkpoint on preempt**: worker gets SIGTERM then flushes to object store then resumes elsewhere.**抢占时 checkpoint**:worker 收到 SIGTERM -> 落盘到对象存储 -> 在其他节点恢复。

Follow-ups追问

  • How do you prevent head-of-line blocking by a 1-week job? per-class queues, bin-packing, backfill small jobs.如何避免一周级长任务 HoL?按 class 拆队列 + 装箱 + 小任务 backfill。
  • Exactly-once task execution? at-least-once + idempotent handler keyed by task_id.Exactly-once?at-least-once + 以 task_id 幂等的 handler。

Related study-guide topics相关学习手册专题