OpenAI ★★ Frequent Hard Job QueueSchedulerGPU

O24 · Design a Distributed Job Queue for ML Workloads O24 · 设计面向 ML 工作负载的分布式任务队列

Verified source经核实出处

Reported on LeetCode Discuss and 一亩三分地 - "design a job queue that schedules training jobs across our cluster". Credibility B.

Scope & goals范围与目标

Multi-tenant submission of heterogeneous GPU jobs (train / eval / batch-infer). SLOs: p95 scheduling latency < 30 s for small jobs; long jobs are checkpointable and preemptible; fair-share across teams.多租户提交异构 GPU 任务（训练/评估/批量推理）。SLO：小任务 p95 调度 < 30 s；长任务可 checkpoint 与抢占；团队间公平配额。

Architecture架构

flowchart LR
  Clients --> API[Submit API]
  API --> Q[(Priority Queue / Redis streams)]
  S[Scheduler] --> Q
  S --> P[Placement / Bin-pack]
  P --> N1[Node agent 1]
  P --> N2[Node agent 2]
  N1 --> CK[(Checkpoint Store)]
  S --> DB[(Postgres jobs & leases)]

Job state machine任务状态机

QUEUED -> ASSIGNED -> RUNNING -> (SUCCEEDED | FAILED | PREEMPTED)
PREEMPTED -> QUEUED (with last checkpoint)

Key decisions关键决策

**Leased work stealing**: workers atomically CAS a lease row for N minutes; no central dispatcher bottleneck.**租约式 work-stealing**：worker 通过 CAS 原子抢占 lease；避免中心派发瓶颈。
**Priority classes** (P0 online eval, P1 training, P2 batch) with weighted fair-share; higher classes preempt lower.**优先级类**（P0 在线评估、P1 训练、P2 批量）+ 加权公平份额；高优先级抢占低优先级。
**Gang scheduling** for multi-node training: only start when all N slots ready to avoid deadlock.多节点训练 **gang scheduling**：N 个 slot 就绪才启动，防止死锁。
**Checkpoint on preempt**: worker gets SIGTERM then flushes to object store then resumes elsewhere.**抢占时 checkpoint**：worker 收到 SIGTERM -> 落盘到对象存储 -> 在其他节点恢复。

Follow-ups追问

How do you prevent head-of-line blocking by a 1-week job? per-class queues, bin-packing, backfill small jobs.如何避免一周级长任务 HoL？按 class 拆队列 + 装箱 + 小任务 backfill。
Exactly-once task execution? at-least-once + idempotent handler keyed by task_id.Exactly-once？at-least-once + 以 task_id 幂等的 handler。

O24 · Design a Distributed Job Queue for ML Workloads O24 · 设计面向 ML 工作负载的分布式任务队列

Verified source经核实出处

Scope & goals范围与目标

Architecture架构

Job state machine任务状态机

Key decisions关键决策

Follow-ups追问

Related study-guide topics相关学习手册专题