OpenAI ★★★ Frequent Hard Workflow EngineLeaseMulti-tenant

O5 · Design a CI/CD System (like GitHub Actions) O5 · 设计 CI/CD 系统(类 GitHub Actions)

Verified source经核实出处

Original prompt: "design a CI/CD system, pretty similar to Github Actions…focus on high reliability first" — LeetCode, 2026-02-12 and Jointaro, 2025-07-31. Credibility C.

Requirements clarification需求澄清

  • Triggers: push / PR / cron / manual?触发方式:push / PR / cron / manual?
  • Job model: workflow → jobs → steps; DAG or linear? (Often expands from linear to DAG.)作业模型:workflow → jobs → steps;DAG 还是线性?(常常从线性扩到 DAG)
  • Runners: self-hosted or managed? Isolation (container/VM)?Runner:自托管还是托管?隔离要求(容器/VM)?
  • If control plane dies, do in-flight jobs continue?控制面挂了,运行中的 job 是否继续?
  • Outputs: logs, artifacts, status webhooks.输出:日志、artifact、状态回调 webhook。

Reliability-first design: persistent state machine可靠性优先:持久化状态机

The core is a persistent, auditable state machine for every job/step — so any failure can be replayed and recovered.核心是每个 job/step 的持久化、可审计的状态机——任何故障都能重放/恢复。

flowchart LR
  U[User/Repo] --> API[CI API]
  API --> DB[(State DB)]
  API --> Q[Job Queue]
  Q --> S[Scheduler]
  S --> R[Runner Fleet]
  R --> L[Log Store]
  R --> A[Artifact Store]
  R --> DB

API (give just four)API(只给 4 个)

POST /repos/{repo}/workflows/{wf}/dispatch
GET  /runs/{run_id}                  -- aggregate status
GET  /runs/{run_id}/logs             -- paginated / streamed
POST /runners/{runner_id}/heartbeat  -- runner capability + lease + health

Data model数据模型

WorkflowRun(run_id, repo_id, trigger, status, created_at)
Job(job_id, run_id, status, requirements, assigned_runner, attempt, lease_id)
Step(step_id, job_id, status, started_at, ended_at, exit_code)

Scale & isolation扩展与隔离

  • Multi-tenant: org/repo quota; queue partition; separate runner pools.多租户:org/repo 配额;队列分区;Runner pool 分组。
  • Runner security: minimum-privilege token, short-lived creds, container/VM sandbox.Runner 安全:最小权限 token、短期凭证、容器/VM 沙箱。
  • Scheduling: priority queue (paid/urgent) + fair-share quota.调度:优先级队列(付费/紧急)+ 配额公平。

Common follow-up典型追问

“What if the scheduler dies?” — State is in DB; jobs use lease + idempotent claim; after failure, another scheduler reclaims. Two runners claiming the same job is prevented by compare-and-swap on lease_id.「调度器挂了怎么办?」——状态在 DB;jobs 用 lease + 幂等 claim;故障后重新 claim。两个 runner 争同一 job 用 lease_id 的 CAS 防止。

Cost & perf成本与性能

  • Runner compute dominates cost. Optimize with idle reclamation, warm pool, spot-instance mix with checkpoint tolerance.Runner 资源占主成本。优化:空闲回收、预热池、spot 混用 + checkpoint 容忍。
  • Log explosion: segmented upload, compression, tiered cold storage after 30 days.日志爆炸:分段上传、压缩、30 天后冷存储分层。

Related study-guide topics相关学习手册专题