OpenAI ★★★ Frequent Hard Workflow EngineLeaseMulti-tenant

O5 · Design a CI/CD System (like GitHub Actions) O5 · 设计 CI/CD 系统（类 GitHub Actions）

Verified source经核实出处

Original prompt: "design a CI/CD system, pretty similar to Github Actions…focus on high reliability first" — LeetCode, 2026-02-12 and Jointaro, 2025-07-31. Credibility C.

Requirements clarification需求澄清

Triggers: push / PR / cron / manual?触发方式：push / PR / cron / manual？
Job model: workflow → jobs → steps; DAG or linear? (Often expands from linear to DAG.)作业模型：workflow → jobs → steps；DAG 还是线性？（常常从线性扩到 DAG）
Runners: self-hosted or managed? Isolation (container/VM)?Runner：自托管还是托管？隔离要求（容器/VM）？
If control plane dies, do in-flight jobs continue?控制面挂了，运行中的 job 是否继续？
Outputs: logs, artifacts, status webhooks.输出：日志、artifact、状态回调 webhook。

Reliability-first design: persistent state machine可靠性优先：持久化状态机

The core is a persistent, auditable state machine for every job/step — so any failure can be replayed and recovered.核心是每个 job/step 的持久化、可审计的状态机——任何故障都能重放/恢复。

flowchart LR
  U[User/Repo] --> API[CI API]
  API --> DB[(State DB)]
  API --> Q[Job Queue]
  Q --> S[Scheduler]
  S --> R[Runner Fleet]
  R --> L[Log Store]
  R --> A[Artifact Store]
  R --> DB

API (give just four)API（只给 4 个）

POST /repos/{repo}/workflows/{wf}/dispatch
GET  /runs/{run_id}                  -- aggregate status
GET  /runs/{run_id}/logs             -- paginated / streamed
POST /runners/{runner_id}/heartbeat  -- runner capability + lease + health

Data model数据模型

WorkflowRun(run_id, repo_id, trigger, status, created_at)
Job(job_id, run_id, status, requirements, assigned_runner, attempt, lease_id)
Step(step_id, job_id, status, started_at, ended_at, exit_code)

Scale & isolation扩展与隔离

Multi-tenant: org/repo quota; queue partition; separate runner pools.多租户：org/repo 配额；队列分区；Runner pool 分组。
Runner security: minimum-privilege token, short-lived creds, container/VM sandbox.Runner 安全：最小权限 token、短期凭证、容器/VM 沙箱。
Scheduling: priority queue (paid/urgent) + fair-share quota.调度：优先级队列（付费/紧急）+ 配额公平。

Common follow-up典型追问

“What if the scheduler dies?” — State is in DB; jobs use lease + idempotent claim; after failure, another scheduler reclaims. Two runners claiming the same job is prevented by compare-and-swap on lease_id.「调度器挂了怎么办？」——状态在 DB；jobs 用 lease + 幂等 claim；故障后重新 claim。两个 runner 争同一 job 用 lease_id 的 CAS 防止。

Cost & perf成本与性能

Runner compute dominates cost. Optimize with idle reclamation, warm pool, spot-instance mix with checkpoint tolerance.Runner 资源占主成本。优化：空闲回收、预热池、spot 混用 + checkpoint 容忍。
Log explosion: segmented upload, compression, tiered cold storage after 30 days.日志爆炸：分段上传、压缩、30 天后冷存储分层。

O5 · Design a CI/CD System (like GitHub Actions) O5 · 设计 CI/CD 系统（类 GitHub Actions）

Verified source经核实出处

Requirements clarification需求澄清

Reliability-first design: persistent state machine可靠性优先：持久化状态机

API (give just four)API（只给 4 个）

Data model数据模型

Scale & isolation扩展与隔离

Common follow-up典型追问

Cost & perf成本与性能

Related study-guide topics相关学习手册专题