O5 · Design a CI/CD System (like GitHub Actions) O5 · 设计 CI/CD 系统(类 GitHub Actions)
Verified source经核实出处
Original prompt: "design a CI/CD system, pretty similar to Github Actions…focus on high reliability first" — LeetCode, 2026-02-12 and Jointaro, 2025-07-31. Credibility C.
Requirements clarification需求澄清
- Triggers: push / PR / cron / manual?触发方式:push / PR / cron / manual?
- Job model: workflow → jobs → steps; DAG or linear? (Often expands from linear to DAG.)作业模型:workflow → jobs → steps;DAG 还是线性?(常常从线性扩到 DAG)
- Runners: self-hosted or managed? Isolation (container/VM)?Runner:自托管还是托管?隔离要求(容器/VM)?
- If control plane dies, do in-flight jobs continue?控制面挂了,运行中的 job 是否继续?
- Outputs: logs, artifacts, status webhooks.输出:日志、artifact、状态回调 webhook。
Reliability-first design: persistent state machine可靠性优先:持久化状态机
The core is a persistent, auditable state machine for every job/step — so any failure can be replayed and recovered.核心是每个 job/step 的持久化、可审计的状态机——任何故障都能重放/恢复。
flowchart LR U[User/Repo] --> API[CI API] API --> DB[(State DB)] API --> Q[Job Queue] Q --> S[Scheduler] S --> R[Runner Fleet] R --> L[Log Store] R --> A[Artifact Store] R --> DB
API (give just four)API(只给 4 个)
POST /repos/{repo}/workflows/{wf}/dispatch
GET /runs/{run_id} -- aggregate status
GET /runs/{run_id}/logs -- paginated / streamed
POST /runners/{runner_id}/heartbeat -- runner capability + lease + healthData model数据模型
WorkflowRun(run_id, repo_id, trigger, status, created_at)
Job(job_id, run_id, status, requirements, assigned_runner, attempt, lease_id)
Step(step_id, job_id, status, started_at, ended_at, exit_code)Scale & isolation扩展与隔离
- Multi-tenant: org/repo quota; queue partition; separate runner pools.多租户:org/repo 配额;队列分区;Runner pool 分组。
- Runner security: minimum-privilege token, short-lived creds, container/VM sandbox.Runner 安全:最小权限 token、短期凭证、容器/VM 沙箱。
- Scheduling: priority queue (paid/urgent) + fair-share quota.调度:优先级队列(付费/紧急)+ 配额公平。
Common follow-up典型追问
“What if the scheduler dies?” — State is in DB; jobs use lease + idempotent claim; after failure, another scheduler reclaims. Two runners claiming the same job is prevented by compare-and-swap on lease_id.「调度器挂了怎么办?」——状态在 DB;jobs 用 lease + 幂等 claim;故障后重新 claim。两个 runner 争同一 job 用 lease_id 的 CAS 防止。
Cost & perf成本与性能
- Runner compute dominates cost. Optimize with idle reclamation, warm pool, spot-instance mix with checkpoint tolerance.Runner 资源占主成本。优化:空闲回收、预热池、spot 混用 + checkpoint 容忍。
- Log explosion: segmented upload, compression, tiered cold storage after 30 days.日志爆炸:分段上传、压缩、30 天后冷存储分层。