Consensus (Paxos, Raft)

What consensus actually is

Consensus = making N nodes agree on a single value even though some crash, some are slow, and the network drops or reorders messages. DDIA Ch.9 formalizes it as four properties: uniform agreement (no two nodes decide differently), integrity (no node decides twice), validity (the decided value was proposed by some node), termination (every non-crashed node eventually decides).

Every important use case reduces to consensus: leader election, atomic commit across shards (2PC on top of consensus = "transactions" in Spanner), linearizable register, uniqueness constraint. If you need linearizability across nodes, you need consensus or something equivalent.

Source cross-reference

DDIA Ch.8 ("Trouble with Distributed Systems") motivates why consensus is hard (no global clock, unreliable networks, partial failure). Ch.9 delivers the solutions. The two chapters belong together.

Paxos intuition

Lamport's 1998 Paxos (single-decree) has two phases: Prepare (proposer picks ballot number n, asks majority "promise not to accept anything numbered < n; tell me the highest-numbered value you've accepted"), and Accept (proposer sends value back to majority; if accepted, it is decided). A ballot succeeds only if a majority quorum participates, so two concurrent ballots cannot both succeed — the proof is by overlap at any single node.

Multi-Paxos amortizes the Prepare phase by electing a stable leader who can run only Accept for a sequence of slots. In practice, multi-Paxos and Raft converge on nearly the same algorithm; the paper by Ongaro & Ousterhout (2014) explicitly rewrote it for teachability.

Raft: consensus you can understand

Raft has exactly three roles (follower, candidate, leader), one log per node, and a single RPC family (AppendEntries, RequestVote). Leader is elected for a term (monotonically increasing integer). All writes go through the leader; leader replicates to followers; entry is committed once a majority has stored it.

sequenceDiagram
    participant C as Client
    participant L as Leader (term 5)
    participant F1 as Follower 1
    participant F2 as Follower 2
    C->>L: write(x=42)
    L->>L: append to log[idx=10]
    par Replicate
      L->>F1: AppendEntries(term=5, idx=10, x=42)
      L->>F2: AppendEntries(term=5, idx=10, x=42)
    end
    F1-->>L: ok
    F2-->>L: ok
    L->>L: majority reached, commit idx=10
    L-->>C: committed
    L->>F1: AppendEntries(commitIdx=10)
    L->>F2: AppendEntries(commitIdx=10)

Concrete latency on a 3-node Raft cluster in one DC: median write ~2 ms (one intra-DC RTT × 1 round-trip). Across two DCs in same region: ~5 ms. Across continents: 80–150 ms, which is why Spanner co-locates most writes within a region and uses Paxos groups per partition rather than one global group.

Anti-pattern

Writing Raft from scratch for an interview. Interviewers want to hear "I'd use etcd/ZooKeeper for leader election and lock management." They do not want a bug-for-bug reimplementation; production Raft took the CockroachDB team 5 engineer-years to stabilize.

Leader election & fencing tokens

Raft/Paxos gives you a leader for a term. Two dangers remain:

Stale leader: the old leader hasn't noticed it was deposed (paused VM, long GC). It may send a stale write to storage.
Zombie writes: write issued before lease expired but arriving at storage after a new leader's write.

Solution: fencing tokens. Every lease carries a monotonically increasing token. The storage layer rejects writes carrying a token older than the highest it has seen. Martin Kleppmann's blog post on "How to do distributed locking" (2016) popularized the pattern after Redis's Redlock was found to allow a similar failure.

# Pseudocode
lease = etcd.acquire_lease("db-leader")  # returns (ok, fence_token=47)
storage.write(key, value, fence=47)  # storage rejects if fence < previously_seen

etcd, ZooKeeper, and the right layer

System	Algorithm	Sweet spot
ZooKeeper	ZAB (Paxos-like)	Config mgmt, leader election, service discovery. Hadoop/Kafka heritage.
etcd	Raft	Kubernetes' source of truth, service registry, small KV. Read-mostly workloads.
Consul	Raft	Service mesh + KV + multi-DC.
FoundationDB	Paxos + MVCC	Strongly consistent ordered KV, ACID. Snowflake/Apple use.

Rule of thumb: consensus systems handle 10 k–50 k writes/sec cluster-wide. They are not your primary data store for user data; they are the coordination layer that tells the primary store who the leader is, which shards exist, and what the config is. When a team puts 1 M ops/sec through ZooKeeper, they page at 3 a.m.

Split-brain, FLP, and time

FLP impossibility (Fischer, Lynch, Paterson 1985): in a fully asynchronous model with even one faulty node, no deterministic consensus algorithm can guarantee termination. Real systems escape by assuming partial synchrony — the network is "eventually" timely — and using timeouts. This is why Raft's election timeout (150–300 ms) matters: too short causes spurious elections, too long stretches outage windows.

Split-brain happens when two nodes both believe they are leader. Prevention: majority quorums (you cannot have two disjoint majorities in one cluster). A 4-node cluster is strictly worse than 3 (same fault tolerance, double the write cost). Always use odd-sized clusters: 3, 5, or 7.

3 nodes: tolerate 1 failure; cheapest production choice.
5 nodes: tolerate 2; standard for planetary services (etcd in GKE).
7 nodes: tolerate 3; only if you have really loud pagers.

OpenAI-specific

Model-weight registry and experiment tracking are classic consensus-layer workloads: small data, linearizable reads, rare writes. Answer "how do you pick which model version serves traffic?" with "etcd cluster holding the current pointer + fencing token on the routing layer."

Interview-ready patterns

Distributed lock with lease + fencing: acquire lease in etcd, get fence token, pass it to the resource on every call. The resource is the source of truth on fence ordering.
Config distribution: store config blob in etcd; services watch and reload on change. Push via watches; pull periodically as backup.
Leader of a worker pool: one worker holds an etcd lease, runs the singleton job (e.g., cron, compaction scheduler). Others wait. On lease expiry, they race; fencing protects the shared resource.
2PC on top of consensus: Spanner pattern. Each partition's Paxos group is a participant; the coordinator writes the commit record to its own Paxos group. See transactions.

Anthropic-specific

Anthropic's interviewers probe "what happens during a rolling deploy?" Answer: Raft's joint-consensus membership change (or etcd's learner mode) lets you add/remove nodes one at a time without ever losing quorum. Describe the Cold-new configuration transition explicitly.

共识到底是什么

共识 = 在部分节点崩溃、部分慢、网络丢/乱序的条件下，让 N 个节点就一个值达成一致。DDIA 第 9 章形式化为四条：统一一致（任意两节点决定相同）、完整性（任一节点只决一次）、有效性（决定的值是某节点提出的）、终止性（非崩溃节点最终都能决）。

所有重要用例都归约到共识：选主、跨分片原子提交（共识上的 2PC = Spanner 的「事务」）、线性一致寄存器、唯一性约束。跨节点的线性一致必然要共识或等价物。

来源交叉引用

DDIA 第 8 章讲为什么共识难（无全局时钟、不可靠网络、部分失败），第 9 章给答案。两章一起读。

Paxos 直觉

Lamport 1998 的单决 Paxos 两阶段：Prepare（提案者选 ballot n，问多数派「不接受 < n 的，并告诉我你已接受的最高编号值」），Accept（提案者把值发给多数派，被接受即决定）。ballot 要多数 quorum 才能成功，任意两节点必在某点重叠——这就是不能同时决两次的证明。

Multi-Paxos 摊薄 Prepare：选出稳定 leader，连续 slot 只跑 Accept。实际上 multi-Paxos 和 Raft 几乎同一算法；Ongaro & Ousterhout (2014) 把它为可教学性重写一遍。

Raft：能理解的共识

Raft 只有三种角色（follower、candidate、leader）、一个日志、一个 RPC 家族（AppendEntries、RequestVote）。leader 按term（单调自增）选出。所有写经 leader，leader 复制到 follower，多数存储后提交。

sequenceDiagram
    participant C as 客户端
    participant L as Leader (term 5)
    participant F1 as Follower 1
    participant F2 as Follower 2
    C->>L: write(x=42)
    L->>L: 追加日志 idx=10
    par 复制
      L->>F1: AppendEntries(term=5, idx=10, x=42)
      L->>F2: AppendEntries(term=5, idx=10, x=42)
    end
    F1-->>L: ok
    F2-->>L: ok
    L->>L: 多数达成，提交 idx=10
    L-->>C: committed
    L->>F1: AppendEntries(commitIdx=10)
    L->>F2: AppendEntries(commitIdx=10)

同机房 3 节点 Raft 写中位数 ~2 ms（1 次 RTT）。同区域两机房 ~5 ms。跨大陆 80–150 ms，这也是 Spanner 把多数写锁在区域内、每分区一个 Paxos 组（而非一个全局组）的原因。

反模式

面试时从零写 Raft。面试官想听「我会用 etcd/ZooKeeper 做选主和锁」。他们不想要 bug-for-bug 复刻；CockroachDB 团队 5 工程年才把生产 Raft 稳定下来。

选主与 fencing token

Raft/Paxos 给 term 级 leader。还有两个坑：

陈旧 leader：老 leader 还没意识到被废（VM 挂起、GC 长），继续向存储发写。
僵尸写：lease 过期前发出、在新 leader 之后才到达存储。

解：fencing token。每个 lease 带单调 token；存储层拒绝 token 低于已见最高的写。Kleppmann 2016 博文「How to do distributed locking」推广此模式（Redis Redlock 就有类似失败）。

lease = etcd.acquire_lease("db-leader")  # 返回 (ok, fence=47)
storage.write(key, value, fence=47)  # 若 fence < 已见最高则拒

etcd、ZooKeeper 与正确的抽象层

系统	算法	最佳用途
ZooKeeper	ZAB（类 Paxos）	配置管理、选主、服务发现。Hadoop/Kafka 血统。
etcd	Raft	K8s 的真源、注册表、小 KV。读多写少。
Consul	Raft	服务网格 + KV + 多 DC。
FoundationDB	Paxos + MVCC	强一致有序 KV、ACID。Snowflake/Apple 用。

拇指规则：共识系统集群级1–5 万写/秒。它不是存用户数据的主库；是协调层，告诉主库谁是 leader、有哪些 shard、配置是什么。若你把 100 万 ops/秒压进 ZooKeeper，半夜 3 点报警。

脑裂、FLP 与时间

FLP 不可能（Fischer, Lynch, Paterson 1985）：完全异步模型下只要一个故障节点，没有确定性共识算法能保证终止。真实系统假定部分同步——网络「最终」及时——用超时绕过。Raft 选举超时 150–300 ms：太短导致多余选举，太长拉长故障窗口。

脑裂：两节点都认为自己是 leader。预防：多数派 quorum（同集群不可能有两个不相交多数）。4 节点严格劣于 3（同容错，双倍写成本）。永远用奇数：3、5、7。

3 节点：容 1 故障，最便宜。
5 节点：容 2，行星级服务标配（GKE 的 etcd）。
7 节点：容 3，只在报警器特别响时用。

OpenAI 专属

模型权重注册表、实验跟踪是经典共识层负载：小数据、线性一致读、偶尔写。「如何选哪个模型版本对外服务？」答：etcd 集群存当前指针 + 路由层带 fencing token。

面试级模式

lease + fencing 的分布式锁：etcd 拿 lease 取 fence token，每次调资源时带上。资源是 fence 序的真源。
配置分发：etcd 存 blob，服务 watch 热加载；推送 + 兜底轮询。
worker 池 leader：一个 worker 持 etcd lease，跑 singleton job（cron、压缩调度）；其余等待；过期后抢，fencing 保护共享资源。
共识之上的 2PC：Spanner 模式。每分区 Paxos 组是 participant，coordinator 把 commit record 写到自己的 Paxos 组。见事务。

Anthropic 专属

Anthropic 会问「滚动发布期间怎样？」答：Raft 联合共识成员变更（或 etcd 的 learner 模式）允许逐个加减节点、全程保 quorum。明确描述 Cold-new 配置过渡。