Why replicate at all

Three orthogonal reasons, in DDIA Ch.5's framing: latency (serve reads from a replica close to the user), availability (survive node failure), and throughput (scale reads horizontally). Durability is a bonus. If you need none of these, do not replicate — a single Postgres with good backups beats a mis-tuned 5-node cluster every time.

The hard part is not copying bytes; it is deciding what the reader sees when writes and reads race. That is why Kleppmann spends Ch.5 on mechanics and Ch.9 on what the resulting guarantees are called.

Source cross-reference

DDIA Ch.5 covers leader/follower, multi-leader, leaderless with Dynamo-style quorums. DDIA Ch.9 defines linearizability, causal consistency, and consensus. Acing SDI Ch.4 gives the SQL-vs-NoSQL framing most interviewers expect.

Single-leader replication

One node accepts writes; followers replay the log. Synchronous followers block the writer until ack (zero RPO, slow tail); asynchronous followers lag (fast, may lose recent writes on failover). Most real systems run one sync + many async — the "chain" that Postgres, MySQL Group Replication, and AWS Aurora all converge on.

flowchart LR
    C[Client] -->|write| L[Leader]
    L -->|sync WAL| F1[Follower 1 sync]
    L -->|async| F2[Follower 2]
    L -->|async| F3[Follower 3]
    R[Read client] --> F2
  • Replication lag: typical 10–100 ms intra-region, seconds cross-region. Must-know for read-your-writes discussion.
  • Failover: who promotes the new leader? A consensus service (etcd, ZooKeeper) holds the lease; see consensus.
  • Throughput ceiling: single-leader Postgres caps at ~50 k writes/sec on a fat NVMe box. Beyond that, partition (see partitioning) or move to leaderless.

Anti-pattern

Running "multi-master" Postgres via logical replication as a write-scaling strategy. Conflicts are not automatic; you own every last-writer-wins bug. Don't do it unless you have a conflict resolver (CRDT, per-row ownership).

Multi-leader and leaderless

Multi-leader (e.g., Postgres BDR, CouchDB, calendar apps with offline clients): each region accepts writes, replicates async to others. Concurrent writes to the same key create conflicts resolved by vector clocks, LWW timestamps, or CRDT merge. Best fit: geographically distributed with mostly-disjoint write sets (per-user data).

Leaderless / Dynamo-style (Cassandra, Riak, ScyllaDB): clients write to N replicas, consider it successful when W ack; reads query R replicas. If W+R > N you get quorum overlap. Anti-entropy via Merkle trees + read-repair + hinted handoff patches divergence.

StyleWrites scaleConflict riskExample
Single-leader1 nodeNone (serialized)Postgres, MySQL
Multi-leaderN regionsHighCouchDB, BDR
LeaderlessN nodesMedium (quorum)Cassandra, Dynamo

CAP, PACELC, and what they actually say

CAP (Brewer, 2000) says during a network Partition you must choose Consistency or Availability. It does not say "pick 2 of 3"; partitions are not optional. Useful but oversimplified.

PACELC (Abadi, 2010) repairs the gap: If P, choose A or C; Else choose Latency or Consistency. This is the version staff engineers actually use. Spanner is a PC/EC system (consistent even when healthy, at a latency cost); Dynamo is PA/EL (available and fast, eventually consistent). State the letters explicitly in your answer.

OpenAI-specific

For conversation-history stores, OpenAI-style answers often choose PA/EC: available during partitions but strongly consistent when healthy (same-session read-your-writes must hold). Achieve this with a single-leader region + causal token the client returns on each request.

Consistency models ladder

From strongest to weakest (DDIA Ch.9):

  1. Linearizable (atomic): every operation appears to take effect at a single point in real time. Required for leader election, distributed locks, uniqueness constraints. Cost: at least one RTT to a consensus quorum.
  2. Sequential: all clients see the same order, but the order can lag real time.
  3. Causal: causally related operations are seen in order; concurrent ones may differ. Achievable with vector clocks without consensus.
  4. Read-your-writes (session guarantee): after you write, you see it. Cheap to implement with sticky sessions or client-tracked last-LSN.
  5. Monotonic reads: once you see a value, you never see an earlier one. Stick a user to one replica.
  6. Eventual: converges "eventually" (undefined duration). Most leaderless defaults. Fine for social-media like counts.

Interview heuristic: name the weakest model the product can tolerate, then the mechanism that delivers it. "User profiles need read-your-writes after save; monotonic-read across refreshes is enough elsewhere; analytics dashboards can run on 10-minute stale replicas."

Failure modes and real outages

  • Split-brain: network partition isolates the old leader; a new leader is elected; both accept writes. Mitigation: fencing tokens, STONITH, quorum-based leases. See consensus.
  • Replication lag spikes: a slow follower falls minutes behind; reads served from it look stale. Monitor seconds_behind_master and evict from read pool beyond 1 s.
  • 2015 AWS DynamoDB metadata storm: a 4-hour us-east-1 outage where retry storms against a metadata service amplified under replica lag. Lesson: backpressure + jittered exponential backoff, not unbounded retries.
  • 2017 GitLab postgres deletion: followers were lagged by 4 GB of WAL when the primary was accidentally wiped; async followers had data the sync one didn't. Runbook the RPO you actually have, not the one on the slide.

Anthropic-specific

Anthropic interviewers push on auditability: after a failover, can you reconstruct which requests were served by which leader? Answer: write-ahead log with monotonic LSN + leader-epoch baked into every response header, plus a separate append-only audit store.

Interview decision tree

  1. Do you need strong consistency on writes? → single-leader (Postgres/Spanner) or consensus-backed (etcd/CockroachDB).
  2. Need globally-distributed writes with disjoint keys? → multi-leader with per-user home region.
  3. Need extreme write throughput + tolerate LWW? → leaderless (Cassandra/Dynamo), R=W=quorum.
  4. Only reads scale? → single-leader + async read replicas, with a causal token for read-your-writes.

State the chosen model, then its PACELC letters, then the one anti-pattern you are explicitly avoiding. That three-sentence answer is what Kleppmann would write on a whiteboard.

为什么要复制

按 DDIA 第 5 章的分法,三个相互独立的理由:延迟(就近副本提供读)、可用性(节点故障仍可用)、吞吐(读横向扩展)。持久性是附带收益。若三者都不需要,就别复制——一台带好备份的 Postgres 永远强过一个没调好的 5 节点集群。

难点不是搬数据,而是在读写竞争时决定读者能看见什么。所以 Kleppmann 第 5 章讲机制、第 9 章讲这些机制带来的保证叫什么名字。

来源交叉引用

DDIA 第 5 章:主从、多主、Dynamo 风格无主与 quorum;第 9 章:线性一致、因果一致、共识。Acing SDI 第 4 章补 SQL/NoSQL 视角。

单主复制

一个节点接受写,follower 重放日志。同步 follower 让写阻塞到 ack(RPO=0,尾延迟高);异步 follower 有延迟(快,但故障切换可能丢最近写)。真实系统大多跑1 同步 + 多异步——Postgres、MySQL Group Replication、AWS Aurora 都如此。

flowchart LR
    C[客户端] -->|写| L[主]
    L -->|同步 WAL| F1[同步副本]
    L -->|异步| F2[副本 2]
    L -->|异步| F3[副本 3]
    R[读客户端] --> F2
  • 复制延迟:同区域 10–100 ms,跨区域秒级。读你所写讨论必考。
  • 故障切换:谁提拔新主?共识服务(etcd、ZooKeeper)持有 lease,见 共识
  • 吞吐上限:单主 Postgres 胖 NVMe 机器上限约 5 万写/秒。再高要分区(见 分区)或换无主。

反模式

用 Postgres 逻辑复制搞「多主」做写扩展。冲突不会自动解决,你要自己写每一条 LWW bug。没有冲突解决器(CRDT、行级所有权)就别做。

多主与无主

多主(Postgres BDR、CouchDB、有离线客户端的日历):每个区域接受写,异步复制到其他。并发写同 key 会冲突,由向量时钟、LWW 时间戳或 CRDT 合并解决。最适合地理分布且写集合基本互不相交(按用户分区)的数据。

无主 / Dynamo 风格(Cassandra、Riak、ScyllaDB):客户端向 N 个副本写,W 个 ack 即成功;读查 R 个。W+R>N 时 quorum 有交集。Merkle 树 anti-entropy + read-repair + hinted handoff 修复分歧。

类型写扩展冲突风险例子
单主1 节点无(已串行)Postgres、MySQL
多主N 区域CouchDB、BDR
无主N 节点中(quorum)Cassandra、Dynamo

CAP、PACELC 及其真正含义

CAP(Brewer 2000):发生网络分区时必须在一致性和可用性中选一个。它不是「三选二」——分区不是可选项。有用但简化过头。

PACELC(Abadi 2010)补齐:分区时选 A 或 C;否则选 L 或 C。staff engineer 实际用这个版本。Spanner 是 PC/EC(健康时也强一致,代价是延迟);Dynamo 是 PA/EL(可用且快,最终一致)。答题时显式报字母。

OpenAI 专属

对话历史存储,OpenAI 风格答案多选 PA/EC:分区时可用,健康时强一致(同会话必须读你所写)。实现:单主区域 + 客户端每次回传因果 token。

一致性模型阶梯

DDIA 第 9 章从强到弱:

  1. 线性一致:每个操作在真实时间某个点原子生效。选主、分布式锁、唯一性约束必需。代价:至少一次到 quorum 的 RTT。
  2. 顺序一致:所有客户端看到同一顺序,但顺序可以滞后真实时间。
  3. 因果一致:因果相关的操作按序可见,并发的顺序可不同。向量时钟实现,无需共识。
  4. 读你所写(会话保证):自己写完自己一定读得到。sticky session 或客户端记录 last-LSN 即可。
  5. 单调读:看过某值后不会再看到更早的。让用户粘在一个副本。
  6. 最终一致:「最终」会收敛(时长未定义)。大多无主默认。社交 like 计数够用。

面试启发:先说产品能容忍的最弱模型,再说实现机制。「用户资料保存后需读你所写;刷新间单调读足矣;分析看板跑在 10 分钟陈旧副本上。」

故障模式与真实事故

  • 脑裂:分区把旧主隔离,新主被选出,两者都接写。缓解:fencing token、STONITH、基于 quorum 的 lease,见 共识
  • 复制延迟暴增:慢 follower 落后几分钟,从它读看到陈旧数据。监控 seconds_behind_master,超 1 秒从读池剔除。
  • 2015 AWS DynamoDB metadata 风暴:us-east-1 4 小时故障,元数据服务重试风暴在副本滞后下放大。教训:反压 + 抖动指数退避,禁无界重试。
  • 2017 GitLab Postgres 误删:误删主库时 follower 落后 4 GB WAL;异步 follower 有同步副本没有的数据。RPO 要按实际配置 runbook,不是 PPT 上的。

Anthropic 专属

Anthropic 面试官会追问可审计性:故障切换后能否重建每个请求由哪个主服务的?答:带单调 LSN 的 WAL + 响应头写入 leader-epoch + 独立 append-only 审计存储。

面试决策树

  1. 写也要强一致?→ 单主(Postgres/Spanner)或共识支持(etcd/CockroachDB)。
  2. 跨地域写、key 不相交?→ 每用户 home 区的多主。
  3. 极致写吞吐、能容忍 LWW?→ 无主(Cassandra/Dynamo),R=W=quorum。
  4. 只扩展读?→ 单主 + 异步读副本 + 因果 token 保读你所写。

答:选定模型 → PACELC 字母 → 明确避开的一个反模式。这三句话就是 Kleppmann 在白板上会写的东西。