Replication & Consistency

Why replicate at all

Three orthogonal reasons, in DDIA Ch.5's framing: latency (serve reads from a replica close to the user), availability (survive node failure), and throughput (scale reads horizontally). Durability is a bonus. If you need none of these, do not replicate — a single Postgres with good backups beats a mis-tuned 5-node cluster every time.

The hard part is not copying bytes; it is deciding what the reader sees when writes and reads race. That is why Kleppmann spends Ch.5 on mechanics and Ch.9 on what the resulting guarantees are called.

Source cross-reference

DDIA Ch.5 covers leader/follower, multi-leader, leaderless with Dynamo-style quorums. DDIA Ch.9 defines linearizability, causal consistency, and consensus. Acing SDI Ch.4 gives the SQL-vs-NoSQL framing most interviewers expect.

Single-leader replication

One node accepts writes; followers replay the log. Synchronous followers block the writer until ack (zero RPO, slow tail); asynchronous followers lag (fast, may lose recent writes on failover). Most real systems run one sync + many async — the "chain" that Postgres, MySQL Group Replication, and AWS Aurora all converge on.

flowchart LR
    C[Client] -->|write| L[Leader]
    L -->|sync WAL| F1[Follower 1 sync]
    L -->|async| F2[Follower 2]
    L -->|async| F3[Follower 3]
    R[Read client] --> F2

Replication lag: typical 10–100 ms intra-region, seconds cross-region. Must-know for read-your-writes discussion.
Failover: who promotes the new leader? A consensus service (etcd, ZooKeeper) holds the lease; see consensus.
Throughput ceiling: single-leader Postgres caps at ~50 k writes/sec on a fat NVMe box. Beyond that, partition (see partitioning) or move to leaderless.

Anti-pattern

Running "multi-master" Postgres via logical replication as a write-scaling strategy. Conflicts are not automatic; you own every last-writer-wins bug. Don't do it unless you have a conflict resolver (CRDT, per-row ownership).

Multi-leader and leaderless

Multi-leader (e.g., Postgres BDR, CouchDB, calendar apps with offline clients): each region accepts writes, replicates async to others. Concurrent writes to the same key create conflicts resolved by vector clocks, LWW timestamps, or CRDT merge. Best fit: geographically distributed with mostly-disjoint write sets (per-user data).

Leaderless / Dynamo-style (Cassandra, Riak, ScyllaDB): clients write to N replicas, consider it successful when W ack; reads query R replicas. If W+R > N you get quorum overlap. Anti-entropy via Merkle trees + read-repair + hinted handoff patches divergence.

Style	Writes scale	Conflict risk	Example
Single-leader	1 node	None (serialized)	Postgres, MySQL
Multi-leader	N regions	High	CouchDB, BDR
Leaderless	N nodes	Medium (quorum)	Cassandra, Dynamo

CAP, PACELC, and what they actually say

CAP (Brewer, 2000) says during a network Partition you must choose Consistency or Availability. It does not say "pick 2 of 3"; partitions are not optional. Useful but oversimplified.

PACELC (Abadi, 2010) repairs the gap: If P, choose A or C; Else choose Latency or Consistency. This is the version staff engineers actually use. Spanner is a PC/EC system (consistent even when healthy, at a latency cost); Dynamo is PA/EL (available and fast, eventually consistent). State the letters explicitly in your answer.

OpenAI-specific

For conversation-history stores, OpenAI-style answers often choose PA/EC: available during partitions but strongly consistent when healthy (same-session read-your-writes must hold). Achieve this with a single-leader region + causal token the client returns on each request.

Consistency models ladder

From strongest to weakest (DDIA Ch.9):

Linearizable (atomic): every operation appears to take effect at a single point in real time. Required for leader election, distributed locks, uniqueness constraints. Cost: at least one RTT to a consensus quorum.
Sequential: all clients see the same order, but the order can lag real time.
Causal: causally related operations are seen in order; concurrent ones may differ. Achievable with vector clocks without consensus.
Read-your-writes (session guarantee): after you write, you see it. Cheap to implement with sticky sessions or client-tracked last-LSN.
Monotonic reads: once you see a value, you never see an earlier one. Stick a user to one replica.
Eventual: converges "eventually" (undefined duration). Most leaderless defaults. Fine for social-media like counts.

Interview heuristic: name the weakest model the product can tolerate, then the mechanism that delivers it. "User profiles need read-your-writes after save; monotonic-read across refreshes is enough elsewhere; analytics dashboards can run on 10-minute stale replicas."

Failure modes and real outages

Split-brain: network partition isolates the old leader; a new leader is elected; both accept writes. Mitigation: fencing tokens, STONITH, quorum-based leases. See consensus.
Replication lag spikes: a slow follower falls minutes behind; reads served from it look stale. Monitor seconds_behind_master and evict from read pool beyond 1 s.
2015 AWS DynamoDB metadata storm: a 4-hour us-east-1 outage where retry storms against a metadata service amplified under replica lag. Lesson: backpressure + jittered exponential backoff, not unbounded retries.
2017 GitLab postgres deletion: followers were lagged by 4 GB of WAL when the primary was accidentally wiped; async followers had data the sync one didn't. Runbook the RPO you actually have, not the one on the slide.

Anthropic-specific

Anthropic interviewers push on auditability: after a failover, can you reconstruct which requests were served by which leader? Answer: write-ahead log with monotonic LSN + leader-epoch baked into every response header, plus a separate append-only audit store.

Interview decision tree

Do you need strong consistency on writes? → single-leader (Postgres/Spanner) or consensus-backed (etcd/CockroachDB).
Need globally-distributed writes with disjoint keys? → multi-leader with per-user home region.
Need extreme write throughput + tolerate LWW? → leaderless (Cassandra/Dynamo), R=W=quorum.
Only reads scale? → single-leader + async read replicas, with a causal token for read-your-writes.

State the chosen model, then its PACELC letters, then the one anti-pattern you are explicitly avoiding. That three-sentence answer is what Kleppmann would write on a whiteboard.

为什么要复制

按 DDIA 第 5 章的分法，三个相互独立的理由：延迟（就近副本提供读）、可用性（节点故障仍可用）、吞吐（读横向扩展）。持久性是附带收益。若三者都不需要，就别复制——一台带好备份的 Postgres 永远强过一个没调好的 5 节点集群。

难点不是搬数据，而是在读写竞争时决定读者能看见什么。所以 Kleppmann 第 5 章讲机制、第 9 章讲这些机制带来的保证叫什么名字。

来源交叉引用

DDIA 第 5 章：主从、多主、Dynamo 风格无主与 quorum；第 9 章：线性一致、因果一致、共识。Acing SDI 第 4 章补 SQL/NoSQL 视角。

单主复制

一个节点接受写，follower 重放日志。同步 follower 让写阻塞到 ack（RPO=0，尾延迟高）；异步 follower 有延迟（快，但故障切换可能丢最近写）。真实系统大多跑1 同步 + 多异步——Postgres、MySQL Group Replication、AWS Aurora 都如此。

flowchart LR
    C[客户端] -->|写| L[主]
    L -->|同步 WAL| F1[同步副本]
    L -->|异步| F2[副本 2]
    L -->|异步| F3[副本 3]
    R[读客户端] --> F2

复制延迟：同区域 10–100 ms，跨区域秒级。读你所写讨论必考。
故障切换：谁提拔新主？共识服务（etcd、ZooKeeper）持有 lease，见共识。
吞吐上限：单主 Postgres 胖 NVMe 机器上限约 5 万写/秒。再高要分区（见分区）或换无主。

反模式

用 Postgres 逻辑复制搞「多主」做写扩展。冲突不会自动解决，你要自己写每一条 LWW bug。没有冲突解决器（CRDT、行级所有权）就别做。

多主与无主

多主（Postgres BDR、CouchDB、有离线客户端的日历）：每个区域接受写，异步复制到其他。并发写同 key 会冲突，由向量时钟、LWW 时间戳或 CRDT 合并解决。最适合地理分布且写集合基本互不相交（按用户分区）的数据。

无主 / Dynamo 风格（Cassandra、Riak、ScyllaDB）：客户端向 N 个副本写，W 个 ack 即成功；读查 R 个。W+R>N 时 quorum 有交集。Merkle 树 anti-entropy + read-repair + hinted handoff 修复分歧。

类型	写扩展	冲突风险	例子
单主	1 节点	无（已串行）	Postgres、MySQL
多主	N 区域	高	CouchDB、BDR
无主	N 节点	中（quorum）	Cassandra、Dynamo

CAP、PACELC 及其真正含义

CAP（Brewer 2000）：发生网络分区时必须在一致性和可用性中选一个。它不是「三选二」——分区不是可选项。有用但简化过头。

PACELC（Abadi 2010）补齐：分区时选 A 或 C；否则选 L 或 C。staff engineer 实际用这个版本。Spanner 是 PC/EC（健康时也强一致，代价是延迟）；Dynamo 是 PA/EL（可用且快，最终一致）。答题时显式报字母。

OpenAI 专属

对话历史存储，OpenAI 风格答案多选 PA/EC：分区时可用，健康时强一致（同会话必须读你所写）。实现：单主区域 + 客户端每次回传因果 token。

一致性模型阶梯

DDIA 第 9 章从强到弱：

线性一致：每个操作在真实时间某个点原子生效。选主、分布式锁、唯一性约束必需。代价：至少一次到 quorum 的 RTT。
顺序一致：所有客户端看到同一顺序，但顺序可以滞后真实时间。
因果一致：因果相关的操作按序可见，并发的顺序可不同。向量时钟实现，无需共识。
读你所写（会话保证）：自己写完自己一定读得到。sticky session 或客户端记录 last-LSN 即可。
单调读：看过某值后不会再看到更早的。让用户粘在一个副本。
最终一致：「最终」会收敛（时长未定义）。大多无主默认。社交 like 计数够用。

面试启发：先说产品能容忍的最弱模型，再说实现机制。「用户资料保存后需读你所写；刷新间单调读足矣；分析看板跑在 10 分钟陈旧副本上。」

故障模式与真实事故

脑裂：分区把旧主隔离，新主被选出，两者都接写。缓解：fencing token、STONITH、基于 quorum 的 lease，见共识。
复制延迟暴增：慢 follower 落后几分钟，从它读看到陈旧数据。监控 seconds_behind_master，超 1 秒从读池剔除。
2015 AWS DynamoDB metadata 风暴：us-east-1 4 小时故障，元数据服务重试风暴在副本滞后下放大。教训：反压 + 抖动指数退避，禁无界重试。
2017 GitLab Postgres 误删：误删主库时 follower 落后 4 GB WAL；异步 follower 有同步副本没有的数据。RPO 要按实际配置 runbook，不是 PPT 上的。

Anthropic 专属

Anthropic 面试官会追问可审计性：故障切换后能否重建每个请求由哪个主服务的？答：带单调 LSN 的 WAL + 响应头写入 leader-epoch + 独立 append-only 审计存储。

面试决策树

写也要强一致？→ 单主（Postgres/Spanner）或共识支持（etcd/CockroachDB）。
跨地域写、key 不相交？→ 每用户 home 区的多主。
极致写吞吐、能容忍 LWW？→ 无主（Cassandra/Dynamo），R=W=quorum。
只扩展读？→ 单主 + 异步读副本 + 因果 token 保读你所写。

答：选定模型 → PACELC 字母 → 明确避开的一个反模式。这三句话就是 Kleppmann 在白板上会写的东西。