Why storage engines matter

Picking "Postgres" or "Cassandra" is picking a storage engine architecture. DDIA Ch.3 argues — correctly — that understanding the B-tree vs LSM-tree distinction is what lets you predict the performance envelope of any new database you meet. Two rules:

  • B-trees are tuned for read-heavy, mutation-heavy workloads with low latency. MySQL/InnoDB, Postgres, SQL Server, Oracle.
  • LSM-trees are tuned for write-heavy workloads with time-series or append-like access. RocksDB, Cassandra, HBase, DynamoDB, ScyllaDB.

Source cross-reference

DDIA Ch.3 is the single best explanation of both engines. Pair with the original LSM paper (O'Neil et al. 1996) and Google's Bigtable paper (2006) for depth.

B-tree mechanics

A balanced n-ary tree where each page (4–16 KB) holds a sorted array of keys and child pointers. A 4-level B-tree with 256-key fan-out indexes 4 billion rows; a read costs log_256(N) disk seeks = 4 for a huge table. With buffer-pool caching of inner pages, only the leaf read hits disk — ~100 µs on NVMe.

Writes are in-place: find leaf, modify, optionally split. To survive crashes mid-split, every write first goes to a write-ahead log (WAL). Recovery replays WAL into the tree.

  • Update latency: ~1 WAL append (fsync: 50–200 µs) + occasional leaf fsync.
  • Write amplification: ~1–3× (WAL + page write).
  • Space amplification: low; each logical byte is ~1.3 physical bytes after indexes.

Anti-pattern

Claiming B-trees have "no write amp." They write both WAL and page. A random-write workload against a 10 TB B-tree with 16 KB pages produces 16 KB I/O per 100-byte logical write = 160× amplification in the worst case.

LSM-tree mechanics

Writes go to an in-memory memtable (sorted by key; typically a skip list). When memtable hits threshold (~64 MB), it is flushed as an immutable SSTable to disk. Background compaction merges SSTables into larger sorted runs.

flowchart TB
    W[Write] --> M[Memtable
RAM, sorted] M -- flush 64 MB --> L0[(L0 SSTables)] L0 -- compact --> L1[(L1 SSTables)] L1 -- compact --> L2[(L2 SSTables)] L2 -- compact --> L3[(L3 SSTables)] R[Read] --> M R --> L0 R --> L1 R --> L2 R --> L3

Reads must check memtable + all SSTable levels; bloom filters (see below) short-circuit most negative lookups. Writes are always sequential appends, which SSDs prefer.

  • Write throughput: 10–100 MB/s sustained per disk for LSM vs 1–10 MB/s for B-tree on random-write workloads.
  • Read latency: higher tail (may visit many SSTables).
  • Write amplification: 10–30× (leveled compaction); 2–10× (tiered). See next section.

Amplification: write, read, space

AmplificationDefinitionB-treeLSM (leveled)LSM (tiered)
Writephysical bytes / logical bytes1–3×10–30×2–10×
Readphysical reads / logical read1 (leaf)1–35–10
Spacephysical bytes / logical bytes1.3×1.1×1.5–2×

Quote to memorize: "Leveled compaction gives write amp 10–30× at the cost of space amp 1.1×; tiered gives 2–10× at the cost of read amp 5–10×." State it with numbers; interviewers love the specific ratios.

Compaction strategies

Leveled compaction (LevelDB, RocksDB default, CockroachDB): L_i is ~10× L_{i-1}; each level is a single sorted run (except L0). On compaction, one SSTable from L_i merges with overlapping SSTables in L_{i+1}. Minimizes space amp and read amp, pays in write amp.

Tiered compaction (Cassandra Size-Tiered STCS, many time-series DBs): L_i holds K similar-sized SSTables; when full, merge all K into one at L_{i+1}. Minimizes write amp, pays in read amp (many overlapping runs) and space amp (pre-compaction duplicates).

Time-window compaction (Cassandra TWCS): partition data by write-time window (1 hour, 1 day); compact only within window. Ideal for time-series with TTL — drop whole SSTables on expiry with zero compaction cost.

Real numbers: a Cassandra node with STCS and 500 GB data produces ~8 TB of compaction I/O/day; switching to LCS drops that to ~15 TB over the life of the data but amortized — read latency p99 drops ~40%. Instagram's 2019 migration from STCS to LCS paper is the canonical reference.

Bloom filters and read-path tricks

A Bloom filter is a compact probabilistic set: "might contain, or definitely does not." 10 bits per key gives ~1% false-positive rate. For every SSTable, keep an in-memory bloom; on a point lookup, skip SSTables whose bloom says "definitely not here." Dramatically reduces LSM read I/O on missing keys.

Other read-path tricks:

  • Block cache: LRU of recently-read data blocks. 4–32 GB per node typical.
  • Index block cache: index blocks of SSTables kept hot.
  • Prefix bloom: bloom for key prefixes, enabling range prefix skip.
  • Ribbon filter (Facebook 2022): ~30% smaller than bloom for same FP rate; used in newer RocksDB.

OpenAI-specific

Vector DB backends (Turbopuffer, LanceDB) are essentially LSM engines with bloom-filter equivalents on IVF partitions. For "design a vector DB" mention LSM + WAL replication + bloom on partition IDs. See vector DB arena.

Column stores and OLAP

Columnar storage (Parquet, Apache Arrow, C-Store/Vertica, ClickHouse, Snowflake, DuckDB) stores each column contiguously instead of each row. Benefits:

  • Compression: same-type values compress 5–20× with dictionary encoding + RLE + delta.
  • Vectorized execution: SIMD over tight column arrays; 10–100× faster scans than row stores.
  • Column pruning: SELECT a, b from a 200-column table reads only 1% of bytes.

Cost: individual row lookups and updates are expensive; column stores are OLAP (analytics), not OLTP (transactions). Typical split: OLTP row store (Postgres) → CDC → columnar warehouse (Snowflake/BigQuery) for reporting and ML features.

Anthropic-specific

For model evaluation data (millions of (prompt, completion, eval_score) rows) you want a columnar store, not Postgres. Mention Parquet on S3 + DuckDB or ClickHouse for interactive eval dashboards. This is a routine Anthropic interview hook — showing familiarity with evaluation infrastructure signals fit.

Closing anti-pattern

Defaulting to "use Postgres" for a 100 TB analytical workload. Postgres is a row store; scan times are hours. The right answer is "ETL to Parquet on S3, query with ClickHouse or Snowflake" — one sentence that shows you know the OLTP/OLAP divide.

为什么存储引擎重要

选「Postgres」或「Cassandra」就是在选存储引擎架构。DDIA 第 3 章强调——正确地——理解 B 树 vs LSM 树之分,让你能预测任何新数据库的性能包络。两条:

  • B 树适合读多、改多、要求低延迟的负载。MySQL/InnoDB、Postgres、SQL Server、Oracle。
  • LSM 树适合写多、时序或追加型访问。RocksDB、Cassandra、HBase、DynamoDB、ScyllaDB。

来源交叉引用

DDIA 第 3 章是两种引擎最好的解释。配 LSM 原论文(O'Neil 等 1996)与 Google Bigtable 论文(2006)加深度。

B 树机制

平衡 n 叉树,每页(4–16 KB)存有序 key 数组和子指针。256 fan-out 的 4 层 B 树索引 40 亿行;读取 log_256(N) 次磁盘 = 大表 4 次。内部页缓冲池命中后,只有叶读盘——NVMe ~100 µs。

写入原地:找叶、改、必要时分裂。为防崩溃,每写先入预写日志 WAL。恢复时重放 WAL。

  • 更新延迟:~1 次 WAL 追加(fsync 50–200 µs)+ 偶发叶 fsync。
  • 写放大:~1–3×(WAL + 页写)。
  • 空间放大:低;含索引后每逻辑字节约 1.3 物理字节。

反模式

声称 B 树「无写放大」。它既写 WAL 也写页。16 KB 页、10 TB B 树的随机写负载最坏情况:100 字节逻辑写对应 16 KB IO = 160× 放大。

LSM 树机制

写入进 RAM 中的memtable(按 key 排序,通常跳表)。满 ~64 MB 后作为不可变 SSTable 刷盘。后台压缩把 SSTable 合并成更大的有序段。

flowchart TB
    W[写] --> M[Memtable
RAM, 有序] M -- 刷 64 MB --> L0[(L0 SSTables)] L0 -- 压缩 --> L1[(L1 SSTables)] L1 -- 压缩 --> L2[(L2 SSTables)] L2 -- 压缩 --> L3[(L3 SSTables)] R[读] --> M R --> L0 R --> L1 R --> L2 R --> L3

读要查 memtable + 所有层 SSTable;布隆过滤器(见下)短路大部分 miss。写永远顺序追加,SSD 喜欢。

  • 写吞吐:随机写负载 LSM 持续 10–100 MB/s/盘,B 树 1–10 MB/s。
  • 读延迟:尾更高(可能访问多个 SSTable)。
  • 写放大:leveled 10–30×;tiered 2–10×。下节。

放大:写、读、空间

放大定义B 树LSM (leveled)LSM (tiered)
物理字节 / 逻辑字节1–3×10–30×2–10×
物理读 / 逻辑读1(叶)1–35–10
空间物理字节 / 逻辑字节1.3×1.1×1.5–2×

要背的口诀:「leveled 压缩写放大 10–30×,空间放大 1.1×;tiered 写 2–10×,代价是读放大 5–10×。」说出具体数字,面试官最吃这套。

压缩策略

Leveled compaction(LevelDB、RocksDB 默认、CockroachDB):L_i ≈ 10× L_{i-1};每层为单一有序段(L0 除外)。压缩时 L_i 一个 SSTable 与 L_{i+1} 重叠者合并。空间/读放大最小,写放大为代价。

Tiered compaction(Cassandra STCS、许多时序 DB):L_i 存 K 个近似大小的 SSTable,满了合并成 L_{i+1} 的一个。写放大最小,读放大(多重叠段)与空间放大(压缩前重复)为代价。

时间窗口压缩(Cassandra TWCS):按写时间窗口(1 小时、1 天)分区,只窗口内压缩。TTL 时序数据理想:到期整段 SSTable 删除,零压缩成本。

真实数字:STCS 的 Cassandra 节点 500 GB 数据 ~8 TB/天压缩 IO;切 LCS 后全生命周期 ~15 TB 但摊薄,p99 读延迟降 ~40%。Instagram 2019 STCS→LCS 迁移是经典参考。

布隆过滤器与读路径小技巧

布隆过滤器是紧凑概率集:「可能有或一定没有」。每 key 10 bit 的假阳率 ~1%。每个 SSTable 一个内存布隆;点查「一定没有」直接跳过。极大降 LSM 在缺 key 的读 IO。

其他读路径技巧:

  • 块缓存:最近读数据块 LRU。典型每节点 4–32 GB。
  • 索引块缓存:SSTable 索引块常驻。
  • 前缀布隆:按 key 前缀,支持范围前缀跳跃。
  • Ribbon filter(Facebook 2022):同假阳率比布隆小 ~30%,新 RocksDB 用。

OpenAI 专属

向量 DB 后端(Turbopuffer、LanceDB)本质是 LSM + IVF 分区上的布隆等价物。「设计向量 DB」答题时提 LSM + WAL 复制 + 分区 ID 布隆,见 向量 DB 真题

列存与 OLAP

列存(Parquet、Apache Arrow、C-Store/Vertica、ClickHouse、Snowflake、DuckDB)按列连续存而非按行。好处:

  • 压缩:同类型字典编码 + RLE + delta 压缩 5–20×。
  • 向量化执行:紧凑列数组上 SIMD,扫描比行存快 10–100×。
  • 列裁剪:200 列表 SELECT a, b 只读 1% 字节。

代价:单行查改贵,列存是 OLAP(分析)不是 OLTP。典型分层:OLTP 行存(Postgres)→ CDC → 列存数仓(Snowflake/BigQuery)做报表和 ML 特征。

Anthropic 专属

模型评估数据(数百万 (prompt, completion, eval_score) 行)要列存不是 Postgres。提 S3 上 Parquet + DuckDB 或 ClickHouse 做交互式 eval 看板。这是 Anthropic 面试常见钩子——表现出对评测基础设施的熟悉度就是契合信号。

收尾反模式

100 TB 分析负载默认「用 Postgres」。Postgres 是行存,全扫要小时。正解是「ETL 成 S3 Parquet,ClickHouse 或 Snowflake 查」——一句话展示你懂 OLTP/OLAP 分界。