Storage Engines — SD-Guide

Why storage engines matter

Picking "Postgres" or "Cassandra" is picking a storage engine architecture. DDIA Ch.3 argues — correctly — that understanding the B-tree vs LSM-tree distinction is what lets you predict the performance envelope of any new database you meet. Two rules:

B-trees are tuned for read-heavy, mutation-heavy workloads with low latency. MySQL/InnoDB, Postgres, SQL Server, Oracle.
LSM-trees are tuned for write-heavy workloads with time-series or append-like access. RocksDB, Cassandra, HBase, DynamoDB, ScyllaDB.

Source cross-reference

DDIA Ch.3 is the single best explanation of both engines. Pair with the original LSM paper (O'Neil et al. 1996) and Google's Bigtable paper (2006) for depth.

B-tree mechanics

A balanced n-ary tree where each page (4–16 KB) holds a sorted array of keys and child pointers. A 4-level B-tree with 256-key fan-out indexes 4 billion rows; a read costs log_256(N) disk seeks = 4 for a huge table. With buffer-pool caching of inner pages, only the leaf read hits disk — ~100 µs on NVMe.

Writes are in-place: find leaf, modify, optionally split. To survive crashes mid-split, every write first goes to a write-ahead log (WAL). Recovery replays WAL into the tree.

Update latency: ~1 WAL append (fsync: 50–200 µs) + occasional leaf fsync.
Write amplification: ~1–3× (WAL + page write).
Space amplification: low; each logical byte is ~1.3 physical bytes after indexes.

Anti-pattern

Claiming B-trees have "no write amp." They write both WAL and page. A random-write workload against a 10 TB B-tree with 16 KB pages produces 16 KB I/O per 100-byte logical write = 160× amplification in the worst case.

LSM-tree mechanics

Writes go to an in-memory memtable (sorted by key; typically a skip list). When memtable hits threshold (~64 MB), it is flushed as an immutable SSTable to disk. Background compaction merges SSTables into larger sorted runs.

flowchart TB
    W[Write] --> M[Memtable
RAM, sorted]
    M -- flush 64 MB --> L0[(L0 SSTables)]
    L0 -- compact --> L1[(L1 SSTables)]
    L1 -- compact --> L2[(L2 SSTables)]
    L2 -- compact --> L3[(L3 SSTables)]
    R[Read] --> M
    R --> L0
    R --> L1
    R --> L2
    R --> L3

Reads must check memtable + all SSTable levels; bloom filters (see below) short-circuit most negative lookups. Writes are always sequential appends, which SSDs prefer.

Write throughput: 10–100 MB/s sustained per disk for LSM vs 1–10 MB/s for B-tree on random-write workloads.
Read latency: higher tail (may visit many SSTables).
Write amplification: 10–30× (leveled compaction); 2–10× (tiered). See next section.

Amplification: write, read, space

Amplification	Definition	B-tree	LSM (leveled)	LSM (tiered)
Write	physical bytes / logical bytes	1–3×	10–30×	2–10×
Read	physical reads / logical read	1 (leaf)	1–3	5–10
Space	physical bytes / logical bytes	1.3×	1.1×	1.5–2×

Quote to memorize: "Leveled compaction gives write amp 10–30× at the cost of space amp 1.1×; tiered gives 2–10× at the cost of read amp 5–10×." State it with numbers; interviewers love the specific ratios.

Compaction strategies

Leveled compaction (LevelDB, RocksDB default, CockroachDB): L_i is ~10× L_{i-1}; each level is a single sorted run (except L0). On compaction, one SSTable from L_i merges with overlapping SSTables in L_{i+1}. Minimizes space amp and read amp, pays in write amp.

Tiered compaction (Cassandra Size-Tiered STCS, many time-series DBs): L_i holds K similar-sized SSTables; when full, merge all K into one at L_{i+1}. Minimizes write amp, pays in read amp (many overlapping runs) and space amp (pre-compaction duplicates).

Time-window compaction (Cassandra TWCS): partition data by write-time window (1 hour, 1 day); compact only within window. Ideal for time-series with TTL — drop whole SSTables on expiry with zero compaction cost.

Real numbers: a Cassandra node with STCS and 500 GB data produces ~8 TB of compaction I/O/day; switching to LCS drops that to ~15 TB over the life of the data but amortized — read latency p99 drops ~40%. Instagram's 2019 migration from STCS to LCS paper is the canonical reference.

Bloom filters and read-path tricks

A Bloom filter is a compact probabilistic set: "might contain, or definitely does not." 10 bits per key gives ~1% false-positive rate. For every SSTable, keep an in-memory bloom; on a point lookup, skip SSTables whose bloom says "definitely not here." Dramatically reduces LSM read I/O on missing keys.

OpenAI-specific

Vector DB backends (Turbopuffer, LanceDB) are essentially LSM engines with bloom-filter equivalents on IVF partitions. For "design a vector DB" mention LSM + WAL replication + bloom on partition IDs. See vector DB arena.

Column stores and OLAP

Columnar storage (Parquet, Apache Arrow, C-Store/Vertica, ClickHouse, Snowflake, DuckDB) stores each column contiguously instead of each row. Benefits:

Compression: same-type values compress 5–20× with dictionary encoding + RLE + delta.
Vectorized execution: SIMD over tight column arrays; 10–100× faster scans than row stores.
Column pruning: SELECT a, b from a 200-column table reads only 1% of bytes.

Cost: individual row lookups and updates are expensive; column stores are OLAP (analytics), not OLTP (transactions). Typical split: OLTP row store (Postgres) → CDC → columnar warehouse (Snowflake/BigQuery) for reporting and ML features.

Anthropic-specific

For model evaluation data (millions of (prompt, completion, eval_score) rows) you want a columnar store, not Postgres. Mention Parquet on S3 + DuckDB or ClickHouse for interactive eval dashboards. This is a routine Anthropic interview hook — showing familiarity with evaluation infrastructure signals fit.

Closing anti-pattern

Defaulting to "use Postgres" for a 100 TB analytical workload. Postgres is a row store; scan times are hours. The right answer is "ETL to Parquet on S3, query with ClickHouse or Snowflake" — one sentence that shows you know the OLTP/OLAP divide.

为什么存储引擎重要

选「Postgres」或「Cassandra」就是在选存储引擎架构。DDIA 第 3 章强调——正确地——理解 B 树 vs LSM 树之分，让你能预测任何新数据库的性能包络。两条：

B 树适合读多、改多、要求低延迟的负载。MySQL/InnoDB、Postgres、SQL Server、Oracle。
LSM 树适合写多、时序或追加型访问。RocksDB、Cassandra、HBase、DynamoDB、ScyllaDB。

来源交叉引用

DDIA 第 3 章是两种引擎最好的解释。配 LSM 原论文（O'Neil 等 1996）与 Google Bigtable 论文（2006）加深度。

B 树机制

平衡 n 叉树，每页（4–16 KB）存有序 key 数组和子指针。256 fan-out 的 4 层 B 树索引 40 亿行；读取 log_256(N) 次磁盘 = 大表 4 次。内部页缓冲池命中后，只有叶读盘——NVMe ~100 µs。

写入原地：找叶、改、必要时分裂。为防崩溃，每写先入预写日志 WAL。恢复时重放 WAL。

更新延迟：~1 次 WAL 追加（fsync 50–200 µs）+ 偶发叶 fsync。
写放大：~1–3×（WAL + 页写）。
空间放大：低；含索引后每逻辑字节约 1.3 物理字节。

反模式

声称 B 树「无写放大」。它既写 WAL 也写页。16 KB 页、10 TB B 树的随机写负载最坏情况：100 字节逻辑写对应 16 KB IO = 160× 放大。

LSM 树机制

写入进 RAM 中的memtable（按 key 排序，通常跳表）。满 ~64 MB 后作为不可变 SSTable 刷盘。后台压缩把 SSTable 合并成更大的有序段。

flowchart TB
    W[写] --> M[Memtable
RAM, 有序]
    M -- 刷 64 MB --> L0[(L0 SSTables)]
    L0 -- 压缩 --> L1[(L1 SSTables)]
    L1 -- 压缩 --> L2[(L2 SSTables)]
    L2 -- 压缩 --> L3[(L3 SSTables)]
    R[读] --> M
    R --> L0
    R --> L1
    R --> L2
    R --> L3

读要查 memtable + 所有层 SSTable；布隆过滤器（见下）短路大部分 miss。写永远顺序追加，SSD 喜欢。

写吞吐：随机写负载 LSM 持续 10–100 MB/s/盘，B 树 1–10 MB/s。
读延迟：尾更高（可能访问多个 SSTable）。
写放大：leveled 10–30×；tiered 2–10×。下节。

放大：写、读、空间

放大	定义	B 树	LSM (leveled)	LSM (tiered)
写	物理字节 / 逻辑字节	1–3×	10–30×	2–10×
读	物理读 / 逻辑读	1（叶）	1–3	5–10
空间	物理字节 / 逻辑字节	1.3×	1.1×	1.5–2×

要背的口诀：「leveled 压缩写放大 10–30×，空间放大 1.1×；tiered 写 2–10×，代价是读放大 5–10×。」说出具体数字，面试官最吃这套。

压缩策略

Leveled compaction（LevelDB、RocksDB 默认、CockroachDB）：L_i ≈ 10× L_{i-1}；每层为单一有序段（L0 除外）。压缩时 L_i 一个 SSTable 与 L_{i+1} 重叠者合并。空间/读放大最小，写放大为代价。

Tiered compaction（Cassandra STCS、许多时序 DB）：L_i 存 K 个近似大小的 SSTable，满了合并成 L_{i+1} 的一个。写放大最小，读放大（多重叠段）与空间放大（压缩前重复）为代价。

时间窗口压缩（Cassandra TWCS）：按写时间窗口（1 小时、1 天）分区，只窗口内压缩。TTL 时序数据理想：到期整段 SSTable 删除，零压缩成本。

真实数字：STCS 的 Cassandra 节点 500 GB 数据 ~8 TB/天压缩 IO；切 LCS 后全生命周期 ~15 TB 但摊薄，p99 读延迟降 ~40%。Instagram 2019 STCS→LCS 迁移是经典参考。

布隆过滤器与读路径小技巧

布隆过滤器是紧凑概率集：「可能有或一定没有」。每 key 10 bit 的假阳率 ~1%。每个 SSTable 一个内存布隆；点查「一定没有」直接跳过。极大降 LSM 在缺 key 的读 IO。

其他读路径技巧：

块缓存：最近读数据块 LRU。典型每节点 4–32 GB。
索引块缓存：SSTable 索引块常驻。
前缀布隆：按 key 前缀，支持范围前缀跳跃。
Ribbon filter（Facebook 2022）：同假阳率比布隆小 ~30%，新 RocksDB 用。

OpenAI 专属

向量 DB 后端（Turbopuffer、LanceDB）本质是 LSM + IVF 分区上的布隆等价物。「设计向量 DB」答题时提 LSM + WAL 复制 + 分区 ID 布隆，见向量 DB 真题。

列存与 OLAP

列存（Parquet、Apache Arrow、C-Store/Vertica、ClickHouse、Snowflake、DuckDB）按列连续存而非按行。好处：

压缩：同类型字典编码 + RLE + delta 压缩 5–20×。
向量化执行：紧凑列数组上 SIMD，扫描比行存快 10–100×。
列裁剪：200 列表 SELECT a, b 只读 1% 字节。

代价：单行查改贵，列存是 OLAP（分析）不是 OLTP。典型分层：OLTP 行存（Postgres）→ CDC → 列存数仓（Snowflake/BigQuery）做报表和 ML 特征。

Anthropic 专属

模型评估数据（数百万 (prompt, completion, eval_score) 行）要列存不是 Postgres。提 S3 上 Parquet + DuckDB 或 ClickHouse 做交互式 eval 看板。这是 Anthropic 面试常见钩子——表现出对评测基础设施的熟悉度就是契合信号。

收尾反模式

100 TB 分析负载默认「用 Postgres」。Postgres 是行存，全扫要小时。正解是「ETL 成 S3 Parquet，ClickHouse 或 Snowflake 查」——一句话展示你懂 OLTP/OLAP 分界。