Why storage engines matter
Picking "Postgres" or "Cassandra" is picking a storage engine architecture. DDIA Ch.3 argues — correctly — that understanding the B-tree vs LSM-tree distinction is what lets you predict the performance envelope of any new database you meet. Two rules:
- B-trees are tuned for read-heavy, mutation-heavy workloads with low latency. MySQL/InnoDB, Postgres, SQL Server, Oracle.
- LSM-trees are tuned for write-heavy workloads with time-series or append-like access. RocksDB, Cassandra, HBase, DynamoDB, ScyllaDB.
Source cross-reference
DDIA Ch.3 is the single best explanation of both engines. Pair with the original LSM paper (O'Neil et al. 1996) and Google's Bigtable paper (2006) for depth.
B-tree mechanics
A balanced n-ary tree where each page (4–16 KB) holds a sorted array of keys and child pointers. A 4-level B-tree with 256-key fan-out indexes 4 billion rows; a read costs log_256(N) disk seeks = 4 for a huge table. With buffer-pool caching of inner pages, only the leaf read hits disk — ~100 µs on NVMe.
Writes are in-place: find leaf, modify, optionally split. To survive crashes mid-split, every write first goes to a write-ahead log (WAL). Recovery replays WAL into the tree.
- Update latency: ~1 WAL append (fsync: 50–200 µs) + occasional leaf fsync.
- Write amplification: ~1–3× (WAL + page write).
- Space amplification: low; each logical byte is ~1.3 physical bytes after indexes.
Anti-pattern
Claiming B-trees have "no write amp." They write both WAL and page. A random-write workload against a 10 TB B-tree with 16 KB pages produces 16 KB I/O per 100-byte logical write = 160× amplification in the worst case.
LSM-tree mechanics
Writes go to an in-memory memtable (sorted by key; typically a skip list). When memtable hits threshold (~64 MB), it is flushed as an immutable SSTable to disk. Background compaction merges SSTables into larger sorted runs.
flowchart TB
W[Write] --> M[Memtable
RAM, sorted]
M -- flush 64 MB --> L0[(L0 SSTables)]
L0 -- compact --> L1[(L1 SSTables)]
L1 -- compact --> L2[(L2 SSTables)]
L2 -- compact --> L3[(L3 SSTables)]
R[Read] --> M
R --> L0
R --> L1
R --> L2
R --> L3
Reads must check memtable + all SSTable levels; bloom filters (see below) short-circuit most negative lookups. Writes are always sequential appends, which SSDs prefer.
- Write throughput: 10–100 MB/s sustained per disk for LSM vs 1–10 MB/s for B-tree on random-write workloads.
- Read latency: higher tail (may visit many SSTables).
- Write amplification: 10–30× (leveled compaction); 2–10× (tiered). See next section.
Amplification: write, read, space
| Amplification | Definition | B-tree | LSM (leveled) | LSM (tiered) |
|---|---|---|---|---|
| Write | physical bytes / logical bytes | 1–3× | 10–30× | 2–10× |
| Read | physical reads / logical read | 1 (leaf) | 1–3 | 5–10 |
| Space | physical bytes / logical bytes | 1.3× | 1.1× | 1.5–2× |
Quote to memorize: "Leveled compaction gives write amp 10–30× at the cost of space amp 1.1×; tiered gives 2–10× at the cost of read amp 5–10×." State it with numbers; interviewers love the specific ratios.
Compaction strategies
Leveled compaction (LevelDB, RocksDB default, CockroachDB): L_i is ~10× L_{i-1}; each level is a single sorted run (except L0). On compaction, one SSTable from L_i merges with overlapping SSTables in L_{i+1}. Minimizes space amp and read amp, pays in write amp.
Tiered compaction (Cassandra Size-Tiered STCS, many time-series DBs): L_i holds K similar-sized SSTables; when full, merge all K into one at L_{i+1}. Minimizes write amp, pays in read amp (many overlapping runs) and space amp (pre-compaction duplicates).
Time-window compaction (Cassandra TWCS): partition data by write-time window (1 hour, 1 day); compact only within window. Ideal for time-series with TTL — drop whole SSTables on expiry with zero compaction cost.
Real numbers: a Cassandra node with STCS and 500 GB data produces ~8 TB of compaction I/O/day; switching to LCS drops that to ~15 TB over the life of the data but amortized — read latency p99 drops ~40%. Instagram's 2019 migration from STCS to LCS paper is the canonical reference.
Bloom filters and read-path tricks
A Bloom filter is a compact probabilistic set: "might contain, or definitely does not." 10 bits per key gives ~1% false-positive rate. For every SSTable, keep an in-memory bloom; on a point lookup, skip SSTables whose bloom says "definitely not here." Dramatically reduces LSM read I/O on missing keys.
Other read-path tricks:
- Block cache: LRU of recently-read data blocks. 4–32 GB per node typical.
- Index block cache: index blocks of SSTables kept hot.
- Prefix bloom: bloom for key prefixes, enabling range prefix skip.
- Ribbon filter (Facebook 2022): ~30% smaller than bloom for same FP rate; used in newer RocksDB.
OpenAI-specific
Vector DB backends (Turbopuffer, LanceDB) are essentially LSM engines with bloom-filter equivalents on IVF partitions. For "design a vector DB" mention LSM + WAL replication + bloom on partition IDs. See vector DB arena.
Column stores and OLAP
Columnar storage (Parquet, Apache Arrow, C-Store/Vertica, ClickHouse, Snowflake, DuckDB) stores each column contiguously instead of each row. Benefits:
- Compression: same-type values compress 5–20× with dictionary encoding + RLE + delta.
- Vectorized execution: SIMD over tight column arrays; 10–100× faster scans than row stores.
- Column pruning:
SELECT a, bfrom a 200-column table reads only 1% of bytes.
Cost: individual row lookups and updates are expensive; column stores are OLAP (analytics), not OLTP (transactions). Typical split: OLTP row store (Postgres) → CDC → columnar warehouse (Snowflake/BigQuery) for reporting and ML features.
Anthropic-specific
For model evaluation data (millions of (prompt, completion, eval_score) rows) you want a columnar store, not Postgres. Mention Parquet on S3 + DuckDB or ClickHouse for interactive eval dashboards. This is a routine Anthropic interview hook — showing familiarity with evaluation infrastructure signals fit.
Closing anti-pattern
Defaulting to "use Postgres" for a 100 TB analytical workload. Postgres is a row store; scan times are hours. The right answer is "ETL to Parquet on S3, query with ClickHouse or Snowflake" — one sentence that shows you know the OLTP/OLAP divide.
为什么存储引擎重要
选「Postgres」或「Cassandra」就是在选存储引擎架构。DDIA 第 3 章强调——正确地——理解 B 树 vs LSM 树之分,让你能预测任何新数据库的性能包络。两条:
- B 树适合读多、改多、要求低延迟的负载。MySQL/InnoDB、Postgres、SQL Server、Oracle。
- LSM 树适合写多、时序或追加型访问。RocksDB、Cassandra、HBase、DynamoDB、ScyllaDB。
来源交叉引用
DDIA 第 3 章是两种引擎最好的解释。配 LSM 原论文(O'Neil 等 1996)与 Google Bigtable 论文(2006)加深度。
B 树机制
平衡 n 叉树,每页(4–16 KB)存有序 key 数组和子指针。256 fan-out 的 4 层 B 树索引 40 亿行;读取 log_256(N) 次磁盘 = 大表 4 次。内部页缓冲池命中后,只有叶读盘——NVMe ~100 µs。
写入原地:找叶、改、必要时分裂。为防崩溃,每写先入预写日志 WAL。恢复时重放 WAL。
- 更新延迟:~1 次 WAL 追加(fsync 50–200 µs)+ 偶发叶 fsync。
- 写放大:~1–3×(WAL + 页写)。
- 空间放大:低;含索引后每逻辑字节约 1.3 物理字节。
反模式
声称 B 树「无写放大」。它既写 WAL 也写页。16 KB 页、10 TB B 树的随机写负载最坏情况:100 字节逻辑写对应 16 KB IO = 160× 放大。
LSM 树机制
写入进 RAM 中的memtable(按 key 排序,通常跳表)。满 ~64 MB 后作为不可变 SSTable 刷盘。后台压缩把 SSTable 合并成更大的有序段。
flowchart TB
W[写] --> M[Memtable
RAM, 有序]
M -- 刷 64 MB --> L0[(L0 SSTables)]
L0 -- 压缩 --> L1[(L1 SSTables)]
L1 -- 压缩 --> L2[(L2 SSTables)]
L2 -- 压缩 --> L3[(L3 SSTables)]
R[读] --> M
R --> L0
R --> L1
R --> L2
R --> L3
读要查 memtable + 所有层 SSTable;布隆过滤器(见下)短路大部分 miss。写永远顺序追加,SSD 喜欢。
- 写吞吐:随机写负载 LSM 持续 10–100 MB/s/盘,B 树 1–10 MB/s。
- 读延迟:尾更高(可能访问多个 SSTable)。
- 写放大:leveled 10–30×;tiered 2–10×。下节。
放大:写、读、空间
| 放大 | 定义 | B 树 | LSM (leveled) | LSM (tiered) |
|---|---|---|---|---|
| 写 | 物理字节 / 逻辑字节 | 1–3× | 10–30× | 2–10× |
| 读 | 物理读 / 逻辑读 | 1(叶) | 1–3 | 5–10 |
| 空间 | 物理字节 / 逻辑字节 | 1.3× | 1.1× | 1.5–2× |
要背的口诀:「leveled 压缩写放大 10–30×,空间放大 1.1×;tiered 写 2–10×,代价是读放大 5–10×。」说出具体数字,面试官最吃这套。
压缩策略
Leveled compaction(LevelDB、RocksDB 默认、CockroachDB):L_i ≈ 10× L_{i-1};每层为单一有序段(L0 除外)。压缩时 L_i 一个 SSTable 与 L_{i+1} 重叠者合并。空间/读放大最小,写放大为代价。
Tiered compaction(Cassandra STCS、许多时序 DB):L_i 存 K 个近似大小的 SSTable,满了合并成 L_{i+1} 的一个。写放大最小,读放大(多重叠段)与空间放大(压缩前重复)为代价。
时间窗口压缩(Cassandra TWCS):按写时间窗口(1 小时、1 天)分区,只窗口内压缩。TTL 时序数据理想:到期整段 SSTable 删除,零压缩成本。
真实数字:STCS 的 Cassandra 节点 500 GB 数据 ~8 TB/天压缩 IO;切 LCS 后全生命周期 ~15 TB 但摊薄,p99 读延迟降 ~40%。Instagram 2019 STCS→LCS 迁移是经典参考。
布隆过滤器与读路径小技巧
布隆过滤器是紧凑概率集:「可能有或一定没有」。每 key 10 bit 的假阳率 ~1%。每个 SSTable 一个内存布隆;点查「一定没有」直接跳过。极大降 LSM 在缺 key 的读 IO。
其他读路径技巧:
- 块缓存:最近读数据块 LRU。典型每节点 4–32 GB。
- 索引块缓存:SSTable 索引块常驻。
- 前缀布隆:按 key 前缀,支持范围前缀跳跃。
- Ribbon filter(Facebook 2022):同假阳率比布隆小 ~30%,新 RocksDB 用。
OpenAI 专属
向量 DB 后端(Turbopuffer、LanceDB)本质是 LSM + IVF 分区上的布隆等价物。「设计向量 DB」答题时提 LSM + WAL 复制 + 分区 ID 布隆,见 向量 DB 真题。
列存与 OLAP
列存(Parquet、Apache Arrow、C-Store/Vertica、ClickHouse、Snowflake、DuckDB)按列连续存而非按行。好处:
- 压缩:同类型字典编码 + RLE + delta 压缩 5–20×。
- 向量化执行:紧凑列数组上 SIMD,扫描比行存快 10–100×。
- 列裁剪:200 列表
SELECT a, b只读 1% 字节。
代价:单行查改贵,列存是 OLAP(分析)不是 OLTP。典型分层:OLTP 行存(Postgres)→ CDC → 列存数仓(Snowflake/BigQuery)做报表和 ML 特征。
Anthropic 专属
模型评估数据(数百万 (prompt, completion, eval_score) 行)要列存不是 Postgres。提 S3 上 Parquet + DuckDB 或 ClickHouse 做交互式 eval 看板。这是 Anthropic 面试常见钩子——表现出对评测基础设施的熟悉度就是契合信号。
收尾反模式
100 TB 分析负载默认「用 Postgres」。Postgres 是行存,全扫要小时。正解是「ETL 成 S3 Parquet,ClickHouse 或 Snowflake 查」——一句话展示你懂 OLTP/OLAP 分界。