Partitioning & Sharding

Why partition

Replication solves reads and availability; partitioning solves writes and storage. Once your data exceeds a single node's disk (~30 TB NVMe today) or your write rate exceeds a single leader (~50 k writes/sec for Postgres on a fat box), you must split. DDIA Ch.6 defines a partition as a subset of rows assigned to a node; in Cassandra it's a "ring segment", in DynamoDB it's a "partition", in MongoDB it's a "chunk". Same concept, different branding.

Source cross-reference

DDIA Ch.6 is the gold standard. Alex Xu V1 Ch.5 on consistent hashing is the interview-template version (virtual nodes, add/remove math). Acing SDI Ch.4 frames sharding as a scaling lever.

Range vs hash partitioning

Range: partition by key order, e.g. usernames A–E, F–J, .... Pro: range scans are cheap (SELECT ... WHERE user BETWEEN 'a' AND 'd'). Con: skew — sequential keys like timestamps hammer one partition. Used by HBase, BigTable, Spanner.

Hash: partition = hash(key) % N. Pro: uniform distribution. Con: range scans require scatter-gather. Used by Cassandra, DynamoDB, Redis Cluster.

Most real systems are hash-of-prefix, range-within: the partition key is hashed (to spread load), but the sort key within a partition is ordered (for efficient scans). DynamoDB's (PK, SK) design, Cassandra's (partition key, clustering key) — same idea.

DynamoDB item: PK = user#42, SK = ts#2026-04-16T10:00:00Z
  → partition chosen by hash(user#42)
  → within partition, items sorted by ts

Consistent hashing & virtual nodes

Naïve hash(key) % N breaks when N changes — all keys remap. Karger et al. 1997 consistent hashing places nodes on a hash ring; each key is assigned to the next node clockwise. Adding a node only takes 1/N of the keys from its neighbor.

flowchart LR
    subgraph Ring
      N1((Node A)) --> N2((Node B))
      N2 --> N3((Node C))
      N3 --> N4((Node D))
      N4 --> N1
    end
    K1[Key k1] -.-> N2
    K2[Key k2] -.-> N3

Problem: with N=5 nodes, variance in key distribution is ±O(1/√N) which is ~45% — ugly. Solution: virtual nodes (vnodes). Each physical node claims 100–256 positions on the ring; variance drops to ±5%. Cassandra defaults to 256 vnodes/node; DynamoDB hides it entirely.

Add-node math: with 256 vnodes × 10 nodes = 2560 tokens on the ring. Adding an 11th node steals 256 tokens (~10%) evenly from the other 10. Network traffic during rebalance: 10% of total dataset moved. For a 100 TB cluster that is 10 TB → at 1 GB/s sustained, ~3 hours. Throttle the rebalance or you will stomp foreground traffic.

Anti-pattern

Using hash(key) % N with N baked into client code. Scaling out requires rewriting every client. The 2010 reddit incident: migrating from mod-hashing to consistent hashing took months. Start with consistent hashing on day one.

Hot partitions

A celebrity Twitter user with 100 M followers, a viral URL in a short-link service, Black Friday top-SKU in e-commerce — any skewed workload concentrates on one partition. Symptoms: one node at 95% CPU, the rest idle; p99 latency explodes while p50 stays fine.

Mitigations, in order of cost:

Cache the hot keys: Redis in front of the partition, TTL 1–60 s. Cheapest.
Write sharding: append a 2-digit random suffix to the partition key (celeb#42-07). Aggregate at read time. DynamoDB's recommended hot-partition pattern.
Split the partition: double-hash or adaptive splitting (DynamoDB auto-splits partitions past 3000 RCU/1000 WCU since 2018).
Dedicated shard: Twitter's "celebrity cluster" moves top-N accounts to a special pool with fan-out optimized for them.

OpenAI-specific

For prompt-caching or chat-history by user, the "power-user" tail is long: 1% of users drive 30% of tokens. Don't partition by user_id alone; prepend a bucket dimension ((user_id, day_bucket)) to spread the heaviest users across partitions.

Secondary indexes: local vs global

Primary partitioning is chosen by access pattern; secondary access (by a different column) needs an index. Two strategies (DDIA Ch.6):

Local (document-partitioned): every partition has its own index on its rows. Writes are cheap (local). Reads by secondary key must scatter-gather every partition. Used by Elasticsearch shards, MongoDB local indexes.
Global (term-partitioned): one index keyed by the secondary column, partitioned by that column's hash. Reads are single-partition. Writes to the base table now also hit a different partition (2PC or async). DynamoDB GSI is eventually consistent precisely because sync'ing is expensive.

Trade-off slogan: "Local indexes make writes cheap and reads expensive; global indexes do the opposite."

Rebalancing strategies

Fixed partitions: create 1000 partitions on day one across 10 nodes (100 each). Adding node 11 → move 1/11 of partitions to it. Elasticsearch, Riak. Must guess the high-water mark upfront.
Dynamic partitioning: start with 1 partition; split when it exceeds threshold (default 10 GB in HBase). Shrinks when undersize. Works well for unbounded growth.
Proportional to nodes: Cassandra's token-per-node (256 vnodes). Adding a node reshuffles 1/N of tokens.

Automatic vs manual rebalancing: automatic is tempting but can stampede during a network blip. Production-grade systems require operator confirmation for moves above a threshold. CockroachDB and Spanner both gate large rebalances.

Request routing

Who knows which partition holds user#42?

Smart client: client library embeds the partition map. Fast, but requires pushing updates to all clients. Cassandra drivers do this.
Coordinator / routing tier: a stateless proxy reads the map from ZK/etcd, forwards requests. Extra hop (+0.5 ms), easier ops. Used by Kafka, Vitess.
Partition-aware DNS / anycast: rare; only for coarse-grained routing.

Either way, there is a metadata store (ZK/etcd) that holds the authoritative partition map. This is exactly why the 2015 DynamoDB metadata storm was catastrophic — losing the map means losing the ability to route anything. Cache the map on every node with a 5–30 s TTL and graceful staleness.

Anthropic-specific

For tenant-isolated inference (Claude for Enterprise), partition by tenant_id at the top level so data never shares a physical node across tenants. This simplifies audit and SOC2: one partition = one tenant = one bucket of logs.

为什么分区

复制解决读与可用性；分区解决写与存储。数据超过单机磁盘（今天 NVMe 约 30 TB）或写超过单主（胖机器 Postgres 约 5 万写/秒）就必须切。DDIA 第 6 章把分区定义为分配给某节点的一组行：Cassandra 叫「ring segment」，DynamoDB 叫「partition」，MongoDB 叫「chunk」，同概念不同名字。

来源交叉引用

DDIA 第 6 章是金标；Alex Xu V1 第 5 章一致性哈希是面试模板版；Acing SDI 第 4 章把分片当扩展杠杆。

范围分区 vs 哈希分区

范围：按 key 有序切，比如用户名 A–E、F–J。优点：范围扫描便宜。缺点：倾斜——时间戳这种顺序 key 砸同一分区。HBase、BigTable、Spanner 用这种。

哈希：partition = hash(key) % N。优点：均匀。缺点：范围查询要 scatter-gather。Cassandra、DynamoDB、Redis Cluster 用这种。

真实系统多是前缀哈希 + 分区内有序：分区 key 哈希散开，分区内 sort key 排序。DynamoDB (PK, SK)、Cassandra (partition key, clustering key)，同思想。

DynamoDB 条目：PK = user#42, SK = ts#2026-04-16T10:00:00Z
  → 由 hash(user#42) 决定分区
  → 分区内按 ts 排序

一致性哈希与虚节点

朴素 hash(key) % N 在 N 变时所有 key 重映射。Karger 等 1997 的一致性哈希把节点放在哈希环上，每个 key 顺时针归属下个节点。加一节点只从邻居拿 1/N。

flowchart LR
    subgraph Ring
      N1((节点 A)) --> N2((节点 B))
      N2 --> N3((节点 C))
      N3 --> N4((节点 D))
      N4 --> N1
    end
    K1[key k1] -.-> N2
    K2[key k2] -.-> N3

问题：5 节点时分布方差 ±O(1/√N) ≈ 45%，丑。方案：虚节点（vnode）。每个物理节点占 100–256 个位置，方差降到 ±5%。Cassandra 默认 256/节点，DynamoDB 完全隐藏。

加节点算术：256 × 10 = 2560 tokens；加第 11 节点偷 256 个（~10%）平均分自其他 10 节点。重平衡网络流量 = 总数据的 10%。100 TB 集群 = 10 TB，1 GB/s 约 3 小时。限速重平衡，否则会把前台流量踩死。

反模式

客户端代码写死 hash(key) % N。扩容要改每个客户端。2010 年 reddit 案例：从 mod 哈希迁到一致性哈希用了几个月。第一天就上一致性哈希。

热点分区

1 亿粉丝的大 V、爆款短链、黑五头部 SKU——任何倾斜负载都会集中到一个分区。表现：一节点 CPU 95%，其余空闲；p99 爆炸但 p50 正常。

按成本从低到高：

缓存热 key：Redis 挡在分区前，TTL 1–60 秒。最便宜。
写分片：分区 key 后缀 2 位随机（celeb#42-07），读时聚合。DynamoDB 官方热点模式。
拆分区：双哈希或自适应拆（DynamoDB 自 2018 起超 3000 RCU/1000 WCU 自动拆）。
专用分片：Twitter 的「名人集群」，把头部账号挪到专池，扇出专门优化。

OpenAI 专属

按用户缓存 prompt 或聊天历史时尾部很长：1% 用户贡献 30% token。不要只按 user_id 分区，加一个 bucket 维度（(user_id, day_bucket)）把重度用户散开。

二级索引：本地 vs 全局

主分区按访问模式挑，但按其他列查需要索引。DDIA 第 6 章两种策略：

本地（按文档分区）：每分区自带索引。写便宜（本地）。按二级 key 读要scatter-gather全部分区。Elasticsearch shard、MongoDB 本地索引。
全局（按 term 分区）：以二级列做 key，其哈希决定分区。读单分区。写主表同时要写到不同分区（2PC 或异步）。DynamoDB GSI 就是因为同步太贵所以默认最终一致。

口诀：本地索引写便宜读贵，全局索引反之。

再平衡策略

固定分区数：第一天建 1000 分区跨 10 节点（每节点 100）。加第 11 节点 → 挪 1/11 分区。Elasticsearch、Riak。要预估高水位。
动态分区：从 1 分区开始，超阈值（HBase 默认 10 GB）就拆，太小合并。无界增长适用。
按节点比例：Cassandra 每节点 token（256 vnodes）。加节点洗 1/N。

自动 vs 手动重平衡：自动诱人但网络抖动时容易踩踏。生产级系统对超阈值移动要求运维确认。CockroachDB、Spanner 都有此门控。

请求路由

谁知道 user#42 在哪个分区？

智能客户端：客户端库内嵌分区图。快，但要推送更新。Cassandra 驱动这样做。
协调器 / 路由层：无状态代理从 ZK/etcd 读图再转发。多一跳（+0.5 ms），运维容易。Kafka、Vitess 这样。
分区感知 DNS / anycast：少见，只做粗粒度。

不管哪种都要一个元数据存储（ZK/etcd）保存权威分区图。2015 DynamoDB metadata 风暴之所以灾难：丢了图就路不了任何请求。每节点缓存图，TTL 5–30 秒，允许优雅陈旧。

Anthropic 专属

租户隔离推理（Claude for Enterprise）时，顶层按 tenant_id 分区，跨租户不共享物理节点。审计和 SOC2 更简单：一分区 = 一租户 = 一桶日志。