Why partition

Replication solves reads and availability; partitioning solves writes and storage. Once your data exceeds a single node's disk (~30 TB NVMe today) or your write rate exceeds a single leader (~50 k writes/sec for Postgres on a fat box), you must split. DDIA Ch.6 defines a partition as a subset of rows assigned to a node; in Cassandra it's a "ring segment", in DynamoDB it's a "partition", in MongoDB it's a "chunk". Same concept, different branding.

Source cross-reference

DDIA Ch.6 is the gold standard. Alex Xu V1 Ch.5 on consistent hashing is the interview-template version (virtual nodes, add/remove math). Acing SDI Ch.4 frames sharding as a scaling lever.

Range vs hash partitioning

Range: partition by key order, e.g. usernames A–E, F–J, .... Pro: range scans are cheap (SELECT ... WHERE user BETWEEN 'a' AND 'd'). Con: skew — sequential keys like timestamps hammer one partition. Used by HBase, BigTable, Spanner.

Hash: partition = hash(key) % N. Pro: uniform distribution. Con: range scans require scatter-gather. Used by Cassandra, DynamoDB, Redis Cluster.

Most real systems are hash-of-prefix, range-within: the partition key is hashed (to spread load), but the sort key within a partition is ordered (for efficient scans). DynamoDB's (PK, SK) design, Cassandra's (partition key, clustering key) — same idea.

DynamoDB item: PK = user#42, SK = ts#2026-04-16T10:00:00Z
  → partition chosen by hash(user#42)
  → within partition, items sorted by ts

Consistent hashing & virtual nodes

Naïve hash(key) % N breaks when N changes — all keys remap. Karger et al. 1997 consistent hashing places nodes on a hash ring; each key is assigned to the next node clockwise. Adding a node only takes 1/N of the keys from its neighbor.

flowchart LR
    subgraph Ring
      N1((Node A)) --> N2((Node B))
      N2 --> N3((Node C))
      N3 --> N4((Node D))
      N4 --> N1
    end
    K1[Key k1] -.-> N2
    K2[Key k2] -.-> N3

Problem: with N=5 nodes, variance in key distribution is ±O(1/√N) which is ~45% — ugly. Solution: virtual nodes (vnodes). Each physical node claims 100–256 positions on the ring; variance drops to ±5%. Cassandra defaults to 256 vnodes/node; DynamoDB hides it entirely.

Add-node math: with 256 vnodes × 10 nodes = 2560 tokens on the ring. Adding an 11th node steals 256 tokens (~10%) evenly from the other 10. Network traffic during rebalance: 10% of total dataset moved. For a 100 TB cluster that is 10 TB → at 1 GB/s sustained, ~3 hours. Throttle the rebalance or you will stomp foreground traffic.

Anti-pattern

Using hash(key) % N with N baked into client code. Scaling out requires rewriting every client. The 2010 reddit incident: migrating from mod-hashing to consistent hashing took months. Start with consistent hashing on day one.

Hot partitions

A celebrity Twitter user with 100 M followers, a viral URL in a short-link service, Black Friday top-SKU in e-commerce — any skewed workload concentrates on one partition. Symptoms: one node at 95% CPU, the rest idle; p99 latency explodes while p50 stays fine.

Mitigations, in order of cost:

  • Cache the hot keys: Redis in front of the partition, TTL 1–60 s. Cheapest.
  • Write sharding: append a 2-digit random suffix to the partition key (celeb#42-07). Aggregate at read time. DynamoDB's recommended hot-partition pattern.
  • Split the partition: double-hash or adaptive splitting (DynamoDB auto-splits partitions past 3000 RCU/1000 WCU since 2018).
  • Dedicated shard: Twitter's "celebrity cluster" moves top-N accounts to a special pool with fan-out optimized for them.

OpenAI-specific

For prompt-caching or chat-history by user, the "power-user" tail is long: 1% of users drive 30% of tokens. Don't partition by user_id alone; prepend a bucket dimension ((user_id, day_bucket)) to spread the heaviest users across partitions.

Secondary indexes: local vs global

Primary partitioning is chosen by access pattern; secondary access (by a different column) needs an index. Two strategies (DDIA Ch.6):

  • Local (document-partitioned): every partition has its own index on its rows. Writes are cheap (local). Reads by secondary key must scatter-gather every partition. Used by Elasticsearch shards, MongoDB local indexes.
  • Global (term-partitioned): one index keyed by the secondary column, partitioned by that column's hash. Reads are single-partition. Writes to the base table now also hit a different partition (2PC or async). DynamoDB GSI is eventually consistent precisely because sync'ing is expensive.

Trade-off slogan: "Local indexes make writes cheap and reads expensive; global indexes do the opposite."

Rebalancing strategies

  • Fixed partitions: create 1000 partitions on day one across 10 nodes (100 each). Adding node 11 → move 1/11 of partitions to it. Elasticsearch, Riak. Must guess the high-water mark upfront.
  • Dynamic partitioning: start with 1 partition; split when it exceeds threshold (default 10 GB in HBase). Shrinks when undersize. Works well for unbounded growth.
  • Proportional to nodes: Cassandra's token-per-node (256 vnodes). Adding a node reshuffles 1/N of tokens.

Automatic vs manual rebalancing: automatic is tempting but can stampede during a network blip. Production-grade systems require operator confirmation for moves above a threshold. CockroachDB and Spanner both gate large rebalances.

Request routing

Who knows which partition holds user#42?

  1. Smart client: client library embeds the partition map. Fast, but requires pushing updates to all clients. Cassandra drivers do this.
  2. Coordinator / routing tier: a stateless proxy reads the map from ZK/etcd, forwards requests. Extra hop (+0.5 ms), easier ops. Used by Kafka, Vitess.
  3. Partition-aware DNS / anycast: rare; only for coarse-grained routing.

Either way, there is a metadata store (ZK/etcd) that holds the authoritative partition map. This is exactly why the 2015 DynamoDB metadata storm was catastrophic — losing the map means losing the ability to route anything. Cache the map on every node with a 5–30 s TTL and graceful staleness.

Anthropic-specific

For tenant-isolated inference (Claude for Enterprise), partition by tenant_id at the top level so data never shares a physical node across tenants. This simplifies audit and SOC2: one partition = one tenant = one bucket of logs.

为什么分区

复制解决读与可用性;分区解决写与存储。数据超过单机磁盘(今天 NVMe 约 30 TB)或写超过单主(胖机器 Postgres 约 5 万写/秒)就必须切。DDIA 第 6 章把分区定义为分配给某节点的一组行:Cassandra 叫「ring segment」,DynamoDB 叫「partition」,MongoDB 叫「chunk」,同概念不同名字。

来源交叉引用

DDIA 第 6 章是金标;Alex Xu V1 第 5 章一致性哈希是面试模板版;Acing SDI 第 4 章把分片当扩展杠杆。

范围分区 vs 哈希分区

范围:按 key 有序切,比如用户名 A–E、F–J。优点:范围扫描便宜。缺点:倾斜——时间戳这种顺序 key 砸同一分区。HBase、BigTable、Spanner 用这种。

哈希partition = hash(key) % N。优点:均匀。缺点:范围查询要 scatter-gather。Cassandra、DynamoDB、Redis Cluster 用这种。

真实系统多是前缀哈希 + 分区内有序:分区 key 哈希散开,分区内 sort key 排序。DynamoDB (PK, SK)、Cassandra (partition key, clustering key),同思想。

DynamoDB 条目:PK = user#42, SK = ts#2026-04-16T10:00:00Z
  → 由 hash(user#42) 决定分区
  → 分区内按 ts 排序

一致性哈希与虚节点

朴素 hash(key) % N 在 N 变时所有 key 重映射。Karger 等 1997 的一致性哈希把节点放在哈希环上,每个 key 顺时针归属下个节点。加一节点只从邻居拿 1/N。

flowchart LR
    subgraph Ring
      N1((节点 A)) --> N2((节点 B))
      N2 --> N3((节点 C))
      N3 --> N4((节点 D))
      N4 --> N1
    end
    K1[key k1] -.-> N2
    K2[key k2] -.-> N3

问题:5 节点时分布方差 ±O(1/√N) ≈ 45%,丑。方案:虚节点(vnode)。每个物理节点占 100–256 个位置,方差降到 ±5%。Cassandra 默认 256/节点,DynamoDB 完全隐藏。

加节点算术:256 × 10 = 2560 tokens;加第 11 节点偷 256 个(~10%)平均分自其他 10 节点。重平衡网络流量 = 总数据的 10%。100 TB 集群 = 10 TB,1 GB/s 约 3 小时。限速重平衡,否则会把前台流量踩死。

反模式

客户端代码写死 hash(key) % N。扩容要改每个客户端。2010 年 reddit 案例:从 mod 哈希迁到一致性哈希用了几个月。第一天就上一致性哈希。

热点分区

1 亿粉丝的大 V、爆款短链、黑五头部 SKU——任何倾斜负载都会集中到一个分区。表现:一节点 CPU 95%,其余空闲;p99 爆炸但 p50 正常。

按成本从低到高:

  • 缓存热 key:Redis 挡在分区前,TTL 1–60 秒。最便宜。
  • 写分片:分区 key 后缀 2 位随机(celeb#42-07),读时聚合。DynamoDB 官方热点模式。
  • 拆分区:双哈希或自适应拆(DynamoDB 自 2018 起超 3000 RCU/1000 WCU 自动拆)。
  • 专用分片:Twitter 的「名人集群」,把头部账号挪到专池,扇出专门优化。

OpenAI 专属

按用户缓存 prompt 或聊天历史时尾部很长:1% 用户贡献 30% token。不要只按 user_id 分区,加一个 bucket 维度((user_id, day_bucket))把重度用户散开。

二级索引:本地 vs 全局

主分区按访问模式挑,但按其他列查需要索引。DDIA 第 6 章两种策略:

  • 本地(按文档分区):每分区自带索引。写便宜(本地)。按二级 key 读要scatter-gather全部分区。Elasticsearch shard、MongoDB 本地索引。
  • 全局(按 term 分区):以二级列做 key,其哈希决定分区。读单分区。写主表同时要写到不同分区(2PC 或异步)。DynamoDB GSI 就是因为同步太贵所以默认最终一致。

口诀:本地索引写便宜读贵,全局索引反之。

再平衡策略

  • 固定分区数:第一天建 1000 分区跨 10 节点(每节点 100)。加第 11 节点 → 挪 1/11 分区。Elasticsearch、Riak。要预估高水位。
  • 动态分区:从 1 分区开始,超阈值(HBase 默认 10 GB)就拆,太小合并。无界增长适用。
  • 按节点比例:Cassandra 每节点 token(256 vnodes)。加节点洗 1/N。

自动 vs 手动重平衡:自动诱人但网络抖动时容易踩踏。生产级系统对超阈值移动要求运维确认。CockroachDB、Spanner 都有此门控。

请求路由

谁知道 user#42 在哪个分区?

  1. 智能客户端:客户端库内嵌分区图。快,但要推送更新。Cassandra 驱动这样做。
  2. 协调器 / 路由层:无状态代理从 ZK/etcd 读图再转发。多一跳(+0.5 ms),运维容易。Kafka、Vitess 这样。
  3. 分区感知 DNS / anycast:少见,只做粗粒度。

不管哪种都要一个元数据存储(ZK/etcd)保存权威分区图。2015 DynamoDB metadata 风暴之所以灾难:丢了图就路不了任何请求。每节点缓存图,TTL 5–30 秒,允许优雅陈旧。

Anthropic 专属

租户隔离推理(Claude for Enterprise)时,顶层按 tenant_id 分区,跨租户不共享物理节点。审计和 SOC2 更简单:一分区 = 一租户 = 一桶日志。