Anthropic ★★★ Frequent Hard ScaleSharding

A28 · Distributed Search for 1B Documents @ 1M QPS A28 · 1B 文档 @ 1M QPS 的分布式搜索

Verified source经核实出处

Prompt: "Design a Distributed Search System for 1B documents, 1M QPS." — Medium Anqi Silvia 2025; linkjob.ai. Credibility C.

Scale math first先讲规模

1B docs × avg 10 KB = 10 TB index → ~100 shards at 100 GB each.1B 文档 × 平均 10 KB = 10 TB 索引 → 约 100 分片 × 100 GB。
1M QPS → with 100 shards, each shard serves ~10K QPS. Need multiple replicas per shard.1M QPS → 100 分片下每分片 ~10K QPS。每分片需多副本。
If LLM re-rank / generate involved, GPU cost dominates — rate-limit or skip for bulk queries.若涉及 LLM 重排/生成，GPU 成本主导——大批量 query 应限流或跳过。

Architecture架构

flowchart LR
  Q[Query] --> GW[Gateway + Rate Limit]
  GW --> CACHE[(L1 Cache / Edge)]
  CACHE --> L2[(Semantic Cache)]
  L2 --> FAN[Scatter-Gather]
  FAN --> S1[Shard 1] & S2[Shard 2] & S3[Shard N]
  S1 & S2 & S3 --> MER[Top-K Merge]
  MER --> RE[Reranker / LLM (optional)]
  RE --> Q

Hot-spot mitigation热点规避

Cache top-5% queries in edge cache (cuts 40%+ load at large scale).边缘缓存热门 top-5% query（大规模下可削 40%+ 负载）。
Re-shard on hot keys; consistent hashing with virtual nodes.按热 key 重分片；一致性哈希 + 虚拟节点。
Semantic cache (embedding-proximity) catches near-duplicate queries.语义缓存（embedding 近似）捕捉近似 query。

GPU memory optimizationGPU 显存优化

PagedAttention for LLM rerank layer.LLM 重排层用 PagedAttention。
Prefix caching for a shared system prompt.共享 system prompt 的 prefix caching。
Quantization (FP8/INT8) for reranker.重排器量化（FP8/INT8）。

Related study-guide topics相关学习手册专题