A28 · Distributed Search for 1B Documents @ 1M QPS A28 · 1B 文档 @ 1M QPS 的分布式搜索
Verified source经核实出处
Prompt: "Design a Distributed Search System for 1B documents, 1M QPS." — Medium Anqi Silvia 2025; linkjob.ai. Credibility C.
Scale math first先讲规模
- 1B docs × avg 10 KB = 10 TB index → ~100 shards at 100 GB each.1B 文档 × 平均 10 KB = 10 TB 索引 → 约 100 分片 × 100 GB。
- 1M QPS → with 100 shards, each shard serves ~10K QPS. Need multiple replicas per shard.1M QPS → 100 分片下每分片 ~10K QPS。每分片需多副本。
- If LLM re-rank / generate involved, GPU cost dominates — rate-limit or skip for bulk queries.若涉及 LLM 重排/生成,GPU 成本主导——大批量 query 应限流或跳过。
Architecture架构
flowchart LR Q[Query] --> GW[Gateway + Rate Limit] GW --> CACHE[(L1 Cache / Edge)] CACHE --> L2[(Semantic Cache)] L2 --> FAN[Scatter-Gather] FAN --> S1[Shard 1] & S2[Shard 2] & S3[Shard N] S1 & S2 & S3 --> MER[Top-K Merge] MER --> RE[Reranker / LLM (optional)] RE --> Q
Hot-spot mitigation热点规避
- Cache top-5% queries in edge cache (cuts 40%+ load at large scale).边缘缓存热门 top-5% query(大规模下可削 40%+ 负载)。
- Re-shard on hot keys; consistent hashing with virtual nodes.按热 key 重分片;一致性哈希 + 虚拟节点。
- Semantic cache (embedding-proximity) catches near-duplicate queries.语义缓存(embedding 近似)捕捉近似 query。
GPU memory optimizationGPU 显存优化
- PagedAttention for LLM rerank layer.LLM 重排层用 PagedAttention。
- Prefix caching for a shared system prompt.共享 system prompt 的 prefix caching。
- Quantization (FP8/INT8) for reranker.重排器量化(FP8/INT8)。