O10 · Fault-Tolerant Polite Web Crawler @ 10M RPS O10 · 容错礼貌型网页爬虫(10M RPS)
Verified source经核实出处
Prompt: "Design a fault-tolerant, polite web crawling service that can scale to 10M requests per second." — IGotAnOffer, 2026-04-13. Credibility D.
Core architecture核心架构
flowchart LR S[Seed URLs] --> F[URL Frontier] F --> P[Politeness Scheduler] P --> Q[Fetch Queue] Q --> W[Crawler Workers] W --> D[Parser + Dedup] D --> F W --> ST[(Content Store)]
Politeness mechanics礼貌机制
- Per-host/per-domain token-bucket rate limit.per-host/per-domain 令牌桶限流。
- robots.txt parse & cache (24h TTL).robots.txt 解析并缓存(24h TTL)。
- Cap concurrent connections per host (≤ N).每 host 并发连接上限(≤ N)。
Dedup at 10M RPS (the hard part)10M RPS 下去重(最难点)
- URL canonicalization: normalize scheme/host, sort query params, strip tracking.URL 规范化:归一化 scheme/host、排序 query、去追踪参数。
- Bloom filter (approximate) + exact set for hot domains.Bloom filter(近似)+ 热域精确集。
- Content dedup via simhash/minhash for near-duplicates (bonus).内容去重用 simhash/minhash 检测近似重复(加分)。
Fault tolerance容错
- Stateless workers; the queue + scheduler own retry/backoff.Worker 无状态;重试/退避由队列与调度器管。
- Idempotent writes keyed on content_hash or canonical_url.按 content_hash 或 canonical_url 的幂等写入。