OpenAI ★★ Frequent Hard FrontierPolitenessBloom Filter

O10 · Fault-Tolerant Polite Web Crawler @ 10M RPS O10 · 容错礼貌型网页爬虫(10M RPS)

Verified source经核实出处

Prompt: "Design a fault-tolerant, polite web crawling service that can scale to 10M requests per second." — IGotAnOffer, 2026-04-13. Credibility D.

Core architecture核心架构

flowchart LR
  S[Seed URLs] --> F[URL Frontier]
  F --> P[Politeness Scheduler]
  P --> Q[Fetch Queue]
  Q --> W[Crawler Workers]
  W --> D[Parser + Dedup]
  D --> F
  W --> ST[(Content Store)]

Politeness mechanics礼貌机制

  • Per-host/per-domain token-bucket rate limit.per-host/per-domain 令牌桶限流。
  • robots.txt parse & cache (24h TTL).robots.txt 解析并缓存(24h TTL)。
  • Cap concurrent connections per host (≤ N).每 host 并发连接上限(≤ N)。

Dedup at 10M RPS (the hard part)10M RPS 下去重(最难点)

  • URL canonicalization: normalize scheme/host, sort query params, strip tracking.URL 规范化:归一化 scheme/host、排序 query、去追踪参数。
  • Bloom filter (approximate) + exact set for hot domains.Bloom filter(近似)+ 热域精确集。
  • Content dedup via simhash/minhash for near-duplicates (bonus).内容去重用 simhash/minhash 检测近似重复(加分)。

Fault tolerance容错

  • Stateless workers; the queue + scheduler own retry/backoff.Worker 无状态;重试/退避由队列与调度器管。
  • Idempotent writes keyed on content_hash or canonical_url.按 content_hash 或 canonical_url 的幂等写入。

Related study-guide topics相关学习手册专题