Google ★★★ Frequent Hard CrawlerBFSPoliteness

G6 · Design Google Web Crawler G6 · 设计 Google 网络爬虫

Verified source经核实出处

Classic Google SD interview question; confirmed on Glassdoor, LeetCode Discuss for Google onsites. Credibility A.

Architecture架构

flowchart LR
  Seed --> FR[Frontier - BFS queue]
  FR --> POL[Politeness gate - per host]
  POL --> FETCH[Fetcher pool]
  FETCH --> ROB[Robots check]
  FETCH --> DL[Downloader]
  DL --> DEDUP[Dup detector - simhash]
  DEDUP --> INDEX[Indexer]
  INDEX --> FR_NEW[New URLs]
  FR_NEW --> FR

Key decisions关键决策

  • **Frontier per-host bucket**: per-host queue throttled to politeness delay.**按域分桶的 frontier**:每域限流至礼貌延迟。
  • **Re-crawl priority** by PageRank + historical update frequency.**重爬优先级**按 PageRank + 历史更新频率。
  • **SimHash near-duplicate detection** dedups mirror/boilerplate pages.**SimHash 近似去重**处理镜像/模板页。
  • **Distributed frontier**: consistent-hash URL to crawler node; avoids cross-cluster duplicates.**分布式 frontier**:URL 一致性哈希到节点;避免集群重复抓取。

Follow-ups追问

  • JS-heavy sites? dedicated headless renderer tier, cached.JS 重站?专用无头渲染层,带缓存。
  • Frontier persistence? LSM store (like Bigtable).frontier 持久?LSM 存储(如 Bigtable)。

Related study-guide topics相关学习手册专题