G6 · Design Google Web Crawler G6 · 设计 Google 网络爬虫
Verified source经核实出处
Classic Google SD interview question; confirmed on Glassdoor, LeetCode Discuss for Google onsites. Credibility A.
Architecture架构
flowchart LR Seed --> FR[Frontier - BFS queue] FR --> POL[Politeness gate - per host] POL --> FETCH[Fetcher pool] FETCH --> ROB[Robots check] FETCH --> DL[Downloader] DL --> DEDUP[Dup detector - simhash] DEDUP --> INDEX[Indexer] INDEX --> FR_NEW[New URLs] FR_NEW --> FR
Key decisions关键决策
- **Frontier per-host bucket**: per-host queue throttled to politeness delay.**按域分桶的 frontier**:每域限流至礼貌延迟。
- **Re-crawl priority** by PageRank + historical update frequency.**重爬优先级**按 PageRank + 历史更新频率。
- **SimHash near-duplicate detection** dedups mirror/boilerplate pages.**SimHash 近似去重**处理镜像/模板页。
- **Distributed frontier**: consistent-hash URL to crawler node; avoids cross-cluster duplicates.**分布式 frontier**:URL 一致性哈希到节点;避免集群重复抓取。
Follow-ups追问
- JS-heavy sites? dedicated headless renderer tier, cached.JS 重站?专用无头渲染层,带缓存。
- Frontier persistence? LSM store (like Bigtable).frontier 持久?LSM 存储(如 Bigtable)。