How does a distributed web crawler fetch 10B+ pages across the internet while respecting robots.txt, maintaining politeness (not overwhelming hosts), prioritizing high-value pages, and deduplicating content efficiently?
Core challenge: The web has 10B+ indexable pages changing constantly. Must crawl politely (1 req/sec per host), prioritize by PageRank/freshness, detect near-duplicate content (30%+ of web is duplicated), and handle traps (infinite calendars, session URLs).
10B+
pages crawled
Politeness
1 req/sec per host
PageRank
priority-based scheduling
Dedup
SimHash fingerprinting
Architecture
URL Frontier design: Two-level queue: front queues (priority-based, sorted by PageRank · freshness) feed into back queues (per-host, enforce politeness). Each host gets its own queue with minimum delay between requests.
Content deduplication: SimHash (locality-sensitive hashing) detects near-duplicate pages. Exact duplicates caught by content hash (MD5/SHA). 30%+ of web content is duplicated · dedup saves massive storage and avoids indexing redundant pages.
Anti-patterns:No politeness enforcement · get IP-banned, harm small sites. BFS without priority · waste budget on low-value pages. No trap detection · infinite calendars/session URLs exhaust crawler. Single DNS resolver · becomes bottleneck.
Robots.txt + traps: Always fetch and cache robots.txt before crawling a domain. Detect spider traps: URLs with ever-increasing depth, calendar pages generating infinite dates, session IDs creating infinite URL space. Max depth + URL pattern detection as safeguards.
Interview Cheat Sheet
1.URL Frontier · priority queues (PageRank) + per-host back queues (politeness) 2.Politeness · respect robots.txt crawl-delay, 1 req/sec/host default, distributed across workers 3.Deduplication · URL-level (Bloom filter) + content-level (SimHash for near-duplicates) 4.Priority scheduling · PageRank · change_frequency · freshness for crawl budget allocation 5.Trap detection · max URL depth, pattern recognition, domain-level crawl budgets 6.DNS caching · local DNS cache to avoid resolver bottleneck at billions of lookups