How does Google crawl 10B+ pages across the internet?

➔? Design a web crawler: 10B+ pages, politeness, priority by PageRank, content deduplication

Concepts Involved

Message Queues Rate Limiting Bloom Filters Sharding DNS

Problem Statement

How does a distributed web crawler fetch 10B+ pages across the internet while respecting robots.txt, maintaining politeness (not overwhelming hosts), prioritizing high-value pages, and deduplicating content efficiently?

Core challenge: The web has 10B+ indexable pages changing constantly. Must crawl politely (1 req/sec per host), prioritize by PageRank/freshness, detect near-duplicate content (30%+ of web is duplicated), and handle traps (infinite calendars, session URLs).

10B+

pages crawled

Politeness

1 req/sec per host

PageRank

priority-based scheduling

Dedup

SimHash fingerprinting

Architecture

URL Frontier design: Two-level queue: front queues (priority-based, sorted by PageRank · freshness) feed into back queues (per-host, enforce politeness). Each host gets its own queue with minimum delay between requests.

Content deduplication: SimHash (locality-sensitive hashing) detects near-duplicate pages. Exact duplicates caught by content hash (MD5/SHA). 30%+ of web content is duplicated · dedup saves massive storage and avoids indexing redundant pages.

Anti-patterns: No politeness enforcement · get IP-banned, harm small sites. BFS without priority · waste budget on low-value pages. No trap detection · infinite calendars/session URLs exhaust crawler. Single DNS resolver · becomes bottleneck.

Robots.txt + traps: Always fetch and cache robots.txt before crawling a domain. Detect spider traps: URLs with ever-increasing depth, calendar pages generating infinite dates, session IDs creating infinite URL space. Max depth + URL pattern detection as safeguards.

Interview Cheat Sheet

1. URL Frontier · priority queues (PageRank) + per-host back queues (politeness)
2. Politeness · respect robots.txt crawl-delay, 1 req/sec/host default, distributed across workers
3. Deduplication · URL-level (Bloom filter) + content-level (SimHash for near-duplicates)
4. Priority scheduling · PageRank · change_frequency · freshness for crawl budget allocation
5. Trap detection · max URL depth, pattern recognition, domain-level crawl budgets
6. DNS caching · local DNS cache to avoid resolver bottleneck at billions of lookups

System Design Case Study

Problem Statement

Architecture

Interview Cheat Sheet