System Design Case Study

How does Google crawl 10B+ pages across the internet?

??? Design a web crawler: 10B+ pages, politeness, priority by PageRank, content deduplication
Concepts Involved

Problem Statement

How does a distributed web crawler fetch 10B+ pages across the internet while respecting robots.txt, maintaining politeness (not overwhelming hosts), prioritizing high-value pages, and deduplicating content efficiently?

Core challenge: The web has 10B+ indexable pages changing constantly. Must crawl politely (1 req/sec per host), prioritize by PageRank/freshness, detect near-duplicate content (30%+ of web is duplicated), and handle traps (infinite calendars, session URLs).
10B+
pages crawled
Politeness
1 req/sec per host
PageRank
priority-based scheduling
Dedup
SimHash fingerprinting

Architecture

LAYER 1 · FRONTIER (Two-Level Queue: Priority + Politeness) Seed URLs Initial high-quality domains to start + discovered URLs from previous crawls BFS expansion URL Frontier (Two-Level Design) Front Queues Priority by PageRank · freshness · change_freq High-value pages first Multiple priority levels Back Queues Per-host politeness 1 req/sec per host robots.txt crawl-delay One queue per domain Seen URL Store (Bloom Filter) 10B URLs tracked ~1.2GB memory (1% false positive) O(1) "have we seen this URL?" Prevents re-crawling same page URL normalization before check (strip params, lowercase, resolve redirects) LAYER 2 · FETCHING (Distributed Workers ? Content Processing) Fetcher Pool (1000s Workers) 1. Check robots.txt (cached per domain) 2. DNS resolve (local cache, avoid bottleneck) 3. HTTP GET (connection pool, keep-alive) Timeout: 30s | Max size: 10MB Distributed across data centers Content Processor 1. Parse HTML (extract text, metadata, structured data) 2. Extract outlinks ? normalize ? check Bloom ? add to frontier 3. SimHash fingerprint (near-duplicate detection) 4. Spider trap detection (max depth, URL pattern, domain budget) 5. Language detection + content classification LAYER 3 · OUTPUT (Index Pipeline + URL Store + Content Dedup) Index Pipeline Build inverted index (MapReduce) term ? [doc_id, position, freq] Feeds search engine PageRank computed offline URL Store (? Back to Frontier) New URLs extracted ? add to frontier BFS expansion of the web graph Track last-crawled time per URL Schedule re-crawl by change frequency Content Dedup (SimHash) Near-duplicate detection 30%+ of web is duplicated content Hamming distance on SimHash fingerprints Keep canonical, discard duplicates new URLs ? back to frontier (BFS expansion) Politeness: 1 req/sec/host | Priority: PageRank · freshness · change_frequency | Traps: max depth + URL pattern + domain budget Breaking news: real-time crawl pipeline (bypass normal queue) | Bloom filter: 10B URLs in 1.2GB | SimHash: 30%+ web is duped
URL Frontier design: Two-level queue: front queues (priority-based, sorted by PageRank · freshness) feed into back queues (per-host, enforce politeness). Each host gets its own queue with minimum delay between requests.
Content deduplication: SimHash (locality-sensitive hashing) detects near-duplicate pages. Exact duplicates caught by content hash (MD5/SHA). 30%+ of web content is duplicated · dedup saves massive storage and avoids indexing redundant pages.
Anti-patterns: No politeness enforcement · get IP-banned, harm small sites. BFS without priority · waste budget on low-value pages. No trap detection · infinite calendars/session URLs exhaust crawler. Single DNS resolver · becomes bottleneck.
Robots.txt + traps: Always fetch and cache robots.txt before crawling a domain. Detect spider traps: URLs with ever-increasing depth, calendar pages generating infinite dates, session IDs creating infinite URL space. Max depth + URL pattern detection as safeguards.

Interview Cheat Sheet

1. URL Frontier · priority queues (PageRank) + per-host back queues (politeness)
2. Politeness · respect robots.txt crawl-delay, 1 req/sec/host default, distributed across workers
3. Deduplication · URL-level (Bloom filter) + content-level (SimHash for near-duplicates)
4. Priority scheduling · PageRank · change_frequency · freshness for crawl budget allocation
5. Trap detection · max URL depth, pattern recognition, domain-level crawl budgets
6. DNS caching · local DNS cache to avoid resolver bottleneck at billions of lookups