A comprehensive deep-dive into designing a production-grade web crawler capable of discovering, fetching, and indexing billions of web pages. Covers URL frontier design with priority and politeness queues, the fetch cycle including DNS pre-fetching and connection pooling, URL deduplication with Bloom filters and SimHash, distributed multi-worker architecture with consistent hashing, robots.txt compliance and ethical crawling, spider trap detection and URL normalization, and adaptive recrawl scheduling for freshness. This is a core system design interview question at FAANG companies because it tests breadth (networking, storage, distributed coordination) and depth (probabilistic data structures, graph traversal, politeness constraints) in a single session.
A comprehensive deep-dive into designing a production-grade web crawler capable of discovering, fetching, and indexing billions of web pages. Covers URL frontier design with priority and politeness queues, the fetch cycle including DNS pre-fetching and connection pooling, URL deduplication with Bloom filters and SimHash, distributed multi-worker architecture with consistent hashing, robots.txt compliance and ethical crawling, spider trap detection and URL normalization, and adaptive recrawl scheduling for freshness. This is a core system design interview question at FAANG companies because it tests breadth (networking, storage, distributed coordination) and depth (probabilistic data structures, graph traversal, politeness constraints) in a single session.
Lesson outline
A web crawler — also called a spider or bot — is a program that systematically browses the internet by following hyperlinks from page to page, downloading content, and extracting new URLs to visit. It is the foundational component of every search engine, powering the index that makes search possible. Without crawlers, search engines would have no data to search over.
Google operates the largest web crawler in the world (Googlebot), which discovers and re-fetches pages across an index of over 400 billion documents. Bing, Yandex, Baidu, and DuckDuckGo each operate their own crawlers. Beyond search, crawlers power academic datasets like Common Crawl (which produces a free, open archive of the web containing over 3 billion pages per monthly snapshot), price comparison engines, archival services like the Internet Archive, SEO analysis tools, and machine learning training pipelines.
Why Interviewers Love This Question
A web crawler touches nearly every distributed systems concept: graph traversal at internet scale, queue management and scheduling, network I/O optimization, deduplication with probabilistic data structures, politeness and rate limiting, handling adversarial content (spider traps), and freshness maintenance. It can be discussed at L4 depth (basic BFS + queue) or L7 depth (Bloom filter sizing, consistent hashing for URL partitioning, adaptive recrawl with exponential backoff). This range makes it one of the top system design interview questions at Google, Meta, and Amazon.
The scale of the modern web is staggering. Estimates put the indexed web at around 60 billion pages, with the deep web (content behind forms, logins, and dynamic generation) being orders of magnitude larger. A production crawler must handle pages in hundreds of languages, content types ranging from static HTML to JavaScript-rendered single-page applications, and an adversarial environment where some sites actively try to trap or block crawlers.
Core functions of a web crawler
Scale Numbers to Remember
The indexed web contains roughly 60 billion pages. Google crawls hundreds of millions of pages per day. A well-designed crawler targeting 1 billion pages needs to sustain about 12,000 pages per second (1B / 86,400 seconds). Each page averages about 2 MB when including images and scripts, but crawlers typically fetch only HTML (average 50-100 KB), so 1 billion pages of HTML is roughly 50-100 TB of raw content per crawl cycle.
Understanding these numbers is critical because they drive every architectural decision. A toy crawler on your laptop can fetch a few pages per second using a single thread. A production crawler serving a search engine must fetch thousands of pages per second, coordinate across hundreds of machines, avoid re-fetching billions of already-seen URLs, and do all of this while being a polite guest on every web server it visits.
Crawler vs Scraper
A crawler discovers and fetches pages by following links across the web. A scraper extracts specific data from known pages. In interviews, "design a web crawler" means the large-scale discovery and fetching system, not a targeted data extraction tool. The crawler feeds pages into downstream systems (indexers, scrapers, ML pipelines).
Every system design interview should begin with clarifying requirements and estimating scale. For a web crawler, the functional and non-functional requirements are well-defined, but the scale numbers drive radically different architectural choices.
Functional requirements
Non-functional requirements
Start With the Math
In an interview, always show your back-of-envelope calculation. This demonstrates that you think about scale before jumping into architecture. Interviewers expect you to derive the pages-per-second target, bandwidth requirements, and storage needs from first principles.
| Metric | Calculation | Result |
|---|---|---|
| Target pages | 1 billion pages per crawl cycle | 1,000,000,000 |
| Crawl cycle duration | 1 day (aggressive) to 1 week (conservative) | 86,400 to 604,800 seconds |
| Pages per second (1-day cycle) | 1B / 86,400 | ~11,574 pages/sec |
| Pages per second (1-week cycle) | 1B / 604,800 | ~1,653 pages/sec |
| Avg HTML page size | Typical HTML body without images | ~100 KB |
| Bandwidth (1-day cycle) | 11,574 x 100 KB/sec | ~1.16 GB/sec = ~9.3 Gbps |
| Bandwidth (1-week cycle) | 1,653 x 100 KB/sec | ~165 MB/sec = ~1.3 Gbps |
| Raw storage per cycle | 1B x 100 KB | ~100 TB |
| Storage with compression | ~5:1 gzip ratio for HTML | ~20 TB |
| URL frontier size | 10B discovered URLs (10x pages) | ~500 GB at 50 bytes/URL avg |
These numbers reveal important constraints. At 11,574 pages per second with a 1-day cycle, a single machine doing 50 fetches per second would need 230+ machines in the fetcher pool alone. The bandwidth requirement of 9.3 Gbps requires careful network planning. And 100 TB of raw HTML per cycle means storage costs add up quickly, making compression and deduplication essential.
DNS is the Hidden Bottleneck
Every new domain requires a DNS lookup. With 1 billion pages across 100 million domains, you need to resolve 100 million DNS queries per crawl cycle. Public DNS resolvers (8.8.8.8) rate-limit heavy users. Production crawlers run their own caching DNS resolvers and pre-fetch DNS records for domains in the frontier before the fetcher needs them. Without DNS caching, DNS lookup latency (50-200ms) dominates total page fetch time.
Interview estimation walkthrough
01
State your assumption: "Let us design for 1 billion pages crawled in 1 week."
02
Calculate pages per second: 1B / (7 x 86,400) = approximately 1,653 pages/sec.
03
Estimate bandwidth: 1,653 x 100 KB = 165 MB/sec inbound, or about 1.3 Gbps. This is manageable on a single data center uplink.
04
Estimate storage: 1B x 100 KB = 100 TB raw. With gzip compression at 5:1, about 20 TB per crawl cycle.
05
Estimate fetcher count: if each fetcher handles 50 concurrent connections, we need 1,653 / 50 = 33 fetcher machines. Round up to 50 for headroom.
06
Estimate frontier size: assume 10 billion discovered URLs at 50 bytes each = 500 GB. This fits in distributed memory or on-disk queues.
State your assumption: "Let us design for 1 billion pages crawled in 1 week."
Calculate pages per second: 1B / (7 x 86,400) = approximately 1,653 pages/sec.
Estimate bandwidth: 1,653 x 100 KB = 165 MB/sec inbound, or about 1.3 Gbps. This is manageable on a single data center uplink.
Estimate storage: 1B x 100 KB = 100 TB raw. With gzip compression at 5:1, about 20 TB per crawl cycle.
Estimate fetcher count: if each fetcher handles 50 concurrent connections, we need 1,653 / 50 = 33 fetcher machines. Round up to 50 for headroom.
Estimate frontier size: assume 10 billion discovered URLs at 50 bytes each = 500 GB. This fits in distributed memory or on-disk queues.
These estimates form the skeleton of your design. Every subsequent architectural decision — how many fetcher workers, how to partition the URL frontier, whether to use Bloom filters or a database for deduplication — follows from these numbers.
The URL frontier is the heart of a web crawler. It is the data structure that holds all URLs that have been discovered but not yet fetched. The frontier determines what the crawler visits next, making it the single most important component for crawl efficiency, politeness, and coverage. A naive frontier (a simple FIFO queue) works for toy crawlers but fails catastrophically at scale.
Seed URLs bootstrap the crawl. For a general-purpose search engine crawler, seeds typically include the homepages of major websites (top 10,000 by traffic), URLs submitted by webmasters through tools like Google Search Console, and URLs discovered from sitemaps. For a focused crawler (e.g., crawling only news sites), seeds are a curated list of target domains.
Seed Selection Strategy
Use the DMOZ/Curlie directory, Common Crawl URL lists, or Alexa/Tranco top sites as seeds. Good seeds have two properties: (1) high PageRank, meaning they link to many other important pages, and (2) broad coverage, spanning many domains and topics. Starting with 50,000-100,000 diverse seed URLs is typical for a general web crawl.
A production URL frontier has two layers: a front queue system that handles prioritization, and a back queue system that handles politeness. This two-layer design, described in the Mercator crawler paper (Heydon and Najork, 1999), remains the standard architecture used by modern crawlers.
graph TB
subgraph "URL Frontier (Mercator Design)"
direction TB
NEW["Newly Discovered URLs"] --> NORM["URL Normalizer"]
NORM --> DEDUP["URL Dedup Filter<br/>(Bloom Filter)"]
DEDUP -->|"New URL"| PRIO["Priority Assigner<br/>(PageRank, freshness,<br/>domain authority)"]
DEDUP -->|"Already seen"| DROP["Drop"]
subgraph "Front Queues (Priority)"
PRIO --> F1["Queue 1<br/>HIGH priority"]
PRIO --> F2["Queue 2<br/>MEDIUM priority"]
PRIO --> F3["Queue 3<br/>LOW priority"]
end
subgraph "Back Queues (Politeness)"
F1 --> ROUTER["Back-Queue Router<br/>(hash domain → queue)"]
F2 --> ROUTER
F3 --> ROUTER
ROUTER --> B1["Queue: example.com<br/>next_fetch: 10:00:05"]
ROUTER --> B2["Queue: news.org<br/>next_fetch: 10:00:02"]
ROUTER --> B3["Queue: shop.io<br/>next_fetch: 10:00:08"]
end
B1 --> FETCH["Fetcher Workers"]
B2 --> FETCH
B3 --> FETCH
end
style PRIO fill:#3b82f6,stroke:#333,color:#fff
style ROUTER fill:#f59e0b,stroke:#333,color:#fff
style FETCH fill:#10b981,stroke:#333,color:#fff
style DROP fill:#ef4444,stroke:#333,color:#fffMercator-style two-layer URL frontier: front queues handle priority, back queues enforce per-domain politeness with next-fetch timestamps.
Front queue (priority) design
Back queue (politeness) design
Back Queue Explosion
With 100 million domains, maintaining one in-memory queue per domain is infeasible (it would require terabytes of RAM). Production crawlers use a hybrid approach: keep back queues for the top 1 million domains in memory, and use disk-backed queues (RocksDB, LevelDB) for the long tail. The BerkeleyDB-backed frontier used by Apache Nutch is a classic example of this pattern.
URL Normalization Before Enqueueing
Before adding a URL to the frontier, normalize it: lowercase the scheme and host, remove default port numbers (80 for HTTP, 443 for HTTPS), resolve relative paths (/../), remove fragment identifiers (#section), sort query parameters alphabetically, and decode unnecessary percent-encoding. Without normalization, http://Example.com:80/path and http://example.com/path are treated as different URLs, wasting crawl budget on duplicates.
The fetch cycle is where the crawler actually retrieves content from the web. Each cycle consists of three stages: DNS resolution to find the server IP address, HTTP request to download the page, and HTML parsing to extract content and new URLs. Optimizing each stage is critical because the fetch cycle runs billions of times.
Anatomy of a single page fetch
01
Pop the next URL from the frontier (from the back queue with the earliest next_fetch timestamp).
02
Check if we have a cached robots.txt for this domain. If not, fetch robots.txt first and cache it (TTL of 24 hours). If the target path is disallowed, skip the URL.
03
Resolve the hostname to an IP address via DNS. Check the local DNS cache first; if miss, query the crawler-local caching DNS resolver.
04
Open an HTTP connection (or reuse one from the connection pool for this domain). Send a GET request with appropriate headers: User-Agent (identifying the crawler), Accept-Encoding (gzip), If-Modified-Since (for recrawls).
05
Read the response. If 200 OK, proceed. If 301/302 redirect, extract the Location header and add the redirect target to the frontier (up to 5 redirects max). If 304 Not Modified, mark the page as unchanged. If 4xx/5xx, log the error and apply a back-off policy.
06
Decompress the response body (gzip/brotli). Parse HTML to extract text content, metadata (title, description, canonical URL), and all hyperlinks (<a href>, <link>, <frame src>).
07
Normalize extracted URLs and feed them back to the URL frontier through the deduplication filter.
08
Store the raw HTML and extracted metadata in the content store for downstream indexing.
Pop the next URL from the frontier (from the back queue with the earliest next_fetch timestamp).
Check if we have a cached robots.txt for this domain. If not, fetch robots.txt first and cache it (TTL of 24 hours). If the target path is disallowed, skip the URL.
Resolve the hostname to an IP address via DNS. Check the local DNS cache first; if miss, query the crawler-local caching DNS resolver.
Open an HTTP connection (or reuse one from the connection pool for this domain). Send a GET request with appropriate headers: User-Agent (identifying the crawler), Accept-Encoding (gzip), If-Modified-Since (for recrawls).
Read the response. If 200 OK, proceed. If 301/302 redirect, extract the Location header and add the redirect target to the frontier (up to 5 redirects max). If 304 Not Modified, mark the page as unchanged. If 4xx/5xx, log the error and apply a back-off policy.
Decompress the response body (gzip/brotli). Parse HTML to extract text content, metadata (title, description, canonical URL), and all hyperlinks (<a href>, <link>, <frame src>).
Normalize extracted URLs and feed them back to the URL frontier through the deduplication filter.
Store the raw HTML and extracted metadata in the content store for downstream indexing.
DNS Pre-fetching
DNS resolution adds 50-200ms of latency per new domain. Production crawlers run a dedicated DNS pre-fetch thread that scans URLs in the frontier and resolves their hostnames before the fetcher needs them. Results are cached in a local DNS cache (TTL 30 minutes to 24 hours). Running your own caching DNS resolver (BIND, Unbound, or CoreDNS) avoids rate-limiting from public resolvers and reduces latency to under 1ms for cached entries.
| Optimization | What it does | Impact |
|---|---|---|
| Connection pooling | Reuse TCP connections across multiple requests to the same domain | Eliminates 50-100ms TCP handshake + TLS setup per request. Keep-alive pools of 2-4 connections per domain. |
| Async I/O | Use epoll/kqueue to handle thousands of concurrent connections per thread | A single fetcher thread can manage 1,000+ concurrent HTTP requests without thread overhead. |
| DNS pre-fetching | Resolve hostnames before the fetcher needs them | Eliminates DNS latency from the critical path. Local cache hit rate should be above 95%. |
| Conditional GET | Send If-Modified-Since or If-None-Match headers during recrawls | Server returns 304 Not Modified with no body, saving bandwidth. Effective for 30-50% of recrawls. |
| gzip/brotli | Request compressed content via Accept-Encoding header | Reduces transferred bytes by 60-80%. Nearly all web servers support gzip. |
| Timeout policies | Set connect timeout (5s), read timeout (30s), and total timeout (60s) | Prevents slow or unresponsive servers from blocking fetcher threads indefinitely. |
HTML parsing is the next major bottleneck after network I/O. A production crawler must handle malformed HTML gracefully because the real web is full of broken markup — unclosed tags, invalid nesting, mixed encodings, and embedded scripts. Libraries like jsoup (Java), BeautifulSoup/lxml (Python), and Golang net/html are designed for this forgiving parsing. Link extraction must handle both absolute and relative URLs, resolve base tags, handle meta refresh redirects, and recognize link-like patterns in JavaScript strings.
JavaScript-Rendered Content
An increasing percentage of the web (estimated 10-30%) renders content using JavaScript frameworks (React, Angular, Vue). A simple HTML parser sees an empty div and a script tag. To crawl these pages, you need a headless browser (Puppeteer, Playwright) to execute JavaScript and capture the rendered DOM. This is 10-100x more expensive than simple HTML parsing. Google runs a two-pass system: a first pass fetches raw HTML, and a second "rendering" pass uses a headless Chrome fleet for pages that need JavaScript execution.
robots.txt Caching Strategy
Fetch robots.txt once per domain per day and cache it. If robots.txt returns a 5xx error, treat it as "allow all" but retry within an hour. If it returns 404, there are no restrictions for that domain. Always respect Crawl-delay directives — they are a strong signal that the webmaster wants you to slow down. Cache robots.txt in a fast key-value store (Redis, Memcached) keyed by domain.
Ignoring Redirect Chains
Some URLs redirect multiple times (HTTP to HTTPS, www to non-www, short URL to final page). Follow redirects up to a maximum depth (5 is standard), but add the final destination URL to the frontier, not the intermediate redirects. Also detect redirect loops (A -> B -> A) and break them. Failing to handle redirects correctly results in duplicate content, wasted bandwidth, and missing pages.
The web is full of duplicate content. The same page may be reachable through multiple URLs (with and without www, with different query parameter orderings, with tracking parameters appended). Different URLs may serve identical or near-identical content (printer-friendly versions, paginated views, A/B test variants). Without robust deduplication, a crawler wastes bandwidth, storage, and crawl budget re-fetching content it already has.
Deduplication happens at two levels: URL-level (have we already seen this exact URL?) and content-level (does this page contain the same content as another page we have already fetched?). Both are essential, and they use different techniques.
URL-level deduplication
graph LR
subgraph "URL Dedup Pipeline"
URL["Extracted URL"] --> NORM2["Normalize<br/>(lowercase, strip params,<br/>remove fragment)"]
NORM2 --> HASH["Compute Hash<br/>(MurmurHash3 128-bit)"]
HASH --> BF{"Bloom Filter<br/>Check"}
BF -->|"Probably new"| ADD["Add to Frontier<br/>+ Insert into Bloom Filter"]
BF -->|"Definitely seen"| SKIP["Skip URL"]
end
subgraph "Content Dedup Pipeline"
PAGE["Fetched Page HTML"] --> EXTRACT["Extract Text<br/>(strip tags, scripts, styles)"]
EXTRACT --> SH["Compute SimHash<br/>(64-bit fingerprint)"]
SH --> COMPARE{"Hamming Distance<br/>vs Existing<br/>Fingerprints"}
COMPARE -->|"Distance > 3"| UNIQUE["Unique Content<br/>→ Store"]
COMPARE -->|"Distance ≤ 3"| NEARDUP["Near-Duplicate<br/>→ Mark + Skip"]
end
style BF fill:#3b82f6,stroke:#333,color:#fff
style COMPARE fill:#f59e0b,stroke:#333,color:#fff
style SKIP fill:#ef4444,stroke:#333,color:#fff
style NEARDUP fill:#ef4444,stroke:#333,color:#fff
style UNIQUE fill:#10b981,stroke:#333,color:#fffTwo-stage deduplication: URL-level uses Bloom filters to avoid refetching known URLs; content-level uses SimHash to detect near-duplicate page content.
Bloom Filter Sizing Formula
A Bloom filter for n items with false positive rate p requires m = -(n * ln(p)) / (ln(2))^2 bits and k = (m/n) * ln(2) hash functions. For 10 billion URLs at 1% FP rate: m = ~96 billion bits = ~12 GB, k = 7 hash functions. At 0.1% FP rate: m = ~144 billion bits = ~18 GB, k = 10. These sizes fit comfortably in server RAM.
Content-level deduplication detects when different URLs serve the same or nearly identical content. This is surprisingly common: product pages with different sorting orders, forum threads with different pagination, and syndicated articles appearing on multiple sites. Two key algorithms dominate this space: SimHash for near-duplicate detection and MinHash for document similarity estimation.
| Algorithm | How it works | Output | Use case |
|---|---|---|---|
| Exact hash (MD5/SHA-256) | Hash the entire page content | 128/256-bit digest | Exact duplicate detection. Any change (even a single byte) produces a different hash. |
| SimHash (Charikar, 2002) | Compute weighted hash of content tokens, producing a fingerprint where similar documents have similar fingerprints | 64-bit fingerprint | Near-duplicate detection. Two documents with Hamming distance <= 3 (out of 64 bits) are considered near-duplicates. |
| MinHash (Broder, 1997) | Apply k hash functions to the set of shingles (n-grams) in each document; keep the minimum hash value for each function | k minimum hash values (signature) | Jaccard similarity estimation. Two documents with 80%+ matching MinHash values have high content overlap. |
| SimHash + Hamming lookup | Index SimHash fingerprints in a table partitioned by bit blocks; query by checking all permutations within Hamming distance d | Boolean: near-duplicate or not | Scalable near-duplicate detection. Can check a fingerprint against billions of stored fingerprints in milliseconds. |
SimHash in Practice
SimHash works by treating a document as a bag of weighted features (typically word bigrams or trigrams). For each feature, compute a hash, then for each bit position, add the weight if the bit is 1 and subtract it if the bit is 0. The final fingerprint is constructed by setting each bit to 1 if the sum for that position is positive. Two documents differing by a small amount of content will differ in only a few bit positions. Google uses SimHash internally to detect near-duplicate web pages at scale.
Shingling for Content Fingerprinting
Before computing SimHash or MinHash, convert the page into a set of shingles — overlapping n-grams of words. For example, "the quick brown fox" with 3-word shingles becomes {"the quick brown", "quick brown fox"}. Shingle size of 5-8 words works well for web pages. Using word-level shingles (rather than character-level) is more robust to minor formatting changes while still detecting meaningful content similarity.
A single machine cannot crawl the web at scale. With targets of thousands of pages per second and a frontier containing billions of URLs, the crawler must be distributed across many machines. The key challenge is partitioning the work so that each machine operates independently as much as possible, while still coordinating enough to avoid duplicate work and maintain global politeness.
The standard distributed crawler architecture assigns each worker a partition of the URL space. The most common partitioning strategy is by domain: all URLs belonging to a given domain are assigned to the same worker. This has a critical advantage — per-domain politeness (rate limiting, robots.txt caching) is handled locally within each worker without any cross-worker coordination.
graph TB
subgraph "Distributed Crawler Architecture"
SEEDS["Seed URLs"] --> DIST["URL Distributor<br/>(Consistent Hash<br/>by domain)"]
subgraph "Worker Node 1"
DIST -->|"domains A-F"| F1["URL Frontier 1"]
F1 --> FET1["Fetcher Pool<br/>(50 async connections)"]
FET1 --> PARSE1["Parser"]
PARSE1 -->|"New URLs"| DEDUP1["Dedup Filter"]
end
subgraph "Worker Node 2"
DIST -->|"domains G-N"| F2["URL Frontier 2"]
F2 --> FET2["Fetcher Pool<br/>(50 async connections)"]
FET2 --> PARSE2["Parser"]
PARSE2 -->|"New URLs"| DEDUP2["Dedup Filter"]
end
subgraph "Worker Node 3"
DIST -->|"domains O-Z"| F3["URL Frontier 3"]
F3 --> FET3["Fetcher Pool<br/>(50 async connections)"]
FET3 --> PARSE3["Parser"]
PARSE3 -->|"New URLs"| DEDUP3["Dedup Filter"]
end
DEDUP1 -->|"Route by domain"| DIST
DEDUP2 -->|"Route by domain"| DIST
DEDUP3 -->|"Route by domain"| DIST
FET1 --> STORE["Content Store<br/>(S3 / HDFS)"]
FET2 --> STORE
FET3 --> STORE
PARSE1 --> META["Metadata DB<br/>(URL, timestamp,<br/>fingerprint, status)"]
PARSE2 --> META
PARSE3 --> META
end
style DIST fill:#3b82f6,stroke:#333,color:#fff
style STORE fill:#10b981,stroke:#333,color:#fff
style META fill:#f59e0b,stroke:#333,color:#fffDistributed crawler with domain-based URL partitioning. Each worker owns a slice of domains and handles all fetching, parsing, and politeness for those domains independently.
URL partitioning strategies
Consistent Hashing for Elastic Scaling
When a worker node fails or new capacity is added, consistent hashing with virtual nodes (100-200 virtual nodes per physical node) ensures that only a small fraction of domains are reassigned. The frontier data for reassigned domains must be transferred to the new owner. Use a message queue (Kafka, RabbitMQ) as the inter-worker URL exchange: when a parser on Worker 1 discovers a URL belonging to Worker 2 domain partition, it publishes the URL to Worker 2 input queue.
The Hot Domain Problem
Hash-by-domain creates load imbalance because domain sizes follow a power law. Wikipedia has over 60 million pages, while most domains have fewer than 100 pages. Solutions: (1) split hot domains across multiple workers using sub-domain or path-prefix partitioning, (2) use work-stealing where idle workers pull URLs from overloaded workers queues, or (3) maintain a separate "large domain" pool with dedicated workers.
Inter-worker communication is the other key design decision. When a parser on Worker 1 extracts a URL belonging to a domain owned by Worker 2, it must route that URL to Worker 2. The two common approaches are: (1) a shared message queue (Kafka or RabbitMQ) where each worker consumes from a topic partition mapped to its domain set, or (2) direct RPC where workers send URLs to the owning worker via gRPC. The message queue approach is preferred because it decouples workers and handles backpressure naturally.
Content Storage Architecture
Store raw HTML in an object store (S3, GCS, HDFS) keyed by a content hash (SHA-256 of the page body). This provides built-in deduplication at the storage layer — identical pages from different URLs map to the same storage object. Maintain a metadata database (PostgreSQL, Cassandra) that maps URL to content hash, fetch timestamp, HTTP status, response headers, and extracted metadata. The metadata DB is queried frequently for scheduling decisions; the content store is accessed only during indexing.
Politeness is not optional — it is a fundamental requirement of any responsible web crawler. An impolite crawler can overwhelm small web servers, trigger DDoS protection systems, get its IP addresses blocked, and potentially violate computer fraud and abuse laws. Every production crawler must implement strict politeness policies at multiple levels.
The robots.txt standard (formally the Robots Exclusion Protocol, standardized as RFC 9309 in 2022) is the primary mechanism by which webmasters communicate their crawling preferences. Every crawler must fetch, parse, and obey robots.txt before crawling any page on a domain.
| robots.txt Directive | Meaning | Example |
|---|---|---|
| User-agent | Specifies which crawler the following rules apply to | User-agent: Googlebot (rules for Google) or User-agent: * (rules for all) |
| Disallow | Paths the crawler must not visit | Disallow: /admin/ prevents crawling anything under /admin/ |
| Allow | Overrides a broader Disallow for specific sub-paths | Allow: /admin/public-page.html permits this page even if /admin/ is disallowed |
| Crawl-delay | Minimum seconds between consecutive requests to this domain | Crawl-delay: 10 means wait at least 10 seconds between requests |
| Sitemap | Location of XML sitemaps for the site | Sitemap: https://example.com/sitemap.xml helps the crawler discover all pages |
robots.txt is Advisory, Not Enforceable
robots.txt is a convention, not a technical barrier. Crawlers can technically ignore it. However, respecting robots.txt is both ethically required and practically essential. Sites that detect crawler policy violations will block your IP addresses, complain to your hosting provider, or pursue legal action. Google has successfully enforced robots.txt compliance through legal channels. LinkedIn won a lawsuit against a scraper (hiQ Labs) partly based on robots.txt violations.
Politeness policies beyond robots.txt
Implementing Per-Domain Rate Limiting
The back-queue design from the frontier section naturally enforces per-domain politeness. Each back queue has a next_allowed_fetch timestamp. After fetching from a domain, set next_allowed_fetch = now() + max(default_delay, robots_crawl_delay). The min-heap scheduler automatically skips domains that are still in their delay window. This is O(log N) per fetch operation where N is the number of active domains — efficient even at millions of domains.
Legal considerations vary by jurisdiction. In the United States, the Computer Fraud and Abuse Act (CFAA) prohibits unauthorized access to computer systems. Courts have generally held that crawling publicly accessible pages is not a CFAA violation, but crawling pages behind login walls or ignoring explicit robots.txt restrictions may be. The European Union GDPR imposes restrictions on collecting personal data from web pages. The EU Database Directive protects databases, which can include structured website content.
Ethical Crawling Checklist
Always identify your crawler via the User-Agent header. Respect robots.txt fully. Honor Crawl-delay directives. Provide a webpage explaining your crawler with a contact email for exclusion requests. Do not crawl content behind login walls. Do not crawl at rates that degrade site performance. Implement exponential back-off for errors. Store and respect per-domain crawl preferences. Respond promptly to webmaster complaints.
Crawling During Site Outages
If a site returns 503 Service Unavailable, it means the server is temporarily overloaded. Continuing to crawl worsens the problem. A polite crawler detects repeated 5xx responses and puts the domain into a cool-down period (exponential back-off starting at 1 hour). Never retry failed requests immediately — this is the equivalent of repeatedly ringing a doorbell when nobody is answering.
The web is not a friendly graph. It contains intentional and unintentional structures designed to trap crawlers into infinite loops, consuming resources while providing no useful content. A production crawler must detect and escape these traps to protect its crawl budget.
Spider traps are the most common adversarial pattern. A spider trap is a set of URLs that generates an effectively infinite number of pages. The most notorious example is a dynamic calendar that generates pages for every day into the infinite future: /calendar/2024/01/01, /calendar/2024/01/02, ..., /calendar/9999/12/31. Each page links to the next day, creating an endless chain that a naive BFS crawler will follow forever.
Common spider trap patterns
| Defense | How it works | Limitations |
|---|---|---|
| Max URL depth | Refuse to crawl URLs that are more than N clicks away from any seed URL | Simple and effective. Typical limit: 15-20 hops. May miss deeply nested legitimate content. |
| Max URL length | Skip URLs exceeding a length threshold (e.g., 2048 characters) | Trap URLs tend to be very long due to repeated path segments or many parameters. |
| Max pages per domain | Limit the number of pages crawled from any single domain per crawl cycle | Prevents any one domain from consuming the entire crawl budget. Set per domain authority. |
| URL pattern detection | Detect repeating path patterns: /a/b/a/b/a/b suggests a loop | Use regex or suffix tree to detect repeating segments. Catches most path-based traps. |
| Content fingerprinting | If consecutive pages from the same domain have identical or near-identical SimHash fingerprints, stop crawling that path | Catches traps that generate slightly different URLs but identical content. |
| Robots.txt nofollow | Trap URLs are often excluded via robots.txt by well-meaning sites | Only works if the site operator has configured robots.txt correctly. |
URL Normalization Defenses
Apply aggressive URL normalization before deduplication: remove known tracking parameters (utm_*, fbclid, gclid), remove session IDs (JSESSIONID, PHPSESSID), sort query parameters, remove empty parameters, and collapse repeated path separators (/a//b -> /a/b). This collapses many trap variations into a single canonical URL that the dedup filter catches. Maintain a configurable list of known tracking parameters and update it regularly.
Max Depth is Not Enough
Max depth prevents the simplest traps but misses sophisticated ones. A trap at depth 5 that generates 1 million pages per level will still consume a million page fetches before hitting any depth limit. Combine depth limits with per-domain page caps, URL pattern detection, and content fingerprinting for defense in depth. No single technique catches all traps.
Loop detection requires tracking the crawl path. Maintain a per-domain visit history and detect when the crawler is revisiting the same content (as measured by content fingerprints) through different URLs. If more than 10 consecutive fetches from the same domain produce near-identical SimHash fingerprints, flag the domain as a potential trap and stop crawling beyond a conservative page limit.
The OPIC Algorithm for Crawl Budget Allocation
OPIC (Online Page Importance Computation) is an algorithm that estimates PageRank-like importance scores during the crawl itself. Pages that receive many inbound links accumulate more "cash," and the crawler prioritizes pages with higher cash values. This naturally deprioritizes trap pages because they receive links only from other trap pages, not from authoritative sources. OPIC is used as a crawl scheduler signal by several open-source crawlers including Apache Nutch.
A crawler that only visits each page once produces a stale index. The web is constantly changing: news sites update every minute, e-commerce prices change daily, and even static blogs get occasional edits. Recrawl scheduling determines how often to revisit previously crawled pages to maintain freshness — a critical quality metric for search engines.
The core challenge is that crawl budget is finite. If you can fetch 10,000 pages per second, you must allocate those fetches between discovering new pages and recrawling known pages. Spending too much budget on recrawling means slow discovery of new content; spending too little means your index becomes stale. The optimal allocation depends on page change rates, which vary enormously across the web.
| Page category | Typical change frequency | Recommended recrawl interval | Examples |
|---|---|---|---|
| Breaking news | Minutes | 5-15 minutes | CNN.com homepage, Reuters breaking news feeds, Twitter trending topics |
| News articles | Hours | 1-6 hours | Individual news articles (edits, corrections, comment counts) |
| E-commerce | Daily | 12-24 hours | Product prices, stock availability, reviews |
| Blogs and forums | Weekly | 3-7 days | Blog posts, forum threads (new replies) |
| Static reference | Monthly or less | 14-30 days | Wikipedia articles, documentation, government sites |
| Archived content | Rarely or never | 30-90 days | Historical pages, archived news, old forum posts |
Adaptive recrawl scheduling uses historical change data to set per-page recrawl intervals. The idea is simple: if a page changed the last 5 times you checked, check it more often; if it has not changed in the last 10 checks, check it less often. This is a bandit-style optimization problem.
Adaptive recrawl algorithm
01
After each fetch, compare the content fingerprint (SimHash) to the previous version.
02
If the content changed, decrease the recrawl interval by a factor (e.g., multiply by 0.7). Minimum interval: 5 minutes.
03
If the content did not change, increase the recrawl interval by a factor (e.g., multiply by 1.5). Maximum interval: 30 days.
04
Weight the priority by page importance (PageRank, domain authority, traffic estimates). A high-importance page that changes daily gets priority over a low-importance page that also changes daily.
05
Maintain a recrawl priority queue sorted by next_scheduled_recrawl = last_fetch_time + current_interval.
06
On each scheduling cycle, pop pages from the recrawl queue and inject them into the frontier at elevated priority.
After each fetch, compare the content fingerprint (SimHash) to the previous version.
If the content changed, decrease the recrawl interval by a factor (e.g., multiply by 0.7). Minimum interval: 5 minutes.
If the content did not change, increase the recrawl interval by a factor (e.g., multiply by 1.5). Maximum interval: 30 days.
Weight the priority by page importance (PageRank, domain authority, traffic estimates). A high-importance page that changes daily gets priority over a low-importance page that also changes daily.
Maintain a recrawl priority queue sorted by next_scheduled_recrawl = last_fetch_time + current_interval.
On each scheduling cycle, pop pages from the recrawl queue and inject them into the frontier at elevated priority.
Using HTTP Conditional Requests for Efficient Recrawling
On recrawls, send If-Modified-Since (with the previous Last-Modified date) or If-None-Match (with the previous ETag). If the server responds 304 Not Modified, skip downloading the body entirely — the page has not changed. This saves bandwidth (no body transfer) and processing time (no parsing needed). In practice, 30-50% of recrawl requests receive 304 responses, cutting recrawl bandwidth nearly in half.
Sitemap.xml as a Recrawl Signal
XML sitemaps (sitemap.xml) include a lastmod timestamp and a changefreq hint for each URL. While these values are not always accurate (many sites set them incorrectly), they provide a useful initial estimate for recrawl scheduling before the crawler has enough history to compute adaptive intervals. Parse sitemaps from every domain and use the lastmod values as the initial recrawl schedule, then refine with observed change data.
Change Detection Beyond Content Hash
Simple content hash comparison misses meaningful changes buried in noise. Ads, timestamps ("Page generated at 10:42 AM"), view counters, and dynamic sidebars change on every fetch without indicating real content change. Robust change detection strips boilerplate (headers, footers, ads, navigation) before fingerprinting, focusing only on the main content area. Libraries like boilerpipe and readability.js extract main content from web pages.
Priority scoring for recrawl combines multiple signals into a single score that determines which pages get recrawl budget. The formula typically looks like: recrawl_priority = page_importance_score * change_probability * (time_since_last_crawl / expected_interval). Pages that are important, likely to have changed, and overdue for a check receive the highest priority.
The Freshness-Coverage Tradeoff
Every recrawl fetch is a fetch that could have been used to discover a new page. At Google scale, the crawl budget is split roughly 60-70% recrawl and 30-40% new discovery. If you over-index on freshness, your coverage (fraction of the web in your index) shrinks. If you over-index on coverage, your existing pages go stale. There is no universally correct split — it depends on your use case. A news search engine prioritizes freshness; a web archive prioritizes coverage.
Fixed Recrawl Intervals
A common mistake is assigning a fixed recrawl interval to all pages (e.g., "recrawl everything weekly"). This wastes budget on static pages that never change and under-crawls dynamic pages that change hourly. Always use adaptive intervals based on observed change history. Even a simple exponential backoff (double the interval each time a page is unchanged, halve it each time it changes) dramatically improves freshness compared to fixed scheduling.
Web crawler design is a top-tier system design interview question at Google, Meta, Amazon, and Microsoft. It appears directly ("Design a web crawler for a search engine") and indirectly in questions about SEO indexing, content aggregation, or building a search engine. Interviewers use it to test your understanding of graph traversal at scale, distributed system coordination, probabilistic data structures, network I/O optimization, and the ability to make trade-offs between competing constraints (coverage vs. freshness, speed vs. politeness, memory vs. accuracy). Strong candidates structure their answer around the URL frontier design first, then layer on fetching, deduplication, distribution, and recrawl scheduling.
Common questions:
Try this question: Ask the interviewer: What is the target scale (number of pages, crawl cycle duration)? Are we building a general web crawler or a focused crawler for specific domains? Do we need to handle JavaScript-rendered pages? What is the freshness requirement for the index? Are there specific legal or ethical constraints to consider?
Strong answer: Drawing the Mercator two-layer frontier design from memory. Calculating Bloom filter size for the target URL count. Discussing SimHash for near-duplicate content detection. Mentioning consistent hashing for domain-based URL partitioning. Explaining adaptive recrawl intervals with change detection. Bringing up DNS pre-fetching as a latency optimization.
Red flags: Using a simple FIFO queue as the frontier without discussing priority or politeness. Not mentioning robots.txt or crawl delays. Ignoring spider traps entirely. Proposing to store all URLs in a hash set without considering memory (10B URLs x 50 bytes = 500 GB). Not discussing how to distribute the crawler across multiple machines.
Key takeaways
Why does a production web crawler need both URL-level deduplication (Bloom filter) and content-level deduplication (SimHash)?
URL-level dedup catches exact URL matches (same URL seen before) but misses cases where different URLs serve identical content — printer-friendly versions, paginated views, tracking parameter variants, and syndicated content across domains. Content-level dedup (SimHash) catches these cases by comparing page content fingerprints regardless of the URL. Together they prevent both refetching known URLs and storing duplicate content from different URLs, which can account for 20-30% of all pages on the web.
Explain why domain-based URL partitioning is preferred over URL-based partitioning in a distributed crawler.
Domain-based partitioning assigns all URLs from a given domain to the same worker node. This means per-domain politeness enforcement (rate limiting, Crawl-delay compliance, robots.txt caching) is entirely local — no cross-worker coordination needed. URL-based partitioning distributes individual URLs across workers, so two workers might simultaneously fetch from the same domain, violating politeness constraints unless they coordinate via a shared rate limiter. The local politeness advantage of domain-based partitioning outweighs its load imbalance downside (hot domains), which can be mitigated with work-stealing or domain splitting.
How does adaptive recrawl scheduling determine the optimal interval for revisiting a page?
Adaptive scheduling tracks each page change history across multiple recrawl visits. If a page changed since the last visit, the recrawl interval is shortened (e.g., multiplied by 0.7). If it was unchanged, the interval is lengthened (e.g., multiplied by 1.5). This converges toward an interval that roughly matches the page actual change frequency. The interval is further weighted by page importance (high-PageRank pages get more frequent recrawls for the same change rate) and bounded by minimum (5 minutes) and maximum (30 days) limits to prevent over-crawling or stale content.
💡 Analogy
Think of a web crawler as a massive library cataloging expedition. You start with a list of known library locations (seed URLs). Scout teams (fetcher workers) are dispatched to each library, where they photograph every book (download pages) and record which other libraries are referenced in bibliographies (extract links). Back at headquarters, catalog cards (the URL frontier) are organized into priority trays — rare collections get visited first, common branches can wait. Each scout follows strict visiting rules: only visit during open hours, wait between requests so you do not overwhelm the librarian, and always check the posted visiting policy (robots.txt) before entering. Meanwhile, a deduplication team (Bloom filter + SimHash) checks if we already have a copy of each book, and a scheduling team decides when to send scouts back to check for new editions. The expedition never truly ends — the library system keeps growing, and books keep getting updated.
⚡ Core Idea
A web crawler is a distributed graph traversal system that explores the hyperlink graph of the web. The core challenge is not fetching pages — that is straightforward HTTP. The challenge is deciding what to fetch next (frontier prioritization), avoiding redundant work (deduplication), being a respectful visitor (politeness), surviving adversarial content (trap detection), and keeping your data current (recrawl scheduling) — all at a scale of billions of pages across hundreds of millions of domains.
🎯 Why It Matters
Web crawlers are the invisible foundation of the search economy. Every Google search result, every SEO optimization, every price comparison tool, and every AI training dataset built from web data depends on a crawler having fetched that content. At FAANG scale, crawler efficiency directly translates to search quality and business outcomes. A crawler that wastes budget on duplicate or trapped content provides a worse index; a crawler that is too aggressive gets blocked by major sites; a crawler that recrawls too slowly serves stale results. Getting the balance right across these competing concerns is what makes this a deep system design problem.
Ready to see how this works in the cloud?
Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.
View role-based pathsSign in to track your progress and mark lessons complete.
Questions? Discuss in the community or start a thread below.
Join DiscordSign in to start or join a thread.