Skip to main content
Career Paths
Concepts
Design Web Crawler
The Simplified Tech

Role-based learning paths to help you master cloud engineering with clarity and confidence.

Product

  • Career Paths
  • Interview Prep
  • Scenarios
  • AI Features
  • Cloud Comparison
  • Resume Builder
  • Pricing

Community

  • Join Discord

Account

  • Dashboard
  • Credits
  • Updates
  • Sign in
  • Sign up
  • Contact Support

Stay updated

Get the latest learning tips and updates. No spam, ever.

Terms of ServicePrivacy Policy

© 2026 TheSimplifiedTech. All rights reserved.

BackBack
Interactive Explainer

Designing a Web Crawler

A comprehensive deep-dive into designing a production-grade web crawler capable of discovering, fetching, and indexing billions of web pages. Covers URL frontier design with priority and politeness queues, the fetch cycle including DNS pre-fetching and connection pooling, URL deduplication with Bloom filters and SimHash, distributed multi-worker architecture with consistent hashing, robots.txt compliance and ethical crawling, spider trap detection and URL normalization, and adaptive recrawl scheduling for freshness. This is a core system design interview question at FAANG companies because it tests breadth (networking, storage, distributed coordination) and depth (probabilistic data structures, graph traversal, politeness constraints) in a single session.

🎯Key Takeaways
The URL frontier is the most critical data structure in a crawler: a two-layer design with front queues for priority and back queues for per-domain politeness (Mercator architecture) is the industry standard. Without this design, the crawler either wastes budget on low-value pages or overwhelms individual web servers.
Deduplication operates at two levels — URL-level (Bloom filters for "have I seen this URL?") and content-level (SimHash for "have I seen this content?"). A Bloom filter for 10 billion URLs at 1% false positive rate requires only 12 GB of memory, making it practical for even the largest crawls.
Politeness is non-negotiable: respect robots.txt directives, enforce per-domain crawl delays (minimum 1 second between requests), identify your crawler via User-Agent, and implement exponential back-off for errors. Impolite crawling leads to IP blocks, legal action, and degraded access to the web.
Spider traps (infinite calendars, session IDs, parameter explosions) can consume unlimited crawl budget. Defense requires multiple layers: per-domain page caps, max crawl depth, URL pattern detection, and content novelty decay — no single technique catches all traps.
Adaptive recrawl scheduling allocates finite crawl budget optimally by adjusting per-page revisit intervals based on observed change rates and page importance. Pages that change frequently and rank highly get recrawled most often; static low-value pages wait weeks or months between visits.

Designing a Web Crawler

A comprehensive deep-dive into designing a production-grade web crawler capable of discovering, fetching, and indexing billions of web pages. Covers URL frontier design with priority and politeness queues, the fetch cycle including DNS pre-fetching and connection pooling, URL deduplication with Bloom filters and SimHash, distributed multi-worker architecture with consistent hashing, robots.txt compliance and ethical crawling, spider trap detection and URL normalization, and adaptive recrawl scheduling for freshness. This is a core system design interview question at FAANG companies because it tests breadth (networking, storage, distributed coordination) and depth (probabilistic data structures, graph traversal, politeness constraints) in a single session.

~34 min read
Be the first to complete!
What you'll learn
  • The URL frontier is the most critical data structure in a crawler: a two-layer design with front queues for priority and back queues for per-domain politeness (Mercator architecture) is the industry standard. Without this design, the crawler either wastes budget on low-value pages or overwhelms individual web servers.
  • Deduplication operates at two levels — URL-level (Bloom filters for "have I seen this URL?") and content-level (SimHash for "have I seen this content?"). A Bloom filter for 10 billion URLs at 1% false positive rate requires only 12 GB of memory, making it practical for even the largest crawls.
  • Politeness is non-negotiable: respect robots.txt directives, enforce per-domain crawl delays (minimum 1 second between requests), identify your crawler via User-Agent, and implement exponential back-off for errors. Impolite crawling leads to IP blocks, legal action, and degraded access to the web.
  • Spider traps (infinite calendars, session IDs, parameter explosions) can consume unlimited crawl budget. Defense requires multiple layers: per-domain page caps, max crawl depth, URL pattern detection, and content novelty decay — no single technique catches all traps.
  • Adaptive recrawl scheduling allocates finite crawl budget optimally by adjusting per-page revisit intervals based on observed change rates and page importance. Pages that change frequently and rank highly get recrawled most often; static low-value pages wait weeks or months between visits.

Lesson outline

What a Web Crawler Does and Why It Matters

A web crawler — also called a spider or bot — is a program that systematically browses the internet by following hyperlinks from page to page, downloading content, and extracting new URLs to visit. It is the foundational component of every search engine, powering the index that makes search possible. Without crawlers, search engines would have no data to search over.

Google operates the largest web crawler in the world (Googlebot), which discovers and re-fetches pages across an index of over 400 billion documents. Bing, Yandex, Baidu, and DuckDuckGo each operate their own crawlers. Beyond search, crawlers power academic datasets like Common Crawl (which produces a free, open archive of the web containing over 3 billion pages per monthly snapshot), price comparison engines, archival services like the Internet Archive, SEO analysis tools, and machine learning training pipelines.

Why Interviewers Love This Question

A web crawler touches nearly every distributed systems concept: graph traversal at internet scale, queue management and scheduling, network I/O optimization, deduplication with probabilistic data structures, politeness and rate limiting, handling adversarial content (spider traps), and freshness maintenance. It can be discussed at L4 depth (basic BFS + queue) or L7 depth (Bloom filter sizing, consistent hashing for URL partitioning, adaptive recrawl with exponential backoff). This range makes it one of the top system design interview questions at Google, Meta, and Amazon.

The scale of the modern web is staggering. Estimates put the indexed web at around 60 billion pages, with the deep web (content behind forms, logins, and dynamic generation) being orders of magnitude larger. A production crawler must handle pages in hundreds of languages, content types ranging from static HTML to JavaScript-rendered single-page applications, and an adversarial environment where some sites actively try to trap or block crawlers.

Core functions of a web crawler

  • URL discovery — Starting from seed URLs, extract links from fetched pages and add them to the frontier for future crawling. This creates a breadth-first or priority-based graph traversal of the web.
  • Content fetching — Issue HTTP requests to download pages, respecting connection limits, timeouts, and retry policies. Handle redirects (301/302), error codes, and varied content types.
  • Content parsing and extraction — Parse HTML to extract text, metadata, links, and structured data (schema.org, Open Graph). Handle malformed HTML gracefully.
  • Deduplication — Detect and skip duplicate or near-duplicate pages to avoid wasting bandwidth and storage. Use URL normalization and content fingerprinting.
  • Politeness enforcement — Respect robots.txt directives, enforce per-domain crawl delays, and avoid overwhelming any single web server.
  • Freshness maintenance — Re-crawl previously visited pages on a schedule to detect changes, prioritizing pages that change frequently or are highly valued.

Scale Numbers to Remember

The indexed web contains roughly 60 billion pages. Google crawls hundreds of millions of pages per day. A well-designed crawler targeting 1 billion pages needs to sustain about 12,000 pages per second (1B / 86,400 seconds). Each page averages about 2 MB when including images and scripts, but crawlers typically fetch only HTML (average 50-100 KB), so 1 billion pages of HTML is roughly 50-100 TB of raw content per crawl cycle.

Understanding these numbers is critical because they drive every architectural decision. A toy crawler on your laptop can fetch a few pages per second using a single thread. A production crawler serving a search engine must fetch thousands of pages per second, coordinate across hundreds of machines, avoid re-fetching billions of already-seen URLs, and do all of this while being a polite guest on every web server it visits.

Crawler vs Scraper

A crawler discovers and fetches pages by following links across the web. A scraper extracts specific data from known pages. In interviews, "design a web crawler" means the large-scale discovery and fetching system, not a targeted data extraction tool. The crawler feeds pages into downstream systems (indexers, scrapers, ML pipelines).

Requirements and Scale Estimation

Every system design interview should begin with clarifying requirements and estimating scale. For a web crawler, the functional and non-functional requirements are well-defined, but the scale numbers drive radically different architectural choices.

Functional requirements

  • Seed URL ingestion — Accept a set of seed URLs as the starting point for the crawl. Seeds may come from a curated list, a sitemap index, or a previous crawl state.
  • Link discovery and traversal — Parse fetched pages to extract hyperlinks, normalize them, and add new URLs to the crawl frontier.
  • Content download and storage — Fetch page content over HTTP/HTTPS and store raw HTML (and optionally rendered content) for downstream processing.
  • Robots.txt compliance — Fetch and cache robots.txt for every domain before crawling, and respect all Disallow, Crawl-delay, and Allow directives.
  • Deduplication — Avoid fetching the same URL twice and detect near-duplicate content across different URLs.
  • Recrawl scheduling — Periodically revisit pages to detect changes, with frequency proportional to each page importance and change rate.

Non-functional requirements

  • Scalability — Must scale horizontally to handle billions of pages. Adding more machines should linearly increase crawl throughput.
  • Politeness — Must not overwhelm any single web server. Enforce per-domain request spacing of at least 1 second (or as specified by Crawl-delay in robots.txt).
  • Robustness — Handle network failures, malformed HTML, infinite redirect chains, spider traps, and adversarial pages without crashing or hanging.
  • Extensibility — Support adding new content types (PDF, images), new protocols (FTP), or new processing modules without redesigning the core architecture.
  • Freshness — High-value pages (news homepages, social media feeds) should be re-crawled within minutes; low-value static pages can be re-crawled weekly or monthly.

Start With the Math

In an interview, always show your back-of-envelope calculation. This demonstrates that you think about scale before jumping into architecture. Interviewers expect you to derive the pages-per-second target, bandwidth requirements, and storage needs from first principles.

MetricCalculationResult
Target pages1 billion pages per crawl cycle1,000,000,000
Crawl cycle duration1 day (aggressive) to 1 week (conservative)86,400 to 604,800 seconds
Pages per second (1-day cycle)1B / 86,400~11,574 pages/sec
Pages per second (1-week cycle)1B / 604,800~1,653 pages/sec
Avg HTML page sizeTypical HTML body without images~100 KB
Bandwidth (1-day cycle)11,574 x 100 KB/sec~1.16 GB/sec = ~9.3 Gbps
Bandwidth (1-week cycle)1,653 x 100 KB/sec~165 MB/sec = ~1.3 Gbps
Raw storage per cycle1B x 100 KB~100 TB
Storage with compression~5:1 gzip ratio for HTML~20 TB
URL frontier size10B discovered URLs (10x pages)~500 GB at 50 bytes/URL avg

These numbers reveal important constraints. At 11,574 pages per second with a 1-day cycle, a single machine doing 50 fetches per second would need 230+ machines in the fetcher pool alone. The bandwidth requirement of 9.3 Gbps requires careful network planning. And 100 TB of raw HTML per cycle means storage costs add up quickly, making compression and deduplication essential.

DNS is the Hidden Bottleneck

Every new domain requires a DNS lookup. With 1 billion pages across 100 million domains, you need to resolve 100 million DNS queries per crawl cycle. Public DNS resolvers (8.8.8.8) rate-limit heavy users. Production crawlers run their own caching DNS resolvers and pre-fetch DNS records for domains in the frontier before the fetcher needs them. Without DNS caching, DNS lookup latency (50-200ms) dominates total page fetch time.

Interview estimation walkthrough

→

01

State your assumption: "Let us design for 1 billion pages crawled in 1 week."

→

02

Calculate pages per second: 1B / (7 x 86,400) = approximately 1,653 pages/sec.

→

03

Estimate bandwidth: 1,653 x 100 KB = 165 MB/sec inbound, or about 1.3 Gbps. This is manageable on a single data center uplink.

→

04

Estimate storage: 1B x 100 KB = 100 TB raw. With gzip compression at 5:1, about 20 TB per crawl cycle.

→

05

Estimate fetcher count: if each fetcher handles 50 concurrent connections, we need 1,653 / 50 = 33 fetcher machines. Round up to 50 for headroom.

06

Estimate frontier size: assume 10 billion discovered URLs at 50 bytes each = 500 GB. This fits in distributed memory or on-disk queues.

1

State your assumption: "Let us design for 1 billion pages crawled in 1 week."

2

Calculate pages per second: 1B / (7 x 86,400) = approximately 1,653 pages/sec.

3

Estimate bandwidth: 1,653 x 100 KB = 165 MB/sec inbound, or about 1.3 Gbps. This is manageable on a single data center uplink.

4

Estimate storage: 1B x 100 KB = 100 TB raw. With gzip compression at 5:1, about 20 TB per crawl cycle.

5

Estimate fetcher count: if each fetcher handles 50 concurrent connections, we need 1,653 / 50 = 33 fetcher machines. Round up to 50 for headroom.

6

Estimate frontier size: assume 10 billion discovered URLs at 50 bytes each = 500 GB. This fits in distributed memory or on-disk queues.

These estimates form the skeleton of your design. Every subsequent architectural decision — how many fetcher workers, how to partition the URL frontier, whether to use Bloom filters or a database for deduplication — follows from these numbers.

Seed URLs and URL Frontier Design

The URL frontier is the heart of a web crawler. It is the data structure that holds all URLs that have been discovered but not yet fetched. The frontier determines what the crawler visits next, making it the single most important component for crawl efficiency, politeness, and coverage. A naive frontier (a simple FIFO queue) works for toy crawlers but fails catastrophically at scale.

Seed URLs bootstrap the crawl. For a general-purpose search engine crawler, seeds typically include the homepages of major websites (top 10,000 by traffic), URLs submitted by webmasters through tools like Google Search Console, and URLs discovered from sitemaps. For a focused crawler (e.g., crawling only news sites), seeds are a curated list of target domains.

Seed Selection Strategy

Use the DMOZ/Curlie directory, Common Crawl URL lists, or Alexa/Tranco top sites as seeds. Good seeds have two properties: (1) high PageRank, meaning they link to many other important pages, and (2) broad coverage, spanning many domains and topics. Starting with 50,000-100,000 diverse seed URLs is typical for a general web crawl.

A production URL frontier has two layers: a front queue system that handles prioritization, and a back queue system that handles politeness. This two-layer design, described in the Mercator crawler paper (Heydon and Najork, 1999), remains the standard architecture used by modern crawlers.

graph TB
    subgraph "URL Frontier (Mercator Design)"
        direction TB
        NEW["Newly Discovered URLs"] --> NORM["URL Normalizer"]
        NORM --> DEDUP["URL Dedup Filter<br/>(Bloom Filter)"]
        DEDUP -->|"New URL"| PRIO["Priority Assigner<br/>(PageRank, freshness,<br/>domain authority)"]
        DEDUP -->|"Already seen"| DROP["Drop"]

        subgraph "Front Queues (Priority)"
            PRIO --> F1["Queue 1<br/>HIGH priority"]
            PRIO --> F2["Queue 2<br/>MEDIUM priority"]
            PRIO --> F3["Queue 3<br/>LOW priority"]
        end

        subgraph "Back Queues (Politeness)"
            F1 --> ROUTER["Back-Queue Router<br/>(hash domain → queue)"]
            F2 --> ROUTER
            F3 --> ROUTER
            ROUTER --> B1["Queue: example.com<br/>next_fetch: 10:00:05"]
            ROUTER --> B2["Queue: news.org<br/>next_fetch: 10:00:02"]
            ROUTER --> B3["Queue: shop.io<br/>next_fetch: 10:00:08"]
        end

        B1 --> FETCH["Fetcher Workers"]
        B2 --> FETCH
        B3 --> FETCH
    end

    style PRIO fill:#3b82f6,stroke:#333,color:#fff
    style ROUTER fill:#f59e0b,stroke:#333,color:#fff
    style FETCH fill:#10b981,stroke:#333,color:#fff
    style DROP fill:#ef4444,stroke:#333,color:#fff

Mercator-style two-layer URL frontier: front queues handle priority, back queues enforce per-domain politeness with next-fetch timestamps.

Front queue (priority) design

  • Priority scoring — Each URL receives a priority score based on signals: estimated PageRank, domain authority, how recently the URL was discovered, the depth from seed URLs, and whether the page has historically changed frequently.
  • Multiple priority buckets — Maintain N front queues (typically 3-10), each representing a priority tier. A biased selector pulls from higher-priority queues more frequently (e.g., 70% from high, 20% from medium, 10% from low).
  • Starvation prevention — Even low-priority URLs must eventually be crawled. Implement aging: URLs that have waited longer than a threshold get promoted to the next priority tier.

Back queue (politeness) design

  • One queue per domain — Map each domain to exactly one back queue using consistent hashing. All URLs for example.com go into the same queue, ensuring that the crawler fetches only one page from that domain at a time.
  • Next-fetch timestamp — Each back queue has a next_allowed_fetch timestamp. After fetching a page from a domain, set next_allowed_fetch = now() + crawl_delay (default 1 second, or as specified in robots.txt).
  • Heap-based scheduling — A min-heap ordered by next_fetch timestamp selects which back queue to pull from next. The fetcher pops the queue with the earliest timestamp, fetches the URL, then updates the timestamp and re-inserts into the heap.
  • Queue overflow handling — When a back queue exceeds a size limit (e.g., 10,000 URLs), drop the lowest-priority URLs. This prevents a single prolific domain from consuming all frontier memory.

Back Queue Explosion

With 100 million domains, maintaining one in-memory queue per domain is infeasible (it would require terabytes of RAM). Production crawlers use a hybrid approach: keep back queues for the top 1 million domains in memory, and use disk-backed queues (RocksDB, LevelDB) for the long tail. The BerkeleyDB-backed frontier used by Apache Nutch is a classic example of this pattern.

URL Normalization Before Enqueueing

Before adding a URL to the frontier, normalize it: lowercase the scheme and host, remove default port numbers (80 for HTTP, 443 for HTTPS), resolve relative paths (/../), remove fragment identifiers (#section), sort query parameters alphabetically, and decode unnecessary percent-encoding. Without normalization, http://Example.com:80/path and http://example.com/path are treated as different URLs, wasting crawl budget on duplicates.

The Fetch Cycle — DNS, HTTP, and Parsing

The fetch cycle is where the crawler actually retrieves content from the web. Each cycle consists of three stages: DNS resolution to find the server IP address, HTTP request to download the page, and HTML parsing to extract content and new URLs. Optimizing each stage is critical because the fetch cycle runs billions of times.

Anatomy of a single page fetch

→

01

Pop the next URL from the frontier (from the back queue with the earliest next_fetch timestamp).

→

02

Check if we have a cached robots.txt for this domain. If not, fetch robots.txt first and cache it (TTL of 24 hours). If the target path is disallowed, skip the URL.

→

03

Resolve the hostname to an IP address via DNS. Check the local DNS cache first; if miss, query the crawler-local caching DNS resolver.

→

04

Open an HTTP connection (or reuse one from the connection pool for this domain). Send a GET request with appropriate headers: User-Agent (identifying the crawler), Accept-Encoding (gzip), If-Modified-Since (for recrawls).

→

05

Read the response. If 200 OK, proceed. If 301/302 redirect, extract the Location header and add the redirect target to the frontier (up to 5 redirects max). If 304 Not Modified, mark the page as unchanged. If 4xx/5xx, log the error and apply a back-off policy.

→

06

Decompress the response body (gzip/brotli). Parse HTML to extract text content, metadata (title, description, canonical URL), and all hyperlinks (<a href>, <link>, <frame src>).

→

07

Normalize extracted URLs and feed them back to the URL frontier through the deduplication filter.

08

Store the raw HTML and extracted metadata in the content store for downstream indexing.

1

Pop the next URL from the frontier (from the back queue with the earliest next_fetch timestamp).

2

Check if we have a cached robots.txt for this domain. If not, fetch robots.txt first and cache it (TTL of 24 hours). If the target path is disallowed, skip the URL.

3

Resolve the hostname to an IP address via DNS. Check the local DNS cache first; if miss, query the crawler-local caching DNS resolver.

4

Open an HTTP connection (or reuse one from the connection pool for this domain). Send a GET request with appropriate headers: User-Agent (identifying the crawler), Accept-Encoding (gzip), If-Modified-Since (for recrawls).

5

Read the response. If 200 OK, proceed. If 301/302 redirect, extract the Location header and add the redirect target to the frontier (up to 5 redirects max). If 304 Not Modified, mark the page as unchanged. If 4xx/5xx, log the error and apply a back-off policy.

6

Decompress the response body (gzip/brotli). Parse HTML to extract text content, metadata (title, description, canonical URL), and all hyperlinks (<a href>, <link>, <frame src>).

7

Normalize extracted URLs and feed them back to the URL frontier through the deduplication filter.

8

Store the raw HTML and extracted metadata in the content store for downstream indexing.

DNS Pre-fetching

DNS resolution adds 50-200ms of latency per new domain. Production crawlers run a dedicated DNS pre-fetch thread that scans URLs in the frontier and resolves their hostnames before the fetcher needs them. Results are cached in a local DNS cache (TTL 30 minutes to 24 hours). Running your own caching DNS resolver (BIND, Unbound, or CoreDNS) avoids rate-limiting from public resolvers and reduces latency to under 1ms for cached entries.

OptimizationWhat it doesImpact
Connection poolingReuse TCP connections across multiple requests to the same domainEliminates 50-100ms TCP handshake + TLS setup per request. Keep-alive pools of 2-4 connections per domain.
Async I/OUse epoll/kqueue to handle thousands of concurrent connections per threadA single fetcher thread can manage 1,000+ concurrent HTTP requests without thread overhead.
DNS pre-fetchingResolve hostnames before the fetcher needs themEliminates DNS latency from the critical path. Local cache hit rate should be above 95%.
Conditional GETSend If-Modified-Since or If-None-Match headers during recrawlsServer returns 304 Not Modified with no body, saving bandwidth. Effective for 30-50% of recrawls.
gzip/brotliRequest compressed content via Accept-Encoding headerReduces transferred bytes by 60-80%. Nearly all web servers support gzip.
Timeout policiesSet connect timeout (5s), read timeout (30s), and total timeout (60s)Prevents slow or unresponsive servers from blocking fetcher threads indefinitely.

HTML parsing is the next major bottleneck after network I/O. A production crawler must handle malformed HTML gracefully because the real web is full of broken markup — unclosed tags, invalid nesting, mixed encodings, and embedded scripts. Libraries like jsoup (Java), BeautifulSoup/lxml (Python), and Golang net/html are designed for this forgiving parsing. Link extraction must handle both absolute and relative URLs, resolve base tags, handle meta refresh redirects, and recognize link-like patterns in JavaScript strings.

JavaScript-Rendered Content

An increasing percentage of the web (estimated 10-30%) renders content using JavaScript frameworks (React, Angular, Vue). A simple HTML parser sees an empty div and a script tag. To crawl these pages, you need a headless browser (Puppeteer, Playwright) to execute JavaScript and capture the rendered DOM. This is 10-100x more expensive than simple HTML parsing. Google runs a two-pass system: a first pass fetches raw HTML, and a second "rendering" pass uses a headless Chrome fleet for pages that need JavaScript execution.

robots.txt Caching Strategy

Fetch robots.txt once per domain per day and cache it. If robots.txt returns a 5xx error, treat it as "allow all" but retry within an hour. If it returns 404, there are no restrictions for that domain. Always respect Crawl-delay directives — they are a strong signal that the webmaster wants you to slow down. Cache robots.txt in a fast key-value store (Redis, Memcached) keyed by domain.

Ignoring Redirect Chains

Some URLs redirect multiple times (HTTP to HTTPS, www to non-www, short URL to final page). Follow redirects up to a maximum depth (5 is standard), but add the final destination URL to the frontier, not the intermediate redirects. Also detect redirect loops (A -> B -> A) and break them. Failing to handle redirects correctly results in duplicate content, wasted bandwidth, and missing pages.

URL Deduplication and Content Fingerprinting

The web is full of duplicate content. The same page may be reachable through multiple URLs (with and without www, with different query parameter orderings, with tracking parameters appended). Different URLs may serve identical or near-identical content (printer-friendly versions, paginated views, A/B test variants). Without robust deduplication, a crawler wastes bandwidth, storage, and crawl budget re-fetching content it already has.

Deduplication happens at two levels: URL-level (have we already seen this exact URL?) and content-level (does this page contain the same content as another page we have already fetched?). Both are essential, and they use different techniques.

URL-level deduplication

  • URL normalization — Before checking for duplicates, normalize the URL: lowercase scheme and host, remove default ports, resolve dot-segments, remove fragments, sort query parameters, strip tracking parameters (utm_source, fbclid, etc.). This eliminates trivial URL variations.
  • Bloom filter — A space-efficient probabilistic data structure that answers "have I seen this URL before?" with zero false negatives and a configurable false positive rate. For 10 billion URLs with a 1% false positive rate, a Bloom filter needs approximately 12 GB of memory — vastly less than storing all URLs explicitly.
  • Hash set fallback — For smaller crawls, a simple hash set of URL fingerprints (e.g., 8-byte MurmurHash of the normalized URL) works. 10 billion URLs at 8 bytes each = 80 GB, which fits in memory on modern servers.
graph LR
    subgraph "URL Dedup Pipeline"
        URL["Extracted URL"] --> NORM2["Normalize<br/>(lowercase, strip params,<br/>remove fragment)"]
        NORM2 --> HASH["Compute Hash<br/>(MurmurHash3 128-bit)"]
        HASH --> BF{"Bloom Filter<br/>Check"}
        BF -->|"Probably new"| ADD["Add to Frontier<br/>+ Insert into Bloom Filter"]
        BF -->|"Definitely seen"| SKIP["Skip URL"]
    end

    subgraph "Content Dedup Pipeline"
        PAGE["Fetched Page HTML"] --> EXTRACT["Extract Text<br/>(strip tags, scripts, styles)"]
        EXTRACT --> SH["Compute SimHash<br/>(64-bit fingerprint)"]
        SH --> COMPARE{"Hamming Distance<br/>vs Existing<br/>Fingerprints"}
        COMPARE -->|"Distance > 3"| UNIQUE["Unique Content<br/>→ Store"]
        COMPARE -->|"Distance ≤ 3"| NEARDUP["Near-Duplicate<br/>→ Mark + Skip"]
    end

    style BF fill:#3b82f6,stroke:#333,color:#fff
    style COMPARE fill:#f59e0b,stroke:#333,color:#fff
    style SKIP fill:#ef4444,stroke:#333,color:#fff
    style NEARDUP fill:#ef4444,stroke:#333,color:#fff
    style UNIQUE fill:#10b981,stroke:#333,color:#fff

Two-stage deduplication: URL-level uses Bloom filters to avoid refetching known URLs; content-level uses SimHash to detect near-duplicate page content.

Bloom Filter Sizing Formula

A Bloom filter for n items with false positive rate p requires m = -(n * ln(p)) / (ln(2))^2 bits and k = (m/n) * ln(2) hash functions. For 10 billion URLs at 1% FP rate: m = ~96 billion bits = ~12 GB, k = 7 hash functions. At 0.1% FP rate: m = ~144 billion bits = ~18 GB, k = 10. These sizes fit comfortably in server RAM.

Content-level deduplication detects when different URLs serve the same or nearly identical content. This is surprisingly common: product pages with different sorting orders, forum threads with different pagination, and syndicated articles appearing on multiple sites. Two key algorithms dominate this space: SimHash for near-duplicate detection and MinHash for document similarity estimation.

AlgorithmHow it worksOutputUse case
Exact hash (MD5/SHA-256)Hash the entire page content128/256-bit digestExact duplicate detection. Any change (even a single byte) produces a different hash.
SimHash (Charikar, 2002)Compute weighted hash of content tokens, producing a fingerprint where similar documents have similar fingerprints64-bit fingerprintNear-duplicate detection. Two documents with Hamming distance <= 3 (out of 64 bits) are considered near-duplicates.
MinHash (Broder, 1997)Apply k hash functions to the set of shingles (n-grams) in each document; keep the minimum hash value for each functionk minimum hash values (signature)Jaccard similarity estimation. Two documents with 80%+ matching MinHash values have high content overlap.
SimHash + Hamming lookupIndex SimHash fingerprints in a table partitioned by bit blocks; query by checking all permutations within Hamming distance dBoolean: near-duplicate or notScalable near-duplicate detection. Can check a fingerprint against billions of stored fingerprints in milliseconds.

SimHash in Practice

SimHash works by treating a document as a bag of weighted features (typically word bigrams or trigrams). For each feature, compute a hash, then for each bit position, add the weight if the bit is 1 and subtract it if the bit is 0. The final fingerprint is constructed by setting each bit to 1 if the sum for that position is positive. Two documents differing by a small amount of content will differ in only a few bit positions. Google uses SimHash internally to detect near-duplicate web pages at scale.

Shingling for Content Fingerprinting

Before computing SimHash or MinHash, convert the page into a set of shingles — overlapping n-grams of words. For example, "the quick brown fox" with 3-word shingles becomes {"the quick brown", "quick brown fox"}. Shingle size of 5-8 words works well for web pages. Using word-level shingles (rather than character-level) is more robust to minor formatting changes while still detecting meaningful content similarity.

Distributed Architecture

A single machine cannot crawl the web at scale. With targets of thousands of pages per second and a frontier containing billions of URLs, the crawler must be distributed across many machines. The key challenge is partitioning the work so that each machine operates independently as much as possible, while still coordinating enough to avoid duplicate work and maintain global politeness.

The standard distributed crawler architecture assigns each worker a partition of the URL space. The most common partitioning strategy is by domain: all URLs belonging to a given domain are assigned to the same worker. This has a critical advantage — per-domain politeness (rate limiting, robots.txt caching) is handled locally within each worker without any cross-worker coordination.

graph TB
    subgraph "Distributed Crawler Architecture"
        SEEDS["Seed URLs"] --> DIST["URL Distributor<br/>(Consistent Hash<br/>by domain)"]

        subgraph "Worker Node 1"
            DIST -->|"domains A-F"| F1["URL Frontier 1"]
            F1 --> FET1["Fetcher Pool<br/>(50 async connections)"]
            FET1 --> PARSE1["Parser"]
            PARSE1 -->|"New URLs"| DEDUP1["Dedup Filter"]
        end

        subgraph "Worker Node 2"
            DIST -->|"domains G-N"| F2["URL Frontier 2"]
            F2 --> FET2["Fetcher Pool<br/>(50 async connections)"]
            FET2 --> PARSE2["Parser"]
            PARSE2 -->|"New URLs"| DEDUP2["Dedup Filter"]
        end

        subgraph "Worker Node 3"
            DIST -->|"domains O-Z"| F3["URL Frontier 3"]
            F3 --> FET3["Fetcher Pool<br/>(50 async connections)"]
            FET3 --> PARSE3["Parser"]
            PARSE3 -->|"New URLs"| DEDUP3["Dedup Filter"]
        end

        DEDUP1 -->|"Route by domain"| DIST
        DEDUP2 -->|"Route by domain"| DIST
        DEDUP3 -->|"Route by domain"| DIST

        FET1 --> STORE["Content Store<br/>(S3 / HDFS)"]
        FET2 --> STORE
        FET3 --> STORE

        PARSE1 --> META["Metadata DB<br/>(URL, timestamp,<br/>fingerprint, status)"]
        PARSE2 --> META
        PARSE3 --> META
    end

    style DIST fill:#3b82f6,stroke:#333,color:#fff
    style STORE fill:#10b981,stroke:#333,color:#fff
    style META fill:#f59e0b,stroke:#333,color:#fff

Distributed crawler with domain-based URL partitioning. Each worker owns a slice of domains and handles all fetching, parsing, and politeness for those domains independently.

URL partitioning strategies

  • Hash-by-domain — hash(domain) mod N assigns each domain to a worker. Simple, ensures per-domain politeness is local. Downside: hot domains (wikipedia.org, reddit.com) create load imbalance.
  • Consistent hashing — Map domains to a hash ring with virtual nodes. Adding or removing a worker redistributes only 1/N of the domains, minimizing frontier data movement. This is the preferred approach for elastic scaling.
  • Hash-by-URL — hash(full_url) mod N distributes individual URLs. Better load balance, but URLs from the same domain may land on different workers, requiring cross-worker coordination for politeness. Generally avoided.
  • Static domain assignment — Manually assign domain ranges to workers based on known domain sizes. Handles hot domains explicitly but requires operational overhead for rebalancing.

Consistent Hashing for Elastic Scaling

When a worker node fails or new capacity is added, consistent hashing with virtual nodes (100-200 virtual nodes per physical node) ensures that only a small fraction of domains are reassigned. The frontier data for reassigned domains must be transferred to the new owner. Use a message queue (Kafka, RabbitMQ) as the inter-worker URL exchange: when a parser on Worker 1 discovers a URL belonging to Worker 2 domain partition, it publishes the URL to Worker 2 input queue.

The Hot Domain Problem

Hash-by-domain creates load imbalance because domain sizes follow a power law. Wikipedia has over 60 million pages, while most domains have fewer than 100 pages. Solutions: (1) split hot domains across multiple workers using sub-domain or path-prefix partitioning, (2) use work-stealing where idle workers pull URLs from overloaded workers queues, or (3) maintain a separate "large domain" pool with dedicated workers.

Inter-worker communication is the other key design decision. When a parser on Worker 1 extracts a URL belonging to a domain owned by Worker 2, it must route that URL to Worker 2. The two common approaches are: (1) a shared message queue (Kafka or RabbitMQ) where each worker consumes from a topic partition mapped to its domain set, or (2) direct RPC where workers send URLs to the owning worker via gRPC. The message queue approach is preferred because it decouples workers and handles backpressure naturally.

Content Storage Architecture

Store raw HTML in an object store (S3, GCS, HDFS) keyed by a content hash (SHA-256 of the page body). This provides built-in deduplication at the storage layer — identical pages from different URLs map to the same storage object. Maintain a metadata database (PostgreSQL, Cassandra) that maps URL to content hash, fetch timestamp, HTTP status, response headers, and extracted metadata. The metadata DB is queried frequently for scheduling decisions; the content store is accessed only during indexing.

Politeness, Throttling, and robots.txt

Politeness is not optional — it is a fundamental requirement of any responsible web crawler. An impolite crawler can overwhelm small web servers, trigger DDoS protection systems, get its IP addresses blocked, and potentially violate computer fraud and abuse laws. Every production crawler must implement strict politeness policies at multiple levels.

The robots.txt standard (formally the Robots Exclusion Protocol, standardized as RFC 9309 in 2022) is the primary mechanism by which webmasters communicate their crawling preferences. Every crawler must fetch, parse, and obey robots.txt before crawling any page on a domain.

robots.txt DirectiveMeaningExample
User-agentSpecifies which crawler the following rules apply toUser-agent: Googlebot (rules for Google) or User-agent: * (rules for all)
DisallowPaths the crawler must not visitDisallow: /admin/ prevents crawling anything under /admin/
AllowOverrides a broader Disallow for specific sub-pathsAllow: /admin/public-page.html permits this page even if /admin/ is disallowed
Crawl-delayMinimum seconds between consecutive requests to this domainCrawl-delay: 10 means wait at least 10 seconds between requests
SitemapLocation of XML sitemaps for the siteSitemap: https://example.com/sitemap.xml helps the crawler discover all pages

robots.txt is Advisory, Not Enforceable

robots.txt is a convention, not a technical barrier. Crawlers can technically ignore it. However, respecting robots.txt is both ethically required and practically essential. Sites that detect crawler policy violations will block your IP addresses, complain to your hosting provider, or pursue legal action. Google has successfully enforced robots.txt compliance through legal channels. LinkedIn won a lawsuit against a scraper (hiQ Labs) partly based on robots.txt violations.

Politeness policies beyond robots.txt

  • Per-domain rate limiting — Even if robots.txt does not specify a Crawl-delay, limit requests to 1 per second per domain by default. This is the community standard followed by major search engine crawlers.
  • Concurrent connection limit — Open at most 1-2 simultaneous connections to any single domain. Multiple concurrent requests can overwhelm small servers even if the rate is low.
  • Total bandwidth throttling — Cap the crawler total outbound bandwidth (e.g., 1 Gbps) to avoid saturating your own network and to limit aggregate impact on the internet.
  • Respectful User-Agent — Identify your crawler with a descriptive User-Agent string that includes a URL where webmasters can learn more and request exclusion. Example: "MyCrawler/1.0 (+https://mycrawler.com/about)".
  • Time-of-day awareness — For smaller sites, reduce crawl intensity during business hours (when human traffic peaks) and increase during off-peak hours. Major crawlers do this for sensitive domains.
  • Back-off on errors — If a server returns 5xx errors or connection timeouts, implement exponential back-off: wait 1 minute, then 2 minutes, then 4 minutes, up to a maximum of 24 hours before retrying.

Implementing Per-Domain Rate Limiting

The back-queue design from the frontier section naturally enforces per-domain politeness. Each back queue has a next_allowed_fetch timestamp. After fetching from a domain, set next_allowed_fetch = now() + max(default_delay, robots_crawl_delay). The min-heap scheduler automatically skips domains that are still in their delay window. This is O(log N) per fetch operation where N is the number of active domains — efficient even at millions of domains.

Legal considerations vary by jurisdiction. In the United States, the Computer Fraud and Abuse Act (CFAA) prohibits unauthorized access to computer systems. Courts have generally held that crawling publicly accessible pages is not a CFAA violation, but crawling pages behind login walls or ignoring explicit robots.txt restrictions may be. The European Union GDPR imposes restrictions on collecting personal data from web pages. The EU Database Directive protects databases, which can include structured website content.

Ethical Crawling Checklist

Always identify your crawler via the User-Agent header. Respect robots.txt fully. Honor Crawl-delay directives. Provide a webpage explaining your crawler with a contact email for exclusion requests. Do not crawl content behind login walls. Do not crawl at rates that degrade site performance. Implement exponential back-off for errors. Store and respect per-domain crawl preferences. Respond promptly to webmaster complaints.

Crawling During Site Outages

If a site returns 503 Service Unavailable, it means the server is temporarily overloaded. Continuing to crawl worsens the problem. A polite crawler detects repeated 5xx responses and puts the domain into a cool-down period (exponential back-off starting at 1 hour). Never retry failed requests immediately — this is the equivalent of repeatedly ringing a doorbell when nobody is answering.

Handling Traps, Loops, and Dark Patterns

The web is not a friendly graph. It contains intentional and unintentional structures designed to trap crawlers into infinite loops, consuming resources while providing no useful content. A production crawler must detect and escape these traps to protect its crawl budget.

Spider traps are the most common adversarial pattern. A spider trap is a set of URLs that generates an effectively infinite number of pages. The most notorious example is a dynamic calendar that generates pages for every day into the infinite future: /calendar/2024/01/01, /calendar/2024/01/02, ..., /calendar/9999/12/31. Each page links to the next day, creating an endless chain that a naive BFS crawler will follow forever.

Common spider trap patterns

  • Infinite calendars — Calendar widgets that generate pages for every future date. The crawler discovers a link to tomorrow, which links to the day after, creating an unbounded chain.
  • Session ID traps — Sites that append a unique session ID to every URL: /page?sid=abc123. Each visit generates a new session ID, making every URL appear unique while the content is identical.
  • Infinite pagination — Pages with "next" links that never end: /results?page=1, /results?page=2, ... /results?page=999999. Each page contains the same template with slightly different (or even empty) content.
  • URL parameter explosion — Sort/filter combinations that multiply URLs exponentially: /products?sort=name&color=red&size=large. With 5 sort options, 10 colors, and 8 sizes, one product listing spawns 400 unique URLs.
  • Symbolic link loops (file:// crawlers) — In filesystem-based crawlers, symbolic links can create directory cycles: /a/b/c -> /a creates an infinite path.
  • Honeypot traps — Deliberately placed hidden links (invisible to human users, visible to crawlers) that serve infinite generated content to identify and block crawlers.
DefenseHow it worksLimitations
Max URL depthRefuse to crawl URLs that are more than N clicks away from any seed URLSimple and effective. Typical limit: 15-20 hops. May miss deeply nested legitimate content.
Max URL lengthSkip URLs exceeding a length threshold (e.g., 2048 characters)Trap URLs tend to be very long due to repeated path segments or many parameters.
Max pages per domainLimit the number of pages crawled from any single domain per crawl cyclePrevents any one domain from consuming the entire crawl budget. Set per domain authority.
URL pattern detectionDetect repeating path patterns: /a/b/a/b/a/b suggests a loopUse regex or suffix tree to detect repeating segments. Catches most path-based traps.
Content fingerprintingIf consecutive pages from the same domain have identical or near-identical SimHash fingerprints, stop crawling that pathCatches traps that generate slightly different URLs but identical content.
Robots.txt nofollowTrap URLs are often excluded via robots.txt by well-meaning sitesOnly works if the site operator has configured robots.txt correctly.

URL Normalization Defenses

Apply aggressive URL normalization before deduplication: remove known tracking parameters (utm_*, fbclid, gclid), remove session IDs (JSESSIONID, PHPSESSID), sort query parameters, remove empty parameters, and collapse repeated path separators (/a//b -> /a/b). This collapses many trap variations into a single canonical URL that the dedup filter catches. Maintain a configurable list of known tracking parameters and update it regularly.

Max Depth is Not Enough

Max depth prevents the simplest traps but misses sophisticated ones. A trap at depth 5 that generates 1 million pages per level will still consume a million page fetches before hitting any depth limit. Combine depth limits with per-domain page caps, URL pattern detection, and content fingerprinting for defense in depth. No single technique catches all traps.

Loop detection requires tracking the crawl path. Maintain a per-domain visit history and detect when the crawler is revisiting the same content (as measured by content fingerprints) through different URLs. If more than 10 consecutive fetches from the same domain produce near-identical SimHash fingerprints, flag the domain as a potential trap and stop crawling beyond a conservative page limit.

The OPIC Algorithm for Crawl Budget Allocation

OPIC (Online Page Importance Computation) is an algorithm that estimates PageRank-like importance scores during the crawl itself. Pages that receive many inbound links accumulate more "cash," and the crawler prioritizes pages with higher cash values. This naturally deprioritizes trap pages because they receive links only from other trap pages, not from authoritative sources. OPIC is used as a crawl scheduler signal by several open-source crawlers including Apache Nutch.

Recrawl Scheduling and Freshness

A crawler that only visits each page once produces a stale index. The web is constantly changing: news sites update every minute, e-commerce prices change daily, and even static blogs get occasional edits. Recrawl scheduling determines how often to revisit previously crawled pages to maintain freshness — a critical quality metric for search engines.

The core challenge is that crawl budget is finite. If you can fetch 10,000 pages per second, you must allocate those fetches between discovering new pages and recrawling known pages. Spending too much budget on recrawling means slow discovery of new content; spending too little means your index becomes stale. The optimal allocation depends on page change rates, which vary enormously across the web.

Page categoryTypical change frequencyRecommended recrawl intervalExamples
Breaking newsMinutes5-15 minutesCNN.com homepage, Reuters breaking news feeds, Twitter trending topics
News articlesHours1-6 hoursIndividual news articles (edits, corrections, comment counts)
E-commerceDaily12-24 hoursProduct prices, stock availability, reviews
Blogs and forumsWeekly3-7 daysBlog posts, forum threads (new replies)
Static referenceMonthly or less14-30 daysWikipedia articles, documentation, government sites
Archived contentRarely or never30-90 daysHistorical pages, archived news, old forum posts

Adaptive recrawl scheduling uses historical change data to set per-page recrawl intervals. The idea is simple: if a page changed the last 5 times you checked, check it more often; if it has not changed in the last 10 checks, check it less often. This is a bandit-style optimization problem.

Adaptive recrawl algorithm

→

01

After each fetch, compare the content fingerprint (SimHash) to the previous version.

→

02

If the content changed, decrease the recrawl interval by a factor (e.g., multiply by 0.7). Minimum interval: 5 minutes.

→

03

If the content did not change, increase the recrawl interval by a factor (e.g., multiply by 1.5). Maximum interval: 30 days.

→

04

Weight the priority by page importance (PageRank, domain authority, traffic estimates). A high-importance page that changes daily gets priority over a low-importance page that also changes daily.

→

05

Maintain a recrawl priority queue sorted by next_scheduled_recrawl = last_fetch_time + current_interval.

06

On each scheduling cycle, pop pages from the recrawl queue and inject them into the frontier at elevated priority.

1

After each fetch, compare the content fingerprint (SimHash) to the previous version.

2

If the content changed, decrease the recrawl interval by a factor (e.g., multiply by 0.7). Minimum interval: 5 minutes.

3

If the content did not change, increase the recrawl interval by a factor (e.g., multiply by 1.5). Maximum interval: 30 days.

4

Weight the priority by page importance (PageRank, domain authority, traffic estimates). A high-importance page that changes daily gets priority over a low-importance page that also changes daily.

5

Maintain a recrawl priority queue sorted by next_scheduled_recrawl = last_fetch_time + current_interval.

6

On each scheduling cycle, pop pages from the recrawl queue and inject them into the frontier at elevated priority.

Using HTTP Conditional Requests for Efficient Recrawling

On recrawls, send If-Modified-Since (with the previous Last-Modified date) or If-None-Match (with the previous ETag). If the server responds 304 Not Modified, skip downloading the body entirely — the page has not changed. This saves bandwidth (no body transfer) and processing time (no parsing needed). In practice, 30-50% of recrawl requests receive 304 responses, cutting recrawl bandwidth nearly in half.

Sitemap.xml as a Recrawl Signal

XML sitemaps (sitemap.xml) include a lastmod timestamp and a changefreq hint for each URL. While these values are not always accurate (many sites set them incorrectly), they provide a useful initial estimate for recrawl scheduling before the crawler has enough history to compute adaptive intervals. Parse sitemaps from every domain and use the lastmod values as the initial recrawl schedule, then refine with observed change data.

Change Detection Beyond Content Hash

Simple content hash comparison misses meaningful changes buried in noise. Ads, timestamps ("Page generated at 10:42 AM"), view counters, and dynamic sidebars change on every fetch without indicating real content change. Robust change detection strips boilerplate (headers, footers, ads, navigation) before fingerprinting, focusing only on the main content area. Libraries like boilerpipe and readability.js extract main content from web pages.

Priority scoring for recrawl combines multiple signals into a single score that determines which pages get recrawl budget. The formula typically looks like: recrawl_priority = page_importance_score * change_probability * (time_since_last_crawl / expected_interval). Pages that are important, likely to have changed, and overdue for a check receive the highest priority.

The Freshness-Coverage Tradeoff

Every recrawl fetch is a fetch that could have been used to discover a new page. At Google scale, the crawl budget is split roughly 60-70% recrawl and 30-40% new discovery. If you over-index on freshness, your coverage (fraction of the web in your index) shrinks. If you over-index on coverage, your existing pages go stale. There is no universally correct split — it depends on your use case. A news search engine prioritizes freshness; a web archive prioritizes coverage.

Fixed Recrawl Intervals

A common mistake is assigning a fixed recrawl interval to all pages (e.g., "recrawl everything weekly"). This wastes budget on static pages that never change and under-crawls dynamic pages that change hourly. Always use adaptive intervals based on observed change history. Even a simple exponential backoff (double the interval each time a page is unchanged, halve it each time it changes) dramatically improves freshness compared to fixed scheduling.

How this might come up in interviews

Web crawler design is a top-tier system design interview question at Google, Meta, Amazon, and Microsoft. It appears directly ("Design a web crawler for a search engine") and indirectly in questions about SEO indexing, content aggregation, or building a search engine. Interviewers use it to test your understanding of graph traversal at scale, distributed system coordination, probabilistic data structures, network I/O optimization, and the ability to make trade-offs between competing constraints (coverage vs. freshness, speed vs. politeness, memory vs. accuracy). Strong candidates structure their answer around the URL frontier design first, then layer on fetching, deduplication, distribution, and recrawl scheduling.

Common questions:

  • Design a web crawler that can crawl 1 billion pages in one week. Walk me through the architecture from seed URLs to stored content.
  • How would you ensure your crawler does not overwhelm any single web server? Describe the politeness mechanisms in detail.
  • Your crawler is stuck in an infinite loop on a calendar site generating pages for every future date. How do you detect and escape this?
  • Explain how you would deduplicate URLs at the scale of 10 billion discovered URLs. What data structure would you use and how much memory does it need?
  • How would you design the recrawl scheduling system to keep high-value pages fresh while still discovering new content?
  • Compare BFS, DFS, and priority-based crawling. Which would you use for a search engine and why?

Try this question: Ask the interviewer: What is the target scale (number of pages, crawl cycle duration)? Are we building a general web crawler or a focused crawler for specific domains? Do we need to handle JavaScript-rendered pages? What is the freshness requirement for the index? Are there specific legal or ethical constraints to consider?

Strong answer: Drawing the Mercator two-layer frontier design from memory. Calculating Bloom filter size for the target URL count. Discussing SimHash for near-duplicate content detection. Mentioning consistent hashing for domain-based URL partitioning. Explaining adaptive recrawl intervals with change detection. Bringing up DNS pre-fetching as a latency optimization.

Red flags: Using a simple FIFO queue as the frontier without discussing priority or politeness. Not mentioning robots.txt or crawl delays. Ignoring spider traps entirely. Proposing to store all URLs in a hash set without considering memory (10B URLs x 50 bytes = 500 GB). Not discussing how to distribute the crawler across multiple machines.

Key takeaways

  • The URL frontier is the most critical data structure in a crawler: a two-layer design with front queues for priority and back queues for per-domain politeness (Mercator architecture) is the industry standard. Without this design, the crawler either wastes budget on low-value pages or overwhelms individual web servers.
  • Deduplication operates at two levels — URL-level (Bloom filters for "have I seen this URL?") and content-level (SimHash for "have I seen this content?"). A Bloom filter for 10 billion URLs at 1% false positive rate requires only 12 GB of memory, making it practical for even the largest crawls.
  • Politeness is non-negotiable: respect robots.txt directives, enforce per-domain crawl delays (minimum 1 second between requests), identify your crawler via User-Agent, and implement exponential back-off for errors. Impolite crawling leads to IP blocks, legal action, and degraded access to the web.
  • Spider traps (infinite calendars, session IDs, parameter explosions) can consume unlimited crawl budget. Defense requires multiple layers: per-domain page caps, max crawl depth, URL pattern detection, and content novelty decay — no single technique catches all traps.
  • Adaptive recrawl scheduling allocates finite crawl budget optimally by adjusting per-page revisit intervals based on observed change rates and page importance. Pages that change frequently and rank highly get recrawled most often; static low-value pages wait weeks or months between visits.
Before you move on: can you answer these?

Why does a production web crawler need both URL-level deduplication (Bloom filter) and content-level deduplication (SimHash)?

URL-level dedup catches exact URL matches (same URL seen before) but misses cases where different URLs serve identical content — printer-friendly versions, paginated views, tracking parameter variants, and syndicated content across domains. Content-level dedup (SimHash) catches these cases by comparing page content fingerprints regardless of the URL. Together they prevent both refetching known URLs and storing duplicate content from different URLs, which can account for 20-30% of all pages on the web.

Explain why domain-based URL partitioning is preferred over URL-based partitioning in a distributed crawler.

Domain-based partitioning assigns all URLs from a given domain to the same worker node. This means per-domain politeness enforcement (rate limiting, Crawl-delay compliance, robots.txt caching) is entirely local — no cross-worker coordination needed. URL-based partitioning distributes individual URLs across workers, so two workers might simultaneously fetch from the same domain, violating politeness constraints unless they coordinate via a shared rate limiter. The local politeness advantage of domain-based partitioning outweighs its load imbalance downside (hot domains), which can be mitigated with work-stealing or domain splitting.

How does adaptive recrawl scheduling determine the optimal interval for revisiting a page?

Adaptive scheduling tracks each page change history across multiple recrawl visits. If a page changed since the last visit, the recrawl interval is shortened (e.g., multiplied by 0.7). If it was unchanged, the interval is lengthened (e.g., multiplied by 1.5). This converges toward an interval that roughly matches the page actual change frequency. The interval is further weighted by page importance (high-PageRank pages get more frequent recrawls for the same change rate) and bounded by minimum (5 minutes) and maximum (30 days) limits to prevent over-crawling or stale content.

🧠Mental Model

💡 Analogy

Think of a web crawler as a massive library cataloging expedition. You start with a list of known library locations (seed URLs). Scout teams (fetcher workers) are dispatched to each library, where they photograph every book (download pages) and record which other libraries are referenced in bibliographies (extract links). Back at headquarters, catalog cards (the URL frontier) are organized into priority trays — rare collections get visited first, common branches can wait. Each scout follows strict visiting rules: only visit during open hours, wait between requests so you do not overwhelm the librarian, and always check the posted visiting policy (robots.txt) before entering. Meanwhile, a deduplication team (Bloom filter + SimHash) checks if we already have a copy of each book, and a scheduling team decides when to send scouts back to check for new editions. The expedition never truly ends — the library system keeps growing, and books keep getting updated.

⚡ Core Idea

A web crawler is a distributed graph traversal system that explores the hyperlink graph of the web. The core challenge is not fetching pages — that is straightforward HTTP. The challenge is deciding what to fetch next (frontier prioritization), avoiding redundant work (deduplication), being a respectful visitor (politeness), surviving adversarial content (trap detection), and keeping your data current (recrawl scheduling) — all at a scale of billions of pages across hundreds of millions of domains.

🎯 Why It Matters

Web crawlers are the invisible foundation of the search economy. Every Google search result, every SEO optimization, every price comparison tool, and every AI training dataset built from web data depends on a crawler having fetched that content. At FAANG scale, crawler efficiency directly translates to search quality and business outcomes. A crawler that wastes budget on duplicate or trapped content provides a worse index; a crawler that is too aggressive gets blocked by major sites; a crawler that recrawls too slowly serves stale results. Getting the balance right across these competing concerns is what makes this a deep system design problem.

Ready to see how this works in the cloud?

Switch to Career Paths for structured paths (e.g. Developer, DevOps) and provider-specific lessons.

View role-based paths

Sign in to track your progress and mark lessons complete.

Discussion

Questions? Discuss in the community or start a thread below.

Join Discord

In-app Q&A

Sign in to start or join a thread.