Skip to content

Designing a distributed crawler: frontier, dedup, politeness, and backpressure

· 21 min read
Copyright: MIT
Wordmark reading CRAWLER over a frontier-to-fetch pipeline diagram with an orange accent

The algorithm fits on an index card. Take a URL off a list, fetch the page, pull the links out of it, add the new ones back to the list, repeat. A first-year student can write it in an afternoon. The version that runs at Google or the Internet Archive is a different animal, and the gap between the two is almost entirely about scale: a billion-page crawl needs roughly 400 fetches a second sustained, every data structure outgrows RAM, and the whole thing has to stay polite enough that web admins do not start filing abuse complaints or blackholing your IP range.

So the interesting question is not “how do I crawl a page.” It is “how do I keep hundreds of worker threads busy, fetching from tens of thousands of distinct hosts at once, without hammering any single server, without re-fetching the same content under forty different URLs, and without the bookkeeping structures spilling off the end of memory and dragging the whole crawl to a halt.” That is a systems problem, and the answer has a lineage that runs from the late-1990s Mercator design straight into how Googlebot decides your crawl rate in 2026.

This post walks the pipeline in the order a URL travels it. First the frontier, the data structure that decides what gets fetched and when. Then duplicate elimination, the URL-seen and content-seen tests that keep the crawl from chasing its own tail. Then politeness and scheduling, the part that turns a fast crawler into a tolerable one. Then fetching, DNS, and parsing, where the practical bottlenecks hide. Then backpressure, the feedback loop where the crawled server gets a vote. The named systems along the way, Mercator, the original Googlebot, IRLbot, and the modern multi-fetcher Googlebot, are reference points; the structure is common to all of them.

The shape of the whole thing

Before the components, the dataflow. A worker thread pulls an absolute URL from the frontier, resolves the host to an IP, checks robots rules, fetches the document, runs a content-seen test to skip duplicate bodies, extracts links, normalizes and filters each one, and submits the survivors to a duplicate-URL eliminator that drops anything already seen. New URLs go back into the frontier. That loop, run by hundreds of threads against a set of shared structures that mostly live on disk, is the crawler.

the fetch loop, one worker among hundreds frontier DNS + fetch content-seen? drop if dup body parse links URL filter robots, scope URL-seen? dedup (DUE) *The loop a single worker runs. The two orange boxes are the structures every new URL must clear, the dedup gate and the frontier, before it can be fetched again.*

Almost everything hard about a crawler is in two of those boxes. The frontier decides ordering and pacing. The duplicate eliminator decides membership. Get those two right and scale them past memory, and the rest is plumbing that you can mostly buy off the shelf. Get them wrong and the crawl either melts a stranger’s web server or grinds to a crawl, no pun intended, the first time the seen-set stops fitting in RAM.

The URL frontier

The frontier is the to-do list, but calling it a queue undersells it. It controls the download schedule. Clients (the link extractors) decide what goes in; the frontier alone decides what comes out and in what order. Two properties have to hold at once, and they pull against each other. High-value pages should come out sooner, which wants a priority order. And the crawler must never pound a single host, which wants a per-host rate limit. A naive FIFO queue gives you neither.

The reason FIFO fails is link locality. Most links on a page point back into the same site, so a breadth-first queue fills up with long runs of URLs that share a host. Drain that queue with a few hundred threads and they all converge on the same unlucky server at the same instant. The Mercator authors saw exactly this in production and described it plainly: such behavior is socially unacceptable, and it has the potential to crash some web servers.

The fix that the field settled on is a two-stage frontier, and the canonical description is in the Mercator work and the Stanford IR textbook chapter built on it. A front end of FIFO queues handles priority. A back end of FIFO queues handles politeness. A heap sits between the back end and the workers and handles timing.

front end: priority — back end: politeness prioritizer queue 1 (high) queue 2 queue k (low) biased random pick host A → queue host C → queue host F → queue n back queues, one host each (n ≈ 3×threads) min-heap, keyed by next-allowed-fetch time worker pops root *A URL enters at a priority queue, gets routed to its host's back queue, and waits there. A worker takes whatever host the heap says is ready next, which is what enforces the gap between hits on any one server.*

Here is how a URL moves through it. On the way in, a pluggable prioritizer assigns it a value from 1 to k, based on the URL and its history (how often the page has changed, for instance), and drops it into that front queue. On the way out, a worker does not read the front queues directly. It pops the root of a heap whose entries are back queues keyed by a timestamp, the earliest time that host may be contacted again. If that timestamp is in the future, the worker blocks. When a back queue empties, it gets refilled by pulling from a front queue chosen at random with a bias toward high priority, and each new host is mapped into a back queue so that every back queue holds exactly one host’s URLs. Priority lives in the choice of which front queue to drain; politeness lives in the one-host-per-queue rule plus the heap timestamps.

Two numbers from those production crawls are worth carrying around because they are good defaults. Run about three times as many back queues as worker threads, so there are always enough distinct hosts ready that politeness delays do not leave threads idle. And set each host’s next-fetch time to the current time plus ten times how long the last fetch from that host took, so a slow server automatically gets a longer gap. Those two settings, the authors noted, were enough to keep all threads busy and the complaint rate near zero. The frontier design earns its own deep dive in URL frontier design: from Mercator to modern priority-queue crawlers; the short version is that almost every serious crawler since 1999 is a variation on this two-stage structure.

One more thing the frontier has to do: not fit in memory. A web-scale frontier holds hundreds of millions to billions of pending URLs, far past RAM. The standard trick is to keep each FIFO queue mostly on disk and buffer only a fixed slice of its head and tail in memory. The queue is logically huge and physically a small in-RAM window over a disk file, which is a pattern you will see again in the dedup structures.

Duplicate elimination: the URL-seen and content-seen tests

Two different duplication problems live in a crawler and people conflate them constantly. The first is the same URL discovered many times, because dozens of pages link to your homepage. The second is the same content under different URLs, because of session IDs, tracking parameters, print versions, and mirror sites. They need different tests.

The URL-seen test guards the frontier. Before a freshly extracted URL is allowed in, a duplicate-URL eliminator checks whether it has ever been seen, in the frontier or already downloaded, and silently drops it if so. The naive implementation is a hash set of URL strings, and it explodes immediately: a billion URLs averaging tens of bytes each is tens of gigabytes before overhead. The first compression is to store a fixed-width fingerprint of the URL instead of the string. Mercator used an 8-byte checksum from Rabin’s fingerprinting algorithm, which has good spectral properties and a vanishingly small collision probability, and then squeezed harder by using the high-order 3 bytes to index the hash table and storing only the low 5 bytes in the bucket. Even so, a billion URLs came out to slightly over 5 GB in memory, and that cost grows linearly with the crawl. Past a point you cannot buy your way out with RAM.

So the seen-set goes to disk, and now you have a locality problem. Fingerprints are deliberately uniform, which is great for hashing and terrible for caching, because the stream of cache-missing fingerprints has essentially no locality and every miss becomes a disk seek. At roughly 8 ms a seek, that caps you near 125 seeks per second, which at ten links per page is around 75 page downloads per second. Not nearly enough. The escape is to stop doing point lookups and start batching: buffer incoming fingerprints, and periodically sort the buffer and merge it against a big sorted on-disk file of known fingerprints in one sequential linear pass. Sequential I/O instead of random seeks. Mercator’s batched DUE roughly doubled the seek-limited rate, though even it degrades as the merge grows.

IRLbot pushed this idea to its conclusion with a structure called DRUM, the Disk Repository with Update Management. DRUM buckets key-value pairs (the key being a URL or host hash) across many disk files, fills large in-memory buffers, and processes a whole bucket at a time so the cost per element stays nearly flat as the set grows into the billions. On a single server over a 41-day crawl, IRLbot fetched 6.38 billion unique HTML pages at 1,789 pages per second, and the paper’s headline claim is that at the eight-trillion-page scale a Mercator-style seen-test would have collapsed to roughly 1.4 pages per second while DRUM stayed fast. Whether or not you accept the exact multiplier, the structural point holds: at web scale the bottleneck is not the network, it is the dedup structure’s disk behavior, and the only fix is to convert random access into sequential access.

the URL-seen test, three ways to pay for it in-RAM hash set fast, but capped by memory; grows linearly on-disk, per-lookup seek ~75 pages/s — one 8 ms seek per miss batched sequential merge scales to billions; sort + linear merge *The bar length is throughput, not to scale. The win at the bottom comes entirely from replacing random disk seeks with one big sorted pass.*

There is also the memory-only option that trades exactness for space: a Bloom filter. Burton Bloom’s 1970 paper introduced a bit-array structure that answers set membership with no false negatives and a tunable false-positive rate, at a cost of a small fixed number of bits per element regardless of key size. For a crawler that means you can hold the seen-set entirely in RAM at well under twenty bits per URL. The catch is the failure mode. A Bloom-filter false positive says “seen” for a URL you have never crawled, so you silently skip a real page; there is no false-negative risk, so you never re-crawl a duplicate, but you do quietly lose coverage. Most large crawlers therefore use a Bloom filter as a cheap first gate and fall back to an exact on-disk store only on a positive, or accept the small coverage loss as the price of staying in memory. The tradeoffs are their own topic in Bloom filters and the URL-seen problem in web-scale crawling.

The content-seen test is the other half, and it is what saves you from the mirror-and-session-ID swamp. After a document is fetched but before its links are extracted, the crawler fingerprints the document body and checks whether that exact content has appeared before under any URL. Mercator computed a fingerprint over the document content and kept a set of those fingerprints; a hit means the body is a duplicate, so the worker drops it and does not bother extracting links it has already followed elsewhere. This is a coarse exact-match test, not near-duplicate detection. Catching near-duplicates, the same article with a different ad rail, is a harder problem that the same authors addressed elsewhere with shingling and min-wise fingerprints, and it sits closer to the indexer than the crawler.

Politeness and scheduling

Politeness is the constraint that separates a crawler you can run from one that gets you banned. It has a written half and an unwritten half.

The written half is the Robots Exclusion Protocol. For twenty-eight years it was a de facto convention from a 1994 mailing-list proposal by Martijn Koster, honored by convention and interpreted inconsistently. In September 2022 it finally became RFC 9309, a Proposed Standard authored by Google engineers, which pinned down the parts that had drifted. A crawler fetches /robots.txt before touching real content, matches its user-agent against the groups, and obeys the allow and disallow rules. The RFC also nailed down the operational edges that matter at scale. A crawler should not use a cached robots.txt for more than 24 hours unless the file is unreachable. On a 4xx the crawler may assume no restrictions and crawl freely; on a 5xx or network failure it must assume complete disallow and stay out, which is a deliberately conservative default. And a parser must handle at least the first 500 kibibytes of the file. Notably, crawl-delay never made it into the standard. Different engines read it differently, Bing treats it as a one-request time window and Yandex as a literal inter-request gap, while Google ignores it entirely and exposes a crawl-rate control elsewhere. So a portable crawler cannot rely on crawl-delay meaning any one thing, which is exactly why the real pacing logic lives in the frontier’s per-host timing rather than in a directive you parse off the page. The unwritten rules around robots.txt at scale get their own treatment in Crawl politeness: robots.txt, crawl-delay, and the unwritten rules of scale.

To make robots.txt cheap, you cache it. Mercator kept a fixed-size LRU cache mapping hosts to their parsed rules, default 2^18 entries, so common hosts never re-fetched the file. That cache, plus the 24-hour freshness rule, is the standard arrangement today.

The unwritten half is the pacing the protocol does not specify, and it is enforced structurally rather than declaratively. One open connection per host at a time, and a real gap between successive requests. The Mercator team learned the hard way that one-connection-at-a-time, which they called the weak politeness guarantee, was not enough on its own, because nothing stopped a thread from firing request after request at the same host with no pause between them. Complaints kept coming. The stronger guarantee, the back-queue-plus-heap design above with the next-fetch time set to a multiple of the last fetch duration, is what finally quieted the complaints. The adaptive part matters: tying the delay to observed response time means a struggling server automatically gets hit less, without anyone configuring a per-host rate by hand.

In a distributed crawl, politeness also constrains how you shard. If two machines could both be fetching from the same host, neither can enforce a per-host gap, so the partition has to be by host. Mercator partitioned the host space across a queen and its drones, with each process owning a disjoint set of hosts and the full state for them, no replication. When a link extractor finds a URL belonging to a peer’s host, it routes it there, batching the cross-machine sends because around 80 percent of links are relative and stay local anyway. Hash the host, mod by machine count, send the URL to that machine, and per-host politeness composes cleanly across the cluster because each host has exactly one owner.

Fetching, DNS, and parsing

With ordering and politeness handled, the per-fetch path is where throughput is won or lost, and the bottlenecks are rarely where a newcomer expects. The network is usually not the limit. DNS is.

A single DNS resolution can take seconds, because it may chase referrals across several authoritative servers around the world, and a web-scale crawl resolves an enormous number of distinct hostnames. Caching helps but only partly. The subtler trap is that the standard resolver path is synchronized: the Mercator team found that Java’s DNS interface, and underneath it the BIND gethostbyname on most Unixes, serialized lookups so that effectively one DNS request per address space could be in flight on a cache miss, which throttled hundreds of worker threads down to a trickle. Their fix was to replace the OS resolver with a custom multi-threaded one that issued many lookups concurrently, which cut the share of crawl time spent waiting on DNS to about 14 percent. The CPU cost of DNS was always trivial; the entire problem was concurrency, threads parked waiting on a serialized resolver. Any crawler built today should treat asynchronous DNS as non-negotiable and run its own caching resolver.

The fetch itself wants an abstraction that lets you re-read the response stream, because you fingerprint the body, possibly run several processing modules over it, and extract links, all from one download. Mercator called this a RewindInputStream. You also want a hard cap on how many bytes you pull per resource, because some responses are effectively unbounded and a crawler that reads them all is a crawler waiting to be wedged. Modern Googlebot makes its caps explicit: per its 2026 description, it fetches up to 2 MB of an individual HTML resource, including the HTTP headers, and 64 MB for PDFs. Past the HTML cutoff it does not reject the page; it stops the fetch at exactly 2 MB and hands the truncated bytes downstream as if they were the whole file. That is the right default behavior. Truncate, keep what you got, move on, never let one pathological URL stall a worker.

Parsing is mostly link extraction. Pull the hrefs, derelativize each against the page’s base URL so http://x/a and a relative ../b both become canonical absolute URLs, then run each through a URL filter that applies scope rules and robots before the dedup gate ever sees it. URL normalization is quietly load-bearing here: if your canonicalization is sloppy, the same page slips through under several spellings, your dedup misses them, and the frontier bloats with junk. Lowercasing the host, stripping default ports, resolving dot segments, and deciding what to do with trailing slashes and fragments is unglamorous and it is where a lot of real-world crawl waste comes from.

Backpressure: the crawled server gets a vote

Everything so far is the crawler imposing its will on the network. Backpressure is the network pushing back, and a crawler that ignores it is both rude and, increasingly, simply ineffective, because the modern web answers aggressive crawlers with throttling and blocks rather than complaint emails.

The cleanest backpressure signal is HTTP status. A burst of 429 (too many requests), 500, or 503 responses means back off, and a well-behaved crawler reads it that way. Google’s crawling infrastructure documents exactly this: a significant number of 500, 503, or 429 responses causes it to reduce the crawl rate for the whole hostname, not just the offending URL. The same logic runs in the other direction. Fast, stable, error-free responses let Google raise the rate. This is a closed control loop. Google calls the cap a crawl capacity limit and adjusts it continuously from observed response time, error rate, and connection stability, and in its 2026 account it states plainly that when a server struggles to serve bytes, its fetchers automatically back off and the crawl frequency drops. The lever a site owner has over crawl budget today is mostly server speed, because host load is the binding constraint.

host load as a closed loop crawler per-host rate r origin server capacity C requests at rate r status + latency 200 + fast → raise r 429 / 500 / 503 → cut r *The crawler sets a per-host rate, the server's status codes and latency are the feedback, and the rate moves up on healthy responses and down on the back-off codes. The crawled server effectively dictates its own ceiling.*

Backpressure also runs inside the crawler, not only between crawler and origin. The frontier, the disk-based seen-set, and the outbound cross-machine URL queues are all bounded buffers, and any of them can fill. When the dedup merge falls behind, or a multi-million-page host backs up in its politeness queue and threatens to overflow RAM, the crawler has to stop accepting new work for that host or that structure. IRLbot hit precisely this: large sites, spam and legitimate alike, backed up in politeness rate-limiting until they overflowed memory, and the only fix was to couple politeness with site reputation so that a low-value site exceeding its budget got shed rather than buffered. Their STAR and BEAST mechanisms allocate a per-domain crawl budget from a reputation estimate and push over-budget URLs aside instead of letting them swamp the queue. That is internal backpressure, the crawler protecting itself from its own input, and it is the same shape as a saturated queue anywhere in distributed systems. The thing that makes web crawling distinctive is that the adversary, spam farms generating effectively infinite URLs, is actively trying to overflow you.

The modern wrinkle, mostly out of scope here but worth naming, is that backpressure now arrives dressed as anti-bot defense. A 429 is honest. A challenge page, a JavaScript interstitial, or a silent block from an edge vendor is backpressure delivered as obfuscation, and telling “you are going too fast” apart from “you are not welcome at all” has become its own discipline. The polite-crawler reading of a 503 is to slow down; the reading a defended site wants is to leave. A crawler operating against modern edge defenses has to model both.

What has and has not changed

The surprising thing about reading the 1999 Mercator paper in 2026 is how little of the skeleton has moved. The two-stage frontier, the fingerprint-based seen-set, the per-host politeness heap, the host-partitioned shard, asynchronous DNS, bounded per-resource fetches: all of it is still the architecture, and a crawler you design today will have those same boxes in roughly the same arrangement. The numbers grew by orders of magnitude and the dedup structures got cleverer about disk, IRLbot’s DRUM being the clearest example, but the shape held.

What changed is the environment, and that is where the design pressure now sits. In 1999 the binding constraints were your own disks and your own DNS resolver, and the failure mode was an annoyed sysadmin sending an email. In 2026 the binding constraint is the crawled side: host-load limits that the server controls through its response codes and latency, byte caps the crawler imposes on itself to avoid wedging, and an entire industry of edge defenses that treat unfamiliar crawlers as hostile by default. The frontier still decides what to fetch and when. But the answer to “how fast may I go” no longer lives in your config file. It lives in the 429s coming back at you, and the crawler that does not read them carefully gets throttled to nothing or blocked outright, with all its clever frontier math intact and useless. The hard part of crawling moved from the inside of the box to its edges, and it is still moving.


Sources & further reading

Frequently asked questions

why does a FIFO queue fail as a URL frontier for a large crawler

A plain FIFO queue cannot enforce priority or politeness at once, and link locality breaks it. Most links on a page point back into the same site, so a breadth-first queue fills with long runs of URLs sharing a host. Drained by hundreds of threads, they all converge on one server at the same instant, which the Mercator authors called socially unacceptable behavior with the potential to crash some web servers.

how does a two-stage URL frontier separate priority from politeness

The front end is a set of FIFO queues that handle priority: a prioritizer assigns each URL a value from 1 to k and drops it into a queue. The back end is FIFO queues that each hold exactly one host's URLs, handling politeness. A min-heap between them is keyed by each host's next-allowed-fetch time, and a worker pops the host that is ready soonest, which enforces the gap between hits on any single server.

why does the URL-seen test become disk-bound at web scale and how is it fixed

Storing seen URLs as fingerprints still grows past RAM, so the set moves to disk. Fingerprints are deliberately uniform, which is good for hashing but terrible for caching, so each miss becomes a roughly 8 ms disk seek, capping throughput near 75 page downloads per second. The fix is to stop point lookups and batch: buffer incoming fingerprints, then sort and merge them against a big sorted on-disk file in one sequential pass, turning random seeks into linear I/O.

what is the tradeoff of using a Bloom filter for crawler URL deduplication

A Bloom filter holds the seen-set entirely in RAM at well under twenty bits per URL, regardless of key size, with no false negatives. Its failure mode is false positives: it can report seen for a URL never crawled, so you silently skip a real page and lose coverage. Because there are no false negatives, you never re-crawl a duplicate. Many crawlers use it as a cheap first gate backed by an exact on-disk store, or accept the small coverage loss to stay in memory.

how does a web server influence the rate at which a crawler fetches from it

Backpressure makes the crawled server a participant in setting the rate. A burst of 429, 500, or 503 responses signals the crawler to back off, and Google reduces the crawl rate for the whole hostname, not just the offending URL. Fast, stable, error-free responses let it raise the rate. Google adjusts this crawl capacity limit continuously from observed response time, error rate, and connection stability, so a site owner's main lever over crawl budget is server speed.

Further reading