Skip to content

URL frontier design: from Mercator to modern priority-queue crawlers

· 22 min read
Copyright: MIT
URL FRONTIER wordmark with a column of host queues feeding a single fetch arrow

A web crawler is, stripped down, a loop: take a URL, fetch it, pull the links out, add them back to the pile, repeat. The hard part is the pile. Once you have billions of URLs in it, the order you pull them out in decides almost everything that matters. Pull the wrong ones and you waste a week re-fetching a spam farm’s infinite calendar. Hit the same host too fast and you get your IP banned, or worse, take down someone’s server and earn a complaint to your upstream provider. Pull them in pure discovery order and your index of news sites goes stale while you grind through link rot.

That pile is the URL frontier. It is one component of a crawler, and it carries two jobs that pull against each other. It has to decide which URL is most worth fetching next, and it has to make sure the next URL it hands out does not violate politeness toward the host it points at. Priority and politeness, in one data structure, at a scale where most of it lives on disk. This post is about how that structure is built.

We will start with the canonical answer, the front-queue/back-queue design from Compaq’s Mercator crawler, because almost every later system is a variation on it. Then the politeness machinery in detail: the per-host queues, the host-time heap, the gap heuristic. Then the part textbooks skip, which is what happens when the frontier outgrows RAM, where IRLbot’s disk-batched approach changed the economics. Then freshness versus coverage, the scheduling question that turns a frontier from a queue into a planner. And finally the modern shape of the thing: Common Crawl’s batch model, and the gRPC frontier service that decouples the queue from the crawler entirely.

The two jobs of a frontier

Before the design, the contract. A frontier sits between the part of the crawler that discovers URLs (the parser, pulling links out of fetched pages) and the part that fetches them (a pool of worker threads, each wanting a URL to download). It is a producer-consumer buffer with opinions.

The opinions are three constraints, and the classic statement of them comes from the Mercator design as written up in Manning, Raghavan, and Schütze’s Introduction to Information Retrieval. A good frontier ensures only one connection is open to any host at a time, leaves a polite gap between successive requests to the same host, and keeps high-priority pages moving ahead of low-priority ones. Those three are not independent. The first two are about not hurting hosts. The third is about spending a finite fetch budget well. A naive FIFO queue gets you none of them: it interleaves hosts randomly (so you might hammer one host that happens to have many links on a popular page) and it treats a homepage and a session-id-bloated search result as equals.

So the frontier is really two schedulers stacked on top of each other. One picks what to fetch by importance. The other picks what to fetch by who is allowed to be contacted right now. Mercator’s insight was that you can build each as its own bank of simple FIFO queues, and wire them together.

The Mercator design: front queues and back queues

Mercator was Allan Heydon and Marc Najork’s crawler at Compaq’s Systems Research Center, written entirely in Java and described in a 1999 World Wide Web journal paper. It became the reference architecture, and its frontier is the one every IR textbook draws. The structure has two tiers.

discovered URL in prioritizer (1..F) F1 F2 .. FF front queues router: pick queue by host hostA hostB .. hostB back queues host-time heap min t_e -> worker thread fetches *The two tiers: front queues sort by priority, back queues sort by host, and the heap picks which host is allowed to be fetched next.*

The upper tier is a set of F front queues, all FIFO. When a URL arrives, a prioritizer assigns it an integer priority between 1 and F and appends it to the matching queue. The priority comes from fetch history, specifically, as the IR book puts it, “the rate at which the web page at this URL has changed between previous crawls.” A page that changes often gets a higher number and rides a higher-priority front queue. That is the entire prioritization mechanism. The number of front queues, plus the policy for assigning priorities and for choosing which front queue to pull from, is the knob that defines the crawl’s importance model.

The lower tier is a set of B back queues, also FIFO, and these carry the politeness invariants. Each back queue holds URLs from exactly one host, and the system maintains two properties: a back queue is non-empty while the crawl is in progress, and it contains URLs from a single host only. An auxiliary table maps each host to its back queue, so the router can find the right one in constant time. When a back queue drains empty, the router refills it by pulling from a front queue (biased toward the higher-priority ones) and routing each pulled URL to the back queue for its host, creating new back queues as new hosts appear.

The wiring between the tiers is where the two schedulers meet. The front queues decide the rate at which a host’s URLs flow into the back tier. The back queues decide that within a host, fetches happen in arrival order and one at a time. Priority shapes the inflow, politeness shapes the outflow, and the two never have to know about each other directly.

Politeness: the host-time heap and the gap

A worker thread that wants something to fetch does not scan the back queues looking for a non-empty one. That would not respect timing. Instead the frontier keeps a heap with one entry per back queue, and each entry is the earliest time the host for that queue may be contacted again. Call it t_e. The thread extracts the root of the heap (the host whose next-allowed time is soonest), waits until that time if it is in the future, fetches the URL at the head of that back queue, and then computes a fresh t_e for the host and pushes it back onto the heap.

How long is the gap? Mercator’s heuristic is deliberately adaptive: insert a gap before the next request to a host that is an order of magnitude larger than the time the last fetch from that host took. Fetch a page that took 200 ms to download and you wait roughly two seconds before the next request to that host. The slower a host responds, the more breathing room the crawler gives it, which is exactly the right reflex, because a slow response is often a sign the server is loaded. A fixed delay cannot do that. It either over-throttles fast servers or under-throttles struggling ones.

This is also where the robots.txt rules land in practice. The Robots Exclusion Protocol became a real IETF standard in 2022 as RFC 9309, nearly three decades after Martijn Koster’s original 1994 convention. The RFC formalizes the allow/disallow matching and the caching rules, but it pointedly leaves out Crawl-delay. There was never consistent behavior to standardize. The directive means different things to different crawlers: Yandex reads it as seconds to wait between visits, Bing treats it as a time window during which it will hit the site once, and Google ignores it outright. A frontier that honors crawl-delay therefore has to fold a per-host minimum into the t_e computation, taking the max of its own adaptive gap and whatever the host asked for. The robots rules themselves get checked at fetch time, not at enqueue time, because a robots.txt file can change while a URL sits in the frontier for hours.

One subtlety the IR book is explicit about. To keep all the worker threads busy without ever letting two of them touch the same host, you want comfortably more back queues than threads. The Mercator designers recommend roughly three times as many back queues as crawler threads. With B at three times the thread count, when a thread finishes with a host and that host’s next-allowed time is far off, there is almost always another host whose time has come, so threads rarely idle. Too few back queues and threads stall waiting on politeness timers. Too many and you are tracking more hosts in memory than you have work for.

host A host B fetch (fast) gap ~ 10x last fetch fetch (slow) -> longer gap one connection per host at a time; threads fill the gaps with other hosts *The gap is adaptive: a slow host earns a longer wait. The heap interleaves hosts so no thread sits idle during a gap.*

The single-connection-per-host rule has one well-known sharp edge. Hosting many logical sites behind one IP, or one site across many subdomains, breaks the assumption that “host” is the right unit of politeness. A crawler keying back queues purely on hostname can open many simultaneous connections to one physical server hosting a thousand subdomains. Production crawlers often key politeness on something coarser, the registered domain or the resolved IP, to avoid this. The URL Frontier API makes the choice explicit, letting the client decide whether the queue key is hostname, domain, IP, or anything else. More on that design below. The trade-off is real: key on IP and a CDN that fronts millions of unrelated sites becomes one giant politeness bottleneck.

For a fuller treatment of where these rules come from and how they behave at scale, see our companion post on crawl politeness and robots.txt.

When the frontier outgrows RAM

The textbook diagram has a comforting smallness to it. A few dozen front queues, a few thousand back queues, a heap. In a real crawl the frontier holds hundreds of millions of URLs, far more than fits in memory, so most of it lives on disk. The IR book is matter-of-fact about this: the FIFO subqueues keep fixed-size enqueue and dequeue buffers in memory (the reference implementation used buffers holding 600 URLs each) and spill the body of each queue to disk. The buffers amortize disk I/O so a thread enqueuing or dequeuing rarely waits on a seek.

That works until you ask the harder question, which is not how to store the frontier but how to test whether a URL is already in it. Every link parsed out of every page has to be checked against the set of URLs already seen, or the crawler re-fetches the same pages forever and the frontier fills with duplicates. At a few million pages you keep the seen-set in RAM. At billions you cannot, and the check becomes the bottleneck that decides whether the whole crawl scales. This is the URL-seen problem, and it is worth its own post; we cover the probabilistic approach in Bloom filters and the URL-seen problem. But the frontier and the seen-test are coupled at the disk layer, and the cleanest demonstration of why came from a crawler built at Texas A&M.

IRLbot, by Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and Dmitri Loguinov, won best paper at WWW 2008 on the strength of a single-server crawl that ran for 41 days and pulled down 6,380,051,942 unique HTML pages. The full numbers are worth stating because they are concrete: 7,606,109,371 connection requests issued, 7,437,281,300 HTTP responses received from 117,576,295 hosts, an average download rate of 319 mb/s or 1,789 pages per second, 394,619,023,142 links parsed, and 41,502,195,631 unique pages discovered across 641,982,061 hosts. One server. The point of the paper is that getting there required rethinking how the seen-test touches disk.

The paper’s diagnosis of the prior art is precise. Earlier disk-based seen-tests, which it groups under the name Mercator-B and a system called Polybot, accumulate a buffer of new URLs in memory, then merge that buffer against a sorted on-disk file of seen-URL hashes in one pass. The merge is correct but its cost grows with the size of the seen-file, and the seen-file grows with the crawl. The paper works through the arithmetic and shows the overhead has a quadratic term in N, the number of pages crawled, which means the per-URL cost of the check rises as the crawl proceeds until it dominates everything. Their estimate for what that does in practice is brutal: at the scale of their crawl, a Mercator-B style check would have run roughly 3,075 times slower than their approach and admitted an average rate of only about 1.4 pages per second. A 41-day crawl becomes a multi-century one.

Their fix is a structure they call DRUM, for Disk Repository with Update Management. The idea is to never seek. DRUM spreads incoming key-value pairs (a URL hash as key, metadata as value) across many disk buckets, each paired with a RAM bucket. Pairs accumulate in the RAM buckets. When a RAM bucket fills, its disk bucket is read, merged, and written back in one large sequential batch using bucket sort, and the check, update, and check+update operations all ride that same batched pass. Because every disk access is a big sequential read or write rather than a seek, DRUM’s throughput stays close to the disk’s streaming bandwidth as N grows, instead of decaying toward the seek rate. The paper frames the gap starkly: a disk seek cannot be pushed below about 3 to 5 milliseconds even with RAID, whereas the equivalent in-RAM operation in their crawler took 5 to 10 microseconds, three orders of magnitude apart, and a general-purpose database degraded to 10 to 50 milliseconds per lookup once it held around 100 million records. Caching helps, but only at hit rates around 99.7 percent, which you cannot count on.

IRLbot’s other contribution sits squarely in the frontier’s priority logic, and it is the part most relevant to anyone who has watched a crawl get stuck. BFS order, the default discovery order, eventually traps a large crawler. The pending-URL queue fills with links from spam farms and dynamically generated infinite webs (think a calendar that links to next month forever) because those structures have enormous branching factors and outvote legitimate URLs. IRLbot’s answer is to budget each domain by its reputation. Their STAR algorithm (Spam Tracking and Avoidance through Reputation) computes a domain’s budget from the number of in-degree links it receives from other domains, a quantity spammers cannot inflate cheaply, and scales each domain’s allowed download rate to that budget. BEAST (Budget Enforcement with Anti-Spam Tactics) then enforces it without the live-lock that naive budget-checking causes: rather than repeatedly re-scanning over-budget URLs, it spreads URLs across a growing number of disk queues by whether they fit the budget, so over-budget URLs sink into a low-frequency queue instead of burning CPU on every pass. The frontier stops being a pure discovery queue and becomes a spend-control system.

cost per URL-seen check as the crawl grows (schematic) N pages crawled → cost / URL → seek-based merge: ~quadratic DRUM batched: near streaming bw seek floor 3-5 ms RAM op 5-10 us *Why disk strategy decides scale: a per-check seek cost compounds with N, while a batched sequential merge tracks the disk's streaming bandwidth.*

The lesson generalizes past IRLbot. Whenever the frontier and its companion structures outgrow RAM, the design question stops being “what data structure” and becomes “what access pattern.” Sequential beats random by three orders of magnitude on spinning disks and still wins handily on SSDs once you count write amplification. Mercator’s 600-URL buffers, IRLbot’s RAM buckets, and the segment-batched generate step in Apache Nutch are the same move in different costumes: collect work in memory, touch disk in big contiguous gulps. We go deeper on the system-wide version of this in designing a distributed crawler.

Freshness versus coverage

So far the priority in the front queues has been a single number from change-rate history. The harder, longer-running version of that question is the one a continuously running crawler faces. Given a fixed fetch budget, do you spend the next fetch discovering a page you have never seen (coverage) or re-fetching a page you already have to check whether it changed (freshness)? Every fetch is one or the other. The frontier’s priority policy is where that trade-off gets resolved, fetch by fetch.

The formal grounding is older than most production crawlers. Junghoo Cho and Hector Garcia-Molina’s work on refresh policies, with the fullest version in their 2003 ACM Transactions on Database Systems paper, modeled page change as a Poisson process and defined the two metrics a freshness-seeking crawler optimizes. Freshness is the fraction of your local copies that are currently up to date. Age is how stale your copies are on average. They verified the Poisson assumption empirically against 720,000 pages from 270 sites crawled daily for four months, then derived the policy that maximizes freshness under a fixed crawl rate.

Their result is the counterintuitive one worth remembering. The obvious policy, re-fetch each page at a frequency proportional to how often it changes, is not optimal. A page that changes on every visit is, past a point, a waste of budget: you will never hold a fresh copy of it for long no matter how often you re-fetch, so pouring fetches into it has diminishing returns. The freshness-optimal policy is closer to uniform than proportional, and in the extreme it tells you to give up on the most frantically changing pages and spend that budget on pages where a re-fetch actually buys you a meaningful window of freshness. Coverage interacts with this the same way: a newly discovered page that turns out to be a dead, never-changing leaf does not deserve another fetch, while a homepage that reliably surfaces new links earns frequent revisits not for its own content but for its value as a discovery point.

A modern frontier encodes these as the priority that feeds the front queues. Christopher Olston and Marc Najork’s 2010 survey Web Crawling lays out the incremental-crawl version cleanly: balance fetching new pages against refreshing old ones, where the former raises coverage and the latter raises freshness, and let per-page importance (often a PageRank-like score) and per-page change-rate estimates jointly set the revisit interval. In practice that means the prioritizer is not assigning a static 1-to-F bucket once. It is recomputing a score each time a page is fetched, folding in observed change history, an importance estimate, and how long the page has sat unfetched, and feeding the result back into the front queues. The frontier becomes a scheduler with feedback, not a queue.

The modern shape: batch crawls and frontier services

Two production designs show where the architecture went after Mercator. They diverge on a single question: should the frontier be embedded in the crawler, or a service the crawler talks to?

Common Crawl is the embedded, batch answer, and it runs on Apache Nutch. Nutch’s frontier is not a live queue at all. It is a CrawlDb, a big on-disk table of every URL the project knows about with its fetch state and score, and the crawl proceeds in discrete rounds. Each round runs a generate step that selects the top-scoring URLs (capped per host to enforce politeness and capped globally to bound the round), a fetch step that downloads them with one fetch queue per host or domain or IP and a configurable server delay between requests, a parse step, and an update step that merges newly discovered links back into the CrawlDb. The whole loop is a MapReduce-style batch job. Common Crawl reports a CrawlDb on the order of 25 billion URLs costing roughly 40 dollars a month to keep on S3, and aggregate crawl speeds around 40,000 pages per second, with the ceiling set mostly by the politeness policy and the machine count rather than by any single queue. The batch model trades the low latency of a live frontier for operational simplicity: there is no long-lived queue server to keep healthy, just a table and a recurring job.

The other answer is to make the frontier a standalone service with a clean API, so any crawler in any language can use it. That is exactly what Julien Nioche’s URL Frontier project did, funded under the EU’s NGI Zero Discovery program and built as a crawler-neutral gRPC service. The API is small and revealing about what a frontier really needs to expose. A crawler calls GetURLs to pull the next batch ready to fetch, and the service enforces politeness by limiting how many URLs it returns per queue. It calls PutURLs both to add freshly discovered URLs and to report back the outcome of ones it already fetched. URLs carry a queue key the client chooses (hostname, domain, IP, whatever the politeness unit should be), arbitrary metadata, and a refetchable_from_date expressed as seconds since the Unix epoch; the service only emits a URL once that timestamp is at or before now, which is how scheduled re-crawls and politeness windows are expressed in one field. A crawlID namespaces queues so several crawls can share a service without colliding.

crawler fetch + parse frontier service queues by key GetURLs (N, politeness-limited) PutURLs (outcomes + new URLs) URL fields: key, metadata, refetchable_from_date, status service emits a URL only when refetchable_from_date ≤ now *The frontier as a gRPC service: the crawler asks for work and reports outcomes, and a single date field expresses both re-crawl scheduling and politeness windows.*

What that API makes obvious is how little the consumer side of a frontier actually is. Hand me the next N URLs I am allowed to fetch; take back my results and my new discoveries. Everything hard, the prioritization, the per-queue politeness timing, the dedup, the disk batching, the freshness scheduling, lives behind those two calls. The same separation shows up inside StormCrawler, the Apache Storm-based pipeline Nioche also built, where a frontier component emits URLs into the topology and a partitioner routes them to fetcher bolts by host key so one bolt instance owns one host’s politeness. Decoupling the frontier from the fetch loop is the architectural through-line from Mercator’s two-tier queues to today’s services: keep the part that decides what and when separate from the part that does the downloading.

What the design has and has not changed

The striking thing, looking across thirty years of these systems, is how stable the core idea has been. Heydon and Najork’s split, priority on the way in and politeness on the way out, mediated by a per-host structure and a time-ordered heap, is still the shape of every serious frontier. Nutch’s CrawlDb and the gRPC URL Frontier are recognizably the same machine with different I/O models bolted on. The front-queue/back-queue diagram from a 1999 paper survives because the two constraints it separates, importance and harm-avoidance, are genuinely orthogonal, and the cleanest design keeps orthogonal concerns in separate structures.

What changed is everything around the edges of that core, and it changed because of scale. The seen-test went from an in-memory set to a Bloom filter to IRLbot’s disk-batched DRUM, because at six billion pages the cost of a single disk seek per URL is the difference between a six-week crawl and one you never finish. Priority went from a static change-rate bucket to a feedback scheduler balancing coverage against Cho and Garcia-Molina’s freshness math, because a frontier that only discovers will watch its index rot, and one that only refreshes will never grow. Politeness keying drifted from hostname toward domain and IP, because the web consolidated behind CDNs and shared hosts and the naive unit stopped matching the thing you were trying not to overload.

If there is one number to carry away, it is IRLbot’s 3,075. That is the factor by which the wrong disk strategy would have slowed the same crawl on the same hardware, not from a smarter algorithm in the abstract sense, but from arranging the work so the disk reads sequentially instead of seeking. A frontier is a scheduling problem dressed as a queue, and at web scale the scheduler that wins is the one that respects the storage hierarchy underneath it. The priority math and the politeness rules are the part you can reason about on a whiteboard. The part that decides whether your crawl finishes is whether the bytes move in batches.


Sources & further reading

  • Heydon, A. and Najork, M. (1999), Mercator: A Scalable, Extensible Web Crawler — the Compaq SRC crawler whose two-tier frontier became the reference design; World Wide Web journal, vol. 2.
  • Manning, C., Raghavan, P. and Schütze, H. (2008), The URL frontier (Introduction to Information Retrieval, ch. 20) — the canonical textbook statement of front queues, back queues, the host-time heap, and the politeness gap heuristic.
  • Manning, C., Raghavan, P. and Schütze, H. (2008), Crawler architecture — where the frontier sits relative to the seen-filter, DNS cache, and robots.txt cache.
  • Lee, H.-T., Leonard, D., Wang, X. and Loguinov, D. (2008), IRLbot: Scaling to 6 Billion Pages and Beyond — WWW 2008 best paper; DRUM, the STAR/BEAST budget control, and the disk-access argument behind the 3,075x figure.
  • Cho, J. and Garcia-Molina, H. (2003), Effective Page Refresh Policies for Web Crawlers — the Poisson change model and the result that freshness-optimal revisiting is closer to uniform than proportional; ACM TODS 28(4).
  • Olston, C. and Najork, M. (2010), Web Crawling — survey covering incremental crawling and the coverage-versus-freshness trade-off; Foundations and Trends in Information Retrieval.
  • IETF (2022), RFC 9309: Robots Exclusion Protocol — the standardized robots.txt; note its deliberate omission of Crawl-delay.
  • Nioche, J. et al. (2021), URL Frontier — API and reference implementation — the crawler-neutral gRPC frontier; GetURLs/PutURLs, queue keys, and refetchable_from_date.
  • NLnet (2021), URL Frontier project page — the NGI Zero Discovery grant and the project’s goal of a language-neutral frontier service.
  • Apache StormCrawler, Documentation — the Storm-based pipeline where a frontier spout feeds host-partitioned fetcher bolts.
  • Common Crawl (2023), Common Crawl’s move to Nutch — the batch generate/fetch/update model and the CrawlDb scale and cost figures.
  • Najork, M. and Heydon, A. (2001), High-Performance Web Crawling — Mercator follow-up detailing the in-memory enqueue/dequeue buffers and on-disk subqueue storage.

Further reading