Bloom filters and the URL-seen problem in web-scale crawling

A crawler that has fetched a billion pages has, by then, extracted somewhere north of fifty billion links. Most of them are duplicates of links it has already seen. Before it enqueues any one of them, it has to answer a single question, billions of times a second across the fleet: have I seen this URL before? Get the answer wrong in one direction and you re-crawl the same pages forever, wasting bandwidth and politeness budget on hosts that did nothing to deserve it. Get it wrong in the other direction and you silently drop pages you were supposed to fetch, leaving holes in the index that nobody notices until a customer complains.

The obvious data structure for that question is a set. Put every URL you have seen into a hash set, test each new URL against it. The obvious data structure is also the one that falls over first, because a set of every URL the crawler has ever touched does not fit in memory, and the moment it spills to disk every membership test becomes a seek. This post is about the structure that bought crawlers room to breathe: the Bloom filter, a 1970 idea that trades a small, bounded lie for a large, real reduction in space. We will start with the problem as Mercator and IRLbot actually framed it, derive the false-positive math from the original sources, then walk the family of variants the crawling and networking literature reached for when the plain filter was not enough.

The URL-seen problem, stated honestly

Frame the crawl as a loop. Pull a URL from the frontier, fetch it, parse out the links, canonicalize each one, and for each ask whether it is new. New URLs go into the frontier; URLs you have seen are dropped. The check at the center of that loop is the URL-seen test, and it is run once per extracted link, not once per fetched page. That distinction is the whole problem. Pages number in the billions; extracted links number in the hundreds of billions.

The 1999 Mercator paper from Compaq’s Systems Research Center put hard numbers on it. To run the URL-seen test, Mercator kept every URL it had seen, in canonical form, in a structure called the URL set. There were far too many entries to hold in memory, so the URL set lived mostly on disk. To save space it did not store the text of each URL but a fixed-size checksum. That is already a concession: the crawler is willing to confuse two distinct URLs that happen to collide on a 64-bit fingerprint, because the probability is negligible and the space saving is not.

What makes Mercator worth reading is what it did instead of reaching for a probabilistic filter. The URL stream has locality. Some links are wildly popular, so an in-memory cache of hot URLs catches most repeats before they touch disk. Mercator reported a 66.2 percent hit rate on a cache of 2^18 popular URLs, a further 9.5 percent on a table of recently added URLs, for a net 75.7 percent of membership tests answered without a disk operation. They went further and engineered locality into the checksum itself: the high-order bits of each URL’s checksum were derived from a fingerprint of its host name, so URLs on the same server landed numerically close together on disk, and the host-name locality of real link graphs turned into access locality on the backing file. The net result was an average of 0.16 seeks and 0.17 reads per URL-set test. That is a disk-backed exact set made cheap by exploiting the shape of the data, and it is a useful baseline precisely because it does not use a Bloom filter at all.

IRLbot, the Texas A&M crawler that fetched 6.38 billion pages over 41 days on a single server in 2007, took the same disk-first position and pushed it harder. It parsed out 394 billion links and found 41.5 billion unique nodes. Its answer to URL-seen was a structure called DRUM (Disk Repository with Update Management), which batches incoming hashed keys in RAM, then merges them against the on-disk set with a bucket sort so the disk is read sequentially rather than seeked per key. The authors modeled DRUM’s overhead and argued it stays close to the theoretical best even as the crawl reaches into the trillions of pages. The point of mentioning both is that the two most-cited web-scale crawlers solved URL-seen with disk discipline, not with a Bloom filter in the hot path. The Bloom filter is the answer when you want to keep the test in memory and you can tolerate the lie. Knowing where it does and does not earn its place is most of the engineering.

What Bloom actually proposed in 1970

Burton H. Bloom, then at Computer Usage Company in Newton Upper Falls, Massachusetts, published “Space/Time Trade-offs in Hash Coding with Allowable Errors” in the July 1970 Communications of the ACM, volume 13, number 7, pages 422 to 426. The paper does not mention crawlers, web pages, or networks. It mentions hyphenation.

The motivating example is worth keeping because it is the cleanest statement of the whole idea. Suppose a program hyphenates English words. A few simple rules handle 90 percent of words correctly; the other 10 percent need a dictionary lookup. The dictionary is too big for core memory, so it lives on disk. Bloom’s move is to keep a small in-core structure that, given a word, tells you whether it is in the hard 10 percent. The structure is allowed to occasionally say yes when the answer is no. When it does, you pay one needless disk lookup, find nothing, and fall back to the simple rules. You never pay for a missed lookup, because the structure never says no when the answer is yes. The error is one-sided by construction, and the rare false yes is cheap to absorb. Swap “hard word” for “URL I have already seen” and “disk lookup” for “frontier enqueue” and the structure transfers wholesale.

The paper describes two methods. Method 1 keeps the conventional cell layout but shrinks each cell to a short code rather than the full message, accepting collisions between distinct messages that hash to the same code. Method 2 is the one everyone now means by “Bloom filter.” Quoting the mechanism: the hash area is treated as N individually addressable bits, all initially 0; each message is hash-coded into d distinct bit addresses, and all d of those bits are set to 1; to test a new message, you generate its d addresses and accept if all d bits are 1, reject if any is 0. A zero anywhere is a definite no. All ones is a probable yes.

*A Bloom filter answers with certainty in exactly one direction. A single zero among the probed positions rules an element out; all-ones means probably in, possibly a collision.*

Bloom also saw the tuning rule that the modern math formalizes. He noted that adding another hash function helps until the array fills past the point where each probed bit is too likely to already be 1, and that this crossover lands when half the bits are 1 and half are 0. That observation, in a 1970 paper, is the same half-full optimum the survey literature later derived with calculus.

The false-positive math

The clean treatment is Broder and Mitzenmacher’s “Network Applications of Bloom Filters: A Survey,” in Internet Mathematics volume 1, number 4, pages 485 to 509. Take an array of m bits, n inserted elements, and k hash functions assumed to be independent and uniform over the m positions. After all n elements are inserted, the probability that any specific bit is still 0 is (1 - 1/m)^(kn), which is close to e^(-kn/m). Call the expected fraction of zero bits p. A query for an element not in the set probes k positions; it falsely reports membership only if all k of them happen to be 1. So the false-positive rate is approximately (1 - p)^k, which expands to:

1
f  ≈  (1 - e^(-kn/m))^k

That formula has the trade-off built in. More hash functions give a non-member more chances to land on a zero and be correctly rejected, but more hash functions also set more bits per insert, which fills the array faster and pushes p down. Minimizing f over k, the survey shows the optimum is:

1
k  =  ln 2 · (m/n)   ≈   0.693 · (m/n)

At that k the array is exactly half full, p equals one half, and the false-positive rate becomes (1/2)^k, equivalently (0.6185)^(m/n). The pleasant fact that the optimum sits at a half-full array does not depend on the asymptotic approximation; it falls out of the exact expression too. Read the second form as a budgeting tool. The false-positive rate decays geometrically in bits-per-element. Want one-in-a-hundred? You need about 9.6 bits per element. One-in-a-thousand costs about 14.4. One-in-a-million, about 28.8. Those are flat costs per URL regardless of how long the URLs are, which is the entire appeal: a hundred-byte URL costs the same two bytes of filter as a thirty-byte URL.

*Each tenfold cut in the false-positive rate costs a fixed increment of bits per element, roughly 3.3, because the rate decays geometrically in m/n. URL length never enters into it.*

One caveat the survey flags and that bites in practice: a Bloom filter is about 44 percent larger than the information-theoretic minimum for an approximate-membership structure at the same false-positive rate. The 1.44 factor, give or take, is the tax you pay for the filter’s simplicity. It is usually worth paying, but it is the gap that later structures set out to close.

There is a second, sharper caveat for crawling specifically. The math above assumes a fixed n chosen in advance. A crawl does not know n in advance; the set of seen URLs grows without an a priori bound. If you size the array for a billion URLs and the crawl finds ten billion, the array fills, p climbs toward one, and the false-positive rate degrades toward certainty. A false positive in URL-seen means the crawler decides it has already seen a URL it has not, and silently drops it. That is the failure mode IRLbot’s authors were wary of when they kept the exact set on disk. It is also exactly the problem the scalable variant was built to fix.

When false positives are not symmetric

It is worth being precise about what a Bloom filter false positive costs in a crawler, because the cost is direction-dependent and the direction matters.

The filter never produces a false negative. If a URL was inserted, every probe will find its bits set, so the filter will never claim a seen URL is new. Good: you will never re-crawl a page because the dedup structure forgot it. The error is the other way. The filter can claim a brand-new URL has been seen. When that happens the crawler drops a page it should have fetched. There is no second chance and no log line; the URL simply never enters the frontier. At a false-positive rate of one in a million and a frontier of tens of billions of URLs, that is tens of thousands of pages quietly missing from the crawl.

Whether that is acceptable is a product decision, not an algorithmic one. For a broad web index where coverage is statistical anyway, losing one URL in a million to a filter collision is noise next to the URLs lost to timeouts, robots rules, and dead hosts. For a targeted crawl that must fetch every product page on a single site, a one-in-a-million silent drop might be unacceptable, and you would back the filter with an exact on-disk check for any URL the filter reports as seen, paying a disk lookup only on the rare positive rather than on every query. That two-tier arrangement is exactly Bloom’s 1970 hyphenation design: cheap probabilistic gate in front, authoritative slow check behind, and the gate sized so the slow check runs rarely. The dedup stage sits inside the larger frontier machinery, and if you are building that machinery from scratch the trade-offs around it are their own topic, covered in designing a distributed crawler and URL frontier design.

Counting Bloom filters, when you need to delete

The plain filter cannot delete. You cannot clear the bits for one element, because each bit may also be holding up some other element that hashed to the same position; zero it and you risk a false negative for that other element, which breaks the one-sided guarantee. For a crawler that only ever grows its seen-set, this is fine. For a crawler that wants to forget URLs after a recrawl interval, or a cache that evicts, it is not.

The counting Bloom filter fixes deletion by replacing each bit with a small counter. The structure comes from Li Fan, Pei Cao, Jussara Almeida, and Andrei Broder’s “Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol,” from SIGCOMM 1998, where proxy caches needed to gossip compact summaries of their contents and update them as objects entered and left. Insert by incrementing the k counters; delete by decrementing them; an element is present if all k counters are non-zero. Deletion is safe now because a counter records how many elements landed on it, so decrementing for one element does not wrongly clear a slot another element still needs.

*The counting variant trades space for deletion. Four-bit counters, enough to make overflow vanishingly rare in practice, cost three to four times the memory of a plain bit array at the same false-positive rate.*

The cost is space. Four bits per slot is the usual choice, which makes counter overflow rare enough to ignore, and that is three to four times the memory of the plain bit array for the same false-positive rate. For a crawler the question is whether you actually need in-structure deletion. Often you do not; you would rather roll over to a fresh filter on a schedule, or partition the seen-set by time window and drop whole filters. So counting filters show up more in caches and routers, the summary-cache lineage, than in the URL-seen path of a crawler. They matter here as the first answer to deletion and as the baseline the cuckoo filter later beat on space.

Scalable Bloom filters, when you do not know n

The deeper mismatch between Bloom’s structure and a crawl is the fixed size. The standard analysis hands you m once you fix n and the target rate, and a crawler does not know n. Almeida, Baquero, Preguiça, and Hutchison addressed this directly in “Scalable Bloom Filters,” Information Processing Letters volume 101, issue 6, 2007, pages 255 to 261.

The construction is a sequence of plain filters. Start with one filter sized for some initial capacity at false-positive probability P0. When it fills to its target load, freeze it read-only and add a new filter with larger capacity and a tighter false-positive probability P1 = P0·r, where r is a tightening ratio strictly between 0 and 1. Each subsequent filter tightens by another factor of r, so the per-stage rates form a geometric progression. A query checks every filter in the sequence and reports present if any one matches; an insert goes only into the active, last filter.

The reason the rates have to tighten is the union bound. A query touches all the filters, so the compounded false-positive probability of the whole stack is the probability that at least one stage produces a false positive. Because the stage rates shrink geometrically as P0, P0·r, P0·r^2 and so on, their sum converges, and the compounded bound for the whole structure stays under P0 · 1/(1 - r). Choose r around one half and the geometric tail is small, so the overall rate is held near a fixed ceiling no matter how many stages accrete. The crawler gets a filter that grows with the seen-set and never lets the false-positive rate drift, at the cost of querying a handful of filters per lookup instead of one. For a URL-seen structure that must run an unbounded crawl without a recrawl from RAM exhaustion, this is the variant that fits the workload, and it is what the open-source scalable-Bloom implementations in Python and Go are modeling.

Blocked Bloom filters, when memory bandwidth is the wall

A plain filter at the optimal k probes k independent positions, and for a large array those positions land on k different cache lines. Every lookup is therefore k cache misses. When the filter is large and the lookup rate is high, the structure stops being bound by computation and starts being bound by how fast the memory system can serve random lines. That is the regime a saturated crawler dedup stage runs in.

Putze, Sanders, and Singler addressed it in “Cache-, Hash-, and Space-Efficient Bloom Filters” (2007). The blocked Bloom filter cuts the array into blocks sized to one cache line, hashes each element once to pick a block, then sets all k bits inside that single block. A lookup now touches one cache line instead of k. The trade is a small, analyzable rise in false-positive rate, because confining all k bits to one block makes the per-block fill less uniform than spreading them across the whole array; the paper quantifies the extra space needed to compensate. The technique stuck. Parquet’s column statistics use split block Bloom filters built on exactly this idea, and the pattern shows up across storage engines where filter lookups are on the hot path. If your dedup filter is large enough to live outside the last-level cache, blocking is usually the single biggest throughput win available, and it composes with everything above it.

Cuckoo filters, and closing the space gap

The 44 percent overhead the survey flagged is the opening the cuckoo filter aimed at. Bin Fan, David Andersen, Michael Kaminsky, and Michael Mitzenmacher introduced it in “Cuckoo Filter: Practically Better Than Bloom,” at CoNEXT 2014, with a claim narrow enough to be useful: for applications that store many items and target a moderately low false-positive rate, specifically below about 3 percent, the cuckoo filter uses less space than a space-optimized Bloom filter while also supporting deletion.

A cuckoo filter is a cuckoo hash table that stores only a short fingerprint of each item instead of the item itself. Each item has two candidate buckets, derived so that either bucket’s index can be recomputed from the other and the fingerprint alone, which is the trick (partial-key cuckoo hashing) that lets the structure relocate fingerprints during inserts without ever storing the original keys. A lookup checks both candidate buckets for the fingerprint. Because the table is packed densely, around 95 percent occupancy, and stores nothing but fingerprints, it is compact. Deletion is genuinely supported: find the fingerprint in either candidate bucket and remove it.

The same paper is candid about where the cuckoo filter loses. It is asymptotically worse than a Bloom filter: the minimum fingerprint length grows logarithmically with the number of entries, so per-item overhead creeps up as the table gets enormous. And it tabulates the alternatives it beats. Counting Bloom filters generally take three to four times the space of a space-optimized Bloom filter at equal false-positive rate; d-left counting Bloom filters are still about 1.5 times larger; quotient filters trade comparable space for degraded lookups. The cuckoo filter is the variant that gives you deletion and a tight false-positive rate without the counting-filter space penalty, which makes it the natural modern default when a crawler’s dedup structure needs both to delete and to stay small.

*No single structure wins every column. Blocking minimizes memory accesses, cuckoo filters add deletion without the counting-filter space tax, scalable filters handle an unknown final size, and the plain filter remains the simplest thing that works.*

Streams, and the variant that forgets

One more shape of the problem deserves a mention, because crawlers sometimes face it. When the input is an unbounded stream and you only care about recent duplicates, holding every key forever is wrong by design; you want a structure that forgets old entries to keep room for new ones. Fan Deng and Davood Rafiei’s stable Bloom filter, from SIGMOD 2006, does exactly that. Each insert randomly decrements a few cells before setting its own, so the structure reaches a steady state where old information decays out. The trade is a deliberate one: the stable Bloom filter accepts a small, bounded false-negative rate, something the classic filter never has, in exchange for a tight cap on false positives over an endless stream. For a crawler that revisits and wants to suppress near-term duplicate fetches without remembering the whole history, that is the right shape, and it is a reminder that the one-sided-error guarantee is a design choice, not a law.

Where this leaves a crawler in 2026

The honest summary is that the Bloom filter is rarely the whole answer to URL-seen and almost always part of it. The two best-documented web-scale crawlers, Mercator and IRLbot, kept their authoritative seen-set on disk and made it cheap with locality and batching, not with a probabilistic filter in the critical path. Where the filter earns its place is as a memory-resident gate in front of that disk set, or as the entire seen-set for a crawl small enough that a bounded false-positive rate is acceptable and a re-fetch of a wrongly dropped URL is not catastrophic. The 1970 hyphenation design (fast probabilistic gate, slow authoritative check behind, gate tuned so the slow check runs rarely) is still the architecture, fifty-six years on.

The variants map onto real decisions rather than fashion. If the crawl’s final size is unknown, reach for the scalable filter so the false-positive rate cannot drift into silent page loss. If the dedup structure has to live outside the last-level cache and the lookup rate is high, block it so each query is one cache miss instead of k. If you need to delete, the cuckoo filter gives you that without the three-to-four-times space penalty of counting filters, as long as your target false-positive rate is below a few percent. And if the stream is endless and only recent duplicates matter, a structure that forgets beats one that remembers forever. The plain filter is still the thing to reach for first, precisely because it is the simplest structure that turns an impossible exact set into a tractable approximate one, and you only graduate to a variant when a specific column in that comparison table becomes the thing that hurts.

What has not changed is the asymmetry at the center. A Bloom filter lies in exactly one direction, and a crawler’s job is to make sure that direction is the cheap one. Re-crawling a page you already fetched wastes bandwidth and you will notice. Dropping a page you were supposed to fetch wastes nothing visible and you will not. The entire discipline of using these structures in a crawler comes down to keeping the lie pointed at the failure you can see.

Sources & further reading

Bloom, B. H. (1970), Space/Time Trade-offs in Hash Coding with Allowable Errors — the original paper; Communications of the ACM 13(7), 422-426, introduces Method 2, the modern bit-array filter, and the half-full optimum.
Broder, A. and Mitzenmacher, M. (2004), Network Applications of Bloom Filters: A Survey — Internet Mathematics 1(4), 485-509; the clean derivation of the false-positive rate, the k = (m/n)·ln 2 optimum, and the 1.44 space-overhead result.
Heydon, A. and Najork, M. (1999), Mercator: A Scalable, Extensible Web Crawler — the URL set, the host-locality checksum trick, and the 75.7 percent net cache hit rate on the URL-seen test.
Lee, H.-T., Leonard, D., Wang, X. and Loguinov, D. (2008), IRLbot: Scaling to 6 Billion Pages and Beyond — 6.38 billion pages on one server, 394 billion links, and DRUM as the disk-based URL-seen structure.
Fan, L., Cao, P., Almeida, J. and Broder, A. (1998), Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol — SIGCOMM ‘98; origin of the counting Bloom filter to support deletion in gossiped cache summaries.
Almeida, P. S., Baquero, C., Preguiça, N. and Hutchison, D. (2007), Scalable Bloom Filters — Information Processing Letters 101(6), 255-261; the geometric tightening ratio that bounds the compounded rate over an unbounded insert stream.
Putze, F., Sanders, P. and Singler, J. (2007), Cache-, Hash-, and Space-Efficient Bloom Filters — the blocked Bloom filter that turns k cache misses per lookup into one, now used in storage engines and Parquet.
Fan, B., Andersen, D. G., Kaminsky, M. and Mitzenmacher, M. (2014), Cuckoo Filter: Practically Better Than Bloom — CoNEXT ‘14; deletion plus less space than a space-optimized Bloom filter below a roughly 3 percent false-positive rate.
Deng, F. and Rafiei, D. (2006), Approximately Detecting Duplicates for Streaming Data Using Stable Bloom Filters — SIGMOD ‘06; a filter that forgets old entries to bound false positives over an endless stream, at the cost of bounded false negatives.
Manning, C., Raghavan, P. and Schütze (2008), Introduction to Information Retrieval, ch. 20: Web crawling and indexes — textbook treatment of the URL frontier, the URL-seen and content-seen tests, and the duplicate-detection problem.