Bloom filters and the URL-seen problem in web-scale crawling
A primary-source walk through the URL-seen problem in large crawlers: why naive dedup fails at scale, how Bloom filters answer it, the false-positive math, and the counting, scalable, blocked, and cuckoo variants that followed.
· 23 min read