Tagged: crawling

Designing a distributed crawler: frontier, dedup, politeness, and backpressure

Traces the architecture of a web-scale crawler from Mercator and the early Googlebot through IRLbot to today: the URL frontier, duplicate elimination, politeness scheduling, and how servers push back.

crawling distributed-systems infrastructure

Sun, March 29, 2026 · 21 min read

URL frontier design: from Mercator to modern priority-queue crawlers

How the URL frontier orders a crawl: the Mercator front-queue/back-queue split, per-host politeness, freshness versus coverage, and the disk-backed and gRPC designs that run at web scale today.

crawling distributed-systems infrastructure

Sat, March 28, 2026 · 22 min read

A robots.txt user-agent and disallow block with an orange crawl-delay line

Crawl politeness: robots.txt, crawl-delay, and the unwritten rules of scale

Traces how crawl politeness works in practice: RFC 9309 robots.txt parsing, the crawl-delay split between Google, Bing, and Yandex, per-host rate limits, sitemaps, and the cryptographic verification replacing the honor system.

crawling robots-txt web-standards

Fri, March 27, 2026 · 25 min read

The word Bloom filters in monospace with an orange underline and the caption seen(url) returns maybe or definitely-not

Bloom filters and the URL-seen problem in web-scale crawling

A primary-source walk through the URL-seen problem in large crawlers: why naive dedup fails at scale, how Bloom filters answer it, the false-positive math, and the counting, scalable, blocked, and cuckoo variants that followed.

crawling distributed-systems algorithms

Thu, March 26, 2026 · 23 min read

Proxy pool management: rotation, health checks, and burn-rate economics

Traces how a working proxy pool is operated: rotation strategies, the difference between a banned IP and a dead one, health-check state machines, sticky versus rotating sessions, and the per-GB cost model that decides whether a crawl is profitable.

proxies crawling infrastructure

Wed, March 25, 2026 · 22 min read

Three stacked proxy-type labels, datacenter residential mobile, with an orange bar marking the most expensive tier

Residential vs datacenter vs mobile proxies: detection, cost, and use cases

A vendor-neutral comparison of the three proxy types: how each is sourced, how each gets detected at the ASN and reputation layer, what a gigabyte actually costs, and which job each one fits.

proxies crawling anti-bot

Tue, March 24, 2026 · 19 min read

How proxy networks source IPs: SDKs, residential peers, and the ethics question

Traces where residential and mobile proxy IPs actually come from: bundled SDKs, free-VPN monetization, peer-payout apps, and outright malware, plus the consent gap that runs through all of them.

proxies crawling privacy

Mon, March 23, 2026 · 19 min read

A cookie token splitting across three proxy exit IPs with one path flagged

Session and cookie management across a proxy fleet

How identity stays coherent when a crawler rotates IPs: binding cookies and sessions to exit nodes, what breaks when a session leaks across IPs, and the signals anti-bot systems use to catch the mismatch.

proxies crawling cookies

Sun, March 22, 2026 · 22 min read

One IP held steady across a session versus a fresh IP per request, with a consistency check between them

Sticky sessions vs rotating IPs: when each makes or breaks a scrape

The strategic choice between holding one exit IP for a session and rotating per request: where statefulness forces stickiness, where rotation buys throughput, and how session-consistency checks punish the wrong call.

proxies crawling anti-bot

Sat, March 21, 2026 · 21 min read

A token bucket diagram with an orange refill drip and a 429 backoff curve

Rate limiting yourself: token buckets, adaptive throttling, and 429 backoff

Traces client-side rate control for crawlers: token and leaky buckets applied to your own requests, per-host concurrency, adaptive throttling on 429 and Retry-After, and exponential backoff with jitter.

crawling rate-limiting infrastructure

Fri, March 20, 2026 · 23 min read

The economics of a scraping operation: proxy cost, solve cost, and success rate

Traces how proxy dollars per gigabyte, CAPTCHA-solve dollars per thousand, and browser compute combine through a success rate into one number that actually matters: cost per successful record.

crawling proxies industry

Thu, March 19, 2026 · 19 min read

Conditional request and 304 Not Modified response flow with an ETag validator

Caching and incremental recrawl: ETags, Last-Modified, and change detection

How crawlers avoid re-fetching unchanged pages: conditional requests with ETag and Last-Modified, 304 handling, content hashing for change detection, and recrawl scheduling driven by per-page change rate.

crawling http caching

Wed, March 18, 2026 · 22 min read

Parsing at scale: when to use a real browser vs an HTTP client

A decision framework for choosing between a headless browser and a plain HTTP client at extraction scale: JS-dependence, per-page cost, fingerprint surface, brittleness, and the hybrid path most large crawlers actually take.

crawling browser-automation infrastructure

Tue, March 17, 2026 · 18 min read

The headless-browser tax: memory, CPU, and why HTTP clients win when they can

Traces the real resource cost of driving headless Chrome at scale: per-instance RAM, the multi-process tax, container failure modes, concurrency math, and the cost gap that pushes teams back to HTTP clients.

crawling browser-automation infrastructure

Mon, March 16, 2026 · 22 min read

A pipeline diagram showing a captcha routed to either a human farm or an ML solver, returning a token, with the per-solve cost highlighted orange

Building a CAPTCHA-solving pipeline: human farms, ML solvers, and the cost curve

Traces how CAPTCHA solving is operationalized: the human-farm relay, the shift to ML and audio-transcription solvers, the per-solve price curve from 2010 to 2026, and the latency-accuracy-binding tradeoffs that decide whether a token is worth anything.

captcha crawling anti-bot

Sun, March 15, 2026 · 18 min read

Scraping observability: success metrics, block-rate dashboards, and silent failures

Traces how to instrument a scraping system end to end: the metrics that matter, why HTTP 200 is a lie, how to detect soft blocks and empty-payload garbage, and how to build dashboards and alerts that catch silent failure before the data does.

crawling infrastructure observability

Sat, March 14, 2026 · 26 min read

Handling JavaScript-rendered content without a browser: API discovery and XHR replay

How to pull JavaScript-rendered data without launching a browser: finding the backend JSON, XHR, and GraphQL endpoints a page calls, replaying them, handling tokens and request signatures, and where the approach stops working.

crawling browser-automation reverse-engineering

Fri, March 13, 2026 · 20 min read

The history of web scraping: from wget to headless Chrome, 1994-2026

Traces automated web extraction from the 1993 Wanderer and JumpStation through wget, Perl LWP, the API era, Scrapy, Selenium, the headless-Chrome shift, and the AI-training wave, with the legal landmarks along the way.

history crawling web-standards

Sat, November 1, 2025 · 25 min read