Blog — page 4

Detecting virtualized and containerized browsers: GPU, screen, and timing artifacts

How detectors spot a browser running in a VM or container: software WebGL renderers like SwiftShader and llvmpipe, default 800x600 screens, quantized device memory, and timing artifacts under virtualization.

browser-automation anti-bot fingerprinting infrastructure

Wed, April 1, 2026 · 23 min read

The navigator.plugins array as a fingerprint and automation tell

Traces navigator.plugins from a 15-bit fingerprinting signal to the five hard-coded PDF entries Chrome and Firefox ship today, the empty array that gave away old headless, and why fabricating a PluginArray still leaks.

browser-automation fingerprinting anti-bot

Tue, March 31, 2026 · 21 min read

The lifecycle of a stealth patch: discovery, fix, detection, and re-discovery

Traces how a single browser-automation stealth patch moves through its life: a signal is found, the patch hides it, the patch itself becomes a fingerprint, and a new signal replaces the old one. With real examples and the economics of the treadmill.

browser-automation stealth anti-bot

Mon, March 30, 2026 · 19 min read

Designing a distributed crawler: frontier, dedup, politeness, and backpressure

Traces the architecture of a web-scale crawler from Mercator and the early Googlebot through IRLbot to today: the URL frontier, duplicate elimination, politeness scheduling, and how servers push back.

crawling distributed-systems infrastructure

Sun, March 29, 2026 · 21 min read

URL frontier design: from Mercator to modern priority-queue crawlers

How the URL frontier orders a crawl: the Mercator front-queue/back-queue split, per-host politeness, freshness versus coverage, and the disk-backed and gRPC designs that run at web scale today.

crawling distributed-systems infrastructure

Sat, March 28, 2026 · 22 min read

A robots.txt user-agent and disallow block with an orange crawl-delay line

Crawl politeness: robots.txt, crawl-delay, and the unwritten rules of scale

Traces how crawl politeness works in practice: RFC 9309 robots.txt parsing, the crawl-delay split between Google, Bing, and Yandex, per-host rate limits, sitemaps, and the cryptographic verification replacing the honor system.

crawling robots-txt web-standards

Fri, March 27, 2026 · 25 min read

The word Bloom filters in monospace with an orange underline and the caption seen(url) returns maybe or definitely-not

Bloom filters and the URL-seen problem in web-scale crawling

A primary-source walk through the URL-seen problem in large crawlers: why naive dedup fails at scale, how Bloom filters answer it, the false-positive math, and the counting, scalable, blocked, and cuckoo variants that followed.

crawling distributed-systems algorithms

Thu, March 26, 2026 · 23 min read

Proxy pool management: rotation, health checks, and burn-rate economics

Traces how a working proxy pool is operated: rotation strategies, the difference between a banned IP and a dead one, health-check state machines, sticky versus rotating sessions, and the per-GB cost model that decides whether a crawl is profitable.

proxies crawling infrastructure

Wed, March 25, 2026 · 22 min read

Three stacked proxy-type labels, datacenter residential mobile, with an orange bar marking the most expensive tier

Residential vs datacenter vs mobile proxies: detection, cost, and use cases

A vendor-neutral comparison of the three proxy types: how each is sourced, how each gets detected at the ASN and reputation layer, what a gigabyte actually costs, and which job each one fits.

proxies crawling anti-bot

Tue, March 24, 2026 · 19 min read

How proxy networks source IPs: SDKs, residential peers, and the ethics question

Traces where residential and mobile proxy IPs actually come from: bundled SDKs, free-VPN monetization, peer-payout apps, and outright malware, plus the consent gap that runs through all of them.

proxies crawling privacy

Mon, March 23, 2026 · 19 min read

A cookie token splitting across three proxy exit IPs with one path flagged

Session and cookie management across a proxy fleet

How identity stays coherent when a crawler rotates IPs: binding cookies and sessions to exit nodes, what breaks when a session leaks across IPs, and the signals anti-bot systems use to catch the mismatch.

proxies crawling cookies

Sun, March 22, 2026 · 22 min read

One IP held steady across a session versus a fresh IP per request, with a consistency check between them

Sticky sessions vs rotating IPs: when each makes or breaks a scrape

The strategic choice between holding one exit IP for a session and rotating per request: where statefulness forces stickiness, where rotation buys throughput, and how session-consistency checks punish the wrong call.

proxies crawling anti-bot

Sat, March 21, 2026 · 21 min read

A token bucket diagram with an orange refill drip and a 429 backoff curve

Rate limiting yourself: token buckets, adaptive throttling, and 429 backoff

Traces client-side rate control for crawlers: token and leaky buckets applied to your own requests, per-host concurrency, adaptive throttling on 429 and Retry-After, and exponential backoff with jitter.

crawling rate-limiting infrastructure

Fri, March 20, 2026 · 23 min read

The economics of a scraping operation: proxy cost, solve cost, and success rate

Traces how proxy dollars per gigabyte, CAPTCHA-solve dollars per thousand, and browser compute combine through a success rate into one number that actually matters: cost per successful record.

crawling proxies industry

Thu, March 19, 2026 · 19 min read

Conditional request and 304 Not Modified response flow with an ETag validator

Caching and incremental recrawl: ETags, Last-Modified, and change detection

How crawlers avoid re-fetching unchanged pages: conditional requests with ETag and Last-Modified, 304 handling, content hashing for change detection, and recrawl scheduling driven by per-page change rate.

crawling http caching

Wed, March 18, 2026 · 22 min read

Parsing at scale: when to use a real browser vs an HTTP client

A decision framework for choosing between a headless browser and a plain HTTP client at extraction scale: JS-dependence, per-page cost, fingerprint surface, brittleness, and the hybrid path most large crawlers actually take.

crawling browser-automation infrastructure

Tue, March 17, 2026 · 18 min read

The headless-browser tax: memory, CPU, and why HTTP clients win when they can

Traces the real resource cost of driving headless Chrome at scale: per-instance RAM, the multi-process tax, container failure modes, concurrency math, and the cost gap that pushes teams back to HTTP clients.

crawling browser-automation infrastructure

Mon, March 16, 2026 · 22 min read

A pipeline diagram showing a captcha routed to either a human farm or an ML solver, returning a token, with the per-solve cost highlighted orange

Building a CAPTCHA-solving pipeline: human farms, ML solvers, and the cost curve

Traces how CAPTCHA solving is operationalized: the human-farm relay, the shift to ML and audio-transcription solvers, the per-solve price curve from 2010 to 2026, and the latency-accuracy-binding tradeoffs that decide whether a token is worth anything.

captcha crawling anti-bot

Sun, March 15, 2026 · 18 min read

Scraping observability: success metrics, block-rate dashboards, and silent failures

Traces how to instrument a scraping system end to end: the metrics that matter, why HTTP 200 is a lie, how to detect soft blocks and empty-payload garbage, and how to build dashboards and alerts that catch silent failure before the data does.

crawling infrastructure observability

Sat, March 14, 2026 · 26 min read

Handling JavaScript-rendered content without a browser: API discovery and XHR replay

How to pull JavaScript-rendered data without launching a browser: finding the backend JSON, XHR, and GraphQL endpoints a page calls, replaying them, handling tokens and request signatures, and where the approach stops working.

crawling browser-automation reverse-engineering

Fri, March 13, 2026 · 20 min read

Phone-to-server arrow broken by a pinned-certificate lock, over a dark background

Reverse-engineering a mobile app's API: certificate pinning, TLS, and the protobuf wall

A primary-source walk through intercepting a mobile app's backend: proxying TLS, why certificate pinning stops you, how runtime unpinning works conceptually, and decoding schema-less protobuf payloads.

reverse-engineering mobile tls

Thu, March 12, 2026 · 22 min read

The TLS ClientHello, field by field: a fingerprinting reference

A field-by-field dissection of the TLS ClientHello, tracing exactly which bytes JA3 and JA4 read: version, cipher suites, compression, extensions, supported_groups, signature_algorithms, supported_versions, key_share, and ALPN.

tls fingerprinting anti-bot

Wed, March 11, 2026 · 19 min read

JA4+ in depth: JA4, JA4S, JA4H, JA4L, JA4X and what each captures

A reference walk through the full JA4+ suite: how each of JA4, JA4S, JA4H, JA4L, JA4X, JA4T and JA4SSH is constructed, what it captures, and how the a_b_c format lets the parts compose.

tls fingerprinting ja4

Tue, March 10, 2026 · 22 min read

How uTLS mimics browser ClientHellos at the Go layer

A deep dive into uTLS: how the Go library forges a chosen browser's ClientHello through ClientHelloID parrots and handshake control, why Go's crypto/tls is otherwise easy to fingerprint, and where the mimicry still leaks.

tls fingerprinting go stealth

Mon, March 9, 2026 · 19 min read