Skip to content

Parsing at scale: when to use a real browser vs an HTTP client

· 18 min read
Copyright: MIT
Wordmark reading browser vs http with the word http in orange

Every crawler eventually arrives at the same fork. A page you need does not render its data in the HTML you fetch. The DOM you get from a plain GET is a skeleton: a few empty divs, a bundle of JavaScript, and a loading spinner that a curl request will never see spin. The obvious fix is to throw a real browser at it. Spin up Chrome, let it execute, read the rendered DOM, move on. It works. It also costs you somewhere between ten and fifty times the memory and several times the wall-clock time of the request you were making before, and at a million pages a day that multiplier is the whole budget.

So the question is not “can a browser do this.” A browser can always do it. The question is whether this particular page, on this particular site, at this particular volume, justifies paying for one. That decision gets made badly more often than any other in a scraping stack, usually by defaulting to a browser because it is easier to reason about, and the bill shows up later as proxy bandwidth, compute, and a fleet of Chrome processes that spend most of their life rendering ad iframes you will throw away.

This post is a decision framework for that fork. We start with what actually forces a browser into the loop, then look at how to tell whether a given page is one of those cases without guessing. We cost both paths concretely. We walk the fingerprint surface, because the “just use a browser, it looks more human” reasoning is half wrong in 2026. Then the hybrid pattern that most large operations converge on, and a short checklist you can run against any new target.

What a browser actually buys you

Strip away the marketing and a headless browser gives you exactly one thing an HTTP client cannot: a JavaScript runtime with a DOM attached. Everything else people reach for browsers to get, you can usually get cheaper somewhere else.

The runtime matters when the data you want does not exist in the initial HTML response and only comes into being after script execution. That happens in a few distinct ways, and they are worth separating because they have different cheap escape hatches. The page might fetch its data from an XHR or fetch call after load, assembling the DOM client-side. It might ship the data inline in the HTML as a serialized blob and only need JavaScript to mount it into visible elements. It might gate content behind an interaction: a click, a scroll, a form submit that triggers the real request. Or it might compute something in JavaScript that you genuinely cannot reproduce without running the script, a signed token or a value derived from a obfuscated routine.

Only the last of those truly needs a browser, and even then not always. The first three have well-worn HTTP-only paths, and recognizing which case you are in is most of the skill.

Is the data in the raw GET response? yes → HTTP client. done. Is it in an inline JSON blob (__NEXT_DATA__, __NUXT__)? yes → HTTP client + parse the blob. Does an XHR/fetch endpoint return it as JSON? yes → HTTP client, call the endpoint directly. Only after JS runs / behind a JS-computed token → browser. *The browser is the last branch, not the first. Each branch you can answer "yes" to above it removes the runtime from your hot path.*

Telling which case you are in, without guessing

The cheapest diagnostic is the oldest one. Disable JavaScript in a browser, load the page, and look. If the content you want is still there, the server rendered it and an HTTP client will see exactly what you see. If it vanishes, the page depends on script execution to produce that content. That single test sorts most targets in under a minute, and it is more reliable than reading the framework’s marketing about whether it is “server-side rendered,” because plenty of nominally SSR apps still hydrate the interesting fields client-side.

When JS-off shows nothing, the next move is to open the network panel and watch what loads. Filter to Fetch/XHR and reload. You are looking for the request that carries your data, and it announces itself: a response body that is JSON with the fields you wanted, usually to a path under /api/, sometimes a single POST to /graphql carrying operationName, query, and variables in its payload. Find that request and you have found a way to skip the browser entirely. You replay the endpoint with an HTTP client, send the headers it needs, and parse JSON instead of HTML. This is the API-discovery path, and it is the single highest-value habit in the whole discipline; we go deep on it in handling JavaScript-rendered content without a browser.

One caution on replay. The endpoint you found is the one the page used, which means it expects the headers the page sent. Some of those are load-bearing and some are noise, and the only way to know is to strip them one at a time. A common minimum is a referer, an accept header the server checks, and sometimes a custom header the front-end injects to mark a request as same-origin. Occasionally there is a token in there too, issued earlier in the session, which moves the target up a tier because now you need to mint that token before you can replay anything. But the common case is a handful of static headers and a clean JSON response, and that case is worth a few minutes of probing before you concede the page to a browser.

There is a second pattern that looks like it needs a browser and does not. Many React and Vue applications render server-side and then ship the entire dataset inline so the client can hydrate without a second round trip. In Next.js this lived for years in a single tag, <script id="__NEXT_DATA__" type="application/json">, holding a JSON document with a props.pageProps object that frequently contains more than the page visibly displays. Nuxt does the equivalent in a window.__NUXT__ assignment. A plain GET returns that blob in the HTML body. No execution required. You locate the script tag, parse the JSON, and read structured fields straight out of it, which is cleaner than scraping the rendered DOM would have been.

That tag is moving, though, and it is worth knowing where. With the App Router introduced in Next.js 13, the single __NEXT_DATA__ document gave way to a streamed React Server Components payload delivered as a series of self.__next_f.push(...) calls scattered through the HTML. The data is still there in the response, still parseable without a browser, but it is now chunked in React’s flight wire format rather than sitting in one tidy object. Libraries exist to reassemble it; the njsparser project parses the flight stream and lets you traverse the result as plain objects. The point is structural: the data did not leave the HTTP response when Next.js changed its serialization, it just got harder to grab with a single JSON parse. The browser is still not required.

Where the data hides in a single GET response inline blob __NEXT_DATA__ __NUXT__ one JSON.parse, no execution streamed RSC self.__next_f .push(...) reassemble flight chunks, no browser XHR endpoint /api/... JSON /graphql POST replay the call, parse JSON *All three sit inside a response a plain HTTP client already receives. The browser is only needed when the value is computed at runtime and never serialized.*

The GraphQL case has one wrinkle worth flagging. Some sites use persisted queries, where the client sends only a SHA-256 hash of the query text instead of the full query, under a persistedQuery extension carrying a sha256Hash field. When the server has that hash cached it runs the query; when it does not, it returns a PersistedQueryNotFound error and the client retransmits the full query for the server to cache. For a scraper this is mostly a discovery hurdle: you observe the hash-plus-variables shape once, capture the full query the client sends on a cache miss, and then replay the cached hash on subsequent calls. None of that needs a rendering engine. It does need patience and a clean capture of the real client’s traffic, which is the recurring theme of the HTTP-only path.

The cost, counted honestly

Here is the multiplier that drives the whole decision. A bare HTTP request fetches the bytes of one document and stops. A browser fetches that document, then fetches every subresource it references, parses and executes the JavaScript, builds a DOM and a layout tree, runs the page’s network calls, and holds all of that in a renderer process while it works. The published numbers cluster tightly across independent sources. A single Playwright or Puppeteer browser instance sits around 200 to 500 MB of RAM under load. Per-page, browser requests run roughly 10 to 50 times slower and 5 to 20 times more expensive than the HTTP equivalent. In wall-clock terms that is a browser page landing in the 3-to-15-second range against 0.5-to-2 seconds for a direct request. For static content an HTTP client is on the order of 10 to 100 times faster.

Chrome’s memory cost is not an accident you can configure away; it is architectural. Site Isolation puts each site’s documents in their own renderer process, locked to a single site, defined as scheme plus eTLD+1. The Chromium docs are explicit that this runs web instances in parallel “at the cost of some memory overhead for each process.” A renderer carries its own V8 isolate, its own Blink instance, and the DOM for that page, so a minimal renderer starts in the tens of megabytes and a heavy app pushes a single tab past 300 to 500 MB. Chrome applies a soft process limit tied to available memory and starts reusing same-site processes once you cross it, which means the cost does not scale linearly, it scales until the machine is full and then your concurrency stalls. On a typical 8-core box you get something like 5 to 10 healthy parallel pages per browser before contention shows, and 15 to 30 concurrent contexts if you share one browser process across isolated contexts rather than launching fresh browsers.

The slowness compounds in a way the per-page number hides. A browser does not just take longer per page; it takes longer in a way you cannot fully parallelize away, because each parallel page is also competing for the same finite renderer budget and the same CPU during layout and script execution. Doubling your browser concurrency on a fixed machine does not double throughput once you pass the point where renderers start fighting for memory and the GPU process and main thread saturate. An HTTP client, by contrast, is mostly waiting on the network, so its concurrency scales cleanly with available sockets and proxy capacity until something upstream rate-limits you, which is a much higher ceiling. This is why a thousand concurrent HTTP requests is an ordinary afternoon and a thousand concurrent real browsers is a cluster.

There is a cost most people miss until the proxy invoice arrives: bandwidth. A browser loads the page’s images, fonts, analytics scripts, ad iframes, and tracking pixels, all of it flowing through your proxies and metered by the gigabyte. An HTTP client fetching one JSON endpoint moves a few kilobytes. When you are paying residential-proxy rates, the difference between pulling 2 MB of full-page assets and 4 KB of API response is the difference between a viable margin and a loss, which is the arithmetic we walk through in the economics of a scraping operation. You can claw some of it back by blocking image and media requests in the browser, and you should, but a blocked-resource browser is still a browser, with the V8 and renderer overhead intact.

Per-page cost, relative (HTTP client = 1) latency http 1x browser ~10-50x memory http ~5-10 MB 200-500 MB bandwidth http: 1 doc + all subresources Bars are illustrative ranges from published benchmarks, not a single measurement. *The browser tax is paid three ways at once. The full accounting, including the CPU side, is in [the headless-browser tax](/blog/headless-browser-tax).*

The fingerprint surface cuts both ways

There is a common belief that a real browser is the safer choice against anti-bot systems because it “looks human.” That was closer to true in 2018 than it is now, and treating it as a rule will cost you on both ends.

Start with the HTTP client’s weakness, because it is real. Every TLS connection opens with a ClientHello, and the ordered set of cipher suites, extensions, elliptic curves, and supported versions in that message forms a fingerprint, hashed as JA3 (2017) and its successor JA4. The fingerprint is computed before any HTTP header is sent, before encryption is even fully negotiated, so it identifies the client stack underneath whatever User-Agent you set. Cloudflare describes the result as a stable identifier across destination IPs, ports, and certificates, and uses it in bot analytics, WAF rules, and Workers. The trap for HTTP-only scrapers is mismatch: your headers announce Chrome, your TLS stack is Python’s default OpenSSL, and the two disagree. Edge systems treat that contradiction as a high-confidence automation signal, and a default requests or stock-Go client will lose on exactly this. We cover the full chain in TLS fingerprinting: from ClientHello bytes to JA4.

That weakness is closable. Tools that replace the client’s TLS stack with a browser’s exact handshake exist precisely to make the ClientHello match the claimed User-Agent. JA4 itself is a defensive instrument here, not an offensive one: John Althouse and FoxIO designed it, building on JA3, and shipped the core TLS-client fingerprint as open-source under BSD 3-Clause, with the extended JA4+ suite under FoxIO’s own license. The relevant design choice for our purposes is that JA4 sorts the ClientHello extensions before hashing. JA3 hashed them in observed order, which made it brittle once browsers began randomizing extension order; by sorting, JA4 stays stable across that randomization and groups modern browsers more tightly. For the scraper that means matching a browser’s TLS fingerprint is a question of replicating the right cipher and extension set, not chasing a moving order, but it also means the defender’s grouping got harder to fool with noise alone.

Now the part people forget. A real browser is not automatically clean. Running headless leaks its own tells. The navigator.webdriver flag, the HeadlessChrome token that older headless builds put in the User-Agent, missing or inconsistent plugin and permission states, the Chrome DevTools Protocol connection the automation driver uses, all of these are queryable from inside the page and all of them are signals an anti-bot script collects. A headless Chrome under Playwright presents a large and well-studied surface that detection vendors enumerate field by field; the long tail of those signals is its own subject in headless Chrome detection: every tell. So you do not escape the fingerprint problem by choosing a browser. You trade a TLS-layer mismatch for a JavaScript-runtime fingerprint, and the runtime surface is broader and harder to fully sanitize than the handshake is to match. Neither path is fingerprint-free. The honest statement is that an HTTP client’s tell is concentrated and fixable, while a browser’s tells are diffuse and numerous, and which one you would rather defend depends on the target.

The hybrid path most operations land on

Watch a mature scraping operation long enough and you see the same shape emerge. It is not browsers everywhere and it is not HTTP everywhere. It is a tiered system where the cheap path handles the bulk and the expensive path is reserved for the cases that genuinely need it.

The tiering usually goes like this. The default tier is a hardened HTTP client with a browser-matched TLS fingerprint and a correct header set, handling every target whose data lives in the raw response, an inline blob, or a discoverable endpoint. This is where the volume is, and it is cheap enough that you can run it wide. A second tier handles targets that need a runtime but can be solved once: you drive a real browser through the flow a single time, capture the JavaScript-computed token or the exact request the page makes, and then replay that token or request with the cheap HTTP tier for as long as it stays valid. The browser becomes a token mint rather than a per-page renderer, which is the whole game, because token issuance is rare and page fetching is constant. The top tier is the full browser, used only where the page recomputes something per request that you cannot capture and reuse, or where interaction genuinely drives the data.

What pushes a target up the tiers is rarely the framework and usually the defense. A statically rendered page behind an aggressive bot manager can need a browser not because the data is hidden but because the cheap client cannot survive the challenge. A heavy single-page app with no anti-bot can often be solved entirely at the HTTP tier once you find its API. So the rendering decision and the anti-bot decision are entangled, and you cannot make one without the other in front of you. That entanglement is also why the token-mint pattern is durable: the costly browser work concentrates on the moment a defense issues a clearance cookie or a signed value, and the cheap tier rides on it afterward, the way a cf_clearance cookie earned once carries many subsequent requests until it expires.

Tiered routing: cheap path carries the volume tier 1 HTTP client + matched TLS raw HTML, inline blob, discovered API — the bulk of pages tier 2 browser as token mint solve once, capture token/request, replay with tier 1 tier 3 full browser per page per-request recomputation or real interaction only *The width of each tier is roughly its share of traffic in a tuned pipeline. Tier 3 is the exception, not the workhorse.*

Operationally, the hybrid only works if you can tell when the cheap path stopped working. A replayed token expires. An API adds a new required header. A site flips from SSR to client-rendering after a deploy and your HTML parser starts returning empty. None of these throw errors; they return 200s with the wrong body, the silent-failure mode that quietly drains a crawl. You need success metrics that key on extracted-field presence, not HTTP status, so a structural change trips an alarm instead of polluting your dataset, which is the case made in scraping observability. Pair that with sane throttling so a tier that starts failing does not hammer the target while it fails; the backoff mechanics live in rate limiting yourself.

A checklist for the next target

When a new site lands on your desk, the framework collapses to a short sequence you can run in a few minutes. Fetch the page with a plain client and search the response for the fields you want. If they are there, you are done, parse the HTML or the inline blob and never open a browser. If they are not, load the page with JavaScript disabled to confirm the dependency is real and not a header or cookie issue. Then open the network panel and hunt for the endpoint that carries the data, because most of the time it exists and replaying it is the entire solution. Only when there is no such endpoint, or the request to it is gated by a value the page computes at runtime that you cannot capture and reuse, does a browser earn its place in the loop.

And even then, scope it. A browser used as a one-time token mint costs you one expensive operation amortized across thousands of cheap ones. A browser used as your default renderer costs you the full tax on every page, most of which never needed it. The mistake that shows up on the invoice is almost never “we should have used a browser here and didn’t.” It is the reverse: a fleet of Chrome processes faithfully rendering pages whose data was sitting in a JSON endpoint the whole time, found in ninety seconds by anyone who thought to look at the network tab before reaching for the heavy tool.


Sources & further reading

Further reading