Handling JavaScript-rendered content without a browser: API discovery and XHR replay
A page loads. You view the HTML source and the thing you want is not there. It is an empty <div id="root">, a spinner, maybe a <noscript> apology. The data arrives a half-second later, painted in by JavaScript that ran a few HTTP requests you never saw. The reflexive fix is to reach for a headless browser, let it run the JavaScript, and read the DOM after the dust settles. That works. It is also the most expensive way to solve the problem, and frequently the least reliable.
The cheaper path is to ask a different question. The JavaScript did not invent the data. It fetched it, from somewhere, over HTTP, and that somewhere is almost always a JSON or GraphQL endpoint that returns exactly the structured payload you want, minus the HTML, minus the rendering, minus the browser. This post is about finding those endpoints, replaying them from a plain HTTP client, dealing with the tokens and signatures that guard them, and recognising the cases where the endpoint is so well defended that the browser was the cheaper option after all.
The sections that follow trace the full arc: why client-side rendering creates a hidden API in the first place, how to find it in the network panel and in the JavaScript bundle, the specific case of GraphQL persisted queries, replaying a request and the headers that matter, the auth and signature schemes you will run into, the data that ships inside the initial HTML on framework sites, and the limits where this stops being worth it.
Why the data is hidden in the first place
A server-rendered page from 2010 put the data in the HTML. The server queried a database, filled a template, and sent you a complete document. Single-page applications inverted that. The server now sends a near-empty shell plus a JavaScript bundle, and the bundle calls back to a data API once it boots in your browser. The HTML is a loader. The data lives behind a second request.
That second request is the prize. It exists because the front-end team needed a clean contract between their React or Vue code and their backend, so they built one: a REST endpoint returning JSON, or a GraphQL endpoint, or a gRPC-web channel. The endpoint is not secret. It is just not advertised. It speaks JSON because a browser’s fetch and XMLHttpRequest speak JSON, and the response is the same structured data the page is about to render, before any of it gets wrapped in markup.
This is the whole reason the technique pays off. Parsing HTML is parsing someone’s presentation layer, and presentation layers churn. A CSS class renames, a wrapping <div> appears, and your selector breaks. The JSON endpoint underneath changes far less often, because changing it means changing a contract that the front-end depends on. When you can hit the API directly you get typed fields with stable names instead of brittle DOM paths, you skip downloading and executing megabytes of JavaScript, and you skip the entire render. For the trade-offs between this and driving a real browser, the headless-browser tax and parsing at scale: browser vs HTTP posts go deeper on the cost side.
Finding the endpoint in the network panel
The first place to look is the browser’s own network panel, because the browser already did the reconnaissance for you. Open developer tools, go to the Network tab, and turn on two settings before you reload: preserve log, so requests survive navigation, and disable cache, so you see the real traffic rather than a 304. Then reload and watch what the page asks for.
Filter to XHR and Fetch. That single filter strips out the images, fonts, stylesheets, and the JavaScript bundle itself, and leaves you with the data calls, the XMLHttpRequest and fetch traffic the application made to populate the page. Most of what remains is the API surface. You are looking for a request whose response body, viewed under the Preview or Response tab, contains the records you came for. Trigger more of them by interacting: paginate, search, open a detail view, scroll a feed that lazy-loads. Each action that fetches data shows up as a new entry, and the URL patterns start to rhyme: /api/v2/items?page=2, /graphql, /_next/data/<buildId>/....
Once you have the request, the browser hands you a replay for free. Right-click it and choose “Copy as cURL.” That captures the exact method, URL, headers, cookies, and body the browser sent, as a runnable command. The standard next move is to drop that cURL into a converter such as the open-source curlconverter to get the equivalent Python requests or Node fetch call, then start deleting headers one at a time to find the minimum set the endpoint actually requires. Most of what the browser sends is noise. A handful of headers carry weight, and finding which ones is the core of the work.
When the network panel is not enough
Some endpoints do not show up cleanly. The call might fire from a web worker, get bundled into a batch, or be obscured by a service worker that serves a cached response on the second load. When the panel comes up short, the source of truth moves to the JavaScript bundle, because the URL and the request-building logic are in there as literal strings.
Pull the bundle and search it. Grep the deobfuscated or pretty-printed JavaScript for the obvious markers: /api, /v1/, /v2/, fetch(, axios, XMLHttpRequest, .graphql, `gql“. A minified bundle is hard to read but not hard to search, and the endpoint paths usually survive minification intact because they are string constants the minifier cannot rename. From there you can often reconstruct how a request is assembled: which query parameters it takes, what the body shape is, and where the headers come from. A browser-side proxy such as mitmproxy, Charles, or Fiddler is the other half of this kit, capturing traffic that the dev tools panel hides and letting you replay and modify it outside the page. The mobile equivalent of this work, where the bundle is a compiled binary and the transport is pinned, lives in reverse-engineering a mobile app’s API; the desktop-web version is the friendly case by comparison.
One reliable shortcut on framework sites is to skip the API entirely and read the data the server already embedded in the HTML. More on that below, because it deserves its own section.
The GraphQL persisted-query wrinkle
GraphQL endpoints are common and, for the most part, friendly to replay. You POST a query and variables to a single /graphql URL and get JSON back. The complication is a performance optimisation called Automatic Persisted Queries, and it is worth understanding because at first glance it looks like the query has been hidden from you.
Under APQ the client does not send the query text on every request. It computes the SHA-256 hash of the query string and sends only that hash, inside an extensions object. The shape, from Apollo’s own documentation, is an extensions.persistedQuery object carrying "version": 1 and a "sha256Hash" field holding the hex digest. The server keeps a cache mapping hash to query. On a hit it runs the cached query. So the request you capture in the network panel may contain a hash and some variables and no readable query at all, which is what makes it look obfuscated.
It is not obfuscated, and the protocol itself tells you how to recover the query. The contract has a defined miss path. When the server has no query cached for a given hash it returns a PersistedQueryNotFound error, and the client’s documented response is to retry, this time sending the full query text alongside the same hash so the server can cache it. That round-trip is observable. Reload with the network panel open and you can usually catch the registration request that carries the complete query string, because the client had to send it at least once to prime the cache. Apollo’s docs spell out the three-step flow: hash-only first, full query on the miss, hash-only thereafter, with hashed reads optionally sent as GET when useGETForHashedQueries is set and mutations staying on POST. The Crawlee team’s write-up on reverse engineering the persistedQuery extension documents the same recovery using mitmproxy to deliberately send a bad hash and force the server to reveal the error and the resend.
There is a catch worth stating plainly. A hash you capture today can stop working tomorrow. Hashes are derived from the exact query text, so when the site ships a new front-end bundle with a tweaked query, the hash changes, and a server cache that has been flushed will no longer recognise the old one. The mitigation people use is to keep replaying the registration request, full query plus hash, so the entry stays warm, but the durable answer is to extract the query text and send it yourself rather than depend on a hash you do not control.
Replaying the request: the headers that matter
Once you have a target URL, a method, and a body, the work is figuring out the smallest request the server will accept. Browsers attach a long list of headers. The server checks a few of them. Your job is to separate the two.
Three categories show up. There are baseline browser headers like User-Agent and Accept-Language that some servers sanity-check and most ignore. There are contextual headers the browser sets based on how the request was initiated, most importantly Referer and Origin, which a server can use to confirm the call came from its own pages. And there are application headers, usually X- prefixed, that the front-end code attaches deliberately: X-CSRF-Token, X-Requested-With, X-Api-Key, or some bespoke X-Client-Version. Missing a required one of these typically earns a 400 or a 403 rather than the data. The discipline is the same in every case: replicate the browser’s full request, confirm it works, then strip headers one at a time and watch for the first removal that breaks it.
Two header families deserve specific attention because they are designed to be hard to forge from outside a browser. The first is the Sec-Fetch metadata set, standardised by the W3C WebAppSec group: Sec-Fetch-Site, Sec-Fetch-Mode, Sec-Fetch-Dest, and Sec-Fetch-User. The browser, not the page’s JavaScript, sets these, and it sets them based on the real relationship between the request’s initiator and its target. Sec-Fetch-Site reports whether the request is same-origin, same-site, cross-site, or none, and Sec-Fetch-Mode reports navigate for a top-level page load versus cors or no-cors for a sub-resource fetch. These are forgeable by a script outside a browser, because nothing stops an HTTP client from writing any header value it likes, but the point is that a value which is internally inconsistent with the rest of the request is a cheap tell for the server. A fetch-style data call that arrives claiming Sec-Fetch-Mode: navigate did not come from a browser.
The second is the cookie jar. Many endpoints depend on a session cookie set during page load or a CSRF token that the server issued in one response and expects echoed in a header on the next. That means you cannot replay the API call in isolation; you have to perform the earlier request that mints the cookie or token first, carry it forward, and keep it consistent across the session. Whether you keep that session pinned to one IP or rotate it matters once volume goes up, which is the subject of sticky sessions vs rotating IPs and session and cookie management across a proxy fleet.
Beyond the headers themselves there is the matter of how the connection looks at the TLS and HTTP/2 layer, which the headers cannot fix. A Python client and Chrome negotiate TLS differently, advertise different cipher orders, and frame HTTP/2 differently, and that difference is summarised in fingerprints like JA3 and JA4. An endpoint sitting behind a bot-management product can read that fingerprint before it ever looks at your headers. The mechanics of that are covered in TLS fingerprinting: from ClientHello bytes to JA4, and it is the reason a request that looks header-perfect can still get a 403.
Tokens, signatures, and the wall they build
The endpoints discussed so far check who you are with cookies and tokens you can capture. A harder class signs the request itself, so that replaying captured headers is not enough; the server recomputes a value from the request contents and rejects anything that does not match.
The common form is HMAC request signing. The client takes some canonical string built from the request, often the method, the path, a timestamp, and a hash of the body, and computes an HMAC of that string with a secret key, then sends the result in a header. There is no single standard header name for this; in practice you see pairs like X-Request-Timestamp and X-Request-Signature, or vendor-specific names, carrying a base64 or hex MAC. The timestamp is there so the server can reject a signature that is more than a few seconds or minutes old, which kills naive replay: capture a request, wait, resend it, and the stale timestamp fails the freshness check even though the signature was valid when you grabbed it.
What makes signing a real wall rather than a speed bump is where the key lives. If the secret is embedded in client-side JavaScript, it is recoverable; obfuscation raises the cost of reading it but does not change the fact that the browser must hold the key to sign, and what the browser holds, a determined reader can extract. The harder deployments move the signing logic into WebAssembly or, on mobile, into a native library, so that the algorithm is compiled machine code rather than readable script. The work then becomes static analysis of a binary rather than reading JavaScript, which is the same wall mobile reversers hit and the reason reverse-engineering a mobile app’s API treats the protobuf-and-native-code combination as the genuine stopping point.
*Where the secret key lives decides everything. In readable JavaScript it is recoverable. Compiled into WebAssembly or a native library, it becomes a static-analysis problem, which is where the wall is real.*For a formal version of the same idea, RFC 9421, published in 2024, standardises HTTP message signatures with Signature-Input and Signature headers and a defined canonicalisation of the covered components. You will not see RFC 9421 on most consumer SPAs yet, but it is the direction reputable APIs are heading, and recognising its header names saves you from treating a documented standard as a bespoke obstacle.
The data that ships inside the HTML
Before reaching for any API, check whether the framework already handed you the data. Server-side-rendered React frameworks frequently embed the page’s initial data directly in the HTML so the client can hydrate without an extra round-trip, which means the JSON is sitting in the document you already downloaded with a single GET.
On Next.js using the Pages Router, the place to look is a script tag with the id __NEXT_DATA__ and type application/json. It holds a JSON document with the props passed to the page, which is to say the data the server fetched to render it. Next.js documents this openly, with the warning that anything in getServerSideProps props is visible to the client in that initial HTML, so developers should not pass secrets through it. That warning is exactly why the data is so easy to read: it was never meant to be hidden, only to avoid a second request. Parse the script tag’s contents as JSON and the structured data is yours, no JavaScript execution required.
The App Router, introduced in Next.js 13, changed the wire format. Instead of one __NEXT_DATA__ blob, it streams React Server Component output through a series of inline scripts that call self.__next_f.push(...), each push a chunk of a React-specific serialisation often called the Flight format. It is messier to parse than a single JSON object, but it is still data sitting in the HTML, and the open-source njsparser Python module exists to walk those chunks and pull structured objects back out. Client-side navigations in this model fetch RSC payloads with a ?_rsc= query parameter and an Rsc: 1 request header, which is another endpoint pattern to recognise in the network panel. Nuxt does the analogous thing for Vue with a window.__NUXT__ payload, and SvelteKit and others have their own conventions; the common thread is that hydration data is, by design, readable in the response. When it is there, it is the cheapest path on offer, because it costs one request and no execution.
The honest caveat is that embedded hydration data only covers what the server rendered. Anything the page fetches later, on interaction or after scroll, will not be in the initial HTML, and for that you are back to finding the live endpoint.
Where this stops being worth it
API discovery wins most of the time. It does not win all of the time, and knowing the failure modes saves you from sinking hours into an approach the site has specifically defended against.
The first wall is server-side bot management. If the endpoint sits behind a product like DataDome, Akamai Bot Manager, Cloudflare Bot Management, or Kasada, then reproducing the headers and the body is not sufficient, because the defence reads signals your HTTP client emits below the application layer. The TLS ClientHello, the HTTP/2 frame ordering, and the absence of a browser-generated challenge token all feed a score before your carefully reconstructed request is even parsed. Crawlex has separate write-ups on how those systems read the first request and how Cloudflare turns TLS and HTTP/2 fingerprints into a score. When the endpoint demands a token that only the vendor’s obfuscated JavaScript can mint, you are no longer doing API discovery; you are solving the anti-bot challenge, and a real browser may be the lower-effort route.
The second wall is client-side computation you cannot cheaply reproduce. A signature scheme with the key buried in WebAssembly, a token derived from running a proof-of-work, a payload encrypted by logic that only makes sense inside the running page: each of these can be reversed, but each raises the cost, and at some point the cost of reading compiled bytecode exceeds the cost of just running the browser that produces the value for you. The proof-of-work renaissance post covers one version of this directly.
The third wall is non-technical and the one most easily forgotten. Finding an undocumented endpoint does not grant permission to hammer it. The 2022 hiQ v. LinkedIn remand from the Ninth Circuit held that scraping public data is unlikely to be “without authorization” under the Computer Fraud and Abuse Act, but the same litigation ended with a judgment against hiQ on breach-of-contract and CFAA grounds tied to fake accounts and access to password-protected pages. The line that matters is between public data and authenticated data, and between reading at a polite rate and overwhelming a service. Replaying a private API at volume, behind login, against a site whose terms forbid it, is a different act with a different legal exposure than reading a public JSON feed at a human pace. Rate-limiting yourself is both courtesy and self-protection, which is the subject of rate limiting yourself.
What the technique is really buying
The pattern underneath all of this is simple. A rendered page is a projection of data that already moved over HTTP in a cleaner form, and the projection is the expensive part to consume. Skip the projection and you read the source. That is why the network panel is the first tool and the headless browser is the last resort, and why the strongest version of the skill is recognising, fast, which of the two a given site calls for.
The walls are getting taller on the high-value targets. A decade ago almost every SPA had a naked JSON endpoint you could hit with curl and a User-Agent. Now the same class of site fronts that endpoint with TLS fingerprinting, request signing, and a challenge token minted by obfuscated WebAssembly, and the gap between “found the endpoint” and “can replay the endpoint” has widened into most of the work. But the long tail has not moved. The enormous majority of JavaScript-rendered sites, the ones that are not Ticketmaster or a sneaker drop, still serve their data from an endpoint that checks a cookie and a referer and nothing more, or embed it in a __NEXT_DATA__ tag they forgot was readable. For those, launching a browser to render a page whose data is sitting one GET away is not thoroughness. It is paying the rendering tax on a bill that was already settled.
Sources & further reading
- Apollo GraphQL (2024), Automatic Persisted Queries — the canonical APQ flow, the
extensions.persistedQueryobject withversionandsha256Hash, and thePersistedQueryNotFoundround-trip. - Crawlee (2024), Reverse engineering GraphQL persistedQuery extension — recovering the full query text by forcing a hash miss with mitmproxy, and the hash-rotation failure mode.
- Backman, Richer, Sporny, Pieraccini, Backman et al. (2024), RFC 9421: HTTP Message Signatures — IETF standard for signing HTTP components, with
Signature-Input,Signature, and the covered-components canonicalisation. - W3C WebAppSec (2024), Fetch Metadata Request Headers — the spec defining
Sec-Fetch-Site,Sec-Fetch-Mode,Sec-Fetch-Dest, andSec-Fetch-Userand the browser semantics behind them. - MDN Web Docs (2025), Sec-Fetch-Site — the per-value meaning of
same-origin,same-site,cross-site, andnone, and why the browser rather than the page sets it. - Vercel / Next.js (2025), Data Fetching: getServerSideProps — official confirmation that props serialise into the client-visible initial HTML, the basis for
__NEXT_DATA__extraction. - Ed Spencer (2024), Decoding React Server Component Payloads — walkthrough of the
self.__next_f.pushFlight format and the chunk-tuple structure of App Router payloads. - Trickster Dev (2025), Scraping Next.js web sites in 2025 — the
__NEXT_DATA__script tag, the Flight-data App Router case, thenjsparsermodule, and the buildId. - Scrapfly (2024), How to Scrape Hidden APIs — the network-panel-to-cURL workflow, the three header categories, and where dynamic tokens hide in the page.
- curlconverter (2025), curlconverter — open-source tool that turns a browser “Copy as cURL” into a runnable HTTP client call in Python, Node, and other languages.
- US Court of Appeals, Ninth Circuit (2022), hiQ Labs v. LinkedIn — the remand holding public-data scraping unlikely to violate the CFAA, and the contract and fake-account judgment that followed.
Further reading
Parsing at scale: when to use a real browser vs an HTTP client
A decision framework for choosing between a headless browser and a plain HTTP client at extraction scale: JS-dependence, per-page cost, fingerprint surface, brittleness, and the hybrid path most large crawlers actually take.
·18 min readThe headless-browser tax: memory, CPU, and why HTTP clients win when they can
Traces the real resource cost of driving headless Chrome at scale: per-instance RAM, the multi-process tax, container failure modes, concurrency math, and the cost gap that pushes teams back to HTTP clients.
·22 min readInside the DataDome JS tag: what ddjskey and the client payload carry
A reference on DataDome's client-side JavaScript tag: the ddjskey site identifier, the signals the browser collector gathers and posts to api-js.datadome.co, and how the challenge and interstitial flow is wired.
·21 min read