Session and cookie management across a proxy fleet
A cookie is a claim about who you are. It says: this client already passed a check, already logged in, already proved it could run JavaScript. The server trusts the claim because, at issue time, it was earned. Then the next request arrives carrying that same cookie from a different IP address, in a different city, on a network owned by a different company. The claim has not changed. The context around it has. And the gap between the two is where a surprising amount of anti-bot detection now lives.
Run a crawler at any scale and you rotate proxies. You have to. A single IP making a thousand requests a minute is the easiest thing in the world to rate-limit. So you spread the load across a pool, and the moment you do, you have introduced a coherence problem that a single-IP scraper never has. The session state your crawler carries (cookies, tokens, server-side session rows) was minted in the context of one exit node, and now it is being presented from another. This post is about that seam: how identity is bound to an IP in the first place, what actually happens when a session leaks across IPs, and how the major anti-bot vendors turn an IP-session mismatch into a block.
The roadmap. First, the plumbing: what a session is at the HTTP layer, and the difference between affinity and persistence as load-balancer people use the words. Then the binding mechanisms, from the boring-but-instructive Dataverse CrmOwinAuth cookie up through the anti-bot vendors who bind cookies to IP and fingerprint on purpose. Then the failure mode, the leak, and the specific detectors that fire on it: impossible travel, ASN incoherence, and cross-session scoring that survives a fresh cookie. We close on the architecture that keeps a fleet coherent, and on why the honest version of that architecture is slower than the naive one.
What a session actually is
Strip away the vendor language and a web session is two things glued together. There is a token the client holds, almost always a cookie, and there is state the server keeps, often a row in a store keyed by that token. RFC 6265 is the spec that defines the mechanism. It exists because HTTP is stateless, and a server that wants to remember anything across requests has to hand the client a value and ask for it back. The Set-Cookie header sends it out; the Cookie header carries it home.
The security attributes on that cookie matter for what follows. Secure restricts the cookie to TLS-protected requests. HttpOnly keeps it out of document.cookie and therefore out of reach of injected JavaScript. SameSite controls whether the cookie rides along on cross-site requests at all. None of these three binds the cookie to a network address. That is the important part. A standard session cookie is a bearer token. Whoever presents it is treated as the party it was issued to, no questions asked about where they are presenting it from. The whole apparatus of IP-bound sessions exists precisely because the default cookie offers no such guarantee.
There is a second confusion worth clearing early, because the words get reused. In the load-balancer world, “sticky session” does not mean what it means to a scraper. HAProxy draws the line cleanly: affinity is using information below the application layer (typically the source IP) to keep a client pinned to one backend server, and persistence is using application-layer information (a cookie) to do the same. Their own phrasing is that with persistence “we’re 100% sure that a user will get redirected to a single server,” while with affinity “the user may be redirected to the same server.” Affinity is a hint. Persistence is a contract. A reverse proxy doing source-IP affinity will route you to the same backend only as long as your IP holds steady, which is exactly the assumption a rotating fleet violates by design.
When a proxy provider sells you a “sticky session,” they mean the opposite end of the pipe: the provider keeps your exit IP stable for some window, typically 1 to 30 minutes, so that everything you send during that window leaves from the same address. That is the feature you are buying to solve the coherence problem. It is worth naming the two meanings out loud because they collide constantly in scraping discussions, and the collision hides the real question, which is what the target server is keying your identity on.
*The token the client carries holds no IP. The binding, when a server enforces one, lives in the server-side state and is checked at request time.*Binding a cookie to an IP, the plain version
Before the anti-bot vendors, look at the simplest possible implementation, because it makes the mechanism legible. Microsoft Dataverse ships an optional feature called IP address-based cookie binding. When it is on, the server writes an IP-address claim into the session cookie at issue time. On every subsequent request it compares the current source IP against the one stored in the cookie, and if they differ, the request is denied with an HTTP 403. The feature targets the CrmOwinAuth cookie, takes about five minutes to propagate after you flip it on, and evaluates in real time on every request after the first.
What is instructive is the failure list Microsoft publishes. Cookie binding forces a reauthentication “when any VPN client is turned on or off,” “when connecting to a wireless hotspot,” “when the Internet connection is reset by the Internet service provider,” and “when a router is reset or restarted.” Every one of those is a benign event that changes a real user’s IP. The binding cannot tell a roaming laptop from a stolen cookie, so it treats both the same: new IP, new proof required. That is the central tension of IP binding stated by the people who shipped it. The control is strong precisely because it is dumb, and being dumb, it punishes legitimate mobility as readily as it punishes theft.
One more detail from that doc, easy to skip, that matters enormously for crawlers. Microsoft notes that if traffic is routed through a reverse proxy with a dynamic IP, cookie binding “won’t work,” and to make it work you must configure the proxy to forward the real client IP in the Forwarded header. This is the whole game in miniature. The server does not see your IP. It sees the last hop’s IP, unless something earlier in the chain told it otherwise via X-Forwarded-For, True-Client-IP, or RFC 7239’s Forwarded. Whose IP “the session IP” refers to depends entirely on which header the server trusts, and that trust boundary is one of the most error-prone parts of the whole stack.
The cryptographically serious version of the same idea is Token Binding, RFC 8471. Instead of writing an IP into a cookie, it binds the token to the TLS layer using exported keying material, so a stolen cookie cannot be replayed on a different TLS connection even from the same IP. The label is EXPORTER-Token-Binding, the export is 32 bytes, and the protocol genuinely defeats the theft it targets. It also went nowhere on the web. Chromium removed support in version 70, and the browser side never recovered. The reason is worth holding onto, because it recurs: a binding that is too tight breaks too many legitimate flows, and the web routes around it. IP binding survives where token binding died because IP binding degrades to “solve a challenge again” rather than “you are locked out.”
How anti-bot vendors bind on purpose
Anti-bot vendors took the Dataverse idea and made it the point. Where a normal application binds a session to an IP as a hardening option, a bot-management product binds its own trust cookie to IP and device fingerprint as the core of how it decides whether to trust you at all. The cookie is not your login. It is the vendor’s verdict on you, written down so it does not have to recompute it every request, and the verdict is meaningless if it can be carried to a different machine.
DataDome is the clean example. Its primary cookie is named datadome, it is roughly 128 bytes of encrypted data, and it persists for about a year. The public docs are deliberately quiet on the binding mechanics, and you should be too when you describe them: the exact field layout is not published, and what follows about IP and fingerprint binding is inferred from the vendor’s own behavioural descriptions plus widely reported observation, not from a spec. With that caveat, the cookie carries the result of DataDome’s checks, including whether the client has cleared a CAPTCHA or passed Device Check. On the server side, the Protection API that validates each request takes a mandatory ClientID field whose stated job is “to track the user session,” sourced either from an X-DataDome-ClientID header or, failing that, from the datadome cookie value. The same validation call receives the request IP and the forwarded-for chain (capped at 512 bytes) as separate fields. The cookie identifies the session; the IP and headers describe the request; and the verdict comes from holding them against each other. Rotate the IP under a cookie that was issued for a different one, and the two halves disagree.
Cloudflare’s cf_clearance is the interesting counter-case, because Cloudflare’s own documentation says the cookie is not bound to IP. It states the cookie “is securely tied to the specific visitor and device it was issued to,” device-level binding rather than network-level. By default the clearance lasts 30 minutes (the Challenge Passage value, recommended between 15 and 45), it tops out at the 4096-byte cookie ceiling, and it comes in three privilege tiers from non-interactive up to interactive. So the official line is that you could move the cookie to a new IP and keep clearance. In practice, operators report the opposite often enough that it is worth stating carefully: a clearance cookie presented from an IP wildly different from the solve, or from an IP whose reputation has since soured, frequently triggers a fresh challenge. The reconciliation is that clearance and bot score are different layers. The cookie can be valid while the IP underneath it has independently earned a block, and the second layer does not care that the first one is satisfied. This is the same separation DataDome makes explicit, dressed differently.
Akamai’s _abck is the third pattern, and the one most clearly built around cross-request continuity. It is the cookie set after the client posts a sensor_data payload to Akamai’s challenge endpoint, and it carries forward the result of that telemetry so later requests can be judged against it. The published reverse-engineering work is candid that the server-side validation is not fully understood; the client side (the obfuscated sensor script, the signal collection) is well mapped, the matching logic on Akamai’s side is not. What is observable is the behaviour: a mismatch between the _abck value and the rest of the request fingerprint marks the visitor as a bot, and a low score attached to an IP-and-fingerprint pair persists, so a freshly minted cookie does not rescue a combination that has already been judged. That last property is the one that defeats the naive fix of just clearing cookies and starting over.
If you want the per-cookie detail, Crawlex has dedicated write-ups on the DataDome cookie lifecycle, Cloudflare’s cf_clearance cookie, and Akamai Bot Manager’s _abck cookie. This post is about the layer above all three: keeping any of them coherent across a fleet.
The leak, and why it is the expensive bug
Here is the failure in one sentence. You acquire session state on exit IP A, your proxy rotation hands the next request exit IP B, and you present the IP-A state from IP B. That is a session leak across IPs, and depending on the target it ranges from “lose the session and retry” to “burn the cookie, the IP, and the fingerprint together.”
The mildest case is a plain application doing source-IP affinity or IP-bound cookies, like the Dataverse example. Rotate mid-session and you get a 403 or a redirect to login. Annoying, recoverable, cheap. Re-authenticate from a stable IP and move on. The discussions around this consistently arrive at the same advice: pin one sticky exit IP per logical session, run the entire login-to-logout flow through it, and never rotate between the login POST and the authenticated GETs that depend on it. The session cookie minted during login is only meaningful from the IP that minted it.
The expensive case is an anti-bot-protected target with cross-session scoring. Now the leak does not just fail the current request. It teaches the defender something. When a single datadome or _abck value appears from IPs in three countries inside a minute, that pattern is not ambiguous; no human roams that fast, and the cookie-to-IP fan-out is itself a high-confidence bot signal. The defender does not merely reject the request. It associates the misbehaviour with the cookie, with each IP that presented it, and with the fingerprint that ties them together. A clean residential IP that you paid a premium for can be spent in a single careless rotation, because the system that saw it carry a leaked session will remember the IP, not just the request.
This is the property that makes the naive fix useless. Clear the cookies, you think, and start fresh. But the score that got attached to the IP-and-fingerprint pair persists independently of the cookie, so the fresh cookie inherits the old verdict. Akamai’s behaviour is the textbook version: a low score on an IP/fingerprint combo is not rescued by a new _abck. You have to change the IP and the fingerprint and the cookie together to actually present as a new entity, and if your fleet is built so that the cookie store and the proxy pool and the fingerprint generator are three independent subsystems that do not coordinate, you cannot do that reliably. The coherence has to be designed in. It does not emerge from rotating each layer on its own clock.
The detectors that catch the mismatch
The mismatch is easy to catch because the signals are cheap and the giveaway is loud. Walk through the main ones.
Impossible travel is the oldest and most intuitive. Take two requests carrying the same session identifier, geolocate each source IP to a coordinate pair, measure the great-circle distance with the Haversine formula, divide by the elapsed time, and compare the implied speed to a threshold somewhere around 1000 km/h. Above it, no human could have made both requests, so either the session was stolen or it is being driven by a distributed bot. The technique came from account-security teams catching credential theft, and it ports directly onto bot detection because a rotating proxy fleet is, geometrically, indistinguishable from a hijacked session being used in three places at once. The known weakness is that GeoIP is an estimate, not ground truth, and mobile carrier IPs can be off by hundreds of kilometres, which is exactly why mature detectors layer other signals on top rather than blocking on velocity alone.
ASN and IP-reputation scoring is the layer that fires before your request body is even read. Every IP belongs to an autonomous system, and anti-bot vendors classify ASNs in real time as datacenter, residential, or mobile, then score them by how much abuse historically comes from that network. Datacenter and hosting ASNs (the AWS, OVH, Hetzner kind) start in a penalty box around 50 to 85 on a 0-to-100 risk scale even when the specific IP is clean, while consumer ISP ASNs sit near 0 to 20. The practical effect reported across 2025 and 2026 is stark: datacenter proxies that hit 60 to 90 percent on unprotected targets fall to 20 to 40 percent against Cloudflare or Akamai, because the ASN classification fires first. For session coherence this matters in a specific way. If your fleet mixes a residential exit for the cookie-issuing request and a datacenter exit for a follow-up, you have not just leaked the session across IPs, you have leaked it across IP classes, and the jump from a residential ASN to a hosting ASN under one cookie is its own bright flag. The deeper mechanics live in the Crawlex post on how anti-bot vendors detect residential proxies and ASN reputation.
Header and fingerprint incoherence is the third detector, and it catches the sloppy fleet rather than the leaking one. If IP #1 sends a Chrome user-agent and IP #2 sends Safari and IP #3 sends Firefox, all under the same session cookie, you have told the defender that one identity is being driven by three different clients, which no real browsing session does. The same applies below HTTP: a session whose TLS fingerprint changes between requests, or whose HTTP/2 settings shift, is incoherent in a way a single browser never is. The cookie says “one user,” the transport says “several machines,” and the contradiction is the signal. This is where the proxy problem meets the broader fingerprinting problem, covered in TLS fingerprinting from ClientHello bytes to JA4; the point here is only that the session cookie is what ties the disparate requests together into a single judged identity. Without the cookie they are unrelated strangers. With it they are one suspect with an alibi that does not hold.
DNS leakage is the subtle one. If your exit IP is in Germany but your resolver queries land on a US resolver, the geographic story your traffic tells is internally inconsistent, and some detectors check for exactly that. It is not strictly a session-coherence signal, but it rhymes with one: a coherent identity has a coherent geography across every layer that exposes geography, and DNS is one of those layers.
Building a fleet that stays coherent
The fix is structural, and it starts by deciding what your unit of identity is. Call it an identity: one logical actor with one cookie jar, one fingerprint, and one exit IP that stay bound together for the actor’s lifetime. The fleet is then a population of identities, not a pool of IPs and a pile of cookies that get shuffled independently. Once the identity is the unit, the rules write themselves. State acquired by an identity is presented only by that identity. The cookie jar travels with the IP, never apart from it.
Concretely that means a sticky session from the proxy provider, held for the full lifetime of the work that identity is doing, with the cookie store keyed by identity rather than by target domain. The login flow runs start to finish on one exit. The cart, the checkout, the authenticated reads, all of it goes through the IP that minted the session. When the sticky window expires (most providers cap it at 10 to 30 minutes) you do not roll the IP under the existing identity and keep going. You retire the identity. The cookies it holds were minted for an IP that is about to disappear, and carrying them onto the next IP is the leak you are trying to avoid. Retire the identity, mint a new one with its own fresh IP and fingerprint and empty jar, and let it earn its own cookies from scratch.
This is slower than the naive design, and the slowness is the honest cost. A fleet that rotates IPs freely under shared cookies gets more requests per IP and per cookie, right up until a cross-session detector notices and burns the lot. A fleet that binds identity coherently spends IPs faster, holds fewer requests per session, and re-pays the cost of acquiring fresh session state every time an identity retires. You are trading throughput for survival. The burn-rate economics of that trade are their own subject, worked through in proxy pool management; the design consequence for sessions is simply that coherence is not free and pretending it is shows up as a sudden cliff in success rate rather than a gentle decline.
Two implementation details save a lot of grief. First, decide explicitly which forwarded-IP header your target trusts, because if you are running your own egress proxy chain you can leak your real origin or your internal hop IPs through X-Forwarded-For without meaning to, and a server that reads that header sees an IP your fleet did not intend to present. Strip or set those headers deliberately at the egress. Second, treat the fingerprint as part of the identity, not a global default. If every identity in your fleet shares one TLS fingerprint and one user-agent, the cookies differ but the clients are identical, and a vendor correlating across sessions sees one machine wearing many cookies. That is the inverse leak: not one cookie across many IPs, but many cookies across one fingerprint. Both collapse the population back into a single judged entity. The lifecycle of how those fingerprint patches get found and refixed is its own arms race, traced in the lifecycle of a stealth patch.
What coherence really costs
The thing worth sitting with is that the defender’s job here is easier than the attacker’s, and the asymmetry runs the wrong way for the crawler. Binding a cookie to an IP is a single comparison on the server: does the address on this request match the address on the issue. Defeating that comparison while still rotating IPs requires the crawler to keep an entire population of identities internally consistent across cookies, IPs, fingerprints, headers, and DNS, every one of them, every request. The server checks one seam. The fleet has to seal all of them at once. Any layer that rotates on its own clock, out of step with the others, opens the seam back up.
That is why the leak is such a durable bug. It is not a flaw in any one component. It is what happens when components that should move together move independently, and the default architecture (a proxy pool here, a cookie store there, a fingerprint generator somewhere else) makes independent movement the path of least resistance. You have to fight the architecture to keep them locked, and the win condition is invisible. A coherent fleet looks exactly like a slower one until you compare its block rate against the fast fleet that is quietly being scored into the ground. The cost of coherence is paid up front in throughput. The cost of incoherence is paid later, all at once, when a cross-session detector cashes in every IP a leaked cookie ever touched.
Sources & further reading
- Barth, A. (2011), RFC 6265: HTTP State Management Mechanism — the cookie spec, including the Secure, HttpOnly, and SameSite attributes that notably do not bind a cookie to a network address.
- Popov, A., Nystroem, M., Balfanz, D., Hodges, J. (2018), RFC 8471: The Token Binding Protocol Version 1.0 — cryptographically binds tokens to the TLS channel; abandoned on the web after Chromium dropped support in v70.
- Microsoft (2026), Safeguarding Dataverse sessions with IP cookie binding — the plainest documented IP-to-cookie binding, including the benign events (VPN, hotspot, ISP reset) that force reauthentication.
- HAProxy (2019), Load balancing, affinity, persistence, sticky sessions: what you need to know — the canonical distinction between source-IP affinity and cookie-based persistence.
- DataDome (2026), Cookies and stored data — names the
datadomecookie, its ~128-byte size and ~1-year lifetime, and warns against altering its attributes. - DataDome (2026), Protection API: validate request — documents the mandatory ClientID for session tracking and the separate IP and X-Forwarded-For fields each request is judged against.
- Cloudflare (2026), Clearance — states the cf_clearance cookie is tied to visitor and device rather than IP, with three privilege tiers and a 4096-byte ceiling.
- Cloudflare (2026), Challenge Passage — documents the 30-minute default clearance lifetime and the 15-to-45-minute recommended range.
- WorkOS (2025), Impossible travel: what it is, how it works, and how to defend against it — the Haversine-plus-velocity method and the ~1000 km/h threshold, with GeoIP accuracy as the main weakness.
- IPinfo (2025), Impossible travel detection with IP data accuracy — on why GeoIP estimates, especially on mobile carriers, drive false positives in velocity detection.
- Proxies.sx (2026), Proxy IP reputation and ASN scoring: the 2026 guide — ASN classification fires before the request body is read; datacenter ASNs start 50-85 on the risk scale, consumer ISPs near 0-20.
- Edioff (2026), akamai-analysis: deep technical analysis of Akamai Bot Manager v2 — maps the sensor_data POST to the challenge endpoint and is candid that server-side
_abckvalidation is not fully reverse-engineered.
Further reading
Proxy pool management: rotation, health checks, and burn-rate economics
Traces how a working proxy pool is operated: rotation strategies, the difference between a banned IP and a dead one, health-check state machines, sticky versus rotating sessions, and the per-GB cost model that decides whether a crawl is profitable.
·22 min readResidential vs datacenter vs mobile proxies: detection, cost, and use cases
A vendor-neutral comparison of the three proxy types: how each is sourced, how each gets detected at the ASN and reputation layer, what a gigabyte actually costs, and which job each one fits.
·19 min readHow proxy networks source IPs: SDKs, residential peers, and the ethics question
Traces where residential and mobile proxy IPs actually come from: bundled SDKs, free-VPN monetization, peer-payout apps, and outright malware, plus the consent gap that runs through all of them.
·19 min read