DataDome's detection model: every signal it collects on the first request
By the time a DataDome-protected site decides whether to serve you the page, run a JavaScript challenge, or drop a 403, a lot has already happened that you never see. No script has executed in your browser. No cookie has been validated. No mouse has moved. The decision, or at least the first cut of it, is made on the bytes of the request itself: the TLS ClientHello, the HTTP/2 frames that follow it, the header set, and the network address those packets arrived from.
That first-request layer is the part of the system most automation gets wrong before it even reaches the interesting client-side puzzles. A scraper can carry a perfect cookie and a flawless browser fingerprint and still be flagged because its TLS stack says Python while its User-Agent says Chrome 138. This post is about that layer alone: what DataDome looks at before JavaScript runs, where each signal comes from, and how the signals combine. Not the cookie internals, not the scoring infrastructure, not the client-side payload. The signal taxonomy of the first request.
A note on sourcing before we start. DataDome publishes a fair amount about its detection categories and, usefully, a public integration reference that names exactly which request attributes its detection engine receives. The exact weights, thresholds, and model internals are not public, and where this post describes behaviour rather than documented mechanism, it says so. The protocol-level fingerprinting techniques DataDome uses are documented in primary sources from Akamai, FoxIO, and the relevant RFCs, and those are the same techniques regardless of vendor.
The sections below walk the request from the outside in. First the shape of the model and where the first-request signals sit inside it. Then the network identity (IP and ASN reputation). Then the TLS handshake and what JA3 and JA4 read from it. Then the HTTP/2 layer, which carries its own fingerprint most clients forget about. Then the header set and the consistency checks that tie everything together. Then a short tour of what is and is not knowable from the outside.
The shape of the model
DataDome describes its detection as a set of model categories rather than a single classifier. Its documentation groups them four ways: signature-based detection, behavioral detection, reputational detection, and vulnerability-scanner detection. Signature-based detection is the one that runs on raw request data, and it leans on fingerprinting: TLS fingerprint, browser fingerprint, and HTTP header signatures. Reputational detection covers the IP side, flagging requests that came from an address that misbehaved recently or that has been tagged as a datacenter or residential proxy. Behavioral detection needs a history of requests to work, so it is largely out of scope for a single first request.
The split that matters for this post is not those four categories but a different axis: server-side versus client-side. Server-side signals are everything DataDome can read from the request without running any code in the client. They are available on request number one. Client-side signals come from the JavaScript tag, which collects hundreds of device and environment signals and only runs after the page (or a challenge) is served. The DataDome JS tag and its client payload are a separate story. The point here is that the server-side set is what gates access to everything else. If the first request looks wrong enough, the client-side collection never gets a chance to run.
*The server-side signals available on the first request, and the consistency check that turns four independent reads into one verdict.*There is one more wrinkle worth naming up front. DataDome does not have to make a binary block-or-allow call on the first request. It can answer with a challenge instead, and since late 2023 one of those challenges, Device Check, is an invisible proof-of-work that the vendor pitches as catching advanced bots on their initial request rather than after a pattern of bad behaviour builds up. Device Check still needs the client to execute something. But the trigger for it comes from this same first-request signal set: a request that looks suspicious but not damning gets a challenge, a request that looks clean gets the page, and a request that looks plainly automated gets blocked outright.
Where the packets came from: IP and ASN reputation
The cheapest signal to evaluate is the source address, because it requires reading exactly zero bytes of the request body. DataDome’s reputational models score an IP on two broad questions. Has this address, or its neighbours, done something hostile recently? And what kind of network does it belong to?
The second question is the more structural one. Every routable IP belongs to an Autonomous System, identified by an ASN, and the ASN tells you whether the address lives in a consumer ISP’s pool, a mobile carrier’s range, a hosting provider, or a cloud region. That distinction carries enormous weight. An address out of Comcast, Orange, or Deutsche Telekom is the kind of place real users sit. An address out of AWS, Hetzner, OVH, or DigitalOcean is the kind of place servers sit, and servers do not browse retail checkout pages by hand. Datacenter ASNs are a strong negative signal on their own, which is exactly why the proxy market exists: the entire residential-proxy industry is a machine for moving requests off datacenter ASNs and onto addresses that look like households.
DataDome layers a proxy and VPN classification on top of the raw ASN lookup. Its models flag addresses that have been identified as datacenter proxies, residential proxies, shared proxies, or free open proxies, and it does this continuously rather than from a static list, because proxy pools rotate. The reputational signal is not just “is this a datacenter” but “has this specific address been seen fronting for automation lately.” An IP that was clean last week and is now the exit node for a residential proxy network can flip from trusted to suspect without anyone touching a configuration.
The headers that carry forwarding information feed straight into this. The Protection API that DataDome’s server-side modules use forwards the client IP along with X-Forwarded-For, X-Real-IP, and a true-client-IP value, plus the Via header that proxies stamp onto requests they relay. A request that arrives with a Via header, or with a forwarding chain that does not match the connecting address, is announcing that it passed through an intermediary. That is not automatically hostile, plenty of legitimate traffic goes through corporate proxies and CDNs, but it is a data point that gets weighed against everything else.
None of this is decisive in isolation. A real user on a mobile carrier behind carrier-grade NAT shares an IP with thousands of other people, some of whom are running bots, and DataDome cannot block the whole pool. That is the point of the layered model. IP reputation narrows the field and sets a prior. The protocol fingerprints below decide what to do with a request whose address is ambiguous.
The TLS handshake: JA3, and why JA4 replaced it
The first thing a client says on an HTTPS connection, before any HTTP, is the TLS ClientHello. It is sent in plaintext because no key has been agreed yet, and it is one of the most revealing things a piece of software emits on the network. It lists the cipher suites the client supports, the TLS extensions it advertises, the elliptic curves and signature algorithms it will accept, and the application protocols it speaks via ALPN. Two stacks built against the same RFCs produce visibly different ClientHellos. The full anatomy of that packet is its own topic, covered in TLS fingerprinting: from ClientHello bytes to JA4; here the question is narrower, which is what DataDome does with it.
The DataDome Protection API receives four TLS-derived fields per request: the negotiated TLS protocol version, the chosen cipher, and both a JA3 and a JA4 fingerprint. JA3, published by Salesforce engineers in 2017, hashes a concatenation of the ClientHello’s version, cipher list, extension list, supported groups, and elliptic-curve point formats into a single MD5. For years that worked beautifully: Chrome’s hash was Chrome’s hash, and Python’s requests produced something that looked nothing like a browser.
Two things broke JA3. The first was that the extension order, which JA3 hashes in the order it appears on the wire, turned out to be easy to spoof and also unstable. The second, and the fatal one, was Chrome’s TLS extension randomization, shipped in Chrome 110 in early 2023. From that point Chrome deliberately shuffles the order of its TLS extensions on every connection, so the JA3 hash of a single Chrome install changes from request to request. A fingerprint that changes every time is useless as an identity. JA3 did not stop existing, and DataDome still receives it, but as a stable browser identifier it was finished.
JA4, created by John Althouse at FoxIO in 2023, was built to survive randomization. The TLS-client variant, JA4, has a readable three-part a_b_c structure rather than a single opaque hash, and the design choice that matters is that it sorts before it hashes.
Part a is human-readable metadata: the transport (t for TLS over TCP, q for QUIC), the TLS version, whether SNI is present, the cipher and extension counts, and the first ALPN value. Part b is a twelve-character truncated SHA256 of the cipher suites sorted into hex order. Part c is a twelve-character truncated SHA256 of the extensions, also sorted, followed by the signature algorithms in their original order. GREASE values, the deliberately random placeholders browsers inject to keep middleboxes honest, are stripped before hashing everywhere they appear. Because the ciphers and extensions are sorted before hashing, Chrome’s randomized extension order collapses back to a single stable fingerprint. The thing that killed JA3 does nothing to JA4.
For DataDome the value of all this is a lookup. The TLS fingerprint maps to a device class. From the cipher list and extension set a client advertises, you can tell whether you are talking to a real browser’s TLS stack or to OpenSSL, Go’s crypto/tls, Python’s ssl module, or a curl build, and you can often pin the browser family and rough version. The fingerprint by itself is not the verdict. It becomes a verdict when it disagrees with the User-Agent. A request whose JA4 says Go and whose User-Agent says “Chrome/138 on Windows” has told DataDome two incompatible stories about what it is, and that contradiction is the signal. The deeper mechanics of how DataDome uses network-layer fingerprints sit in how DataDome uses HTTP/2 and network fingerprints.
The HTTP/2 layer: a second fingerprint most clients forget
Pass the TLS check and you are not done leaking identity. The HTTP/2 connection that rides on top of the TLS session carries its own fingerprint, and it is one of the most common things automation gets wrong, because most HTTP libraries treat HTTP/2 as a transport detail rather than something an adversary reads.
The technique comes from Akamai. In a Black Hat Europe 2017 whitepaper, three Akamai threat researchers, Ory Segal, Aharon Fridman, and Elad Shuster, showed that you can passively fingerprint an HTTP/2 client from the structure of its connection setup, and they did it across more than ten million HTTP/2 connections drawn from Akamai’s edge. HTTP/2 is a binary protocol, and clients differ in how they configure it. The whitepaper proposed a fingerprint format built from four observable behaviours, and that format, or close variants of it, is what commercial anti-bot systems including DataDome use today.
The first source of entropy is the SETTINGS frame, which both endpoints send before any request data. SETTINGS parameters are not negotiated; they describe the sender. RFC 7540 defines six of them, and clients differ in which they send, in what order, and with what values. The Akamai paper’s example for Chrome on macOS reads 1:65536;3:1000;4:6291456|15663105|0, meaning a header-table size of 65536, a max-concurrent-streams of 1000, an initial-window-size of 6291456, then a window-update increment, then no priority frames. A Go HTTP/2 client in the same paper reads 2:0;4:4194304;6:10485760|1073741824|0, a completely different SETTINGS profile. Which parameters are absent is as telling as the values present.
The second component is the WINDOW_UPDATE frame the client sends right after SETTINGS to enlarge its flow-control window. The increment value is consistent per implementation and differs between them. The third is the set of PRIORITY frames. Some clients, Firefox notably, open the connection by sending several PRIORITY frames for streams that have not been requested yet, building a dependency tree, and the exact tree is a signature. Chrome on modern versions sends none, recorded as 0.
The fourth component is the one that catches the most automation: pseudo-header order. Every HTTP/2 request begins with four pseudo-headers, :method, :authority, :scheme, and :path, and the order a client emits them is fixed by its implementation. The Akamai paper documents Chrome as :method, :authority, :scheme, :path and Firefox as :method, :path, :authority, :scheme. Safari, curl, and Go each have their own order. A library that hand-rolls HTTP/2 and happens to emit pseudo-headers in a sequence no browser uses has signed its own name. The full fingerprint string concatenates all four parts, for example a Firefox profile reads 1:65536;4:131072;5:16384|12517377|3:0:0:201,5:0:0:101,7:0:0:1,9:0:7:1,11:0:3:1|m,p,a,s.
The whitepaper named the immediate use case bluntly: spoofed User-Agent detection. The HTTP/2 fingerprint by itself does not have enough entropy to track an individual user, but it reliably exposes the client’s vendor, OS type, and rough version. Set that against the User-Agent string and you have another consistency check. A request claiming to be Chrome whose HTTP/2 profile matches Go’s net/http is lying about one of the two, and DataDome does not need to know which.
The header set, the order, and the consistency check
The most familiar signals are the HTTP headers themselves, and they are also where the layered model does its real work, because the headers are where every other signal gets cross-examined. The DataDome Protection API forwards a long, specific list of request attributes to the detection engine. It is worth knowing the shape of that list, because it tells you precisely what the engine sees.
On the network and identity side it receives the connecting IP, the true-client-IP, X-Real-IP, the X-Forwarded-For chain, the port, the connection protocol, and the TLS fields already covered. On the HTTP side it receives User-Agent, Accept, Accept-Language, Accept-Encoding, Accept-Charset, Host, Origin, Referer, Connection, Cache-Control, Pragma, Via, and the full Sec-CH-UA client-hint family along with the Sec-Fetch-* fetch-metadata headers. Critically, it also receives a HeadersList value, a list of the header names that were present, and a CookiesList and CookiesLen. The engine is told not just what each header contained but which headers were sent and in what order, and how many cookies came along and how big they were.
That HeadersList field is the quiet one. The presence and ordering of headers is a fingerprint in its own right. Real browsers emit a stable, well-known set of headers in a stable order, and they send the modern metadata headers that come with that browser generation. A Chromium-based browser of a recent vintage sends Sec-Fetch-Site, Sec-Fetch-Mode, Sec-Fetch-Dest, and Sec-Fetch-User on navigations, and it sends Sec-CH-UA client hints that have to agree with the User-Agent it claims. An HTTP client that sets a Chrome User-Agent but omits the Sec-Fetch headers, or sends client hints whose version does not match the User-Agent’s version, has produced an internally inconsistent request. The headers a browser sends are not arbitrary; they are coupled to each other and to the browser’s identity, and that coupling is what gets checked.
This is the heart of why the first-request model works. None of these four signal families is reliable alone. IP reputation has too many false positives on shared mobile addresses. TLS fingerprints can be borrowed by tooling that mimics a browser’s ClientHello. HTTP/2 profiles can be matched by libraries built for the purpose. Header sets can be copied verbatim from a real browser’s request. The defence is not any single signal but the demand that all of them tell the same story. A request that is Chrome must be Chrome at the TLS layer, Chrome at the HTTP/2 layer, Chrome in its header set and client hints, and arriving from somewhere a Chrome user plausibly sits. Faking one layer convincingly is routine. Faking all of them, in agreement, on every request, is the actual work, and the more layers a system reads, the more places a mismatch can surface.
*The consistency check in one picture. The User-Agent claims Chrome on both sides; only the right-hand request's lower layers disagree with that claim.*What is knowable from the outside, and what is not
A fair amount of the first-request model is documented, and the rest can be reasoned about from the protocol mechanics. What DataDome publishes are the categories of detection and, in its integration reference, the exact field names its engine consumes. What it does not publish are the weights. We know the engine receives JA3, JA4, the TLS version and cipher, the IP and forwarding chain, the full header set, and the cookie and header inventories. We do not know how a datacenter ASN trades off against an otherwise-perfect browser fingerprint, or what JA4-versus-User-Agent disagreement costs a request in the final score. Those thresholds are model internals, they shift as the models retrain, and any specific number you read about them is a guess unless it comes from DataDome.
It is also worth being precise about what the first request can and cannot conclude. The protocol fingerprints are powerful at catching crude automation and at catching tools that fake one layer but not the others. They are weaker against tooling that faithfully reproduces a real browser’s TLS and HTTP/2 behaviour, which is why the model does not stop at the first request. A request that survives the server-side gate without looking clearly automated, but that the engine is not confident about, gets handed to the client side: a JavaScript challenge that collects the device signals the wire cannot reveal, or an invisible Device Check proof-of-work. What happens to those signals after collection, how they roll into a single decision at the edge, is the job of DataDome’s server-side scoring pipeline. The first request’s role is narrower and earlier. It decides who is obviously a bot, who is obviously fine, and who has to prove themselves.
The thing that has changed most over the life of this model is not the signals but their stability. TLS fingerprinting in the JA3 era was a single MD5 that Chrome’s randomization eventually scrambled. The current generation, JA4 plus HTTP/2 plus header consistency, was designed after that lesson, and it sorts and normalizes specifically so that the defender’s fingerprint stays put while the browser’s presentation jitters. The asymmetry that defines this layer is simple to state. A browser leaks a consistent identity across four independent layers for free, as a side effect of being a browser. Automation has to manufacture that same consistency deliberately, on every layer, on every request, and the first place it usually slips is the layer the author forgot was being read.
Sources & further reading
- DataDome (2024), Protection API reference: validate-request — the integration endpoint that names every request attribute the detection engine receives, including JA3, JA4, TlsProtocol, TlsCipher, the forwarding headers, HeadersList, and CookiesList.
- DataDome (2024), AI Threats Detection — documents the four detection model categories: signature-based, behavioral, reputational, and vulnerability-scanner.
- DataDome (2023), How TLS fingerprinting reinforces DataDome’s protection — vendor engineering write-up on reading device class from ClientHello ciphers and extensions, and JA3’s place among its signals.
- DataDome (2023), Device Check — the invisible proof-of-work challenge that targets advanced bots from the first request.
- Help Net Security (2023), DataDome Device Check blocks bots from the first request — December 2023 announcement coverage with vendor quotes on first-request detection.
- O. Segal, A. Fridman, E. Shuster, Akamai (2017), Passive Fingerprinting of HTTP/2 Clients — Black Hat EU whitepaper defining the SETTINGS / WINDOW_UPDATE / PRIORITY / pseudo-header fingerprint format from 10M+ connections.
- J. Althouse, FoxIO (2023), JA4+ Network Fingerprinting — introduction to the JA4 suite and the design goals behind a randomization-resistant TLS fingerprint.
- FoxIO-LLC (2023), JA4 technical specification — the exact a_b_c format, the sort-before-hash step, and GREASE handling.
- M. Belshe, R. Peon, M. Thomson (2015), RFC 7540: Hypertext Transfer Protocol Version 2 (HTTP/2) — defines the SETTINGS parameters, WINDOW_UPDATE, PRIORITY, and pseudo-header fields the HTTP/2 fingerprint reads, and notes the passive-fingerprinting risk in Section 10.8.
- Scrapfly (2024), HTTP/2 and HTTP/3 fingerprinting: protocol-level bot detection — practitioner walkthrough of HTTP/2 fingerprint components with current browser example values.
- Stamus Networks (2023), JA3 fingerprints fade as browsers embrace TLS extension randomization — analysis of how Chrome 110’s extension shuffling broke JA3 as a stable identifier.
Frequently asked questions
What signals can DataDome evaluate before any JavaScript runs on the page?
On the first request DataDome reads only what arrives in the raw bytes: the source IP and its ASN reputation, the TLS ClientHello as a JA3 and JA4 fingerprint along with the negotiated version and cipher, the HTTP/2 frame profile, and the full header set including order, client hints, and Sec-Fetch metadata. These are the server-side signals available before the JavaScript tag executes. If the request looks wrong enough at this layer, the client-side collection never gets to run.
Why did JA4 replace JA3 for TLS fingerprinting of browsers?
JA3 hashed the ClientHello's extension list in wire order, which proved easy to spoof and unstable. Chrome 110 in early 2023 shipped TLS extension randomization, shuffling extension order on every connection so a single Chrome install produced a different JA3 hash each request, making it useless as a stable identity. JA4, created at FoxIO in 2023, sorts the ciphers and extensions before hashing, so the randomized order collapses back to one stable fingerprint.
How does DataDome catch a request that fakes a Chrome User-Agent?
The model treats no single signal as the verdict and instead demands that every layer tell the same story. A request claiming Chrome must look like Chrome at the TLS layer, in its HTTP/2 profile, in its header set and client hints, and arrive from a plausible address. When the JA4 says Go or OpenSSL, or the HTTP/2 pseudo-header order matches a library rather than Chrome, or the Sec-Fetch headers are missing, the contradiction with the User-Agent is the signal that flags the request.
What does the HTTP/2 fingerprint reveal that catches automation libraries?
The Akamai HTTP/2 fingerprint reads four behaviors from connection setup: the SETTINGS frame parameters and order, the WINDOW_UPDATE increment, any PRIORITY frames, and the order of the four pseudo-headers. Pseudo-header order catches the most automation because each implementation emits :method, :authority, :scheme, and :path in a fixed sequence. Chrome, Firefox, Safari, and most libraries each use a distinct order, so a client whose order matches no browser exposes its real identity regardless of its User-Agent.
Which parts of DataDome's first-request detection are publicly documented versus unknown?
DataDome publishes its detection categories and an integration reference that names the exact request attributes its engine consumes, including JA3, JA4, the TLS version and cipher, the IP and forwarding chain, the full header set, and the cookie and header inventories. What it does not publish are the weights and thresholds: how a datacenter ASN trades off against an otherwise clean fingerprint, or what a JA4-versus-User-Agent mismatch costs in the final score. Those internals shift as the models retrain, so any specific number is a guess unless it comes from DataDome.
Further reading
DataDome's server-side scoring pipeline: from edge to decision in milliseconds
Traces how DataDome turns an HTTP request into an allow, challenge, or block verdict at the edge: the module-to-API split, the form fields it ships, the regional inference layer, and the latency budget that keeps it synchronous.
·22 min readHow DataDome uses HTTP/2 and network fingerprints as a signal
A reference on the network-layer fingerprints DataDome reads: HTTP/2 SETTINGS frames, flow control, pseudo-header order, and how a mismatch between the claimed user agent and the wire profile flags a client.
·21 min readThe DataDome cookie lifecycle: token issuance, rotation, and validation
Traces the datadome cookie end to end: how it is issued after a challenge, what the 128-byte token encodes, when it rotates, how long it lives, and how the edge validates it on every request through the Protection API.
·22 min read