DataDome's server-side scoring pipeline: from edge to decision in milliseconds
A bot-detection verdict has to be wrong almost never and arrive almost instantly. Those two requirements pull against each other. To be confident about whether a request is a human or a script, you want to look at everything: the TLS handshake, the header order, the IP’s history, what this session did on the last forty requests, whether the JavaScript tag ever fired. To be fast enough that the request can block synchronously without anyone noticing, you have a few milliseconds and a single network hop to spend. DataDome’s published number for that hop is two milliseconds of average computing time per request. The interesting engineering is entirely about how you spend a budget that small while still consulting a model trained on trillions of signals.
This post stays on the server side of that problem. Not what the JavaScript tag collects in the browser, which is a separate story, but what happens after a request reaches the edge: how the integration module captures a request, what it ships to DataDome’s API, how the API scores it against rules and machine-learned models, and how a one-byte verdict comes back as an allow, a challenge, or a block. The roadmap is the request flow itself. First the module-and-API split that makes the whole thing work behind a CDN. Then the exact form fields that travel north. Then the scoring engine, the rules-versus-ML question, and the regional inference layer. Then how the verdict is enforced and the cookie reissued. Then the failure modes, because a synchronous security check sitting in front of every request is a liability the moment it stops answering.
The split that makes it synchronous
DataDome does not run as a reverse proxy that all your traffic flows through. It runs as a module inside infrastructure you already operate, paired with a detection API it operates. That division is the whole architecture. The module lives in your Nginx, your HAProxy, your Cloudflare Worker, your Fastly Compute service, your AWS Lambda@Edge function, or one of a couple dozen other supported integration points. On each request the module does something deliberately cheap: it reads the request metadata, packs a subset of it into a small form-encoded body, and makes one call to DataDome’s Protection API at api.datadome.co. The API does the thinking and returns a verdict. The module enforces it.
Keeping the heavy computation off your box and inside DataDome’s is what lets the model be enormous while the per-request cost on your side stays near zero. The module is a few hundred lines of glue. The model behind the API is the product. The cost of that design is one synchronous round trip per request, which is exactly why so much of the system is built to make that round trip short and to survive it when it is not.
*The two-phase flow: the module ships request metadata north, the API returns a verdict in a header, and the module turns that verdict into an HTTP outcome. The exact status-to-action mapping is configurable per rule.*The placement of the module matters more than it first looks. DataDome’s own guidance is to integrate at the edge, at the CDN level, whenever possible. On Cloudflare the worker runs on the incoming request before the cache lookup, which means a blocked bot never touches your origin and never pollutes your cache. The trade is that edge integrations sit behind whatever IP rotation the CDN does, so DataDome keeps a dynamic list of trusted proxy ranges and resolves the real client IP from forwarding headers rather than the connection’s source address. That detail is why the API payload carries several IP-bearing fields at once, which we will get to.
The trusted-proxy problem is worth dwelling on, because it is where a lot of naive bot-detection deployments quietly break. If the engine scored on the connection’s source IP without reconciliation, then every request arriving through a CDN would appear to come from a handful of CDN edge addresses, and reputational detection would either flag the entire CDN as a botnet or trust it blindly. Neither is acceptable. DataDome’s answer is a dynamic IP discovery list of known CDN and proxy ranges that it keeps current as those providers rotate addresses, combined with the multiple forwarding fields in the payload so the engine can walk the chain and recover the genuine origin. When a request’s connection IP belongs to a trusted proxy, the engine trusts the forwarded client address; when it does not, the forwarded headers are themselves suspect and become a signal. That asymmetry is deliberate. Spoofed X-Forwarded-For headers from an untrusted source are a classic evasion attempt, and the engine treats a forwarded-for value arriving from outside the trusted-proxy set very differently from the same value arriving through a Cloudflare or Akamai range it recognizes.
What the module actually sends
The module does not forward the request. It forwards a description of the request. The Protection API expects a POST to /validate-request/ with a application/x-www-form-urlencoded body, and the fields in that body are a fixed, documented set. This is the part most write-ups skip, so it is worth being precise: the field names below are taken directly from DataDome’s Protection API reference, not inferred.
The core identity and routing fields are Key (your license key), RequestModuleName and ModuleVersion (which integration sent this), ServerName and ServerHostname, IP, Port, Protocol, Method, and Request (the path and query string). The timing field is TimeRequest, a microsecond timestamp the engine uses to reason about request rate. Then a block of header-derived fields: UserAgent, Referer, Accept, AcceptEncoding, AcceptLanguage, AcceptCharset, Origin, Host, From, Via, CacheControl, Pragma, ContentType, and X-Requested-With. Client-Hint headers travel in their own SecCH* family of fields.
Three fields carry the IP story for the trusted-proxy problem above: XForwardedForIP, TrueClientIP, and X-Real-IP. Behind a CDN the connection-level IP belongs to the proxy, so the engine reconciles these to recover the genuine client address. Identity continuity comes from ClientID, which is the value of the DataDome cookie or, when an upstream component has already extracted it, the X-DataDome-ClientID header. That ClientID is the thread tying this request to the session history the engine has accumulated.
Two design choices in that field list deserve a beat. First, the engine asks for CookiesLen, PostParamLen, and AuthorizationLen, the lengths of those values rather than the values. It wants the shape of the cookie jar and the body without forcing the module to copy potentially large or sensitive payloads over the wire. A session carrying zero cookies, or a wildly atypical body size, is a signal on its own. Second, HeadersList ships the names and order of the request’s headers. Header ordering is a strong tell, because real browsers emit headers in a stable, version-specific order that most HTTP clients do not bother to reproduce. The engine reconstructs that ordering from HeadersList rather than trusting any single header’s value.
Network-layer fingerprints ride along when the integration can supply them. The Protection API accepts JA3 and JA4 TLS fingerprints plus TlsProtocol and TlsCipher as optional fields, which an Nginx or edge integration maps in from the TLS terminator. That ties the server side of this pipeline to the transport layer: a request whose UserAgent claims a recent Chrome but whose JA4 hash belongs to a Go HTTP client is contradicting itself before the engine looks at anything else. The mechanics of those handshake hashes are their own topic, covered in TLS fingerprinting: from ClientHello bytes to JA4 and, for the protocol layer above it, how DataDome uses HTTP/2 and network fingerprints as a signal. What matters here is that they arrive as a couple of short strings in the same form body as everything else, so the cost of consulting them at scoring time is a hash lookup, not a packet capture.
What the module ships is only half of the signal supply. The other half is whatever the in-browser JavaScript tag collected on earlier requests, keyed to the same ClientID and waiting in DataDome’s session store. The full inventory of first-request signals is its own subject, treated in DataDome’s detection model and inside the DataDome JS tag. The server-side payload is the part that exists on every single request, with or without JavaScript, which is why a request from a headless client that never runs the tag is conspicuous: the engine has the HTTP-level description and an empty client-side record to go with it.
Scoring: rules and models, not rules versus models
Once the form body lands at the API, the request becomes a feature vector and the engine scores it. DataDome describes this as a multi-layered detection engine, and the layering is the point. Four categories of detection run against the same request. Signature-based detection matches known-bad TLS and browser fingerprints and known-bad header signatures. Behavioral detection looks at whether the activity on this session, IP, or fingerprint is abnormal or aggressive over a time window. Reputational detection draws on IP and ASN history. Anomaly detection catches the request that does not match any signature but does not look like anything a human would produce either.
Those layers consume signals aggregated at four levels: the individual request, the session, the IP, and the fingerprint. The reason the levels matter is that the cheapest attacks are invisible at the request level and obvious at the aggregate. A single request with clean headers and a plausible IP looks fine. Ten thousand such requests sharing one fingerprint across a rotating IP pool, inside a sixty-second window, do not. The engine carries state across requests so the aggregate is available at decision time, which is the entire reason ClientID and TimeRequest are in the payload. This is the same machinery that let DataDome attribute a single coordinated campaign across many requests and hard-block it on behavioral signature alone, the published case being a wave of more than 214 million malicious requests caught on server-side behavior.
It is tempting to read this as machine learning replacing rules, but that is not how the system is built. DataDome runs a large body of precision rules alongside its models, and the two do different jobs. Rules give deterministic, auditable, instant verdicts for the cases you can name: this exact fingerprint, this known scraper, this header combination that no browser emits. Models cover the cases you cannot enumerate, scoring novel traffic by its resemblance to labelled history. The company’s own public numbers put the scale at tens of thousands of models tailored per customer and use case, backed by hundreds of thousands of rules, with the engine ingesting on the order of trillions of signals per day across its customer base. The ML side spans supervised, unsupervised, and semi-supervised models, plus behavioral analysis, time-series anomaly detection, and genetic algorithms, depending on the signal.
The division of labor between the two is not arbitrary, and it maps onto the latency budget directly. A rule is a lookup. Matching a request against a known-bad fingerprint or a header signature is constant-time and cheap, so rules can run first and short-circuit the obvious cases without ever invoking a model. That is good for both speed and explainability: when a request is blocked by a rule, there is a specific, nameable reason, which matters for the customer triaging a false positive in the dashboard. Models are more expensive to evaluate and harder to explain, so they earn their place on exactly the traffic rules cannot adjudicate, the novel or evasive request that does not match any signature but does not behave like a human either. The per-customer tailoring is part of why the model count runs into the tens of thousands. A login endpoint on a bank and a product-search endpoint on a marketplace have different normal traffic and different attacker economics, so the model that scores one would misjudge the other. Tailoring by customer and use case is how a single engine avoids applying one site’s notion of normal to another’s.
Supervised models are only as good as their labels, and labelling bot traffic is genuinely hard because you rarely know ground truth at the moment a request arrives. A scraper that perfectly mimics a browser produces a clean request; the only thing that betrays it is the aggregate behavior over a session or the eventual fraud it commits downstream. DataDome’s engineering writing on label engineering describes building training labels from exactly these delayed and aggregate signals rather than from any single request feature, which is why the session-level and fingerprint-level aggregation in the payload matters as much for training as it does for inference. The labels that train tomorrow’s model come from the aggregated history the engine accumulates today.
*Four detection layers feed signals aggregated at four levels over multiple time windows. Rules and models share the input and combine into one verdict; they are not competing pipelines.*Keeping that many models accurate is its own engineering discipline, and one piece of it is public. In 2022 DataDome open-sourced Sliceline, a Python implementation of the SliceLine algorithm from a SIGMOD 2021 paper by Svetlana Sagadeeva and Matthias Boehm of Graz University of Technology. Sliceline finds the slices of a dataset where a model performs significantly worse than its average, using sparse linear algebra to enumerate and prune candidate slices fast. For a bot detector, those slices are exactly where the false positives and false negatives hide: a particular browser-and-locale combination the model misreads as a bot, or a botnet variant it misreads as human. The tool is released under the 3-clause BSD license, and DataDome’s threat-research write-ups describe using it to keep both error rates low. That a vendor publishes the technique it uses to hunt its own model’s blind spots tells you something about how central the error-rate management is to the product.
The output of all this is a verdict, not a raw probability exposed to the customer. The engine resolves the combined signal into one of a small set of actions, and the action that comes back depends as much on your configuration as on the score. DataDome’s rule engine supports several response types: allow, captcha, block, device check, a custom mode that just tags the request with a header and forwards it for you to handle, plus rate limiting and time-based variants. So the score is internal; what crosses the API boundary is the decision.
The verdict, and how it comes back
The Protection API answers with an HTTP status that the module reads as the verdict, and a set of X-DataDome* headers that carry the details. A 200 means allow: let the request proceed to your backend. A status in the 301, 302, 401, 403, or 429 family means the engine wants the module to challenge or block, with the exact code depending on the configured response. The most important header in the reply is X-DataDomeResponse. It echoes the status the engine intends, and the module is expected to verify it matches before acting. If that header is missing, the module treats the API response as malformed; DataDome’s modules even have a named error for it (error 704: the API response did not carry the expected X-DataDomeResponse header), which exists precisely so a module never enforces a verdict it cannot confirm came from the engine.
Alongside the status, the reply carries instructions for rewriting the exchange. X-DataDome-request-headers and X-DataDome-headers tell the module which headers to inject upstream toward the backend and which to add downstream toward the client. X-DataDome-isbot, X-DataDome-botname, and X-DataDome-botfamily classify the verdict when the engine has identified a specific bot, which is what feeds the dashboard’s attribution and what a custom-mode integration reads to make its own call. And a Set-Cookie directive in the reply is how the DataDome cookie gets issued or rotated, which closes the loop on the ClientID field that opened the request.
The challenge path is where the server side hands off to a second round of client work, and DataDome’s modern default for it is Device Check rather than a visible CAPTCHA. Launched in December 2023 and announced as an invisible CAPTCHA alternative, Device Check redirects a suspicious request to an interstitial page that runs a JavaScript proof-of-work and fingerprinting routine in the browser, collects the result, and returns it to DataDome for a final verdict. No checkbox, no image grid. Because no human interaction is involved, the behavioral models do not apply to what Device Check gathers; it leans on environment fingerprinting and proof-of-work to catch spoofed or automated runtimes. From the module’s point of view this is still just the challenge branch of the same state machine. The interstitial either resolves to an allow, and the original request proceeds, or it does not, and the request stays blocked. The cookie lifecycle that carries the resulting trust forward, including how a passed challenge upgrades the token, is its own subject in the DataDome cookie lifecycle.
Spending the latency budget
Now the hard constraint. All of the above sits synchronously in front of every protected request, so its latency is added to every protected request. DataDome’s published figure is roughly two milliseconds of average computing time per request for the detection itself. Hitting that consistently, from anywhere in the world, is a geography problem before it is a modeling problem.
The answer is to put the inference close to the module. DataDome runs its Protection API from more than two dozen locations, described in the AWS reference architecture as 26 deployments spanning over 20 AWS regions, fronted by latency-aware routing. The default api.datadome.co hostname resolves by geo-proximity to the nearest healthy endpoint, and customers who want to pin a region can target a specific dynamic endpoint directly, for example api-eu-west-1.datadome.co in Ireland or api-us-east-1.datadome.co in Virginia. The routing layer uses health checks to pull a degraded region out of the proximity map automatically, so a regional incident reroutes rather than stalls.
The scoring itself has to be in-memory to fit the budget. There is no time inside two milliseconds to query a database on the request path. DataDome’s public material describes examining each request against a large in-memory pattern database of models. The aggregated state that makes session-level and IP-level detection possible, the running counts and fingerprint histories keyed by ClientID and address, has to live in fast storage that the scoring node can read in microseconds, with the heavy work of training and slice-finding happening offline and shipping fresh models out to the edge of the inference layer. The split is the familiar one for low-latency ML: training is slow and asynchronous, inference is fast and synchronous, and the model artifact is what crosses between them.
When the API does not answer
A synchronous check in front of every request is a single point of failure unless you design for the case where it fails. DataDome’s modules do. The Protection API contract treats 200 as allow and the challenge-or-block statuses as enforce, but it also defines a third bucket: any other response, including a timeout or an unreachable endpoint, is a fail-open. The module proceeds as if the request were allowed rather than blocking traffic because the detector is unavailable.
That choice is the right one for almost every customer, and it has a cost worth naming. Fail-open means a sufficiently severe outage of, or attack against, the detection API degrades into no detection rather than a site-wide outage. Availability wins over security in the default. A determined adversary who can induce timeouts on the validation path, or who simply attacks during a regional incident, gets a window where requests pass unscored. The module mitigates this with tight timeouts and the regional redundancy above, so the fail-open window is small and rare, but it exists by design, and any threat model of a DataDome-protected endpoint has to account for it. This is a general property of synchronous inline security, not a DataDome-specific weakness; Akamai’s Bot Manager makes a comparable availability-first tradeoff, discussed in Akamai Bot Manager scoring.
The exact internal layout of DataDome’s inference nodes, the state store behind the aggregation levels, and the precise wire format of the model artifacts that ship to them are not publicly documented. What is documented is the boundary: the form fields going north, the status and X-DataDome* headers coming back, the fail-open rule, and the latency target the whole thing is built around. Everything above the API line is inferred from that contract plus DataDome’s own descriptions of its engine, and where this post has reasoned past the documented boundary it has said so.
What the contract tells you
The most revealing thing about DataDome’s server side is how little of it lives on your infrastructure. The module is deliberately dumb. It captures a request, serializes a known set of fields, makes one call, and reads one status code back. All the intelligence is on the far side of an HTTP boundary you can read the schema for. That is a defensible architecture: the model can grow without touching customer deployments, the rules can update continuously, and a customer integration written against the field list keeps working as the engine behind it changes.
It also means the entire system rests on the round trip being fast and the answer being trustworthy. The two-millisecond compute target, the geo-proximity routing across twenty-plus regions, the in-memory scoring, the X-DataDomeResponse integrity check the module performs before it enforces anything, and the fail-open default when the answer does not come: those are not separate features. They are five facets of one constraint, which is that a security decision has to happen inside the time a user is willing to wait, and has to keep the site up when it cannot happen at all. The field list is the easy part to document. The latency budget is the part that shapes everything else.
Sources & further reading
- DataDome (2026), Protection API reference — the documented form-field schema the module posts to
validate-request, the status-code semantics, and theX-DataDome*response headers. - DataDome (2026), Using DataDome behind a CDN — trusted-proxy handling and why the payload reconciles multiple IP fields at the edge.
- DataDome (2024), Cloudflare Worker module for Bot Protect — edge integration that runs before the cache lookup and injects the JS tag into HTML responses.
- DataDome (2026), Rule responses — the configurable response actions: allow, captcha, block, device check, custom, and rate limiting.
- DataDome (2026), Regional API endpoints — the geo-proximity
api.datadome.codefault and the per-region dynamic endpoints. - Amazon Web Services (2023), Preventing online fraud with AWS and DataDome’s real-time bot protection — the reference architecture: 26 API deployments across 20+ regions, geo-proximity routing, health-check failover, and the two-millisecond average compute figure.
- DataDome (2025), How DataDome blocked 214M+ malicious requests with server-side behavioral detection — a worked example of behavioral aggregation across requests producing a hard-block verdict.
- DataDome (2022), Sliceline — open-source slice-finding for model debugging, implementing the SIGMOD 2021 SliceLine algorithm; used to manage the detector’s false-positive and false-negative rates.
- Sagadeeva and Boehm (2021), SliceLine: Fast, Linear-Algebra-based Slice Finding for ML Model Debugging — the SIGMOD paper behind DataDome’s open-source package.
- DataDome (2023), DataDome launches Device Check, an invisible CAPTCHA alternative — the December 2023 launch of the proof-of-work interstitial that replaces the visible challenge.
- DataDome (2024), Multi-layered AI-powered detection — the four detection categories, the four aggregation levels, and the model and signal-volume scale.
- DataDome (2023), Label engineering for supervised bot-detection models — how training labels are built from delayed and aggregate signals rather than single-request features.
Further reading
DataDome's detection model: every signal it collects on the first request
Traces what DataDome evaluates on the very first request, before any JavaScript runs: the TLS/JA4 fingerprint, the HTTP/2 frame profile, the header set, and IP and ASN reputation, and how those signals stack into one decision.
·19 min readThe DataDome cookie lifecycle: token issuance, rotation, and validation
Traces the datadome cookie end to end: how it is issued after a challenge, what the 128-byte token encodes, when it rotates, how long it lives, and how the edge validates it on every request through the Protection API.
·22 min readF5 Distributed Cloud Bot Defense: the architecture after the Shape acquisition
Traces how Shape Security's bot-detection stack became F5 Distributed Cloud Bot Defense: the client-side JavaScript and mobile SDK, the connector model, the telemetry path to the inference engines, and where the system sits in 2026.
·19 min read