Skip to content

FingerprintJS internals: the open-source signals vs the commercial Pro entropy

· 21 min read
Copyright: MIT
FingerprintJS wordmark in monospace on a dark background with an orange hash bar

Open the FingerprintJS source, find the file where the visitor identifier gets made, and you will be surprised how short it is. The agent reads a fixed list of browser attributes, drops them into a dictionary, serializes that dictionary to a string, and runs the string through a 128-bit hash. The output is the visitorId. There is no machine learning in the open-source library, no server, no probabilistic matching. It is a pure function of whatever the browser chose to hand over at that moment. That simplicity is exactly why it drifts, and exactly why the same company sells a commercial version that throws the client-side hash away and recomputes identity on a server it controls.

This post is a read of both halves at the level a senior engineer would want before citing them. What signals the open-source agent actually collects, named the way they are named in the repo. How x64hash128 turns them into a 32-character hex string. Where the confidence score comes from and why the open-source one is close to a constant. Then the harder half: what Fingerprint Pro adds once the payload leaves the browser, the Smart Signals catalogue with its real JSON field names, how bot and incognito and VPN detection work, and where the public documentation stops and inference begins. Where a detail is not documented, this post says so rather than inventing a field name.

The roadmap. First the open-source agent: its source modules, the hashing path, and the confidence formula. Then a section on why the client-side hash is unstable, which is the whole reason Pro exists. Then the Pro architecture and the Smart Signals reference. Then bot detection, since FingerprintJS ships that as a separate library too. Then incognito and VPN detection as worked examples of server-plus-client signals. A closing synthesis on what the split between the two products tells you about the limits of pure client-side fingerprinting.

The open-source agent, source by source

The open-source library is MIT-licensed and published as @fingerprintjs/fingerprintjs on npm. The current release line is v5, with v5.2.0 tagged in April 2026. The README is blunt about the trade-off: because fingerprints are generated and processed in the browser itself, the accuracy is significantly lower than the commercial version, and the values are vulnerable to spoofing and reverse engineering. That sentence is the thesis of the whole product split, and it is the vendor saying it, not a critic.

Internally the agent is a registry of entropy sources. Each source is a small function that reads one attribute and returns a value (or an error if the API is missing). The sources index registers them by name, and the names are worth reading in full because they are the actual signal set, not a marketing summary of it. As of the v5 source tree the registered sources are:

fonts, domBlockers, fontPreferences, audio, screenFrame, canvas, osCpu, languages, colorDepth, deviceMemory, screenResolution, hardwareConcurrency, timezone, sessionStorage, localStorage, indexedDB, openDatabase, cpuClass, platform, plugins, touchSupport, vendor, vendorFlavors, cookiesEnabled, colorGamut, invertedColors, forcedColors, monochrome, contrast, reducedMotion, reducedTransparency, hdr, math, pdfViewerEnabled, architecture, applePay, privateClickMeasurement, audioBaseLatency, dateTimeLocale, webGlBasics, webGlExtensions, and userAgentData.

That is a little over forty sources. Group them by the kind of thing they probe and the design becomes legible.

Open-source entropy sources, by category Rendering & hardware canvas · webGlBasics · webGlExtensions · audio · audioBaseLatency · math deviceMemory · hardwareConcurrency · osCpu · cpuClass · architecture Display & media features screenResolution · screenFrame · colorDepth · colorGamut · hdr monochrome · contrast · invertedColors · forcedColors · reducedMotion Locale & environment timezone · languages · dateTimeLocale · platform · vendor · vendorFlavors Fonts & text fonts · fontPreferences · domBlockers Capabilities & storage plugins · touchSupport · cookiesEnabled · pdfViewerEnabled · applePay localStorage · sessionStorage · indexedDB · openDatabase · userAgentData *The roughly forty registered v5 sources sorted into the kinds of attribute they probe. Rendering and font signals carry most of the entropy; the capability flags are low-entropy tie-breakers.*

A few of these reward a second look. canvas renders text and shapes to a <canvas> element and hashes the pixel output, which differs by GPU, driver, and font rasterizer. webGlBasics and webGlExtensions read the WebGL renderer and the list of supported extensions, a deeper version of the same idea covered in WebGL fingerprinting. audio runs an oscillator through the Web Audio graph and reads back the floating-point output, which varies by audio stack. domBlockers is the odd one: it injects bait elements whose class names match common ad-blocker filter lists and checks which get hidden, turning the user’s installed blocker into a signal. screenFrame measures the window-to-screen offset, a proxy for whether the browser is maximized. None of these is documented field by field in a spec; the canonical description is the source itself, which is why this post points at the repo rather than paraphrasing internals it cannot see.

The sources that lean on media-feature queries (colorGamut, hdr, monochrome, contrast, invertedColors, forcedColors, reducedMotion, reducedTransparency) are CSS @media checks. Each splits the population a little. None is individually identifying. This is the entropy-budget logic at work: a pile of weak, mostly-independent signals summing to something unique, which the sibling post Device fingerprinting in anti-bot stacks treats in information-theoretic depth, and which the entropy budget every detector balances treats as a general principle.

It is worth dwelling on what is and is not in this list, because the absences are as telling as the contents. There is no IP address, no TLS signature, no HTTP header order, and no behavioral signal. A mouse path, a keystroke rhythm, a scroll cadence: none of it appears, because those are the province of behavioral biometrics, a different discipline covered in mouse-movement biometrics and keystroke dynamics. The open-source agent is a static-attribute reader. It takes a snapshot of the device’s configuration and hashes it. It never watches the user. That bounds both its power and its privacy cost: it cannot tell a human from a script by how they move, but it also does not need to record any interaction to produce an identifier.

The list has also grown and shifted across major versions, which is itself a stability hazard the README hints at and the version field makes explicit. Sources have been added (the media-feature queries, applePay, privateClickMeasurement, architecture, audioBaseLatency) and the serialization has changed, so a visitorId from a v3 agent is not comparable to one from a v5 agent for the same browser. The library treats this as expected behavior rather than a bug, because the version field announces the algorithm generation. Anyone storing visitor IDs across an agent upgrade is storing identifiers from two different functions and should not expect them to join.

Two of the sources also probe APIs that are themselves privacy-sensitive and have been the subject of standards-body attention. deviceMemory and hardwareConcurrency expose coarse hardware counts that the W3C has deliberately bucketed to limit their entropy, a trade-off the hardware concurrency and device memory post covers. plugins once carried heavy entropy in the Flash era and now returns a near-empty, near-constant list in modern Chrome, which is why it sits among the low-value tie-breakers rather than the high-value rendering signals. Reading the source list with an eye to which entries still carry information, versus which are vestigial, is the difference between understanding the library and reciting it.

From sources to a visitorId

The public API is small. You load() an agent, then call agent.get(), which resolves to a GetResult. That object has four fields: visitorId, confidence, components, and version. The components dictionary is the raw evidence, one entry per source, each carrying either a value on success or an error on failure plus a duration in milliseconds. The version field is documented as equal to the library version, which is the agent’s honest admission that the identifier is only stable within a fingerprinting-algorithm generation. Bump the version, change a source, and the hash can move.

The visitorId itself is computed lazily. When you read the field, the agent serializes the components to a canonical string and hashes that string. The hash function is x64hash128, a 128-bit MurmurHash3 variant implemented in the library’s hashing utility. MurmurHash3 is a non-cryptographic hash designed by Austin Appleby for fast, well-distributed lookup keys. That choice matters for two reasons. It is fast, which keeps get() cheap. And it is not cryptographic, so the digest is purely an identity key, not a commitment you could verify or that resists a preimage. The output is 128 bits rendered as 32 hexadecimal characters. That is the string you see as a visitor ID.

~40 sources canvas, audio… components {value|error, duration} canonical string x64hash128 MurmurHash3 128-bit digest, rendered as 32 hex characters 3e7a1f… (visitorId) *The open-source path is a pure function. Read sources, build the components dictionary, serialize, hash with x64hash128. No server, no model, no memory of past visits.*

Because the whole thing is a deterministic function of the inputs, two different machines that happen to produce identical component values get identical IDs (a collision), and one machine whose component values shift between visits gets two different IDs (drift). Both failure modes are baked into a stateless client-side hash. The Pro product exists to address both by never trusting the client’s hash as final.

The confidence score, and why it is nearly a constant

The confidence field is an object with a score between 0 and 1 and an optional comment. The documented meaning: a number that tells how sure the agent is about the visitor identifier, where higher is better. The interesting part is how the open-source library actually computes it, because reading the confidence source dispels a common misreading that the score reflects something measured about this specific browser.

It does not. The score is a lookup by platform, then a fixed transform. The library defines an open confidence score per platform: roughly 0.4 on Android, 0.3 to 0.5 on WebKit/Safari depending on version, 0.6 on Windows, 0.5 on macOS, and about 0.7 on other platforms. Then the value the library actually exposes is derived by the formula round(0.99 + 0.01 * openConfidenceScore) to four decimal places. Run the arithmetic and every platform lands between roughly 0.994 and 0.997. The comment field, when present, is a literal string pointing at the Pro upgrade URL. So the open-source confidence score is close to a constant near 0.995, varying in the third decimal by platform, and it carries a referral link. It is not a per-visit certainty measurement. Treating it as one is a mistake worth naming, because plenty of integrations log that number as if it meant something dynamic.

open confidence score, by platform the base value before the 0.99 + 0.01·x transform Android0.4 WebKit0.3–0.5 Windows0.6 macOS0.5 other0.7 *The base score is a hard-coded constant per platform, not a measurement of this browser. The exposed confidence then maps every platform into roughly 0.994–0.997.*

The honest reading is that the open-source confidence score signals which platform you are on (lower on mobile and Safari, where fingerprints are blurrier) and otherwise advertises Pro. Real per-visit confidence requires the server-side matching that only Pro has, because confidence in an identity is a statement about how that identity matched against history, and the open-source library has no history.

Why the client-side hash drifts

A stateless hash of mutable inputs is unstable by construction. The inputs change for ordinary reasons that have nothing to do with the user trying to hide.

Browser updates are the loudest cause. A Chrome point release that changes how the canvas rasterizer antialiases text, or that ships a new WebGL extension, or that adjusts the audio pipeline, moves the corresponding source value, which moves the hash. The user did nothing. Their visitor ID is now different. Multiply this across the canvas, audio, and WebGL sources, all of which are downstream of GPU drivers and OS graphics stacks, and you get an identifier that decays on the browser’s own release cadence. The entropy budget post frames this as the stability-versus-uniqueness tension: the most uniquely identifying signals (canvas, WebGL, audio) are also among the least stable, because they ride on exactly the components vendors update most.

Then there is the version coupling baked into the API. The version field equals the library version on purpose, a warning that a site upgrading the agent can shift IDs for its whole population at once. And there is plain spoofing. Anything the browser hands over for free, a script can lie about, which is the entire premise of the anti-detect-browser ecosystem covered in how anti-detect browsers spoof fingerprints at the engine level. The open-source library has no way to tell a real canvas value from a randomized one, because it reads the value the browser gives it and hashes that.

This is the gap. A pure client-side hash is cheap, transparent, and unstable, and it cannot distinguish a genuine signal from a forged one. Everything Pro adds is in service of closing that gap, and it does so by moving the decision off the client.

What Pro adds: the server-side architecture

Fingerprint Pro keeps a JavaScript agent in the browser, but the browser stops being the place where identity is decided. The agent collects signals and ships them to Fingerprint’s backend; the backend returns a stable identifier and a set of derived signals, and the application reads the result server-to-server using an event identifier rather than trusting whatever the client computed. The vendor’s public material describes the agent as collecting on the order of a hundred browser and device signals, analyzed server-side together with network-level data the browser cannot see. The reported headline accuracy for the commercial version is around 99.5%, against the open-source library’s own admission that its in-browser accuracy is significantly lower.

Two structural advantages come from this move, and they map exactly onto the two failure modes above. First, the server can do fuzzy matching instead of exact hash equality. Where the open-source library produces a different ID the instant any source value shifts, the server can match a drifted fingerprint against the same visitor’s history and decide it is probably the same device. The exact matching algorithm is not public, and this post will not pretend otherwise; what is documented is that the result is a stable visitor identifier that survives the kinds of drift that break a client-side hash. Second, the server sees signals the page cannot: the TLS ClientHello, HTTP request ordering, and IP reputation. Those are server-side fingerprints, and the TLS fingerprinting and HTTP/2 fingerprinting posts cover how much identity those carry on their own. A forged JavaScript canvas value does not change the TLS handshake, so the server can catch a client whose story does not hold together across layers.

That cross-layer consistency check is the real reason the commercial version is harder to fool. The open-source library asks the browser one question and trusts the answer. Pro asks several layers the same question and looks for the layer that lies.

Smart Signals: the documented catalogue

The derived server-side signals Pro exposes are documented as Smart Signals, and the reference lists them with real JSON field names. This is the most concrete public surface of the commercial product, so it is worth laying out accurately. The signals split into ones common to browsers and mobile, ones specific to browsers, and ones specific to mobile apps.

Smart Signals, by scope (JSON field) Common (browser + mobile) suspect_score · velocity · ip_info · proxy / proxy_confidence ip_blocklist · high_activity_device · raw_device_attributes · proximity Browser-specific bot / bot_type · incognito · vpn / vpn_confidence / vpn_methods tampering / tampering_confidence · virtual_machine · privacy_settings developer_tools Mobile-specific (Android / iOS) emulator (Android) · simulator (iOS) · cloned_app · factory_reset_timestamp frida · location_spoofing · jailbroken (iOS) · root_apps (Android) mitm_attack · vpn / vpn_origin_country · developer_tools suspect_score is a weighted integer combining the others against global probabilities. velocity tracks distinct IPs, countries, and linked_ids across 5-minute, 1-hour, and 24-hour windows. All delivered via the Server API and webhooks; proximity is server-side only. *The Smart Signals reference, as documented. Field names are the vendor's. suspect_score rolls the rest into one weighted integer; velocity is a temporal aggregate, not a per-request attribute.*

A few of these deserve unpacking. suspect_score is a weighted integer that combines the other Smart Signals against global probabilities, which means it is the product’s one-number summary of risk, conceptually similar to the bot scores other vendors expose like Cloudflare’s 1–99 score or Akamai’s bot-score header. velocity is a temporal aggregate: it counts distinct IPs, countries, and linked_ids seen for an identity across 5-minute, 1-hour, and 24-hour windows, which only a server with history can compute. proxy, ip_blocklist, and ip_info are IP-reputation signals, the kind discussed in how anti-bot vendors detect residential proxies and ASN reputation. tampering flags anti-detect browsers and anomalous signatures, which is the server-side answer to the spoofing problem the open-source library cannot address. The mobile signals (frida, jailbroken, root_apps, emulator, cloned_app, mitm_attack) move the same philosophy onto native apps, where the threats are instrumentation toolkits and rooted devices rather than headless Chrome.

The internal computation behind each of these is not published, and where this post names a field it is naming the documented output, not claiming to know the algorithm that produces it. The Smart Signals reference gives the field and a one-line description. It does not give the model.

Bot detection: BotD and what Pro layers on top

FingerprintJS ships bot detection as a separate open-source library, BotD, also MIT-licensed and running entirely client-side. BotD’s job is narrower than fingerprinting: it tries to decide whether the current environment is an automation tool rather than to identify the device. It detects headless browsers, Selenium, Playwright, PhantomJS, Nightmare, Electron, and similar frameworks by checking for the distinctive properties those tools leave in the environment. Those tells are the same family of artifacts catalogued in headless Chrome detection and the Chrome DevTools Protocol as a detection vector: the navigator.webdriver flag, the HeadlessChrome user-agent token, the property leaks that automation frameworks add to the runtime. The BotD README itself notes the project is in maintenance: critical fixes land, new features are unlikely soon.

Pro’s bot detection uses BotD as a base and adds proprietary technology plus server-side signals for accuracy and stability. The documented output classifies traffic three ways: a good bot (verified search crawlers, monitoring services, and authorized AI agents), a bad bot (browser automation like Selenium and Puppeteer, plus anything impersonating a verified tool), or bot not detected. The result object carries the classification and, for AI tools specifically, extra metadata including category, provider, bot name, and an identity status of signed, verified, unknown, or spoofed. The good-versus-bad split is the operationally important part: a search-engine crawler and a scraping bot can present nearly identical client-side signatures, and the thing that separates them is server-side reputation and reverse-DNS verification, which is exactly the kind of check the in-browser library cannot perform.

two questions, two libraries fingerprintjs "which device is this?" returns a visitorId identity, persistent botd "is this automated?" returns good / bad / none classification, per-visit people conflate these constantly; the same vendor sells them as separate libraries *Identity and automation are different questions. Fingerprinting answers "which device," bot detection answers "is this a script." Conflating them is the most common mistake in reading the product.*

Incognito and VPN detection as worked examples

Two Smart Signals make good case studies because their mechanisms are partly public and partly browser-version-dependent, which is a good test of stating what is known versus inferred.

Incognito detection has a documented history of techniques that browsers have steadily closed. The vendor’s own write-up walks through four. The storage-quota method called navigator.storage.estimate() and treated a quota under roughly 120 MB as private mode; that worked from about Chrome 74 and stopped working after Chrome 84, when Chrome stopped tying the reported quota to real disk. A filesystem timing method exploited the fact that Chrome’s temporary filesystem in incognito writes faster against its smaller quota, measurable by repeated large writes; this still functions but is unreliable. A Firefox-specific method tested whether indexedDB.open() throws in private mode, which Firefox restricted from version 60 and which works on current Firefox but never applied to Chrome or Safari. And an old Safari method keyed on localStorage write failures in private mode, broken since Safari 14. The pattern is a cat-and-mouse: each technique is a browser quirk, and each gets patched. What incognito detection looks like in 2026 is therefore browser-specific and partial, and Pro’s value is bundling the surviving techniques and updating them as browsers change, not any single durable trick. The internal current method is not published; the above is the documented public history.

VPN detection exposes its method in its field names. The Smart Signal carries vpn, vpn_confidence, and vpn_methods, and the documented detection approaches are timezone mismatch, known public-VPN provider ranges, OS-versus-IP mismatch, and relay-service identification. Timezone mismatch is the clean example: the browser’s JavaScript timezone (from the timezone source, which the timezone and Intl fingerprint post covers) is a client-side signal, while the IP’s geolocation is a server-side one. If the browser says Europe/Paris and the IP geolocates to a datacenter in Virginia, the two disagree, and that disagreement is the detection. This is the cross-layer consistency check again, made concrete: a client signal and a server signal that should agree, used as evidence when they do not. A pure client-side library can read the timezone but cannot see the IP geolocation, so it structurally cannot run this check. That is the architectural line between the two products, drawn in one signal.

What the split tells you

The two halves of FingerprintJS are an unusually clean natural experiment in the limits of client-side fingerprinting, because the same company built both and is candid about where the open-source one falls short. The open-source library is a deterministic hash of about forty browser attributes, fast and transparent and MIT-licensed, with a confidence score that is essentially a per-platform constant pointing at the upgrade page. It works until a browser update moves a canvas value, or a user spoofs an attribute, at which point it has no recourse, because it asked the browser one question and trusted the answer.

Everything that makes the commercial version harder to evade comes from refusing to make the browser the final authority. Identity gets recomputed on a server with history, so drift can be matched through instead of breaking the ID. The decision draws on signals the page cannot see (TLS, HTTP ordering, IP reputation), so a forged JavaScript value gets caught by the layer it could not forge. The Smart Signals (suspect_score, velocity, bot, vpn, tampering) are all things that require either history or a vantage point above the browser. None of them can exist in a stateless client-side hash, and that is the whole point: the entropy that matters most for telling a real visitor from a forged one lives off the client, which is precisely why the vendor moved it there and charges for it.

The concrete takeaway for anyone citing these tools is to keep the two questions apart. A visitorId is an answer to “which device,” computed by x64hash128 over a documented source list, and it is only as stable as the least stable source in that list. A bot verdict is an answer to “is this automated,” computed somewhere you cannot inspect, and good-versus-bad turns on server-side reputation rather than anything in the browser. Read the open-source repo to know what the agent collects. Read the Smart Signals reference to know what the server returns. The gap between those two documents is the product.


Sources & further reading

Further reading