Skip to content

The cold-start problem in behavioral biometrics

· 18 min read
Copyright: MIT
The phrase cold start in monospace over a flat baseline that rises into a confidence curve, one orange marker at session zero

A behavioral biometric system answers one question well: is the person driving this session the same person who drove the last forty? That is a comparison. It needs two operands, the live session and a stored model of the account holder, and the model is built from history. On a mature account with months of logins, the comparison is sharp. On the first session of a brand-new account, one of the two operands does not exist. The system is being asked to judge a typing rhythm against a profile that contains zero samples of that person typing. There is nothing to compare to.

That is the cold-start problem, and it is not a tuning issue that better models fix. It is structural. The strongest claim behavioral biometrics makes, that your motor patterns are yours and stay yours, is also the claim that disqualifies it on session one. New-account fraud and the first login after enrollment are exactly the moments fraud teams care about most, and they are exactly the moments the per-user model is silent. This is the same problem the fraud-detection pillar gestures at when it says the most expensive fraud happens after authentication, except here the account has no after to lean on yet. This post is about what fills that silence. We will define the cold-start gap precisely, then walk the three fallbacks vendors actually ship: population models that score you against everyone else before they can score you against yourself, device and network signals that carry no enrollment requirement, and progressive trust schemes that ration confidence over the first several sessions. A short detour covers synthetic enrollment, the one research direction that shortens the window directly. Each fallback has a failure mode. The closing section is about what none of them can do.

Why the per-user model is empty on session one

Start with the mechanism, because the cold-start gap falls out of it directly. A behavioral profile is a statistical model of one person’s motor output. For keystroke dynamics that means distributions over dwell time (how long a key is held) and flight time (the gap between releasing one key and pressing the next), usually broken down per digraph. For mouse dynamics it is curvature, velocity and acceleration profiles, pause locations, the shape of the approach to a click target. For touch it is the swipe geometry, pressure, and contact area that the Touchalytics work reduced to 30 features per stroke. Whatever the channel, the profile is a estimate of a distribution, and you cannot estimate a distribution from nothing.

How much is nothing, in practice? The literature is consistent that one sample is far too few and the useful range starts in the high single digits. The Touchalytics classifier was trained on an enrollment phase and only reached its low error rates, a median equal error rate near zero within a session, 2 to 3 percent across sessions, once it had enough strokes to characterize the user. Keystroke surveys put typical enrollment somewhere under ten samples for password-length strings, with some protocols demanding 20 or more repetitions and a few research setups going to several hundred. IBM’s own identity product documents a baseline that needs at least eight sessions of activity before behavioral scoring is reliable. The exact number is modality- and vendor-specific, but the shape is universal: there is a warm-up period measured in sessions, and during it the per-user verdict is either unavailable or low-confidence.

Per-user model confidence vs. accumulated sessions high none session 0 ~8 sessions warmed up warm-up region: verdict is weak *The per-user model is a distribution estimate, so its confidence climbs only as samples accumulate. The orange marker at session zero is where new-account and first-login fraud actually happens.*

There is a second, sharper version of the problem that gets less attention. Even on an established account, every session has its own cold start inside it. A verdict that needs forty keystrokes or a few seconds of cursor motion cannot exist at millisecond zero of a page load. The model warms up within the session too, which is why the most dangerous automated actions are the ones that complete before enough behavior has accrued to score them: a form submitted in 200 milliseconds, a single fetch with no pointer events at all. The session-level cold start is why behavioral signals are usually fused with request-time signals that need no warm-up, a point the session-replay telemetry side of these systems leans on heavily.

Template drift makes the cold-start problem reappear long after enrollment, which is worth flagging because it is the same gap wearing a different hat. A user’s typing rhythm shifts as they grow accustomed to a password, mature in proficiency, or switch input devices. Keystroke surveys note this aging directly and observe that fewer than a fifth of studied systems implement any retraining to track it. When a profile drifts far enough from the live behavior, or when the user genuinely changes (a new laptop, a hand injury, a different keyboard), the per-user model is effectively stale, and the system is back to leaning on the same fallbacks it used at signup until the profile re-converges. Cold start is not only the first session. It is any moment the per-user model and the live user have diverged, which is why these systems keep the fallback machinery running permanently rather than switching it off once an account warms up.

So the per-user model is empty at account creation, thin at the start of each session, and periodically stale as behavior drifts. Everything below is about what a vendor scores you against in the meantime.

Population models: scoring you against everyone else

The first fallback is the oldest idea in biometrics, borrowed from speaker verification. When you cannot compare a sample to the enrolled user’s model, compare it to a model of the general population and ask whether it looks like a plausible human at all, and which sub-population it resembles. In speaker verification this is the universal background model, a speaker-independent mixture trained on a large pool of voices, against which an individual’s likelihood is normalized. The verdict is a likelihood ratio: how much better does the enrolled-speaker model explain this sample than the generic background model does. Cohort models are the discriminative cousin, a set of other-people templates that bound the impostor distribution.

Behavioral biometrics vendors run the same play under different names. BioCatch’s public material describes scoring activity against the historical profile for the individual account and, separately, against population-level patterns that capture statistically observed norms for good and bad behavior. The second comparison is the one that works on a user the bank has never seen. It cannot tell you this is Jane, because there is no Jane model yet. It can tell you this session resembles the population of genuine new-account openings, or it resembles the population of mule-account openings, or it does not resemble a human at all. BioCatch states it analyzes thousands of parameters, with its marketing settling on figures around 2,000 behavioral parameters in older material and roughly 3,000 signals in current product pages. The population comparison is what lets a vendor make a new-account-fraud claim with a straight face despite never having profiled the applicant.

Two comparisons, one live session live session per-user model (empty on new account) population model genuine vs. fraud cohorts score still produced no output *On a new account the per-user branch returns nothing, so the verdict rides entirely on the population branch: does this session look like a genuine cohort or a fraud cohort.*

The cost of the population fallback is precision, and the cost is not small. A per-user model can flag a deviation of a few percent from your own baseline. A population model only knows aggregate norms, so its discriminating power is whatever separates the genuine cohort from the fraud cohort on average, and the survey literature is blunt that behavioral universality across a large population is low because people are not that different from each other in bulk. The same touch and keystroke features that separate individuals cleanly inside a known set blur together when the question is which anonymous stranger you are. So the population verdict is good at catching the obvious, a script with no human kinematics at all, a session that matches a known mule pattern, and weak at the subtle, a competent human fraudster opening an account by hand. It buys you a floor, not a sharp edge.

There is also a quieter risk in the cohort itself. A population model is only as fair as its training distribution. If genuine behavior in the reference pool skews toward one demographic, age band, or set of input devices, then users outside that distribution look anomalous on session one through no fault of their own, and the cold-start period is exactly when there is no personal history to override the population prior. The fairness literature on biometrics keeps returning to this: behavioral models trained on skewed data raise false-reject rates for older users, users with motor disabilities, and users on unusual hardware. During warm-up, the population prior is the whole verdict, so any bias in it lands hardest on the newest users. The UK ICO’s biometric guidance is explicit that a recognition system a disabled person cannot use, with no alternative route, is unlawful, and a system whose only early signal is a population prior tuned to a median user is exactly the kind of design that produces that outcome at the edges.

The population fallback also raises a regulatory question that the per-user model partly dodges. Building a cohort means processing behavioral data from a large pool of people, and under the GDPR behavioral characteristics can qualify as biometric data, a special category that needs both an Article 6 lawful basis and an Article 9 condition before processing is legal. Profiling a person against a population model is itself a processing operation, and the legal analysis keeps noting that ongoing behavioral profiling can create new biometric data rather than merely reading existing data. So the population fallback that makes a new-account verdict possible is also the part of the pipeline most exposed to the consent and special-category arguments, and it processes the most people to do the least precise work. That is an awkward trade to defend on a privacy-impact assessment.

Device and network signals: the verdict that needs no history

The second fallback sidesteps behavior entirely. Device and network signals are available on the very first request, before a single keystroke, and they carry no per-user enrollment requirement because they describe the machine and the path, not the person. This is why a behavioral SDK is almost never deployed alone. The behavioral score warms up; the device score does not.

The device side covers a wide span. At the network layer it is the TLS fingerprint of the client, the HTTP/2 settings and header ordering, the TCP/IP stack characteristics, the IP’s ASN and reputation, whether the address belongs to a hosting provider or a residential proxy pool. At the browser or app layer it is the canvas and WebGL outputs, the navigator properties, the timezone and locale, and in mobile apps the device integrity signals: is this an emulator, is the OS jailbroken, are sensors missing that a real handset would report. BioCatch packaged a chunk of this into a product it calls DeviceIQ, pitched precisely at catching device spoofing, emulators, cloaked browsers, and jailbroken devices, the techniques that let one fraudster present as many fresh users. The point of that product, in cold-start terms, is that it gives a verdict on a device the bank has never seen, using properties that exist on request one.

Signal availability within one session t=0 t=seconds TLS / HTTP2 / TCP fingerprint: present IP / ASN reputation, device integrity: present behavioral score: accrues over the session 0 *Device and network signals are at full strength at the first byte. The behavioral channel starts at the floor and climbs, which is why the two are fused rather than used in isolation.*

The catch is that device and network signals are the most spoofable layer in the stack, and the adversary who matters for new-account fraud is the one who knows it. Anti-detect browsers exist to make the device fingerprint say whatever the operator wants. Residential proxies launder the network origin into something that looks like a home connection. Emulator-hardening and integrity-bypass tooling chip at the mobile device checks. None of this is hypothetical, and it means the device fallback is strong against low-effort abuse and porous against the dedicated fraud operation that farms accounts for a living. There is a deeper structural point too: device and network identity describe the conduit, not the human. A genuine user on a flagged residential proxy, or a fraudster on a clean fresh handset, both break the assumption that device reputation is a proxy for user intent. So the device fallback inherits all the server-side versus client-side detection tradeoffs, and it inherits the proxy arms race wholesale. It fills the cold-start gap with a signal that is available but not the signal you actually wanted.

Progressive trust: rationing confidence over the first sessions

The third fallback is procedural rather than statistical. If the model cannot be confident early, do not pretend it is. Ration what the account can do until enough history exists to judge it, and spend friction where the population and device signals say risk is high. This is the behavioral biometrics version of risk-based, or adaptive, authentication, and it is the layer where cold start stops being a pure modeling problem and becomes a product decision.

The shape is familiar from any adaptive-auth deployment. A first session from an unrecognized device, an unrecognized location, with no behavioral history, scores as higher risk almost by definition, and the system responds with a step-up: an additional factor, a hold on a transfer, a lower limit, a manual review of a new-account application. As the account accumulates clean sessions, the per-user model fills in, the device becomes recognized, and the friction relaxes. Trust is earned over time rather than granted at enrollment. The standards bodies have caught up to this: NIST’s SP 800-63B revision adds a session-monitoring section that treats behavioral signals like typing cadence as continuous-evaluation inputs for spotting fraud during a live session, alongside weaker signals like browser traits, geolocation and IP reputation.

That same NIST language draws the line that progressive trust must respect. Behavioral biometrics is explicitly not treated as an authenticator on its own, because it is not a secret and the verdict is probabilistic. Session monitoring reduces risk during a session but does not raise the assurance level by itself. In cold-start terms this is the honest position: during warm-up the behavioral channel contributes a risk signal, not a pass-fail credential, and the system must still rest its actual authentication on something that works on session one, a possession or knowledge factor. Progressive trust is the scaffolding that lets a weak early signal be useful without being load-bearing.

Progressive trust as a state machine new high friction warming step-up on risk trusted low friction clean sessions model fills in anomaly resets trust *Trust is a state the account moves through, not a property granted at signup. A strong anomaly during any state can drop it back toward the restricted state.*

The failure mode here is the user, not the attacker. Progressive trust spends friction precisely on the people with no history, and the people with no history include every legitimate new customer. Set the friction too high and onboarding bleeds: abandoned signups, support tickets, false rejects that fall hardest on users who already deviate from the population norm. Set it too low and the cold-start window becomes the soft target, the place fraud rushes because the per-user model is not watching yet. There is no setting that escapes the tradeoff. The window can be shortened, by enrolling faster with synthetic data or by leaning harder on device signals, but it cannot be closed, and every day it stays open is a day the account is judged by everything except its own behavior.

Shortening the window: synthetic enrollment

One line of research attacks cold start head-on by manufacturing the missing history. If the problem is too few enrollment samples, generate plausible extra ones from the few you have. The UserBoost work trains a regularized autoencoder on a handful of a user’s real gestures and synthesizes more, reporting a 40 percent cut in the number of real gestures a user must supply at enrollment without degrading error rates, on a wrist-motion smartwatch dataset. The generic version of the idea, synthetic profiles that model both between-user spread and within-user variability, shows up in the patent literature for training biometric verifiers when real data is scarce.

It is worth being precise about what this does and does not solve. Synthetic enrollment shortens the warm-up by squeezing more model out of each real sample. It does not create history for a user who has produced zero real samples, because the generator needs a seed of genuine behavior to extrapolate from. So it compresses the cold-start window, it does not eliminate session zero. And it carries its own risk, that a generator tuned to make enrollment smoother also smooths over the very idiosyncrasies that make a user distinguishable, trading a shorter warm-up for a slightly blunter steady-state model. The honest summary is that synthetic data moves the curve left, not to the origin.

What none of the fallbacks can do

Step back and the three fallbacks are doing the same thing from different angles: substituting something that exists on session one for the per-user model that does not. Population models substitute the crowd. Device and network signals substitute the machine. Progressive trust substitutes time and friction. Each is a real engineering answer and each is a downgrade, because none of them is the thing the system was actually built to measure, which is this specific person’s motor signature compared against itself.

The structural truth underneath is that behavioral biometrics is a recognition technology pretending, at cold start, to be a classification technology. Recognition asks whether two samples came from the same source and it is sharp. Classification asks which broad bucket an unknown sample falls into and it is coarse. On session one there is only the coarse question available, and the coarse question is the one the open-set survey literature keeps reporting low universality on, because a large population of humans is not cleanly separable by typing rhythm alone. This is also why the strongest claim in vendor material, catching fraud on a user you have never seen, is true and oversold at the same time. It is true that the population and device layers produce a verdict on a stranger. It is oversold to imply that verdict has the precision of the per-user comparison the marketing screenshots show on a warmed-up account.

For anyone reasoning about these systems from the outside, the cold-start window is the part of the lifecycle where the behavioral channel is least like its reputation and the request-time signals carry the most weight. The per-user model that the whole category is sold on is silent at exactly the moment of highest fraud value, and what answers in its place is a population prior with low resolution, a device fingerprint that the serious adversary already spoofs, and a friction budget that taxes every honest new user to ration risk on the few bad ones. The window closes on its own as history accrues. It just does not close on the first session, and the first session is the one that was worth the most to get right.


Sources & further reading

Further reading