Skip to content

Mouse-movement biometrics: how curvature and velocity profiles classify humans

· 22 min read
Copyright: MIT
Mouse-movement biometrics wordmark with a curved cursor path and an orange velocity peak

A mouse pointer travelling from a link to a Submit button leaves a trail. Sample that trail at a few hundred hertz and you get a sequence of (x, y, t) triples, which looks like nothing until you take its derivatives. The first derivative is velocity. The second is acceleration. The third, the rate of change of acceleration, is jerk. Layer in the geometry of the path itself, how sharply it bends and how far it strays from a straight line, and you have a surprisingly high-dimensional signal coming out of one of the cheapest input devices ever made. The question a bot detector asks is narrow and specific: does this signal look like it came from a hand, or from code that computed where the cursor should be next?

That question turns out to be answerable with decent accuracy from a single trajectory. Not because humans are unpredictable, but because they are predictable in a way that is hard to fake cheaply. A real arm accelerates at the start of a reach and decelerates into the target, overshoots slightly, corrects, and produces a velocity curve with a characteristic asymmetric bell shape. A naive bot moves the cursor in a straight line at constant speed, or teleports it, or follows a spline so smooth it has no correction at all. This post is about how detectors turn that difference into a number.

The sections below walk through the raw event stream and how it gets cut into actions, the feature families that come out of each action (velocity, acceleration, jerk, curvature, angular change, pauses), the classifiers that consume those features and the error rates they reach, the public datasets the whole field is calibrated against (Balabit above all), how the same machinery gets reused for web bot detection, and finally the arms race against synthetic trajectories built to defeat it. The bot-generation side, why a real path is genuinely hard to synthesize, lives in a companion post on Fitts’s law; here the focus is the detector.

From event stream to mouse actions

The browser, or the native OS hook, hands the detector a stream of pointer events. Each carries a position and a timestamp, plus button state when a click or release happens. Sampling is irregular: the OS coalesces moves, the page may throttle mousemove to animation frames, and a fast flick produces fewer samples than a slow drag. So the first job is not feature extraction. It is segmentation.

The convention that almost every paper in the field follows splits the raw stream into three action types. A point-and-click (PC) is a movement that ends in a button press and release on a target. A drag-and-drop (DD) starts with the button already down, moves, and ends on release. A plain mouse-movement (MM) is location change with no button activity at all. The Balabit-derived work cuts the stream at every button-release event, which naturally terminates PC and DD actions, then isolates MM actions by splitting on a time gap between consecutive events, a 10-second threshold in the Antal intrusion-detection work. Actions with fewer than four recorded points get discarded, because the spline interpolation used downstream needs at least that many to fit.

This segmentation matters more than it first appears. The three action types carry different amounts of information. Drag-and-drop is rare, roughly a tenth of the events in the Balabit set, but it is the most discriminative per action: the same study found classification on DD actions ran about 3 percent better than on PC or MM alone, because holding the button down while moving recruits a slightly different and more individual motor pattern. A detector that pools all actions into one undifferentiated bucket throws that structure away.

Raw event stream (x, y, t, button) at irregular intervals cut on button-release, split MM on 10s gap PC move then click DD drag with button down MM free movement DD is ~10% of actions but the most discriminative per action. *The raw pointer stream is cut into point-and-click, drag-and-drop, and mouse-move actions before any feature is computed; the orange points mark button-release events that terminate a segment.*

Once segmented, each action is a short trajectory: an ordered list of points with timestamps. That list is the unit of analysis. Everything below is computed per action, then either classified action-by-action or aggregated over a window of several actions to push the error rate down.

The feature families

The position-and-time list is resampled and differentiated to build a set of derived time series, and then statistics of those series become the feature vector. The survey literature groups the result into six families plus a pause family, and they are worth taking one at a time because each captures a different aspect of how a hand differs from a script.

Velocity, acceleration, jerk

Velocity is the first derivative of position with respect to time, computed separately for the horizontal and vertical components and combined into a speed magnitude. Acceleration is the derivative of velocity, jerk the derivative of acceleration. These three form the kinematic backbone of the feature set. A human reach is not constant-velocity: it ramps up, peaks, and ramps down, and the shape of that ramp is individual. The Antal work that computes vx, vy, the combined v, acceleration a and jerk j from each action found, after ranking features by information gain, that jerk-related features carried more signal than velocity-related ones, and that the single most informative feature was the duration of the acceleration phase at the very start of an action, ahead of the count of critical points where the velocity profile turns over. Minimum jerk ranked third.

That ordering is the interesting part. Velocity is the obvious feature, the one a naive engineer reaches for first, and it is not where the discrimination lives. The discrimination lives in the higher derivatives and in the timing of the initial ballistic phase, exactly the part of a movement that a constant-speed or smoothly-splined synthetic path gets wrong. A human throws the cursor toward the target ballistically and then closes the last gap under visual feedback. The handoff between those two phases shows up as a jerk signature that is hard to reproduce without modelling the underlying motor control.

Velocity profile of a single reach time speed human: asymmetric bell naive bot: constant velocity ballistic launch feedback correction *The human profile rises fast to an early peak, then tails off through a feedback-driven correction phase; the constant-velocity line is the cheapest bot to catch.*

Curvature and angular change

The kinematic features describe how fast the cursor moved. The geometric features describe the shape of the path it traced, independent of speed. Curvature is the headline one. Intuitively it is how tightly the path is bending at each point, defined as one over the radius of the circle that best fits the path locally; a straight line has zero curvature, a tight loop has high curvature. The discrete form used in the survey computes it from first and second differences of the coordinates,

x'·y'' − y'·x''
κ = ───────────────────
(x'² + y'²)^(3/2)

where the primes are successive differences along the trajectory. Some detectors prefer a simpler proxy: curvature as the ratio of the change in movement angle to the distance travelled over that step, which is cheaper and noisier but captures the same idea. Either way the curvature time series, summarised by its mean, spread and extremes, tells you whether the path is a clean geometric arc or the slightly wandering, jittery line a hand produces.

Closely related is the angular family. Movement direction at each step, the change in that direction between steps (angular velocity, the change in angle per unit time), and histograms of direction over the whole action. The Ahmed and Traore work that opened the field in 2007 built its entire signature out of histograms of this kind: a movement-direction histogram, a travelled-distance histogram, and a movement-elapsed-time histogram, alongside average speed against distance. Direction on its own turns out to be weak, it ranked last by information gain in the later Antal analysis, because where on the screen you happen to be moving says little about who you are. It is the distribution of direction changes, the small constant corrections, that carries the signal.

Path geometry and critical points

Beyond curvature, a handful of whole-trajectory descriptors capture the gestalt of the path. Straightness is the ratio of the straight-line distance between start and end to the actual path length; a perfectly straight move scores 1, a wandering one scores lower. The largest single deviation from the start-to-end line measures how far the hand strayed. The count of critical points, the local extrema in the velocity profile where the cursor sped up or slowed down, captures how many sub-movements made up the action. Real reaches over any distance are rarely a single smooth stroke; they decompose into a ballistic primary movement plus one or more corrective sub-movements, and each correction shows up as a critical point. A bot that emits one clean acceleration-deceleration arc has exactly one critical point where a human has three or four.

Pauses and dwell

The last family is temporal rather than spatial. Humans pause. They stop the cursor mid-screen to read, hover before clicking, hesitate over a choice. These pauses, their count, duration, and where they fall within an action, are a feature in their own right, sometimes encoded as a stop-duration value rather than a binary moving/stopped flag. The dwell before a click, the gap between arriving at a target and pressing the button, is individual and hard to script convincingly. A bot that clicks the instant the cursor reaches the target coordinate produces a zero dwell that stands out against a human distribution centred well above zero.

Put together, the Antal feature set for a single action is 39 numbers: the four summary statistics (mean, standard deviation, min, max) over each of seven derived time series gives 28, and the geometric and timing descriptors supply the rest. Other work runs far higher. The survey notes feature sets defined across six categories totalling 217 features, and the curvature distribution alone can be expanded into 180 features, one per degree of angle change. There is no canonical count. The point is that one short cursor move expands into a vector rich enough for a classifier to separate users, and humans from bots, with the right model on top.

The classifiers and what they reach

The model on top is, more often than not, a tree ensemble. Random Forest is the workhorse of the published mouse-dynamics literature, with support vector machines, k-nearest-neighbours and feed-forward neural networks appearing alongside it, and gradient-boosted trees (CatBoost and similar) in the more recent work. The reason trees dominate is mundane and practical: the feature vector is a few dozen heterogeneous engineered numbers with non-linear, interacting effects, which is exactly the regime where a Random Forest needs almost no tuning to do well and where it also hands back a feature-importance ranking for free. That ranking is how the field learned that jerk beats velocity and acceleration-onset beats everything.

The headline numbers are usually reported as equal error rate (EER), the operating point where the false-accept rate equals the false-reject rate, or as area under the ROC curve (AUC). The foundational 2007 work reached a 2.46 percent EER, but it needed 2,000 mouse actions to get there, far too many for a transparent web check; a session that long is a research artefact, not a login. Later work on the Balabit set with a Random Forest tells a more useful story about the speed-accuracy tradeoff. Classifying a single action is weak, around 80 percent accuracy and an AUC near 0.87 on the cleaner training-only split. Aggregate a window of actions and the curve climbs fast: roughly a dozen actions push the AUC to 1.0 on that split, and a 20-action window reaches an EER of 0.04 percent. On the harder test split, where the impostor data is genuinely held out, the same 20-action aggregate lands at an AUC of 0.89 and a session-level AUC of 0.92, with the per-action accuracy down at 72 percent. The gap between those two splits is the honest measure of how much the easy numbers owe to favourable evaluation.

AUC vs number of aggregated actions (Balabit, Random Forest) 1.0 0.9 0.8 1 action 0.87 test, 20 0.89 test, session 0.92 train, 12+ 1.00 Grey: held-out test split. Orange: easier training-only split. More actions, higher AUC. *Accuracy is a function of how many actions the detector gets to see; the gap between the held-out test bars and the training-only bar is the cost of an optimistic evaluation.*

The operational lesson is that mouse dynamics is a confidence-accumulating signal, not a one-shot verdict. A single move barely beats a coin weighted toward heads. A window of a dozen-to-twenty actions, the kind of interaction a real session produces in well under two minutes, is where the error rate collapses. This is why the signal lives server-side in a scoring pipeline that updates as telemetry arrives, rather than gating a single click. The detector is patient.

The datasets the field runs on

Every number above is anchored to a dataset, and the supply of good ones is thin. Mouse dynamics has nothing like the scale of face or fingerprint corpora, because collecting it means logging real users at their real machines for long enough to capture natural behaviour, which is awkward to do at scale and worse to release without privacy headaches.

The reference set is Balabit, released in 2016 by the company of the same name (later acquired by One Identity) as a public data-science challenge and still the single most-cited mouse-dynamics benchmark. It covers ten user accounts. Each session is a CSV of records over a Remote Desktop link, with six fields per record: a network-side record timestamp, a client-side timestamp, the button, a state field, and the x and y coordinates. The training files are remote sessions known to belong to the legitimate account owner. The test files are sessions of unknown provenance for the same ten accounts, and the trick that makes it a detection benchmark rather than a pure authentication one is in the construction: to simulate account misuse, the test data for each user is salted with mouse data drawn from other users. A public_labels.csv marks which test sessions are legitimate and which are the injected impostor ones, and submissions were scored by AUC across all predictions. The dataset’s weakness is its size, ten users and some very short test sessions, which makes the held-out numbers noisy and is exactly why the train-versus-test gap above is so wide.

One Balabit session record record_t network side client_t RDP client button state x y 10 users. Training = legitimate owner. Test = unknown, salted with other users' data. public_labels.csv marks each test session legitimate or illegal; scored by global AUC. ~816 public benchmark test sessions in the standard split. *The six-field Balabit record; everything in the feature pipeline is reconstructed from the x and y columns (orange) plus the timestamps.*

Balabit is not alone, but it has no large rival. The DFL set from Antal and Denes-Fazakas, released in 2019, covers 21 users of unrestricted (completely free) mouse usage and was built partly to give the Balabit work a second testbed. Going back further, the original ISOT data behind the 2007 Ahmed and Traore study covered 48 users, also free usage. The Shen et al. sets from 2012 and 2014 cover 28 and 58 users respectively but under more controlled, fixed-sequence collection, which makes them cleaner but less representative of real web behaviour. BB-MAS (2019) is larger at 117 users but is a fixed-static-sequence multimodal set. For the specific problem of separating humans from bots rather than one user from another, the most directly relevant public resource is the BeCAPTCHA-Mouse benchmark, which ships human trajectories alongside synthetically generated ones precisely so a detector can be trained to tell them apart.

The thinness of this list is itself a fact about the field. A handful of datasets, most under 60 users, several of them more than a decade old, underpin almost every published EER you will read. Production detectors at the large anti-bot vendors train on traffic volumes that dwarf all of these combined, which is why their real-world performance is unknowable from the literature. The academic numbers are a floor, not a description of what Akamai or DataDome actually achieve.

The same machinery, pointed at the web

User authentication and bot detection are the same problem wearing two hats. Authentication asks whether the mouse belongs to the enrolled user; bot detection asks whether it belongs to any human at all. The feature pipeline is identical. Only the labels and the deployment differ.

On the web, the telemetry comes from JavaScript. An anti-bot script attaches listeners to mousemove, mousedown, mouseup, click, wheel and the pointer events, buffers the resulting (x, y, t) stream along with target and button metadata, and ships batches back to a scoring endpoint. The commercial stacks fold mouse data into a much wider signal set, IP and ASN reputation, TLS and HTTP/2 fingerprints, the JavaScript-runtime and device fingerprint, then score the lot. Vendor and reverse-engineering write-ups consistently list mouse movement, scroll velocity, click coordinates and typing cadence among the behavioural inputs the major stacks collect, though the exact field layout each one ships is proprietary and not publicly documented; what is public is the category of signal, not the wire format. If you want the field-level detail of where these signals sit in a real payload, the DataDome JS-tag breakdown and the Akamai sensor-data analysis cover two of them as closely as the obfuscation allows, and the broader behavioral-biometrics in fraud detection post sits one level up from this one.

The most-studied public example is Google’s reCAPTCHA. The v3 design dropped the visible challenge and runs a transparent behavioural score from 0.0 to 1.0, and mouse dynamics is one of the inputs feeding that score, which is why a headless client that never moves a cursor scores badly without ever failing an explicit test. The 2024 ETH Zurich work on reCAPTCHA v2 is a useful, sobering data point here: the researchers reached a 100 percent solve rate on the image grids with YOLO-based segmentation, up from the 68 to 71 percent of earlier attacks, and found, importantly, that the system leaned heavily on cookie and browser-history signals when deciding how hard to push back, with no significant difference in the number of challenges humans and bots had to clear. The behavioural layer, mouse included, is real, but it is one weighted input among many, and a session with good cookies and history can sail through on a weak trajectory. That is the right way to read every behavioural signal: a contributor to a score, not a gate. The reCAPTCHA v3 scoring post goes deeper on how that 0.0-to-1.0 number is assembled.

The arms race against synthetic trajectories

If a detector can score a trajectory, an attacker can try to generate one that scores well. This is where the field gets interesting, and where the defensive value of understanding the features becomes clear: the harder the synthetic trajectory has to work to pass, the more expensive each bot request becomes.

The first generation of fakes was trivially separable. Constant-velocity straight lines, instantaneous teleports, perfectly Bezier-smooth curves with no correction phase, all fail on the very features that rank highest: no acceleration-onset structure, one critical point instead of several, zero jerk variability, zero dwell. The cheapest possible bot is the easiest possible catch.

The second generation models the motor control. The strongest published approach builds on the kinematic theory of rapid human movements and its Sigma-Lognormal model, which represents a complex trajectory as a sum of primitive strokes, each with a lognormal velocity profile. The BeCAPTCHA-Mouse work (Acien, Morales, Fierrez and Vera-Rodríguez, arXiv 2020, published in Pattern Recognition in 2022) used this both ways. As a feature extractor, it derived 37 neuromotor features per trajectory from the lognormal parameters, fine-tuned over trajectory halves. And as a generator, it produced synthetic trajectories two ways: a function-based method combining linear, quadratic and exponential spatial shapes with constant, logarithmic and Gaussian velocity profiles for 15 trajectory types, and a GAN whose generator synthesised paths from 100-dimensional Gaussian noise while an LSTM-based discriminator tried to tell real from fake. Trained on a benchmark of 15,000 trajectories from 58 users, evenly split between human, function-generated and GAN-generated samples, the detector hit 93 percent accuracy from a single trajectory using the neuromotor features, and fusing them with global mouse-dynamics features pushed accuracy past 99 percent. The headline result is that adding neuromotor modelling improved bot detection by more than 36 percent relative over the global-feature baseline, which is a direct measurement of how much the velocity-profile shape matters once a synthesizer is good enough to get the obvious features right.

The third generation is current research. Entropy-controlled diffusion models like the 2024 DMTG work generate mouse curves with a tunable complexity knob, aiming to land synthetic trajectories inside the human distribution rather than at a detectable edge of it, and reinforcement-learning agents have been shown to synthesize trajectories good enough to move reCAPTCHA v3 scores. The detectors respond by adding features that capture ever finer motor structure and by leaning harder on the parts of the signal a generator cannot see, the relationship between the trajectory and the actual page layout, the timing against network and rendering events, the consistency of the motor signature across a whole session rather than one move. Why the bot side keeps losing the cheap rounds while staying competitive on the expensive ones is the subject of the stealth-plugin and Fitts’s-law posts; the short version is that faking one trajectory is solved and faking a coherent session of them, against features chosen specifically because they are hard to fake, is not.

What the signal is really worth

Mouse dynamics is a strong feature and a weak gate. As a feature it is genuinely discriminative: a hand produces a velocity profile with an early asymmetric peak, a jerk signature at the ballistic-to-feedback handoff, several corrective sub-movements per reach, and pauses where a script would have none, and a tree ensemble over a few dozen engineered features picks all of that up well enough to drive the error rate toward zero given a window of fifteen-to-twenty actions. The features that matter are not the obvious ones. Velocity is weak, direction is weaker, and the discrimination concentrates in the higher derivatives and in the timing of the first hundred milliseconds of a movement, the part of a reach that is hardest to fake because it reflects motor control rather than geometry.

As a gate, though, it is patient and probabilistic, never categorical. No serious system blocks on one trajectory, because one trajectory barely beats a coin and because a confident synthesizer can now produce a passing path on demand. The signal earns its place by accumulating confidence over a session and by combining with everything else in the scoring pipeline, which is the same reason the ETH Zurich team could walk through reCAPTCHA v2 on the strength of good cookies despite the behavioural layer watching. The honest summary is that mouse dynamics raises the cost of a convincing bot without ever setting it to infinity, and the published gap between a constant-velocity line caught instantly and a neuromotor-modelled trajectory that needs a 99-percent-accurate detector to catch is a fair map of how much that cost has risen. The thing worth watching is not whether the features work. It is that the entire public literature rests on ten Balabit users and a handful of small successors, while the systems that actually decide whether your cursor looks human train on traffic none of us will ever see.


Sources & further reading

Further reading