Skip to content

Keystroke dynamics: dwell time, flight time, and the typing-rhythm fingerprint

· 21 min read
Copyright: MIT
Timeline of two keystrokes showing dwell time per key and the flight time gap between them

Type the same word twice and the characters land in the same order both times. The timing between them does not. The gap between your t and your h, how long you hold the spacebar, the half-beat pause your right hand takes before reaching for a number: those intervals are stable enough, across enough of your typing, that a model trained on a few hundred of your keystrokes can pick you out of a hundred thousand strangers. That claim is old. It predates the keyboard.

The question this post answers is a narrow one. Not “can behavior identify a person” in general, but specifically: what are the timing features a keystroke system measures, how are they captured down to the millisecond, what error rates do the published benchmarks actually report, and where does the whole approach quietly fall apart. Keystroke dynamics is the oldest behavioral biometric we have, and it is the one most often oversold.

What follows starts with the two primitives every system is built on, dwell and flight, then the digraph and n-graph latencies layered on top. From there: the telegraph-operator origin and the 1980 RAND study that turned it into a computing problem, the CMU benchmark that fixed the field’s evaluation methodology, the deep-learning systems that scaled it to free text and 100,000 users, how the browser exposes (and now deliberately blunts) the timestamps, and finally the spoofing work that shows how thin the liveness guarantee really is.

Two primitives: dwell and flight

Every keystroke produces two events the operating system can timestamp: a key-down (press) and a key-up (release). Everything in keystroke dynamics is built from differences between those two event types across a sequence of keys. There are only two raw measurements, and they have had stable names since the 1990s.

Dwell time, also called hold time, is how long a single key stays pressed. For key n it is the release timestamp minus the press timestamp of that same key. Flight time is the interval between two successive keystrokes. The survey literature notes flight time has four legitimate definitions depending on which event of each key you anchor to: press-to-press, release-to-press, press-to-release, and release-to-release. The release-to-press variant can go negative, because fast typists press the next key before lifting the current one. That overlap is itself informative, and a system that clamps it to zero throws away signal.

Key "t" Key "h" dwell t press release dwell h press release flight (release to press) *Dwell is the press-to-release span of one key. Flight is one of four possible gaps between adjacent keys; the orange release-to-press variant goes negative when the next press precedes the current release.*

That is the entire raw feature set. Two numbers per key. Everything else, every digraph table and every neural network, is a transformation of dwell and flight sequences into a representation a classifier can compare. The CMU benchmark dataset records exactly three derived columns per key transition (hold time H, keydown-keydown DD, and keyup-keydown UD), which is dwell plus two of the four flight variants. For a ten-character password that yields 31 timing features, and that 31-dimensional vector is what the detectors in that study compare.

Digraphs, trigraphs, and the n-graph

A digraph is the pair of keystrokes for two consecutive characters, and the digraph latency is the timing signature of that pair. Most often it means the time from pressing the first key to pressing the second, though again the anchoring varies. Digraphs are the workhorse of the field. One survey puts them at roughly 80 percent of the published literature, for a simple reason: the same letter pairs recur constantly in natural text, so you accumulate many samples of th, he, in, er from a short paragraph, and the per-digraph distributions tighten quickly.

Trigraphs extend this to three keys, and the general n-graph to any run of consecutive presses. The elapsed-time formula generalizes cleanly: the n-graph time is the press timestamp of key k+n minus the press of key k. Longer n-graphs carry more context (the rhythm of a whole common word like “the” or “and”) but recur less often, so you trade sample count for specificity. In practice systems blend several orders, weighting the ones that appear enough times to estimate a stable mean and variance.

The reason digraphs work better than single keys is motor. Typing a familiar pair is one practiced gesture, not two independent keystrokes, so its timing variance is low and its mean is personal. The 1980 RAND experiment found that out of the hundreds of distinct digraphs in an ordinary paragraph, just five, considered together, separated their seven typists. You do not need the whole keyboard. You need the handful of transitions where a given typist’s motor habit diverges sharply from the population.

Which transitions those are is itself revealing. Same-hand digraphs and same-finger digraphs carry more discriminative weight than alternating-hand ones, because they force a sequential motion that each person’s hand resolves slightly differently, while an alternating-hand pair like th overlaps the two presses and compresses the timing toward a population norm. Transitions that cross to a number row, reach for punctuation, or involve a shift-modified key tend to be the most personal of all, since they pull a hand off the home row and the recovery path varies by hand size and habit. The survey literature reports a sweet spot in input length around 13 to 15 characters for fixed text, long enough to capture several of these telling transitions but short enough that a user will retype it consistently. Below that you have too few stable digraphs to estimate a profile; well above it the marginal transitions add little while fatigue and correction start adding noise.

From telegraph fists to a computing problem

The behavioral observation is from the telegraph era. By the late nineteenth century, operators sending Morse over a wire had recognizably individual rhythm, the cadence and spacing of their dots and dashes, and this came to be called the operator’s “fist.” A receiving operator could often name the sender before any call sign came through, the way you recognize a friend’s footsteps. During the Second World War, military signals intelligence used fist recognition to track individual enemy transmitters across the network, tying a given hand to a ship or a unit even when call signs changed. The identity rode in the timing, not the content.

The leap from telegraph keys to computer keyboards came in 1980. R. Stockton Gaines and colleagues at the RAND Corporation published Authentication by Keystroke Timing: Some Preliminary Results, report R-2526-NSF. Seven professional typists each typed a fixed paragraph of prose; the system recorded the time between successive keystrokes; four months later the same typists retyped the same text. The digraph timing distributions held up across that four-month gap well enough that a small set of digraphs distinguished the typists. The report’s conclusion was deliberately cautious, since these were preliminary results from seven people, but it set the idea that has driven the field since: a touch typist has a timing “signature,” and a computer could check it.

1890s telegraph "fist" 1980 RAND R-2526 2009 CMU benchmark 2021 TypeNet 100K users 2024 PSD2 / patents *The arc from a telephone-era listening skill to an internet-scale auth signal. The orange node is the 1980 RAND report that turned the telegraph observation into a computing problem.*

Through the 1980s and 1990s the work stayed mostly statistical. Researchers fit means and variances to per-digraph timing, then scored a fresh sample by its distance from the enrolled profile, whether Euclidean, Manhattan, or Mahalanobis. The survey tally is striking: even now, roughly 61 percent of the published methods are statistical distance or probability measures, with about 37 percent machine learning and the small remainder using sequence-alignment and other tricks. The field never fully left its statistical roots, because for short fixed text the simple detectors are hard to beat.

Static versus continuous, and why it matters

Two deployment shapes have always coexisted, and they have very different security properties. Static (or fixed-text) verification checks the rhythm of a known string, almost always a password or a short enrollment phrase, at a single moment. You type your password; the system checks both that the characters are right and that the timing matches your enrolled profile. Continuous (or free-text) verification watches you type whatever you type, indefinitely, and keeps asking “is this still the same person.” A session that started with the right login can be flagged mid-stream if the rhythm drifts to someone else’s.

Static is easier to make accurate because the string is fixed, so every sample is directly comparable. Continuous is harder because you cannot rely on the same digraphs appearing, but it is the one that closes the obvious hole in any one-shot check: the attacker who is handed an already-authenticated session. Continuous keystroke monitoring is the reason the modality survives in fraud and account-takeover detection, where the threat is not “guess the password” but “take over a live session.” It sits alongside the other always-on behavioral signals, mouse-movement biometrics and, on phones, touchscreen pressure and swipe dynamics, in the same continuous-authentication bucket, and the broader behavioral-biometrics fraud stack usually fuses all three rather than betting on keystrokes alone.

Turning timing into a comparable vector

Raw dwell and flight numbers are not directly comparable across samples, and most of the engineering in a keystroke system goes into making them so. Three problems have to be solved before a distance metric means anything.

The first is alignment. For a fixed password the keys are always the same in the same order, so the feature vector has a fixed layout and sample i of feature j always means the same transition. Free text has no such luxury. The same person types different strings, so two samples share only the digraphs that happen to appear in both. Systems either restrict comparison to a shared set of common digraphs, or, as TypeNet does, hand the raw sequence plus keycodes to a network and let it learn an alignment-free embedding. The shift from hand-built digraph tables to learned embeddings is the single biggest methodological change in the field since 2015, and it is what made free-text matching at scale practical.

The second is scale. Dwell times cluster in a different range than flight times, and a slow typist’s intervals are uniformly larger than a fast typist’s. A plain Euclidean distance lets the large-magnitude features dominate. This is why the scaled and filtered Manhattan variants beat plain Euclidean on the CMU set: dividing each feature by its enrolled standard deviation puts every dimension on equal footing, so a transition that is highly consistent for you counts more than one that is naturally noisy. The best detector in that study won precisely because it weighted each feature by how stable that feature is for the enrolled user.

The third is drift, and it is the one with no clean fix. A profile enrolled today does not describe how you type next month. People speed up on a string they repeat, slow down when tired, and rebuild their whole rhythm on an unfamiliar keyboard. Production systems answer with adaptive templates that fold each accepted sample back into the stored profile, so the model tracks you as you change. The cost is direct: every update that widens your acceptance region to follow your drift also widens the gap an impostor can slip through. Template aging is the quiet reason long-term keystroke accuracy is worse than any single-session benchmark suggests, and it is why the lab-versus-field gap exists at all.

The benchmark that fixed the methodology

For its first three decades the field had a measurement problem. Every paper used its own subjects, its own text, its own evaluation protocol, so the reported error rates were not comparable and were often optimistic. In 2009 Kevin Killourhy and Roy Maxion at Carnegie Mellon published Comparing Anomaly-Detection Algorithms for Keystroke Dynamics and, more importantly, released the dataset behind it.

The setup is worth knowing in detail because so much later work cites it. Fifty-one subjects each typed the password .tie5Roanl 400 times, in eight sessions of 50 repetitions spread across separate days. Timestamps came from an external reference clock accurate to within ±200 microseconds, not the noisy system clock, which removed a major source of measurement error. The result is a clean, public matrix: 51 users, 400 samples each, 31 timing features per sample. They then implemented 14 detectors drawn from the keystroke and pattern-recognition literature and ran them all through one identical evaluation procedure.

Equal error rate, CMU benchmark (lower is better) Manhattan (scaled) 9.6% Outlier count (z) 10.3% NN (Mahalanobis) 10.8% SVM (one-class) 12.1% Manhattan 15.0% NN (auto-assoc) 16.4% Euclidean 16.9% 51 users, password ".tie5Roanl", 31 timing features per sample *Selected detectors from the 14 compared on the CMU set. The best, scaled Manhattan, lands near a 9.6% equal error rate on a single 10-character password. Useful, not a fingerprint.*

The headline number is the one to internalize before you trust any vendor’s marketing. The best detector in that study, scaled Manhattan distance, achieved an equal error rate of about 9.6 percent. Equal error rate is the operating point where the false-accept rate equals the false-reject rate, and on a single 10-character password the best-known method got roughly one in ten wrong at that balance point. That is a useful second factor. It is nowhere near a fingerprint. Anyone quoting sub-one-percent error for fixed-text keystroke auth is either using much longer input, fusing other signals, or measuring something more forgiving than this clean public protocol.

A later study from the same group made the warning sharper still by showing that error rates measured in a controlled lab were optimistic relative to the same task run in the field, where hardware, posture, and attention vary. The lab number is a ceiling, not a forecast.

Scaling to free text and a hundred thousand users

The statistical detectors plateau where the input is short and fixed. The modern jump came from treating a keystroke sequence as a time series and learning a representation with a recurrent network. The clearest published example is TypeNet, from a group at the Universidad Autónoma de Madrid working with John V. Monaco.

TypeNet feeds five features per keystroke into a Siamese Long Short-Term Memory network. Four are timing (hold latency or dwell, inter-key latency or release-to-press flight, press latency, and release latency), and the fifth is the keycode itself, so the model knows which key the timing belongs to. Sequences are cut to a fixed length of M = 50 keystrokes; the paper reports that beyond about M = 70 there is no meaningful accuracy gain, so a short paragraph is enough. The Siamese setup means the network learns an embedding where two samples from the same person sit close and two from different people sit far, rather than learning a fixed roster of users. New users enroll without retraining.

The scale is what makes it notable. Training used a subset of the Aalto University free-text databases, which together hold more than 136 million keystrokes from 168,000 subjects on physical keyboards. With five enrollment sequences per user, TypeNet reports an equal error rate around 4.8 percent on free text for the physical-keyboard case, dropping toward 2.2 percent in the most favorable configuration, and roughly 9.2 percent on mobile touchscreens where the input surface is noisier. The result that matters for anyone building at internet scale: holding the enrollment fixed and growing the test population from 1,000 users to 100,000, accuracy decayed by less than five percent relative. The signal does not collapse as the crowd gets large, which had been the open worry about behavioral biometrics for two decades.

Two cautions on those numbers. They are free-text rates with a strong model and generous training data, not the fixed-password rates from the CMU study, so they are not directly comparable. And touchscreen at 9.2 percent is barely better than the lab fixed-text baseline, which tells you how much the input device matters. The rhythm is yours; the keyboard mediates how cleanly it comes through.

How the browser hands over the timing

On the web, none of this needs special hardware. The DOM raises a keydown event when a key goes down and a keyup when it comes up, and each carries a timeStamp. Subtract the keydown timestamp from the keyup timestamp of the same key and you have dwell. Subtract adjacent keys’ timestamps and you have flight. A few hundred lines of JavaScript collects the full timing matrix for a login form without the user noticing, which is exactly how commercial typing-biometric SDKs work. That same passive collection is one entry in the larger menu of JavaScript-runtime signals an anti-bot or anti-fraud agent gathers on a page.

keydown .timeStamp keyup .timeStamp dwell = up − down flight = next − this [H, DD, UD, ...] feature vector distance to enrolled profile accept / reject at threshold *The web pipeline. Two DOM events per key, two subtractions, one vector, one distance against the enrolled profile. The mechanism is trivial; the discrimination lives in the model and the data.*

There is a catch the browser vendors introduced on purpose, and it directly degrades the signal. Event timestamps are a DOMHighResTimeStamp, the same high-resolution clock as performance.now(). After Spectre showed that high-resolution timers can be turned into side channels, browsers coarsened that clock. In a normal page that is not cross-origin isolated, the resolution is clamped to 100 microseconds. A page can opt into 5-microsecond resolution, but only by becoming cross-origin isolated with the Cross-Origin-Opener-Policy: same-origin and Cross-Origin-Embedder-Policy: require-corp headers, which most login pages will not set. Firefox applies its own coarsening and adds randomized jitter on top in its stricter privacy modes. For keystroke dynamics this clamping is mostly harmless, since human dwell and flight times live in the tens-to-hundreds-of-milliseconds range, far coarser than 100 microseconds, but it is a reminder that the timing surface the browser exposes is a deliberately blunted version of what the OS sees, and a privacy mode that adds jitter will inject noise into the very intervals a verifier is measuring.

Where it goes into production

The modality has a real commercial and regulatory foothold, and it is concentrated in one place: financial fraud. Under the EU’s PSD2 directive, strong customer authentication requires two factors from independent categories, and behavioral biometrics including keystroke dynamics has been accepted by regulators as an “inherence” factor for that purpose. That regulatory blessing is why typing-biometric vendors exist as a business rather than a research curiosity. TypingDNA, the most visible of them, sells an authentication API that records press, release, and inter-key movement timing during enrollment and matches against it later; the company was granted a US patent on its typing-biometrics multi-factor method in February 2024 and partners with identity platforms to slot the signal into existing login flows.

The honest pitch for these systems is not “replace the password.” At a roughly 10-percent fixed-text error rate, keystroke timing cannot be a sole factor for anything that matters. It is a second factor that costs the user nothing (no phone, no token, no extra tap) and a continuous one that keeps scoring a session after login. For account-takeover detection, where a stolen credential gets you in but the attacker’s hands type differently from the owner’s, a low-friction continuous signal that is right most of the time is genuinely valuable even at 10-percent error, because it runs constantly and fuses with everything else the fraud engine sees.

The liveness problem

The weakness that should worry anyone deploying this as security rather than friction is replay and synthesis. Keystroke dynamics has no inherent liveness guarantee. The “biometric” is a sequence of intervals, and intervals are trivial to record and replay. If an attacker can observe a victim’s timing, through a keylogger that captures timestamps, or through traffic that carries the timing matrix, or even, as one 2022 study showed, by extracting the inter-character delays from a screen recording of someone typing, they can reproduce the rhythm by injecting synthetic key events with the stolen timing.

That screen-recording attack reported an evasion rate as high as 64 percent against a keystroke verifier, using only video of the victim typing and no access to the victim’s machine. Generative approaches go further: conditional GANs have been trained to synthesize keystroke timing sequences that impersonate a target user’s profile, learning the distribution rather than replaying one capture. The defensive response is liveness detection: train a separate classifier to tell genuine human timing from synthetic or replayed timing, using the synthetic samples themselves as the negative class. It helps, but it is an arms race of the familiar shape: every generator that learns to fool the current liveness model becomes training data for the next one. The structural problem is that the signal is just numbers, and numbers replay. This is the same wall that synthetic-input research keeps hitting on the mouse side, where producing event streams that survive a behavioral classifier is far harder than producing ones that merely look plausible.

There is a quieter limitation that matters more in practice than spoofing: the signal is not stationary. Your typing rhythm shifts with the keyboard, with fatigue, with injury, with caffeine, with whether you are typing your own password for the thousandth time or a phrase you have never seen. A profile enrolled on your laptop does not transfer cleanly to your phone or a borrowed machine. Systems handle this with adaptive templates that update as you type and with per-device profiles, but every adaptation widens the acceptance region, and a wider acceptance region is exactly what an impostor needs. The tension between stability and uniqueness here is the same one every detector fights, and it is worth reading alongside the general treatment of the entropy budget a fingerprint has to spend.

What the rhythm is actually worth

Keystroke dynamics is the oldest behavioral biometric and the one with the longest unbroken line of evidence: a telephone-era listening skill, a careful 1980 experiment on seven typists, a clean 2009 benchmark that nailed down the error rates, and a 2021 system that carried the idea to a hundred thousand users without it falling apart. Across all of that the central finding has been remarkably consistent. There is real, reproducible identity in how you type, and it is stable enough to be useful for months at a time.

The number to carry away is the 9.6-percent equal error rate on a single password from the CMU set, because it sets the honest ceiling for the simplest deployment and every higher claim has to explain what it is doing differently, whether longer text, a learned embedding, signal fusion, or a more forgiving test. Keystroke timing is a good second factor and a good continuous one. It is a poor sole factor and a worse liveness guarantee, because the entire signal is a vector of millisecond intervals, and a vector of intervals can be recorded, learned, and replayed by anyone who watches you type. The telegraph operators knew their own fists could be imitated by a skilled hand on the same key. A century and a half later, with screen-recording GANs standing in for the skilled hand, that is still the binding constraint.


Sources & further reading

Further reading