Skip to content

Speech-synthesis voices as a fingerprint: the getVoices enumeration leak

· 20 min read
Copyright: MIT
The word getVoices() as a monospace wordmark with a single orange underline and a grey speech-synthesis fingerprint subtitle

Call speechSynthesis.getVoices() from any web page and you get back an array. No prompt. No permission. No audio plays, no microphone opens, nothing is spoken. The array holds the text-to-speech voices the system has installed, and on a real desktop it is rarely short. A Windows machine hands back Microsoft David and Zira. A Mac hands back Samantha, Daniel, Alex, and a long tail of localized voices. An Android phone hands back the Google TTS set. A headless Linux box, more often than not, hands back nothing at all.

That array is a fingerprint. Not because any single voice is rare, but because the set of voices, their names, their language tags, and the URIs that identify them is a fairly precise readout of which operating system you run, which version it is, and which language packs you have installed. The voice list is bundled by the OS, so reading it is close to reading a property of the OS itself, through a JavaScript API that was designed to make web pages talk. This post traces where the signal comes from, the quirky async-loading behavior that makes the API awkward to read and awkward to fake, how much entropy the list actually carries, and what the browsers that noticed have done about it.

A roadmap

We start with the shape of the API: what getVoices() returns, the five fields on each voice object, and why voiceURI leaks more than name. Then the async quirk, the part most explainers get wrong, because on Chrome the first call returns an empty array and only a later event fills it in, and that timing is itself a tell. After that comes the OS-and-locale signal, voice by voice across Windows, macOS, iOS, Android, and Linux, and the empty-list case that flags automation. Then entropy: how much the list distinguishes you, and why it sits in the awkward middle of the stability-versus-uniqueness trade-off. Finally the defenses, from Firefox’s blanket empty list under resistFingerprinting to Brave’s per-session farbling, and where the vector sits in 2026.

The API and its five fields

The Web Speech API is old in web-platform terms and still not a standard. It lives as a Draft Community Group Report under the W3C Web Audio Working Group, last revised 21 May 2026, and it has been in that draft-but-shipping limbo for over a decade. The synthesis half of it (the half that makes the browser speak) has been available across major browsers since around September 2018, which is the date MDN now lists as its baseline.

The piece that matters for fingerprinting is one method and one interface. SpeechSynthesis.getVoices() returns a sequence<SpeechSynthesisVoice>, and each SpeechSynthesisVoice is a small read-only object. The IDL is short enough to quote in full:

interface SpeechSynthesisVoice {
readonly attribute DOMString voiceURI;
readonly attribute DOMString name;
readonly attribute DOMString lang;
readonly attribute boolean localService;
readonly attribute boolean default;
};

Five fields. name is the human-readable label, “Samantha” or “Microsoft Zira Desktop”, and the spec is explicit that uniqueness is not guaranteed. lang is a BCP 47 language tag, en-US, fr-FR, ru-RU. default is true for at most one voice per language, chosen by the user agent. localService is a boolean: true if the voice is synthesized on-device, false if it routes to a remote service. And voiceURI is the identifier and location of the synthesis service for that voice, a string the spec describes as a generic URI that can point to a local URN or a remote URL.

That last field is the one that leaks. The name is often just a first name, and several platforms reuse the same names. The voiceURI carries the plumbing. On a Mac, Firefox reports a value like urn:moz-tts:osx:com.apple.speech.synthesis.voice.daniel, which embeds the platform (osx), the synthesis backend (moz-tts), and the system’s own internal voice identifier (com.apple.speech.synthesis.voice.daniel). Read a list of those and you are not guessing at the OS, you are reading its private voice-registry keys through a public API.

SpeechSynthesisVoice one entry in the array getVoices() returns name "Daniel" lang "en-GB" default false localService true voiceURI "urn:moz-tts:osx:com.apple .speech.synthesis.voice.daniel" voiceURI embeds the platform and the OS's private voice-registry key *The five read-only fields of one voice object. The orange field, voiceURI, carries the OS-internal identifier and leaks more than the human-readable name.*

The async quirk that nobody gets right

Here is where the API earns its reputation for being awkward. getVoices() is defined to return synchronously, an array, right now. But the array it returns depends on whether the browser has finished loading its voice catalog, and the browsers disagree about when that happens.

On Firefox desktop and Safari desktop, voices are generally ready by the time your script runs, so the first call returns a populated list. On Chrome desktop, Chrome on Android, and sometimes Firefox on Android, the very first getVoices() call returns an empty array. The catalog loads asynchronously in the background, and only when it is ready does the browser fire a voiceschanged event on the speechSynthesis object. Call getVoices() again inside that event handler and the list is finally there.

So the synchronous method has an asynchronous data dependency, which is exactly the kind of design that produces bug reports. The W3C bug tracker has an entry, filed back in the early life of the spec, literally titled “getVoices should be asynchronous.” It never was. Instead the platform bolted on the voiceschanged event and left developers to reconcile the two. The standard advice is to call once, and if the array is empty, wait for either the event or a timeout, whichever comes first. A Promise.race between the voiceschanged event and a two-second timer is the pattern most libraries settle on, the timeout being necessary because on a platform with no voices the event may never fire at all.

First getVoices() call after page load Firefox / Safari desktop getVoices() [12 voices] Chrome desktop / Android getVoices() [] empty voiceschanged getVoices() -> [18 voices] The empty-then-populated pattern is real browser behavior. A list that appears fully populated on the very first synchronous call, claiming to be Chrome, is suspect. *Chrome populates the voice list asynchronously and signals readiness with a voiceschanged event. A spoofed environment that returns a full list on the first synchronous call while presenting a Chrome user-agent contradicts that timing.*

For a tracker this quirk is a gift, not an obstacle. The honest behavior of a real Chrome is observable: empty first call, then a voiceschanged event, then a populated list. A naive automation tool that patches getVoices() to return a hard-coded array will return that array immediately and synchronously, and will never fire voiceschanged. The timing of the leak is part of the leak. This is the same lesson that keeps recurring across fingerprinting vectors, that faking the value is easy and faking the mechanism that produces the value is hard, and it shows up again in how anti-bot systems fingerprint the JavaScript runtime.

What the list reveals about the OS

The reason the voice set is a strong OS signal is mundane: nobody installs these voices by hand. They ship with the operating system, and each OS ships its own. So the list is less a property of the browser than a property of the platform underneath it, surfaced through the browser.

Windows ships the Microsoft voices. The classic set is David and Zira for US English, Mark on some versions, and a long roster of localized SAPI voices keyed by language. On a Windows machine the names carry the “Microsoft” prefix and the voiceURI values reference the SAPI registry. macOS ships the Apple voices, the names everyone recognizes from the Speech preferences pane: Samantha, Alex, Daniel, Victoria, and dozens more across languages, each with a com.apple.speech.synthesis.voice.* identifier inside the URI. iOS ships a subset of the same Apple voices, though Safari on iOS is famously stingy about what it exposes and has historically not returned the full installed set. Android ships the Google TTS engine and its voice list. And desktop Linux ships, often, nothing, because no speech engine is installed by default; where one is present it is usually eSpeak, with its own unmistakable voice names.

Two facts compound the signal. First, version. Apple and Microsoft revise their voice rosters between OS releases, adding new neural voices, retiring old ones, renaming. The presence or absence of a particular voice narrows the OS version, sometimes to a single major release. Second, language packs. The base install ships the voices for the system language, and additional language voices appear only when the user has installed the matching language support. That makes certain voices a near-perfect locale beacon. The standard example, repeated across the fingerprinting literature, is Russian: a ru-RU voice is typically present only on a Russian-localized system or one where the user has explicitly added Russian language support, so its mere presence in the list is a strong tell about the user’s locale and language history.

Representative getVoices() output by platform Windows 11 Microsoft David, Microsoft Zira ... macOS Samantha, Alex, Daniel, Victoria ... Android Google US English, Google ... (TTS) Linux (espeak) english, english-us, ... or none headless Linux [] empty A ru-RU voice present in any of the above narrows locale to Russian-localized or Russian-language-pack systems. The empty list under a desktop UA is the headless tell. *The voice set tracks the OS that bundled it. Names and language tags vary by platform and version; an empty list under a desktop or mobile user-agent contradicts the claimed platform.*

This is where the vector earns its keep for bot detection rather than ad tracking. A headless browser running on a Linux server typically has no speech engine installed, so getVoices() returns an empty array. By itself that is not damning, plenty of real configurations are voiceless, but in context it is loud. A request arrives with a Windows 11 user-agent, a Windows-shaped set of headers, and a navigator object that says Win32, and then getVoices() comes back empty. Real Windows is never voiceless. The contradiction is the signal. The same logic catches the cruder spoof in the other direction: macOS voice URIs showing up under a Windows user-agent, or a fixed list that does not match the claimed platform at all. Cross-checking one declared attribute against an independently sourced one is the core move of server-side and client-side bot detection alike, and the voice list is a convenient second source because it is hard to forge consistently.

Chrome on Linux is a worth a special note here, because it muddies the empty-list rule. Chrome does not enable speech synthesis on Linux at all and will always return an empty list there, regardless of what voices the system has installed. So an empty list under a Linux user-agent is unremarkable and proves little. The detection value of the empty list is highest precisely on the platforms where a populated list is near-certain, which is to say Windows and macOS. The asymmetry matters: a tracker reads the empty list as suspicious only when the rest of the fingerprint claims a platform that never ships voiceless, and treats it as neutral when the claimed platform plausibly has no engine. That nuance is the difference between a detector that flags real Linux desktop users as bots and one that does not.

There is a subtler version of the OS readout that does not depend on the empty case at all. Even a fully populated list carries an ordering and a default-flag pattern that the platform produces deterministically. The voice marked default: true for a given language, the order voices appear in the array, and the exact lang tags attached to each are all set by the OS, not by the page. Two stock machines of the same OS version and locale produce the same ordering; a machine that has had voices added produces a different one. A spoofing tool that assembles a plausible-looking list from scratch has to reproduce not just the names but the order and the default flags, and getting the names right while getting the ordering wrong is its own kind of tell.

How much entropy is in a voice list

The honest answer is: a useful amount, but not on its own a unique identifier, and the exact figure is not something a single authoritative measurement pins down. The fingerprinting-survey literature treats voice enumeration as one signal among many rather than a headline vector like canvas or WebGL, and the per-vector entropy numbers that get published for audio-adjacent techniques are modest. A 2021 study of Web Audio fingerprints, a different technique that renders an oscillator rather than reading the voice list, reported Shannon entropy on the order of 2 to 2.6 bits for its strongest vector. Voice enumeration is in a comparable range for most users and higher for the unusual ones, but you should treat any single hard number you see quoted for it with suspicion, because it depends entirely on the population measured.

The reason it is not higher is that the common cases collapse together. Most Windows 11 users in a given locale have the same base voice set. Most stock Macs of a given OS version have the same Apple voices. So for the median user the voice list is shared with millions of others and contributes only a few bits, much like the reasoning in the entropy budget every detector balances. The distribution has a long tail, though. A user who has installed three extra language packs, or added a premium third-party voice, or is running an OS version whose voice roster differs from the current one, sits in a much smaller bucket. For those users the list is far more identifying. The signal is unevenly distributed, generous to the outliers and stingy to everyone in the middle.

There is also the stability question, which cuts the other way. A canvas or audio fingerprint is stable for years because it is a property of the hardware and the rendering stack. A voice list changes whenever the user updates the OS, installs a language pack, or adds a voice, and on some platforms it changes across browser versions for the same machine. That makes it less reliable as a long-term identifier and more useful as a short-term, cross-checking signal. It is a good “does this session hang together” check and a mediocre “is this the same person as last month” check. In the language of the trade-off, it leans toward uniqueness at the cost of stability, which is why detectors tend to use it as corroboration rather than as a primary key, and why it pairs naturally with the more stable vectors covered in the cross-browser fingerprint.

Where it gets more interesting is in combination with the rest of the audio and media surface. The set of installed voices, the set of supported codecs from media capabilities, and the audio-rendering float all describe the same underlying machine from different angles, and a detector that collects all three can cross-validate them. A voice list that says macOS, a codec profile that says Windows, and an audio float that says Linux do not describe a real device.

A practical collector does not store the raw list. The names are long, the URIs longer, and the array can run to dozens of entries, so the usual approach is to hash a normalized representation into a compact token. Sort the voices by voiceURI to make the order deterministic, concatenate the URI, name, and lang of each, and hash the result. That collapses a verbose list into a few bytes that can sit alongside the canvas hash, the WebGL renderer string, and the rest of the signal vector. The normalization step matters because the raw ordering can differ run to run on some platforms, and a detector that hashed the unsorted list would see the same machine produce different tokens. The trade-off is that hashing throws away the structure, the individual lang tags and the locale tells that a richer collector might want to read directly. Most commercial SDKs keep both: a hash for the dedup key and a small set of extracted features, the voice count, the set of distinct language tags, and a few platform-marker booleans, for the heuristics.

That extraction is where the locale signal becomes actionable. A collector that pulls the distinct lang values out of the list gets a second, independent read on the user’s language history to compare against navigator.language, the Accept-Language header, and the timezone. Those four should agree. A browser reporting en-US for its language, sending en-US in its headers, sitting in a US timezone, and then exposing a ru-RU and a zh-CN voice in its list is describing a user whose system has been configured for languages their browser claims not to prefer. That is not necessarily a bot, plenty of multilingual users look exactly like this, but it is a discriminating feature, and discriminating features are what entropy budgets are built from.

What the browsers did about it

The privacy side noticed early. The first Mozilla bug arguing that the WebSpeech synthesis API exposes information about installed TTS engines was filed in 2015 by a reporter going by KOLANICH, and the concern was stated plainly: getVoices() exposes info about TTS engines installed in the system. That bug is, remarkably, still open in NEW status a decade later, because the underlying spec issue was never redesigned. What changed is the mitigation around it.

Firefox shipped the first concrete defense in Firefox 56, in 2017, tied to the privacy.resistFingerprinting preference. With that flag on, the browser takes a blunt approach. speechSynthesis.getVoices() always reports an empty list. The onvoiceschanged event is blocked. And speechSynthesis.speak() fails immediately, firing an error event rather than speaking. The work landed under the Tor Uplift project, the multi-year effort to fold Tor Browser’s anti-fingerprinting defenses back into mainline Firefox, and it was assigned to Tim Huang with Tom Ritter later confirming the empty-list behavior under RFP. Tor Browser itself goes further and disables the API outright, setting media.webspeech.synth.enabled to false, which removes speechSynthesis rather than emptying it.

The empty-list approach has an obvious cost. An empty list is itself a fingerprint, a small one, because most browsers are not configured this way. A site that sees zero voices learns that you are either on a voiceless platform or running fingerprinting resistance, and the second is a rarer and therefore more identifying state than having a normal voice set. This is the recurring bind of anti-fingerprinting: every defense that makes you different from the default crowd makes you more visible in a different dimension.

Brave took a more elaborate route precisely to avoid that bind. Rather than emptying the list, Brave’s default protection applies what it calls farbling, the same per-session, per-origin randomization it uses across its fingerprinting defenses. For voices, the default behavior adds a farbling-determined fake voice name to the list, an alias keyed to the first real voice, so the list differs slightly and unpredictably between sites and sessions without breaking the API for pages that legitimately want to speak. The aggressive setting falls back to the Firefox approach and returns an empty list. The design was tracked in Brave’s issue 18062, opened in September 2021 by Peter Snyder (the pes10k account) and assigned to the pilgrim-brave developer, after an earlier and blunter request to simply disable the API (issue 5279, from 2019) was closed as invalid. The progression from “turn it off” to “perturb it per session” mirrors how the whole field matured, and it is the same philosophy behind the noise injection that FingerprintJS and the commercial detectors now have to account for.

Chrome and Safari, notably, have not shipped a comparable web-facing defense for the voice list by default. The information remains readable in a stock configuration of either, which is why the vector is still worth collecting in 2026 and still appears in commercial fingerprinting SDKs as a routine line item. The async quirk on Chrome is a side effect of how voices load, not a privacy measure, even though it happens to make naive spoofing detectable.

Closing: a property of the OS, surfaced by accident

The voice list is a small leak with an outsized clarity. It was never meant to identify anyone. It exists so a web page can offer a dropdown of voices before it reads an article aloud, and the fields on each voice object are there to populate that dropdown. But because the list is bundled by the operating system and surfaced verbatim, reading it is close to reading the OS’s own voice registry, complete with the internal identifiers in the voiceURI. The information was always there in the system; the API just handed the web a window onto it.

What makes the vector durable is the gap between the value and the mechanism. A tracker does not only care that you report eighteen macOS voices. It cares that you report them with the right timing, that a Chrome user-agent comes with the empty-then-populated voiceschanged sequence, that the voice URIs match the platform the rest of your fingerprint claims, and that a ru-RU voice in the list lines up with a timezone and an Accept-Language that also say Russia. Each of those is easy to fake in isolation and hard to fake all at once and consistently. The voice list is rarely the signal that catches you by itself. It is the signal that catches the seam where a forged environment stops agreeing with itself.

The most telling detail is the 2015 Mozilla bug still sitting open in 2026. The fix was never to redesign the API, only to empty it under a flag almost nobody enables. For the default browser on the default OS, the list is as readable today as it was eleven years ago, and the only thing that has changed is how many detectors now bother to read it.


Sources & further reading

Further reading