Skip to content

The history of the CAPTCHA-solving economy: from human farms to ML solvers

· 20 min read
Copyright: MIT
The CAPTCHA-solving economy, from human farms to ML solvers

A CAPTCHA is supposed to be a test only a human can pass. The quiet fact that broke that promise is that humans are cheap, and they will sit at a terminal solving someone else’s puzzles for a fraction of a cent each. You do not need a machine that can read warped text if you can rent a person in Dhaka who already can, pay them a tenth of a cent, and wire the answer back to your bot in twenty seconds. That is the whole trick, and it is older than most of the CAPTCHAs it defeats.

So the history of CAPTCHA solving is two histories braided together. One is a labor story: a real-time piecework market that priced human attention down to roughly a dollar per thousand solves and routed it to whichever country was cheapest that quarter. The other is a machine-learning story: OCR, then segmentation attacks, then convolutional nets, then the audio breaks, and finally the multimodal models that read a grid of traffic-light tiles as easily as they read this sentence. The interesting part is how the two markets fed each other, and why the human one refused to die even after the machines got good.

What follows traces that arc through the primary record. The early OCR-and-human services. The 2010 UC San Diego paper that put real numbers on the wages. The generic text-CAPTCHA solvers that arrived around 2011 to 2014. Google’s own Street View net reading reCAPTCHA at accuracy no human matches. The audio-channel breaks that turned a speech-to-text API against the system that called it. hCaptcha turning the test itself into a data-labeling business. And the 2025-era agentic vision-language solvers that no longer need a farm at all. Through all of it, one number to keep in mind: the retail price of a solved CAPTCHA barely moved in fifteen years.

2006–2009: OCR, humans, and the first services

The earliest solving services did not announce themselves as a market. They grew out of two things happening at once. Distorted-text CAPTCHAs were everywhere by the mid-2000s, gating signups at Yahoo, Hotmail, and a thousand forums. And the same period produced the first practical optical-character-recognition attacks on those exact images.

The OCR side was never very hard for the easy schemes. A service like DeCaptcher advertised itself as one of the first image-to-text shops on the web, running custom OCR for the simple arithmetic and clean-font CAPTCHAs and routing anything harder to human operators. That hybrid shape became the template. The software solved what it could solve for nothing; people solved the rest. The customer never knew or cared which path their answer came down, because the API was identical either way: post an image, get back a string.

What made the human path viable was not technology at all. It was a labor arbitrage. Solving a CAPTCHA is unskilled, language-neutral, and trivially piece-rateable, which makes it one of the most outsourceable tasks imaginable. A service could put a job queue on the internet, let workers anywhere in the world pull images off it, and pay per correct answer. The work needed no training, no trust, and no fixed location. By the late 2000s the services were mature enough to publish SLAs on response time and accuracy, the way a real API vendor would.

Two names from this period still run today. Antigate, later folded into the Anti-Captcha brand, dates its operation to 2007. The service that became 2Captcha launched around the same window. Both started as text-CAPTCHA shops and both survived every technology shift since by treating the underlying puzzle as a black box. Whatever the challenge is, find someone or something that can solve it, charge a markup, keep the pipeline warm. That indifference to the actual mechanism is exactly why the industry outlived the CAPTCHAs it was built against.

2010: the paper that priced the market

The study that turned anecdote into data came out of UC San Diego in 2010. Re: CAPTCHAs — Understanding CAPTCHA-Solving Services in an Economic Context was presented at the 19th USENIX Security Symposium, and it remains the cleanest measurement of how the solving market actually worked. The authors did the obvious thing nobody had documented rigorously: they became customers. They bought solving capacity from eight services, measured response time and accuracy, and traced the labor behind the prices.

The retail numbers were already low. The cheapest services, mostly Russian-fronted, charged around a dollar per thousand solved CAPTCHAs. DeCaptcher’s standard rate sat near two dollars per thousand. The most expensive shop in the study charged twenty. Median response time across services was roughly twenty seconds, and error rates stayed under twenty percent even at the bottom of the price range. For a dollar you could buy a thousand human solves, delivered in under half a minute each, accurate four times out of five or better. The paper’s blunt framing is the line worth keeping: a CAPTCHA should be regarded not as a technological impediment but an economic one, because a mature solving industry already bypassed the technology completely.

The labor side is where the paper bites. Solving was sourced from the cheapest available market, and the cheapest market moved over time. The authors watched advertisers shift recruitment from Eastern Europe toward Bangladesh, China, India, and Vietnam as wages in the first markets ticked up. Because the task is unskilled and remote, the floor on what a worker could be paid was set by the poorest country with reliable internet, not by anything intrinsic to the work. Wages per thousand solves had already fallen from around ten dollars in the early years to one or two by the time of the study.

Bot / customer hits gated form Solving service API + job queue OCR / model free path Human worker ~$1/1000 image text answer easy hard *The hybrid pipeline the early services standardized: software solves what it can for free, humans solve the rest, and the customer's API call looks identical either way.*

There is a detail in the paper that ages well. Human solvers in the study were estimating somewhere between a thousand and fifteen hundred solves in an eight-hour day. Run the arithmetic against the wages and the daily take is brutal. Years later, when a researcher signed up as a worker on 2Captcha and wrote it up for F5 Labs, the figures had if anything gotten worse: thirty cents per thousand traditional CAPTCHAs paid to the solver, against the dollar or more the service charged the customer. Solving for eleven hours straight came out near a dollar twenty for the day. The worker captured something like three to four percent of what the customer paid. The rest was the service’s margin on running the queue.

This is the through-line for everything that follows. The solving economy was never bottlenecked on whether a CAPTCHA could be read. It was bottlenecked on price, and the price was set by the global wage floor, which is very low. A defender raising the difficulty of a puzzle did not break the market. It just nudged a few more images from the free OCR path to the paid human path, which raised the attacker’s cost from almost nothing to slightly more than almost nothing.

2011–2014: the generic solvers arrive

The machine side caught up in stages, and the stages matter because each one collapsed a different assumption about what made text CAPTCHAs hard.

The first assumption to fall was that you needed a custom attack per scheme. Early breaks were artisanal. A researcher would study Yahoo’s particular distortions, hand-build a segmenter for them, and publish a result that did not transfer to the next site. The 2011 work on text-CAPTCHA strengths and weaknesses started turning that into a method. It identified segmentation, the step of cutting a blob of overlapping glyphs into individual characters, as the real weak point. Recognizing a single clean character was a solved problem by the late 2000s. The security of a text CAPTCHA lived entirely in how hard it was to chop the word into letters in the first place. That paper also made the uncomfortable observation that pushing distortion far enough to stop a machine pushed human success rates down below twenty percent, which is no longer a usable test.

The decisive result came in 2014. A generic solver, built by a team that included Google’s own anti-abuse researchers, dropped the per-scheme segmentation pipeline for a single machine-learned model that handled segmentation and recognition together. Trained the same way across many designs, it reached around thirty-three percent on reCAPTCHA’s distorted-text challenges, thirty-nine percent on Baidu, and fifty-one percent on the CNN scheme, with no scheme-specific hand-tuning. A solver does not need to be right every time. At a third success rate and effectively zero marginal cost, distorted text is finished as a defense, because the attacker simply retries.

Generic single-model solver accuracy, 2014 one model, no per-scheme tuning CNN 51.1% Baidu 38.7% reCAPTCHA 33.3% Yahoo 5.3% 0% ~70% *A solver does not need to win every time. At a third success rate and near-zero marginal cost, the attacker just retries until it gets through.*

Two weeks before that solver paper, Google had already said the quiet part in public. Its Street View team and reCAPTCHA team published a joint result in April 2014 showing that the convolutional network built to read house numbers from street imagery could read the hardest distorted-text reCAPTCHA at 99.8 percent accuracy. The system meant to tell humans from machines was being read more reliably by a machine than by the humans it was protecting. Google’s own framing in that announcement was that distorted text was no longer something it could lean on, and that reCAPTCHA would shift to scoring a broad range of behavioral cues instead. That shift, announced months later as the “No CAPTCHA reCAPTCHA” checkbox, was the defense conceding the original game and changing what it measured.

This is the moment the two markets diverged in an interesting way. Once the puzzle moved from “read this text” to “convince a risk engine you are human,” the pure-OCR solving services lost their grip. You could read the image perfectly and still fail, because the verdict no longer depended on the answer. It depended on the cookies, the mouse path, the browser fingerprint, the IP reputation, everything the reCAPTCHA v3 scoring pipeline rolled into a single risk number. The solving economy’s response was not to get smarter at images. It was to get a real browser, real residential IPs, and a token-passing flow, which is a different and more expensive game. That story runs through the browser fingerprinting and anti-detect-browser histories, and it is why a modern solving service quotes a higher price for a reCAPTCHA token than for a plain image.

2017–2019: the audio channel and the speech-to-text irony

There was a back door in reCAPTCHA the whole time, and it existed for a good reason. Accessibility. A visual challenge excludes blind users, so reCAPTCHA offered an audio alternative: a recording of spoken digits you transcribed instead of reading tiles. The audio path had to be solvable by a human ear, which meant it had to be clean enough for a speech model too.

In 2017 a team at the University of Maryland built unCaptcha to exploit exactly that. The system took the audio challenge, fed it to a handful of free online speech-to-text engines, combined their outputs with a phonetic mapping step to fix the digits the engines mangled, and submitted the answer. It solved reCAPTCHA’s audio challenge at 85.15 percent accuracy in about 5.42 seconds on average, measured over more than 450 live challenges. The resources required were close to nothing. The whole point of the paper was that you did not need a research budget or a GPU farm, just public APIs and a clever post-processing trick.

Google responded by changing the audio challenge from spoken digits to spoken phrases, on the theory that full phrases would be harder to transcribe and reassemble. That theory did not survive contact. By 2019 the same group released unCaptcha2, which broke the phrase-based audio challenge at around 90 percent, higher than the original, partly by pointing Google’s own Speech-to-Text API at Google’s own CAPTCHA audio. The team had been in contact with the reCAPTCHA team for over six months before release. Google classified the weakness as out of scope for its bug bounty.

The audio break is the cleanest illustration of a structural problem with CAPTCHA design. Any channel you open for genuine humans who cannot use the main channel is a channel an attacker can use too, and the accessibility channel is held to a usability bar that all but guarantees a machine can clear it. You cannot make the audio so noisy that a blind user fails and a bot succeeds; the whole concept inverts. Solving services absorbed audio attacks the same way they absorbed everything else. It became one more code path behind the same API, priced into the per-token rate.

2017–2020: hCaptcha and the labeling inversion

While the breaks accumulated, a different idea reshaped the defender’s side, and it changed the economics of the whole field. If users are going to spend collective millennia clicking on images anyway, that clicking is labeled training data, and labeled training data is worth money.

hCaptcha launched out of Intuition Machines, a company founded in 2017 by Eli-Shaoul Khedouri. The origin is almost too on-the-nose. The team became a heavy buyer of human image-annotation labor for its own machine-learning work, tried staffing an annotation team directly, found the workload too spiky to keep people busy, and realized the cheapest annotation workforce on earth was already sitting in front of CAPTCHA widgets for free. So they built a CAPTCHA whose challenges double as labeling tasks and sold the resulting datasets to third parties. Websites running hCaptcha could even take a cut. The test stopped being a pure cost center and became a two-sided market.

This is the same von Ahn insight that made the original reCAPTCHA digitize books and Street View numbers, turned into a standalone business and aimed at the open data-labeling market rather than one company’s internal corpus. When Cloudflare moved off reCAPTCHA in 2020, it chose hCaptcha, citing privacy and the fact that Google had started charging for reCAPTCHA at scale. For a stretch, hCaptcha became the default challenge behind a very large slice of the web.

The labeling inversion has a strange consequence for the solving economy. Every CAPTCHA solved, by a human farm or a model, is also a labeled example. The defender is training on the same images the attacker is solving. And once a class of challenge has been answered a few million times, a solver can train on those answers too. The dataset that makes the defense valuable is the same dataset that erodes it. That tension never resolved. It just moved into the next generation of challenges, the behavioral and proof-of-work ones that try not to depend on a solvable image at all, which is the territory the Cloudflare Turnstile and proof-of-work approaches occupy now.

2020–2026: solvers eat the image, and the farm gets a competitor

For most of this history the human farm and the machine solver were complementary. Software handled the cheap cases; humans handled the rest. Around 2023 the boundary started moving fast, because the machines got good enough to handle most of the rest too.

Multimodal large language models are the reason. A model that reads an arbitrary image and answers questions about it does not care whether the image is a photo, a chart, or a grid of nine tiles asking which contain a bus. The CAPTCHA’s core assumption, that a particular visual recognition task is hard for software and easy for people, is precisely the assumption a general vision model dissolves. By 2025 the solving services had reorganized around this. The marketing language shifted from “human-powered” to AI and LLM pipelines that pick a model per challenge type, and the older services bolted vision-model paths onto their existing APIs the way they had bolted on OCR fifteen years earlier.

The research numbers are uneven, which is itself informative. A general-purpose benchmark of multimodal agents against a wide spread of live challenge types still humbles the best models. On one 2025 web-based benchmark the strongest single model cleared only forty percent, and several well-known models landed between five and twenty. A general model handed a novel, well-designed challenge cold is not a reliable solver. But a purpose-built agentic system is a different animal. One 2025 USENIX result reported an agentic vision-language solver clearing 60.7 percent across twenty-six visual CAPTCHA types in a controlled set, and, more pointedly, 70.6 percent on previously unseen challenges pulled from real CAPTCHA farms over a thirty-day window. That second number is the one defenders should read twice. The machine was beating the live distribution of challenges that human farms were being paid to solve.

2007 Antigate / human farms 2010 USENIX prices the market 2014 generic solver; 99.8% Street View 2017 unCaptcha audio break 2020 hCaptcha + Cloudflare 2025 agentic VLM solvers *Two decades on one line. The human-farm era starts it, the machine breaks punctuate it, and by the end the model is beating the live distribution of challenges the farms were paid to solve.*

There is a cost wrinkle that keeps the farm alive even now. Running a frontier vision model on every challenge is not free; the per-image cost of a large multimodal model can run one to two orders of magnitude above a purpose-trained computer-vision classifier. So a rational solving service still routes the easy, high-volume, well-understood challenges to cheap small models or to humans, and reserves the expensive general model for the novel cases. The hybrid pipeline from 2006 is intact. Only the boxes changed. Free OCR became cheap CV, the human tail became a vision-model tail with a human fallback, and the customer’s API call still looks the same.

The other reason farms persist is the part of the modern challenge that is not an image at all. The hardest commercial systems stopped relying on the puzzle years ago. A current reCAPTCHA v3 or enterprise deployment, or a DataDome or Akamai Bot Manager verdict, is a risk score assembled from TLS fingerprint, header order, behavioral telemetry, IP reputation, and a dozen other signals before any visible challenge appears. A model that reads the tiles perfectly still loses if the request around it looks automated. That is why the solving economy quietly became a proxy-and-browser economy, and why the genuinely hard problem moved from “solve the image” to “look like a real session,” which is a fingerprinting problem more than a vision one.

What the two decades actually show

Strip the technology changes away and the same shape repeats at every layer. A CAPTCHA tries to find a task that is cheap for humans and expensive for machines. The solving economy finds the cheapest supplier of that task, whoever or whatever it is, wraps it in an API, and sells the answer for a markup. When the cheapest supplier was a worker in a low-wage market, the price floor was a global labor floor of roughly a dollar per thousand. When the cheapest supplier became a small vision model, the floor fell toward the cost of inference. The defender’s improvements never raised that floor by much. They mostly moved demand from one supplier to another.

The labor story is the part that should sit uncomfortably. For fifteen years the engine under the solving market was people doing mind-numbing piecework for cents an hour, captured at three or four percent of what the customer paid, in whichever country was poorest that quarter. The machine-learning breaks did not free those workers so much as compete them out of the easy tiers while leaving them the hard residue. And on the defender’s side, the most successful business model of the era, the hCaptcha labeling inversion, monetized the unpaid clicking of ordinary users so thoroughly that one 2023 study put reCAPTCHA’s total human cost at 819 million hours and at least 6.1 billion dollars in equivalent wages, and called the system a tracking and data operation wearing a security badge. Whichever side you stand on, the value was extracted from human attention that nobody priced honestly.

The unfinished part is what happens when the marginal cost of a solve approaches zero from the machine side. A test that distinguishes humans from machines by asking them to perform a task cannot survive a machine that performs the task as well as a human, and the tasks fell one by one: clean text, distorted text, audio, object grids. What is left for defenders is not a better puzzle. It is everything around the puzzle, the session-level signals a solving API cannot easily forge, plus cryptographic schemes that attest to a real device or a sanctioned agent rather than testing a skill at all. The puzzle was always the part the market could buy. It just took twenty years and a pile of speech-to-text APIs to prove that the price was a rounding error.


Sources & further reading

Further reading