Skip to content

Building a CAPTCHA-solving pipeline: human farms, ML solvers, and the cost curve

· 18 min read
Copyright: MIT
A pipeline diagram showing a captcha routed to either a human farm or an ML solver, returning a token, with the per-solve cost highlighted orange

A CAPTCHA is supposed to be a wall. On one side, a human who can read warped text or pick out the bicycles. On the other, a bot that cannot. The whole premise is that solving the puzzle proves something the machine cannot fake. So it is a little strange that you can buy a solved CAPTCHA for roughly a tenth of a cent, in bulk, over a documented HTTP API, with a dashboard and a support queue. The wall has a turnstile in it, and the turnstile has a price list.

That price list is the subject here. Not how to defeat any particular challenge, and not a single line of working bypass code, but how the act of solving got turned into an industrial process with a cost curve, a latency budget, and a measurable accuracy floor. The interesting question is not “can a CAPTCHA be solved” (it can, all of them, for money) but how the pipeline behind that solve is built, what each stage costs, and why the price has barely moved in fifteen years even as the technology underneath it flipped from human eyes to neural nets.

The route below. First the original economic study that named the market, because the numbers it measured in 2010 are still the anchor everyone reasons from. Then the two engines of a modern solving pipeline, the human farm and the ML solver, and how a real service blends them. Then the part that matters most to anyone actually running this and least understood from the outside: that a solved token is worthless unless it is delivered inside the exact session, IP, and fingerprint the challenge expected. Then the cost curve itself, with current per-thousand prices and the latency that comes attached. A closing note on what fifteen years of flat pricing tells you about where the defense is actually being mounted.

The 2010 study that set the baseline

The market was already mature when academics first measured it carefully. A 2010 USENIX Security paper, Re: CAPTCHAs — Understanding CAPTCHA-Solving Services in an Economic Context, bought solves from the services of the day and instrumented the whole thing. The numbers it reported are the baseline the entire industry is still measured against.

Retail price for a thousand human-solved CAPTCHAs sat around one dollar at the cheapest Russian-fronted services, with Decaptcher charging two dollars and a few specialty endpoints higher. Median response time across the eight services they tested was about twenty seconds, and over eighty percent of solves came back inside thirty. Error rates stayed under twenty percent, with the worst service near that ceiling and most comfortably below it. The largest provider they measured, Antigate, was clearing roughly forty-one solutions per second, which the authors estimated meant four to five hundred workers online at once. A single worker, they reckoned, could clear something like a thousand to fifteen hundred CAPTCHAs in an eight-hour shift.

The labor side is where the study earned its title. The work followed the wage gradient. The authors found strong concentrations of workers in Russia, Eastern Europe, China, and India, with the cheapest capacity flowing to wherever a thousand solves was worth a worker’s time. At a dollar of retail revenue per thousand and a chain of intermediaries taking cuts, the person actually typing the answer was earning a few cents an hour by the math, and the paper noted that the solving task is the kind of unskilled, fully remote piecework that races to the lowest-cost labor market available. That is the quiet finding underneath the technical one. A CAPTCHA does not raise the cost of an attack to the attacker’s local wage. It raises it to the cheapest wage on earth at which someone will look at a picture.

Fifteen years of (roughly) flat pricing $3 $0 2010 human 2017 human + audio 2023 hybrid 2026 ML-first *The retail price for a thousand solves has stayed near one to three dollars since 2010 even as the engine underneath flipped from human labor to neural nets. Flat price, changing cost structure. That gap is where the margin moved.*

The human farm, as a system

Strip the marketing away and a human-solver service is a job queue with a global labor pool attached. The architecture has not changed much since the 2010 study described it, because the problem has not changed: take a challenge from a client who needs it solved now, get it in front of a person who can solve it, return the answer before the challenge expires.

A client integrates by calling an API. The submission carries whatever the challenge needs: for a distorted-text or image CAPTCHA, the actual image bytes or a URL; for the interactive Google and hCaptcha widgets, not an image at all but the site key, the page URL, and the parameters the widget was instantiated with. The service drops that into a queue. On the other side, workers are connected through a web dashboard or a desktop client that pulls the next job, shows it, and takes their answer. For image and text challenges the worker types what they see. The answer goes back up the queue, the client polls for it, and the transaction closes. The client never knows whether a person or a model produced the answer, and the pricing rarely tells them.

The piecework economics that the 2010 paper measured still govern the worker end. Pay is per solve, fractions of a cent, settled in bulk. Throughput per worker is bounded by human reading speed, which is why the per-worker numbers from 2010, on the order of a thousand-plus solves a shift, are still roughly right. To scale a farm you do not make workers faster, you add workers, and the only lever that adds workers cheaply is reaching further down the wage curve. That is the structural reason a purely human farm cannot get much below the dollar-per-thousand floor. There is a person in the loop and the person has to eat.

What the human farm buys you, and the reason it has never fully died, is generality. A person can solve a challenge type the service has never seen before. Throw a brand-new puzzle variant at an ML pipeline and it fails until someone trains a model; throw it at a human and they squint and solve it. The 2010 study caught this directly: workers handed a novel CAPTCHA design improved their accuracy over a couple of weeks of exposure with no code changes at all. Humans are the fallback that covers the long tail of weird challenges and the first response to anything new. That is a real capability, and it is why even aggressively ML-first services keep a human pool warm behind the model.

The ML solver, and where it actually wins

The other engine is a model. For the challenge classes that have stable structure, machine solving is now both cheaper and faster than a person, and the academic record makes the scale of that clear.

Text CAPTCHAs fell first and hardest; segment-and-classify pipelines, then end-to-end CNN and recurrent models, chewed through warped-character schemes until the format was essentially retired from serious use. Image-grid challenges took longer but went the same way. In 2024 a group at ETH Zurich published Breaking reCAPTCHAv2 at COMPSAC, reporting that a pipeline built on a fine-tuned YOLOv8 object detector could clear one hundred percent of the image challenges it was given, against the sixty-eight to seventy-one percent that earlier published attacks had managed. The model recognizes the bicycles and crosswalks as well as a person does, and faster.

The most important result in that paper is not the hundred percent. It is what they found about how few challenges they had to solve at all. With a normal browsing history and cookies present, the widget served a median of two image challenges before passing; without that history, a median of five, with a mean climbing past eight. And the difference between a human tester and their bot, measured by how many challenges each had to clear, was not statistically significant. The image puzzle, in other words, had already stopped being the thing that decides. The decision was made upstream from the picture, on the reputation of the session, and the puzzle was mostly theater for whoever the upstream signals could not place. That finding reframes the whole solving problem and it is worth holding onto for the next section.

Two engines, one queue human farm ML solver novel challenge solves it cold fails until retrained cost per 1000 wage-floored, ~$1+ cents, near-zero marginal latency ~20s, reading speed single-digit seconds throughput scaling add workers add GPUs role in pipeline fallback, long tail first pass, bulk *The two engines have opposite strengths. The model is cheap, fast, and brittle to anything it has not seen; the human is slow, wage-floored, and general. A real service runs the model first and keeps the humans for the cases it cannot place.*

Audio was the side door, and it is the cleanest example of a defense being turned against itself. The accessibility track on reCAPTCHA offers a spoken alternative to the visual puzzle. In 2017 a team at the University of Maryland published unCaptcha at WOOT, which grabbed the audio clip, fed it to free public speech-to-text services, and typed back the transcription. They reported 85.15 percent accuracy at an average of 5.42 seconds per solve. Google moved from spoken digits to spoken phrases in response; the follow-up, unCaptcha2, adapted and pushed the success rate to around ninety percent. The lesson held: when a system offers a machine-readable channel for accessibility reasons, a machine can read it, and speech recognition got good enough that the audio path became the easiest path rather than the hard one.

In practice no serious service runs one engine. The modern shape is a cascade. A submitted challenge hits the cheap automated path first, a model or a scripted browser-emulation routine, and only the cases that fail or fall below a confidence threshold get escalated to a human. That is the hybrid model, and it is why the published price for a category like reCAPTCHA can sit at a couple of dollars per thousand while the service’s actual cost to produce most of those solves is a fraction of a cent. The price is set by the human fallback and the willingness-to-pay of the buyer; the cost is set by how often the model can avoid calling a human. The margin lives in that gap, and it widens every time the models get better.

Why a solved token is usually not enough

Here is the part that separates people who have run a solving pipeline from people who have only read the price list. Buying the solve is the easy half. Getting the solve to count is the hard half, and the reason is that on the modern interactive challenges, the answer the service returns is not a verdict. It is a token, and the token is bound.

When a service solves a reCAPTCHA or an hCaptcha or a Turnstile widget, what comes back over the API is the response token, the value the page would have put in the g-recaptcha-response field (or the equivalent) after a real user passed. Your client injects that token into the form or the callback and submits. But the issuing system can bind that token to the context in which it was minted. The IP address that requested the challenge. The browser fingerprint. The TLS and HTTP/2 characteristics of the connection. The session cookies in flight. If the token is solved in one context and redeemed in another, a sufficiently strict verifier rejects it, and you have paid for a solve that buys nothing.

This is why a CAPTCHA failure is so often not a solving failure at all. A scrape that gets blocked again within a few requests of a clean solve is almost never being beaten at the puzzle. It is being beaten on the IP reputation or the fingerprint mismatch that triggered the challenge in the first place and is still triggering after it. The token was real. The session it was redeemed into was not the session it was issued for. The defenses that matter here are documented in our writeups of the Cloudflare cf_clearance cookie and Turnstile’s internals, where the cleared token is explicitly tied to a fingerprint, an IP, and a short lifetime, and on the reputation side in how anti-bot vendors detect residential proxies and ASN reputation.

The token is bound to its context your client sitekey + URL solving service ML, then human target site verifies token token inject + submit at redemption the verifier can check: IP did the request come from the IP that got the challenge? fp does the browser fingerprint match? tls same TLS / HTTP-2 signature on the connection? ttl is the token still inside its short lifetime? any mismatch and a real, paid-for solve is rejected *A solving service returns a token, not a pass. Whether that token counts depends on the IP, fingerprint, TLS signature, and session it is redeemed into matching the ones it was issued for. This is why most solve failures are really proxy or fingerprint failures.*

The operational consequence is that the solve has to happen inside the same session that will use it. If the scraper is running through a residential proxy, the challenge must be requested and the token minted through that same proxy IP, and the session has to carry the request through to redemption without the IP rotating underneath it. This is the exact tension covered in sticky sessions versus rotating IPs: rotation is great for spreading load and terrible for anything that binds a credential to an address. A token-bound flow wants a sticky session for the whole solve-and-redeem window, then can rotate. The cost of the solve is the line item on the invoice. The cost of the infrastructure that makes the solve redeemable, clean residential IPs and a consistent fingerprint, is usually larger and shows up nowhere on the solving service’s price list. The honest version of CAPTCHA economics has to count both, which is the through-line of the economics of a scraping operation.

The cost curve in 2026

With the binding caveat in hand, the headline numbers are easy to state. Published retail pricing in 2026 still clusters where it did in 2010, in low single dollars per thousand, scaled by how interactive and reputation-sensitive the challenge is. On 2Captcha’s public price list, a normal image or text CAPTCHA runs $0.50 to $1.00 per thousand. reCAPTCHA v2 sits at $1.00 to $2.99. reCAPTCHA v3, which is sold by target score, lists at $1.45 for the low band and $2.99 for a high-score token. Cloudflare Turnstile is priced at $1.45 per thousand, GeeTest at $2.99, and Arkose Labs FunCaptcha spans an unusually wide $1.45 to $50 depending on difficulty. The price climbs with how much work and how much session-context the challenge demands, not with the difficulty of the picture.

The shape of that curve tells you where solving is cheap and where it is not. The flat, sub-dollar floor is the commodity zone: static text and image challenges a model handles end to end with no human and no session juggling. The middle band, the interactive widgets at one to three dollars, is where the price reflects the need to either drive a real browser or stand up the fingerprint and proxy context that makes a token redeemable. The Arkose ceiling at fifty dollars per thousand is the tell for a challenge that resists the cheap path hard enough that a human in an expensive loop becomes the only reliable option. Price tracks not the puzzle but the cost of the pipeline that produces a usable answer.

Latency is the other axis, and it moved more than price did. The 2010 study measured a roughly twenty-second median because a person had to read every challenge. In 2026, ML-first services are faster on the challenges they can automate and the human-heavy paths are still slow. A 2026 benchmark of five major services found Cloudflare Turnstile solved in 6.24 seconds at the fastest service (CapMonster) and around 16 to 20 seconds at the human-heavier ones, while reCAPTCHA v2, which leans harder on human escalation, ran from about 32 seconds at the fast end to over 90 seconds at the slow end. The spread inside a single challenge type is the hybrid model showing through: a service that solves Turnstile in six seconds is running a model, a service that takes ninety on reCAPTCHA is routing to people. Success rates on the interactive widgets clustered near one hundred percent across services; the wide variance showed up on plain image CAPTCHAs, where reported success ranged from the low teens to the mid-sixties, which is the long tail of weird images that only humans reliably clear.

Two practical notes fall out of those latency numbers. First, a solve that takes a minute is an eternity in a crawl, and it forces real backpressure on the pipeline feeding it; you cannot fire challenges faster than you can clear them without building an unbounded queue, which is the argument in rate limiting yourself. Second, the cheapest way to keep the cost curve low is to not trigger the challenge at all. The ETH Zurich finding bears directly on this: a session with clean history and reputation got a median of two challenges, a cold one got five or more. Every challenge avoided is a solve not bought and a half-minute not spent. The economics of solving are dominated, in the end, by how often you have to solve, and that is decided before the puzzle ever appears, by the same fingerprint and reputation signals covered across the DataDome and Akamai writeups.

What flat pricing tells you

Fifteen years is a long time for a price to hold. The engine under the CAPTCHA-solving market flipped completely, from rooms full of people typing answers to GPUs running object detectors and speech recognizers, and the retail number a buyer sees barely moved off the dollar-to-three-dollar band it occupied in 2010. That is not because the technology stalled. It is because the price was never really about the puzzle. It was about the cheapest available way to produce a usable answer, and as the cheap way shifted from human labor to inference, the savings went to the service’s margin rather than to the buyer’s invoice. The wall got a cheaper turnstile; the toll stayed the same.

The deeper signal is that the defenders worked this out and moved the contest. When researchers can clear the image puzzle at one hundred percent and the audio at ninety, the puzzle is no longer where the decision happens, and the systems that still ship CAPTCHAs know it. The challenge is a tax on sessions the reputation engine could not already place, and the number you see, two challenges for a trusted session, five-plus for a cold one, is the real output. The picture became a sideshow. The solving market adapted by becoming a token-binding and proxy-context business with a thin layer of actual recognition on top, which is why a solved token so often counts for nothing without the IP and fingerprint to back it. The thing being sold stopped being “can you read this” some time ago. What it costs a tenth of a cent to do, nobody bothers to defend; what costs real money is being the right session when the token lands, and no price list quotes that.


Sources & further reading

Further reading