Skip to content

The economics of a scraping operation: proxy cost, solve cost, and success rate

· 19 min read
Copyright: MIT
Wordmark reading cost per good record over a single orange accent bar

A scraping operation has exactly one number that matters, and it is almost never the one on the invoice. The invoice shows dollars per gigabyte of proxy bandwidth, dollars per thousand CAPTCHA solves, dollars per hour of compute. Those are inputs. The output is cost per successful record: one validated row of the data you actually wanted, parsed and stored. Everything between the input prices and that output number is leakage, and the leakage is where operations quietly go broke.

The trap is that every input price has been falling for years while the output price has been rising. Residential bandwidth that cost fifteen dollars a gigabyte in 2020 is closer to five now. Breaking a reCAPTCHA costs a fraction of a cent. And yet teams that scraped ten million rows on a budget five years ago get three million for the same money today. The cost did not move to a new line item. It moved into the gap between a request sent and a record kept, and that gap is governed by a single multiplier almost nobody prices correctly: the success rate.

This post works through the unit economics piece by piece. First the raw input prices as they stand in 2026, proxy bandwidth split by network type. Then the solve costs, human and machine, and why the machine side collapsed the price floor. Then browser compute, the line item that grew fastest. Then the assembly: how a success rate turns a clean per-request price into a much uglier per-record price, with the arithmetic written out. The piece ends on the second-order effects, the retry storms and the silent failures, that make the naive model wrong in the direction that costs you money.

The input prices: proxy bandwidth by network type

Proxies are sold by the gigabyte for residential, mobile, and most ISP pools, and by the IP for datacenter. That billing split is the first thing that distorts the economics, because the two units fail differently.

Residential proxy pricing in 2026 sits in a wide band. Pay-as-you-go or small-volume tiers run roughly three to fifteen dollars per gigabyte depending on the provider, with premium vendors anchoring the top of that range. Bright Data’s entry tier is around $8.40/GB at ten gigabytes, dropping toward $3.30/GB at ten-terabyte commitments; Oxylabs sits near $8.00/GB at small volume. Value-led providers advertise $1 to $2 per gigabyte. The structure is consistent everywhere: per-gigabyte price falls fifty to seventy percent as you commit to volume, so the same byte costs a startup running ten gigabytes a month four to eight times what it costs an operation committing a terabyte.

Datacenter is the cheap floor. Sold per IP rather than per byte, a single datacenter proxy can be a dollar or two a month with effectively unlimited bandwidth pushed through it. On the cloud side, a raw public IPv4 rents for about $0.005 per address-hour, roughly $3.60 a month per address. The catch is detectability: datacenter ASNs are the easiest signal an anti-bot vendor has, and on a protected target a datacenter IP fails before it finishes the handshake. Cheap bytes you cannot spend are not cheap.

Mobile is the expensive ceiling. Carrier-grade NAT means thousands of real subscribers share one egress IP, so blocking a mobile IP risks blocking real customers, and the trust premium follows. Mobile bandwidth runs roughly eight times the residential per-gigabyte rate at comparable providers, two dollars a gigabyte against twenty-five cents in one published reference set. You buy mobile when nothing cheaper passes.

Indicative cost per GB by network type, 2026 order-of-magnitude bands, not quotes; volume tier moves every number ~$0.5/IP datacenter $3-15 residential $2-15+ mobile trust premium *Datacenter is billed per IP with near-unlimited bytes; residential and mobile bill per gigabyte, so the bars are not strictly comparable. The point is the ordering and the gap, not the exact heights.*

The choice between these is not a pure price decision, because the cheaper byte fails more often on hard targets, and a failed byte still bills. That coupling between network type and success rate is the whole game, and it is the reason a thorough treatment of residential, datacenter, and mobile proxies reads as much like an economics piece as a networking one. The session model matters too. Holding one IP across a multi-request flow versus rotating per request changes both your block exposure and your bandwidth bill, so it is a cost lever and not just a reliability one.

Why per-GB is the wrong meter

Bandwidth billing rewards the wrong behavior. You pay for bytes transferred, but the thing you want is records parsed, and the ratio between them is set by the target, not by you.

Consider the meter directly. A plain HTML page might be 100 kilobytes. The same page rendered in a real browser pulls the document plus every CSS file, font, tracker, analytics beacon, and lazy-loaded image, which can be two to five megabytes on the wire. Same record, twenty to fifty times the billed bandwidth, because the browser fetched everything the page references and the proxy charged you for all of it. This is the single largest hidden multiplier in a residential-proxy bill, and it is invisible on the line item, which just reads gigabytes.

There is a worked figure that makes the scale concrete. Ten million HTML requests a day at a hundred kilobytes each is roughly a terabyte of transfer; at a cloud egress rate near nine cents a gigabyte that is about ninety dollars a day before a single proxy, IP, or compute charge. Push those same ten million requests through residential proxies at even two dollars a gigabyte and the bandwidth line alone is two thousand dollars a day. Render them in a browser and multiply by the asset-weight factor. The request count never changed. The bytes did.

So the per-gigabyte meter punishes exactly the workloads that need protection-grade infrastructure, because those are the workloads that tend to need a browser, which inflates bytes, on a residential network, which prices bytes high. The two multipliers stack. Anyone budgeting from a per-GB price and a request count, without the asset-weight factor and the network premium, will under-forecast by an order of magnitude and not understand why.

Solve costs: the human floor and the machine collapse

CAPTCHA solving has two price regimes that look similar on a price list and behave completely differently underneath.

The human regime is a labor market. CAPTCHA farms route challenges to people, historically concentrated in low-wage regions, who solve them for a per-thousand piece rate. That rate has fallen over fifteen years from around ten dollars per thousand solves in the early era to roughly one to three dollars per thousand today, with some services quoting figures below a dollar. The published economics of the farms put worker compensation around $0.75 to $2 per thousand, with the service taking its margin on top. This is a real human spending real seconds per challenge, and the price floor is set by how cheaply that labor can be sourced. It cannot go to zero because a person has to look at the image.

The machine regime has no such floor. Once a CAPTCHA type can be solved by a model, the marginal cost per solve drops to inference cost, which is fractions of a cent. The hCaptcha team’s own 2025 analysis puts the average cost of breaking a reCAPTCHA at under one dollar per thousand solves, roughly a tenth of a cent or less per answer, and notes the figure has stayed low since 2016. That is the machine floor, and it is well below the human floor for the same challenge.

The research record explains why the machine floor keeps dropping. The audio side fell first: unCaptcha, presented at USENIX WOOT in 2017, defeated reCAPTCHA’s audio challenge at about 85 percent accuracy using free public speech-to-text engines, and the 2018 follow-up reached roughly 90 percent by piping the audio through a commercial speech API. The image side took longer but landed harder. A 2024 paper from ETH Zurich reported solving 100 percent of reCAPTCHA v2 image challenges with a fine-tuned YOLOv8 model, against 68 to 71 percent in prior work, needing a median of two challenge rounds, which is statistically indistinguishable from the human median they measured. When a published academic result reads “100 percent,” the per-solve price for that challenge type is on its way to inference cost, and the human farms only retain the challenge types the models have not yet eaten.

Published machine solve rates against reCAPTCHA each result pushes that challenge type toward inference-cost pricing 100% 70% 2017 audio ~85% 2018 audio ~90% prior image work 68-71% 2024 image (ETH) 100% *The accuracy curve and the price curve are the same curve inverted. As a challenge type becomes solvable by a model, its per-solve price falls from the human piece rate toward the cost of running inference.*

There is a quiet irony in the pricing on the defender’s side. Google charges reCAPTCHA Enterprise customers one dollar per thousand assessments above the free ten thousand a month, which is the same order of magnitude as the cost to defeat the challenge from the attacker’s side. The defender and the attacker pay roughly the same per interaction. That symmetry is unusual and it tells you the protection is not being sold on the strength of the puzzle anymore; it is sold on the telemetry collected while the puzzle is on screen, which is a different product. The full cost curve of building and running a solving pipeline, human and ML side by side, is its own subject covered in building a CAPTCHA-solving pipeline.

For an operation, the practical takeaway is that solve cost is rarely the dominant line item by raw dollars. A one-percent CAPTCHA rate on ten million daily requests at two dollars per thousand adds roughly two hundred dollars a day. That is real money, but it is a fifth of the residential bandwidth bill in the same scenario and a rounding error against browser compute. Solve cost dominates the budget only when the challenge rate is high, and a high challenge rate is itself a symptom that something upstream, the IP quality or the fingerprint, is wrong.

Browser compute: the line item that grew

The fastest-growing cost in a modern scraping operation is the one that did not exist as a major line in 2018: running a real browser.

The reason is detection. As anti-bot vendors moved signal collection client-side, into JavaScript that fingerprints the runtime and watches input timing, a plain HTTP client stopped being able to retrieve a growing share of pages. ScrapeOps tracked this share rising from about two percent of requests needing residential IPs or rendering in 2020 to about twenty-five percent in 2025. A quarter of the modern web, by that measure, now wants a browser or a residential exit before it will hand over the content, and the browser is the expensive half of that pair.

The per-page resource cost of a browser is brutal next to an HTTP fetch. A single headless Chromium or Playwright instance holds roughly 200 to 500 megabytes of resident memory, against the few kilobytes of state an HTTP request needs. A browser navigation takes seconds where an HTTP fetch takes hundreds of milliseconds, often three to fifteen seconds against half a second to two seconds, and the browser burns CPU the whole time decoding, laying out, and executing JavaScript. The common rule of thumb is ten to thirty times the cost per page, and the memory ceiling is what bites in practice: a four-gigabyte box realistically holds five to eight concurrent browsers before it starts thrashing, so throughput per dollar of compute falls off a cliff exactly when you need to scale.

Per-page cost: HTTP client vs headless browser HTTP client memory ~few KB state latency ~0.5-2 s bytes ~100 KB (doc only) concurrency hundreds/box headless browser memory ~200-500 MB/instance latency ~3-15 s bytes ~2-5 MB (all assets) concurrency 5-8 per 4 GB 10-30x the cost per page *The browser inflates three meters at once: compute, latency, and proxy bytes. That triple inflation is why running a browser on a page that did not need one is the most common way to overspend.*

The bandwidth coupling is the part teams miss. The browser does not just cost CPU and memory; it loads every asset the page references, and on a residential proxy you pay for all of those bytes at residential rates. So the browser tax is really two taxes welded together, a compute tax and a bandwidth-multiplier tax, and the second one lands on the most expensive bytes you buy. The detailed accounting of where that compute goes is in the headless-browser tax. The single highest-leverage cost optimization in most operations is finding the underlying JSON or XHR endpoint and replaying it with an HTTP client, which drops the page out of all three browser meters at once.

Assembly: how success rate turns price into cost

Here is the arithmetic that ties it together, and it is the part most budgets get wrong because it is multiplicative, not additive.

Define the per-attempt cost as everything you spend to make one request: the proxy bytes for that attempt, the fraction of a solve if it hit a challenge, the slice of compute. Call it C per attempt. Now define the success rate s as the fraction of attempts that yield a kept record. The naive instinct is that cost per record is just C. It is not. You pay C on every attempt, including the failures, and only a fraction s of attempts produce a record. So:

cost per successful record = C / s

That single division is the whole story. At a 95 percent success rate you pay about 1.05 times the per-attempt cost per record, a five-percent overhead. At 60 percent you pay 1.67 times. At 40 percent you pay 2.5 times. At 25 percent you pay four times the per-attempt cost for every record you keep, and the curve is convex, so each step down costs more than the last. One published treatment runs exactly this: residential bytes at two to fifteen dollars a gigabyte become an effective $3.33 to $25 per gigabyte of kept data at a 60 percent success rate, because you paid for the failed fetches too.

cost per record = per-attempt cost / success rate vertical axis in multiples of the per-attempt cost C 1x 2.5x 4x 95% 60% 40% 25% the curve is convex: each point of success rate lost costs more than the last *Because the relationship is 1/s, the penalty accelerates as success rate falls. Moving from 95 to 90 percent barely registers; moving from 40 to 35 percent is brutal.*

The convexity is why success rate is the highest-leverage variable in the whole model, far more than the proxy unit price. Halving your per-gigabyte cost saves you a linear fifty percent on that one input. Moving success rate from 60 to 95 percent cuts total attempts to reach the same record count by roughly 63 percent, which compounds across every input at once: fewer proxy bytes, fewer solves, fewer browser-seconds, all falling together. This is the mathematical reason the cheap-byte strategy backfires. A datacenter IP at a tenth the price of residential, but a 25 percent success rate against a protected target instead of 90, is not a tenth the cost. After the 1/s penalty it is more expensive per record, and you spent engineering time discovering that.

Retries make the picture worse than the static formula suggests, because a retry is not free and it is not always independent. If a target blocks an IP, retrying through the same IP fails again at near-certainty, so the second attempt is pure waste; you need a fresh exit, which is another full-price attempt. Worse, a wave of failures often triggers a wave of retries that arrives faster than the original traffic, which looks even more like a bot and depresses the success rate further, a feedback loop that can turn a 2 percent block rate into a self-sustaining 30 percent one. The defense against that loop is disciplined backoff and rate control. The token bucket is a budget instrument here, not just a politeness one.

The second-order costs the formula hides

The C / s model is correct and still optimistic, because three real costs sit outside it.

The first is the silent failure, the request that returns HTTP 200 with a body that is a soft block, a near-empty shell, or a CAPTCHA interstitial rendered as ordinary HTML. Your success counter increments. Your parser finds nothing useful, or worse, finds plausible garbage. The s in the formula is your measured success rate, but the number that matters is the validated success rate, records that actually parse into correct data, and the gap between them is invisible unless you instrument for it. An operation that thinks it runs at 90 percent and actually delivers 70 percent validated is mispricing every record by nearly thirty percent and does not know it. Catching this is the entire reason scraping observability exists: block-rate dashboards and content-validity checks are how you make sure the s in your cost model is the real one.

The second is freshness waste. A large share of scraped pages have not changed since you last fetched them, and refetching unchanged pages spends full per-attempt cost for zero new records. Conditional requests with ETags and Last-Modified headers, and change detection on top, let you skip the unchanged ones and pay only for genuine deltas, which can cut effective cost per useful record by a large multiple on slowly-changing targets. On the right dataset that single optimization dwarfs every proxy negotiation combined.

The third is the egress you spend on pages that never reach parsing at all, the hard blocks and timeouts. Field reports put this at twenty to forty percent of total egress in some operations, which is to say a third of the bandwidth bill can be buying nothing but failure responses. That number does not appear on any invoice as waste; it appears as gigabytes, indistinguishable from the gigabytes that carried real data. Only your own instrumentation separates them.

Put those together and the honest cost model is C divided not by measured success but by validated, useful, novel success: the fraction of attempts that yield a correct record you did not already have. That denominator is always smaller than the one on the dashboard, sometimes much smaller, and the distance between the two is the real cost of running blind.

What the numbers are telling you

Strip the operation down and the same shape appears every time. The input prices are commoditized and falling, residential bandwidth cheaper every year, machine solving racing toward inference cost, compute per core dropping with the rest of the cloud. The output price, cost per validated record, is rising anyway, because the share of the web that demands a browser and a residential exit keeps climbing and the success rate on hard targets keeps falling. The cost did not disappear when proxy prices dropped. It migrated into the 1/s term and into the silent-failure gap, where no vendor invoices it and most teams never measure it.

That points at where the money actually is. It is not in negotiating another fifty cents off the per-gigabyte rate, though that helps at volume. It is in the success rate, because the relationship is convex and every point you recover compounds across every input at once. It is in not running a browser on a page that did not need one, because that single decision inflates three meters simultaneously. And it is in measuring validated success rather than HTTP success, because the difference between them is a cost you are already paying whether or not you can see it. An operation that prices itself honestly on cost per good record, and instruments hard enough to know what that number really is, will quietly outcompete one that watched the per-gigabyte line on the invoice and felt reassured as it ticked downward.

The cheapest byte you can buy is still the one you never had to fetch twice.


Sources & further reading

Further reading