Proxy pool management: rotation, health checks, and burn-rate economics
A proxy pool is the least interesting part of a crawler right up until the moment it becomes the only part that matters. The parser is written, the frontier is humming, the politeness budget is tuned, and then the success rate quietly slides from ninety-eight percent to sixty over an afternoon, and nobody touched the code. The IPs went bad. Not all at once, not loudly. They aged, they got flagged, they got reused by someone else’s botnet an hour before you borrowed them, and the target started returning a 200 with an empty body instead of the data you paid for.
The hard problems here are not “buy more IPs.” They are: how do you tell a banned IP apart from a slow one, how do you decide which proxy gets the next request, how long do you keep a session pinned to one address before the target notices, and how do you do all of this without setting fire to a per-gigabyte bill that grows faster than the data you extract. This post is about operating the pool, not sourcing it. It walks through rotation strategies and their selection algorithms, the health-check state machine that separates dead proxies from blocked targets, sticky versus rotating sessions and when each is the wrong choice, ban detection that survives a soft block, and the burn-rate model that ties all of it back to money. Where the internal behaviour of a commercial proxy network is not publicly documented, that is called out rather than guessed.
What a pool actually is
Strip away the marketing and a proxy pool is a set of egress identities plus a policy for assigning them to outbound requests. The identities might be a few hundred dedicated datacenter IPs you lease by the month, or a rotating slice of a residential network where the IP you get on this request belongs to a real person’s home router and the IP you get on the next one belongs to someone three countries away. The policy is the interesting half. It decides selection (which IP next), stickiness (how long one identity persists for a logical session), health (which IPs are currently allowed to serve traffic), and backoff (how a failed IP earns its way back).
Most teams start with a flat list and round-robin. That works until it does not, and the failure mode is always the same: the pool has no memory. A flat round-robin will happily hand you an IP that returned a 403 four seconds ago, then hand it to you again on the next pass, because nothing recorded the failure. The whole discipline of pool management is adding state to that list, so the pool remembers what each IP did last time and routes around the ones that are burning.
The vocabulary splits cleanly along three axes that the rest of this post leans on. Proxies come in three network types, with very different detection and cost profiles, covered in depth in residential vs datacenter vs mobile proxies. They get sourced in ways that range from clean ISP partnerships to consent-questionable SDK bundling, the subject of how proxy networks source IPs. And they get billed two ways: per IP per month for datacenter, per gigabyte of transferred traffic for residential and mobile. Hold those three axes in mind, because every operational decision below is a trade between them.
Rotation strategies and how the next IP is chosen
Rotation is the question “which proxy serves this request,” asked thousands of times a second. There are four answers worth knowing, and they sit on a spectrum from stateless to stateful.
Round-robin walks the list in order. It is trivial to implement, distributes load evenly, and is blind to health. Random selection picks uniformly at random, which avoids the synchronised-marching problem where two crawler instances with the same list hit the same IP at the same offset, but it is just as blind. Weighted or least-recently-used selection is where memory enters: each IP carries a score derived from its recent success rate and the time since it last served a request, and the scheduler prefers the IP with the best score that has rested longest. This is the first strategy that routes around a degrading IP without being told to. The fourth answer is to delegate the whole question to a gateway endpoint, which most commercial residential networks offer. You connect to a single hostname, and the provider’s own load balancer picks the exit IP. You trade visibility for simplicity, and you inherit whatever selection logic the vendor runs, which they do not publish.
The four common ways to answer "which IP next," from a stateless counter to a vendor endpoint that makes the choice for you and tells you nothing about how.The frequency of rotation is a separate dial from the selection algorithm. Per-request rotation gives every request a fresh identity, which suits stateless work: search-result pages, price checks, a sitemap walk where no two requests need to look like the same visitor. The cost is that you can never build trust with the target, because from the target’s side you are a thousand strangers who each show up once. Pinned rotation keeps one IP for a window or a session, which you need the moment state enters the picture: a login, a cart, a multi-page flow guarded by a cookie that the server ties to the IP that obtained it. Most providers expose this as “sticky sessions,” and the cookie-and-state side of it is its own discipline, covered in session and cookie management across a proxy fleet.
A working scheduler in 2026 usually combines weighted selection with a per-session pin. The Apify and Crawlee stack is a clean public example of the pattern: a SessionPool holds a bounded set of sessions (the documented examples use maxPoolSize: 50), each session carries a maxUsageCount and a maxErrorCount, and a session is retired and replaced the moment it hits either limit. The defaults in their guides run around thirty uses and two or three errors before a session is dropped. The principle generalises past any one library. A session is a small budget of trust. You spend it on requests, you lose it on errors, and when the budget is gone you throw the identity away rather than nurse it.
The difference between a dead proxy and a banned one
This is the distinction that separates a pool that self-heals from one that thrashes. A proxy can fail for two unrelated reasons, and treating them the same wrecks your success rate either way.
A dead proxy is broken at the network layer. The TCP connection times out, the TLS handshake fails, the upstream returns a 502 or 503 because the exit node fell off the residential network, the response body comes back empty because the tunnel collapsed mid-transfer. This is the proxy’s fault. The right response is to pull that IP from rotation and stop sending it traffic, because every request you route through it is wasted.
A banned IP is working perfectly. The connection succeeds, the handshake completes, the target’s edge looks at the IP, decides it does not like the look of it, and returns a 403, a CAPTCHA interstitial, or the cruelest variant, a 200 with a stripped or fake body that looks like success to a naive client. This is not the proxy’s fault and it is not even, strictly, a property of the proxy. The same IP that is banned on one target may be perfectly clean on another. Pull it from the whole pool and you have thrown away a good IP because one site frowned at it.
The open-source scrapy-rotating-proxies middleware draws this line in a way worth studying because the code is public and the reasoning is explicit. Its default ban policy marks a proxy dead when “a response status code is not 200, response body is empty, or if there was an exception.” But the package layers a second concept on top: ROTATING_PROXY_PAGE_RETRY_TIMES, default five. A page is retried through different proxies up to that many times, and only after the retries are exhausted is the failure reclassified from a proxy problem into a page problem. If five different IPs all fail to fetch the same URL, the IPs are probably fine and the target is the one refusing you. That reclassification is the whole game. It stops a single hostile target from draining a healthy pool one IP at a time.
Detecting the soft block, the 200 with junk, is the part that resists a generic rule. A status code is cheap to check. A response that returns HTTP 200 and the right content-type but carries a “please verify you are human” page, or a truncated stub, or last week’s cached data, needs a content-level assertion: a selector that must be present, a JSON field that must parse, a minimum body length, a checksum that should change between requests. The major detection vendors lean on exactly this ambiguity, serving a clean-looking response to a flagged client precisely so the scraper keeps spending IPs against a wall. How those systems decide you are a bot in the first place, and what they hand back when they do, is the subject of the server-side versus client-side bot detection and proxy-detection ASN posts.
Health checks: active probes and passive observation
Once you accept that proxies have state, you need a way to keep that state current. Two mechanisms do it, and a real pool runs both.
Active health checking sends synthetic requests on a schedule to measure each proxy independently of your crawl traffic. Hit a known-stable endpoint, record the status, the latency, and whether the body matches what you expect. Active checks give you a clean signal on a quiet pool and catch an IP that died while idle, before you waste a real request on it. They cost bandwidth, which on a per-GB residential plan is bandwidth you are paying for, so the probe target and frequency are an economic decision, not just a technical one. Probing every IP in a hundred-thousand-IP residential pool every minute is absurd; the pool churns faster than you could ever probe it.
Passive health checking reads the outcomes of your real traffic and updates each proxy’s score from them. No extra requests, no extra bytes. Every response your crawler receives is already a health signal, and a passive checker simply records it: this IP returned a 403, that one timed out, this one came back clean in 200 milliseconds. The trade is latency in the signal. You only learn an IP is bad after a real request has already failed through it. On a busy pool passive checking is nearly free and nearly current, which is why it tends to dominate, with active probes reserved for warming a cold pool or vetting newly added IPs.
This is the same split the service-mesh world settled on, and the parallel is exact enough to borrow their proven numbers. Envoy’s outlier detection is the reference implementation. It watches real request outcomes (passive) and ejects a host on consecutive_5xx errors or on consecutive_gateway_failure for 502, 503, and 504 specifically, and it also runs a statistical success-rate detector that flags hosts whose success rate falls far below the cluster mean, but only once a host has served more than success_rate_request_volume requests in the interval, so a low-traffic host is not condemned on a tiny sample. A proxy pool wants the same guardrail. Do not eject an IP that has handled three requests; the sample is too small to mean anything.
The recovery state machine and exponential backoff
Ejecting a proxy is half the job. The other half is letting it back in, because a ban is usually temporary and a dead exit node sometimes recovers. The mechanism almost everyone converges on is randomised exponential backoff, and the convergence is not a coincidence. It is the only scheme that re-checks a flaky IP often enough to recover it quickly while not hammering an IP that is genuinely gone.
The shape is simple. When an IP fails, quarantine it and schedule a re-check after a base delay. If the re-check also fails, double the delay. Keep doubling on each failure up to a cap. scrapy-rotating-proxies ships this with a ROTATING_PROXY_BACKOFF_BASE of 300 seconds and a ROTATING_PROXY_BACKOFF_CAP of 3600 seconds, so a stubbornly dead proxy gets re-checked at most once an hour instead of every few seconds. Envoy expresses the identical idea on the ejection side: the ejection duration is base_ejection_time multiplied by the number of consecutive times the host has been ejected, growing until it hits max_ejection_time. Same curve, opposite framing. One counts up the delay before a retry; the other counts up the time-out before re-admission.
There is a residential-specific wrinkle the mesh analogy misses. In a service mesh the hosts are stable; the IP you ejected is the same IP an hour later, so re-admission is meaningful. In a large residential network it is often not. The measurement data is blunt about this. An analysis of more than 170 million residential proxy IPs across 101 providers, run over a ninety-day window in early 2026, found that the average residential proxy IP stays visible for roughly four and a half days, and 78 percent of IPs do not persist beyond thirty days from first observation. For IPv6 it is far worse: those addresses are visible for about 1.29 days on average and 99 percent are gone within the month. An exponential backoff that schedules a re-check an hour out is, for a chunk of a residential pool, scheduling a re-check on an identity that no longer exists. The practical consequence is that for rotating residential traffic you lean on the gateway and let the provider handle recovery, and you only run an elaborate quarantine state machine on the stable IPs: datacenter, ISP, or a sticky residential session you are actively holding.
Sticky versus rotating, and the trust you are buying
The choice between a sticky session and per-request rotation is not a preference, it is dictated by whether your workload carries state. Get it backwards in either direction and you pay.
Rotate per request on a stateful flow and you break it outright. A login sets a cookie the server associates with the originating IP; the very next request arrives from a different IP carrying that cookie, and a half-decent backend treats the mismatch as session hijacking and kills the session. You will see this as an endless redirect to the login page, or a cart that empties between steps, and you will chase it as a cookie bug when it is a rotation bug. Anything spanning more than one request that the server stitches together needs the IP to hold still.
Pin a sticky session on stateless bulk extraction and you waste the pool’s diversity. The entire value of a large rotating pool is that no single IP carries enough request volume to look abnormal. Pin a thousand sequential price checks to one IP and you have manufactured exactly the velocity signal that gets an IP flagged: one address, hundreds of requests a minute, the textbook shape of automation. You converted a diverse pool into a single loud client by holding still when you should have moved.
Providers expose stickiness as a configurable session window. The durations on offer in 2026 run from a few minutes to several days. IPRoyal documents sticky residential sessions held “up to 7 days,” and the same length appears across competing networks. The pricing detail that matters operationally is that sticky and rotating almost always cost the same per gigabyte. You are billed for the bytes you move, not for the rotation policy, so the choice is purely technical. The cost lever is bandwidth, which the next section is about.
The pairing is forced by the workload, not chosen. Two of these four cells cost you data; the diagonal is where you want to be.There is a reputation dimension underneath all of this that rotation cannot paper over. With a rotating pool, the IP you draw carries whatever score every prior user built on it, and on a network with 46 percent of IPs appearing in two or more providers, that history is not even your own. Nearly half of residential proxy addresses are shared across providers, and some surface in dozens of pools at once, up to 98 in the extreme case the IPinfo data records. You can rotate to a fresh IP, but you cannot rotate to a fresh reputation, because the address you land on may have been someone else’s scraper an hour ago. That shared-history problem is the structural weakness of rotating residential, and it is exactly why detection vendors invest so heavily in ASN and subnet reputation rather than per-IP blocklists.
Burn rate: the cost model that decides everything
Now the money. Two billing models dominate, and they reward opposite traffic shapes.
Datacenter proxies bill per IP per month with effectively unlimited bandwidth, running roughly one to two dollars per dedicated IP in 2026, with some plans below a dollar at volume. Residential and mobile proxies bill per gigabyte of traffic transferred, and the market spreads wide: budget networks start around $1.75 per GB, mid-market sits in the three-to-six-dollar range, and the enterprise tier (Bright Data, Oxylabs) runs roughly eight to twelve dollars per gigabyte at entry volumes, sliding down with commitment. Those are list prices from vendor pages and aggregator surveys in early 2026; the high-volume contract numbers are lower and not public.
The shape of your traffic decides which model is cheaper, and the crossover is not subtle. Per-IP-per-month wins when you move heavy bandwidth across a handful of stable identities: large file pulls, video, long authenticated sessions on a few accounts. Per-GB wins when you touch many distinct IPs but transfer little data on each, which is the shape of most scraping, where a request is a few kilobytes of HTML and the value is in the diversity of egress, not the volume per IP. For that profile residential’s per-GB model is usually the lower true cost even at a higher headline rate, because you are not paying for bandwidth you do not use.
On a per-GB plan every byte is metered, including the ones you throw away. Blocking media and assets you never parse is usually a bigger lever than negotiating the rate.The number that should govern a residential crawl is not the per-GB rate, it is bytes per useful request. Burn rate is roughly bytes-per-request times request rate times the per-GB price, and the first term is the one engineers most often ignore. A raw HTML page might be thirty kilobytes; the same page rendered in a real browser, pulling JavaScript, CSS, web fonts, analytics beacons, and a carousel of images, can move well over half a megabyte, every byte metered at the residential rate. Driving a headless browser through a residential proxy when a plain HTTP fetch would have returned the same data is one of the most expensive mistakes in the discipline, and it is common, because browser automation is the path of least resistance for getting past client-side checks. Block images, media, fonts, and third-party beacons at the network layer, and request the bare document where the data lives in the initial HTML, and the bill can fall by an order of magnitude with no change to what you extract.
The cruelest line item is the one the diagram flags at the bottom. A request that gets banned still transferred bytes, so you paid for it and got nothing. A pool with a fifty percent block rate is not merely half as productive, it is double the cost per useful record, because every blocked request is full freight for zero data. This is the economic argument for everything earlier in this post. Good ban detection, fast ejection, and sane backoff are not hygiene, they are the difference between paying once per record and paying twice. The cheapest gigabyte is the one you never send through a burning IP, and the way you avoid sending it is the health-check machinery that knows the IP is burning before you spend the next request on it.
There is a quieter cost that does not show on the invoice: the operational tax of running the pool itself. Active health probes consume billable bandwidth. A large rotating pool needs a control plane to track per-IP scores, and that state has to live somewhere and survive restarts. Logging every request outcome at the volume a serious crawl generates is its own storage and processing line. None of this is large next to the proxy bill, but it is real, and it is the reason many teams pay the markup for a managed gateway rather than operate selection and health themselves. They are buying back the engineering time, and for rotating residential traffic, where the IPs churn out from under any state you try to keep, that trade is often correct.
Putting the pieces together
A proxy pool that holds up under a real crawl is a small distributed system with one job: keep a current, honest opinion about which egress identities are worth using right now, and route around the rest cheaply. The selection algorithm decides who serves the next request. The health checks, active for cold pools and passive for hot ones, keep the opinion current. The dead-versus-banned distinction keeps a single hostile target from draining the whole pool through misattributed failures. The backoff state machine recovers what is recoverable and writes off what is not, tempered by the reality that a residential IP often churns away before any backoff timer fires. And the burn-rate model sits underneath all of it, because every one of those decisions is ultimately a decision about whether the next gigabyte buys data or buys a block page.
The detail that ties the operational story to the economic one is that bans and burn rate are the same problem wearing two hats. A blocked request costs exactly as much bandwidth as a successful one and returns nothing, so the health machinery that keeps you off burning IPs is also your single biggest cost control, larger than the per-GB rate you negotiated. Teams that obsess over the headline price per gigabyte and ignore their block rate are optimising the wrong term. The measurement data from 2026 sharpens the point from the other side: with the average residential IP visible for under five days, nearly half of all such IPs shared across two or more providers, and 78 percent of malicious sessions slipping past reputation feeds entirely, the pool you rent is a fast-moving, shared, partly-poisoned resource whose individual IPs you will never fully trust. You manage it the way you would manage any resource like that. You measure constantly, you eject fast, you recover cautiously, and you never spend a request you could have known not to send.
Sources & further reading
- IPinfo (2026), The Residential Proxy Problem: Shared Infrastructure and Rapid Rotation — measurement of 170M+ residential proxy IPs across 101 providers; churn, lifespan, and 46 percent cross-provider overlap figures.
- GreyNoise and IPinfo, via BleepingComputer (2026), Residential proxies evaded IP reputation checks in 78% of 4B sessions — three-month study; 78 percent reputation evasion, 39 percent residential origin, 683 ISPs.
- TeamHG-Memex (2024), scrapy-rotating-proxies README — open-source ban policy, page-retry reclassification, and the backoff base/cap defaults quoted here.
- Envoy Project (2026), Outlier detection — architecture overview — consecutive-5xx, gateway-failure, success-rate detection and the base-ejection-time backoff multiplier.
- Apify (2026), Proxy Rotation Strategies for Web Scraping: The Technical Reference — SessionPool defaults, maxUsageCount and maxErrorCount session-retirement pattern, exponential backoff guidance.
- IPRoyal (2026), Residential Proxies pricing — per-GB tiers and the up-to-7-day sticky session window referenced in the stickiness section.
- AIMultiple (2026), How Much Does a Proxy Cost? 2026 Proxy Pricing Comparison — per-IP datacenter versus per-GB residential billing models and market-tier price ranges.
- IPQualityScore (2024), Detecting Residential Proxies: Unmasking Fraudulent IP Addresses — the multi-signal detection approach (ASN, behaviour, history) that a soft block draws on.
- Kong (2026), Health checks and circuit breakers — Kong Gateway — active versus passive health-check definitions that map directly onto proxy-pool probing.
- TorchProxies (2026), Proxy IP Reputation and ASN Scoring — how a rotated IP inherits the reputation built by every prior user of that address.
Further reading
Designing a distributed crawler: frontier, dedup, politeness, and backpressure
Traces the architecture of a web-scale crawler from Mercator and the early Googlebot through IRLbot to today: the URL frontier, duplicate elimination, politeness scheduling, and how servers push back.
·21 min readURL frontier design: from Mercator to modern priority-queue crawlers
How the URL frontier orders a crawl: the Mercator front-queue/back-queue split, per-host politeness, freshness versus coverage, and the disk-backed and gRPC designs that run at web scale today.
·22 min readResidential vs datacenter vs mobile proxies: detection, cost, and use cases
A vendor-neutral comparison of the three proxy types: how each is sourced, how each gets detected at the ASN and reputation layer, what a gigabyte actually costs, and which job each one fits.
·19 min read