Rate limiting yourself: token buckets, adaptive throttling, and 429 backoff

Most writing about rate limiting is written from the server’s chair. The server has a resource to protect, a per-user quota to enforce, a load shedder to keep the fleet alive when traffic spikes. The crawler sits on the other side of that exchange and almost nobody writes the crawler’s half. Yet the crawler has the harder problem, because it has no quota handed to it, no documented limit to obey, and a fleet of workers that will cheerfully flatten a host into a wall of 429s if nobody tells them to slow down. The interesting question is not how a server limits you. It is how you limit yourself before the server has to.

That self-discipline is a real engineering problem with real algorithms behind it, and the same primitives keep showing up: a bucket of tokens that meters your outbound rate, a cap on how many requests you have in flight to one host at once, a feedback loop that reads the server’s distress signals and backs off, and a retry schedule that spreads collisions out in time instead of synchronizing them into a thundering herd. This post works through each. It starts with the two classic bucket algorithms and why a crawler usually wants one of them specifically, then per-host concurrency as the limit people forget, then the HTTP signals a server actually sends you (429, 503, Retry-After) and what the RFCs really say about them, then adaptive throttling that tunes your rate from those signals, then exponential backoff and the jitter that makes it work, and finally why backoff alone never solves overload and what you pair it with.

Two buckets: token and leaky

The two algorithms everyone reaches for are the token bucket and the leaky bucket, and they are easy to confuse because both involve a bucket and a constant rate. The difference is which side the constant rate sits on.

In a token bucket, tokens drip into a bucket at a fixed rate up to some maximum capacity. Every request you want to send must take a token. If a token is available, you send and decrement. If the bucket is empty, you wait until it refills. The refill rate sets your long-run average; the bucket capacity sets how large a burst you can fire when the bucket is full. That second property is the whole point. A token bucket lets you sit idle for a while, accumulate tokens up to the cap, and then spend them in a quick burst, which is exactly the traffic shape a crawler produces when it finishes parsing a page and suddenly has forty new links to fetch.

The leaky bucket inverts this. Requests pour into the bucket and drain out the bottom at a fixed rate, like water through a hole. If requests arrive faster than the drain rate, the bucket fills, and once it overflows the excess is dropped. The output is perfectly smooth no matter how spiky the input, because the drain rate is constant by construction. That smoothness is the leaky bucket’s selling point and also its limitation: it refuses to let you burst even when bursting would be harmless, because the drain rate is the only rate it knows.

*Both algorithms enforce the same long-run average. The token bucket stores up permission to burst; the leaky bucket refuses to burst and smooths the output into a constant stream.*

For a crawler limiting itself, the token bucket is usually the better fit, and the reason is the burst capacity. Crawling is not a steady stream of independent requests. It is wave after wave: fetch a listing page, parse it, discover thirty product URLs, want to fetch all thirty. A token bucket sized with a small burst capacity lets you clear that wave quickly when the host can take it, while the refill rate still holds your sustained average to something the host tolerates. A pure leaky bucket would meter those thirty requests out one drip at a time even on a host that would happily serve all thirty at once, which wastes throughput for no politeness benefit.

This is not a niche preference. Stripe’s API rate limiter, documented in their engineering blog, uses the token bucket precisely because it “lets users burst briefly if needed (like during a flash sale) but still enforces a steady average rate over time.” Their implementation drips tokens into a per-user bucket backed by Redis, rejects when the bucket is empty, and that single primitive carries the most-triggered limiter in their stack. The same shape works whether you are the server handing out tokens or the client spending them against yourself.

One implementation detail separates a good token bucket from a naive one, and it is worth getting right: how the refill happens. The Bucket4j library, a widely used Java implementation, names the two strategies. Greedy refill adds tokens as continuously as it can. A bucket configured for “10 tokens per second” under greedy refill actually adds one token roughly every 100 milliseconds, so capacity recovers smoothly. Intervally refill waits for the whole window to elapse and then adds the entire batch at once, all ten tokens at the top of each second. Greedy refill gives you smoother pacing and is almost always what a crawler wants; intervally refill produces a sawtooth where you can fire a full burst the instant the window rolls over, which tends to synchronize your fleet against the wall clock. The distinction is small in code and large in behavior.

You do not have to materialize tokens as objects at all. The standard trick is to store two numbers per bucket, the current token count and the timestamp of the last refill, and compute the refill lazily on each request: tokens gained equals elapsed seconds times the refill rate, capped at the bucket size. That makes the whole thing a couple of arithmetic operations and a compare-and-swap, which is what lets a token bucket scale to the request volumes a real crawler produces.

1
# lazy token bucket, evaluated per request
2
def allow(bucket, now, rate, capacity):
3
    elapsed = now - bucket.last_refill
4
    bucket.tokens = min(capacity, bucket.tokens + elapsed * rate)
5
    bucket.last_refill = now
6
    if bucket.tokens >= 1:
7
        bucket.tokens -= 1
8
        return True          # send
9
    return False             # wait for (1 - tokens) / rate seconds

Per-host concurrency, the limit people forget

Rate is requests per second. Concurrency is requests in flight at the same moment. They are different axes and a crawler needs to bound both, but the second one gets forgotten constantly, and forgetting it is how a fleet that looks polite on paper still knocks a host over.

Here is the trap. You set a token bucket at five requests per second to a host, you feel responsible, and then one of those requests hits a slow endpoint that takes eight seconds to respond. Your bucket keeps dripping tokens and keeps launching new requests on schedule, because the bucket has no idea the earlier ones never came back. After ten seconds you have fifty requests piled up against a host that can clearly only handle a fraction of that, every one of them holding a connection open, and the host’s connection pool or worker threads are now the bottleneck you created. Rate limiting alone did not save you, because the thing that fell over was concurrency, not rate.

The fix is a per-host concurrency cap that is independent of the rate limit. A counting semaphore is the usual tool: acquire a slot before a request leaves, release it when the response (or the timeout, or the connection error) comes back. If all slots are held, the request waits, regardless of how many tokens the rate bucket is offering. The two limits compose. The token bucket answers “may I start a request this second,” the semaphore answers “is there room for another in-flight request to this host,” and a request needs a yes from both.

*Rate and concurrency are independent. A slow endpoint can blow past a concurrency cap while the rate limit stays satisfied, which is why both gates exist.*

The unit that both limits attach to is the host, not the crawler. A distributed crawler fetches thousands of distinct hosts at once, and the polite rate for a small WordPress blog on shared hosting is nothing like the polite rate for a CDN-backed marketplace. Hard-coding one global rate either crawls the big sites at a crawl or flattens the small ones. The practical pattern is a registry of per-host limiter objects keyed by hostname (or by registrable domain, or by resolved IP if you want to be careful about many hostnames sharing one server), each holding its own token bucket and its own semaphore, created lazily the first time you touch a host. This is the same per-host bookkeeping a polite crawler already keeps for robots.txt and crawl-delay, so the limiter slots in next to it. If you run more than one machine, the limiter state has to be shared, and that usually means a Redis-backed bucket so all workers spend from the same per-host budget instead of each quietly running its own copy of the limit.

How you key the limiter interacts with how you route traffic. If you rotate through a pool of proxy IPs, the host sees several source addresses and the limit it applies to any one of them is looser than the limit it applies to the aggregate. Whether you rate-limit per origin IP, per destination host, or per (IP, host) pair is a real decision with cost and detection consequences, and it connects to the broader question of sticky versus rotating sessions. Self-limiting per destination host, regardless of how many source IPs you spread across, is the conservative default, because the host’s pain is a function of total load on it, not load per source.

What the server actually tells you

Before you can throttle adaptively you have to read the signals, and HTTP gives you a small, specific vocabulary for “you are sending too much.” The canonical one is status code 429.

The 429 status code was added in RFC 6585, “Additional HTTP Status Codes,” published in April 2012 by Mark Nottingham and Roy Fielding. Its definition is one sentence: “The 429 status code indicates that the user has sent too many requests in a given amount of time (rate limiting).” That is the entire normative meaning. The RFC adds that the response “MAY include a Retry-After header indicating how long to wait before making a new request,” and gives the example value Retry-After: 3600. Note the MAY. A server is free to send you a 429 with no guidance at all about when to come back, and many do.

The Retry-After header itself is older than 429 and is defined for the general case in RFC 9110, the consolidated HTTP Semantics document that replaced RFC 7231 in 2022. It takes one of two syntactic forms, and a correct client has to parse both. The first is a non-negative integer count of seconds, Retry-After: 120, meaning wait 120 seconds. The second is an absolute HTTP date, Retry-After: Wed, 21 Oct 2015 07:28:00 GMT, meaning do not retry before that wall-clock instant. A client that only handles the integer form will silently misread the date form as zero and hammer the server immediately, so both are mandatory. The same header shows up on three different status codes with slightly different meanings: on a 503 it says how long the service expects to be unavailable, on a 429 it says how long until your rate limit resets, and on a 301 redirect it asks the client to wait before following. The crawler cares mostly about the first two.

There is a subtlety worth flagging because it bites people. When Retry-After carries an integer, that integer is the delay you should use, and it overrides whatever backoff your own logic computed. When it carries a date, you compute the delay by subtracting your current clock from that date, which means clock skew between you and the server can produce a negative or absurd value, and you should clamp it. And when there is no Retry-After at all, which RFC 6585 explicitly permits, you fall back to your own backoff schedule. A well-built client treats Retry-After as a hint that, when present and sane, wins, and otherwise computes its own delay.

Not every “slow down” arrives as a clean 429. Some servers return 503 Service Unavailable under load. Some sit behind anti-bot or WAF layers that answer with a 429 or a challenge page that is really a block rather than a rate limit, and telling a genuine rate-limit 429 apart from a soft block matters, because the right response differs. A real rate limit wants patience. A block wants a change of approach, and conflating the two leads to backing off politely against a wall that will never open no matter how long you wait. Watching the mix of status codes over time is part of scraping observability, and it is what lets you notice that your 429 rate is really a block-rate dashboard in disguise.

Adaptive throttling: let the server set your rate

A fixed per-host rate is a guess. You picked five requests per second because it felt safe, but the host might tolerate fifty, or it might start shedding at two on a bad day. Adaptive throttling closes that gap by treating the server’s responses as a control signal and continuously adjusting your rate toward whatever the host will actually bear right now.

The cleanest framework borrowed for this is AIMD, additive increase / multiplicative decrease, the same control law that governs TCP congestion. The idea transfers almost directly. While requests succeed, raise your allowed rate by a small fixed step each interval, additive increase, probing gently upward to find more capacity. The moment you see a congestion signal, a 429 or a 503 or a timeout, cut your rate by a multiplicative factor, typically halving it, additive increase, multiplicative decrease. The asymmetry is deliberate and it is the whole trick. You climb slowly and you fall fast, which means you spend most of your time just below the host’s real limit and you retreat hard the instant you cross it. TCP uses exactly this shape to share a link fairly among many flows, and a crawler uses it to track a moving limit it was never told.

*The sawtooth hugs the host's real limit from below. Slow climb, fast cut. The crawler never knows the limit; it only knows when it just crossed it.*

This is not a thought experiment. Adaptive rate limiters built on exactly this principle exist as off-the-shelf middleware, and the pattern was extracted from production traffic pipelines. The open-source rate_limiter_aimd crate, lifted from the Vector observability project, adjusts an HTTP client’s concurrency up and down on the TCP-style additive-increase / multiplicative-decrease rule based on observed errors and latency. The gadget-inc aimd-bucket does the same for JavaScript clients as an adaptive token bucket whose size grows on success and shrinks on failure. The common goal these state plainly is to “discover and adapt to unknown remote system limits dynamically,” which is the crawler’s situation exactly: a limit exists, nobody published it, find it by feel without crossing it hard.

The biggest crawler on earth runs a version of this and recently bet the whole product on it. Google deprecated the Search Console Crawl Rate Limiter Tool, the manual control that let site owners cap Googlebot, on January 8, 2024, retiring a tool that had existed since 2008. The stated reason is that Googlebot’s automatic throttling had become good enough that the manual knob no longer earned its place. Google’s own description of the replacement behavior is pure adaptive throttling: “if the server persistently returns HTTP 500 status codes for a range of URLs, Googlebot will automatically, and almost immediately slow down crawling,” and separately, “Googlebot slows down automatically if the response time for requests gets significantly longer.” Rising latency and error rates are the congestion signals; the crawl rate is the control variable. The largest crawler in the world threw away its manual override because the feedback loop reads server health better than a human setting a number ever did.

Latency deserves its own mention as a signal because it leads the error code. A host under strain gets slow before it starts returning 429s and 503s, so a throttle that watches response-time percentiles can ease off before the host is actually shedding load, which is gentler on the host and on your success rate. Watching p50 and p95 latency per host and treating a sustained rise the same way you treat an error, as a reason to back off, is what separates a crawler that is merely polite from one that is genuinely responsive. The cost of being too slow is real too, since every backoff is throughput you did not get, and the economics of a scraping operation come down partly to how tightly you can ride that limit without crossing it.

Backoff and the necessity of jitter

When a request does fail in a way that deserves a retry, the question is when to retry, and the canonical answer is exponential backoff: wait a base interval after the first failure, double it after the second, double again after the third, capped at some ceiling. The doubling is what keeps a struggling host from being hammered on a fixed short interval while it is trying to recover. So far this is textbook.

The part that is not textbook, and that a lot of crawlers get wrong, is that plain exponential backoff makes things worse at scale, because it synchronizes. Picture a host that returns 429 to a hundred of your workers at the same instant, perhaps because a quota window rolled over. All hundred compute the same backoff, base times two to the first attempt, all hundred wait the identical interval, and all hundred retry at the same moment. You have rebuilt the exact spike that caused the problem, just shifted a few seconds into the future. Doubling the interval does nothing about this; it just spaces out the synchronized spikes. The collision is the disease and deterministic backoff does not treat it.

Jitter treats it. You add randomness to the backoff so the hundred workers scatter across a window instead of firing together. The definitive analysis is Marc Brooker’s 2015 AWS Architecture Blog post “Exponential Backoff and Jitter,” which simulated the approaches and measured both completed work and server contention. It compares three:

1
# base = initial delay, cap = ceiling, attempt = retry number
2
sleep = min(cap, base * 2 ** attempt)                          # no jitter
3
sleep = random(0, min(cap, base * 2 ** attempt))               # full jitter
4
sleep = base * 2**attempt / 2 + random(0, base * 2**attempt/2) # equal jitter
5
sleep = min(cap, random(0, sleep * 3))                         # decorrelated

The finding was clear. Full jitter, where you pick a uniformly random delay anywhere between zero and the full exponential ceiling, came out best on client work, using less total work than equal jitter and decorrelated jitter, while decorrelated jitter finished marginally faster. No-jitter exponential backoff was, in Brooker’s word, “the clear loser.” The headline he draws is that adding jitter is nearly free and the win is large, so jittered backoff “should be considered a standard approach for remote clients.” For a crawler self-limiting against many hosts from many workers, full jitter is the sensible default. It is one line different from the naive version and it is the line that stops your fleet from resonating.

*Deterministic backoff preserves the collision and shifts it forward in time. Jitter dissolves it. The interval matters far less than the randomness.*

A few guardrails go around the backoff itself. Cap the number of retries, because a URL that has failed five times is usually telling you something a sixth attempt will not fix. Cap the maximum delay, or your tail latency drifts into minutes. And honor Retry-After over your computed backoff when the server sent one, since the server’s number is authoritative about its own reset window and your exponential guess is not. The order of precedence is: a sane Retry-After wins, otherwise full-jitter exponential backoff, always under a retry-count ceiling.

Why backoff is not enough on its own

It is tempting to think backoff plus jitter plus adaptive throttling is the complete answer. It is not, and the sharpest statement of why comes from Marc Brooker again, in his 2022 post “What is Backoff For?” His argument is uncomfortable and correct: “Backoff helps in the short term. It is only valuable in the long term if it reduces total work.” Backoff reschedules work; it does not delete it. If a hundred requests need to happen and the host can only serve ten per second, backing off spreads those hundred over more time but the host still has to do all hundred. For a brief spike that is exactly what you want, smoothing a transient into something digestible. For sustained overload it accomplishes nothing, because, as Brooker puts it bluntly, “deferring load does not work” when the load is not actually transient.

The failure mode this guards against is retry amplification. Every retry is a new request the server has to process, even the ones that fail again, so a naive retry policy under load can multiply the original traffic instead of reducing it, turning a host that was merely slow into a host that is fully down. Backoff slows the amplification but does not stop it, because the retries still happen, just later. Brooker’s recommendation is to bound the retries themselves, and the elegant tool he names is a token bucket for retries: give each client a bucket that retries draw from, refilled slowly, so that a healthy client with the occasional failure can always retry but a client whose every request is failing runs its retry bucket dry and stops amplifying. The same primitive that meters your outbound rate also meters your retry budget, which is a tidy result. The full prescription he lands on is backoff and jitter for the short-term spike, a retry token bucket and circuit breaking for the long-term amplification, with no single mechanism doing the whole job.

So the complete self-limiting stack for a crawler has more than one moving part. A per-host token bucket caps your sustained rate and allows controlled bursts. A per-host concurrency semaphore caps requests in flight so a slow endpoint cannot pile up connections. An AIMD-style adaptive loop tunes the bucket rate from the host’s latency and error signals so you ride the real limit instead of a guess. Full-jitter exponential backoff, bounded by a retry count and capped delay and deferring to Retry-After, handles the individual failures without synchronizing your fleet. And a retry budget keeps the retries themselves from becoming the attack. Each piece covers a gap the others leave open.

Closing

The asymmetry at the center of all of this is worth sitting with. A crawler climbs toward a host’s limit one cautious step at a time and retreats from it in a single hard cut, and that shape is not an accident or a heuristic. It is the same additive-increase / multiplicative-decrease law that keeps the entire internet’s TCP traffic from collapsing into congestion, applied to a crawler that was never told what its limit is. You probe gently because capacity is cheap to test for and you back off violently because being wrong is expensive. The whole discipline of rate-limiting yourself reduces to learning a number nobody will tell you, by watching for the moment you exceed it and treating that moment as information rather than failure.

The strongest evidence that this is the right model is that the largest crawler on the planet reached the same conclusion and acted on it. When Google retired its manual crawl-rate control in January 2024 and let Googlebot set its own pace from server latency and error codes, it was conceding that a feedback loop reading the host’s health in real time beats any number a human picks in advance. A site owner setting a static cap is guessing, the same way a crawler hard-coding five requests per second is guessing. The host already knows its own limit and it will tell you, in 429s and 503s and rising response times, if you build the thing that listens. Most crawlers that get blocked were not too aggressive on average. They were deaf to the one signal that would have kept them under the line.

Sources & further reading

Nottingham, M. and Fielding, R. (2012), RFC 6585: Additional HTTP Status Codes — the standard that defined the 429 status code and its one-sentence meaning, with the optional Retry-After guidance.
IETF (2022), RFC 9110: HTTP Semantics — the consolidated HTTP semantics document defining the Retry-After header and its two syntactic forms.
MDN Web Docs (2024), Retry-After header — reference for the delay-seconds and http-date forms and the status codes Retry-After accompanies.
Brooker, M. (2015), Exponential Backoff and Jitter — the AWS simulation comparing full, equal, and decorrelated jitter and concluding full jitter is the default for remote clients.
Brooker, M. (2022), What is Backoff For? — the argument that backoff reschedules rather than reduces load, and that retries need their own token-bucket budget.
Stripe (2017), Scaling your API with rate limiters — Stripe’s production token-bucket limiter on Redis and the four limiter and load-shedder types in their stack.
Bucket4j (2024), Bucket4j reference documentation — a token-bucket library documenting the greedy-versus-intervally refill distinction precisely.
Google Search Central (2023), Upcoming deprecation of Crawl Rate Limiter Tool in Search Console — the announcement that Googlebot’s automatic throttling on 500s and latency replaced the manual crawl-rate knob, effective January 8, 2024.
TwistingTwists (2024), rate_limiter_aimd — an adaptive HTTP rate limiter extracted from Vector that applies TCP-style AIMD to client request rate.
gadget-inc (2024), aimd-bucket — an adaptive leaky-token-bucket for JavaScript clients that grows on success and shrinks on failure to discover unknown limits.
Wikipedia (2024), Additive increase/multiplicative decrease — background on the AIMD control law and its convergence properties from TCP congestion control.

Rate limiting yourself: token buckets, adaptive throttling, and 429 backoff

Two buckets: token and leaky

Per-host concurrency, the limit people forget

What the server actually tells you

Adaptive throttling: let the server set your rate

Backoff and the necessity of jitter

Why backoff is not enough on its own

Closing

Sources & further reading

Further reading

Designing a distributed crawler: frontier, dedup, politeness, and backpressure

URL frontier design: from Mercator to modern priority-queue crawlers

Proxy pool management: rotation, health checks, and burn-rate economics