Rate-limiting algorithms for defense: token bucket, sliding window, and GCRA

A rate limiter is the cheapest defensive primitive you can put in front of a service, and the one most likely to be wrong in a way nobody notices until it matters. The happy path hides the bugs. Under normal traffic almost any counter will do; the request comes in, you bump a number, you check it against a threshold, you let the request through. The interesting questions only show up at the edges. What happens at the boundary between two counting windows? What happens when a client sends a burst that is technically under the average but arrives all at once? What happens when the counter lives in Redis and forty application servers all read it, decide, and write it back in the same millisecond?

Those edges are where an abuse defense lives or dies. A limiter that lets through twice its configured rate at every window boundary is not really a limiter against an attacker who knows where the boundary is. A limiter that drops legitimate bursts is a limiter against your own users. The algorithms that solve these problems are old, most of them older than the web, and they were designed for a different problem entirely: pacing fixed-size cells through 1990s telecom switches. This post is about how those algorithms work, why each one fails in a specific way, and what it takes to run them across a distributed fleet without the counting itself becoming the bottleneck.

Here is the route. First the fixed window, the simplest thing that counts, and exactly how its boundary leaks. Then the two sliding windows, the log and the counter, and the memory-versus-accuracy tradeoff between them. Then the bucket family: leaky bucket, token bucket, and GCRA, which is the leaky bucket rewritten so it never has to tick. Then how Redis makes any of this atomic across a fleet, where the race conditions hide, and how the big shops actually deploy it. The through-line is that rate limiting is a counting problem, and every algorithm here is a different answer to the question of what, exactly, you count and how much you are willing to pay to count it precisely. This is the defensive complement to rate limiting yourself, which looks at the same algorithms from the client’s side of the 429.

What you are actually defending against

Before the algorithms, the threat model, because it decides which algorithm is correct. Rate limiting sits in front of several distinct abuses and the right counter looks different for each.

The volumetric flood is the loud one. A botnet points a few hundred thousand hosts at your login endpoint and tries to exhaust a connection pool or a database. At that scale the rate limiter is rarely the last line; volumetric traffic gets absorbed upstream by anycast dispersion and scrubbing long before it reaches an application counter, which is its own subject in how CDNs absorb volumetric DDoS. The application-layer limiter handles the quieter abuses. Credential stuffing, where an attacker replays a breach dump against your login form at a few requests per second per IP, slow enough to look human, spread across enough IPs to dodge a naive per-IP cap. Scraping, where a crawler pulls your whole catalog as fast as your origin will answer. Brute force against a password or a token. A misconfigured client in a retry loop, which is not malicious at all but will take you down just as effectively.

These differ in a way that matters for the math. The flood wants raw request suppression: count requests, drop the ones over the line. Credential stuffing wants the counter keyed on something more durable than an IP, because the IP rotates; this is where credential stuffing mechanics and rate limiting meet, and where pure rate limiting starts to lose to fingerprinting and reputation. The retry-loop client wants a limiter that says no politely, with a Retry-After it will actually honor, rather than one that just drops packets. One mechanism, several jobs, and the choice of counting window and key is where you encode which job you are doing.

There is a second axis that shapes everything below: where the limiter runs. A limiter at the CDN edge sees every request but has no idea what your application considers expensive. A limiter in your application code knows exactly what a request costs but only sees the fraction of traffic that already reached it. The edge is cheaper and absorbs attacks before they touch your origin; the application is precise. Most serious deployments run both, and the two layers count different things on purpose.

Fixed window: the counter that leaks at the seam

Start with the simplest possible thing. Divide time into fixed windows, one per minute say, keyed by the truncated timestamp. For each window keep one integer per client. A request arrives, you increment the integer for the current window, and if it crosses the threshold you reject. When the clock rolls to the next minute the key changes, the old counter is abandoned, and everyone starts fresh.

This is one increment and one comparison per request. The state per client is a single integer that you can let expire automatically. In Redis it is the canonical two-command pattern: INCR on a key shaped like client:minute, and EXPIRE to set a time-to-live so the key cleans itself up. It is fast, it is trivial to reason about, and it has one flaw that an attacker will find immediately.

The flaw is the seam between windows. The counter resets instantaneously at the window boundary, which means a client can spend its entire budget in the last second of one window and its entire budget again in the first second of the next. Two windows’ worth of requests land inside a span shorter than a single window. Configure 100 requests per minute and an attacker who lines up on your boundary gets 200 requests inside roughly one minute, all of them conforming. The burst is exactly at the seam, and the seam is at a predictable wall-clock time, so this is not a corner case you can wave away. It is a doubling, on demand, available to anyone who can read a clock.

*The seam between fixed windows is a predictable wall-clock instant; lining a burst up across it admits two windows' worth of traffic in less than one window.*

For a lot of internal uses this does not matter. If you are protecting a database from a runaway script, a brief doubling at minute boundaries is harmless. The fixed window earns its keep precisely because it is cheap and simple, and a great deal of production rate limiting is fixed-window underneath. It stops being acceptable the moment your adversary is deliberate, because then the boundary is not a statistical curiosity, it is an attack surface with a known location.

Sliding window log: precise and expensive

The honest fix for the boundary problem is to stop using fixed windows at all and measure the actual trailing interval. Keep, for each client, the timestamp of every request inside the last window. When a new request arrives, drop every timestamp older than the window, count what remains, and admit only if the count is under the limit. Now the window genuinely slides; there is no seam, because “the last sixty seconds” is computed fresh on every request relative to now.

This is the sliding window log, and it is exact. No boundary effect, no approximation, the count is the true number of requests in the true trailing window. In Redis it maps cleanly onto a sorted set: ZADD the new timestamp, ZREMRANGEBYSCORE to evict everything older than the cutoff, ZCARD to count what is left, all wrapped so they run as one atomic unit. The structure is elegant and the answer is correct.

The cost is memory, and it scales with traffic, not with clients. Every admitted request stores a timestamp that lives for the whole window. A client allowed 500 requests in the window holds up to 500 timestamps at once. Figma worked the arithmetic when they built their limiter and put numbers on it: 500 requests per user per day across 10,000 users would be roughly 20 MB of timestamps in the naive log. That is for ten thousand users. The structure that gives you perfect accuracy is the structure whose memory grows linearly with how much traffic you are trying to admit, which is exactly backwards from what you want under attack, when traffic is highest and you can least afford to spend memory recording it. The sliding window log is the right tool when the limit is small and precision is non-negotiable, and the wrong tool the moment the limits or the user count get large.

Sliding window counter: the approximation everyone ships

The version that actually runs at scale keeps the sliding window’s smoothness and throws away the per-request storage. Keep two fixed-window counters, the current window and the one immediately before it, and estimate the rate over the trailing window by weighting the previous window by how much of it still overlaps the sliding interval.

The arithmetic is a weighted blend. Suppose a one-minute limit and you are fifteen seconds into the current window. The current window has counted some requests directly. The previous window’s count gets weighted by the fraction of it still inside the trailing sixty seconds, which is the forty-five seconds that have not yet aged out, so three quarters of it. Cloudflare’s published walk-through uses concrete numbers: 42 requests in the previous minute, 18 so far in the current minute, fifteen seconds elapsed, giving 42 * ((60 - 15) / 60) + 18 = 42 * 0.75 + 18 = 49.5. Round and compare against the limit. Two counters per client, a multiply and an add, no timestamps.

*Two counters and a weighted blend approximate the trailing window without storing a single timestamp.*

It is an approximation, and it is wrong in a defined direction. The blend assumes the previous window’s requests were spread evenly across that window, which they were not. If the previous window’s traffic was bunched at its end, just before the boundary, the linear weighting undercounts what is actually inside the trailing interval, and the limiter is slightly too lenient; bunched at the start, slightly too strict. Cloudflare quantified the error on real traffic: across 400 million requests from 270,000 distinct sources, 0.003 percent of requests were wrongly allowed or wrongly limited, with an average gap of 6 percent between the true rate and the approximation, and crucially none of the sources that got mitigated were actually under the threshold. The approximation errs small and it errs in a way that does not produce false positives against legitimate clients below the line. That combination, almost-right accuracy with no per-request storage and no boundary doubling, is why the sliding window counter is the default in most edge rate limiters today, Cloudflare’s among them.

Figma landed on a close cousin of this for the same reasons. Rather than two windows they slice the enforcement period into sixty sub-windows of equal size, increment the one for the current slice, and sum all sixty when checking, storing the counters in a Redis hash that expires after the period. Sixty integers per user instead of hundreds of timestamps. Their measured footprint was about 2.4 MB for the same ten thousand users that the naive log would have cost 20 MB, an order of magnitude saved for an accuracy hit small enough to be irrelevant against the abuse they were stopping. The general lesson holds across both shapes: you trade a bounded, characterizable error for memory that no longer scales with traffic.

The bucket family: leaky, token, and why they are the same picture

The window algorithms count events inside an interval. The bucket algorithms take a different view: they model a reservoir and a flow, and ask whether the reservoir has room. They come from telecom, where the problem was not abuse but pacing, smoothing bursty sources down to a contracted rate before they hit a switch.

The leaky bucket is the original. Jonathan Turner described it in 1986 as a rate-control scheme for variable-rate sources, and it was later written into ITU-T Recommendation I.371 for traffic control in ATM networks. The picture is a bucket with a hole. Each arriving request pours a unit of water in; the bucket drains at a fixed rate through the hole. If a request would overflow the bucket, it is dropped. Because the drain is constant, the output is perfectly smooth regardless of how bursty the input was, which is exactly what a switch wants. The bucket’s capacity is the burst tolerance: a deep bucket absorbs a big clump of arrivals and meters them out slowly, a shallow one barely tolerates any clumping at all. The defining property is that the leaky bucket shapes output to a constant rate. It does not let bursts through, it absorbs them and releases them evenly.

The token bucket inverts the flow and that one change is the whole difference. Instead of requests filling a bucket that drains, tokens drip into a bucket at a fixed rate and requests remove them. A bucket of capacity B fills with tokens at rate r up to B; each request must take a token to proceed, and if the bucket is empty the request is rejected or made to wait. When the bucket is full the client can spend all B tokens at once, which means the token bucket permits a burst up to the bucket size and then throttles to the refill rate once the bucket drains. This is the property that makes it the favorite for public APIs. Real clients are bursty in a benign way; they batch, they page, they fire a cluster of calls on a user action and then go quiet. The token bucket says yes to that burst as long as the long-run average stays under r, which matches how legitimate traffic actually looks. Stripe built its primary request limiter on exactly this, a centralized token bucket in Redis from which every request takes a token, and it is the most frequently triggered of their limiters, rejecting millions of requests a month.

*Same reservoir picture, opposite flow: the leaky bucket meters output to a constant rate, the token bucket lets a full bucket spend all at once.*

Note that you rarely implement either by literally running a timer that drips. Maintaining a background process that adds a token to millions of buckets every interval is wasteful, most of all because the overwhelming majority of buckets are idle at any moment. The standard trick is lazy: store the last update time and the current level, and on each request compute how many tokens should have accrued since you last looked (elapsed * r, capped at B) before deciding. The bucket only does work when a request touches it. That insight, that you can replace the tick with a timestamp and a subtraction, is the doorway to GCRA.

GCRA: the leaky bucket that never ticks

The generic cell rate algorithm is the leaky bucket rewritten so it stores a single number and never runs a clock. It was the ATM Forum’s reference algorithm for policing traffic contracts, standardized alongside I.371, and the word “cell” is the ATM word for a fixed-size packet, which is the only reason a server-side rate limiter carries a piece of 1990s telecom vocabulary in its name. The reason it keeps showing up in modern stacks is that it does the work of a leaky bucket with the storage of a single timestamp and no background process at all.

The trick is to track one value: the theoretical arrival time, or TAT, the wall-clock instant at which the next request would arrive if the client were sending at exactly the contracted rate. Define an emission interval T, the spacing between requests at the sustained rate, so a rate of ten per second is a T of 100 milliseconds. Define a tolerance τ, the burst allowance, which is how far ahead of schedule a request may arrive and still conform. The conformance test, stated in the standard for the virtual scheduling formulation, is that a request arriving at time ta conforms when ta > TAT − τ. If the request is too far ahead of its theoretical schedule, that is, ta ≤ TAT − τ, it is non-conforming and rejected. When a request conforms, you advance the schedule: the new TAT becomes the later of the current time or the existing TAT, plus T. A conforming request that arrived late resets the schedule to now plus one interval; one that arrived early pushes the TAT further into the future, and once it is τ ahead of real time the next request gets rejected.

This is exactly the continuous-state leaky bucket viewed from the other side. In the bucket formulation the reservoir has capacity T + τ and drains at one unit per unit time; a request conforms if it would not overflow. The standard treats the two as equivalent because they are, the TAT being the leaky bucket’s fill level expressed as a time instead of a depth. The payoff is that GCRA stores one number per client and touches it only on request. There is no drip, no timer, no per-tick work across millions of idle buckets, which is the property that breaks naive leaky-bucket implementations at scale.

*GCRA stores one instant, the theoretical arrival time. A request conforms if it arrives after TAT minus the burst tolerance; conforming requests push the TAT forward by the emission interval.*

The cleanest production form is Brandur Weinstein’s redis-cell, a Redis module written in Rust that exposes the whole algorithm as one command. CL.THROTTLE key max_burst count period quantity does the GCRA arithmetic inside Redis and returns a five-integer array: whether the request was limited, the total limit, the tokens remaining, the seconds until you may retry, and the seconds until the bucket fully resets. A call like CL.THROTTLE user123 15 30 60 1 allows thirty per sixty seconds with a burst of fifteen. The last three return values map directly onto the response headers a client expects, the retry-after slot becoming Retry-After and the reset slot becoming X-RateLimit-Reset. Because the whole thing executes as a single Redis command, it is atomic by construction, which sidesteps the race that makes a hand-rolled token bucket dangerous in a distributed setting. That race is the next problem.

Doing this across a fleet without losing the count

Every algorithm above assumes the counter is a single source of truth that you read, decide on, and write atomically. The moment you have more than one application server, that assumption is the thing that breaks. The counter has to live somewhere shared, which in practice means Redis, and the danger is the gap between reading the count and writing the new one.

Picture the naive token bucket spread over a fleet. A server reads the bucket level from Redis, sees one token left, decides to allow the request, and writes back zero. Between its read and its write, a second server reads the same bucket, also sees one token, also allows, also writes back zero. Two requests admitted against a single token. This is the classic read-modify-write race, and under load it does not happen occasionally, it happens constantly, because load is exactly when many servers hit the same hot key at once. The limiter is silently too lenient in precisely the conditions an attacker creates. Figma called this out as their reason for rejecting a hand-rolled token bucket: the Redis operations are not atomic, and the read-then-write opens a window where the limiter undercounts.

The fix is to make the whole decision one indivisible operation. Redis runs Lua scripts atomically, with no other command interleaving from start to finish, so the standard pattern is to push the entire read-check-write into a single EVAL. The script increments, checks the threshold, sets expiry if it is the first request in the window, and returns the verdict, all without another client slipping in. The fixed-window version is just INCR plus a conditional EXPIRE inside one script. Without the script the two commands are separate round trips, and a crash or a race between them can leave a key incremented but never given a TTL, a counter that leaks forever and eventually blocks the client permanently. The sliding window log becomes the sorted-set dance, ZADD then ZREMRANGEBYSCORE then ZCARD, wrapped in the same atomic envelope. redis-cell’s CL.THROTTLE is the same idea taken to its conclusion: the atomicity is the entire point, the GCRA math done server-side as one command so there is no window to race in at all.

Atomicity per key solves correctness; it does not solve the hot key. When one client, or one attack, concentrates on a single counter, every server in the fleet serializes through one Redis key, and that key becomes a bottleneck and a single point of failure. The shops that operate this at scale answer it in two ways. One is to shard and approximate: keep local per-node counters, let each node enforce a fraction of the global budget, and reconcile centrally rather than consulting a shared counter on every request. This is the shape of large-scale designs that distribute a computed drop ratio to the data plane and let each node enforce locally with probabilistic dropping, trading a little global precision for the elimination of the per-request shared read. The other is to accept the approximation the sliding window counter already gives you and run it at the edge, where the request was going to be inspected anyway.

That edge model is what Cloudflare’s rate limiting became. The original product counted requests per IP at the edge across hundreds of thousands of domains. The 2022 advanced version separated two things that used to be fused, the counting expression and the mitigation expression, so you can count one thing and block another. Count only requests that hit /login and return a 401, then block a broader slice of traffic from that source once the count crosses the line. The characteristics you bucket on are no longer just the IP; you can key the counter on an ASN, a header, a cookie, a query parameter, or a JA3 fingerprint, which matters because the IP is the least durable identifier an abuser has and the cookie or fingerprint is often the most. Keying the counter on the right characteristic is half of making rate limiting work against a determined adversary, and the TLS fingerprinting that produces a JA3 value is what lets the counter survive an IP rotation.

Saying no without giving the game away

The algorithm decides whether to admit. What you do with the rejection is a separate decision, and against a thoughtful attacker it is nearly as important. The polite answer is the one the standards describe: HTTP 429, “Too Many Requests,” defined in RFC 6585 in 2012, optionally carrying a Retry-After header that tells a well-behaved client how long to wait. For the misconfigured retry loop and the honest heavy user this is exactly right; the client backs off, the load drops, nobody is harmed. A 429 with an accurate Retry-After is cooperative rate limiting, and most of your traffic deserves it.

Against an adversary the 429 is also a signal you are handing over. It confirms a limit exists, and a careful return value, the reset time, the remaining count, tells the attacker precisely how to pace themselves to stay just under the line or when to resume. Figma hit this directly: they started by returning 429s to the invitation spammers, and the attackers simply created new accounts to dodge the per-account ban, treating the explicit rejection as feedback. So Figma switched to shadow banning. The limited request returns a normal 200, the user believes the invitation was sent, and nothing actually happens. The attacker gets no signal that they were caught, no reset timer to optimize against, no reason to rotate accounts. The same instinct shows up in how large platforms describe their secondary limits. GitHub publishes hard numbers for its primary rate limits but keeps its secondary, abuse-oriented limits deliberately vague, telling developers the limits are subject to change without notice and may trigger for undisclosed reasons, and that there is no way to check your secondary-limit status. The published primary limit is a budget you can plan against; the secondary limit is a trap you are not supposed to be able to map.

This is the tension at the center of defensive rate limiting. A limit you document and explain is a limit your legitimate users can build against and your attackers can game. A limit you hide protects against the adversary but punishes the honest client who cannot tell why their requests are vanishing. There is no clean answer, only a split: be generous and transparent with the limits that exist to keep honest clients from hurting you by accident, and be quiet and unpredictable with the limits that exist to stop people who are trying.

What the algorithm choice actually buys you

Step back and the family resolves into a small number of real tradeoffs. The fixed window is the cheapest counter that works and the easiest to attack at its seam, fine for incidental protection and wrong for a deliberate adversary. The sliding window log is the only exact answer and it pays for that exactness in memory that grows with the traffic you least want to spend memory on. The sliding window counter is the approximation that won, because its error is small, bounded, and biased away from false positives, and because it costs two integers regardless of load; it is the default at the edge for good reasons. The token bucket is the one that matches how benign traffic actually behaves, bursty within an average, which is why it fronts so many public APIs. GCRA is the token bucket’s discipline with a single number of state and no clock to run, which is why it keeps getting reinvented in Redis modules four decades after ATM.

None of this is where the hard part of an abuse defense lives, and that is the thing worth being honest about. The algorithm is the easy half. The hard half is the key, the characteristic you count on, because an attacker’s whole job is to make each request look like it came from a different client. Rate limiting by IP fell over the moment residential proxy pools made IPs disposable, which is why the serious systems moved the counter onto cookies, fingerprints, and ASN reputation, and why pure rate limiting blurs into the bot-detection stack rather than standing apart from it. The counter is solved. Choosing what to count is the open problem, and it gets harder every year the identifiers get cheaper to rotate.

The detail that has aged best is the oldest one. Turner’s leaky bucket from 1986 was built to pace cells through a switch, and the same arithmetic, lightly rewritten as GCRA and wrapped in a single atomic Redis command, is what paces login attempts through your front door today. The problem changed completely, from congestion to abuse, and the math did not have to.

Sources & further reading

Desgats, J. (2017), How we built rate limiting capable of scaling to millions of domains — Cloudflare’s sliding-window-counter approach, the weighted-blend formula, and the 400M-request accuracy analysis (0.003% error).
Weinstein, B. (2015), Rate Limiting, Cells, and GCRA — a clear derivation of GCRA from the leaky bucket, the TAT formulation, and the design of the redis-cell module.
Brandur (n.d.), redis-cell: a Redis module that provides rate limiting in Redis as a single command — the Rust module, the CL.THROTTLE command, and its five-integer return array mapping onto rate-limit headers.
Tarjan, P. (2017), Scaling your API with rate limiters — Stripe’s four limiter types, the centralized token bucket in Redis, and rejection volumes.
Mahdi, N. (2017), An alternative approach to rate limiting — Figma’s sliding-window-counter hybrid, the memory arithmetic versus the log, and the move to shadow banning.
Molteni, D. (2022), Introducing Advanced Rate Limiting — separating counting from mitigation expressions and counting on characteristics beyond IP, including JA3.
Wikipedia contributors (n.d.), Generic cell rate algorithm — the two equivalent GCRA formulations, the conformance condition, and the ATM Forum / ITU-T I.371 standardization.
Wikipedia contributors (n.d.), Leaky bucket — Turner’s 1986 origin and the formalization in ITU-T Recommendation I.371 (1993).
Nottingham, M. and Fielding, R. (2012), RFC 6585: Additional HTTP Status Codes — the definition of 429 Too Many Requests and the optional Retry-After header.
GitHub (n.d.), Rate limits for the REST API — the documented primary limits versus the deliberately vague secondary, abuse-oriented limits and their concurrency and points-per-minute caps.
Redis (n.d.), Build 5 Rate Limiters with Redis — the INCR/EXPIRE fixed-window and sorted-set sliding-log patterns and why Lua scripting makes them atomic.