Skip to content

Designing a fair queue at scale: lessons from high-demand ticket on-sales

· 24 min read
Copyright: MIT
The words FAIR QUEUE in monospace with an orange admission arrow cutting through a row of waiting dots

At 10:00:00 on a Friday morning, a tour goes on sale. By 10:00:01 there are two million browsers pointed at one checkout path, and roughly forty thousand tickets behind it. The site can serve maybe a few thousand concurrent shoppers before its database falls over. So the real product on sale that morning is not the ticket. It is a position in line, and the line has to be built, ordered, and defended in the same second that the herd arrives.

That is a distributed-systems problem wearing a consumer-facing costume. You need to admit traffic at a rate the origin survives, decide who goes first when everyone arrived “at the same time,” agree on that ordering across machines that do not share a clock, hand each visitor a proof they can present later, and do all of it while a meaningful fraction of the herd is automated and actively trying to jump the line. None of those four requirements is hard alone. The interesting part is that they fight each other, and the fight is what this post is about.

The sections below walk through the shape of the herd and why ordinary autoscaling does not save you, the admission-control machinery (token and leaky buckets, and where the counter actually lives), the fairness-versus-throughput tradeoff that pushed the ticketing industry from first-come-first-served to randomized draws, why clock skew makes “first” a slippery word, the token design that lets a stateless edge trust a stateful core, and finally the failure modes, the ways a queue leaks the very slots it was built to protect.

The herd, and why scaling out does not save you

The defining feature of a ticket on-sale is not the peak request rate. It is the derivative. Traffic does not ramp; it steps. A waiting-room vendor’s own writeup of the problem notes that a sale can pull reasonable, steady traffic from North America and Europe and then take a sudden major spike from South America the instant a regional promo fires, with no warning the autoscaler could have used. The load arrives faster than any reactive scaling loop can respond, because the scaling loop’s feedback delay (scrape metrics, decide, boot instances, warm caches, register with the load balancer) is measured in tens of seconds and the herd assembles in under one.

This is the thundering herd, and the classic statement of it is a pile of clients all blocked on the same resource, released at once, who then stampede that resource into the ground. The standard mitigations are familiar: rate limiting, load shedding, exponential backoff with jitter, caching. A waiting room is what you get when you take those mitigations and make them the product instead of an implementation detail. You do not try to serve the herd. You hold almost all of it outside the gate, deliberately, and meter it in.

The reason you cannot simply scale the origin to meet the peak is economic before it is technical. Provisioning a checkout path and its backing inventory database for two million concurrent users, to absorb a spike that lasts ninety seconds a few times a year, means paying for that capacity every other minute of the year. Worse, much of the herd is not buyable demand at all. It is retries, duplicate tabs, and bots. Scaling to serve it is scaling to serve waste. The waiting room flips the problem: instead of growing the origin to fit the herd, you shrink the herd to fit the origin, and you give the people you turn away a fair, legible place to stand while they wait.

req/s time incoming herd origin capacity this gap is the queue *The on-sale traffic profile: a near-vertical step, not a ramp. Everything above the origin-capacity line has to be held outside and metered in, which is what the queue does.*

Admission control: the bucket and where the counter lives

Once you accept that you are metering, the question becomes: meter with what? The answer the whole industry converged on is a bucket, in one of two complementary shapes.

The token bucket is the rate-limiter’s workhorse. Tokens drip into a bucket at a configured rate, up to a maximum capacity, and each admitted request consumes one. AWS’s own writeup of fairness in multi-tenant systems describes it plainly: tokens refill at a configured rate up to a maximum, and the maximum is the burst capacity. The burst capacity is the interesting knob. A little burst tolerance lets you absorb the natural non-uniformity of real traffic without rejecting legitimate spikes; too much, and the bucket stops limiting anything. The same writeup notes the trick of composing two buckets, one with a low rate and high burst and another with a high rate and low burst, so you get short-term elasticity without unbounded long-term bursts.

The leaky bucket is the dual view, and it is the one the ticketing systems tend to describe. SeatGeek’s virtual waiting room, presented in detail by its engineers, uses a leaky bucket per protected path, where each protected zone (a single checkout or ticketing path) gets its own bucket with its own exit rate. The bucket size equals the protected-zone capacity. When a request arrives and there is room, an access token is added to the bucket and the user proceeds; when the bucket is full, the user is routed to the waiting room and the edge returns HTTP 429. The optimization that makes this work is that the 429 is cached at the CDN (Fastly, in SeatGeek’s case) for a short interval, so the herd’s retries are absorbed at the edge and never touch the backend that is already at capacity. That single decision (cache the rejection) is what keeps the admission layer itself from becoming the next thundering-herd victim.

Both shapes share a problem that is the actual hard part: the counter. A bucket is only a bucket because every admission decrements one shared count. In a single process that is an atomic integer. Across a fleet of edge nodes in two hundred data centers, “one shared count” is a distributed-consensus question dressed up as arithmetic. You have three broad options, each with a different failure personality.

The first is a central counter: every node calls one authoritative service to claim a slot. Correct, simple to reason about, and a bottleneck and single point of failure precisely when you need it most. The second is local sharding: divide the global rate by the number of nodes and let each enforce its slice independently with no coordination. AWS’s fairness writeup describes exactly this, and names its weakness: it assumes traffic is spread uniformly across nodes, which holds under round-robin load balancing and breaks the moment the herd is geographically lopsided. Give every edge node 1/200th of the global rate and then send 90% of South America’s spike to three nodes, and those three reject floods of legitimate users while 197 others sit idle. The third option is the interesting middle: nodes keep local counts and reconcile asynchronously, gossiping their tallies so the global view converges without every decision blocking on a round trip.

Cloudflare’s Waiting Room sits in that third camp, and the mechanism is worth being precise about because it is publicly documented at the architecture level but not at the byte level. The room runs as a Worker in the data center nearest each visitor, and the state (how many users are active, how many are queued per arrival-minute bucket) is shared across the network using counters that converge rather than counters that lock. Cloudflare’s engineering writeups describe the coordination as keeping each location synced with global demand almost instantaneously, built on Workers and Durable Objects, and a later post refers to distributed bucket counters of the conflict-free, mergeable kind. The exact replication protocol and the on-the-wire counter format are not public; what is public is the shape, eventually-consistent counters at the edge so that no admission decision has to wait for a global lock. The cost you pay for that is precisely the cost of eventual consistency: for a brief window, the sum of the local views can exceed the true global count, and you can admit slightly more than the configured ceiling. For a waiting room that is a tolerable overshoot. For an inventory system it would be a double-sell, which is why the inventory check stays authoritative and synchronous behind the gate even when the gate itself is eventually consistent.

central sharded gossiping N correct, single bottleneck N/4 N/4 N/4 N/4 fast, breaks on skewed load converges, can briefly overshoot *Three places the admission counter can live. The edge waiting rooms favour the rightmost: eventually-consistent counters that gossip, trading a small overshoot for never blocking an admission on a global lock.*

For a deeper walk through the edge-coordination mechanics, the companion piece on Cloudflare Waiting Room internals takes apart the JWT and the estimated-wait computation, and the token-bucket and queue-position overview covers the general model.

Fairness versus throughput, and why the industry split on it

Here is where the design stops being purely mechanical. Suppose you can admit, say, two thousand users per minute, cleanly and at the right rate. Which two thousand, out of the two million waiting? The honest answer is that there is no single correct policy, only a tradeoff with the throughput dialed against the perceived fairness, and the two big ticketing approaches landed on opposite ends of it on purpose.

First, a precise note on what “fair” even buys you here. AWS’s multi-tenant fairness writeup makes a point that sounds paradoxical and is exactly right: enforcing a quota both increases and decreases availability. Rejecting one tenant’s excess requests with a 429 protects everyone else’s predictable access, so aggregate availability goes up, while the rejected tenant experiences a drop. The same writeup is blunt that rate-based admission control is not a truly fair scheduler. It is first-come-first-served with a rate limit, and it achieves its appearance of fairness partly by rejecting work, which costs throughput. You cannot have maximum throughput and strict fairness at once; the rejections that produce fairness are the same rejections that lower the ceiling.

The first-come-first-served camp is represented cleanly by SeatGeek, whose engineers said outright that they chose FIFO and rejected randomization for their business: the visitor who arrived earlier should get the earlier chance to buy. A visitor token is minted on arrival, carrying the arrival time, and an exchanger function later swaps that visitor token for an access token in arrival order. FIFO is legible. A user understands “I was here first.” It is also, and this is the catch, the policy that most rewards being fast, which in practice means rewarding whoever has the best automation, the closest network path, and the most parallel clients. Pure arrival-time ordering hands a structural edge to the bots, because a bot can register its arrival in the first millisecond and a human with a phone on a train cannot.

The randomized camp answered exactly that. Queue-it uses FIFO for visitors who arrive after a sale opens, but for the high-stakes scheduled on-sale it does something different: early arrivals wait in a pre-queue behind a countdown, and the instant the sale starts, everyone in that pre-queue is assigned a random position before the queue collapses back to FIFO from there. Ticketmaster moved the same way, introducing randomization in 2016 specifically to blunt scalper bots; once the sale launches, everyone in the pre-queue is shuffled, so hitting the page a millisecond after open buys you nothing. The point is not that randomization is fairer in the abstract. It is that randomization removes the reward for speed, and speed was the thing the bots were monetizing. By making the first ten minutes of arrival time carry no information, you delete the asset the automation was built to capture.

That is the whole tradeoff in one sentence: FIFO is fairer to the punctual human and a gift to the fast bot; randomization is indifferent to arrival time and therefore a tax on the bot’s only structural advantage. Neither is free. Randomization annoys the person who genuinely showed up first and feels cheated, and it does nothing about a bot that simply registers a hundred thousand entries into the draw, which is why randomization only works bolted to identity and bot defenses. The Verified Fan approach is the other half of that bolt: gate who is even allowed into the draw before you randomize the draw.

FIFO: position = arrival order earliest arrival wins (bot-friendly) bot → serve order randomized: position = shuffle(pre-queue) arrival time carries no information bot → serve order *Same five arrivals, two policies. FIFO puts the first-registered bot at the front; the shuffle drops it to wherever chance lands. Randomization does not detect bots, it just removes the prize for being fast.*

Clock skew, and why “first” is harder than it sounds

FIFO needs an ordering, and an ordering needs a clock. This is where the consumer-product framing quietly hands you a genuinely hard distributed-systems problem, because in a fleet there is no global “now.” Every machine keeps time on its own crystal oscillator, and those oscillators drift with temperature, voltage, and age. Inside a single cloud region you should assume a few milliseconds of skew between hosts even with disciplined time sync; across regions the spread is far wider. NTP narrows the gap but does not close it, because it is a probabilistic protocol that gets you within some bound, not to a shared instant.

Now imagine two visitors who both clicked at “the same time.” One lands on an edge node in São Paulo whose clock reads 10:00:00.014 and the other on a node in Madrid reading 10:00:00.009. If you stamp each arrival with the local wall clock and then sort globally by that stamp, you have just ordered your queue partly by whose data center had the faster crystal that morning. The skew is small, single-digit milliseconds, but the herd is so dense in time that single-digit milliseconds reorder thousands of people. The ordering you present as “fairness” is contaminated by clock error.

There is a deeper point underneath, the one Leibniz-via-Lamport made decades ago and which still bites here: in a distributed system you cannot, even in principle, totally order independent events by physical time, because simultaneity is not well defined across nodes that do not share a clock. Two arrivals on two nodes with no causal link between them have no true “which came first”; any answer you give is a convention. The honest engineering responses to this are all about admitting it rather than denying it. You can bound the skew with tighter time discipline (the Spanner-style answer is a per-datacenter time source that hands out an interval [earliest, latest] rather than a point, so you know your uncertainty instead of pretending it is zero). You can attach a logical component to the timestamp, a hybrid logical clock that carries a physical reading plus a counter, so that ordering stays monotonic and ties break deterministically rather than by oscillator luck. Or, and this is the quietly clever move, you can stop ordering by arrival time at all.

That last option is why randomization is not only an anti-bot measure. It is also a clean way out of the clock problem. If the final position is a random draw assigned at sale-open, then the millisecond-level skew in arrival stamps simply does not matter, because arrival order is not an input to the result. The pre-queue still records who is present, but it does not have to agree, to the millisecond, on the order in which they arrived. Randomization launders away both the bot’s speed advantage and the system’s own inability to keep a consistent clock, in one move. SeatGeek, having chosen FIFO, has to confront the clock problem head-on; the randomizing systems get to sidestep a chunk of it. That is a real, and under-discussed, second reason the high-stakes on-sales drifted toward the draw.

The token: how a stateless edge trusts a stateful core

Whatever the ordering policy, the queue eventually has to say “you, go in” to a specific visitor, and that permission has to survive the trip from the queue’s brain to the origin’s door. Doing this with a server-side session lookup on every request would put a stateful database back in the hot path, which is the thing the whole architecture exists to avoid. So the industry uses a signed, self-describing token: the queue mints a cryptographic proof, the visitor carries it, and the gate verifies the signature locally without calling home.

The clearest public specimen is the AWS Virtual Waiting Room reference architecture, because it is open source and documented field by field. The flow is two-phase. First, a visitor calls a public endpoint and is handed a queue position; the docs name the response fields directly, a queue_number integer with an entry_time timestamp. Separately, the system tracks a serving position, exposed as a serving_counter, which advances as the origin signals it can take more people. An admin-side process moves it forward by calling a private increment_serving_counter API with an increment_by value. The rule that ties the two together is simple: your queue_number has to be less than or equal to the current serving_counter before you are allowed to ask for a token. Until then, your browser just polls.

When your number is up, you call generate_token, and the system issues a JWT. The reference implementation’s token carries standard registered claims, and they map onto the waiting-room concepts cleanly: iss is the issuer URL, aud is the event ID, sub is the request ID for that specific visitor, iat and nbf and exp are the issued-at, not-before, and expiry times, and a token_use claim distinguishes an access token from an id or refresh token. The position itself rides along as a queue_position claim. The token is signed RS256, and the gate (a Lambda authorizer in this design) verifies it against a published RSA public key served from a /public_key endpoint that returns the usual JWK fields, kty RSA, alg RS256, and a kid. Critically, generate_token is idempotent: ask twice for the same event and request ID and you get the same token back, with the same expiry, rather than a fresh one. That idempotency is a security property as much as a convenience, and the failure-modes section explains why.

signed access token (JWT, RS256) iss issuer URL aud event_id sub request_id (this visitor) queue_position + iat / nbf / exp / token_use queue_number ≤ serving_counter verify sig vs /public_key (kid) queue mint token visitor carries it gate verifies locally *The AWS reference token, claim by claim. The gate checks the RS256 signature against a published key and never calls the queue's database; that is the whole point of making the proof self-describing.*

Cloudflare’s design reaches the same goal with a different container. Instead of putting the token in an Authorization header, it lives in an encrypted cookie named __cfwaitingroom. The cookie carries the visitor’s group ID, which corresponds to the minute they entered the waiting room, and an acceptedAt value for the minute they were let into the application. While the visitor is still queuing, the cookie’s expiry is always set to five minutes but renews every twenty seconds as the page auto-refreshes, so a visitor who keeps the tab open stays in line and one who wanders off ages out. The cookie is encrypted so the visitor cannot edit their own position or fabricate an acceptedAt. Same trick as the JWT, different envelope: a tamper-evident, self-contained proof the edge can check without a database round trip. Queue-it’s connector model follows the pattern too, with a signed token appended to the redirect URL and a session cookie that lets the connector recognize an already-admitted visitor; the Queue-it architecture piece takes that token and cookie apart in detail.

The session-revocation feature is worth a line because it shows the token model’s seam. Because the proof is self-contained and time-boxed, the natural way to end a session early (say, the moment a purchase completes, so the slot frees immediately instead of idling for the full duration) is an explicit out-of-band signal. Cloudflare exposes this as an origin-returned header, Cf-Waiting-Room-Command: revoke, that tells the edge to tear the session down now. That header exists precisely because a self-describing token has no built-in revocation; you trust it until it expires, so you need a side channel to kill it sooner.

Failure modes: the ways a fair queue leaks

A waiting room is a security control, and like any security control it is best understood through its failure modes. Most of them reduce to one of two faults: admitting more than the ceiling, or letting the wrong party hold the slot.

The first family is the overshoot, and it falls straight out of the eventually-consistent counter discussed earlier. If admission decisions are made against local counts that gossip toward a global total, there is always a window where the sum of local admissions exceeds the configured ceiling before the views reconcile. For the waiting room this is benign; you let in a few hundred extra and the origin, sized with headroom, absorbs them. It stops being benign the instant someone treats the admitted count as an inventory count. The discipline that saves you is keeping the inventory decrement strictly serialized and authoritative behind the gate, never inferred from “you were admitted, so a ticket is yours.” The queue controls concurrency; it must not be trusted to control stock.

The second family is token reuse, and it is the one that turns a fair queue into an unfair one. The whole token design rests on a proof being bound to a single visitor and spent once. Break either property and a single admitted slot multiplies. Concrete instances of this pattern recur across token systems: a 2026 advisory against a widely used identity server, tracked as CVE-2026-1035, describes a time-of-check to time-of-use race in which concurrent requests exchange one refresh token for multiple valid access tokens before the single-use counter is written. The same race shape applies to a queue token. If a visitor can take one valid admission proof and replay it across many parallel connections faster than the system marks it consumed, one position in line becomes many sessions inside the gate, and every one of those extra sessions is a slot stolen from someone who waited. This is exactly why the AWS reference makes generate_token idempotent: returning the same token for the same request ID, rather than minting a fresh one each call, removes the incentive and the ability to farm a stream of distinct tokens from a single queue position. Idempotency here is not an optimization. It is the anti-amplification control.

There is a softer leak that matters just as much in practice: the queue that does not know who is in it. A purely arrival-ordered, identity-free FIFO queue treats every connection as a fresh, equal visitor, which means an attacker who can open a hundred thousand connections simply occupies a hundred thousand positions. Randomization alone does not fix this either; shuffling a draw that one party entered a hundred thousand times still hands that party most of the front of the line. The real fix lives upstream of the ordering policy entirely, in deciding which arrivals are even allowed to take a position. Cloudflare’s published approach pipes failed bot challenges (it reports that roughly one in twenty “users” that run its Turnstile challenge do not pass) into a separate infinite queue whose countdown asymptotically approaches but never reaches zero, so the automated client burns resources waiting for an entry that will never come, while real users queue normally. The economics piece on why waiting rooms leak and the Verified Fan writeup both trace how identity and bot defense have to wrap the queue for the ordering policy inside it to mean anything.

The last failure mode is the one operators feel most and design for least: the queue that becomes the outage it was meant to prevent. Every visitor in the waiting room is polling, and a poorly chosen refresh interval turns the holding pen into its own thundering herd against the queue service. The mitigations are the same ones the queue applies to the origin, turned inward. Cache the status response aggressively, as SeatGeek caches its 429 at Fastly so retries never reach the backend. Pick a refresh cadence that balances freshness against load, as Cloudflare’s twenty-second auto-refresh does. Add jitter so a million tabs that entered in the same second do not all re-poll in the same later second. A waiting room that forgets to defend itself against its own waiting visitors has merely moved the thundering herd one hop upstream and renamed it.

What the queue is really doing

Strip the consumer story away and a virtual waiting room is a small, sharp answer to a specific distributed-systems question: how do you impose a single global ordering and a single global rate limit on a herd, using machinery that is itself distributed, eventually consistent, and missing a shared clock? Every design choice in the space is a different answer to that question. The leaky bucket caches its own rejections so the admission layer survives the herd. The eventually-consistent edge counter trades a tolerable overshoot for never blocking on a lock. The self-describing token lets a stateless gate trust a stateful brain without a database round trip. And the move from FIFO to a randomized draw quietly solves two problems at once, deleting both the bot’s reward for speed and the system’s own inability to agree, to the millisecond, on who arrived first.

What stays with me is that the cleverest decision in the whole stack is a refusal. The systems that scaled best did not try harder to order the herd correctly by arrival time. They noticed that arrival time was a contaminated signal (gameable by automation, unreliable across skewed clocks) and stopped using it as the input that decides who wins. The fairest queue at scale turned out to be the one that gave up on measuring “first” precisely, and replaced a measurement it could not trust with a coin it could.


Sources & further reading

Further reading