Load balancing algorithms: round-robin, least-connections, and consistent hashing

A load balancer has one job that sounds trivial and is not: given an incoming request and a pool of backends that can serve it, pick one. The naive answer is “pick the next one in line.” That answer is wrong often enough that the question has produced four decades of papers, a handful of patents, and a recurring argument inside every infrastructure team that runs more than one server. The hard part is not the picking. The hard part is that the thing you are optimizing for, even load, depends on information the load balancer usually does not have, changes faster than it can measure, and means different things depending on whether your requests last a millisecond or an hour.

This post is a survey of the algorithms that answer the picking question. Round-robin and its weighted variant, least-connections and least-response-time, the power-of-two-choices trick that makes random selection competitive with global state, and the hash-based family that trades even distribution for stickiness. Each one is a different bet about what you know and what you can afford to measure. The goal here is to make those bets explicit, ground the behavior in the algorithms’ real implementations in nginx and HAProxy and Google’s Maglev, and say plainly where the public record stops and folklore begins.

What the algorithm is actually optimizing

Before any algorithm, fix the objective. A load balancer wants every backend to do roughly equal work, where “work” might be request count, concurrent connections, CPU, or tail latency. It wants to react to backends that slow down or fail. And if the backends hold state (a cache, a session, a shard), it may also want the same request to keep landing on the same backend, which is the opposite goal and pulls against evenness.

Those three pulls (spread the load, route around trouble, stay sticky) are in tension. No single algorithm wins all three. The reason there are so many algorithms is that production teams sit at different points in that tradeoff space, and the right answer for a stateless API in front of identical pods is the wrong answer for a fleet of caches where a miss costs a round trip to origin.

There is also a measurement problem that shapes everything. The load balancer’s view of backend state is always stale. By the time it has counted connections or timed responses, the world has moved. With a single load balancer the staleness is small. With many independent load balancers, each acting on its own partial view, staleness becomes the dominant failure mode, and the obvious greedy strategies turn pathological. Keep that in mind as a thread running through the whole survey.

Round-robin and weighted round-robin

Round-robin is the algorithm everyone starts with. Keep a pointer, hand each new request to the next backend, wrap around at the end. It needs no state about the backends, no measurement, no coordination. With N identical backends and short uniform requests, it spreads load about as well as anything. HAProxy exposes it as balance roundrobin and nginx uses it as the default upstream method when you specify nothing.

The trouble starts when backends are not identical. A pool of mixed instance sizes, or a migration where new hardware sits next to old, breaks the assumption that one request equals one request everywhere. The fix is weighting: give each backend a weight proportional to its capacity and hand it that share of requests. In nginx you write server backend1 weight=5; and that backend gets five times the traffic of a weight=1 peer.

The naive way to honor weights is to expand them into a list and walk it. Weights {5, 1, 1} become the sequence a, a, a, a, a, b, c. That distributes the right ratio over a full cycle, but look at the shape: backend a gets five requests in a burst, then b and c get one each, then nothing for five more. A high-weight backend receives its share in clumps. For long-lived connections or bursty arrival that clumping defeats the point.

nginx solved this in 2014 with what it calls smooth weighted round-robin, committed by Maxim Dounin. The algorithm keeps a current_weight per backend. On each pick it adds every backend’s configured weight to its running current_weight, selects the backend with the greatest current_weight, then subtracts the total weight from the winner. For {5, 1, 1} this produces a, a, b, a, c, a, a, the same five-to-one-to-one ratio, but with the high-weight backend interleaved instead of bunched.

*The running counters interleave the heavy backend instead of front-loading it, and return to zero after a full cycle so the long-run ratio is exactly 5:1:1.*

The same nginx code carries a second variable, effective_weight, which normally equals the configured weight but is reduced temporarily when a backend fails a health probe and recovers gradually as the backend starts answering again. That detail matters in production: it means a flapping backend does not get its full share back the instant it returns, which damps oscillation. The algorithm has since been copied widely; you will find it in service meshes and in standalone Go and Rust libraries under the same name.

Round-robin’s blind spot is that it counts requests, not work. If request cost varies, equal request counts produce unequal load. A backend that drew ten slow requests sits behind one that drew ten fast ones, and round-robin keeps feeding both at the same rate. For uniform workloads it is hard to beat on simplicity. For anything with variance in request duration, you want an algorithm that looks at the backend before it picks.

Least-connections

Least-connections is the first algorithm that measures. Track the number of in-flight connections to each backend and send the next request to whichever has the fewest. HAProxy spells it balance leastconn; nginx spells it least_conn. The intuition is that active connection count is a cheap proxy for how busy a backend is right now, and unlike request count it naturally accounts for requests that are still running.

This is the right default when request durations vary. A backend stuck on a slow query accumulates open connections, its count rises, and the algorithm steers new work away from it without anyone configuring a timeout or a weight. HAProxy’s own documentation recommends leastconn for long sessions such as LDAP, SQL, or TSE, and warns against it for short sessions like HTTP where the connection count churns too fast to mean much. HAProxy also rotates servers with equal connection counts in round-robin fashion, so ties do not all pile onto the lowest-indexed backend.

Least-connections has weighted forms too, where the target is the lowest ratio of connections to weight rather than the lowest raw count. That lets a bigger backend carry proportionally more open connections before it stops looking attractive.

The catch is the same staleness problem from earlier, and it gets sharp when you run more than one load balancer. Each load balancer only counts the connections it opened itself. It has no idea how many connections the other load balancers have pointed at the same backend. So every load balancer independently decides the same backend is least loaded, and they all rush it at once. The backend that looked quiet a millisecond ago is now buried. This is the herd problem, and it is exactly why the next algorithm exists.

Least-response-time

Least-response-time pushes the measurement further. Instead of counting connections, time them, and prefer backends that answer fastest. HAProxy describes its variant as routing to the server with the quickest response time among those that also hold the fewest connections, so it is really a tie-break refinement layered on least-connections rather than a pure latency race. nginx Plus exposes least_time, with a parameter choosing whether “time” means the moment the response header arrives (header) or the last byte of the body (last_byte).

Response time captures things connection count misses. A backend with a degraded disk, a noisy neighbor on shared hardware, or a cold cache will show normal connection counts and slow responses. Timing catches it; counting does not. The cost is that latency is noisier than connection count and needs smoothing, usually an exponentially weighted moving average so a single slow request does not evict a healthy backend. The smoothing window becomes a tuning knob, and a badly chosen window either reacts too slowly to real degradation or chases noise.

Least-response-time shares least-connections’ herd vulnerability and arguably makes it worse, because latency measurements are even staler than connection counts (you only learn a backend was slow after it has already been slow for you). Across a fleet of load balancers all optimizing on cached latency, the greedy “always pick the fastest” rule converges on whichever backend looked fastest last, and overwhelms it. Marc Brooker’s writing on this is worth reading: using stale data to pick the single best host produces a herd that stampedes a previously quiet backend for far longer than it takes to make that backend very busy indeed.

The power of two choices

The fix for the herd is almost insultingly simple. Do not pick the best backend. Pick two backends at random, and choose the less loaded of those two. That is the whole idea, and it is one of the genuinely surprising results in the field.

The surprise is in the math. Throw n balls into n bins by choosing a bin uniformly at random for each, and the fullest bin ends up with about log n / log log n balls — a slowly growing but unbounded gap above the average of one. Now change one thing: for each ball pick two bins at random and drop it in the emptier of the two. The fullest bin now holds about log log n / log 2 balls. The maximum load drops from logarithmic to doubly-logarithmic, an exponential improvement, from a change that still uses almost no information. The generalization to d choices gives log log n / log d, so most of the benefit arrives at d = 2 and the curve flattens fast after that. The result is due to Azar, Broder, Karlin, and Upfal, with the asymptotic analysis and the survey title popularized by Michael Mitzenmacher.

*One extra sample per request collapses the worst-case imbalance from logarithmic to doubly-logarithmic. The second bar is not to exact scale, but the gap really is exponential.*

What makes this matter for real load balancers is not the worst-case bound on its own. It is what two choices does to the herd. A greedy “pick the best” rule funnels every load balancer’s traffic toward the same momentarily-quiet backend, because they all share the same stale view and all reach the same conclusion. Two random choices breaks that lockstep. Even if backend X genuinely looks best to everyone, any given load balancer only routes to it when X happens to be one of its two random draws and beats the other draw. The randomness scatters the herd while the comparison still uses real load information.

nginx ships this as random two, a variant of its Random method. The directive picks two servers at random and then applies a tie-break: in open-source nginx the better of the two is the one with fewer active connections, and in nginx Plus you can write random two least_time=last_byte to break the tie on response time instead. F5’s own writeup frames it with a queue-at-the-airport analogy and is explicit that the method exists for the multiple-load-balancer case, where classic least-connections “works very well when you operate a single active load balancer” and falls apart when several share the work. Two choices has since become the default recommendation for large stateless fleets, and it is the algorithm behind a lot of the “least-loaded” routing in modern service meshes even when the docs do not name it.

Hash-based routing and session affinity

Everything so far tries to spread load and treats every backend as interchangeable. Hash-based routing makes the opposite bet. It computes a hash of some stable attribute of the request and uses that hash to pick the backend, so the same input always lands on the same place. Evenness is no longer the goal; stickiness is.

The cheapest version hashes the client IP. HAProxy’s balance source and nginx’s ip_hash both do this. The same client keeps hitting the same backend, which gives you session affinity without a shared session store. It is also the bluntest version: clients behind a large NAT or a corporate proxy share an IP and pile onto one backend, and mobile clients whose IP changes lose their affinity. For routing by something more specific you can hash a URL path (HAProxy balance uri, useful in front of a cache so each object has one home), a header, or a cookie.

The naive way to turn a hash into a backend index is modulo: backend = hash(key) % N. It works until N changes. Add or remove one backend and almost every key remaps, because changing the divisor reshuffles the whole modular arithmetic. For a cache fleet that is catastrophic. One backend removed means nearly every cached object is now looked for on the wrong server, every lookup misses, and the whole fleet stampedes the origin at once. This is the precise problem consistent hashing was invented to solve, and it is worth its own treatment. The consistent hashing deep-dive covers the ring construction and virtual nodes in detail. Here it is enough to know what it buys you.

Consistent hashing and bounded loads

Consistent hashing, from Karger and colleagues at MIT in 1997, maps both keys and backends onto the same circular hash space, and assigns each key to the first backend it meets going clockwise. The payoff is in what happens when the backend set changes. Add or remove a backend and only the keys in that backend’s arc move; on average about k/N keys remap rather than nearly all of them. The 1997 paper’s title names the original use case directly: relieving hot spots for distributed web caching. Amazon’s 2007 Dynamo paper pulled it into the mainstream of distributed databases, and it is now under the hood of essentially every cache fleet and partitioned datastore.

*Removing a backend only displaces the keys in its clockwise arc, which is why adding or removing capacity does not invalidate the whole cache.*

Plain consistent hashing has a weakness that the cache people live with and the load people cannot: it does not promise even load. Random placement on the ring leaves gaps of uneven size, so some backends own much larger arcs than others, and a single hot key can bury whichever backend owns it regardless of how the ring is balanced. Virtual nodes smooth the arcs but do nothing for a hot key.

The 2017 answer from Google’s NYC algorithms team, working with Mikkel Thorup of Copenhagen, was consistent hashing with bounded loads. Each backend gets a hard capacity cap set to (1 + ε) times the average load. Keys still hash to the ring and walk clockwise, but if the first backend they meet is already at capacity they skip to the next one with room. The ε parameter dials the tradeoff directly: small ε forces near-perfect evenness at the cost of more reassignment when things change, large ε allows more imbalance and moves fewer keys. Google reported deploying it in Cloud Pub/Sub for better load uniformity, and Andrew Rodland of Vimeo implemented the same idea in HAProxy and reported cutting cache bandwidth by a factor of nearly eight, which removed a scaling bottleneck. HAProxy ships it today as a hash balancing option; the bounded-loads variant is the reason the hash family is usable for load balancing and not just for caches.

Maglev: hashing built for a load balancer

Google’s Maglev, described at NSDI 2016, is worth a section because it shows what consistent hashing looks like when even distribution is a first-class requirement rather than an afterthought. Maglev is the software network load balancer in front of Google’s user-facing services, and it does not use a ring at all. It builds a fixed-size lookup table and fills it so that every backend ends up owning almost exactly the same number of slots.

The construction is a permutation trick. For each backend it derives an offset and a skip from two hash functions, and generates that backend’s preference order over table slots as (offset + j × skip) mod M, where M is the table size and must be prime so that skip is coprime to M and the sequence visits every slot. Then it fills the table by round-robin among backends: each backend in turn claims its most-preferred slot that is still empty, until the table is full. Because every backend takes a slot each round, they all end up with within one slot of an equal share.

Maglev’s table size M is a deliberate knob, and the paper gives the exact numbers. The default is the prime 65537; a larger configuration uses 655373. The larger table distributes more evenly but costs more to build: generation time rises from 1.8 ms to 22.9 ms as M grows from 65537 to 655373, which is why they cap it rather than growing it without limit. At the small table size, ring-based consistent hashing (Karger) and Rendezvous hashing would require backends to be overprovisioned by 29.7% and 49.5% respectively to absorb their imbalance; at the larger table those numbers fall to 10.3% and 12.3%, while Maglev stays nearly perfectly even at either size. That is the whole pitch: Maglev gives up some of consistent hashing’s minimal-disruption property in exchange for distribution tight enough that you do not have to overprovision the fleet.

It pays for that tradeoff in churn. When a backend is added or removed, Maglev does not promise that only that backend’s keys move; rebuilding the table reshuffles some fraction of every backend’s slots. The paper accepts this because Maglev pairs the hash table with a per-machine connection tracking table keyed on the 5-tuple. Existing flows stay pinned to their backend through the connection table even as the lookup table changes underneath them, so in steady state a table rebuild does not reset live connections. The hash decides where new flows go; the flow table keeps old flows where they were.

*The round-robin fill is what guarantees near-equal share. The price is that a rebuild moves more keys than a ring would, which the flow table absorbs for connections already in flight.*

When each one fits

The choice falls out of two questions. Do your backends hold state that the request needs to find again? And how many independent load balancers share the work?

If backends are stateless and identical, round-robin is the floor and it is fine for uniform short requests. The moment request cost varies, move to least-connections, which is the workhorse default for variable-duration traffic and costs almost nothing to run. Add weights when the hardware is mixed, and prefer the smooth variant so high-weight backends do not receive their share in bursts. Reach for least-response-time only when you have a real latency-skew problem that connection count misses, such as backends on shared or degrading hardware, and accept that you now own a smoothing-window tuning knob.

The number of load balancers is the hinge that people miss. A single load balancer with an accurate connection count can run a greedy least-loaded policy safely. The moment you have several, each acting on its own partial view, greedy least-loaded turns into a herd and the right move is power-of-two-choices, which keeps the load-aware comparison while randomization scatters the stampede. That is why random two and its equivalents are the default in large fleets and meshes, not least-connections.

Hashing is for when stickiness beats evenness. In front of a cache, hash the object so each lives on one backend; in front of sharded state, hash the shard key; for crude session affinity without a shared store, hash the client IP and accept the NAT lumpiness. Use plain modulo only if your backend set never changes, which in practice means never use it. Use consistent hashing the moment capacity is elastic, and reach for the bounded-loads variant when a single hot key or an uneven ring would otherwise bury a backend. If you are building the load balancer itself rather than configuring one, Maglev’s table approach is the reference for getting near-perfect distribution out of a hash while a separate flow table protects live connections.

The thread running through all of it is the gap between what the algorithm knows and what is true. Round-robin knows nothing and is honest about it. Least-connections and least-response-time know something slightly stale and get into trouble exactly when that staleness compounds across many deciders. Power-of-two-choices is the cheap insurance against that compounding, and it is cheap because it deliberately throws away the ability to always pick the best. Hashing gives up evenness for memory of where things went, and consistent hashing is the patch that keeps that memory from being wiped every time the fleet resizes. Each algorithm is a different answer to the same question, which is not “which backend is best” but “how wrong is my picture of the backends, and what does that wrongness cost me.” Pick the algorithm whose failure mode you can live with, because every one of them has a workload that makes it look foolish.

Sources & further reading

Azar, Broder, Karlin, Upfal (1999), Balanced Allocations — the original “two choices” result giving max load log log n / log d; summarized with the asymptotics on the balls-into-bins page.
Mitzenmacher (2001), The Power of Two Choices in Randomized Load Balancing — IEEE TPDS survey that popularized the analysis and the name.
Brooker (2012), The power of two random choices — why best-of-two beats best-of-all when load data is stale, with simulation of the herd effect.
F5/NGINX (2018), NGINX and the “Power of Two Choices” Load-Balancing Algorithm — the random two directive and the multiple-load-balancer herd problem it addresses.
Dounin / nginx (2014), Upstream: smooth weighted round-robin balancing — the commit introducing current_weight/effective_weight and the {5,1,1} → a a b a c a a sequence.
HAProxy (n.d.), What are load-balancing algorithms? — vendor reference for roundrobin, leastconn, least-response-time, source, uri, and random with two choices.
Karger et al. (1997), Consistent Hashing and Random Trees — the STOC paper that introduced consistent hashing for relieving web-cache hot spots.
Mirrokni, Zadimoghaddam, Thorup (2017), Consistent Hashing with Bounded Loads — Google Research post on the (1+ε) capacity cap, Cloud Pub/Sub use, and Vimeo’s near-8x cache-bandwidth result.
Mirrokni, Thorup, Zadimoghaddam (2016), Consistent Hashing with Bounded Loads (arXiv) — the full paper with the load-bound proofs behind the blog post.
Eisenbud et al. (2016), Maglev: A Fast and Reliable Software Network Load Balancer — NSDI paper with the permutation-table construction, M = 65537/655373, and the overprovisioning comparison against Karger and Rendezvous.