BGP explained: how the internet's routing table actually converges
Type an address into a browser and a packet leaves your machine with no idea how to reach the far end. It knows the destination IP. It does not know which of the roughly seventy-eight thousand independent networks that make up the internet will carry it, or in what order. That decision was made earlier, by routers that had already agreed on a path, and the protocol they used to agree is BGP. The agreement is the interesting part. There is no central map. No authority hands out the routes. Each network tells its neighbors which destinations it can reach, the neighbors pass that on with their own name appended, and out of millions of these gossiped fragments a more-or-less consistent routing table emerges. Most of the time.
The “most of the time” is where this gets technical. BGP converges slowly, trusts almost everyone by default, and can take a network offline because a router on another continent made a typo. None of that is a bug in the usual sense. It falls out of the design choices that let the protocol scale to the whole planet in the first place. This post walks those choices from the bottom up.
The sections below move in order. First the units BGP deals in: autonomous systems and prefixes. Then path-vector routing and the AS_PATH attribute that makes it work. Then the split between external and internal BGP, and the scaling problem that split creates. Then the route-selection algorithm, step by step, straight from RFC 4271. Then convergence: why it is slow, what path exploration is, and what the timers do. Finally the trust problem, RPKI, and where the defenses actually stand in 2026.
Autonomous systems and prefixes
The internet is not one network. It is a mesh of independently operated networks, each called an autonomous system. An AS is a collection of IP prefixes under a single routing policy, run by one organization: a transit provider, a university, a content company, a national ISP. Each AS has a number. The original AS number was a 16-bit value, which capped the space at 65,536. That ran low, and RFC 6793 (2012) extended ASNs to 32 bits, giving 4,294,967,296 possible numbers. To let new 4-byte-capable routers interoperate with old 2-byte ones, the spec reserved the placeholder AS number 23456, called AS_TRANS, which a 4-byte ASN collapses into when it has to be represented in a 2-byte field.
As of January 2026 the routing system carries roughly 77,900 active ASes in the IPv4 table and about 36,100 in IPv6, per APNIC’s year-end measurements. Most of those are stub networks (around 66,500 in IPv4) that only originate their own address space and buy transit from someone else. The minority, about 11,400, are transit ASes that carry other people’s traffic.
The thing an AS advertises is a prefix: a block of contiguous IP addresses written as a network address and a mask length, like 192.0.2.0/24. That /24 covers 256 addresses. A /16 covers 65,536. BGP does not route individual addresses; it routes prefixes, and a router forwards a packet by finding the longest (most specific) prefix in its table that matches the destination. The full IPv4 table at the end of 2025 sat between roughly 1,040,000 and 1,050,000 prefixes, with IPv6 around 241,800. That number keeps climbing, mostly because operators advertise more-specific sub-blocks of address space they already hold rather than because new address space appears. The CIDR Report tracks how much of the table is redundant more-specifics that could be aggregated; in early 2026 that class accounted for over 460,000 routes.
Path-vector routing and the AS_PATH
Inside a single network you run an interior gateway protocol, OSPF or IS-IS, that computes shortest paths over a known topology. That works because everyone inside the AS trusts everyone else and shares a full view of the links. Between ASes none of that holds. Networks are competitors, their internal topology is private, and “shortest” is meaningless when a path crosses six different companies with six different commercial agreements. BGP solves a different problem, and it solves it with a different mechanism.
BGP is a path-vector protocol. When an AS advertises a prefix, it attaches the AS_PATH attribute: the ordered list of AS numbers the advertisement has passed through. RFC 4271 calls AS_PATH a well-known mandatory attribute that identifies the autonomous systems through which the routing information in an UPDATE has passed. Each time a route crosses an AS boundary outbound, the sending router prepends its own AS number to the front of the path. So a route that started at AS 64520 and reached you via AS 64510 and AS 64500 arrives carrying 64500 64510 64520. The leftmost number is your neighbor; the rightmost is the origin.
That list does two jobs at once. It is a loop-detection mechanism: when a router receives a route whose AS_PATH already contains its own AS number, it drops the route, because accepting it would mean traffic looping back through a network it already traversed. And it is the closest thing BGP has to a distance metric. The number of ASes in the path is a coarse proxy for path length, and it is the first real tie-breaker the selection algorithm reaches. Coarse is the right word. A three-AS path across three continents can be slower than a five-AS path within one region. AS_PATH counts hops between networks, not latency, not bandwidth, not anything a packet experiences.
This is why BGP is sometimes described as routing on policy rather than on shortest path. The path that wins is not the fastest one; it is the one the local network’s policy prefers, with AS_PATH length as a fallback when policy is silent. That distinction matters enormously for how anycast routing steers traffic, where the same prefix is announced from many locations and BGP picks a winner per vantage point with no idea which copy is actually closest in milliseconds.
There is a deliberate trick operators play with AS_PATH length, called prepending. An AS that wants one of its links to be a backup rather than primary can advertise the same prefix out that link with its own AS number repeated several times, padding the path artificially so that neighbors see it as longer and prefer the other link. It is a crude lever, since LOCAL_PREF at the receiving end overrides it entirely, and it only works on the segment of the internet that has not set a stronger local policy. But it is cheap and it requires no coordination with anyone else, which is exactly why traffic engineering on the public internet is still done largely by stuffing extra copies of a number into a list.
Path-vector also dodges a problem that sank earlier distance-vector protocols. Pure distance-vector routing (RIP and its kin) advertises only a metric, not a path, so a router cannot tell whether a route it hears is genuinely new or an echo of something it advertised a moment ago. That causes the count-to-infinity problem, where a withdrawn route bounces between routers with the metric ticking slowly upward. Carrying the full path mostly kills that, because the loop is visible in the AS_PATH itself. Mostly. As the convergence section gets into, a weaker version of the same pathology survives in BGP.
eBGP, iBGP, and the full-mesh problem
There are two flavors of BGP session, and the difference is not cosmetic. External BGP runs between routers in different ASes. Internal BGP runs between routers in the same AS. They follow different rules for one specific reason: loop prevention works differently inside and outside.
Across an eBGP session, the sending router prepends its AS number, so the AS_PATH grows and loops are catchable. Across an iBGP session, the AS_PATH is not modified, because both routers are in the same AS and prepending would be wrong. That creates a gap. If iBGP does not change the path, the AS_PATH loop check cannot catch a loop that forms inside the AS. BGP closes the gap with a blunt rule: a route learned via iBGP is never re-advertised to another iBGP peer. A router will pass an iBGP-learned route out an eBGP session, but not to another internal router.
The consequence is a scaling disaster waiting to happen. If internal routers cannot relay routes to each other, then every BGP-speaking router in the AS must peer directly with every other one, so that each hears every external route firsthand. That is a full mesh, and the session count grows as N(N−1)/2. Ten routers need 45 sessions. A hundred routers need 4,950. Each session is state, configuration, and memory the router has to hold. Past a few dozen routers the full mesh stops being practical.
*Full-mesh iBGP requires a session between every pair of routers; a route reflector lets clients peer only with the reflector, which relays routes on their behalf.*Two mechanisms break the full mesh. Route reflectors (RFC 4456) designate one or more routers as reflectors that are allowed to re-advertise iBGP-learned routes to their clients, relaxing the no-relay rule in a controlled way. Clients peer only with the reflector. To keep loop prevention working without AS_PATH, reflectors attach two attributes: ORIGINATOR_ID, the router ID of the route’s originator, and CLUSTER_LIST, the chain of reflector clusters the route has passed through. A router that sees its own ID in either drops the route. Confederations (RFC 5065) take a different tack, splitting one AS into smaller sub-ASes that run eBGP between themselves while presenting a single AS number to the outside. Both trade some optimality for scale. Research on iBGP convergence has measured the cost: in one study the median time a network spends with an inconsistent forwarding state rose from about 1.3 seconds under full mesh to between 2.3 and 2.8 seconds with one to three reflectors, because the indirection adds a relay hop to every update.
The route-selection algorithm
A router commonly hears several routes to the same prefix, from different neighbors, with different AS_PATHs and attributes. It installs exactly one as best. The decision is deterministic and ordered, which is what lets a network of routers reach a consistent result without a central coordinator. RFC 4271 splits the decision process into three phases: Phase 1 computes a degree of preference for each route, Phase 2 selects the best route per destination, and Phase 3 disseminates the result to peers.
The degree of preference in Phase 1 is set by LOCAL_PREF for internally learned routes, or by local policy for externally learned ones. LOCAL_PREF is a well-known attribute included in every UPDATE a speaker sends to its internal peers, and the higher value wins. This is the policy knob operators reach for first: an AS sets a high LOCAL_PREF on routes through a cheap or preferred provider so those routes win before any other attribute is even consulted. Because LOCAL_PREF is evaluated before AS_PATH length, a longer path through a preferred provider routinely beats a shorter path through an expensive one. Money outranks distance.
When two routes tie on degree of preference, Phase 2 runs the tie-breaking procedure from RFC 4271 Section 9.1.2.2. The criteria are applied strictly in order, and the algorithm stops as soon as one route remains:
*The RFC 4271 Section 9.1.2.2 tie-break order. Vendor implementations insert proprietary steps (Cisco's WEIGHT, for instance) ahead of step a, but the standard order is what the spec defines.*Step a removes every route not tied for the smallest number of AS numbers in the AS_PATH, with an AS_SET counting as one regardless of how many ASes it contains. Step b keeps the lowest ORIGIN value, where a route learned from an IGP is preferred over one learned from the old EGP, which is preferred over “incomplete” (typically redistributed). Step c compares MULTI_EXIT_DISC, the metric a neighboring AS uses to hint which of several entry points it prefers; the RFC is explicit that MED is only comparable between routes from the same neighboring AS, and a route with no MED is treated as MED 0. Step d prefers eBGP-learned routes over iBGP-learned ones, which is why an AS prefers to exit through its own border rather than haul traffic across its backbone to someone else’s exit. Step e picks the lowest interior cost to the NEXT_HOP, the “hot potato” behavior of dumping traffic toward the nearest exit. Steps f and g are pure determinism: lowest router ID, then lowest peer address, so that two routers facing an otherwise perfect tie still pick the same winner.
One detail trips people up. The order above is the RFC’s order, but no two vendors implement exactly this list. Cisco IOS inserts a WEIGHT step (locally significant, never advertised) before everything, and most implementations slot in checks the RFC never mentions. The FRRouting project’s documented selection algorithm runs to roughly a dozen steps once you count multipath and EVPN extensions. The standard defines the skeleton; the products flesh it out. When you debug a selection that surprises you, you check the specific platform’s order, not the RFC alone.
NEXT_HOP threads through all of this. It is the well-known mandatory attribute that names the IP address a router should forward to in order to reach the destinations in an UPDATE. Across eBGP the next hop is usually the advertising router’s interface. Carried into iBGP it is not rewritten by default, which means internal routers must have an IGP route to that external next hop or the BGP route fails the resolvability check and never gets installed. That single requirement (the next hop must be reachable through the IGP) is behind a large fraction of “the route is there but not in the table” problems operators hit.
Convergence: why it is slow
Convergence is the time from a topology change until every router agrees on a stable, loop-free set of paths. In a circuit-switched network failover happens in milliseconds. BGP does not work that way. The foundational measurement is Labovitz’s SIGCOMM 2000 study, which found that interdomain routers can take tens of minutes to reach a consistent view after a fault, and that during those windows end-to-end paths see intermittent loss, elevated latency, and packet drops. Twenty-five years of tuning have shortened the typical case to seconds, but the mechanism the paper identified is still in the protocol.
The mechanism is path exploration. When a route is withdrawn, a router does not immediately conclude the destination is gone. It still holds alternate paths it learned earlier from other neighbors, paths that may themselves depend on the link that just failed but whose withdrawals have not arrived yet. So the router fails over to one of those stale alternates and advertises it. Its neighbors do the same with their own stale alternates. The network churns through a sequence of progressively longer, progressively more invalid paths before the withdrawals catch up and the destination is finally declared unreachable everywhere. This is the count-to-infinity problem in a milder form: path-vector loop detection stops the count from running forever, but it does not stop the network from exploring a pile of doomed paths first.
*During path exploration a router walks through successively longer alternate paths, advertising each, before the withdrawals propagate far enough to declare the prefix unreachable.*Two things slow the walk down. The first is the obvious one: propagation. An update has to cross every AS on the path, and each hop adds processing and queuing delay. The second is deliberate. BGP rate-limits how often it will advertise a change for a given prefix, using the MinRouteAdvertisementInterval. RFC 4271 sets the suggested default at 30 seconds for eBGP sessions and 5 seconds for iBGP. The timer exists to damp route churn: without it, a flapping link would fire a storm of updates, so the router batches changes and waits out the interval before sending the next advertisement for that prefix. The trade is direct. The timer suppresses churn at the cost of convergence speed, and research has spent two decades arguing over its optimal value, with proposals ranging from tuning it per-link to removing it entirely. A flapping prefix can also trip route flap damping, which penalizes an unstable route and suppresses it for a while, trading availability for stability in the other direction.
The other timers shape the session rather than the updates. RFC 4271’s suggested defaults are a 120-second ConnectRetry timer, a 90-second Hold Time, and a KeepAlive timer at one third of the Hold Time. If a router hears nothing from a peer for the whole Hold Time, it tears the session down and withdraws everything it learned across it. Ninety seconds is a long time to keep forwarding into a dead session, which is why fast-failure detection sits underneath BGP rather than inside it. Bidirectional Forwarding Detection runs a lightweight hello at sub-second intervals and tells BGP to drop the session the moment the path goes quiet, cutting failure detection from tens of seconds to milliseconds without touching the BGP timers themselves.
The practical upshot is that BGP convergence is bimodal. A simple change with a ready alternate (a more-preferred route appearing, or a clean withdrawal with an obvious backup) converges in seconds. A messy failure deep in the topology, where many ASes hold many stale alternates, can still take far longer as the network explores its way to the truth. The architecture optimizes for stability and scale over speed, and the slow tail is the price.
Why the control plane runs on trust
Here is the part that surprises people new to it. BGP, by default, believes what it is told. When a neighbor announces a prefix with a given AS_PATH, the receiving router has no built-in way to verify that the origin AS is actually entitled to that prefix, or that the AS_PATH is real rather than fabricated. The protocol authenticates the session (you can run it over TCP-MD5 or TCP-AO so a stranger cannot inject UPDATEs into an established session), but it does not authenticate the contents. A network that is allowed to speak BGP to you can tell you anything about who reaches what.
This is the design. BGP was specified in an era when the few hundred networks on the internet largely knew and trusted each other, and embedding cryptographic verification of every route would have been infeasible at the time. The result is that the entire global routing system rests on operators announcing only the prefixes they hold and only the paths that exist. When that assumption breaks, the consequences are immediate and far-reaching, which is the whole subject of BGP hijacks and route leaks. The most-cited example is still the simplest: in 2008 Pakistan Telecom announced a more-specific prefix for YouTube intending to block it domestically, the announcement leaked to its upstream, and because a more-specific prefix wins the longest-match lookup, a large slice of the internet sent YouTube traffic to Pakistan instead. No exploit. No attack tool. Just an announcement everyone believed.
A misannounced origin is one failure mode. A forged or manipulated AS_PATH is another, and it is the one that turns a routing accident into targeted traffic interception for theft, where an attacker pulls a victim’s traffic through a network they control long enough to do something with it. The defenses against these arrived late and partially.
*Route origin validation confirms only that the originating AS is authorized for the prefix and that the prefix is no longer than the ROA's maxLength. It says nothing about whether the path is genuine.*RPKI, the Resource Public Key Infrastructure, lets a prefix holder publish a cryptographically signed Route Origin Authorization that binds a prefix to the AS number permitted to originate it, with a maximum length. A router doing Route Origin Validation checks an incoming route against the published ROAs and labels it Valid, Invalid, or NotFound, then applies policy, usually rejecting Invalids. This finally gives the system a way to catch the YouTube-style misorigination. By May 2024 a milestone was reached: for the first time, more than half of the IPv4 routes in the global table were covered by ROAs, and as of early 2025 coverage sat around 54% across IPv4 and IPv6, with roughly 74% of IP traffic destined for ROA-covered space, per APNIC’s measurements. Nearly all of the tier-1 transit-free providers now reject RPKI Invalids, which means an Invalid announcement propagates far less widely than an uncovered one.
RPKI fixes origin. It does not fix the path. ROV confirms that AS 64520 is allowed to originate 203.0.113.0/24; it cannot confirm that the AS_PATH leading back to 64520 is real, so it does nothing against a path-manipulation attack or an accidental route leak where a customer relays its provider’s routes to another provider. Two follow-ons target the path. ASPA (Autonomous System Provider Authorization) lets an AS declare its legitimate providers so that leaks become detectable, and BGPsec cryptographically signs each hop of the AS_PATH. Their deployment status as of 2026 is the honest answer to “is the trust problem solved”: it is not. ASPA is barely deployed (APNIC measured roughly 0.001% of ASes publishing an ASPA record, with the specification not expected to be finalized before 2026), and BGPsec has no meaningful production deployment because signing and verifying every hop is expensive and the benefit only appears once most of the path participates. The control plane is more trustworthy than it was in 2008. It still runs, fundamentally, on operators behaving.
What the design actually buys
Step back and the trade-offs line up in one direction. BGP gave up fast convergence, gave up shortest-path optimality, and gave up built-in security, and in exchange it got something no centrally-coordinated design could have delivered: a routing system that scales to seventy-eight thousand independently operated networks, lets each one set its own policy without revealing it, and keeps running while pieces of it fail and recover continuously. The slow convergence is the cost of damping churn so the system stays stable under constant change. The trust model is the cost of not requiring a global authority that no set of competing networks would have agreed on. The policy-over-distance routing is the cost of letting commercial relationships, not just topology, decide where traffic goes.
The honest assessment in 2026 is that the protocol’s core is essentially the same one specified in RFC 4271 in 2006, and the same path-vector idea from a decade before that. What changed is everything bolted on around it. Four-byte AS numbers pushed back exhaustion. BFD pulled failure detection out of BGP’s slow timers. Route reflectors and confederations made large ASes manageable. RPKI gave origin a cryptographic floor it never had. Each of these is a patch on an assumption that no longer holds, applied without touching the thing underneath, because the thing underneath carries the entire internet and cannot be stopped to be replaced. That a 1.05-million-prefix table, gossiped between tens of thousands of mutually distrustful networks with no central map, converges at all is the part worth sitting with. It converges because everyone agreed on the same deterministic tie-break order and mostly tells the truth about what they can reach. When either of those slips, the table notices, and so does everyone downstream of the mistake.
Sources & further reading
- Rekhter, Li, Hares, eds. (2006), RFC 4271: A Border Gateway Protocol 4 (BGP-4) — the base spec: message types, the FSM, the three-phase decision process, the Section 9.1.2.2 tie-break order, and the default timer values.
- Vohra, Chen (2012), RFC 6793: BGP Support for Four-Octet Autonomous System (AS) Number Space — the 32-bit ASN extension, AS4_PATH/AS4_AGGREGATOR, and the AS_TRANS placeholder 23456.
- Bates, Chen, Chandra (2006), RFC 4456: BGP Route Reflection — route reflectors, ORIGINATOR_ID, and CLUSTER_LIST as the full-mesh alternative.
- Labovitz, Ahuja, Bose, Jahanian (2000), Delayed Internet Routing Convergence — the original measurement of path exploration and multi-minute convergence after failures.
- Huston (2026), BGP in 2025 — year-end table sizes (~1.05M IPv4 prefixes, ~241,800 IPv6), AS counts, and transit/stub breakdown.
- The CIDR Report (ongoing), CIDR Report — live IPv4 table statistics and the share of redundant more-specific routes.
- Huston (2025), RPKI’s 2024 year in review — ROA coverage near 54%, ~74% of traffic toward covered space, and early ASPA adoption figures.
- MANRS (2024), RPKI ROV Deployment Reaches Major Milestone — the May 2024 point where most IPv4 routes became ROA-covered and tier-1 rejection of Invalids.
- FRRouting project, BGP Path Selection Algorithm — a real implementation’s full selection order, showing the vendor steps layered on top of the RFC skeleton.
- arXiv (2025), The Effects of iBGP Convergence — measured convergence cost of route reflectors versus full mesh.
- Griffin, Wilfong (Princeton), There’s something about MRAI: Timing diversity — analysis of the MinRouteAdvertisementInterval and its effect on convergence.
Further reading
How a CDN actually works: anycast, POPs, and the cache hierarchy
Traces what a CDN really does on a request: how anycast and BGP pick a point of presence, how the edge/shield/origin cache tiers fit together, how cache keys decide what is a hit, and where TLS terminates.
·22 min readAnycast routing: how one IP serves the whole planet
Traces how the same IP prefix advertised from hundreds of locations lets BGP route every user to a nearby instance, how DNS roots and CDNs use it, how failover works, and where TCP state breaks the model.
·21 min readDNS resolution end to end: from stub resolver to authoritative answer
Traces a single DNS lookup from the stub resolver in your OS through the recursive resolver, root, TLD and authoritative servers, then explains caching, TTLs, negative answers, and the record types that make it work.
·23 min read