Crawl politeness: robots.txt, crawl-delay, and the unwritten rules of scale
A crawler is a stranger knocking on a server it has never met, with no contract, no rate card, and no one to answer to if it knocks too hard. The web has run on this arrangement for thirty years. It mostly works, and the reason it mostly works is a 1994 text file that no server is obligated to honor and that any client can ignore in a single line of code. Politeness, on the open web, has always been a request rather than a guarantee.
That gap between request and guarantee is the interesting part. A polite crawler is solving two problems at once. One is etiquette: read the file at the root, parse it correctly, respect what it says. The other is engineering: never let your fleet flatten a host you depend on, because a crawler that takes a site down has destroyed the thing it came to read. This post works through both. It starts with where the convention came from, then RFC 9309 and exactly how a modern parser handles the file, then the crawl-delay directive and why the three big engines disagree about it, then the per-host rate limiting that real crawlers run regardless of what the file says, then sitemaps as the cooperative half of the bargain, and finally the cryptographic verification scheme that is quietly replacing the whole honor system because, in 2025, the honor system stopped holding.
1994: a server falls over and a convention is born
The origin is specific and a little embarrassing for the person who caused it. In early 1994 a badly behaved crawler hammered Martijn Koster’s server hard enough to act like a denial-of-service attack. Koster, then at Nexor, raised the problem on the www-talk mailing list, the main channel for early web work, and on 25 February 1994 floated a proposed convention that would let a server maintainer indicate whether robots were welcome and which parts of the site they could touch. The file was briefly called RobotsNotWanted.txt before settling on robots.txt at the root of the host.
By June 1994 it was a de facto standard. The crawlers of the day complied: WebCrawler, Lycos, AltaVista. That worked because the web in 1994 was small enough that you could keep a list of every crawler in existence, and because the people writing crawlers and the people running servers were often the same people, drinking from the same mailing list. The whole thing rested on a shared assumption that the parties involved wanted to be good citizens. The format reflected that informality. There was no RFC, no conformance test, no registry. There was a wiki-grade spec at robotstxt.org and the rough consensus of whoever was paying attention.
The convention survived almost unchanged for twenty-five years, which is both a compliment and a problem. A compliment because the core idea, a plain-text allow/deny list at a well-known path, was right enough to outlast nearly everything else from that era of the web. A problem because twenty-five years of crawlers each implemented the under-specified corners differently, and the corners are where crawlers actually live: what happens on a redirect, on a 500, on a file that is half a megabyte of garbage, on a path with a wildcard. Two crawlers reading the same robots.txt could reach opposite conclusions about whether a URL was allowed. For a convention whose entire job is to communicate intent unambiguously, that is close to a design failure.
2019-2022: the protocol gets an RFC
Google moved to fix the ambiguity in mid-2019. On 1 July 2019 it announced a push to formalize the Robots Exclusion Protocol through the IETF, and it open-sourced the C++ library that Googlebot itself uses to parse and match robots.txt so that other implementers could match Google’s behavior byte for byte rather than guessing at it. The standardization effort landed in September 2022 as RFC 9309, “Robots Exclusion Protocol,” authored by Koster himself alongside Gary Illyes, Henner Zeller, and Lizzi Sassman. Koster’s name on the 2022 document, twenty-eight years after the 1994 mailing-list post, is a nice piece of continuity.
RFC 9309 did not reinvent anything. It wrote down what the careful implementers already did and made the edge cases normative. The core grammar is three fields. A User-Agent line names which crawler the following rules apply to, and Allow and Disallow lines give path prefixes that the named crawler may or may not fetch. Groups of rules are keyed by user-agent, with * as the catch-all. None of that is new. What the RFC nailed down was the behavior at the edges, and the edges are worth walking through because they are exactly where a homegrown crawler gets politeness wrong.
Three rules carry most of the weight. First, matching is by longest match, not first match. When both an Allow and a Disallow could apply to a URL, the most specific one wins, where “most specific” means the rule with the most octets. So Disallow: /private/ and Allow: /private/public-note.html resolve in favor of the longer Allow line for that one file, and everything else under /private/ stays blocked. A crawler that takes the first matching line, or the last, gets a different and wrong answer.
Second, the file has a size limit that the spec makes mandatory. A parser must read at least 500 kibibytes, and Google’s implementation caps it there exactly: anything past 500 KiB is ignored. This is a politeness rule pointed in the other direction. It stops a hostile or broken robots.txt from forcing a crawler to download megabytes before it can make a single fetch decision, and it bounds how long the connection stays open. Pick a number, write it down, and every conforming crawler truncates at the same byte.
Third, and this is the rule most homegrown crawlers get backwards, the response status on robots.txt itself changes everything. RFC 9309 says that when the file is unreachable because of server errors, the crawler must assume complete disallow. A 5xx on robots.txt does not mean “no rules, crawl freely.” It means stop. The reasoning is that a server returning 500s is a server in trouble, and a crawler that responds to a struggling host by treating the absence of rules as permission is precisely the 1994 denial-of-service all over again. Fail closed, not open.
How a real parser handles status codes and caching
Google’s own robots.txt documentation spells out the status-code handling in more operational detail than the RFC, and it is the closest thing to a reference implementation that exists, since the parser behind it is open source. The four ranges split cleanly.
A 2xx is the happy path: parse the file as served. A 3xx redirect is followed for at least five hops, after which Google gives up and treats the situation as a 404 for robots.txt. The 4xx range, with one exception, means “no robots.txt exists,” so the crawler assumes nothing is disallowed and proceeds. That sounds dangerous until you remember that a 404 on the robots.txt path is the normal state for the large majority of hosts on the web, which never had one. The exception is 429, Too Many Requests, which Google does not treat as “no file.” It treats 429 as a signal in the same family as 5xx: back off.
The handling of a persistent 5xx is more graceful than a flat “disallow forever.” Google’s documented behavior runs in three phases. For roughly the first twelve hours of failures it stops crawling the site while it keeps retrying robots.txt. For up to thirty days after that it falls back to the last good cached copy of robots.txt and keeps trying to refresh. Past thirty days, if the host itself is reachable, it concludes there is no robots.txt and resumes under the assumption that nothing is disallowed; if the host is also unreachable, it keeps backing off. Network-level failures, DNS errors, timeouts, connection resets, malformed chunked transfers, all fold into the server-error category and get the same treatment.
Caching matters because robots.txt is fetched constantly and you cannot re-fetch it before every single request without becoming the thing the file was invented to prevent. The RFC tells crawlers not to use a cached copy for more than 24 hours unless the file is unreachable, and Google’s implementation caches for up to 24 hours, extending that window when refreshing is not possible. Twenty-four hours is the politeness compromise. Long enough that you are not re-reading the rules thousands of times an hour, short enough that a site owner who changes the file sees the change respected within a day.
If you are building the crawler that has to get all of this right, the politeness logic does not live in isolation. It sits inside the frontier and the scheduler, and the cleanest treatment of where it fits is in designing a distributed crawler, where the robots cache, the per-host queues, and the backpressure all have to share state.
Crawl-delay: one directive, three answers
Here is where the convention frays. Crawl-delay is the directive that most people reach for when they want to tell a crawler to slow down, and it is not in RFC 9309 at all. The standard defines user-agent, allow, and disallow. Crawl-delay is an extension that some crawlers honored and others never did, and the three largest engines landed in three different places, which means a single Crawl-delay: 10 line does three different things depending on who reads it.
Google does not support it. This is long-standing and explicit: Google’s own robots.txt documentation lists crawl-delay among the fields it does not process, and Google has been public about ignoring it since around 2008. The stated reasoning is that a static number is a blunt instrument compared to watching the server’s own response times, and that site owners who genuinely need to throttle Googlebot should use the crawl-rate controls in Search Console rather than a line in a text file. So on Google, crawl-delay is inert. It is not an error, it is just nothing.
Bing supports it, and the way Bing interprets it trips up almost everyone who writes the line. Crawl-delay on Bing is not “wait N seconds between requests.” It defines the length of a time window, from 1 to 30 seconds, during which Bingbot will fetch at most one page. Crawl-delay: 10 slices the day into ten-second windows and lets Bingbot take one page per window, which caps the crawl at roughly 8,640 pages a day. Set it to 5 and the cap is around 17,280. The number you write is not a pause, it is the denominator of a daily budget, and a site owner who sets Crawl-delay: 30 thinking it means a gentle gap between hits has actually clamped Bing to about 2,880 pages a day, which on a large site is brutal.
Yandex historically read it the literal way most people expect: Crawl-delay: 10 meant wait at least ten seconds between successive requests. Same directive, same value, a different model again.
The lesson is not that crawl-delay is useless, it is that crawl-delay was never specified, so every engine filled the gap with its own model and none of them agree. For a well-behaved crawler reading someone else’s robots.txt, the honest move is to treat a crawl-delay line as a slow-down hint and pick a conservative interpretation, because you genuinely cannot tell from the file alone which semantics the author had in mind. For a site owner trying to control crawl rate, the file is a weak lever. The stronger levers are the engine-specific tools (Search Console for Google, Webmaster Tools crawl control for Bing) and, underneath everything, the server’s own response behavior, which every serious crawler is watching whether or not the file says a word.
Per-host rate limiting: the politeness that does not need a file
Crawl-delay is the visible, declarative half of rate control. The half that actually protects servers is the adaptive logic inside the crawler, and it runs whether or not robots.txt mentions a delay. This is the engineering half of politeness, and it is where a crawler earns the right to operate at scale.
The foundational rule, old enough to predate most of the modern web, is one connection at a time per host and a deliberate gap between requests to the same host. The unit of politeness is the host, not the page and not the crawler as a whole. A crawler can be pulling thousands of pages a second across the web and still be polite if no single host sees more than a trickle. The Mercator crawler design from the late 1990s already enforced this by keying its back-queues on host, so that the scheduler structurally could not aim two simultaneous requests at the same server. Modern crawlers inherit the shape directly, and the URL frontier design is where that per-host serialization is implemented: a front set of priority queues feeding a back set of per-host FIFO queues, with a heap that releases each host’s next URL only after its politeness delay has elapsed.
Google’s modern version is adaptive rather than fixed. Its crawl-rate logic watches the host and adjusts. The signal it pays most attention to is failure: Google’s documentation states that its crawling infrastructure reduces a site’s crawl rate when it sees a meaningful number of 500, 503, or 429 responses, and that the reduction applies to the whole hostname, not just the failing URLs. The inverse holds too. A host that answers quickly and cleanly earns a higher rate, up to the point where Google judges it has crawled what it wants. This is a feedback loop, and the same loop is what makes the 429 status code meaningful: a server that wants less traffic can simply start returning 429, and a conforming crawler reads that as the throttle signal it is.
There is a distributed-systems wrinkle that the single-crawler picture hides. The politeness budget belongs to the host, but a real crawler is many machines, and if each worker independently decides it is being polite, the host still gets hit by the sum of all of them. The per-host serialization therefore has to be coordinated across the fleet, which is why the politeness state, the last-fetch timestamp and the current delay for each host, has to be partitioned so that exactly one worker owns each host at a time. Get the partitioning wrong and you have a polite-looking crawler that is collectively rude. The same partitioning, incidentally, is what keeps proxy traffic from concentrating, a problem the proxy pool management write-up gets into from the other side.
Backpressure closes the loop in the other direction. When a host slows down, the requests queued for it back up, and that backlog has to propagate upstream so the frontier stops feeding URLs for a host that cannot keep up. A crawler without backpressure does not slow down when a host struggles; it queues harder, and the queue is the thing that eventually delivers the denial-of-service. Politeness, at scale, is mostly the discipline of letting a slow host slow you down.
Sitemaps: the cooperative half of the contract
Everything so far has been the site owner telling the crawler what not to do. Sitemaps are the opposite signal, the owner telling the crawler what to fetch and when it last changed, and they are the cooperative half of the politeness contract. A good sitemap makes a crawler more efficient and therefore lighter on the host, because a crawler that knows what changed does not have to re-fetch what did not.
The format is an XML file at a URL of the owner’s choosing, declared either in robots.txt with a Sitemap: line or submitted through a search engine’s tools. The sitemaps.org protocol, at version 0.9, defines a <urlset> root with one <url> entry per page. Inside each entry, <loc> is the only required element, the URL itself. The optional ones are <lastmod> for the last modification date, <changefreq> for how often the page tends to change, and <priority> for relative importance. A single sitemap is capped at 50,000 URLs and 50 MiB uncompressed, and larger sites chain multiple sitemaps behind a sitemap index file. The 50,000-URL ceiling is itself a politeness mechanism of sorts, since it bounds how much a crawler has to parse before it can start scheduling.
The interesting recent history is that the optional metadata has quietly lost most of its credibility. Google treats <changefreq> and <priority> as close to noise; they are hints a site can assert without cost, so they get gamed and then ignored. <lastmod> survived, but only conditionally. Google’s guidance is blunt about the condition: lastmod is useful only if it is honest, in a valid date format, and consistently matches reality. A site that stamps every page with a fresh lastmod on every crawl, hoping to look perpetually updated, teaches the crawler to stop believing its lastmod values entirely. The metadata works exactly as far as it is trustworthy and no further, which is a recurring theme in this whole area.
The clearest signal that the cooperative channel has trust problems came in June 2023, when Google deprecated the sitemaps ping endpoint. For years a site could fire an unauthenticated HTTP request at a Google URL to say “my sitemap changed, come look.” Google retired it, and the stated reason is telling: internal study, corroborated by Bing, found that unauthenticated sitemap submissions were mostly useless, and in Google’s case the vast majority of those pings were spam. The endpoint now returns 404. Sitemaps still work through robots.txt and Search Console; the open, anonymous, anyone-can-poke-us version got shut down because, given an open channel and no authentication, a meaningful share of traffic will abuse it. Hold that thought, because it is the same shape as the problem that breaks robots.txt.
2025: the honor system stops holding
Robots.txt has always rested on a single assumption: that the crawler reading it wants to comply. The file cannot enforce anything. It is a sign on an unlocked door. For thirty years the assumption mostly held, because the crawlers that mattered were search engines whose business depended on being welcome, and a search engine caught ignoring robots.txt had a lot to lose. The arrival of AI crawlers, fetching content to train models and to answer questions in real time, changed the incentive math, and in 2025 the cracks became impossible to ignore.
The numbers tell the story. Across Cloudflare’s network the share of bots ignoring robots.txt rose from 3.3 percent to 12.9 percent over the first quarter of 2025. The single loudest case broke on 4 August 2025, when Cloudflare published a post accusing Perplexity of running stealth crawlers to get around no-crawl directives. The methodology is worth describing because it is clean. Cloudflare set up brand-new domains that had never been indexed and were not publicly discoverable, gave each a robots.txt that disallowed all automated access plus WAF rules blocking Perplexity’s declared crawlers, then asked Perplexity questions about those domains. Content that the AI could only have obtained by crawling came back anyway.
What Cloudflare reported is the part that matters for this topic. Perplexity’s declared agents, PerplexityBot and Perplexity-User, were generating something like 20 to 25 million requests a day. When those were blocked, traffic appeared from an undeclared crawler presenting a generic browser user-agent, Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36, impersonating Chrome on macOS, at a further 3 to 6 million requests a day. The stealth crawler rotated IPs and switched between ASNs when blocked, behavior that has nothing to do with reading a file at the root and everything to do with not wanting to be identified. Cloudflare’s response was to de-list Perplexity from its verified-bot program and ship detection heuristics to block the stealth traffic. Perplexity’s rebuttal was that Cloudflare cannot reliably tell a user-driven assistant fetching a page on someone’s behalf from a bulk training crawler, which is a real distinction and also exactly the kind of distinction that a User-Agent header was never able to carry.
This is the structural failure, not a one-company scandal. The User-Agent string is self-asserted and trivially changed. A crawler that ignores robots.txt and then renames itself to look like Chrome has defeated both halves of the convention in one move, because robots.txt rules are keyed on user-agent and a spoofed user-agent matches no rule. IP-range allowlists, the usual backstop, are brittle: ranges change, residential and shared addresses muddy the picture, and a determined crawler rotates through ASNs faster than anyone maintains a blocklist. The same techniques that the anti-detection world has spent a decade refining, the user-agent spoofing covered in the HeadlessChrome user-agent token write-up, turned out to be exactly what an AI crawler reaches for when a block gets in the way. The honor system has no answer to a participant who declines to be honorable, and by 2025 there were enough such participants to move the network-wide numbers by a factor of four in a single quarter.
Cryptography replaces the honor system
The fix being built is the one the sitemap-ping deprecation already pointed at: stop trusting unauthenticated self-assertion and require the crawler to prove who it is. The scheme is Web Bot Auth, and it applies RFC 9421 HTTP Message Signatures, published as an IETF Proposed Standard in February 2024, to crawler traffic. The architecture work is happening in an IETF draft led by Thibault Meunier, and the major edge providers have already shipped support.
The mechanism replaces “trust the User-Agent” with “verify a signature.” A crawler generates a signing keypair and publishes the public half as a JSON Web Key Set at a well-known location, /.well-known/http-message-signature-directory, on its own domain. On each request it signs a set of request components with its private key and attaches a Signature header carrying the signature, a Signature-Input header naming which components were signed plus the key id, creation, and expiry, and a Signature-Agent header pointing at the key directory. The server fetches the public key once, caches it, and from then on can verify cryptographically that a request came from the entity that controls that key. The User-Agent string becomes irrelevant; you no longer trust the label, you check the signature.
Adoption moved fast for a draft. Cloudflare folded HTTP Message Signatures into its Verified Bots program in 2025, and AWS WAF added Web Bot Auth support in November 2025, automatically allowing verified signed agents by default. The IETF has been weighing whether to form a working group around it. None of this makes a crawler polite in the rate-limiting sense; a signed crawler can still hammer a host. What it does is restore the identity layer that robots.txt always quietly assumed and that user-agent spoofing destroyed. Once a server can verify identity, it can make real decisions, allow this signed crawler, rate-limit that one, block the third, and a crawler that wants access has an incentive to behave, because misbehavior now attaches to a key that cannot be swapped for free.
It is worth being precise about what this does and does not solve, because the two failure modes are different. Reverse-DNS verification, the old backstop, still works for the crawlers that bother to support it: Google publishes its crawler IP ranges as JSON and documents a forward-confirmed reverse-DNS check, where you resolve the IP to a googlebot.com hostname and then resolve that name back to confirm it matches, precisely because a one-way reverse-DNS record can be spoofed. That covers identity for the cooperative majors. Web Bot Auth generalizes it to any crawler willing to hold a key, without the operational pain of maintaining IP allowlists. Neither addresses the crawler that simply refuses to identify itself and hides in browser-shaped traffic; that one is a detection problem, not a verification problem, and it lives in the same arms race as every other server-side bot detection signal.
What politeness is really made of
Strip away the specifics and crawl politeness is two different things wearing one name. One is a social protocol, robots.txt and crawl-delay and sitemaps, a set of conventions for declaring intent that work only as long as the reader chooses to honor them. The other is an engineering discipline, per-host serialization and adaptive rate limiting and backpressure, which protects servers whether or not anyone declares anything, because it is built into how the crawler schedules its own work. The first is etiquette. The second is the thing that actually keeps the lights on. A crawler can read every robots.txt perfectly and still flatten a host if its scheduler lets ten workers hit the same server at once, and a crawler can ignore robots.txt entirely and still be gentle if its per-host rate limiter is honest.
For most of the web’s history those two halves traveled together because the same people built both, and a search engine that wanted to stay welcome had every reason to be careful. What 2025 exposed is that the social half was always the weaker of the two, held up by incentive rather than enforcement, and that when the incentives shift, a text file at the root of a domain is a sign and not a lock. The cryptographic verification now spreading across the edge is the first serious attempt to put a lock where the sign used to be. It does not make crawlers polite. It makes them accountable, which is a different and more durable property, and on the open web it took thirty-one years and a 12.9-percent non-compliance rate to decide that accountability was worth the cost of a signature on every request.
Sources & further reading
- Koster, Illyes, Zeller & Sassman (2022), RFC 9309: Robots Exclusion Protocol — the IETF standardization of robots.txt, including the 500 KiB parse limit, longest-match rule, and fail-closed handling of unreachable files.
- Google (2024), How Google interprets the robots.txt specification — operational detail on status-code handling, the three-phase 5xx response, caching, and the list of unsupported fields including crawl-delay.
- Google (2017, updated), Crawl budget management for large sites — how crawl rate and crawl demand combine, and how 500/503/429 responses pull the rate down for a whole hostname.
- Google (2025), Reduce the Google crawl rate — the role of 429 and 503 as throttle signals and the absence of crawl-delay support.
- Google (2024), Verify Googlebot and other Google crawlers — forward-confirmed reverse-DNS verification and the published Googlebot IP ranges in JSON.
- Bing Webmaster (2009/2012), Crawl delay and the Bing crawler, MSNBot — Bing’s time-window interpretation of crawl-delay, where the value sets a per-window page budget.
- sitemaps.org (2008), Sitemap protocol 0.9 — the XML format, required and optional elements, and the 50,000-URL / 50 MiB limits.
- Google Search Central (2023), Sitemaps ping endpoint is going away — deprecation of the unauthenticated ping endpoint and guidance on honest lastmod values.
- Koster (2019), Robots.txt is 25 years old — the originator’s account of the February 1994 www-talk proposal and the server overload that prompted it.
- Cloudflare (2025), Perplexity is using stealth, undeclared crawlers to evade website no-crawl directives — the test methodology, the spoofed Chrome-on-macOS user-agent, request volumes, and ASN rotation.
- IETF (2024), RFC 9421: HTTP Message Signatures — the signature framework that Web Bot Auth applies to crawler traffic.
- Cloudflare (2025), Message Signatures are now part of our Verified Bots Program — Web Bot Auth in practice: the .well-known key directory, the Signature/Signature-Input/Signature-Agent headers, and Meunier’s IETF draft.
Further reading
The history of web scraping: from wget to headless Chrome, 1994-2026
Traces automated web extraction from the 1993 Wanderer and JumpStation through wget, Perl LWP, the API era, Scrapy, Selenium, the headless-Chrome shift, and the AI-training wave, with the legal landmarks along the way.
·25 min readA history of the robots.txt standard, from 1994 consensus to RFC 9309
Traces robots.txt from Martijn Koster's 1994 mailing-list proposal through 25 years as a de-facto standard, Google's 2019 push, RFC 9309 in 2022, and the 2024-2025 AI-crawler revolt and llms.txt debate.
·22 min readSelenium's bidirectional protocol and the WebDriver BiDi migration
How WebDriver BiDi gives the W3C automation standard the bidirectional channel that CDP had, why Selenium and Firefox are moving onto it, and what the switch changes for bot detection.
·22 min read