A history of the robots.txt standard, from 1994 consensus to RFC 9309
For most of its life, the most widely deployed access-control file on the web had no specification. There was a page on a personal website, a 1994 mailing-list message, and a 1997 internet draft that expired without ever becoming anything. Half a billion sites obeyed it anyway. Googlebot obeyed it, AltaVista obeyed it, the polite half of the scraping world obeyed it, and the impolite half ignored it the same way they would have ignored a law. That arrangement held for twenty-five years. Then, in the space of three, it got a real RFC, lost its grip on a new class of crawler, and became the center of a fight about who gets to train on the open web.
This is the history of that file. It runs from the runaway crawler that provoked the idea in early 1994, through the long stretch where robots.txt was a convention rather than a standard, to the moment Google decided to write it all down, to RFC 9309 in 2022, and into the present, where AI companies, publishers, and a CDN that carries a fifth of the web are arguing about whether a voluntary text file means anything at all. The sections below follow that arc: the origin, the de-facto era and its informal extensions, the failed 1997 draft, Google’s 2019 push and the deprecations that came with it, the mechanics that RFC 9309 finally pinned down, and the AI-era revolt that put a 31-year-old convention back on the front page.
The runaway crawler of 1994
The web in early 1994 was small, fragile, and easy to knock over. Servers ran on hardware that today would struggle to load a single modern page. A crawler that followed every link as fast as it could was not a nuisance, it was an outage. Martijn Koster, then at Nexor and one of the people maintaining early web indexes, kept running into automated clients that hammered his server into the ground, sometimes by accident, sometimes by request loops that generated infinite URLs.
The specific provocation has a name attached. Charles Stross, later better known as a science-fiction novelist, has claimed he wrote a badly behaved crawler that caused what amounted to a denial-of-service on Koster’s server, and that this is what pushed Koster to propose a fix. Koster floated the idea on the www-talk mailing list in February 1994. The discussion moved to a dedicated robots mailing list, and by June 1994 the participants had converged on a convention simple enough that nobody had to be forced to adopt it.
The design is almost aggressively minimal. A single file, named robots.txt, lives at the root of a host. A crawler that wants to be polite fetches http://example.com/robots.txt before it fetches anything else, reads a list of User-agent and Disallow lines, and stays out of the paths it is told to stay out of. That is the whole idea. There is no registration, no signing, no central authority, and critically, no enforcement. The file is advisory. It tells well-behaved crawlers where not to go and has exactly zero effect on anyone who decides not to read it.
Adoption was fast precisely because the cost of adopting was near zero. By the time the convention settled in mid-1994, the crawlers that mattered were already reading it. WebCrawler, Lycos, and AltaVista, the search engines that defined the pre-Google web, all honored it. A webmaster could drop a three-line file at the root of their site and trust that the major indexers would respect it. That trust was the entire mechanism. Nothing technical compelled compliance. The standard worked because the people running the important crawlers wanted to be good citizens and because being seen to ignore robots.txt was bad for a search engine’s reputation.
The de-facto standard and its bolted-on extensions
What Koster wrote down in 1994 was a sketch. The convention as actually deployed accreted features over the next decade and a half, none of them through a formal process. Two crawlers would agree on a syntax, document it on their own sites, and it would propagate by imitation. This is how robots.txt grew an Allow directive, a Crawl-delay, a Sitemap line, and wildcard path matching, all without a governing document that everyone agreed on.
The original 1994 design only had Disallow. A path was either blocked or it was not, and there was no way to carve out an exception inside a blocked tree. Allow filled that gap, letting a site say “block everything under /private/ except /private/public-summary.html”. But because there was no spec, two implementations could disagree about what happened when an Allow and a Disallow both matched a URL. Did the more specific rule win? The first one listed? The last? Different crawlers resolved the conflict differently, and a webmaster writing a robots.txt had no authoritative answer about how their file would actually be interpreted.
Crawl-delay was a similar story. Some crawlers read it as “wait this many seconds between requests,” and webmasters used it to throttle aggressive but compliant bots. Google never supported it, preferring to let site owners set a crawl rate in Search Console. So a Crawl-delay line meant something to Bing and Yandex and nothing to Googlebot, and you could not tell from the file alone which behavior you would get. The Sitemap directive, which points crawlers at an XML index of a site’s URLs, came out of a 2006-2008 collaboration between Google, Yahoo, and Microsoft around the sitemaps protocol, and it bolted a discovery hint onto a file whose original job was exclusion. By the late 2000s robots.txt was carrying at least four distinct dialects, and the only way to know what any given crawler would do was to read that crawler’s own documentation.
There was a deeper ambiguity underneath the directive zoo. The 1994 text never precisely defined how to match a Disallow path against a request URL. Was matching a literal prefix? Did * mean anything? What about a $ to anchor the end of a path? Google and others supported wildcard matching where * stood for any run of characters and $ anchored the end of the URL, so Disallow: /*.pdf$ would block PDF files. But this was Google’s extension, documented by Google, honored by some and not others. A webmaster who used it was writing a file that behaved differently depending on who read it. For a mechanism whose entire value was telling crawlers where not to go, that imprecision was a real problem, and it sat unaddressed for over twenty years.
The 1997 draft that expired
It was not for lack of trying. Koster submitted an internet draft to the IETF in 1996, titled “A Method for Web Robots Control,” that attempted to formalize the convention. Internet drafts are working documents with a six-month shelf life. If a draft is not adopted by a working group and advanced, it expires. Koster’s draft expired. There was no working group pushing it, no vendor coalition demanding it, and the convention was working well enough in practice that nobody felt the urgency to spend years in standards committees ratifying a text file.
So robots.txt entered a strange limbo. It was simultaneously one of the most-deployed mechanisms on the web and an officially nonexistent standard. The canonical reference was robotstxt.org, a site Koster maintained, plus whatever each crawler operator chose to document. This worked, but it left every edge case to be resolved by convention and reverse-engineering. What is the maximum size of a robots.txt file a crawler will read? What happens if the file returns a 503? If it redirects? If it is full of UTF-8 garbage? The answers existed only inside the source code of individual crawlers, and they did not always agree.
That ambiguity had consequences beyond pedantry. If a crawler and a webmaster disagree about whether a path is blocked, the webmaster’s intent loses. A site owner who carefully wrote Disallow: /admin and assumed it blocked /admin.php might find it did not, because the crawler treated the rule as a directory prefix and not a substring. The lack of a spec meant the protocol’s guarantees were soft on both sides: crawlers could not be sure they were complying, and site owners could not be sure they were protected. For a discussion of how this kind of soft guarantee plays out in the broader scraping ecosystem, the history of web scraping traces the parallel arms race in tooling.
Google writes it down
By 2019, Google had been parsing robots.txt files in production for two decades. Its parser was a C++ library containing, in Google’s own description, pieces of code written in the 1990s, refined against the messy reality of half a billion robots.txt files in the wild. That parser was, in effect, the real specification. If your file worked the way Googlebot interpreted it, it worked, because Googlebot’s interpretation was the one that mattered for most of the web’s traffic.
On 1 July 2019, Google moved on two fronts at once. It open-sourced that C++ parser under the Apache 2.0 license, publishing the google/robotstxt repository on GitHub so that anyone could see and reuse the exact matching logic Google’s crawler used. And it submitted an internet draft to the IETF, co-authored by Koster himself along with Google engineers Gary Illyes, Henner Zeller, and Lizzi Sassman, proposing to formalize the Robots Exclusion Protocol after twenty-five years as a convention. The draft’s stated philosophy was conservative: it did not change the rules created in 1994, it defined all the scenarios that 1994 had left undefined and extended the protocol for the modern web.
The same week brought a less popular announcement. Google said that on 1 September 2019 it would retire support for rules in robots.txt that had never been part of any documented standard, specifically noindex, nofollow, and Crawl-delay when they appeared as robots.txt directives. Some webmasters had been using Noindex: /path inside robots.txt to keep pages out of the search index, an undocumented behavior Google had tolerated for years. Google’s data was that these unsupported rules were contradicted by other rules in all but 0.001% of robots.txt files on the internet, meaning that in practice almost nobody relied on them cleanly. The deprecation pushed those use cases to their proper homes: a noindex meta tag or X-Robots-Tag header for index control, a 404 or 410 for removed content, password protection for genuinely private material.
Open-sourcing the parser mattered more than it might look. For two decades, anyone writing a crawler had to guess at the corner cases or reverse-engineer Googlebot’s behavior by observation. Now the reference implementation was public, with its test suite, and the internet draft described the same logic in prose. The de-facto standard and the would-be formal standard were, for the first time, the same artifact. That alignment is what let the IETF process actually go somewhere instead of expiring the way the 1997 draft had.
What RFC 9309 actually pinned down
The draft advanced through the IETF and was published as RFC 9309 in September 2022, an Internet Standards Track document authored by Koster, Illyes, Zeller, and Sassman. Twenty-eight years after the mailing-list proposal, robots.txt had an RFC number. The document is short by RFC standards, and its value is less in inventing anything than in nailing down every behavior that had been ambiguous since 1994.
It defines the syntax in Augmented Backus-Naur Form. A robotstxt file is a sequence of groups and empty lines. A group starts with one or more User-agent lines and contains allow and disallow rules. The product token in a User-agent line, the name a crawler matches itself against, must contain only uppercase and lowercase letters, underscores, and hyphens, and crawlers must match it case-insensitively. When multiple groups name the same crawler, they are combined into one group. The wildcard * group applies only when no group names the specific crawler.
The rule that had caused the most confusion finally got an answer. When an allow and a disallow rule both match a URL, the most specific match wins, and “most specific” is defined precisely as the match with the most octets. If two rules are equivalent in length, the allow rule should be used. Path matching should be case-sensitive. This is the same longest-match logic Google’s open-sourced parser had been applying, now written into a standard that other crawlers could implement against and that webmasters could finally rely on.
The operational details got nailed down too. A crawler should cache the robots.txt file and should not use a cached copy for more than 24 hours, unless the file is unreachable, in which case it may keep using the last known good version. The parser must read at least the first 500 kibibytes of the file, so a site cannot accidentally hide rules past some tiny limit and crawlers have a known floor to implement. The HTTP status handling is explicit: a successful 2xx means parse the rules; a 4xx means the crawler may access any resource, because an absent or unauthorized robots.txt is treated as no restrictions; a 5xx server error means the crawler must assume a complete disallow, on the theory that a broken server should not be hammered; and redirects should be followed for at least five hops.
The security considerations section says the quiet part out loud. The Robots Exclusion Protocol is not a substitute for valid content security measures. Listing a path in robots.txt to keep crawlers out also publishes that path to anyone who reads the file, which is everyone, since robots.txt is world-readable by design. A Disallow: /secret-admin-panel/ is a signpost pointing at the thing you wanted hidden. The RFC tells implementors to treat robots.txt content as untrusted, to guard against memory exhaustion from oversized or malformed files, and to handle invalid characters defensively. It is an access-control file that explicitly disclaims being access control.
When the crawlers stopped listening
A standard that depends on voluntary compliance is only as strong as the incentive to comply. For most of robots.txt’s life, the major crawlers were search engines, and a search engine has a powerful reason to behave: it wants webmasters to let it in, because being in the index is mutually beneficial. Block Googlebot and you vanish from search. That alignment kept the protocol honest. The arrival of AI training crawlers broke the alignment, because the value flows one way. An AI company crawls your content to train a model that may then answer the questions your site used to answer, sending you no traffic in return. The webmaster gets nothing, so the webmaster starts saying no.
The saying-no happened fast. When OpenAI’s GPTBot launched in August 2023, roughly 5% of the top 1,000 websites blocked it. By August 2024 that figure had climbed to about 35.7% of the top 1,000, a sevenfold increase in a year. Among news publishers the numbers ran far higher. By 2025, roughly 79% of the world’s top news sites were blocking AI training bots in robots.txt, with around 49% disallowing GPTBot specifically, about 48% blocking Common Crawl’s CCBot, and around 44% blocking Google’s AI crawler. The New York Times, the Guardian, CNN, Reuters, the Washington Post, and Bloomberg all added blocks. Common Crawl’s CCBot, a non-profit crawler whose archives feed many AI training sets, became one of the single most-blocked user-agents on the web, in some samples surpassing GPTBot itself.
*GPTBot blocking went from a rounding error at launch to roughly a third of the top 1,000 sites within a year, and to nearly half of major news sites by 2025. The samples differ, so the bars are not strictly comparable, but the direction is unambiguous.*Blocking only works if the crawler reads the file. That assumption started to crack. In August 2025, Cloudflare published an investigation accusing Perplexity of running stealth crawlers to get around no-crawl directives. According to Cloudflare, when a site blocked Perplexity’s declared crawler, which identifies itself with a user-agent like Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Perplexity-User/1.0; +https://perplexity.ai/perplexity-user), requests would then arrive from an undeclared crawler presenting a generic Chrome-on-macOS user-agent, Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36, with no hint that Perplexity was behind it. Cloudflare reported the declared crawler generated 20-25 million daily requests while the undeclared one produced another 3-6 million, sourced from IPs outside Perplexity’s published ranges and rotated across different ASNs to dodge blocks. Cloudflare said it fingerprinted the stealth crawler using machine learning and network signals, de-listed Perplexity as a verified bot, and shipped detection rules to block the behavior for all customers. Perplexity pushed back, arguing that a user-driven AI assistant fetching a page on demand is categorically different from a training crawler and should not be bound by the same rules.
That dispute is the whole problem in miniature. robots.txt has no teeth. It never did. For thirty years that did not matter much, because the crawlers that could afford to ignore it mostly chose not to, and the ones that did ignore it were small enough to handle with a firewall rule. AI changed the math. The content is valuable enough that some operators will route around a block, and the techniques for routing around it, rotating IPs, spoofing a browser user-agent, presenting a residential fingerprint, are the same ones the bot-detection industry has spent fifteen years learning to counter. The story of robots.txt becoming unenforceable is, from another angle, the story of why the bot-mitigation industry exists at all. When the polite signal stops working, the impolite countermeasure takes over, and a voluntary text file gives way to TLS fingerprinting and behavioral scoring.
The proposals racing to replace the gap
If robots.txt is a blunt yes/no that AI crawlers increasingly ignore, the obvious move is to build something with more nuance, more enforcement, or both. Three responses emerged in quick succession, and they pull in different directions.
The first is llms.txt, proposed by Jeremy Howard of Answer.AI on 3 September 2024. It is worth being clear that llms.txt is not a robots.txt replacement and was never pitched as one. It solves a different problem: an LLM’s context window is too small to swallow an entire website, and the raw HTML of a page is full of navigation and markup that wastes tokens. An llms.txt file, placed at /llms.txt, is a curated Markdown document, an H1 with the site name, an optional blockquote summary, and H2 sections listing links to clean Markdown versions of key pages, that hands a model a tidy map of what is worth reading. It is a welcome mat, not a fence. Adoption ran into the obvious wall: a welcome mat only helps if the guest reads it, and through 2025 no major AI provider, not OpenAI, not Anthropic, not Google, committed to consuming llms.txt at inference time. The documentation platform Mintlify rolled out support across the docs sites it hosts, which seeded thousands of files including Anthropic’s and Cursor’s, but the consumption side stayed thin. Google did add an llms.txt audit to its Lighthouse tooling, which nudges sites to publish one without obligating any crawler to read it.
The second response tries to add the thing robots.txt always lacked: an economic layer. Really Simple Licensing, RSL, launched on 10 September 2025, managed by the non-profit RSL Collective, co-founded by RSS co-creator Eckart Walther and former Ask.com CEO Doug Leeds. The pitch is that yes/no blocking is the wrong abstraction, because most publishers do not want to block AI outright, they want to be paid. RSL augments robots.txt with machine-readable licensing terms, a vocabulary for saying “you may crawl this, but the license is attribution,” or subscription, or pay-per-crawl where the publisher is compensated for each crawl, or pay-per-inference where they are compensated each time the content shapes a model’s answer. Reddit, Yahoo, Quora, O’Reilly Media, Medium, and others backed the launch, and the RSL Technical Steering Committee published a 1.0 specification on 10 December 2025. Whether it sticks depends entirely on whether AI companies agree to honor licenses they are not legally compelled to honor, which is the same voluntary-compliance question robots.txt has faced since 1994, only now with a dollar figure attached.
The third response gives up on voluntary compliance and reaches for enforcement at the network layer. On 1 July 2025, Cloudflare announced it would block AI crawlers by default for new domains, flipping the web’s default from opt-out to opt-in for the substantial slice of traffic it carries, and it launched a Pay Per Crawl marketplace letting publishers charge AI companies per request, with the CDN positioned to actually enforce the toll because it sits in the request path. This is a different kind of answer. robots.txt asks a crawler to behave; Cloudflare’s edge can simply refuse to serve the bytes. The mechanics of that enforcement, fingerprinting, challenge pages, scoring, are the subject of the broader history of Cloudflare, but the strategic point is that the most credible 2025 enforcement of crawler preferences runs through a private CDN rather than a shared standard. The signal stayed advisory. The enforcement moved to the edge.
What a voluntary file was always for
robots.txt has been declared dead more times than is interesting to count, and it keeps not dying. The reason is that it was never trying to be a wall. It was a politeness convention, a way for the cooperative majority of crawlers to coordinate with the cooperative majority of site owners, and within that scope it has worked for three decades with almost no central governance. The RFC did not change what it does. It documented what it already did, resolved the corner cases, and gave the convention a number so that a new crawler author has something to implement against instead of a folder of crawler-specific blog posts.
The thing that changed is not the file. It is the population of crawlers reading it. When the readers were search engines with an interest in your goodwill, an advisory signal was enough, because the incentive to comply came built in. AI crawlers severed that incentive, and a voluntary protocol responds to a severed incentive exactly the way you would predict: the cooperative operators keep cooperating, publish their GPTBot and CCBot blocks, watch their robots.txt block-rates climb past a third and then half of major sites, and the uncooperative operators are met not by the standard but by everything that grew up to compensate for the standard’s lack of teeth. The fingerprinting, the per-crawl tolls, the default-deny CDN edge.
Thirty-one years after a runaway crawler knocked over Martijn Koster’s server, the file he proposed is finally a real standard, and that standardization arrived at almost exactly the moment its core assumption stopped holding. RFC 9309 describes a world where crawlers want to be told where not to go. The 2025 web is full of crawlers that would rather not be told, and the most consequential fights over web crawling are now happening in licensing collectives and at CDN edges, in places where compliance can be priced or enforced, not requested. The text file still sits at the root of nearly every host, still gets fetched billions of times a day, still works exactly as designed. It just turns out that designing for cooperation has a failure mode, and we are watching it.
Sources & further reading
- IETF / Koster, Illyes, Zeller, Sassman (2022), RFC 9309: Robots Exclusion Protocol — the formal standard: ABNF grammar, longest-match rule, 500 KiB parse floor, 24-hour cache, and HTTP status handling.
- Wikipedia (2026), robots.txt — the consolidated history: the 1994 origin, the Charles Stross provocation, de-facto adoption by WebCrawler/Lycos/AltaVista, and the AI-era blocking figures.
- Google Search Central (2019), Google’s robots.txt parser is now open source — the announcement of the open-sourced C++ parser and the IETF internet draft that became RFC 9309.
- Google Search Central (2019), A note on unsupported rules in robots.txt — the deprecation of noindex, nofollow, and crawl-delay as robots.txt directives, effective 1 September 2019.
- google/robotstxt (2019), Google’s robots.txt parser and matcher (C++) — the reference implementation that doubled as the de-facto spec for two decades.
- Cloudflare (2025), Perplexity is using stealth, undeclared crawlers to evade no-crawl directives — the user-agent strings, request volumes, and IP/ASN rotation behind the stealth-crawling dispute.
- Cloudflare (2025), Content Independence Day: no AI crawl without compensation — the move to block AI crawlers by default and the Pay Per Crawl marketplace.
- Answer.AI / Jeremy Howard (2024), The /llms.txt file — the original llms.txt proposal: purpose, Markdown format, and file location.
- llmstxt.org (2024), The /llms.txt specification — the living spec and rationale for an inference-time content map.
- RSL Collective (2025), RSL Standard press release — Really Simple Licensing: machine-readable license terms layered onto robots.txt, with pay-per-crawl and pay-per-inference models.
- Press Gazette (2024), Eight in ten of world’s biggest news websites now block AI training bots — the publisher-side blocking data and the CCBot/GPTBot/Google-Extended breakdown.
- BuzzStream (2025), Which news sites block AI crawlers in 2025 — per-crawler block rates across top news sites, including GPTBot at roughly half.
Further reading
Crawl politeness: robots.txt, crawl-delay, and the unwritten rules of scale
Traces how crawl politeness works in practice: RFC 9309 robots.txt parsing, the crawl-delay split between Google, Bing, and Yandex, per-host rate limits, sitemaps, and the cryptographic verification replacing the honor system.
·25 min readThe history of web scraping: from wget to headless Chrome, 1994-2026
Traces automated web extraction from the 1993 Wanderer and JumpStation through wget, Perl LWP, the API era, Scrapy, Selenium, the headless-Chrome shift, and the AI-training wave, with the legal landmarks along the way.
·25 min readThe history of HTTP: from 0.9 to HTTP/3, told through its RFCs
Traces HTTP from Berners-Lee's one-line 1991 protocol through RFC 1945, the RFC 2068/2616/7230 era of HTTP/1.1, Google's SPDY, HTTP/2 (RFC 7540/9113), and HTTP/3 over QUIC (RFC 9114).
·22 min read