The history of web scraping: from wget to headless Chrome, 1994-2026
The web was barely a year old when someone wrote a program to read all of it. In June 1993 the World Wide Web had on the order of a hundred and thirty sites, and a graduate student at MIT pointed a Perl script at them to count how fast the thing was growing. That script did not ask permission, did not render JavaScript, did not negotiate a TLS handshake, and would not have understood any of those words. It fetched HTML, followed the links it found, and fetched more. Every scraper since has been a variation on that loop. What changed over the next three decades was not the loop. What changed was everything the loop had to get past.
This post walks that history in order. The first crawlers and the count of the web. The command-line tools, wget and curl, that turned fetching into a one-liner. The Perl libraries that made parsing programmable. The brief, optimistic API era when sites handed you the data directly. The Python stack of BeautifulSoup, Scrapy, and Requests. The browser-automation lineage from Selenium to Puppeteer to Playwright, and the day in 2017 that headless Chrome shipped and changed what “a scraper” even means. Then the modern standoff: TLS fingerprinting, the bot-mitigation industry, and the AI-training wave that turned a niche engineering problem into a fight over the value of the open web. The legal landmarks (eBay v Bidder’s Edge, hiQ v LinkedIn, The New York Times v OpenAI) thread through the whole thing, because the law has always lagged the loop by about a decade.
Counting the web: 1993-1994
The first crawler that mattered was a measurement instrument. Matthew Gray, then at MIT, built the World Wide Web Wanderer in Perl and ran it starting in June 1993 to estimate the size of the web. The Wanderer traversed hyperlinks, recorded the servers it found, and produced an index called the Wandex. Gray ran it on a roughly monthly cadence into 1995, and the numbers it returned are the reason we can say the web went from about 130 sites in mid-1993 to over 23,000 two years later. Gray was explicit that the Wanderer was not meant to be a search engine. It was a census.
The Wanderer also created the first scraping controversy, in miniature. Early versions hit servers hard enough that operators complained about load, which is the same complaint that would later produce injunctions and bot-mitigation companies. The fetch loop is cheap for the client and expensive for the server, and that asymmetry has been the root of every conflict in this history.
A search engine that did the full job (crawl, index, query) arrived at the end of the same year. JumpStation, written by Jonathon Fletcher at the University of Stirling, began indexing on 12 December 1993 and was announced on the Mosaic “What’s New” page on 21 December. It combined crawling, indexing, and a search interface, which earlier tools had not. By the time Fletcher left Stirling in late 1994, having failed to find anyone willing to fund it, JumpStation’s database held around 275,000 entries across 1,500 servers. The first general-purpose web search engine died for lack of a business model, which is its own kind of historical footnote.
The crucial standard of this era was not a tool at all. In February 1994 Martijn Koster, then at Nexor, proposed a convention on the www-talk mailing list: a file at /robots.txt that told automated clients which paths to leave alone. By June 1994 it was a de facto standard, honored by WebCrawler, Lycos, and AltaVista. It carried no enforcement. It was a request, written by the site, read by the bot, obeyed on the honor system. That voluntary contract held up for an astonishingly long time and is only now under real strain. We cover its full arc in the history of robots.txt; for this story, what matters is that the web’s first answer to unwanted crawling was a politely worded text file, and that the text file mostly worked for twenty-five years.
The command line: wget, curl, and the one-line fetch
For most of the 1990s, “web scraping” meant writing your own fetch loop. The tools that made the fetch itself trivial arrived mid-decade and are still in your shell today.
GNU Wget descends from a program called Geturl that Hrvoje Nikšić began in late 1995. Version 1.4.0, released in November 1996, was the first to carry the Wget name and the first under the GNU GPL. Wget did recursion: point it at a URL with -r and it would follow links and mirror an entire site to disk. That is a crawler in a single binary, and it shipped with nearly every Linux distribution, which means an entire generation’s first scraper was a wget -r command they half-understood. curl, written by Daniel Stenberg, arrived in 1998 from the same impulse with a different emphasis: not mirroring but precise control over a single request, every header and method exposed. The two tools split the territory that way. Wget walks trees; curl shapes requests. Both are still maintained, and curl in particular became the reference client whose exact behavior anti-bot vendors now fingerprint, a turn we will come back to.
The limitation of the command-line tools was that they fetched bytes and stopped. To do anything with the bytes (find the price on the page, follow only certain links, fill in a form) you needed a programmable HTTP client with a parser attached. In the late 1990s that meant Perl.
Perl and the programmable fetch: libwww-perl
The library that defined the first real scraping era was libwww-perl, usually called LWP. Its history is tied to the web’s founding conference: the project started at the first WWW conference in Geneva in 1994, where Martijn Koster (the robots.txt author) met Roy Fielding (later a primary author of the HTTP specification), who was presenting MOMspider, a crawler of his own. Fielding wrote the first generation of libwww-perl in Perl 4. The second generation, for Perl 5, was written by Koster and Gisle Aas, and the first non-beta Perl 5 release, numbered 5.00 to match the Perl version, shipped in May 1996. Aas maintained it for years afterward.
LWP is where the recognizable shape of a scraper first appears in a high-level language: construct a request object, send it through a user agent that handles redirects and cookies, get a response object back, hand the body to a parser. Paired with Perl’s regular expressions and HTML parsing modules, it let you write a script that logged into a site, walked paginated results, and pulled structured data out of tag soup. For roughly a decade, “screen scraping” and “Perl and LWP” were close to synonyms in practice. The 2002 O’Reilly book Perl & LWP is a period document worth reading precisely because its framing is so matter-of-fact: there will always be data on the web that has no API, the authors note, and for that data, scraping is the only option. That sentence has aged perfectly.
The API era, and why it didn’t last
The middle 2000s offered a different bargain. Instead of scraping HTML meant for humans, you could ask the site for data in a machine format it published on purpose. Web 2.0 came with public APIs. Flickr, del.icio.us, Amazon, eBay, and then Twitter exposed REST and XML (later JSON) endpoints, and for a while the smart move was to stop scraping and start integrating. An API is faster, cleaner, rate-limited in the open, and blessed by the operator. Why parse HTML if the JSON is right there?
The reason the API era did not end scraping is that an API is a business decision, and business decisions reverse. A site that opens an API to seed an ecosystem can close it once the ecosystem depends on it. Twitter is the canonical arc: an open, generous API in the late 2000s that third-party clients were built on, then a long series of restrictions, then in 2023 a pricing structure that priced most of those clients out of existence. Reddit ran the same play that year. Reddit’s API had been free since 2008; in April 2023 it announced pricing aimed squarely at large commercial users and AI trainers, at a rate (reported around $0.24 per 1,000 calls) that worked out to roughly $20 million a year for the popular third-party client Apollo, which shut down rather than pay. The lesson scrapers took from the API era is that access granted is access that can be revoked, and that the only data you truly control is the data you can extract from the page a normal browser renders. APIs are a convenience, not a foundation.
The Python stack: BeautifulSoup, Scrapy, Requests
While the API debate played out, the center of gravity for scraping moved from Perl to Python, and it moved because of one library aimed at one specific pain.
Leonard Richardson released BeautifulSoup in 2004. He wrote it to scrape book data from e-commerce pages, and its whole reason for existing is in the name: real HTML is “tag soup,” broken and inconsistent and nothing like the clean tree the spec describes, and BeautifulSoup parses it anyway, building a navigable tree out of markup that would make a strict parser give up. The Beautiful Soup 3 line ran from 2006 to 2012; Beautiful Soup 4, which can sit on top of faster parsers like lxml, arrived in 2012 and is still maintained. For the better part of two decades, “parse this messy page in Python” has meant BeautifulSoup.
Parsing is half the job. The other half is the crawl itself: the queue of URLs, the politeness, the concurrency, the retries, the deduplication. Scrapy filled that half. It began in 2007 as an internal tool at Mydeco, a London e-commerce startup, built by Shane Evans and soon co-developed by Pablo Hoffman, and the first public release (0.7, BSD-licensed) shipped in August 2008. Scrapy is an asynchronous crawling framework built on Twisted, with a request scheduler, a downloader, a middleware stack, and an item pipeline. The architecture is a direct descendant of the research crawlers of the late 1990s, codified into something a single engineer can run. The design problems Scrapy solves (where to keep the frontier of unvisited URLs, how to avoid hitting one host too hard, how to not crawl the same page twice) are the same ones the Mercator crawler laid out in its 1999 paper, which described per-host FIFO subqueues so that at most one worker thread ever pulls from a given web server. Scrapy’s stewardship passed to Scrapinghub, founded in 2010 by the same people and renamed Zyte in 2021.
The third piece arrived in 2011. Kenneth Reitz’s Requests library wrapped Python’s clumsy built-in HTTP machinery in an API a human could remember, and its tagline, “HTTP for Humans,” was earned. Requests plus BeautifulSoup became the default first scraper for a generation of programmers, the way wget -r had been a decade earlier. It is worth being precise about what that stack does and does not do, because the limitation defines the next chapter. Requests fetches the HTML the server sends. It does not run JavaScript. As long as the data you wanted was in that initial HTML, the stack was perfect. The moment sites started rendering their content with client-side JavaScript, the HTML that Requests fetched went empty, and the whole approach hit a wall.
Driving a real browser: Selenium, PhantomJS, and the 2017 shift
If the data only exists after JavaScript runs, the obvious move is to run the JavaScript, which means driving an actual browser. That lineage starts with test automation, not scraping, which is why so much of the tooling has a quality-assurance accent.
Selenium began in 2004 at ThoughtWorks in Chicago, where Jason Huggins built a JavaScript-based test runner (Selenium Core) to test an internal expense application. The early architecture ran inside the page as JavaScript, which the browser’s same-origin policy made painful. The cleaner idea came from Simon Stewart, also at ThoughtWorks, who started WebDriver around 2007: instead of injecting JavaScript, give each browser a native driver that controls it from outside. Huggins moved to Google, the projects converged, and WebDriver’s API became the foundation of Selenium 2, which shipped in July 2011. That control protocol was eventually standardized: WebDriver is now a W3C recommendation, and Selenium 4 is built on it. The arc from a JavaScript hack to a web standard took about fifteen years, and it is worth noting that the protocol’s own successor, WebDriver BiDi, is now folding bidirectional, event-driven control back in.
For scraping specifically, the bridge tool was PhantomJS, a headless WebKit you could script without a visible window or a display server. From roughly 2011 it was how you ran a “real” browser on a server to render JavaScript-heavy pages. It had one structural problem: it was a separate browser engine, perpetually behind real Chrome and Safari, so the pages it rendered drifted from what users actually saw.
That problem ended in 2017. In April 2017 the Chrome team announced headless Chrome, shipping in Chrome 59 on Mac and Linux, with Windows following in Chrome 60. Now you could run the exact same engine real users ran, with no window, controlled programmatically. The effect was immediate. PhantomJS’s lead maintainer announced he was stepping down and pointed people at headless Chrome; the project wound down. A few months later Google released Puppeteer, a Node library that drives Chrome over the Chrome DevTools Protocol, and the modern era of browser-based scraping had its default tool. Microsoft’s Playwright followed in 2020 from many of the same engineers, adding cross-browser support and a more capable automation model. We compare the detection surfaces of these tools in Playwright vs Puppeteer vs Selenium.
Headless Chrome changed the economics in both directions. For the scraper, any site became scrapable, because you were running the same code the site was written for. For the defender, the scraper now arrived as a genuine browser, indistinguishable at the protocol layer from a human’s Chrome, which meant the old defenses (block this user-agent, block this IP range) were useless against it. The detection problem moved up the stack, from “what does this client claim to be” to “is this real Chrome being driven by a person or by a script.” That question is the entire modern bot-mitigation industry, and the answer turned out to be: a driven browser leaks. The DevTools Protocol that Puppeteer speaks leaves traces. Headless mode reports itself in subtle ways. We catalog those tells in headless Chrome detection and trace the specific HeadlessChrome user-agent token; the short version is that 2017 did not end the arms race, it relocated it inside the browser.
The fingerprinting turn: when the request itself gives you away
The cleverest shift in this whole history is that defenders stopped trusting what a client says and started measuring how it speaks. A scraper can set any user-agent string it likes; the user-agent has been a frozen lie for years, with every browser claiming to be every other browser for compatibility reasons. So the signal moved to layers the scraper cannot easily forge by editing a header.
The TLS handshake is the clearest example. When any client opens an HTTPS connection it sends a ClientHello listing the cipher suites it supports, the extensions it understands, the elliptic curves it prefers, all in a particular order. That order and contents differ between a Python script linked against OpenSSL and a real Chrome linked against BoringSSL, and the difference is stable enough to fingerprint. The JA3 fingerprint, and its successor JA4, hash those ClientHello fields into a short string that often identifies the client library regardless of what its user-agent claims. A scraper sending a perfect Chrome user-agent over a Python TLS stack announces itself the moment the handshake completes, before it has sent a single HTTP header. The same logic extends to HTTP/2, where the order of the SETTINGS frame parameters and the pseudo-headers betrays the client, and down to the TCP/IP stack itself.
This is why curl-impersonate exists and why it is a milestone worth dating. In February 2022 a researcher publishing as lwthiker released a build of curl recompiled against BoringSSL and patched so its TLS handshake matches Chrome’s byte for byte, including Chrome-specific extensions, with Firefox variants too. The point was not to attack anyone. The point was that the only way to make an HTTP client survive TLS fingerprinting is to make it lie at the handshake level, and that this requires rebuilding the client against the browser’s own crypto library, because the fingerprint comes from the library, not from the application. curl’s own author, Daniel Stenberg, wrote about curl’s TLS fingerprint that same year with a kind of resigned interest: the tool’s identifiability is not a bug he can fix, it is a property of how TLS works. The widely used curl_cffi Python binding wraps this same impersonation capability, and the back-and-forth continues, because browsers periodically randomize their own extension order specifically to break the fingerprints, which breaks the impersonators until they catch up. The whole dance is covered in detecting curl-impersonate and uTLS.
The mitigation industry and the standoff at the edge
The signals above are not loose research curiosities. They are productized. A cluster of companies (Cloudflare, Akamai, DataDome, HUMAN, Imperva, Kasada, and others) sit in front of a large fraction of high-traffic sites and run exactly these checks on the first request, then add JavaScript challenges, behavioral telemetry, and proof-of-work on top. The full arc is in the history of the bot-mitigation industry, and the CDN vantage point that makes it work is in the history of Cloudflare. The mechanism is consistent across vendors even where the internals differ: the edge sees the connection before the origin does, it scores the client against signals collected at every layer, and it decides whether to pass, challenge, or block before the request ever reaches the site you are trying to read.
This is the standoff that defines scraping in 2026. The cheap HTTP scrape is fingerprinted at the handshake. The headless browser is detected by the tells of automation. The residential proxy is flagged by ASN reputation and the geolocation-versus-latency check. Each countermeasure has a counter-countermeasure (impersonated TLS, patched browsers, cleaner proxies) and each of those has a counter in turn. The result is not a wall, it is a price. Scraping a protected site is no longer a question of whether it is technically possible. It is a question of how much the data is worth against the cost of the infrastructure to extract it reliably, which is the genuinely modern condition: the loop from 1993 still runs, but now there is a meter on it.
The law, running about a decade behind
The legal history of scraping is a series of courts trying to fit a 1980s statute and a medieval tort to a problem neither anticipated.
The first landmark is eBay v Bidder’s Edge in 2000. Bidder’s Edge aggregated auction listings, hitting eBay’s servers as often as 100,000 times a day; eBay sent a cease-and-desist, Bidder’s Edge kept crawling through proxy servers to evade IP blocks, and eBay sued. Judge Whyte in the Northern District of California granted a preliminary injunction in May 2000 on a theory of trespass to chattels, the old common-law tort for interfering with someone’s personal property. The reasoning was that the unauthorized crawling consumed eBay’s server capacity and so trespassed on eBay’s computer systems. It was a creative fit and a shaky one, and California’s own Supreme Court undercut it three years later in Intel v Hamidi, holding that trespass to chattels requires actual harm to the system, not merely unauthorized contact. But the precedent had already taught a generation of operators that “we told you to stop and you kept going” was a viable legal hook.
For the next two decades the favored weapon shifted to the Computer Fraud and Abuse Act, the 1986 anti-hacking statute, on the theory that scraping a site after being told to stop was access “without authorization.” That theory met its limit in hiQ Labs v LinkedIn. hiQ scraped public LinkedIn profiles to build analytics; LinkedIn sent a cease-and-desist in 2017 and tried to block it; hiQ sued. The district court granted hiQ a preliminary injunction (273 F. Supp. 3d 1099, 2017), and the Ninth Circuit affirmed in September 2019 (938 F.3d 985), reasoning that data on a public website, with no login required, cannot be accessed “without authorization” because there is no authorization gate in the first place. The Supreme Court vacated and remanded in June 2021 in light of Van Buren v United States, which had narrowed the CFAA’s “exceeds authorized access” clause, and on remand in April 2022 the Ninth Circuit reaffirmed its position (31 F.4th 1180): scraping public data is unlikely to violate the CFAA. The case is widely read as establishing that the CFAA does not reach public data, and that the only solid CFAA hook is to put data behind authentication. hiQ’s eventual story is more complicated and a useful corrective: even after winning on the CFAA, hiQ was found in November 2022 to have breached LinkedIn’s user agreement, and the parties settled. You can win the access fight and lose the contract fight.
That distinction (the CFAA is about access, contract law is about terms) is the live battleground now, and the AI-training wave moved it into copyright. The New York Times sued OpenAI and Microsoft on 27 December 2023, alleging that training large models on millions of Times articles was copyright infringement and that the models could reproduce protected text. The defense leans on fair use; the plaintiffs argue the use is not transformative enough. The case is not about whether you can fetch a page. It is about what you are allowed to do with what you fetched, at training scale, and it is unresolved as of 2026.
*Four eras of legal theory, each reacting to the failure of the last to fit the facts.*The AI wave: from a niche to a public fight
For thirty years scraping was an engineering subculture. The AI-training boom dragged it into the open, because the thing being scraped was now the raw material for products worth tens of billions, and the people being scraped noticed.
The infrastructure was already there. Common Crawl, founded by Gil Elbaz in 2007 and crawling since 2008, publishes an open corpus now exceeding 10 petabytes, with monthly crawls of more than two billion pages, in standardized WARC, WAT, and WET formats. It was built to level the playing field so that researchers without Google’s resources could study the web at scale. It became, without quite intending to, one of the most important sources of training data for large language models, cited in thousands of papers. The Mozilla Foundation’s 2024 study of it carried the memorable framing that you could get LLM training data “for the price of a sandwich,” meaning the marginal cost of pulling Common Crawl is trivial compared to the value extracted from it. That asymmetry, free to take, expensive to have produced, is the 1993 server-load complaint scaled up by thirty years.
The backlash arrived as robots.txt entries. OpenAI introduced its GPTBot crawler in August 2023 with documented IP ranges and a user-agent token, and publishers responded by adding Disallow rules at a rate that turned a quiet convention into a public referendum on the open web. By August 2024, measurements put roughly a third of the top 1,000 websites blocking GPTBot, up from about 5% a year earlier, and roughly half of major news sites blocking one or more AI crawlers. The same period saw Anthropic’s ClaudeBot, Amazonbot, and ByteDance’s Bytespider become some of the most active crawlers on the web. The catch is the one Koster’s design carried from the start: robots.txt has no teeth. It is a request. Which is why Cloudflare and others began shipping network-level enforcement (Cloudflare’s robots.txt-enforcing WAF rules launched in December 2024) to turn the polite request into an actual block, and why the proof-of-work gate has come back into fashion as a way to make crawling cost the crawler something. The full revival is in the proof-of-work renaissance and the crawler-gating approach in Anubis.
It is worth stating plainly what is unresolved. Whether robots.txt has any legal weight (as opposed to mere convention) has never been settled in court. Whether training on scraped public data is fair use is exactly the question The New York Times case is testing. Whether the AI crawlers honor the blocks they are served is, in several documented disputes, contested. The honor system that held from 1994 to roughly 2023 held because the stakes were low enough that nobody had a strong incentive to cheat. The stakes are no longer low.
What stayed the same
Strip away thirty years of accreted machinery and the thing in the middle has not moved. Fetch a URL, find the links and the data in what comes back, fetch more. Matthew Gray’s Perl script did that in 1993, your headless Chrome cluster does it in 2026, and the loop is identical. Everything else in this history is layers added around that loop by two sides pushing against each other: the site adding robots.txt, then IP blocks, then JavaScript rendering, then handshake fingerprinting, then behavioral scoring, then proof-of-work; the scraper answering each with recursion, then proxies, then headless browsers, then impersonated TLS, then patched runtimes. Neither side ever wins. The equilibrium just gets more expensive.
The one genuinely new thing the AI wave introduced is not technical, it is about consent at scale. For most of this history the fight was over a specific site’s specific data, decided one cease-and-desist at a time. What changed is that the web in aggregate became a training corpus, and the question stopped being “may I read your page” and became “may the sum of everyone’s pages become someone else’s model.” Robots.txt was designed to answer the first question and was quietly load-bearing for the second, which it was never built to carry. The most concrete observation to close on is small and specific: the single most consequential file in the history of web scraping is a plain-text list of paths, invented in 1994 as a courtesy, that in 2024 became the front line of a multibillion-dollar dispute it has no power to enforce. The loop kept running. The text file is what finally broke.
Sources & further reading
- Wikipedia (2024), World Wide Web Wanderer — Matthew Gray’s 1993 Perl crawler, the Wandex index, and the site-count growth figures.
- Wikipedia (2024), JumpStation — Jonathon Fletcher’s December 1993 search engine, the first to combine crawling, indexing, and search.
- M. Koster et al. (2022), RFC 9309: Robots Exclusion Protocol — the IETF standardization of robots.txt, written by the author of the original 1994 convention.
- G. Aas (n.d.), History of LWP — the libwww-perl origin story from the 1994 Geneva WWW conference, by one of its authors.
- Wikipedia (2024), Wget — the Geturl-to-Wget lineage, the 1996 1.4.0 release, and recursive mirroring.
- L. Richardson / Wikipedia (2024), Beautiful Soup (HTML parser) — the 2004 origin and the “tag soup” parsing problem it solves.
- Zyte (2025), Ten years since Scrapy 1.0 — Scrapy’s 2007 Mydeco origin, the 2008 public release, and its stewardship history.
- A. Heydon and M. Najork (1999), Mercator: A Scalable, Extensible Web Crawler — the per-host frontier design that modern crawlers still echo.
- The Selenium project (n.d.), Selenium History — the 2004 ThoughtWorks origin, WebDriver, and the path to a W3C standard.
- The Chrome team (2017), Getting Started with Headless Chrome — the April 2017 announcement of headless mode in Chrome 59.
- lwthiker (2022), Impersonating Chrome, too — the curl-impersonate write-up on matching Chrome’s TLS handshake with BoringSSL.
- Wikipedia (2024), eBay v. Bidder’s Edge — the 2000 trespass-to-chattels injunction and its later erosion by Intel v Hamidi.
- Wikipedia (2024), hiQ Labs v. LinkedIn — the CFAA public-data holdings (938 F.3d 985; 31 F.4th 1180) and the contract-breach coda.
- Common Crawl (2024), About — the 2007 founding, the 10+ petabyte corpus, monthly crawl size, and WARC/WAT/WET formats.
- The Register (2023), New York Times sues OpenAI, Microsoft over training data — the December 2023 copyright complaint over scraped training data.
Further reading
Crawl politeness: robots.txt, crawl-delay, and the unwritten rules of scale
Traces how crawl politeness works in practice: RFC 9309 robots.txt parsing, the crawl-delay split between Google, Bing, and Yandex, per-host rate limits, sitemaps, and the cryptographic verification replacing the honor system.
·25 min readA history of the robots.txt standard, from 1994 consensus to RFC 9309
Traces robots.txt from Martijn Koster's 1994 mailing-list proposal through 25 years as a de-facto standard, Google's 2019 push, RFC 9309 in 2022, and the 2024-2025 AI-crawler revolt and llms.txt debate.
·22 min readThe history of HTTP: from 0.9 to HTTP/3, told through its RFCs
Traces HTTP from Berners-Lee's one-line 1991 protocol through RFC 1945, the RFC 2068/2616/7230 era of HTTP/1.1, Google's SPDY, HTTP/2 (RFC 7540/9113), and HTTP/3 over QUIC (RFC 9114).
·22 min read