Caching and incremental recrawl: ETags, Last-Modified, and change detection
A crawler that re-downloads a page it already has is burning bandwidth, CPU, proxy budget, and the target’s patience for nothing. The page might be a forum thread that last changed in 2019. It might be a product listing that updates twice an hour. The crawler does not know which until it asks, and the whole game of incremental recrawl is asking the cheapest possible question: has this changed since I last looked?
HTTP has had an answer to that question for almost thirty years. Conditional requests let a client say “give me this, but only if it differs from the copy I’m holding,” and a server that agrees nothing changed replies with a 304 and an empty body. The headers are small, the savings are large, and almost nobody on the scraping side uses them correctly. This post is about the mechanism underneath: the validators, the comparison rules, the failure modes that make a 304 lie to you, and the scheduling math that decides how often to even bother asking.
The road map. First the two validators HTTP gives you, ETag and Last-Modified, and the difference between strong and weak. Then the conditional request headers and exactly what a 304 is allowed to contain. Then the ways validators break in production, from inode-based ETags behind a load balancer to servers that send 200 with identical bytes. Then content hashing, for when the server’s own signals are useless and you have to detect change yourself. Then the scheduling layer: how to estimate a page’s change rate and turn that estimate into a recrawl interval that does not waste fetches on dead pages or miss updates on live ones.
The two validators: ETag and Last-Modified
A validator is a short piece of metadata the server attaches to a representation so that, later, a client can ask whether the representation still matches. HTTP defines two. One is a timestamp. The other is an opaque tag.
Last-Modified is the timestamp. The server sends a date in its response, its best guess at when the selected representation last changed, and the client stores it. The format is an HTTP-date, second-resolution, no finer. That resolution is the source of half the trouble with this header, which we will get to.
ETag is the opaque one. The entity-tag is a quoted string the server generates however it likes, and the client is not supposed to interpret it. RFC 7232 gives the grammar precisely:
entity-tag = [ weak ] opaque-tagweak = %x57.2F ; "W/", case-sensitiveopaque-tag = DQUOTE *etagc DQUOTESo an ETag is a double-quoted string, optionally prefixed with the two characters W/. That prefix is the weakness indicator, and it is the single most misunderstood byte sequence in HTTP caching.
The distinction between strong and weak validators is about what a match guarantees. A strong validator changes whenever the bytes change. In the words of the spec, a strong validator “changes value whenever a change occurs to the representation data that would be observable in the payload body of a 200 (OK) response to GET.” A weak validator “might not change for every change to the representation data.” So two responses with the same strong ETag are byte-identical. Two responses with the same weak ETag are semantically equivalent but might differ in ways the server decided do not matter, a regenerated whitespace difference, a re-ordered set of equivalent headers, a gzip stream encoded slightly differently.
*The W/ prefix is the whole difference. A strong tag promises identical bytes; a weak tag only promises the page means the same thing.*For a crawler the weak/strong distinction matters less than it does for a browser cache doing range requests, because a crawler usually wants whole documents and cares about semantic change anyway. If the server gives you a weak ETag and it stays constant, the page has not meaningfully changed, which is exactly the question you were asking. The trap is the comparison rule, and the spec is strict about it. Strong comparison treats two tags as equal only if both are not weak and the opaque parts match character for character. Weak comparison treats them as equal if the opaque parts match, regardless of either being flagged weak. If-None-Match, the conditional header a crawler uses, is defined to use weak comparison. If-Range, used for resuming partial downloads, requires strong comparison and a weak tag will never satisfy it.
Last-Modified can itself be a strong or weak validator, and the deciding factor is clock resolution. The spec spells out the failure: “a representation’s modification time, if defined with only one-second resolution, might be a weak validator if it is possible for the representation to be modified twice during a single second and retrieved between those modifications.” If a page changes at 12:00:00.300 and again at 12:00:00.800, both edits stamp the same Last-Modified second, and a client holding the first version will be told nothing changed. The spec’s workaround is a heuristic: a Last-Modified time is allowed to be treated as strong only when it is at least 60 seconds before the response’s Date, on the theory that if a minute has passed without another edit you are probably safe. That 60-second rule is a kludge around a header that was specified before sub-second timestamps were normal, and it still ships in every cache today.
Conditional requests and the anatomy of a 304
You hold a validator. Now you ask the conditional question. There are two request headers a crawler cares about, mirroring the two validators.
If-None-Match carries an ETag you previously received. The request means: serve this only if the resource’s current entity-tag does not match what I have. If-Modified-Since carries a date, usually the Last-Modified you stored, and means: serve this only if it changed after that date. If the condition fails, meaning the resource is unchanged, the server returns 304 Not Modified with no body, and you reuse your stored copy.
When both headers appear in one request, the spec is unambiguous: If-None-Match wins and If-Modified-Since is ignored. The reasoning is that an entity-tag is the stronger signal, so if the server can evaluate it there is no reason to fall back to a coarse timestamp. In practice you send both, because some servers emit one validator and not the other, and let the server pick whichever it can evaluate.
A 304 is not just a status line and a blank body. The spec lists which header fields a 304 must carry: the ones a cache needs to keep its stored copy’s metadata current, “Cache-Control, Content-Location, Date, ETag, Expires, and Vary,” whenever they would have been sent on the equivalent 200. So a 304 can refresh your freshness lifetime and even hand you a new ETag for the same content, which matters if the server rotates its tag format. What a 304 must not include is a message body. The whole point is that you already have the bytes.
The savings are not subtle. The classic write-up on conditional GET for feeds put it bluntly: the technique can turn “90% of significant 24,000 byte queries into really trivial 200 byte queries.” A feed reader polling hourly that respects 304 sends two small headers and gets two small headers back, until the day something actually changes. A crawler doing incremental recrawl across millions of URLs sees the same ratio, and at fleet scale that is the difference between a recrawl pass that costs real proxy money and one that is nearly free. If you are paying per gigabyte through a residential pool, every 304 is bandwidth you did not buy. The economics of a scraping operation turn sharply on exactly this kind of avoided work.
There is a sibling family of conditional headers a crawler rarely needs but should recognise. If-Match and If-Unmodified-Since are the write-side conditionals, used to make a PUT or DELETE conditional on the resource not having changed under you, and when they fail the server returns 412 Precondition Failed rather than 304. You will see 412 if you accidentally send If-Match on a read, or against an API that repurposes the header. For pure recrawl you want the None-Match and Modified-Since pair and the 304 they produce.
Cache-Control sits one layer up and decides whether you even need to ask. A response carrying max-age=3600 is telling you it stays fresh for an hour, so a polite incremental crawler can skip the conditional request entirely until that hour is up. immutable, which Facebook proposed and browsers adopted for versioned static assets, goes further: it says do not revalidate at all while fresh, because the URL itself changes when the content does. RFC 5861 added stale-while-revalidate, letting a cache serve a slightly stale copy while it checks in the background, which is more a CDN concern than a crawler one but tells you how the origin expects its content to age. When no caching headers are present at all, caches fall back to heuristic freshness, and the common heuristic is to treat the content as fresh for roughly 10 percent of the time since it was last modified. A page last changed a year ago gets assumed fresh for a month. None of this is binding on a crawler, but reading these headers tells you how the origin itself models the page’s volatility, which is free intelligence for your scheduler.
When validators lie
Conditional requests assume the server’s validators are honest and consistent. In production they frequently are not, and a crawler that trusts a broken validator will either miss real changes or re-fetch unchanged pages forever. The failure modes are worth knowing by name.
The oldest and most famous is the inode ETag behind a load balancer. Apache historically generated ETags from three components of the file on disk: the inode number, the modification time, and the size. On a single server that is fine. Put the same file on two servers behind a load balancer, where the bytes are identical but the inode numbers differ because they are separate filesystems, and the two servers generate different ETags for the same content. A crawler gets ETag from server A, sends If-None-Match on the next request, lands on server B, and B compares the tag against its own inode-derived value, sees no match, and ships the whole file with a 200. The content never changed. The validator said it did. The fix, known since the early 2000s, is to configure FileETag MTime Size and drop the inode component, but plenty of origins still ship the default. There is a security angle too: inode numbers leak filesystem layout, and inode-based ETags have been flagged as an information-disclosure vector for exactly that reason.
The mirror-image failure is the server that always sends 200 with a fresh Date and no validator at all, or a validator that changes on every request even when content is static. Some application servers regenerate a page on every hit, stamping Last-Modified with the current time, so If-Modified-Since can never succeed. Some put a request ID or a timestamp inside the ETag. Either way the conditional request is dead on arrival and you re-download every time. You can detect this by watching the response: if a URL returns 200 with identical bytes across two fetches despite you sending the right conditional headers, the validator is non-functional and you should fall back to content hashing.
Then there is the gzip problem. A server might serve the same logical content but a different compressed byte stream depending on its compression settings or library version, producing a different strong ETag for identical decoded content. This is precisely the case the weak validator exists for, and a server that strong-tags gzipped content can hand you spurious 200s. Some CDNs and origins now correctly weak-tag compressed representations. Many do not.
The blunt lesson is that the validator is a hint from a system you do not control, written by someone who may not have thought about crawlers at all. Treat a 304 as trustworthy and act on it, since acting on it costs nothing. But do not treat a 200 as proof of change. A 200 might be a load balancer with mismatched inodes, a page that re-stamps its own timestamp, or a CDN that varied its compression. To know whether the content actually changed, you have to look at the content.
Content hashing, for when the server won’t tell you
When validators are absent or untrustworthy, change detection moves to the client. You fetch the page, reduce it to a fingerprint, and compare that fingerprint to the one you stored last time. If they match, nothing changed and you can drop the new copy. The question is what kind of hash.
The naive choice is a cryptographic hash of the raw bytes, MD5 or SHA-256 over the response body. This works only when you want to detect any change at all, down to a single flipped bit, and it fails the moment a page contains anything dynamic. A cryptographic hash has the avalanche property: change one byte and the entire digest changes. So a page whose only difference between two crawls is a rotated ad slot, a CSRF token in a hidden field, a “last updated 3 minutes ago” string, or a different gzip framing will produce a completely different MD5 every time, and your change detector will scream change on a page that is functionally static. One write-up put the intuition well: MD5-hashing the HTML is like “asking ‘are these two snowflakes identical?’ when what I really needed is ‘are these both snow?’”
The answer the search-engine world settled on is locality-sensitive hashing, specifically simhash. Charikar introduced the technique in 2002, and Google’s crawling team demonstrated it at web scale at WWW 2007. The property that makes it useful is the inverse of avalanche: similar inputs produce similar fingerprints. You tokenise the document into weighted features, hash each feature, and combine them into a fixed-length bit vector such that two documents differing only slightly produce fingerprints that differ in only a few bit positions. The distance between two fingerprints is the Hamming distance, the count of differing bits, and it behaves like a similarity score you can threshold.
For their 8-billion-page corpus, the Google team found that 64-bit fingerprints with a Hamming distance threshold of 3 bits were a good operating point: documents within 3 bits of each other were near-duplicates worth treating as the same. That is the number to start from. Below roughly 3 bits of difference, the page is the same content with cosmetic churn, and you skip it. Around 4 to 8 bits, you are in judgement territory, a real but minor edit. Beyond that, the content has substantively changed and is worth re-indexing.
*Simhash turns "did the page change" into a number. The bands are a starting point, not gospel; calibrate against your own corpus.*The other half of the problem is what you feed the hash. If you simhash the raw HTML including navigation, header, sidebar, footer, and ads, you are measuring change in the boilerplate as much as in the content, and a site-wide template tweak will register as every page changing at once. The standard move is to strip boilerplate first, extracting the main content block before hashing, so that the fingerprint reflects the article or the listing rather than the chrome around it. Google’s own near-duplicate handling excludes boilerplate blocks for exactly this reason. The decision of how aggressively to clean the page, and whether to render it at all before extracting, is the same trade covered in parsing at scale: the more you process per page, the more each fetch costs, and content hashing is one of the cheapest reasons to keep parsing lightweight.
Content hashing also gives you something validators cannot: a magnitude of change, not just a yes or no. A 304 tells you the page is identical. A simhash distance of 12 tells you the page changed quite a bit, which is information your scheduler can use. That feeds directly into the last piece.
Scheduling recrawls by change rate
Conditional requests and content hashing both answer “did it change” cheaply. They do not answer “when should I ask again.” That is the scheduling problem, and it is where incremental crawling stops being an HTTP detail and becomes a resource-allocation question. You have a finite crawl budget. You have millions of URLs changing at wildly different rates. How do you point your fetches at the pages most likely to have changed?
The foundational treatment came from Cho and Garcia-Molina around 2000 to 2003, who modelled page changes as a Poisson process and asked how to allocate a fixed number of crawls to minimise staleness across a collection. Their counterintuitive result was that crawling every page at the same frequency is suboptimal, and so, surprisingly, is crawling fast-changing pages proportionally more often. A page that changes so frequently you can never keep it fresh is sometimes better deprioritised than a page that changes at a rate you can actually track, because spending fetches on the hopeless page steals them from pages you could keep current. The optimal policy is non-monotonic in change rate.
Estimating that change rate per page is its own subproblem, because you only observe the page when you crawl it. If you crawl a page and it changed, you know it changed at least once since last time but not how many times. Cho and Garcia-Molina’s work on estimating change frequency from incomplete observation history is the standard reference: a naive estimator that just counts observed changes is biased, because it cannot see changes that happened and reverted between two crawls. The corrected estimators account for that hidden churn.
The sharper insight came from Olston and Pandey at WWW 2008, who argued that change rate alone is the wrong target. What matters is information longevity. They drew the line between ephemeral and persistent content: “It is crucial for a web crawler to distinguish between ephemeral and persistent content. Ephemeral content (e.g., quote of the day) is usually not worth crawling, because by the time it reaches the index it is no longer representative of the web page from which it was acquired.” A page might change constantly, but if every change is a throwaway, recrawling it is wasted effort. The content that earns a recrawl is content that persists long enough to be worth having, a new blog post that stays up for weeks, not a rotating tagline that is gone in an hour. Their policies estimate per-page longevity cheaply, with little per-page overhead, so they run inside a large parallel crawler without a global optimisation pass.
*Olston and Pandey's distinction. The page that flickers constantly is often the one to skip; the page that holds a new value for weeks is the one to catch.*In practice you do not need to implement the full theory to benefit from it. A workable scheduler keeps, per URL, an estimate of change interval and a last-seen fingerprint. Each crawl updates the estimate: if the content changed, shorten the interval toward the observed gap; if it did not, lengthen it, often with multiplicative backoff so a page that has been static for ten crawls gets visited a quarter as often as one that just changed. This is the same adaptive-interval logic that governs polite rate limiting against yourself, pointed at freshness rather than politeness. The sitemap is a weak input to this. The <lastmod> element is meant to advertise when a page changed, and Google has said it actively uses lastmod when it is accurate, but Google also deprecated its sitemap ping endpoint in June 2023 and has been clear that lastmod values are “often inaccurate,” which is why they are a hint and not a trigger. The <changefreq> element is weaker still; modern search engines largely ignore it. Trust the validator and the fingerprint over anything the site declares about its own update cadence.
Search engines reveal a lot about how they run this loop in their crawler docs. Googlebot supports both If-Modified-Since and If-None-Match, and Google’s guidance is explicit that returning a 304 when content has not changed “will save server processing time and resources, which may indirectly improve crawling efficiency.” On the demand side, Google’s crawl-budget model lists staleness as one of the forces driving recrawl: their systems “want to recrawl documents frequently enough to pick up any changes,” balanced against crawl capacity and the page’s popularity. A page that has not changed across many visits gets recrawled less. A page that changes every time gets recrawled more, up to the point where the cost outweighs the value. That is the same Cho and Olston math, running in production at a scale no scraper will match, and the headers it relies on are the ones any HTTP client can send.
Putting it together
Incremental recrawl is a layered question, and each layer is cheaper than the one below it. At the top, Cache-Control tells you whether you even need to ask, and if the content is still fresh you ask nothing. If you must ask, a conditional request with the validators you already hold turns a full fetch into a 200-byte 304 whenever the page is unchanged, which on a real corpus is most of the time. When the validators are missing or broken, by inode mismatch or rotating tags or a server that re-stamps its own timestamp, you fall back to fetching and fingerprinting, and a locality-sensitive hash like simhash tells you not just whether the page changed but how much, after you strip the boilerplate that would otherwise drown the signal. And wrapping all of it, a scheduler that learns each page’s change behaviour decides how often to run the loop at all, spending fetches where content both changes and persists, and starving the pages that flicker pointlessly or never move.
The thing worth internalising is that none of these layers trusts the one below it completely. A 304 is safe to act on because acting costs nothing, but a 200 is not proof of change, so content hashing exists to second-guess the validator. A lastmod of yesterday is a hint, not a fact, so the scheduler weighs it against what it actually observed. The whole stack is built on the assumption that the server’s signals are well-meaning but unreliable, which, after thirty years of watching Apache ship inode ETags by default, is the correct assumption. The crawler that re-fetches an unchanged page is wasting money. The crawler that trusts a broken validator and skips a changed one is shipping stale data, which is worse, because it does so silently. Building the verification in is the difference between a recrawl pass you can believe and one you merely hope is correct.
Sources & further reading
- IETF (2014), RFC 7232: HTTP/1.1 Conditional Requests — defines the ETag grammar, strong/weak validators, the comparison functions, and the 304 and 412 status codes.
- IETF (2022), RFC 9110: HTTP Semantics — the current HTTP semantics spec; conditional requests, validator precedence, and the headers a 304 must carry.
- Nottingham (2010), RFC 5861: HTTP Cache-Control Extensions for Stale Content —
stale-while-revalidateandstale-if-errordirectives for serving aged content during revalidation. - Cho and Garcia-Molina (2000), The Evolution of the Web and Implications for an Incremental Crawler (VLDB) — Poisson change model and the foundation of frequency-based refresh policies. (See the Olston/Pandey paper for the follow-on.)
- Olston and Pandey (2008), Recrawl Scheduling Based on Information Longevity (WWW) — the ephemeral-versus-persistent distinction and practical low-overhead revisit policies.
- Manku, Jain, and Das Sarma (2007), Detecting Near-Duplicates for Web Crawling (WWW) — simhash at web scale; 64-bit fingerprints and a 3-bit Hamming threshold over 8 billion pages.
- MDN (2026), Cache-Control header —
max-age,immutable,no-cache,must-revalidate, and heuristic freshness behaviour. - Google Search Central (2023), Sitemaps ping endpoint is going away — deprecation of the ping endpoint and Google’s position that
lastmodis often inaccurate. - Google (2026), Crawl budget management — crawl capacity versus crawl demand, with staleness as a recrawl driver.
- sitemaps.org (2016), Sitemap protocol — the
loc,lastmod,changefreq, andpriorityelements and their intended meaning. - Apache HTTP Server, FileETag directive — the inode/mtime/size ETag construction and the load-balancer mismatch it causes.
- Pilgrim (2002), HTTP Conditional Get for RSS Hackers — the canonical explainer on cutting feed-poll bandwidth with If-Modified-Since and ETag.
Further reading
How HTTP caching headers really work: Cache-Control, Vary, and revalidation
A primary-source reference for HTTP caching: how Cache-Control directives, Expires, ETag and Last-Modified revalidation, Vary, and the stale-* extensions actually behave in private and shared caches under RFC 9111.
·25 min readDesigning a distributed crawler: frontier, dedup, politeness, and backpressure
Traces the architecture of a web-scale crawler from Mercator and the early Googlebot through IRLbot to today: the URL frontier, duplicate elimination, politeness scheduling, and how servers push back.
·21 min readURL frontier design: from Mercator to modern priority-queue crawlers
How the URL frontier orders a crawl: the Mercator front-queue/back-queue split, per-host politeness, freshness versus coverage, and the disk-backed and gRPC designs that run at web scale today.
·22 min read