Scraping observability: success metrics, block-rate dashboards, and silent failures

The worst scraping outage I have watched did not page anyone. No error rate spiked. No queue backed up. Every job finished green, on time, with the usual number of rows. The dashboards were calm for nine days. Then a price-intelligence customer noticed that a competitor’s catalogue had apparently frozen, and an engineer pulled a sample, and the sample was full of the string “Enable JavaScript to continue.” The site had quietly switched on a bot defence that returns HTTP 200 with a challenge stub instead of the product page. The scraper parsed the stub, found no price, wrote a null, and moved on. Nine days of nulls, all marked success.

That failure mode is the entire reason this post exists. A scraping system is a distributed program whose dependency is an adversary that actively wants your requests to fail in ways you will not notice. The standard observability playbook, built for services you control, assumes a failed request announces itself. Here it does not. A block can look exactly like a success at every layer a normal monitoring stack inspects: the connection completes, the TLS handshake succeeds, the status line says 200, the body is well-formed HTML of a plausible length. The data inside is garbage. This post is about instrumenting the gap between “the request completed” and “the request returned the thing I asked for,” because that gap is where scrapers go to die without telling you.

The plan: start with what to measure and why the standard RED and golden-signal models need a fourth dimension for scraping. Then the taxonomy of failure, from honest 403s to the empty-200 soft block. Then detection, the actual checks that separate good responses from convincing fakes. Then the dashboards and the cardinality traps that make or break them. Then alerting that fires on the right symptom at the right speed. And finally the part everyone skips, validating the data itself, because a scraper that returns the wrong answer confidently is worse than one that crashes.

What to measure, and why scraping needs a fourth signal

The two reference models for instrumenting a service are worth knowing before bending them. Brendan Gregg’s USE method, from the 2014 SRE-era work on systems performance, asks three questions of every hardware resource: utilization, saturation, and errors. Tom Wilkie’s RED method, first presented at a Prometheus meetup in London in 2015 and popularised through Grafana, asks three questions of every request-serving service: Rate, the requests per second; Errors, the count of those that fail; and Duration, the distribution of how long they take. Google’s Site Reliability Engineering book, written by Rob Ewaschuk and edited by Betsy Beyer, frames its own four golden signals as latency, traffic, errors, and saturation. The three models overlap heavily, and for a service you own they are sufficient.

RED is the right starting point for a scraper, because a scraper is fundamentally a thing that issues requests and cares how they come back. Rate, errors, duration. But each of the three needs a definition that a normal web service never has to think about, and there is a fourth signal that has no analogue in the standard models at all.

Errors is the one that breaks. In RED, an error is a request the service failed to serve, and the service knows when that happened because it is the one returning the 500. In scraping, the entity deciding whether your request “failed” is the target, and the target has every incentive to lie about it. So “error rate” splits into at least two distinct quantities that must be tracked separately. There is the transport error rate, connections that never completed: DNS failures, TLS handshake failures, read timeouts, resets. And there is the block rate, requests that completed at the transport layer but were refused or poisoned at the application layer. Conflating them is the first mistake. A spike in transport errors means your proxies or your network are sick. A spike in block rate means the target started fighting back. The fixes have nothing in common.

Duration needs the SRE book’s specific warning attached, because it applies doubly here. Never alert on a mean. The book’s example is blunt: a service averaging 100 ms at 1,000 requests per second can have one percent of requests taking five seconds, and the mean hides it completely. For scrapers the tail is where blocking lives, because a challenge page often returns faster than real content, and a CAPTCHA redirect often returns slower. Both distort the average in opposite directions and cancel out. You want the distribution, bucketed, so you can see a bimodal split appear when half your traffic starts getting fast challenge stubs.

The fourth signal, the one with no standard-model analogue, is yield. Yield is the ratio of useful extracted records to attempted requests. Not requests that returned 200, requests that returned the data. A request can succeed at every layer RED inspects and yield nothing, and a system that only watches rate, errors, and duration will rate that request as a triumph. Yield is the metric that would have caught the nine-day outage on day one, because while the success rate held at a hundred percent, the price-extraction yield would have dropped to zero the moment the challenge stub started serving.

*The four layers where a scrape can fail. A monitoring stack borrowed from normal services watches the top two and calls the request a success. Soft blocks live at layer three, parser rot at layer four, and both pass an HTTP-status check untouched.*

So the measurement set for a scraper is RED with errors split in two and yield bolted on: request rate, transport error rate, block rate, duration distribution, and yield. Hold those five in mind. Everything below is about computing them honestly when the target is trying to make block rate and yield look like a hundred percent success.

The taxonomy of failure, from honest to deceitful

Failures sort cleanly by how much they want to be seen. At the honest end, the target tells you plainly that it blocked you. At the deceitful end, it hands you a 200 and a page that looks real and is not. A monitoring system has to recognise the whole spectrum, because each band needs a different detector and implies a different fix.

The honest blocks are the easy band. A 403 Forbidden is the canonical “I know what you are and I am refusing you.” A 429 Too Many Requests is the target asking you to slow down, usually with a Retry-After header, and it is more a rate-limit signal than a bot verdict; the right response is backoff, not a new identity, which is the whole subject of rate limiting yourself. A 503 with a challenge interstitial is the anti-bot vendor’s polite version of a block. The 451 Unavailable For Legal Reasons code shows up occasionally for geo-blocking. These are gifts. The status code names the problem, your block-rate metric increments cleanly, and you can route around it. Most public write-ups of scraping at scale note that DataDome blocks usually arrive as a 403, sometimes as a 4xx or 5xx variant, and that is the comfortable case.

The middle band is the redirect block. The status is a 301 or 302, the connection succeeds, and you land on a challenge page, a login wall, or a consent gate at a different URL than you asked for. This is half-honest. Nothing says “blocked,” but the URL you ended up at is not the URL you requested, and that mismatch is a detectable signal if your client follows redirects and you compare the final URL against the requested one. Teams that do not check this silently count the challenge page as a successful fetch.

Then the deceitful band, the empty 200 and its relatives. The connection completes, the status line says 200 OK, and the body is one of several flavours of nothing. A blank page. A page with the document scaffolding but no content where content should be. A client-side CAPTCHA or JavaScript challenge that a real browser would solve and an HTTP client just stores as inert markup. A decoy: plausible-looking data that is deliberately wrong, served to a request the target has decided is a bot, so that the scraper poisons its own dataset without ever knowing it was caught. A public case study that ran five request methods across 82 sites found that “Ban Page” responses, empty bodies, missing <body> tags, login walls, and JavaScript challenges all routinely arrived under a 200 status. The headline from that study is the one every observability design has to absorb: websites block bots and still send a 200, which makes the status code worthless as a success signal on its own.

There is a particularly nasty interaction worth calling out, because it turns a good adaptive-throttling design into an own goal. Scrapy’s AutoThrottle, and any adaptive rate limiter built on the same idea, tunes crawl speed by watching latency: fast responses mean the server is happy, so push harder. A soft block that returns a tiny challenge stub returns very fast. So the throttle reads the block as health and accelerates straight into it. The diagnostic for this exact situation is sharp: if your yield drops while latency stays low and the throttle is not backing off, you are being silently blocked, and your own speed-control loop is helping. That is the cost of treating latency as a proxy for success when the adversary controls the response time.

*Block types from honest to deceitful. The detector you need gets more expensive the further right you go. The rightmost band, deliberate decoy data, defeats every check that looks at the response in isolation and can only be caught by comparing against ground truth.*

Detecting the soft block: validating a response that lies

A status-code check is necessary and nowhere near sufficient. The work of scraping observability is the layer of validation that runs after the 200, deciding whether the body is real. There is no single test. There is a stack of cheap checks that each catch a band of the failure spectrum, ordered so the cheapest run first.

The cheapest is response size. Real content and a challenge stub differ in length by an order of magnitude, usually. A product page is tens of kilobytes; a “Enable JavaScript” interstitial is one or two. Recording the body length of every response and watching its distribution per target catches a surprising amount, because when a soft block kicks in, the whole site’s response-size distribution collapses toward the small stub size at once. You are not checking any single response against a fixed threshold so much as watching the histogram shift. A bimodal size distribution that suddenly goes unimodal-small is a block in progress.

Next is structural validation, asking whether the parsed document still contains the elements that define a real page. Does the response have a <body>. Does the product page have the price node, the title node, the thing the extractor depends on. This is where selector logic and monitoring blur together, and the useful move is to make the parser’s failure to find a required field a first-class metric rather than a silently-handled null. If the extractor expected a price and got nothing, that is a yield miss, and it should increment a counter tagged with the target and the field, not vanish into a default value.

Then content markers, the vendor-specific tells. Anti-bot responses leave fingerprints. Public reverse-engineering write-ups note that a DataDome interstitial tends to carry telltales such as a datadome cookie in the Set-Cookie header, an x-datadome response header, or specific script markers in the returned HTML; the exact current marker set is the vendor’s to change and is not something to hard-code without re-verifying, but the principle holds: a soft block from a named vendor usually carries a named signal somewhere in the headers or body. Maintaining a small library of per-vendor block signatures and matching every response against it turns a large chunk of the deceitful band back into the honest band. The blocks are still soft, but now you see them. The signal collection that vendors run on the way in is its own deep topic, covered for one of them in DataDome’s detection model.

The hardest band, decoy data, defeats all of the above because the response is structurally perfect and the right size and carries no block marker. The data is just wrong. The only defence is ground truth, and the cheapest form of ground truth is a canary: a small set of records whose correct values you know and that you re-scrape on every run. A handful of products whose prices you have verified out of band, a few profiles whose fields are stable, a category page whose count you can predict. When a canary’s scraped value diverges from its known value, you are either looking at a decoy or a parser regression, and either way you want the page. This is the same idea synthetic monitoring uses, the CloudWatch-style canary that walks a known path and asserts a known result, applied to data correctness rather than uptime. Canaries are cheap, they are deterministic, and they are the only thing in this section that catches a target lying with a straight face.

*The validation stack a request runs after it returns 200. Each rung is cheaper than the one below and catches an honester failure. Everything above the orange dots is free; the canary check at the bottom costs a known-answer request but is the only thing that catches deliberate poisoning.*

One design note that saves a lot of pain: store a sample of the actual response body for a fraction of requests, especially the ones flagged by any check above. When the block rate spikes at 3am, the difference between a five-minute diagnosis and a two-hour one is whether you can pull up the bytes the target actually sent. A reservoir sample of bodies, or full capture of everything a check flagged, is the highest-value thing to log. The metrics tell you that something changed; the captured body tells you what.

Dashboards, and the cardinality that eats them

A dashboard for a scraper is a dashboard for a fleet of requests segmented by the things that determine whether they succeed. The five metrics from the first section, request rate, transport error rate, block rate, duration distribution, and yield, are the rows. The segmentation is the hard part, and it is where most scraping dashboards either go blind or go broke.

The segmentation that matters is by target, by proxy slice, and by the dimensions that let you localise a problem. Per target domain, because a block on one site says nothing about the rest. Per proxy pool or ASN, because when a block clusters by ASN you swap that slice of the pool rather than the whole provider, which is the difference between a surgical fix and a panic. Per protocol or fingerprint, because a target that starts refusing one TLS fingerprint while accepting another is telling you exactly what changed. The general advice from people who run this at scale is to tag every attempt with the IP’s ASN, the protocol version, and a fingerprint hash, so that when block rate moves you can pivot to the dimension that moved it instead of guessing. The choice of fingerprint and session strategy is its own deep well, covered in TLS fingerprinting from JA3 to JA4 and sticky vs rotating sessions.

Here is the trap. Every one of those tags is a label dimension, and in a Prometheus-style metrics system the number of stored time series is the product of all label cardinalities. Target domain times ASN times status code times fingerprint hash times field name explodes fast. A thousand targets, a few hundred ASNs, a dozen status buckets, and a handful of fingerprints is already millions of series, and a fingerprint hash or a full URL as a label is unbounded cardinality that will take the metrics backend down. The standard guidance holds with force here: keep label values bounded and low-cardinality, never put a raw URL or a per-request ID or a full fingerprint hash in a metric label. Bucket the unbounded things. Domain, not URL. ASN, not IP. A fingerprint class or a short hash prefix, not the full hash. The per-request detail lives in logs and traces, which are built for high cardinality; the metrics carry only the dimensions you actually pivot a dashboard on.

For duration specifically, this is where Prometheus histograms earn their place, and where the bucket layout is a real design decision rather than a default. A histogram stores counts per latency bucket, and a quantile is estimated by interpolating between bucket boundaries, so the estimate is only as good as the buckets near the percentile you care about. The practical rule is to place bucket boundaries around your target latencies: if a real response takes a couple hundred milliseconds and a challenge stub comes back in tens of milliseconds, you want buckets fine enough at the low end to see the stub population appear as its own bump. A p50 and a p99 computed from a well-bucketed histogram will show a soft block as a divergence between the two long before a mean would twitch.

The yield panel is the one a borrowed dashboard never has, and it is the one to put at the top. Useful records extracted over requests attempted, per target, over time. When it is flat near its normal value the system is healthy in the only sense that pays the bills. When it slides while the success rate holds, you have a silent failure in progress, and the gap between the two lines is the size of the lie. Putting yield and HTTP success rate on the same axis, so the divergence is visible as a widening gap, is the single most useful panel on a scraping dashboard. It is the panel that turns a nine-day outage into a same-day page.

*The panel every scraping dashboard should lead with. HTTP success rate holds flat across the top because the target keeps returning 200. Yield, the rate of actually-extracted records, falls off a cliff when the soft block starts. The shaded gap is data you are losing while every other metric says you are fine.*

Alerting that fires on the right thing at the right speed

Metrics on a dashboard are passive. Someone has to be looking. Alerting is the part that decides what wakes a human, and the scraping-specific failure modes make the standard advice both more important and more fiddly.

Start from the SRE book’s discipline on what deserves a page. An alert should fire on a symptom that is urgent, actionable, and affecting the output now or imminently. It should answer “what is broken,” with the “why” left to investigation. For a scraper the user-facing symptom is not “a request returned 403,” it is “we are no longer collecting the data,” and the metric closest to that is yield. The temptation is to alert on every band of block, but a target throwing a few 403s while yield holds is noise, and paging on it trains people to ignore the pager. Alert on the symptom, yield falling below its expected band, and let block rate and transport errors be the diagnostic dashboards you open after the page, not the trigger.

The mechanics that keep an alert from being either too jumpy or too slow come from SLO-based, multi-window burn-rate alerting, the approach in the Google SRE workbook chapter on alerting. The idea is to express the target as a budget, say ninety-five percent yield, and alert on how fast you are burning the five percent you can afford to lose. A fast burn, consuming a large slice of the budget in an hour, pages immediately because something broke hard. A slow burn, consuming a smaller slice over a day, opens a ticket because something is rotting gently. The workbook’s reference configuration pairs a short and a long window for each tier so an alert only fires when the problem is both severe and current, which kills the classic false page where a five-minute blip in the past keeps an alert latched. The published example tiers, a roughly 14x burn over a one-hour window for the page and gentler multipliers over six-hour and three-day windows for the slower signals, are a sane starting point to adapt rather than invent from scratch.

There is a scraping wrinkle the SLO machinery does not cover: seasonality and the difference between “the target is down” and “the target changed.” A scraper’s expected yield is not flat. Catalogues are smaller on weekends, some targets rate-limit harder at peak hours, and a naive static threshold will page every Sunday. This is where the statistical-process-control toolkit is worth borrowing. An EWMA, an exponentially weighted moving average, tracks the expected level while adapting to slow drift, and is sensitive to the small sustained shifts that a soft block produces. A CUSUM, a cumulative-sum chart, accumulates small deviations and fires when they add up, which is exactly the shape of a slow yield decline that no single data point would trip. Both are cheap, both are decades old in quality control, and both are better suited to “has the mean quietly moved” than a fixed threshold is. The catch, well documented in the literature, is that they throw false positives when the signal is complex, so they want a confirmation window and a sane minimum-deviation floor before they page.

A few alerts earn their keep that have nothing to do with rates. Alert when block rate for a target jumps from near zero to non-trivial, because that is the moment a target deployed a new defence and the window to adapt is open. Alert when the body-size distribution for a target collapses, because that is a soft block before the yield metric has even caught up. Alert when a canary’s value diverges, because a wrong canary is a near-certain sign of either decoy data or a parser regression and both need eyes. And alert, gently, when a job produces a row count wildly different from its historical norm, because a scrape that returns ten percent of yesterday’s records and calls it success is the classic silent failure that a pure error-rate alert sails straight past.

Validating the data, the part nobody instruments

Everything so far has been about the request. The last layer is the data itself, after extraction, and it is the layer where the most expensive silent failures live, because by the time bad data reaches a customer it has usually been through a pipeline that laundered it into looking authoritative.

The failure here is selector drift, and it is mundane and constant. A target reorders its HTML, reuses a class name, moves a label, and the selector still matches something, just not the thing it used to. The scraper does not error. It extracts the wrong node and writes it confidently. Public field guides on this describe it precisely: fields move, labels swap, class names get reused, the selector still matches but the meaning of the data changed. No status code reflects it. No exception fires. The only signal is in the data distribution, which means the data is the thing you have to monitor.

The cheap, high-value check is schema validation on every extracted record. A typed schema, the kind Pydantic gives you in Python, asserts that the price is a positive number, the title is a non-empty string of plausible length, the date parses, the required fields are present. A record that fails the schema is a yield miss you can see, and the failure rate per field is a metric that catches selector drift the moment a field starts coming back the wrong type or empty. This is the structural-validation idea from the response layer pushed all the way down to the parsed record, and it is the line of defence between a scraper that returns nulls and a scraper that returns nulls you know about.

Schema validation catches structural breaks. It does not catch a field that is the right type and wrong value, the price that parsed fine but came from the wrong node and is off by a decimal place. That needs distributional monitoring: tracking the mean, the variance, the quantiles, and the cardinality of each field over time and flagging when the distribution shifts in a way the world did not. The tells are specific. A field whose distribution suddenly compresses, every value clustering near one number, often means a selector now grabs a constant. A field whose null rate jumps means a selector stopped matching. A field whose cardinality collapses, where a thousand distinct values become five, means the extractor is reading a template instead of the data. A field that goes suspiciously uniform is as much a red flag as one that goes wild, because real-world data is rarely that tidy. None of these trip a status check, an error counter, or even a schema validator. They only show up when you watch what the data does over time and compare today’s distribution against the recent past.

The comparison against the previous run is the throughline of the whole data-validation layer, and it is the cheapest powerful technique on offer. Diffing a crawl against the last good crawl catches the empty result set, the field that went all-null, the row count that halved, the price column that shifted by a constant. It is the same instinct as the canary, generalised: you do not need absolute ground truth to catch most silent failures, you only need yesterday’s data and an alert when today’s diverges from it more than the world plausibly could. A scraping system that diffs every run against the last and pages on an implausible delta has closed the loop that the HTTP-200 success metric leaves wide open.

Closing: instrument the gap, not the request

The recurring mistake in scraping observability is importing a monitoring model built for services you own and trusting it to work against a dependency that lies. RED and the golden signals are sound, and you should use them, but the error signal they assume is one your dependency volunteers. A target under a bot defence volunteers nothing. It returns a 200 and a page-shaped object full of nothing, and a borrowed dashboard rates that a success forever. Every technique in this post exists to close the distance between “the request completed” and “the request returned the data,” because that distance is invisible to status codes and it is exactly where a scraper fails for nine days without a single red pixel.

The cheapest defences are the ones to build first, and they are not glamorous. Split your error rate into transport failures and blocks. Watch the body-size distribution per target and treat its collapse as an alarm. Put yield, real extracted records over attempts, on the same chart as HTTP success rate and stare at the gap. Keep a handful of canaries with known answers and re-scrape them every run. Diff each crawl against the last. None of these need a machine-learning model or a new platform; they need someone to decide that a 200 is a claim to be verified rather than a result to be trusted. That decision is the whole discipline. The system that makes it catches the silent failure on day one. The system that does not finds out from a customer on day nine, which is the most expensive way to learn that your scraper has been quietly writing nulls into a database that everyone downstream believed.

Sources & further reading

Ewaschuk, R. and Beyer, B. (2016), Monitoring Distributed Systems, Site Reliability Engineering — the four golden signals, the warning against alerting on means, and the discipline of paging on symptoms not causes.
Wilkie, T. / Grafana Labs (2018), The RED Method: How to Instrument Your Services — Rate, Errors, Duration as a request-oriented monitoring model, and how it relates to Gregg’s USE method.
Google SRE (2018), Alerting on SLOs, The Site Reliability Workbook — multi-window, multi-burn-rate alerting and the reference burn-rate/window tiers.
ScrapeOps (2024), We Use 5 Methods to Scrape 82 Sites — Here’s Who Blocked Us — measured success rates by method, and the catalogue of block types that arrive under a 200 status.
Patryk B. (2024), Why Scraping Fails Silently (And Why That’s Worse Than Crashing) — selector drift, partial blocking, and why a 200 does not mean good data.
Grepsr (2025), Observability with Logs, Metrics, and Alerts in Scraping Pipelines — the execution, success, data-quality, and performance metric categories for a scraping pipeline.
ExtractData (2024), Scrapy AutoThrottle: Tune Crawl Speed Without Getting Blocked — how a latency-based throttle misreads a fast soft-block stub as a healthy server.
Potent Pages (2025), Data Quality for Web-Sourced Signals: Validation Checks That Catch Silent Failures — distributional monitoring of mean, variance, quantiles, and entropy for silent degradation.
DEV / Deepak Mishra (2025), Data Quality at Scale: Validating Scrapes with Pydantic — typed schema validation as a first-class metric on extracted records.
Last9 (2024), Histogram Buckets in Prometheus Made Simple — why quantiles are bucket-boundary estimates and how to lay out buckets around target latencies.
Better Stack (2024), Prometheus Best Practices: Dos and Don’ts — the label-cardinality trap and keeping unbounded values like URLs and IDs out of metric labels.
DMARC Report (2025), Proxy Observability: A Data-Led Playbook for Reliable Web Scraping — tagging attempts by ASN, protocol, and fingerprint so block-rate movement can be localised to a slice of the pool.

Scraping observability: success metrics, block-rate dashboards, and silent failures

What to measure, and why scraping needs a fourth signal

The taxonomy of failure, from honest to deceitful

Detecting the soft block: validating a response that lies

Dashboards, and the cardinality that eats them

Alerting that fires on the right thing at the right speed

Validating the data, the part nobody instruments

Closing: instrument the gap, not the request

Sources & further reading

Further reading

Designing a distributed crawler: frontier, dedup, politeness, and backpressure

URL frontier design: from Mercator to modern priority-queue crawlers

Proxy pool management: rotation, health checks, and burn-rate economics