Skip to content

How anti-bot vendors detect residential proxies and ASN reputation scoring

· 22 min read
Copyright: MIT
ASN reputation wordmark with an orange underline over the labels datacenter, residential and mobile

A residential proxy is supposed to make a scraper invisible. The traffic exits through a real consumer ISP, on an address a real person uses for Netflix and online banking, announced by an autonomous system that has never knowingly hosted a server. On paper that IP is indistinguishable from any other household on the same street. So why do the better anti-bot systems still flag it, sometimes on the very first request, before a single byte of JavaScript has run?

The answer is that the IP carries more context than its octets. Every address belongs to an autonomous system with a history, sits inside a prefix with neighbours, and has a measurable past: who has used it, what they did, when it was last seen exiting a known proxy pool. The network layer is the first thing a detector sees and the cheapest signal it has, and a decade of measurement research plus a few very large takedowns have given the defenders a surprisingly detailed map of where residential proxy traffic actually comes from. This post is about that map and how it gets read.

We start with what an ASN is and why reputation attaches to it. Then the classification problem: how a detector decides datacenter, residential, or mobile, and why those labels are probabilistic rather than certain. From there into IP-quality scoring and the commercial feeds that sell it, the known-proxy lists and how they are built, the special case of mobile and CGNAT where blocking causes collateral damage, and finally the leaks: the reasons a genuinely clean residential IP still gives itself away. The through-line is that IP-layer detection is not one check but a stack of correlated ones, and the proxy economy has spent years trying to outrun each layer in turn.

What an ASN actually is, and why reputation sticks to it

An Autonomous System is a block of IP prefixes under a single routing policy, identified by a number, that announces its routes to the rest of the internet over BGP. Cloudflare is AS13335. Google is AS15169. OVH, Hetzner, DigitalOcean, AWS, each runs its own ASNs, and the prefixes they originate are public knowledge, scraped continuously from BGP route collectors and the regional registries. When a connection arrives, the very first thing a detector can do, before parsing a header or fingerprinting a TLS handshake, is map the source IP to its origin ASN and ask a simple question: what kind of network is this?

That mapping is the foundation of everything downstream. A datacenter ASN announcing tens of thousands of contiguous addresses with no residential customers behind them is trivially server-origin traffic. A residential ISP ASN is the cover that proxy operators pay for. The reputation, though, does not live only at the ASN level. It lives at the prefix level too, often the /24, because that is the granularity at which abuse clusters and at which a detector can punish a noisy neighbourhood without nuking a whole carrier. An address gets judged by the company it keeps. A clean /24 inside an otherwise dirty ASN can score well; a single abusive /24 can drag down addresses that have personally done nothing wrong.

the same address resolves up three levels, each with its own history 203.0.113.47 last-seen: proxy exit, 3d ago 203.0.113.0/24 prefix abuse density: high AS64500 residential ISP address verdict = f( own last-seen, prefix density, ASN class ) a clean address inside a dirty /24 inherits some of the neighbourhood's suspicion. cleanliness can outrank rotation. *The IP-layer verdict combines the address's own history, the abuse density of its prefix, and the class of its ASN. The proxy operator controls none of these directly.*

This is why the old proxy-buyer instinct, rotate faster to stay ahead, has aged badly. Rotation defeats per-address rate limits. It does nothing against prefix reputation, because the replacement address usually comes from the same pool, the same /24 neighbourhoods, the same handful of ASNs that the proxy provider has access to. The detector is not tracking your session across IPs; it is recognising the pool. A clean, stable address that behaves like a person beats a fast-rotating address drawn from a known-dirty range, and that inversion is the single most important shift in IP-layer detection over the last few years.

The classification problem: datacenter, residential, mobile

The first cut a detector makes is the type label, and it matters because the three classes carry wildly different default trust. Datacenter traffic is server-origin by definition; a browser does not normally run in AWS us-east-1, so a residential-grade user-agent arriving from a hosting ASN is already contradicting itself. Residential is the trusted default, the thing proxy buyers are paying to impersonate. Mobile sits at the top of the trust pile because carrier-grade NAT makes mobile addresses genuinely shared among many real people, which makes them dangerous to block.

Datacenter detection is the easy end. The hosting ASNs are enumerable. AWS, Google Cloud, Azure, OVH, Hetzner, Vultr, DigitalOcean and the rest publish or leak their ranges, and a detector keeps a maintained list of every ASN whose business is renting compute. Anything originating there gets the server-origin label and a low default trust, full stop. The interesting failure mode is the reverse: legitimate users behind a corporate VPN that egresses through a cloud region, or a privacy relay, can look datacenter-origin while being entirely human. Apple’s iCloud Private Relay is the canonical example, which is why detectors maintain separate allowance for known relay egress ranges rather than treating all hosting traffic identically.

The hard part is the boundary between residential and the things pretending to be residential. The classification is probabilistic, assembled from registry metadata, the ASN’s declared usage type, reverse-DNS patterns, and observed behaviour, and none of those is authoritative on its own. The commercial databases encode this directly. IP2Location’s proxy database, for instance, carries a proxy_type field with codes that have grown over the years to track the ecosystem: DCH for datacenter or CDN, VPN, TOR, PUB for public proxies, WEB for web proxies, SES for search-engine robots, and crucially RES for residential proxies, added in their later database tiers, alongside CPN for consumer privacy networks and EPN for enterprise private networks. The fact that RES is a distinct, comparatively recent code is itself the story: residential proxy traffic became common enough, and detectable enough, to warrant its own classification separate from the generic proxy buckets.

three IP classes, three very different default postures DATACENTER RESIDENTIAL MOBILE default trust detection difficulty safe to block? low trivial yes high hard mostly highest hardest no (CGNAT) the trust gradient runs opposite to the detection gradient: the harder a class is to classify confidently, the more a wrong block costs, so detectors lean on behaviour rather than the IP alone as you move down the list. *Trust and detection difficulty run in opposite directions. Mobile is the most trusted and the hardest to block precisely because so many real people share each address.*

IPinfo splits the same problem across separate products, which tells you something about how the industry thinks. Their privacy-detection database flags six binary categories on an address: hosting (a datacenter or cloud IP), proxy (an open web proxy), tor (an exit node), relay (a location-preserving anonymising relay), vpn (a VPN exit), and a service string naming the provider where known. Residential proxy detection is deliberately not one of those flags. It lives in a different dataset entirely, because spotting a home connection that is quietly relaying someone else’s traffic is a fundamentally different and harder job than spotting a VPN exit, and conflating the two would pollute both. The hosting flag is high-confidence and ASN-driven; the residential-proxy signal is inferred, behavioural, and never as certain.

IP-quality scoring and the feeds that sell it

Most anti-bot vendors do not build their own global view of every IP from scratch. They buy it, or they blend a bought feed with their own first-party observations. The product these feeds sell is a score, usually a small integer, that compresses everything known about an address into one number a rule engine can threshold on.

IP2Location’s fraud_score is a 0 to 99 value, where higher means a greater likelihood of fraud and a lower reputation, sitting in the same record as the proxy-type code, the usage type, the threat category, the ASN, and a last_seen timestamp for when the address was last observed in a proxy role. That record layout is worth dwelling on because it is representative of the whole category. The score is not a measurement of the request in front of you. It is a precomputed verdict on the address, assembled offline from abuse history, proxy-pool membership, and threat feeds, and handed to the detector as a lookup. The threat field in the same database is categorical rather than scalar: SPAM for known spammers, SCANNER for security scanners, BOTNET for malware-infected devices, BOGON for addresses announced via BGP that should not be routable at all. A BOGON is an instant fail; a high fraud score with no threat category is a soft signal that something downstream should look harder.

The usage-type taxonomy matters as much as the score. IP2Location tags addresses as commercial, organisation, government, military, educational, library, CDN, ISP, mobile (MOB), datacenter (DCH), search engine, reserved, and, in their newest tier, a dedicated code for AI crawlers. That MOB versus ISP versus DCH distinction is exactly the residential-versus-datacenter-versus-mobile cut from the previous section, encoded as queryable data. A detector that wants to apply different rate limits to mobile carriers than to fixed-line residential ISPs needs precisely this, and it needs it as a fast lookup, because the decision happens before the request is fully processed. If you want the deeper version of where that decision physically happens in the request path, we cover it in server-side versus client-side bot detection.

Where the bought feed ends and the vendor’s own intelligence begins is the interesting seam. IPQualityScore describes its own pipeline as a blend of honeypots and traps, forensic analysis, machine learning, range scanning, blacklisting, and client-side reporting. The honeypots and traps are the first-party part: bait endpoints and addresses that only an automated crawler or a proxy probe would ever touch, which means any IP that hits them is self-identifying. Range scanning and forensic analysis are active, the provider going out and probing suspected proxy infrastructure rather than waiting for it to show up. The bought-versus-built tension is permanent, because a feed bought by everyone is known to everyone, including the proxy operators who test their pools against it.

Known-proxy lists and how they are built

The cleanest way to flag a proxy is to already know the exit IP is a proxy. That sounds circular, but it is the workhorse of IP-layer detection, and the question is entirely how the list gets populated and how fresh it stays. There are a few mechanisms, and they have different reliability.

The first is to become a customer. A detection vendor signs up to a residential-proxy service, routes traffic through it to infrastructure it controls, and records every exit IP that appears. Do this continuously across the major providers and you build a live map of their pools. The foundational academic work here, the 2019 IEEE Symposium on Security and Privacy paper that infiltrated five residential-proxy providers between 2017 and 2018, did exactly this and harvested more than six million exit IP addresses spread across 230-plus countries and over 52,000 ISPs. The same study found that many of those exit nodes were not willing participants at all but compromised hosts, including IoT devices, which is the detail that reframes the whole ecosystem: a large share of residential proxy supply is, functionally, a botnet with a billing department.

The second mechanism is passive observation at scale. A vendor sitting in front of a meaningful slice of global web traffic sees the same IPs appear across thousands of unrelated sites, often relaying for many different ostensible users in a short window. That fan-out, one address acting on behalf of many unrelated sessions across many destinations, is itself the fingerprint of a shared exit, and the bigger the vendor’s vantage point the cleaner the signal. This is why scale is a moat in this business: only a network seeing a large fraction of the web can map proxy exits by behaviour rather than by buying a feed.

The third is active scanning. Probe the address. A genuine residential machine does not usually answer on the ports a proxy daemon listens on, and a host that does answer with a proxy handshake has confessed. This is noisy and increasingly evaded, because modern proxy SDKs do not open an obvious listening port; they hold an outbound connection to the operator’s control plane and receive tasking over it, which we will come back to.

The takedowns are where the abstract list-building becomes concrete. In January 2026, Google published its disruption of a residential proxy network it tracked as IPIDEA, which operated behind more than a dozen consumer brands, among them 360 Proxy, 922 Proxy, ABC Proxy, PIA S5 Proxy, and a row of VPN-branded apps. The scale is the point: over 600 Android applications were found carrying the operator’s SDKs, around 7,400 second-tier command-and-control servers fronted the proxy traffic, and in a single seven-day window that January more than 550 distinct threat groups were observed routing through IPIDEA exit nodes. The SDKs had names like Packet, Castar, Hex, and Earn, marketed to app developers as a monetisation library that pays per install. Embed it, and your users’ devices quietly become exit nodes.

how a modern proxy SDK turns a home device into an exit node user's device app w/ SDK tier-1 bootstrap C2 sends device diag tier-2 proxy nodes ~7,400 of them os=android&... node list scraper customer rents the exit outbound persistent connection, no open port target site sees home IP the device never listens on a public port, so a port scan finds nothing. the exit is only discoverable by renting it or by seeing its fan-out across many sites. *The two-tier control plane observed in the IPIDEA takedown. Because the device only makes outbound connections, classic port-scan detection misses it; the exit is found by behaviour or by becoming a customer.*

The smart-TV variant is the same model with better hardware. A June 2026 write-up traced how a major commercial residential-proxy operator sources supply through an SDK embedded in connected-TV and mobile apps, with the device holding a persistent outbound WebSocket to the operator’s control endpoint and executing fetch instructions pushed down that channel, even while the screen is in use. Connected TVs make better exit nodes than phones for the obvious reasons: always powered, always on stable wifi, unmetered bandwidth, and nobody watching what the box does at three in the morning. The detection artefacts that piece surfaced are network-level, the SDK’s control hostnames and TLS SNI patterns, which is a reminder that even the stealthiest exit node has to talk to its master somehow, and that conversation is observable from the network if you know what to look for.

Mobile, CGNAT, and the collateral-damage problem

Mobile is the class that breaks the blunt instruments. A mobile carrier does not give every subscriber a public IPv4 address; it cannot, there are not enough addresses. Instead it runs Carrier-Grade NAT, mapping hundreds or thousands of real subscribers onto each public IP. The outside world sees one address doing the work of a small town. Block it for one abuser and you block everyone behind it.

This is what makes mobile proxies the premium tier of the proxy market and the bane of IP-layer detection. The shared address is genuinely full of real humans, so its reputation is genuinely good, and a proxy operator who can route a scraper through a 4G or 5G exit gets to hide inside that crowd. The detector cannot lean on the IP, because the IP is shared with people it must not block. Cloudflare’s own analysis of CGNAT, published in October 2025, put a number on the squeeze: CGNAT addresses had essentially the same median bot score as non-CGNAT addresses, 4.8 percent versus 4.7 percent, yet they got rate-limited three times as often, purely because so many legitimate users pile onto each one that the aggregate trips volume thresholds.

Detecting CGNAT, so as to be gentler with it rather than to block it, is its own measurement problem. The signals are indirect. Distributed traceroutes from thousands of vantage points can spot the RFC 6598 shared address space (the 100.64.0.0/10 block) sitting between customer equipment and the public IP, the tell-tale of multi-layer NAT. Reverse-DNS and WHOIS records sometimes leak the giveaway substrings, cgn, cgnat, lsn for large-scale NAT, in the hostnames carriers assign. And the behavioural signature is statistical: a CGNAT address presents a much wider, less correlated spread of client behaviours and destinations than a single household, because it really is many households. The same Cloudflare work noted the geography of this, with African ISPs both relying on CGNAT more heavily and packing more clients behind each address, which means a naive global rate limit punishes some regions far harder than others.

The architectural irony is that the long-promised fix, IPv6, partly dissolves the problem and partly relocates it. On IPv6 every device can have its own address again, so the shared-IP collateral-damage problem eases, but a per-device address is also a far more stable identifier, and detectors have learned to score IPv6 at the /64 prefix rather than the single address precisely because that is the allocation unit a single subscriber typically controls. The unit of reputation moves, but the idea that reputation attaches to an allocation, not just an address, does not.

Why even a clean residential IP still leaks

Suppose the proxy operator has solved the IP problem completely. The exit is a real home connection, on a residential ISP ASN, in a clean /24, never before seen in any proxy feed, with a pristine fraud score. The IP layer has nothing to say. Why does the request still get caught?

Because the IP is one signal among many, and the others do not agree with it. The most reliable network-layer leak is the geography of latency. A residential exit node in São Paulo relaying for a scraper whose actual machine sits in a Frankfurt datacenter inherits the round trips of both hops. The TLS handshake takes longer than a real Frankfurt user would experience and longer than a real São Paulo user would experience, because the packets are doing the journey twice. Cloudflare’s residential-proxy machine-learning work, published in mid-2024, built directly on this, training a model on latency measurements from multi-hop proxy traversal alongside behavioural traffic-spike features, and reporting figures that show how much the picture has shifted away from pure IP reputation. That system classifies on the order of 17 million unique IPs per hour as showing residential-proxy activity, across more than 45,000 ASNs and 237 countries, and reaches around 95 percent accuracy on distributed residential-proxy attacks against the endpoints that get hit hardest. The honest caveat the same team raised is the false-positive cost: by their measure roughly four out of five requests from residential-proxy IPs are ordinary direct human connections, because the device is somebody’s actual phone or TV when it is not relaying, so blocking on the IP alone would harm real users. That is the whole reason the detection moved to behaviour.

The second leak is the consistency check between the IP and everything the client claims about itself. A residential exit in one country with a browser reporting a timezone, a locale, and an Accept-Language from another country is contradicting itself, and that mismatch is cheap to compute and hard for a rotating-proxy setup to keep coherent, because the proxy gives you a new country every few minutes but the browser profile stays put. Rapid city or ISP hopping inside a single session is the same tell from the other direction. These checks do not prove the IP is a proxy. They prove the IP and the rest of the fingerprint were not assembled by the same honest origin, which is enough.

The third leak is the network fingerprint itself, below the IP. The TLS ClientHello and the HTTP/2 settings frames carry a signature of the client library, and a residential IP fronting a Go or Python HTTP client produces a JA3 or JA4 that no real browser on a real home connection would ever emit. The clean IP buys nothing if the handshake announces an automation toolkit. We go deep on that in TLS fingerprinting from ClientHello bytes to JA4; the point here is that the IP layer and the transport layer are scored together, and a perfect score on one cannot rescue a failing score on the other. The same logic extends up into the JavaScript runtime, where a headless automation stack leaves traces no proxy can launder, covered in JavaScript runtime fingerprinting.

There is a fourth leak that is easy to overlook: the proxy pool’s own consistency. A scraper that rotates through a residential pool hits a target from a sequence of addresses that, individually, look fine, but collectively trace a path no single human could walk: ten cities in ten minutes, three ISPs, two countries, all carrying the same session cookie and the same device fingerprint. The detector is not scoring any one IP badly. It is scoring the trajectory, and the trajectory is only possible with a proxy. This is the behavioural counterpart to prefix reputation, and it is why the vendors with the largest traffic vantage point have the structural advantage: only they see enough of a given actor’s requests across enough sites to reconstruct the path. The economics of who can afford that vantage point, and how the detection gets priced and sold, is its own subject, taken up in the economics of anti-bot vendors.

What the network layer can and cannot settle

The network layer settles less than it used to and more than proxy buyers wish. It can convict a datacenter IP outright, because hosting ASNs are enumerable and a server pretending to be a browser is a contradiction the IP alone exposes. It can convict a known proxy exit, because the lists, built by infiltrating providers, by passive fan-out observation across a large traffic base, and occasionally by the kind of takedown that maps an entire two-tier control plane in one report, are large and reasonably fresh. What it cannot do is convict a genuinely clean residential or mobile IP on the strength of the address by itself, because that address belongs to a real person whose traffic must not be collateral, and the better the proxy supply chain gets at sourcing real devices, the truer that becomes.

So the IP-layer verdict has quietly changed shape. It used to be a gate: bad IP, blocked. It is now a prior, a starting score that the layers above either confirm or contradict. The latency that betrays a double hop, the timezone that does not match the exit country, the JA4 that announces a scripting library, the impossible trajectory across a rotating pool, these are what actually carry the decision when the IP itself comes up clean. The proxy economy has spent a decade and a great deal of money making the IP look innocent, and it has largely succeeded. The detectors responded by deciding that the IP was never the whole question. The most telling artefact of that shift is the Cloudflare number: four in five requests from a residential-proxy IP are a real person, so the IP is not evidence of guilt, only an invitation to look closer. Everything expensive in modern proxy detection happens after that look.


Sources & further reading

Further reading