Skip to content

The history of DNS: from HOSTS.TXT to DNS-over-HTTPS

· 24 min read
Copyright: MIT
The word DNS as a large monospace wordmark with an orange arrow tracing a path from a text file to an encrypted lock

In the late 1970s, every host on the ARPANET knew the name and address of every other host because it kept a copy of the same file. The file was called HOSTS.TXT. It lived on a machine at Stanford Research Institute, you fetched it over FTP, and it mapped names like UCLA-CCN to numeric addresses for the few hundred computers then connected. When you wanted a host added, you emailed the Network Information Center and waited for the next edition. The whole internet’s address book was one editable text file, downloaded by hand, maintained by a handful of people.

That arrangement could not survive the network it described. By the early 1980s the host count was climbing, the file was getting slow to distribute, and the single point of edit had become a single point of contention. The fix was not a bigger file. It was a distributed database with a hierarchical namespace, designed in 1983 by one person, and almost every name you type into a browser today is still resolved by the protocol he wrote down. This post follows that protocol from the text-file era through its original RFCs, the BIND implementation that carried it onto real networks, the long and unfinished effort to add cryptographic authenticity with DNSSEC, the 2008 cache-poisoning crisis that forced an emergency patch across the whole internet, and the recent move to wrap DNS inside TLS and HTTPS so the queries themselves stop leaking.

The HOSTS.TXT era

The host table predates the name “internet.” A single file held a name-to-address mapping for every machine on the ARPANET, and it was maintained centrally by SRI’s Network Information Center, distributed from one host called SRI-NIC. Administrators mailed their additions and changes to the NIC, then periodically FTP’d to SRI-NIC and pulled down the current copy. The format was eventually written down in RFC 952 in October 1985, which itself revised the earlier RFC 810. A line gave a network or host address, an official name, and a list of aliases. Software on each machine parsed that file to turn names into numbers.

The format worked because the community was small and the rate of change was low. Neither condition held for long. Three problems compounded. The first was traffic and load: as the host count grew, so did the cost of every machine pulling the full table, and the cost of one host serving all of them. The second was consistency: between the moment SRI updated the master and the moment a given host re-fetched it, the two disagreed, and there was no mechanism to reconcile them beyond “wait and re-download.” The third, and the one that actually broke the model, was name collisions. With a flat namespace and a central registrar, every name on the entire network had to be unique and had to be approved by one office. That does not scale to thousands of independently run organizations who each want to name their own machines.

SRI-NIC HOSTS.TXT host host host host one file, one editor, every host re-downloads the whole thing over FTP *The pre-DNS model: a single authoritative file copied out to every host. Flat namespace, central registrar, no way to reconcile stale copies.*

People saw the wall coming. The question was what to replace the file with. The answer needed to let organizations name their own machines without asking anyone’s permission, let a name be resolved without downloading a global table, and stay consistent enough to be useful while updates propagated. That answer was the Domain Name System.

Mockapetris and the 1983 design

In 1983, Paul Mockapetris, working at USC’s Information Sciences Institute, wrote the two documents that defined DNS: RFC 882, Domain Names: Concepts and Facilities, and RFC 883, Domain Names: Implementation and Specification. The IETF published both in November 1983. RFC 882 laid out the idea; RFC 883 specified the wire format and the resolution mechanics. Four years of running code later, the pair was obsoleted by RFC 1034 and RFC 1035 in November 1987, and that second pair is the specification operators still build against today.

The central idea was to replace one flat namespace with a tree. A domain name reads right to left as a path from the root down through a hierarchy of labels, and authority over any subtree can be handed to whoever administers that subtree. The root delegates com to one operator, who delegates example.com to the organization that runs it, who in turn names www.example.com and mail.example.com however they like without consulting anyone above them. This is the part that solved the HOSTS.TXT collision problem. Uniqueness only has to be enforced within a single zone by whoever runs that zone, not globally by one office. The labels themselves are bounded: a single label is at most 63 octets, and a full name in its text representation tops out at 253 characters.

The other half of the design is how a name actually gets resolved without anyone holding the whole tree. Resolution is a walk down the hierarchy. A resolver that knows nothing but the addresses of the root servers asks the root where to find com, gets a referral to the com servers, asks them where to find example.com, gets another referral, and finally asks the example.com servers for the address of www. Each server is authoritative only for its own zone and answers questions about it or points you one step closer. Caching makes this affordable: once a resolver has learned the address of the com servers, it does not ask the root again until the answer’s time-to-live expires.

resolver (recursive) root . com. example.com. 1. who serves com? 2. who serves example.com? 3. A record for www? each server answers for its own zone or refers you one level down *Iterative resolution: the recursive resolver walks the tree, caching each referral. No machine holds the whole namespace.*

RFC 1035 also fixed the wire format that DNS still uses. A message has a 12-octet header followed by four variable-length sections: the question, the answer, the authority records, and the additional records. The header carries a 16-bit identifier the client picks and the server echoes back so the client can match a reply to its query, a set of flag bits, and four 16-bit counts for the four sections. Records come in types: A for an IPv4 address, NS for a name server delegation, MX for mail routing, CNAME for an alias, and many more added over the years. Queries normally travel over UDP on port 53 for speed, with the original specification capping a UDP message at 512 bytes. TCP on port 53 exists too, for responses too large to fit and for zone transfers between servers. That 512-byte cap and the 16-bit ID field are both small numbers that come back to bite the protocol later.

The shape of the namespace got filled in quickly. The first top-level domains, including com, edu, gov, mil, net, and org, were added to the DNS implementation at the start of 1985. On 15 March 1985, a Massachusetts computer company called Symbolics, a maker of Lisp machines spun out of the MIT AI Lab, registered symbolics.com and became the first .com on the internet. The first hundred .com names took until late 1987 to fill.

BIND and the implementation that won

A specification is not a running network. DNS needed software, and the software that carried it onto real Unix machines came out of the University of California, Berkeley. After Mockapetris built the first name server, called Jeeves, on the ISI side, graduate students at Berkeley wrote the Berkeley Internet Name Domain package under DARPA funding. The initial team included Douglas Terry, Mark Painter, David Riggle, and Songnian Zhou. Kevin Dunlap, working from 1985 to 1987, substantially revised the implementation. The name shortened to BIND, and the program shipped with Berkeley Unix, which meant it shipped nearly everywhere.

Maintenance passed hands as the BSD project wound down. The Berkeley Computer Systems Research Group carried BIND through version 4.8.3. Digital Equipment Corporation then released 4.9 and 4.9.1 with Paul Vixie as the primary caretaker from 1994. In May 1997, Vixie and Bob Halley released BIND 8, and in September 2000 BIND 9 arrived as a ground-up architectural rewrite rather than another revision of the old code. BIND 9 was built to handle DNSSEC, IPv6, and multithreading, things the 1980s codebase was never designed for. The Internet Systems Consortium has maintained BIND from version 4.9.3 onward, and the old BIND 4 and BIND 8 lines are long deprecated.

BIND mattered out of proportion to its code because for most of the 1990s it was the reference implementation in the literal sense. If your DNS server interoperated with BIND, it interoperated with the internet, because most of the internet was running BIND. That gave it enormous influence over how ambiguous corners of the RFCs got resolved in practice, and it also made its bugs into everyone’s bugs. Several of the cache-poisoning weaknesses that mattered later were not abstract protocol flaws so much as widely deployed implementation choices, and BIND’s behavior set the baseline for what “widely deployed” meant.

If you want the end-to-end mechanics of how a modern recursive resolver, stub resolver, and authoritative server divide the work, we cover that in DNS resolution end to end. The rest of this post follows the security story, because that is where the protocol’s design decisions came due.

The trust problem nobody designed for

DNS in 1987 had no notion of authenticity. A resolver sends a query and accepts the first well-formed answer that arrives with the right transaction ID, comes from the expected address and port, and answers the right question. Nothing in the protocol proves the answer came from the server that is actually authoritative for the name. The protocol assumed a cooperative network. The internet stopped being one.

The most direct way to exploit that gap is cache poisoning. A recursive resolver caches answers so it does not have to walk the tree every time. If an attacker can get a forged answer into that cache, every user of the resolver gets the forged address until the entry expires, and the time-to-live in the forged record is whatever the attacker chose. The early forms of the attack relied on resolvers being sloppy about what they accepted in the additional section of a response. Tighten that up, and the attacker is left guessing the 16-bit transaction ID, which on a network where they can sniff the query is no guess at all, and even off-path is only 65,536 possibilities.

Two countermeasures were proposed early and ignored for years. Paul Vixie suggested randomizing the UDP source port back in 1995, which would multiply the attacker’s search space by the entropy in the port number. In 2002 Dan Bernstein warned plainly that relying on the transaction ID alone was not enough. Both were right and both were largely shrugged off, because cache poisoning in the wild was rare enough that randomizing source ports felt like effort spent on a theoretical problem. That complacency held until 2008.

Kaminsky, 2008

In 2008, security researcher Dan Kaminsky found a way to make cache poisoning fast and reliable against resolvers that everyone had assumed were good enough. The vulnerability was tracked as CVE-2008-1447 and published by CERT/CC as VU#800113 on 8 July 2008, in a coordinated multi-vendor release that was unusual for its scale.

The insight was about the attacker’s retry rate. Classic cache poisoning had a throttle built in: if you guessed the transaction ID wrong, the real answer arrived, got cached with its normal TTL, and you had to wait for that TTL to expire before you could try again to poison that same name. Hours, maybe a day. Kaminsky’s method removed the wait. Instead of attacking www.example.com directly, the attacker asks the resolver to look up a name that is guaranteed not to be cached, like a random subdomain aaaa.example.com, then floods spoofed answers. The spoofed answers do not even have to contain the address of the random subdomain. They contain a referral in the authority and additional sections pointing the resolver at an attacker-controlled name server for the entire example.com zone. Guess wrong, and the random name was never cached, so the attacker just picks a new random subdomain and tries again immediately. There is no throttle. The attacker can grind through the 16-bit ID space at line rate, and on a resolver using a fixed source port the whole thing falls in a matter of seconds.

resolver qid=? port=fixed real example.com NS attacker (off-path) query aaaa.example.com flood: forged referrals, guessed qid wrong guess? the random name was never cached — pick a new one, retry instantly *The Kaminsky technique removed the cooldown. By targeting uncached random subdomains and forging zone-level referrals, the attacker races the real answer as fast as packets allow.*

The fix that shipped industry-wide was the defense Vixie had recommended thirteen years earlier: per-query UDP source port randomization. The transaction ID gives 16 bits. Randomizing the source port over the available range adds roughly another 16 bits of entropy, so the attacker now has to guess a value in a space of billions rather than tens of thousands, and the brute force that took seconds now takes far too long to win the race against the legitimate reply. CERT’s advisory framed it exactly that way. Port randomization does not make poisoning impossible, it makes it impractical. That is the honest description of nearly every DNS security fix: not a proof, a cost increase.

The Kaminsky episode did two lasting things. It got source port randomization deployed across essentially the entire resolver population in a matter of weeks, closing a hole that had been open and known for over a decade. And it became the argument, finally persuasive, for actually deploying the cryptographic fix that had been in development since the 1990s.

DNSSEC: authenticity, slowly

DNSSEC is the attempt to give DNS answers cryptographic provenance, so a resolver can verify that a record really came from the zone that is authoritative for it and was not modified in transit. The mechanism is digital signatures over the records. Each zone signs its record sets with a private key, publishes the corresponding public key in the DNS itself, and a parent zone vouches for a child zone’s key by signing a hash of it. That chain of signatures runs from the root down to the zone you are querying, and a validating resolver follows it the same way ordinary resolution follows delegations.

The standards took a long and frustrating path. Early work in the late 1990s produced RFC 2535, which on paper looked ready to deploy. Trial deployments through 2000 and after exposed operational and scaling problems serious enough to send the design back for revision. Key management in particular did not work at internet scale the way the first draft assumed. The modern DNSSEC that operators actually run was respecified in RFC 4033, RFC 4034, and RFC 4035, published in March 2005, which obsoleted RFC 2535. New record types carried the new machinery: RRSIG holds a signature over a record set, DNSKEY publishes a zone’s public key, DS is the delegation signer hash a parent uses to vouch for a child’s key, and NSEC (later NSEC3) provides authenticated denial of existence so an attacker cannot forge a “no such name” answer either.

root . KSK = trust anchor com. DNSKEY signed by DS in root example.com. RRSIG over A record parent's DS vouches for child's DNSKEY *The DNSSEC chain of trust. A validator starts from the root's key-signing key, the one trust anchor it ships with, and follows DS-to-DNSKEY links down to the signed record.*

The chain has to start somewhere, and that somewhere is the root. The root zone was signed in 2010. The first root key-signing key was generated at an ICANN key ceremony in Culpeper, Virginia on 16 June 2010, entered production at a second ceremony in El Segundo, California on 12 July, and the resulting trust anchor was first published on 15 July 2010. That published key, KSK-2010 with key ID 19036, is the single anchor a validating resolver ships with. Everything else chains back to it. Before the root was signed, DNSSEC was a set of disconnected signed islands with no common starting point. After 2010 there was finally a path from one trusted key to any signed zone.

DNSSEC’s deployment has been partial and uneven, and it is worth being plain about why. Signing a zone is operationally heavier than not signing it. Keys must be rolled, signatures must be regenerated before they expire, and a mistake takes the zone offline rather than merely leaving it unsigned. The signatures also inflate response sizes well past the old 512-byte UDP limit, which is one reason the Extension Mechanisms for DNS (EDNS0, originally RFC 2671 in 1999, now RFC 6891) exist to negotiate larger UDP payloads. And DNSSEC authenticates the data; it does nothing to hide the query. A passive observer on the path still sees every name you look up in cleartext. That last gap is what the encrypted transports set out to close.

The query was always in cleartext

For thirty years, the thing nobody could fix by signing records was that the query itself travels in the open. When your stub resolver asks its recursive resolver for www.example.com, that name goes out over UDP port 53 unencrypted. Anyone on the path, your network operator, a transit provider, a hostile actor on the same Wi-Fi, sees the name. They can log it, sell it, block it, or rewrite the answer. DNSSEC, even where deployed, does not change this. It lets you detect a forged answer; it does not stop anyone from reading the question.

This was tolerated for a long time because DNS was treated as plumbing rather than as user data. The shift in attitude tracked the broader move to encrypt everything on the web after 2013, the same wave that pushed HTTPS from a minority of traffic to the default. If the page itself is encrypted, leaking the hostname through the DNS lookup that preceded it starts to look like a hole worth closing. Two designs emerged to close it, differing mainly in how they hide.

DNS over TLS, specified in RFC 7858 in 2016, takes the obvious route. It wraps ordinary DNS messages in a TLS session running over TCP, on a dedicated port (853). The wire format inside is unchanged; it is just encrypted and on its own port. Because it has a distinctive port, a network operator can see that you are doing encrypted DNS and to which resolver, even though they cannot read the queries, and they can block the port if they want to force you back to cleartext.

DNS over HTTPS, RFC 8484, published in October 2018, makes a different bet. It sends DNS queries as HTTPS requests, with the wire-format DNS message carried in a request or response body under the MIME type application/dns-message, to an HTTPS endpoint on port 443. The point of using 443 is camouflage. DoH traffic is just more HTTPS, mixed in with all the other HTTPS on the same port to the same kinds of servers, and far harder to single out and block without breaking the web. The three transports carry the same DNS messages; what differs is how visible the channel is. Cleartext Do53 on port 53 is both visible and readable. DoT on port 853 is encrypted but sits on a dedicated port an operator can spot and block. DoH on 443 blends into ordinary web traffic. That last property is exactly what made DoH controversial.

DoH goes mainstream, and the fight over who resolves

The encrypted transports moved from RFC to default fast, driven by the browser vendors and a few large public resolvers rather than by the operators who had historically run DNS. Cloudflare launched its 1.1.1.1 public resolver in April 2018 with DoH and DoT support. Google’s 8.8.8.8 added encrypted transport too. Then the browsers turned it on. Firefox switched to DoH by default for users in the United States in February 2020, routing their DNS to a partner resolver rather than to whatever the operating system was configured to use. Chrome began its own default DoH rollout in May 2020, with a softer policy that upgraded to DoH only when the user’s existing resolver was known to support it.

That last design difference points straight at the controversy. Encrypting DNS is uncontroversial. Where the encrypted query goes is not. Sending DNS over 443 to a browser-chosen resolver moves resolution away from your ISP or your network’s own resolver and toward a small number of large operators, and it does so in a channel the local network cannot easily see or override. Privacy advocates pointed out that this hides your lookups from your ISP. Network operators pointed out that it also breaks split-horizon DNS, parental-control filtering, malware blocking done at the resolver, and enterprise policy, and that it concentrates a sensitive view of everyone’s browsing into a handful of companies. In 2019 the UK Internet Service Providers Association went as far as nominating Mozilla for an “Internet Villain” award over its DoH plans, a nomination it later withdrew after the backlash. The technical fix for query privacy turned into a governance argument about centralization, and that argument is not settled.

One answer to the centralization worry is to split the two things a resolver learns about you: who you are (your IP) and what you asked (the query). Oblivious DoH does exactly that. It routes the encrypted query through a proxy so that the proxy sees your address but not the decrypted query, while the target resolver sees the query but not your address, and neither alone can link the two. Apple and Cloudflare deployed it, and it was written up as RFC 9230, an experimental specification, in June 2022. We go deeper on the transport mechanics and the resolver-choice tradeoffs in DNS-over-HTTPS and DNS-over-TLS.

Cache poisoning came back

It would be a tidy story if source port randomization had ended cache poisoning in 2008. It did not. It raised the cost, and a decade later researchers found a way to lower it again. In 2020, a group led by Keyu Man and Zhiyun Qian at UC Riverside presented SAD DNS (Side-channel AttackeD DNS) at the ACM CCS conference, tracked as CVE-2020-25705.

The Kaminsky-era defense rested on the attacker not knowing the resolver’s randomized source port. SAD DNS finds the port through a side channel rather than guessing it. The trick exploits the global ICMP rate limit that modern operating systems implement. By probing in a way that triggers ICMP “port unreachable” responses and watching how the shared rate limit behaves, an off-path attacker can scan and infer which UDP source port the resolver is actually using for a pending query. Learn the port, and the attacker is back to brute-forcing only the 16-bit transaction ID, the same 16 bits Kaminsky exploited. The defense’s entropy collapsed from roughly 32 bits to roughly 16. The side channel was present across Linux, Windows, macOS, and FreeBSD because the ICMP rate-limiting behavior it abuses is near-universal. Mitigations included randomizing the ICMP rate limit to add noise to the channel and tightening DNS query timeouts, and the Linux kernel team shipped a patch along those lines.

SAD DNS is the clearest illustration of the pattern that runs through this whole history. The original protocol authenticated nothing, so every defense since has been a probabilistic cost increase layered on a design that was never meant to be adversarial. Raise the entropy with port randomization, and someone finds a side channel that drains it. The only defense that breaks the pattern is the cryptographic one, DNSSEC, because a signature an attacker cannot forge does not care how many packets they send or what side channel they have. That is also the defense that took longest to deploy and is still not universal, which tells you something about how hard authenticity is to retrofit. If you want the broader sweep of how off-path attacks abuse shared network state, the same logic shows up in DNS amplification and reflection.

What the long arc actually shows

The throughline from HOSTS.TXT to DoH is a single problem chased across four decades: how do you turn a name into an address at internet scale, and how do you make that answer trustworthy and private after the fact, on a protocol that started out assuming neither mattered. Mockapetris solved the scaling problem so completely in 1983 that the hierarchical namespace and the 12-byte message header he specified are still in use, essentially unchanged, on a network billions of times larger than the one he designed for. That is a rare thing in computing. Most foundational designs get replaced; DNS got extended.

The trust and privacy properties were not in the original design, and bolting them on has been the work of everything since. DNSSEC adds authenticity through signatures and has spent twenty years reaching partial deployment, held back by the operational weight of running keys at scale. The encrypted transports add confidentiality and reached the default in browsers within two years of standardization, but only by routing resolution to a few large operators and reopening a fight about who gets to see your queries. The cache-poisoning thread, from the 16-bit transaction ID through Kaminsky to SAD DNS, keeps demonstrating that probabilistic defenses on an unauthenticated protocol are temporary, because someone eventually finds the entropy you were counting on and takes it away.

The thing worth sitting with is that the lookup is still, by default, the most legible record of what a person does online. The page is encrypted, the connection is encrypted, the certificate is verified, and then the name that started it all went out over UDP port 53 in cleartext to whoever the network told you to ask. Closing that last gap took until 2018 to standardize and is still being argued over in 2026, not because the cryptography is hard, but because the question underneath it, who is allowed to know what you are trying to reach, was never a technical question at all.


Sources & further reading

Further reading