DNS works sometimes, but sometimes it doesn’t. What can we do?

Troubleshooting any kind of intermittent network problem can be a nightmare. Fortunately, Domain Name System (DNS) is a fairly simple protocol, and there are only so many things that can go wrong.

Low-Hanging Fruit

Start by eliminating obvious problem areas, such as unavailable DNS servers, WAN links that are down, unplugged cables, failed network cards, and so on. These types of problems will almost always manifest in other ways, because DNS won't usually be the only thing affected. Because your client will more often than not be configured with multiple DNS server addresses, use Network Monitor to analyze the network traffic being sent by your clients. That way, you'll know exactly which DNS server they're trying to talk with, and you can focus your troubleshooting efforts there first.

Another common problem root is multihomed servers. Unless specifically instructed to do otherwise, these servers will register all of their IP addresses with DNS. Some of those addresses, however, may be associated with network interfaces that not all clients can access. The result is that some clients will have access to the server and others won't. You may also have clients that switch between having access and not, particularly if DNS round robin is enabled on their DNS server. Round robin may be alternating between an accessible IP address and an inaccessible one, creating intermittent problems for clients.

Replication Issues

Replication issues can cause intermittent problems in Active Directory (AD)-integrated DNS zones. Ensure that AD replication is working properly to start with. Clients that are querying different DNS servers may be receiving different responses if the two servers haven't yet converged.

If replication latency is a problem for your DNS zones, consider upgrading to Windows Server 2003. In Windows Server 2003, the DNS zone is stored in an AD partition, and you can control which domain controllers contain a copy of the partition. By limiting the partition to just those domain controllers that are acting as DNS servers, you'll force a new replication topology to be generated for that partition. The result will be fewer servers replicating the information. Thus, replication will be able to occur more quickly, causing the different copies of the partition to converge more quickly and reducing problems caused by replication latency.

Protocol Problems

Another problem can occur if your network is assuming that DNS uses User Datagram Protocol (UDP) port 53 and blocks access to Transmission Control Protocol (TCP) port 53. The DNS specification requires DNS to use the connectionless UDP transport protocol, but only for small queries. Larger queries—or, more accurately, larger query responses—that won't fit into a single UDP packet may be broken into multiple TCP packets instead. This switch to TCP can cause bewildering problems on your network because some DNS queries will work fine and others will simply time out.

DNS queries will nearly always go out via UDP (the notable exception being the Simple Mail Transfer Protocol—SMTP—service in IIS, which seems to always use TCP); replies will come in on UDP or TCP depending upon the number of hosts and IP addresses contained within the replies.

If you're not sure whether this circumstance relates to your problems, try using Network Monitor to capture DNS traffic on both sides of your firewall. If you're not seeing identical traffic on both sides of the firewall, the firewall is obviously blocking some DNS traffic, most likely large replies. To play it safe, I recommend opening your network to incoming DNS traffic on both UDP and TCP ports 53.