This story is a great example of characterizing a problem, getting closer and closer to a solution with each step, and why the process is so important. The story flows like a detective novel, with Greg the gumshoe uncovering new clues with each new step, all leading to a surprising conclusion that generates more unexpected questions for subsequent episodes.
Like most detective stories, the day started innocently enough.
My friend and customer, Lynn, called with a common problem. Her email was broken. Many of my problem calls start with broken email because pretty much everyone uses email. But sometimes problems are not what they seem and the path to a solution can take many twists and turns. This was one of those times.
I built the IT network in Lynn’s office and I know its characteristics the same way Scotty knew the original Starship Enterprise. I knew Lynn used Microsoft Outlook on her desktop, the server was named ehcserver1, and the server ran Microsoft Exchange. The server is in the basement of the building and everyone connects over a series of Ethernet switches. Time for a good problem description.
Greg: “What happens when you launch your Outlook program”
Lynn: “It just sits there for a while and then gives me an error message, something about the server.”
Greg: “When did it break?”
Lynn: “It worked fine when I shut down yesterday, but when I came in this morning and turned on my computer, now it doesn’t work. I promise, I didn’t change anything.”
I could push Lynn harder for more details, but this told me enough. Her Outlook program was not able to find the Exchange Server. And I know Lynn well enough to believe her when she tells me she did not change anything. This suggested something out of her control must have changed.
The next logical step in characterizing the problem was to find out if the problem was specific to Lynn or more widespread. Quickly polling a few people near Lynn, we discovered Bruce had the problem, but not Ayrica, Joe, or Mike. Since at least one other user had the problem, this suggested the problem was not specific to any workstation setting. The problem was something common to Bruce and Lynn, but nobody else.
Start Unraveling the Mystery
Experience suggests most email problems are really symptoms of a more general network or server issue. Everyone reports email problems because email is the application they use most often. But email depends on the overall network. If the overall network is broken, email will also be broken.
To find out if the problem is specific to email or something deeper, try a different application and see how it behaves.
One rule about working with end users. Always start with an easy test and then dig deeper as necessary. People seem to appreciate it more that way.
Greg: Let’s see if you can see other stuff on the network. Click Start…Computer, try to open one of your network drive mappings and let’s see what happens. What happens when you open, say, the V drive?
A network drive mapping is really a directory on the server. The idea is, the desktop computer “thinks” it’s another hard drive, thus the drive letter, but really it’s a directory on the server. This is far and away the most common use for servers in an office.
All IT support companies have their own style and I set up many of my customers with a “V” drive, accessible to everyone. It’s a convenient place to test. Why V? Because V stands for eVeryone. Why not use “E”? Because some computers use “E” for a locally connected CD or DVD or USB card reader. It’s generally easier to use high letters in the alphabet for network drive letter mappings and leave low letters for locally attached devices.
Here is a picture similar to what Lynn saw. (The picture will open in a different tab on your browser.) The red X on the network drive mappings does not necessarily mean they are offline. The only test that generates anything meaningful – just double-click on the drive letter and observe what happens. Either the contents or an error message will show up in a window.
When Lynn double-clicked on the V drive, she saw an error message. So did Bruce. Since another application depending on the server and network was broken, the problem was not specific to email, but instead something common to both email and viewing drive letter mappings on the server. But only common to Lynn and Bruce. Mike, Joe, and Ayrica were fine.
Computer troubleshooting is often compared to a good mystery movie. Uncover clues and follow them where they lead. This one was starting to feel like a Hollywood whodunit. Time for some more in depth tests.
I asked Lynn to launch an old-fashioned DOS command window and try a few commands. In Windows 7, Click Start…All Programs…Accessories…Command Prompt. In Windows 8, click the upper right corner of the display to launch the Start screen, click the Start icon, right-click anywhere, click apps in the lower right corner of the system tray on the bottom of the screen, find the Command Prompt, and double-click on it. (How much money did Microsoft spend on this new, “improved” interface?)
I knew the server was named ehcserver1. So in that Command Prompt window, I asked Lynn to type “ping ehcserver1″, press the enter key, and tell me what it said. Here is a picture similar to what Lynn found. Here is a picture similar to what Lynn should have found.
How was it possible that Lynn could not translate the name of her server? Clearly, something was fundamentally wrong with the network. But it only effected a few users. The next step is a deeper diagnostic. In that DOS command window, type
Here is a PDF file with a sample report and some annotations taken from a Windows 7 computer in the Infrasupport network.
The computers in Lynn’s network should all have IPv4 addresses that look like 192.168.10.nnn, where nnn is a number between 1 and 254. The gateway should be 192.168.10.1, DNS Server 192.168.10.20. I built this network; I know what these values should be.
Surprise plot twist
But in a surprise plot twist worthy of the best Hollywood has to offer, both Lynn and Bruce’s computers showed IPv4 Address, Gateway, DHCP Server, and DNS Server Addresses of 192.168 2.nnn. Note the 2.nnn instead of 10.nnn.
No wonder Lynn and Bruce’s computers were broken. They both had bogus IP Addresses that did not belong to this network. This was stunning!
The only possible explanation: Somebody introduced a rogue DHCP server into this network and it was competing with my real DHCP Server.
DHCP servers lease IP Addresses and other network parameters to computers in an office. Although there are carefully crafted special cases, typically an office should have exactly one and only one DHCP Server. If an office has multiple DHCP servers, it is not possible to predict which DHCP server will lease a computer its network parameters. This means computers may appear to suddenly fail at random times, and for random lengths of time, as their old leases expire and a rogue DHCP server assigns them bogus new network parameters.
This was exactly the case here. The rogue DHCP Server serviced both Lynn and Bruce’s computers, while the correct DHCP Server took care of Ayrica, Joe, and Mike.
The suspicious character with the shifty eyes did it – or did he?
Wonderful. Problem identified. Now, what to do about it? See part 2 for the exciting conclusion to the story.
(Originally published on my old Infrasupport website on April 6, 2013. I backdated the posting here.)