This is a story about real life cyberheroes, and it’s every bit as good as great fiction. But before going any further, this is my blog on my own website, which means anything I say here is from me and might not reflect the opinions of the people who pay my salary. But if anyone from where I work reads this – and I hope you do – you’ll like it.
My employer provides IT support subscriptions to some of the largest organizations on the planet, and one system from one of those customers dropped dead a few days ago after a software upgrade. It’s bad when that happens. More on that in a minute.
I need to pause again right here. This story involves technology. If you feel an urge to run away when you see technology words like “virtual machine,” now’s the time, because I need technology words to tell this story properly. But if you run away, you’ll miss out on a great story about real people who are really smart and who work really hard to make sure our phones, lights, water, and everything else we take for granted in modern society keeps running. So, don’t run away. Show some respect for the unsung cyberheroes who keep the world running and read this story and say thanks. Meet me half-way and I promise to make the technology words meaningful.
Why was this dead system important? I can’t say. But trust me, it was. And it died during the holiday week between Christmas and the New Year, when lots of people are away from work spending quality time with their families, using electricity, traveling, engaging in commerce, and storing cell phone pictures in the cloud. The world does not stop just because nearly everyone is on vacation, and this system needed to be operational.
The dead system contained two physical servers and an NFS server. NFS – Network File System. NFS servers offer a bunch of shared storage space to computer servers clustered around them. Every piece of a good NFS server is redundant, so no single component failure can take it down. The physical servers in this system, called hypervisors, run virtual machines that do the actual work. A virtual machine is software pretending to be hardware. Most of the world these days runs on virtual machines because virtual machines offer lots of management benefits, including a slick way to provide redundancy. If a hypervisor dies, just restart its virtual machines on another hypervisor. Even better, automate the whole failover. A manager server named RHVM handles all this, and RHVM can itself be a virtual machine inside the environment it manages. When built properly, such a system of virtual machines, managed by RHVM, and running on hypervisors clustered around shared storage, will stand up to any failure of any component and still continue operating.
But after an upgrade, RHVM in this system refused to start. With no RHVM to manage all the virtual machines inside this system, none of the work those virtual machines normally handled got done. Kind of like being on a dead boat in the middle of the ocean. Which is why we call such a condition in the IT industry, dead in the water.
This is where some amazing detective work comes in from people around the world. I tell people that troubleshooting is ninety nine percent perspiration and one percent inspiration. But when that one percent hits pay-dirt, it’s like hitting a walk-off home-run to win the game in the bottom of the ninth. Or scoring the game-winning goal in the soccer world cup. It’s a high no artificial drug will ever recreate.
First solution attempt
The degree of satisfaction when solving a tough problem is directly proportional to the degree of frustration finding the solution. And this one was tough from the start. Amar and Frank on the support team in North Carolina studied the error logs and consulted with Doron, Ryan, and others on the development team in St. Paul and Tel Aviv. The hypervisor error logs showed a failure when the vdsmd process connected to storage. The vdsmd (Virtual Desktop and Server Manager Daemon) process is the key to everything. Vdsmd handles all the communication between RHVM and hypervisors, and also helps start RHVM in self-hosted environments such as this one. If vdsmd breaks, everything breaks.
A failure connecting to storage seemed odd because the storage was fine before the hypervisor upgrades. So, why would reconnecting to storage break now, when it was fine before the hypervisor upgrades, especially since nothing changed on the storage side? Maybe the new hypervisor software version handled permissions differently somehow. Or maybe somebody changed the storage and the problem only showed up when servers tried to reconnect to it.
Was the problem only with the vdsmd process, or did the hypervisors themselves have problems connecting to storage? The customer issued commands to mount the storage by hand, and those worked. User root could not write to the storage, but user vdsm, which runs the vdsmd process, could. Since root could not write to the shared storage, maybe it was a permission problem. That was my contribution. I was wrong.
On Linux systems, user root is also known as the superuser, and has permission to access everything on that system. But an NFS share comes from another system, and user root on one system does not necessarily have the right to access everything from NFS shared storage on another system. Frank, Ryan, and others pointed out my error. The only user that mattered was user vdsm, and vdsm could do everything it needed. Permissions was a dead-end. Something was wrong with the vdsmd process.
Troubleshooting is like that. Since, by definition, the cause of the problem is unknown–if the cause were known, it would not be a problem–the goal is to uncover what’s going on and then fix it. Form a theory, devise an experiment to test it, evaluate the results, and then form a better theory and repeat. Ideally, every new experiment eliminates something or uncovers more about the problem. It really is trial and error, sometimes guided by intuition, sometimes by a checklist, usually a little of both. Most people never see the dead-ends and never learn to appreciate how much effort and creativity goes into troubleshooting.
It was time to explore some other curious output in the logs. On both hypervisors, when the vdsmd process tried to connect to shared storage and failed, it left a mysterious error around a software component named LVM. LVM – Logical Volume Manager. There are books about LVM, but the one sentence summary is, LVM is a key component inside the Linux operating system dealing with storage. But in this case with this system, the storage domains were all NFS. LVM has nothing to do with NFS. So, why did the vdsmd process trigger LVM errors when trying to connect to NFS storage?
Nir, from the development team in Tel Aviv, looked closer at the logs and found the vdsmd process died because it received unexpected output from the LVM command, pvs, during storage initialization. But the logs did not show the specific unexpected output that triggered the failure. And so Nir did what all great developers do; he wrote a patch to provide better diagnostic information. And then Allie and Gordon on the support team helped the customer install the patch, reproduce the problem, and send in fresh logs.
This was the clue that broke everything wide open.
InvalidLine: Missing field 'pv_name' in line 'Last login: Fri Dec 28 21:41:33 EST 2018'
Nijin, on the support team, was on duty that night and found it at 2:55 A.M. US Central time. I don’t know what time it was in Nijin’s timezone. Nijin is a pro, just like everyone else on both the support and development teams. The world has no idea how much gratitude it owes the people who do this stuff every day.
And now, the next mystery. What in the world was a “Last login” (lastlog) line doing in the output from an LVM pvs command? It’s like finding a car engine trouble code from home electrical wiring. Which makes no sense because they’re not related at all.
Or are they?
It’s exhilarating when problems take a left turn like this, because it’s a mystery nobody expects, and nobody has seen before. It’s like the Star Trek line, “to boldly go where no-one has gone before.” This is where the creativity comes in. Anyone who thinks the best IT people aren’t creative should study a few problems like this one.
Permissions are a big deal inside a computer. Giving a process blanket permissions to touch anything it wants is like playing Russian Roulette. Sooner or later, a bug or an attack will cause lots of damage. And so good software only runs as a superuser when it needs superuser capabilities, and then it reverts back to running as a normal user. The pvs command – the one that returned the unexpected output – needs to run as the superuser. Which means the vdsmd process needs to temporarily become the superuser to run it.
When a process needs to temporarily run as the superuser, it executes a command named sudo, for “do as the superuser.” But not everyone has permission to execute a sudo command, and so sudo, in turn, invokes a series of PAM modules to make sure the calling user has these permissions. PAM – Pluggable Authentication Modules – handle every aspect of authentication, from validating passwords, to handling login failures, to displaying the lastlog message, and dozens of other chores.
And that was why an unexpected lastlog message ended up in the output from the pvs command, which made the vdsmd process blow up, which prohibited RHVM from starting, which left a system full of virtual machines dead, which threatened millions of Christmas holidays across the United States. Nijin made the connection. The customer had customized the PAM module that handles sudo commands to display the lastlog message. The cure – take out that customization, and the whole system came back to life.
And now, the long term engineering challenge. Given this unexpected new possibility, what’s the best way to handle it? Does it make sense to ignore lastlog messages and continue parsing for real output? But what about other customizations and other output possibilities? Or does it make most sense to just document the behavior and harden Nir’s patch to deliver more diagnostic information in such cases? Or is the best approach to use a special supervdsmd process to handle chores the superuser needs to handle?
While I slept after composing the first draft of this blog post, support and development teams around the world explored permanent solutions. Because that’s what cyberheroes do; they solve problems and continuously improve. Every day. Ask any if they see themselves as heroes and they’ll say no, they’re professionals doing their job. And they’ll probably give me a load of grief for using the word, cyberheroes. All I know is, they’re some of the smartest and most dedicated people I’ve ever met and the world works because people like these bust their butts to make it work.