I’ve been in the IT industry my entire adult life, so sometimes I use words and just assume everyone thinks they mean the same thing I think they mean. I was recently challenged with the word, “redundancy.”
“What does that even mean?” asked my friend.
“It means you have more than one.”
“So if one breaks, you can use the other one.”
“Yeah, everyone knows that, but what does it mean with IT stuff?”
Seems simple enough to me, but as I think about it, maybe it’s not so simple. And analyzing how things can fail and how to mitigate it is downright complex.
Redundancy is almost everywhere in the IT world. Almost, because it’s not generally found in user computers or cell phones, which explains why most people don’t think about it and why these systems break so often. In the back room, nearly all modern servers have at least some redundant components, especially around storage. IT people are all too familiar with the acronym, RAID, which stands for Redundant Array of Independent Disks. Depending on the configuration, RAID sets can tolerate one and sometimes two disk failures and still continue operating. But not always. I lived through one such failure and documented it in a blog post here.
Some people use RAID as a substitute for good backups. The reasoning goes like this: “Since we have redundant hard drives, we’re still covered if a hard drive dies, so we should be OK.” It’s a shame people don’t think this through. Forget about the risk of a second disk failure for a minute. What happens if somebody accidentally deletes or messes up a critical data file? What happens if a Cryptolocker type virus sweeps through and scrambles everyone’s files? What happens if the disk controller in front of that RAID set fails?
Redundancy is only one component in keeping the overall system available. It’s not a universal cure-all. There will never be a substitute for good backups.
Virtual environments have redundancy all over the place. A virtual machine is software pretending to be hardware, so it’s not married to any particular piece of hardware. So if the physical host dies, the virtual machine can run on another host. I have a whole discussion about highly available clusters and virtual environments here.
With the advent of the cloud, doesn’t the whole discussion about server redundancy become obsolete? Well, yeah, sort of. But not really. It just moves somewhere else. Presumably all good cloud service providers have a well thought out redundancy plan, even including redundant data centers and replicated virtual machines, so no failure or natural disaster can cripple their customers.
With the advent of the cloud, another area where redundancy will become vital is the boundary between the customer premise and the Internet. I have a short video illustrating the concept here.
I used to build systems I like to call SDP appliances. SDP – Software Defined Perimeter, meaning with the advent of cloud services, company network perimeters won’t really be perimeters any more. Instead, they’ll be sets of software directing traffic to/from various cloud services to/from the internal network.
Redundancy takes two forms here. First is the ability to juggle multiple Internet feeds, so when the primary feed goes offline, the company can route via the backup feed. Think of two on-ramps to the Interstate highway system, so when one ramp has problems, cars can still get on with the other ramp.
The other area is redundant SDP appliances. The freeway metaphor doesn’t work here. Instead, think of a gateway, or a door though which all traffic passes to/from the Internet. All gateways, including Infrasupport SDP appliances, use hardware, and all hardware will eventually fail. So the Infrasupport SDP appliances can be configured in pairs, such that a backup system watches the primary. If the primary fails, the backup assumes the primary role. Once back online, the old primary assumes a backup role.
Deciding when to assume the primary role is also complicated. Too timid and the customer has no connection to the cloud. Too aggressive and a disastrous condition where both appliances “think” they’re primary can come up. After months of tinkering, here is how my SDP appliances do it. The logic is, well, you’ll see…
If the backup appliance cannot see the primary appliance in the private heartbeat network, and cannot see the primary in the internal network, and cannot see the primary in the external Internet network, but can see the Internet, then and only then assume the primary role.
It took months to test and battle-harden that logic and by now I have several in production. It works and it’s really cool to watch. That’s redundancy done right. If you want to find out more, just contact me right here.
(Originally posted on my Infrasupport website, June 4, 2015. I backdated here to match the original posting.)