Plan for hardware failure, because you can't avoid it
In today's world of vMotion, fault tolerance, high availability and redundant technologies, we no longer talk about hardware failure. Instead, we focus on the availability of applications and systems. We design, configure and install based on availability rather than failure. So, the question becomes, "What happens when hardware does fail?"
Designing for availability versus failure
The thought process of designing for availability is very different from designing for failure, and understanding the difference is the key to a successful environment. When we design for availability we are looking at products and technologies to prevent an outage. When we design for failure, we prepare for dealing with the outage after it has occurred. While this sounds simple in concept, it is often not thought through.
Let's take a simple example of a traditional rack server. It's normally configured with redundant hardware including RAID technology to prevent the loss of data in the event of a drive failure and would traditionally be backed up for data protection. All of these pieces are common design steps based on availability. If we have a failure of multiple drives, where does that leave the system? On the other hand, if we design for failure instead of availability, we would have spare hardware on hand so we can start the recovery process immediately instead of having to wait two to four hours for onsite vendor service. Once we have the drives replaced, can we do a bare metal restore or do we have to re-install the entire OS and backup agent so we can start a recovery?
The advantage of dedicated management clusters
This recovery process can take several hours to complete, but we often don't account for it when we talk about availability of a system. As we look at our virtual environments, let's look past the 99.999% availability and focus on the failure aspect. One of the most important pieces to your virtual environment is your management tool. In VMware environments, this means vCenter. For many, keeping vCenter physical was one way to keep your management tool outside of the environment that it manages. However, VMware has continued to evolve the vCenter appliance into a robust virtual appliance that is becoming the preferred platform choice for managing the environment. While vCenter is not specifically needed for virtualization and in theory could be placed on any virtual cluster, a dedicated management cluster provides multiple benefits, including management of critical virtual machines (VMs).
Separate management clusters is not an availability design but a failure design. Your management cluster may only be two or three hosts and could even use local disks instead of shared storage. While it might sound strange not to use a storage area network or network file system storage frame, remember that the purpose of the management cluster is to create an environment separate from the production systems. This separation of duties focuses on isolation where an event in one environment does not adversely affect the other. It helps to limit your failure exposure, and gives you more flexibility with patches and upgrades. Of course, the next step is having a linked mode vCenter with backup management servers in your production environment so that, in the event of a hardware failure in your production or management environment, you still retain access to your management tools.
Document or fail
One of the other critical pieces for your virtual environment that is often delayed or ignored is documentation. In today's world where everything moves so quickly, that documentation is often left to the end of a project and, in many cases, it's never done. If you removed access to your management tools for your virtual environment, do you know what the IP address or host names are? What about mapping to storage LUNs or network vLANs? Too often when we install or expand environments, we compare and match to something that is already in place and we don't reference documentation -- in many cases because we haven't gotten around to creating it yet.
Let's say a newer VMware admin needs to disable Distributed Resource Scheduler (DRS) for upcoming maintenance, but instead of placing DRS into maintenance mode, they instead turn it off on a core production cluster. While disabling DRS preserves all of the settings for both DRS affinity rules and resource pools, turning off DRS removes both the rules and the pools. Of course, DRS gives you a warning when you choose to turn it off, but how often do people take the time to read every dialog box? Unfortunately, this is also a real example of what can happen if a seasoned administrator is moving a little too quickly. With all of the resource pools and DRS rules removed, it sends production cluster into a questionable health state. Of course, a few rules can be recreated from memory, but without proper documentation you will be starting from square one. The likelihood of getting it all right is slim to none. Having current documentation would have taken this situation from something that required days of effort and affected performance of the VMs to a correctable mistake.
In this case, the technology and the design didn't fail. The failure was caused by human error, but what made it much worse was the lack of documentation. While management and administrators will agree on how important documentation is, oftentimes the fast-paced nature of the job and lack of time prevents us from creating it. Fortunately, products like Neverfail IT Continuity Architect can help IT bridge some of that gap. These types of products interface into your virtual infrastructure and map out the existing environment and even the dependency between your servers and applications. This creates a documentation map for your infrastructure and gives you insight into your virtual environment without requiring hours of work by your administrator.
Add dedicated management clusters and proper documentation and it becomes a two-pronged approach for dealing with hardware failure in your virtual environments. While availability is what we typically design for, we also need to acknowledge the fact that failure in our virtual environment is always an option. If we take that approach, our businesses will be better prepared for the contingencies that never seem possible but always seem to occur.