Disaster Recovery Planning for datacenter ManagersIntroIf you run a datacenter, you’re eventually going to face a downtime situation. Regardless of the “size” of the outage, if your business depends on your IT resources, you lose money each second your facility isn’t functioning as it should. Thus, a disaster recovery plan is critical to minimizing losses when one of the inevitable outages occurs. Again, the disaster may be a true disaster—an earthquake or other event (natural or manmade) that causes extensive damage—or something less dramatic, like a longer-term utility outage. The key to recovering quickly and effectively from a disaster, however “disastrous,” is planning. The following are some critical considerations to incorporate when preparing, modifying or updating your disaster recovery plan.
Count the costs. Although datacenter downtime is harmful to any company that relies on its IT services, it costs some companies more than others. Your disaster recovery plan should enable a fast return to service, but it shouldn’t cost you more than you are losing in downtime costs. This is simply a business decision that properly weighs the costs and benefits of disaster recovery approaches: as with any other business decision, you should select one that maximizes the return on your investment. And that requires you to first determine the cost of datacenter downtime. Before formulating a disaster recovery plan, spend some time figuring out how much your business loses for a given amount of downtime—this will enable you to select a strategy that gives you the most return.
Evaluate the types of threats you face and how extensively they can affect your facility. Malicious attacks can occur anywhere, but you may also face threats peculiar to your location, such as weather events (tornadoes, hurricanes, floods and so on), earthquakes or other dangers. Part of preparing for a disaster is knowing what is likely to occur (you can probably ignore some conceivable threats, like an alien invasion) and how those threats could affect your systems. Evaluating these situations beforehand allows you to better take appropriate action should one of these events occur.
Know what you have and how critical it is to operations. Responding to a disaster in your datacenter is similar to doing so in medicine: you need to treat the more serious problems first, then the more minor ones. By determining which systems are most critical to your datacenter, you enable your IT staff to prioritize and make the best use of the precious minutes and hours immediately following an outage. Not every system need be functional immediately following a disaster.
Identify critical personnel and gather their contact information. Who do you most want to be present in the datacenter following an outage? Who has the most expertise in a given area and the greatest ability to oversee some part of the recovery effort? Being able to get in touch with these people is crucial to a fast recovery. Collect their contact information and, just as importantly, keep it up to date. If it’s been a year or more since you last checked, some of that contact information is likely out of date. Every minute you spend trying to find important personnel is time not spent on recovery.
Train your employees. Knowledge of how to implement disaster recovery procedures is obviously important when an outage occurs. To this end, prepare by training personnel—and not just in their respective areas of expertise. Everyone should have some broad-based knowledge of the recovery process so that it can be at least started even if not everyone is present.
Ensure that everyone knows the disaster recovery plan and understands his or her role. Announcing the plan and assigning roles is not something you should do after a disaster strikes; it should be done well in advance, leaving time for personnel to learn their roles and to practice them. Almost nothing about a disaster event should be new (aside from some contingencies of the moment, perhaps): the IT staff should implement disaster recovery as a periodic task (almost) like any other.
Practice. Needless to say, this is perhaps the most critical part of preparation for a downtime event. The difference between knowing your role and being able to execute it well is simply practice. You may not be able to shut down your datacenter to simulate precisely all of the conditions you will face in an outage, but you can go through many of the procedures nevertheless. Some recommendations prescribe semiannual drills, at a minimum, to practice implementing the disaster recovery plan. If there’s one thing you take from this article, it’s that you should practice your disaster recovery plan—don’t expect it to unfold smoothly when you need it (regardless of how well laid-out a plan it is) if you haven’t given it a trial run or two.
Automate where possible. Your staff is limited, so it can only do so much. The more that your systems can do on their own in a recovery situation, the faster the recovery will generally be. This also leaves less room for human error—particularly in the kind of stressful atmosphere that exists following a disaster.
Follow up after a disaster. When a downtime event does occur, evaluate the performance of the personnel and the plan to determine if any improvements can be made. Update your plan accordingly to enable a better response in the future. Furthermore, investigate the cause of the outage. If it’s an internal problem, take necessary measures to correct equipment issues to avoid the same problem occurring again.
The details of your disaster recovery plan will obviously vary depending on how critical your IT infrastructure is to your business, what your budget is and what kinds of threats you face. For instance, you may need a backup site at an entirely different location that can take over service provision should your main site go down. For smaller businesses, however, this may be impractical from a budget standpoint. Nevertheless, the general approach to disaster planning is still the same: evaluate the threats, determine the potential costs to your infrastructure, develop a plan that fits your budget and practice that plan regularly to ensure that your disaster recovery efforts (which you hope never to implement) go smoothly, minimizing downtime.
The most important takeaway, however, is that you must practice your plan to make it effective. Even if your IT staff is equipped with a great disaster recovery plan, it can still stumble—and cost your business even more money—if it’s the first time they’ve had practical experience with it. So, plan ahead diligently, but just as importantly, use regular dry runs to help your staff gain both experience and confidence with the plan. The more familiar the procedures, the fewer issues will arise when it comes time to implement them. And don’t forget, when the inevitable downtime event strikes, to use the opportunity to refine and improve your plan, and to identify the source of the outage to help ensure it doesn’t happen again.