Sandy’s Lessons About Disaster RecoveryIntroWere you ready for Sandy? Sure it was downgraded from hurricane to tropical storm, but it still had a devastating effect. For those of us who manage datacenters, that impact had many layers. Let me start at the core and work my way out.
Clearly, the two most critical components to protect during an event like Sandy are the facility and power. Loss of integrity of the building infrastructure or loss of power to the processing machinery means a dead stop of the datacenter. A close second would be network connectivity, but if the power and facility are not in operation, a network connection is just a nice thing to brag about at a post mortem.
The facility is under attack in two ways from Sandy. First, is the direct attack of winds in excess of 70mph. The second attack is the exposure to elements, such as flood and rain, or wind through a breach in the facility. Loss of facility integrity can be catastrophic.
Who is onsite and/or able to provide remote support during this type of event? You will need hands and feet to be able to assess the state of the datacenter. If you are isolated because of power or network or access, then what is the contingency plan? Is there a DR plan for critical data and services?
I don’t need to go through a disaster recovery scenario, since you should all have them in place. What is often missing from those plans is the human element. We think through all of the technical implications, but often make assumptions about the human elements. Chances are the critical resources you will depend on for recovery are dealing with their own issues at home. Even if they fared well and do not have much damage in their own homes, they may not be able to get to the datacenter to provide assistance.
If hands and feet are needed and the only hands and feet you can get to the site are not trained in the technical issues, what is your plan to help them provide you the minimum damage assessment and preparation for recovery? You need a playbook that will allow someone who is not an IT person to go into the datacenter and, with guidance and support, be able to do minimal remediation and provide accurate information to remote technical teams. This is about process, not technology.
You have access and you have people who can get there to assist along with a process that will allow them to be effective on site. They also have IT tools to be able to allow remote support to guide them. But, there is more. People need food, water and sanitation. Assuming you are not under government restriction for access, how do you provide for the staff that may make it onsite ahead of the full services of the building being established?
This will be one of the disaster scenarios you need to think through and document. If certain conditions of self-sustaining operation are not met, then you cannot bring people on to the premise to initiate recovery. It may not be black and white. Some limited work might be done, but you need to establish criteria for just how much can be done before any recovery is stopped.
Then there is the rest of the world. If you have generators and want to bring them online, do you have fuel? It is great if you have a tank, but can you even get it refilled in the case of an extended outage? From Sandy, it is clear that fuel and water have become critical elements in the recovery. If people cannot get fuel to drive to the site, then your recovery is not going to happen.
There are many more elements to think through. I hope this process of starting at the core and working your way out to the world at large will help you to tune your disaster recovery plans. Whether it is a natural or man-made event, some form of disaster will happen to your datacenter and you need to be able to move to an appropriate plan to instigate recovery.