Avoiding Data Center Disasters

Avoiding Data Center Disasters

0

Kenneth Brill

Steps you can take so you don’t have to implement your disaster recovery plan.

Disaster recovery plans for data are a necessary and fiscally prudent fact of life, but they are also costly and problematic. Most IT professionals privately acknowledge that despite rigorous testing, they have only marginal confidence that their plan will work in a timely way during a real emergency. How can this be?

In a real emergency, the IT function must be physically moved to a new geographic location. This means that applications must be put onto new hardware, risking configuration and compatibility problems. Relocated key people are torn between taking care of
their families and taking care of the business.

Having watched the costly and problematic process of relocating data processing in an emergency for the last 30 years, senior executives should be aware of the following:

–Disaster recovery plan activations occur because something physical has happened–the data center’s facility infrastructure has been damaged. Water intrusion, smoke intrusion or electrical explosion are the usual culprits. Terrorism is a relatively new threat, but again, it is a physical threat.

–Although planned, a second, extended information outage occurs at the conclusion of the disaster recovery period when processing is returned to the original location

So, why not include physical disaster avoidance as an integral part of disaster recovery planning?

The benefits would include a dramatic reduction in risk, plus less total information downtime! While it may be impossible to prevent a physical facility failure, it is relatively cheap to assure that facility uptime can be restored faster than by moving people and restoring IT processes at a remote site.

–Everything is already in place in the existing data center and running before the physical event occurred. Recovering information availability only requires restoring facility availability. As a result, an information outage occurs only once rather than twice.

–Recovery at a remote location takes time, is done under sub-optimal circumstances, and even in the best of circumstances, is likely to run into at least some configuration, compatibility and software version problems. Even when successful, a second information outage occurs when moving back to the original site.

Reducing the physical risk of catastrophic facility failure starts with site selection and data center design. No site or design is perfect; all have risk. But, you should know the catastrophic physical risks you have.

A number of years ago, a catastrophic data center failure occurred that required implementation of the disaster recovery plan (which also opened the corporate checkbook to major unplanned spending). The event was described as an act of God. Was it really?  The data center was located in Florida on the beach facing the ocean. It was full moon and a hurricane caused high waves. Fish were found in the fifth floor data center. What happened was totally predictable based on a combination of seven events that could be expected to happen over a 20-year period. I would say this was a location selection choice based on beauty instead of function.
My belief is this was a management error, not an act of God. In short, data centers and their site infrastructure should not be located under sources of water like bathrooms, kitchens, roofs or in basements unless mitigating containment systems are also installed.

Despite common misperception, computer rooms don’t have fires. There is little to burn in a computer room, if cardboard packing materials are rigorously removed or never allowed into begin with and circuit breaker and electrical protective device
settings are properly coordinated. What does occur are resistors getting hot and smelling, IT equipment getting hot, but no real fire or actual flame occurs. Smoke intrusion, however, does occur, especially when the data center is a tenant in a larger building. Computer rooms should always be under positive air pressure–this means air should be blowing out of any opening to an adjacent space. Positive air pressure prevents smoke from entering the computer room. This is simple but often overlooked.

The No. 1 reason for catastrophic facility failure is lack of electrical maintenance. Electrical connections need to be checked annually for hot spots and then physically tightened at least every three years. Many sites cannot do this because IT’s need for uptime and the facility department’s need for maintenance downtime are incompatible. Often IT wins, at least in the short term. In the long term, the underlying science of materials always wins.

An “Act of God” is something that is not reasonably predictable. Most disaster recovery plan activations occur for totall predictable reasons, things that have occurred multiple times in the recent past. Disaster recovery planning should begin with site selection, data center location within the building and site infrastructure design. Relatively cheap facility mitigation investments can reasonably assure never needing to implement your disaster recovery plan.

Ken Brill is executive director of the Uptime Institute.

 

Leave a Reply

IPSET Download - Sensor Hub

captcha


X
OID Download - Sensor Hub

captcha


X
OID Download - SHPro

captcha


X
MIB Download - Sensor Hub

captcha


X
MIB Download - SHPro

captcha


X
MIB Download - RPM

captcha


X
Firmware Download - Sensor Hub

captcha


X
Firmware Download - SHPro

captcha


X
Firmware Download - RPM

captcha


X