Limiting the effects of cloud computing outages

IntroContrary to popular belief, cloud services actually fail more often than internal datacenter facilities do. The cloud isn't inherently unreliable, but like all forms of IT, cloud services have to be selected carefully and managed to achieve specific reliability and availability goals. The steps can be contractual, technical or may even involve rethinking your application architectures. Without careful consideration, you may get less from your cloud services than you expected.

SLAs mitigate risk of using cloud providers' datacenters

Protecting against cloud outages starts by assessing the reliability of cloud providers' datacenters. The majority of cloud providers have a small number of datacenters, often only one, and these datacenters are subject to the same kinds of failures as an enterprise. The most publicized cloud failures occur when an entire cloud datacenter fails, usually because of a natural disaster. To protect yourself in case of failure, you'll either have to ask for specific datacenter configuration information or obtain an availability guarantee from your provider. For server, storage and network reliability, the best strategy is to negotiate a service-level agreement (SLA) specifying the availability guarantee and the time to restore service if it's lost. It's important to understand whether a cloud datacenter is located in an area where natural disasters like hurricanes or blizzards are common. Also find out if the datacenter has backup power and whether there's a backup datacenter that can pick up the load. The backup datacenter must be located in another region than the primary one, so it's unlikely to be impacted by the same conditions, and it must have enough capacity to handle failover of cloud applications. Since few providers would provide sufficient backup datacenter capacity for 100% failover of a primary datacenter, the SLA should indicate how failover is managed. It may be necessary to pay for priority in this situation. If your cloud service includes geographic diversity to support a distributed user population, your own diverse facilities may provide some protection against a cloud provider failure; check your contract carefully to ensure there's enough capacity to handle the additional load.

Network performance -- or lack thereof -- leads to cloud outages

The most common cause of cloud failures is usually not the cloud at all, but the network. The majority of cloud computing applications are accessed via the Internet, and Internet availability creates most cloud computing outages. The only way to address this is to move off the Internet to a virtual private network (VPN) or virtual local area network service, or to secure multiple Internet service providers (ISPs) for sites accessing cloud applications. This may be a good option if security and compliance issues can be addressed and contracted for by the provider. It will likely involve a special charge, unless the cloud provider already uses the carrier providing your VPN. With Internet service costs falling for small businesses, it's practical to provide a branch office with two ISPs. However, ensure there are no common points of failure between the two offices. Peering points and shared interconnection "hotels" are often shared among providers. Even common access wiring between the ISPs will defeat the benefits of having dual network connections.

Cloud application resilience must be addressed

If cloud datacenter and cloud network failures have been addressed, the next question is the resilience of the applications themselves. The greatest problems in managing high availability and cloud services involve both database access and reliable transaction processing. When a datacenter fails, the data stored there is unavailable, even if another datacenter can back up the applications using the data. Unless application data is maintained in "hot standby" form in multiple locations, a failure will result in loss of data access, which makes other redundancy measures largely ineffective. This same problem exists for internal datacenter backup, so companies who have provided their own datacenter redundancy may find the same procedures will work in the cloud. This is less of a technical strategy than a financial one, though; the cost of maintaining redundant data in the cloud is higher because of cloud storage and access charges. A better solution may be to house all your data on-premises in a high availability, protected datacenter, and access it from multiple cloud locations. The best availability management will have to be integrated with the applications themselves. Any time database updates are made to multiple copies at the same time there's a risk of loss of data integrity if a failure occurs during the update process. Online transaction processing systems usually include a "two-phase commit" process to back out transactions that don't update all database copies successfully. Sometimes network failure leaves even single-database updates in an uncertain state. It's essential to review application designs to ensure failures of the network or of a datacenter where databases are stored won't create a risk of contaminated or inconsistent data. It's not unreasonable to expect cloud applications to be as -- or more -- reliable as on-premises applications. And, reliability and specific goals you set are likely to cost you. Remember to consider reliability costs when building your cloud business case or you may find your applications will have to trade ad hoc between reliability and cost.

News

Limiting the effects of cloud computing outages

SLAs mitigate risk of using cloud providers' datacenters

Network performance -- or lack thereof -- leads to cloud outages

Cloud application resilience must be addressed