Amazon blames generator-stabilization issue for outageGenerator fails to stabilize before UPS energy runs out.
Article by by Yevgeniy SverdlikThe power-outage at an Amazon Web Services datacenter in Virginia that caused service outages or many online businesses – including popular ones like Netflix and Pinterest – last week was caused by a back-up-generator failure, the company said in a summary of the incident posted on the website of Amazon’s cloud-services business.
Once the power outage hit, triggered by a large-scale electrical storm that had gone through northern Virginia on 29 June, electrical load in one of more than 10 Amazon datacenters in the region failed to successfully transfer to generator backup. Uninterruptible power supply (UPS) serving that particular circuit eventually ran out of battery power and servers began losing power around 8pm PDT.
While promising to take steps to repair the equipment involved, lengthen failover time and expand around-the clock engineering staff, Amazon said it would not replace the offending generator.
“Prior to installation in this facility, the generators were rigorously tested by the manufacturer,” the company said. “At datacenter commissioning time, they again passed all load tests (approximately 8 hours of testing) without issue.”
The storm affected the US East-1 Region of Amazon’s cloud infrastructure, where datacenters support multiple “availability zones”. Placed in distinct locations, the zones are engineered to isolate failure from each other, according to Amazon.
Cloud server instances, storage volumes and database and load-balancer instances hosted by the affected datacenter represent a “single-digit percentage” of the total capacity hosted in the region. Still, the impact of the outage was widespread and two-fold: unavailability of instances and volumes and degraded performance of control planes, which customers use to make changes to their cloud-based infrastructure.
While control-plane performance has no impact on uptime of the existing infrastructure, it is important during outages when customers want to move resources out of a problematic availability zone into another.
Last month, the US East-1 region had another major outage that brought down a number of popular web services, including Pinterest, Heroku, Quora and Foursquare. The company eventually traced the root-cause of that outage to a cable fault in the utility power-distribution system, but said a back-up generator in a datacenter had powered off because of a defective cooling fan.