Microsoft: Uptime magic happens in softwareIntroThe pace with which the world’s demand for datacenter capacity is growing has made traditional approach to datacenter availability through hardware and infrastructure redundancy unsustainable for many companies. This is the trend that gave birth to the concept of “software-defined datacenter,” where the physical infrastructure is simple and treated as a commodity, while the failover mechanisms are written in software.
One of the companies that have made the switch is Microsoft. Its director of datacenter architecture and design David Gauthier wrote about this shift in a blog post.
Microsoft abandoned the hardware-centric approach to availability in 2008. Until then, the company’s infrastructure team built highly available online services on top of highly available (and highly redundant) hardware.
“The software developer could effectively treat the hardware as always available and we made capital investments in redundancy to protect against any imaginable physical failure condition,” Gauthier wrote.
Demand for cloud services grew so fast, however, that the team realized it would soon become untenable to stay the course. Complexity and cost of the traditional approach would make it prohibitive.
Since it launched its first datacenter in 1989, Microsoft has spent more than US$15bn on its infrastructure, according to Gauthier.
Today, the company’s physical infrastructure is designed to provide abstraction pools for resources like compute and storage. These pools have been optimize using total cost of ownership and time-to-recover availability models.
“These abstractions are advertised as cloud services and can be consumed as capabilities with availability attributes,” Gauthier wrote. “Developers are incented to excel against constraints in latency, instance availability and workload cost performance.”
This is radically different from the traditional model where the datacenter managers are solely responsible for availability. This approach puts responsibility on both datacenter staff and developers.
Development operations teams across the company spend a lot of time on incident triage, bug fixes and disaster scenarios. Software developers get a lot of data from the operations teams to help them write code that will maintain high service availability.
“Of course, software isn't infallible, but it is much easier and faster to simulate and make changes in software than physical hardware,” Gauthier wrote.