Failing Over an Entire Data Center: No Biggie

Racks of servers in a shipping container can quickly add to a data-center’s computing capacity. And according to Microsoft, they can also help out with reliability.

David Gauthier, Microsoft’s director of data center architecture and design, described in a blog post how the containerized design in use since 2009 has evolved into a design principle shared among the company’s new datacenters, with container treated as a single functional entity. If there’s a power failure or loss of connectivity, Microsoft is prepared to shift a workload from one container (with thousands of servers running within it) to another, or fail over the entire data center to another data center.

Gauthier said that scenario has happened before: in 2011, an offsite lightning strike cut power to the Chicago facility. As planned, the facility’s workloads transferred to another facility, then back to Chicago after repairs had been made and the power restored: “Aside from a slight increase in network latency, the applications never missed [a] beat.”

Microsoft’s 700,000-square-foot Chicago data center houses an undisclosed number of containers, each of which contains thousands of identical servers. Each container can be installed in as little as one day. Gauthier said that the containers were not optimized for the typical Mean Time Before Failure (MTBF) metric, but Mean Time to Recover—meaning that if a container failed, it should fail fast.

Gauthier added that Microsoft had eliminated two key elements from the data center: generators and redundant power supplies. “By treating a container (and the thousands of servers inside it) as a discrete failure domain and unit of capacity, we could forego the second power supply and redundant power source that most data centers deliver to every server,” Gauthier wrote. “If a container is taken offline due to planned maintenance or an unplanned failure, it is completely isolated from the other containers and workloads are transitioned to a functioning container while maintaining customers service level agreements (SLAs).”

For the same reason, Microsoft also decided to eliminate the use of diesel-powered generators from its container bay floor—the same diesel generators that kept New York City-based data centers operational during hurricane Sandy in 2012.

The secret, Gauthier wrote, isn’t hardware, but software. By building software resiliency within Microsoft’s own processes, the cloud as a whole can be used to eliminate some components of the traditional hardware safety net.

Microsoft has also offered more detail about its Dublin, Ireland facility, cooled with the region’s cool, moist air. As Microsoft has described previously, the Boydton facility combines elements of both the Dublin and Chicago data centers—Microsoft’s Pre-assembled Components or ITPACs are stored outside, where the container walls are exposed to rain, sleet, and snow. The ITPACs are further organized along a central linear spine with a covered breezeway, facilitating personnel access.

Designing a naturally cooled server is one thing; eliminating diesel generators is another. And failing over an entire data center? That takes guts—and confidence in the infrastructure.

 

Image: Microsoft

Post a Comment

Your email address will not be published.