A pair of power outages at hosting provider DreamHost took the hosting provider down for the better part of two days, leading to thousands of angry customers. Meanwhile, a frustrated DreamHost posted dozens of updates, including a note from its chief executive.
And it wasn’t even DreamHost’s fault: a third party, Alchemy Communications, which manages its data center, was to blame.
Dreamhost first posted about the problem on March 19, when its Irvine, Calif. data center suffered a massive power outage that took down the entire facility. Dreamhost said that the power was restored by 3 PM that day—before going out again. Fortunately, DreamHost engineers had some practice restoring the affected equipment.
DreamHost was scheduled to start the transition back to UPS (uninterruptable power supply) power and normal Irvine municipal power last night at 10 PM, concluding at 1 AM. But while DreamHost promised to update the post with a final status report, it so far hasn’t delivered.
So what happened?
After the initial power outage, DreamHost worked to bring the affected systems online. That process took about two hours, as the company’s engineers worked through its checklist. As of 5:55 PM, however, the company discovered that a routing table was experiencing issues. The culprit: two internal routers, including the primary and the spare, were fried during the outage and needed to be returned to the vendor. A third, fresh router took about two hours to deploy. Alchemy also initiated “emergency maintenance” on its UPS, which would become important later on.
From 9:15 PM through about midnight, DreamHost continued to bring its services online, addressing problems as they went. By 3:50 AM, DreamHost reported that all of its services had been restored.
An hour later, however, disaster struck: the power went out again, and was restored by 5 AM, when DreamHost’s irritation began showing. “We understand the frustrations this is causing, as we are in the same boat as you,” the company said. “Please hang in there, and support us as we work to get everything up from this second data center power outage. We sincerely apologize for the inconvenience.”
By 6:35 AM, 99 percent of shared hosting, 95 percent of VPS and 99 percent of the company’s dedicated machines had been restored, but its DreamObjects cloud/object storage service remained offline until about 11:20 AM, when some content began being served.
Simon Anderson, the chief executive of DreamHost, said at 7 PM on March 20 that Alchemy was to blame. “We believe, at this time, that Alchemy was performing unannounced maintenance on their UPS systems and the systems failed—resulting in a complete power outage,” he wrote in an update to the status blog post. “In addition to their UPS systems failing, their generators did not kick in.”
DreamHost later said that itself, Alchemy and Emerson Power all believed a core mechanical circuit breaker in the UPS system was the point of failure in both power outages. By 11 PM Thursday night, DreamHost had returned to city power with no problems, subject to the maintenance window.
It’s certainly a black eye for Alchemy, in which DreamHost owns a large equity stake. But kudos to DreamHost: although the company’s status thread generated over 1,000 comments, many of them irate, the company also received several compliments about its open channels of communication.