By this point, pretty much everyone who spends time online is used to the occasional email crash. (If anything, it’s a good excuse to step away from the screen and refill that coffee mug.) But it was altogether odd when users began reporting a widespread Gmail outage Dec. 10—just as Google’s Chrome browser also seemed to suffer from some sort of downtime.
Were the two events connected? Was this a once-in-a-lifetime coincidence, perhaps even a harbinger of the Mayan apocalypse?
Late Dec. 10, Google engineer Tim Steele took to the Chromium developer forum to explain that, yes, the outages went together: Google had enacted a “faulty load balancing configuration change” to a core piece of backend infrastructure “that many services at Google depend on.” Those services evidently included Gmail and Chrome.
That faulty configuration change caused problems for a “backend infrastructure component” used by the Chrome Sync Server to enforce quotas on sync traffic; confronted with that component’s failure, the server forced clients to throttle all data types, according to Steele, “without accounting for the fact that not all client versions support all data types.” That faulty logic for throttling data sparked the crash.
While relatively short, the outages nonetheless illustrate the challenges facing even an IT giant like Google as it attempts to serve the data needs of millions of users. No matter how flexible and resilient the underlying IT infrastructure, even something relatively minor—like some faulty logic underlying one module of one piece of backend infrastructure—can potentially cascade into a much larger downtime situation. While outages are an expected part of life for online services, the key is for companies to bring those services back online as soon as possible, or risk the ire of millions of severely irate Web denizens.
This isn’t the first time this year that Google’s experienced an outage, particularly with regard to Gmail. However, Google has become reluctant in recent years to share much detail about the root causes behind outages—that is, unless something truly unusual is afoot, in which case an explanation usually surfaces on developer blogs.