Google’s Spanner Database Offers Innovation Not Easily Replicated

Google’s Spanner relies on a combination of atomic clocks and GPS for its vital TrueTime API.

Preserving data in the face of unexpected disaster, reducing latency times when retrieving data, scaling datasets to enormous size: these are the issues that give developers some massive headaches.

As a provider of cloud-based services and one of the largest data-center owners in the world, Google wrestles with these issues on an epic scale. In order to solve them all, some engineers within the company have been working on Spanner, described in a recent research paper as a “scalable, multi-version, globally-distributed, and synchronously-replicated database.” (Hat tip to GigaOM and some other publications for finding the paper and posting it online.)

“Spanner is a scalable, globally-distributed database designed, built, and deployed at Google,” read the beginning of that paper. “Spanner is designed to scale up to millions of machines across hundreds of datacenters and trillions of database rows.”

That’s pretty big. What kind of task requires that level of database firepower? “Applications can use Spanner for high availability, even in the face of wide-area natural disasters, by replicating their data within or even across continents,” the paper continues.

But that sort of massive, globally distributed infrastructure is difficult to replicate for any company that’s not Google. “I don’t predict a Spanner clone will land as an Apache incubator anytime soon,” Cloudant chief scientist and co-founder Mike Miller wrote in an upcoming blog post. “The primary reason is that Spanner’s largest design innovation leverages special hardware. Specifically one must install a coordinate network of GPS and atomic clocks in each participating datacenter.”

Spanner’s secret sauce, so to speak, is its TrueTime API, which references GPS and atomic clocks to reduce time uncertainty among distributed systems.

Miller believes that TrueTime API allowed Google to overcome a particularly difficult problem with regard to distributed systems. “The previous dogma in distributed systems was that synchronizing time within and between datacenters is insurmountably hard and uncertain,” he wrote. “Ergo, serialization of requests is impossible at global scale. Google’s key innovation is to accept uncertainty, keep it small (atomic clocks / GPS), quantify the uncertainty and operate around it. In retrospect this is obvious, but it doesn’t make it any less brilliant.”

Miller sees Spanner as good for write-transaction light, read-heavy workloads, as well as applications that can accept mean latencies in the 1-100 ms with large tails. Spanner was first used for F1, a rewrite of Google’s advertising backend; and its huge size and distributed software probably means that its model won’t be widely replicated anytime soon.


Image: Zvonimir Atletic/Shutterstock