Facebook stores its warehouse data in a set of enormous Hadoop/HDFS-based clusters. That helps the social network wrestle with the enormous amounts of user information it needs to store and analyze every day; but at a certain point (namely, once the warehouse grew to petabyte scale), its network administrators decided they needed something other than Hadoop MapReduce and Hive to process that data in a fully optimized way.
Enter Presto, Facebook’s very own distributed SQL query engine designed with a focus on speed. The platform supports standard ANSI SQL, which means it’s capable of everything from complex queries and aggregations to joins and window functions. Presto also boasts scalability and flexibility; for example, with the addition of key plugins, it can handle Facebook data not stored in HDFS clusters, such as HBase and custom systems. It doesn’t rely on MapReduce for processing, which allows it to process queries at speed.
Facebook engineers began developing Presto near the end of 2012 and rolled it out to the entire company the following spring; employees currently use it to process roughly 30,000 queries (totaling around one petabyte of data) per day. And now that the platform’s stable enough, Facebook is open-sourcing it via Github and a dedicated Website.
“We are actively working on extending Presto functionality and improving performance,” the Presto team posted on Facebook. “In the next few months, we will remove restrictions on join and aggregation sizes and introduce the ability to write output tables. We are also working on a query ‘accelerator’ by designing a new data format that is optimized for query processing and avoids unnecessary transformations.” In theory, that will open up “hot subsets” of data to caching from backend datastores, as well as the system using cached data to transparently accelerate queries. A high-performance HBase connector is also in the works.
This isn’t Facebook’s first venture into custom architecture: it’s built custom tools such as Corona, a scheduling framework with a built-in cluster manager for tracking nodes in clusters and the free resources in play across the network. But when it comes to actually creating insights from all that data, Facebook still relies on people power: not only does it hire PhDs with extensive background in analytics (and good business savvy), it also runs “camps” where employees are taught the intricacies of data analysis.