Google’s BigQuery vs. Hadoop: A Matchup

Google’s BigQuery.

Ready to “Analyze terabytes of data with just a click of a button”? That’s the claim Google makes with its BigQuery platform. The accompanying BigQuery Webpage offers two case studies; one of them features a gaming company that found Hadoop too slow and costly for crunching massive amounts of data, before BigQuery came along to save the day.

But is BigQuery really an analytics superstar? It was unveiled in Beta back in 2010, but recently gained some improvements such as the ability to do large joins. We’ll compare it to some other analytics and OLAP tools, and hopefully that’ll give some additional context to anyone who’s thinking of using BigQuery or a similar platform to analyze data.

Google BigSecrets

Google keeps much of its software on its servers without releasing any code—even when they release papers describing a process, they never offer up the code itself. Services (when made available) are usually watered-down versions of the “real” software Google deploys behind the scenes.

Google hasn’t released many details on what lurks under BigQuery’s hood, but it’s pretty clear the software is built atop yet another internal product known as Dremel. (Google has released research papers describing Dremel, but hasn’t released it as a product.) Dremel was built to help the search-engine giant perform data-intensive tasks such as spam analysis, ad service, and OCR analysis in Google Books.

Google launched Dremel seven years ago, an eternity in the tech industry. It’s certainly been improving on it—and on BigQuery. Whatever version is available to the public, it’s likely their internal version is that much more advanced and powerful.

Despite its secrecy, Google’s experiments with data storage and analytics have influenced the whole industry. In 2003, Google released a paper describing its internal Google File System (the paper is available online and provides a great amount of detail). It followed that up in 2004 with papers for MapReduce and BigTable. That combined knowledge allowed developers outside of Google to build the first version of Apache Hadoop in 2005.

An open-source framework, Hadoop has become increasingly popular over the past few years as a way for entities to crunch massive amounts of data stored on large hardware clusters. Developers and companies can deploy Hadoop on their own infrastructure, or run it via the cloud (Amazon’s EC2 is a popular option).

With Hadoop on the rise, Google has moved forward with its own internal work. In a 2012 presentation, Google principal engineer Andrew Fikes suggested that Google File System—the basis for Hadoop—was only Google’s first cluster-level file system. Based on lessons learned from Google File System, the company had created a next-generation cluster-level file system known as Colossus. (Hadoop now offers some Colossus-level features, including Reed-Solomon encoding.)

There are also attempts underway to make Hadoop faster. Take Cloudera, for instance, and its open-source Impala project. Impala is billed as a real-time analysis system; in a blog post, Cloudera even gave a shout-out to Dremel as its inspiration on the initiative. So while Google argues that BigQuery is superior to Hadoop when it comes to delivering speedy answers, its clear that the latter will provide competition for some time to come.

The Matchup

In order to use Google’s BigQuery, you have to sign up with a credit card. It’s free up to a point, but after that you have to pay. Compare that to the likes of Hadoop and Impala, which are open source but require the user pay for the server hardware (or at least the server time).

Using BigQuery is quite easy; there’s a full API accessible through a REST interface. You can also try out the queries right from the Web itself. The queries don’t run instantly; one of the samples took 3.3 seconds to grind through 3.49 Gigabytes of data. But that’s clearly fine for quick lookups.

There are also several client-side libraries available for the usual suspects (Java, .NET, PHP, Python), as well as others such as Dart and Go. Almost all the libraries are still in beta, except for a couple that are only in alpha. The Java one has several examples to help you get going. It works well, and it’s pretty fast.

Google calls BigQuery an Online Analytical Processing (OLAP) system. Is it? Tools such as Jaspersoft and Pentaho are incredibly easy to use, and you can drill down into your data using the concept of OLAP cubes—but unlike those tools, BigQuery does not automatically offer the ability to drill into your data. At the heart of OLAP cubes are sophisticated joins, and you can perform such joins in BigQuery, but you have to piece them together yourself using its own breed of SQL. And that’s SQL, not MDX.

Behind any good OLAP tool is a good database. Jaspersoft has actually taken the steps to integrate their products with BigQuery; it released a connector last summer, for example.

BIME is another example of a true OLAP tool that works in conjunction with BigQuery. It’s put together an impressive set of tools that let you build dashboards, along with some demos.

But BigQuery doesn’t really compete with these products at all—it’s not a true OLAP tool in the sense of how most people think of OLAP tools. It’s a huge, scalable database that can be used in conjunction with actual OLAP tools, provided those tools offer options for using BigQuery on the backend.

Conclusion

In the end, you don’t have to choose between BigQuery and other OLAP tools. JasperSoft has its connector that ties Google BigQuery—if you choose to go that route. BIME provides access to BigQuery via a set of tools for designing impressive dashboards that are quite easy to use.

In the end, BigQuery is just another database. It can handle massive amounts of data, but so can Hadoop. It’s not free, but neither is Hadoop once you factor in the cost of the hardware, support, and the paychecks of the people running it. The public version of BigQuery probably isn’t even used by Google, which likely has something bigger and better that we’ll see in five years or so.

 

Image: Google

Post a Comment

Your email address will not be published.