Google has released a preview version of a Google Cloud Storage connector for Hadoop that allows users to run the framework on the Google Cloud platform without needing Google Compute Engine VMs.
In theory, that could allow users to focus more on crunching data than managing a cluster-and-file backend. “In the 10 years since we first introduced Google File System (GFS)—the basis for Hadoop Distributed File System (HDFS)—Google has continued to improve our storage system for large data processing. The latest iteration is Colossus,” read a Jan. 14 note on the Google Cloud Platform Blog. “Using a simple connector library, Hadoop can now run directly against Google Cloud Storage—an object store built on Colossus.”
Apache Hadoop, the open-source framework for running data applications on large hardware clusters, experienced a surge of popularity in 2013. That year, a number of prominent firms (such as Intel) released their own Hadoop distributions, and Hadoop 2.2.0 reached its general-release milestone. (In addition to more integration with other open-source projects, v2.2.0 included YARN, a general-purpose resource management system that makes it easier to work with other frameworks.)
Google claims that running Hadoop in conjunction with Google Cloud Storage confers a number of additional benefits, including greater availability and scalability (“Google Cloud Storage is globally replicated,” the blog posting claims, “and has higher availability than HDFS because it’s independent of the compute nodes and the NameNode”), as well as minimal storage-management overhead (sayeth the posting: “Whereas HDFS requires routine maintenance—like file system checks, rebalancing, upgrades, rollbacks and NameNode restarts—Google Cloud Storage just works”).
And Google wouldn’t be Google without touting the interoperability between its various software platforms: “By keeping your data in Google Cloud Storage, you can benefit from all of the other Google services that already play nicely together.” (Some potential clients might take serious issue with handing over all their data and associated processes to the search-engine giant, however.)
Complicating matters on a corporate level is Google’s own data-analytics platform, BigQuery, which it claims is capable of analyzing terabytes of data with just a few mouse-clicks. The BigQuery Webpage positions the platform as superior to Hadoop. But with the latter such a force within the data-analytics community, Google has little choice but to facilitate its use in conjunction with the company’s other products.