When it rains, it pours: not to be outdone by seemingly every other IT vendor on the planet deciding to release an Apache Hadoop-based analytics package, EMC has announced Pivotal Hadoop Distribution (or Pivotal HD), which integrates the popular data-crunching framework with EMC Greenplum’s massively parallel-processing (MPP) database technology.
The Hadoop open-source framework specializes in running data applications on large hardware clusters, making it a particular favorite among firms such as Facebook and IBM with a lot of backend infrastructure (and a whole ton of data) to manage. Studies indicate that Hadoop’s popularity will only increase over the next few years.
Pivotal HD leverages HAWQ, which EMC describes as the culmination of ten years’ worth of research into data management and related areas. It’s an SQL parallel-database layer atop the Hadoop Distributed File System (HDFS), which EMC claims will allow Pivotal HD to process query workloads with an exponentially higher degree of improvement over other vendors’ SQL-like services layered atop Hadoop.
Unlike those services, HAWQ is integrated with the HDFS as a single system, which spares users from having to shuttle data between multiple systems or relying on connectors that require the same datasets to be stored in multiple locations. A platform that can make SQL queries work seamlessly with Hadoop, of course, would prove invaluable for companies trying to gain insight from different types of data spread across multiple systems.
As part of extending HDFS, HAWQ includes features such as Dynamic Pipelining (a query optimizer), support for common Hadoop formats, and programmable analytics.
While Apache Hadoop is open-source, it’s a big question whether EMC will open up HAWQ and the other technologies layered atop it; given how those technologies give EMC something of a competitive advantage (at least among those organizations that like to crunch their data with a little more speed), it seems likely that EMC will do its level best to keep at least some portions of Pivotal HD proprietary.
EMC and its subsidiaries face significant competition in the Hadoop arena. Cloudera, for example, already offers Impala, an Apache-licensed query engine for data stored in HDFS and HBase; earlier this week, the company also released more auditing and management add-ons to its Hadoop platform. At almost the same moment, Hortonworks released its new Hortonworks Data Platform for Windows, which it claims is the first Hadoop-based platform capable of interoperability across Windows, Linux, and Windows Azure.
But it was Intel that made perhaps the biggest Hadoop-related announcement of the week, with news that it had produced its own Hadoop distribution, built “from the silicon up” to efficiently access and crunch massive datasets. The distribution takes advantage of Intel’s work in hardware, backed by the Intel Advanced Encryption Standard (AES) Instructions (Intel AES-NI) in the Intel Xeon processor; the company also claims its distribution can analyze data at superior speeds.
Of course, every IT vendor dealing in Hadoop claims “superior speeds” for its particular offering. Time will tell whether companies prefer EMC’s take on the framework.