Cloudera is using the Strata + Hadoop World event in New York City to unveil Impala, an Apache-licensed query engine for data stored in HDFS (Hadoop Distributed File System) and HBase, which it’s packaging for clients as Cloudera Enterprise RTQ.
Cloudera claims the new platform, which is entering public beta, can process queries 10 to 30 times faster than Hive/MapReduce. While Cloudera’s marketing materials refer to that sort of processing speed as “real time” and “speed of thought,” the company’s chief architect suggests that “real time” in data analytics is better framed as “waiting less.”
“It’s when you sit and wait for it to finish, as opposed to going for a cup of coffee or even letting it run overnight,” Doug Cutting, who co-wrote the Hadoop framework, said in an interview. “That’s ‘real time.’”
The extra speed is all the more impressive when you consider how companies are wrestling with more in-house data than ever before. But those epic datasets also create sizable backend issues, especially with regard to latency.
Many IT vendors constructing software platforms for storing and managing data have looked to Google, which has developed groundbreaking data systems as a matter of necessity. Cutting referred to Google’s research papers on Spanner (a globally distributed, synchronously-replicated database) and Dremel (a massively scalable query system) as inspirational for developers looking for innovative ways to handle data.
Apache Hadoop isn’t the only open-source framework for running data applications on large hardware clusters, but it’s become the clear favorite of many organizations such as Facebook and IBM. Research firms have attributed much of this interest to the cost savings and flexibility inherent in open-source software. Hadoop birthed platforms like HBase, a non-relational distributed database modeled on Google BigTable and run atop HDFS. Impala, in Cutting’s words, is another “step down the path,” offering the ability to process a wide variety of data types and volumes (including structured and unstructured), expressing those queries in SQL.
As Hadoop and its related systems evolve, there’s a continuing reduction in the amount of time needed to speedily process large amounts of data. Cutting doesn’t see many limits to Hadoop as a data processing system—at least “nothing that couldn’t be worked around.”
Cloudera is keeping Impala open-source. Although the company has a sizable lead in the Hadoop space, it faces continuing competition from HortonWorks and MapR for customers—all of which aspire to become the primary provider of Hadoop-related distributions and support (that’s in contrast to companies like IBM, which are using Hadoop primarily as an added feature to their proprietary platforms).
But Cutting believes it’d be difficult for other players to enter this “pure” Hadoop vendor marketplace, at least in the near term. “It’s tough to get going,” he said. “It’s taken us close to four years.” Cloudera is betting that Impala, scheduled for general availability in the first quarter of 2013, will buy it a little more space apart from the competition.