Cloudera has rolled out the public beta of a search engine designed to surface data stored in the Hadoop Distributed File System (HDFS) and Apache HBase.
Cloudera Search is yet another element in Cloudera Enterprise, the company’s Big Data platform. Cloudera Enterprise relies on Impala, its open-source SQL query engine for data stored in HDFS and HBase, as its engine for processing queries at speeds significantly faster than Hive/MapReduce; the platform also comes with additional layers of management automation and technical support.
Apache Hadoop has evolved as a favorite framework for organizations with a need to crunch massive amounts of data across large hardware clusters. Developers have used Hadoop as a jumping-off point for the creation of platforms such as the aforementioned HBase, a non-relational distributed database modeled on Google BigTable and run atop HDFS.
“Cloudera Search leverages the same data as Impala and MapReduce,” Doug Cutting, Cloudera’s chief architect (and co-writer of the Apache Hadoop framework), wrote in a June 4 posting on Cloudera’s official blog. “It can index any data stored in HDFS, and it stores its own index in the same filesystem. This is a big step forward in simplicity and usability. Hadoop users will benefit from the ease of automatically indexing and free text searching the data in their clusters.”
Cloudera Search is basically a framework “much like MapReduce and Cloudera Impala,” he added, and “leverages the same security as the rest of the Hadoop stack.” In theory, that means administrators and other IT pros can restrict who can actually see HDFS data surfaced by the Search tool.
Cutting also used the blog entry to make an argument for Cloudera Search (and its Hadoop underpinnings) over other data-search engines. “For years, databases attempted to provide search as a feature in their platforms but this approach was largely abandoned in favor of acquiring independent search products that require their own infrastructure, integration, and expertise,” he wrote. “Hadoop’s flexibility has made it a much better supporting platform for search and consequently a much more general-purpose platform than relational databases.”
But Cloudera faces a good deal of competition in the Hadoop space: EMC, Intel, Hortonworks, and a variety of other companies all offer tools for wrangling unstructured data via the framework. In that crowded context, an innovative bit of software can help a company stand out—but it’s usually a matter of time before some competitor releases something similar.