Cloudera unveiled the fourth incarnation of its Hadoop distribution and management platform, Cloudera Enterprise 4, with a June 6 launch event that brought together several of the company’s executives, a handful of users, and many of its 250 partners. The key topic of discussion: how Hadoop is changing the way data is used—and how to make the best use of the avalanche of new data and data types. “Hadoop now is truly ready for the Enterprise,” asserted Cloudera CTO Amr Awadallah. Cloudera Manager 4, the latest incarnation of the management platform, now has open APIs that make it easier to instrument and monitor a Hadoop cluster with other monitoring or deployment tools such as IBM Tivoli or HP OpenView. Availability and real-time enhancements include a new high-availability option for the NameNode “master” Hadoop controller, which eliminates the need to use special RAID or HA hardware in the NameNode system. Just as Hadoop was designed for deployment on clusters of commodity hardware, the new HA feature supports multiple NameNode machines that can automatically failover in case of NameNode failure, thus eliminating the need for special care and feeding of any device in a Hadoop system. Also new are HBase extensions that allow applications to run in real time even as data is being fetched, rather than waiting for tables to load completely before processing. “We see the whole Hadoop ecosystem moving toward real time,” said Omar Trajman, Cloudera VP of Technology Solutions. As enterprises rely more on Big Data solutions for everyday tasks, the ability to support 24/7 operations becomes critical, and single points of failure can no longer be tolerated. Extending availability beyond previous Hadoop implementations involves replication: if you have multiple data center or cloud-based deployments running different HBase clusters, data can be automatically replicated across them all, automatically ensuring continual processing even after a site disaster.

Access to Hadoop’s Innards

By opening up access throughout their Hadoop stack, Cloudera makes new programming models possible. “Extensibility we added is also key,” said Hadoop architect Doug Cutting, who named Hadoop after his child’s stuffed elephant. That extends to both HBase, the overarching data repository from which Hadoop delivers its results to MapReduce, the computational framework down to the HDFS clustered file system at the heart of a Hadoop deployment. According to VP of Product Charles Zedleweski, you can now develop directly against HDFS, or use co-processors to run arbitrary Java programs in the cluster itself—bypassing the MapReduce layer traditionally used to churn Big Data into useable results. “There’s life beyond MapReduce,” said Zedleweski. “It’s great for most applications but has its shortcomings, and the new release lets you develop other frameworks on top of the same pool of storage that’s running MapReduce jobs. You can pull in other Apache algorithms like HAMA for scientific applications, or do it yourself.”

What the Hadoop?

Cloudera Enterprise 4 has two main components: Cloudera Distribution for Hadoop V4 (CDH4), and Cloudera Manager 4. CDH4 is the Apache-based open-source Hadoop stack that combines the Hadoop Distributed File System (HDFS), the MapReduce Big Data programming construct, and overarching database HBase as well as nine other Apache-based open-source Big Data tools that ease Hadoop deployment. Although CDH4 is 100 percent Apache and all the pieces can be downloaded from the Apache website, Cloudera COO Kirk Dunn likened that approach to buying the components for an automobile and assembling it yourself. Cloudera Manager 4 combines a host of deployment and management tools that ease the rollout of a Hadoop cluster—whether five or 500 nodes—as well as monitor and report on any hot spots or failures in a Hadoop cluster of any size. Even though Cloudera Manager is not strict Apache open source, all the APIs for readouts and instrumentation are laid out for integration with applications or consoles throughout the enterprise.

Size Doesn’t Matter

How big does an organization have to be to take advantage of Hadoop? Not very, according to Trajman: “People download CDH when they’ve got an idea in their head, sitting in a coffee shop and saying ‘You know, I need to grab a lot of data from my application,’ or, ‘This is something that’s really data-centric because data’s going to drive my business.’” The barrier to entry is nothing, he added: “That’s the great thing about Hadoop. You can use whatever hardware you have lying around, get some data in, prove that there’s value in that data and then expand it out.” Despite the name, Cloudera isn’t solely cloud-focused, according to Hadoop innovator Cutting. “A lot of people start using Hadoop in the cloud, and some folks continue like that indefinitely,” he said, “but as your size grows you might want to buy your own hardware just to be more cost effective. One problem with the cloud is moving huge data volumes back and forth.” It’s an exciting time to be a geek,” Cutting continued. “Open source is a great environment for developers because they can get recognized in front of a much larger audience, and so I’m happy to be a geek myself right now.” Cloudera’s CDH4 is free to download and offers distributions packaged for Ubuntu and Debian. It also offers support for PostgreSQL and Oracle 11g. Cloudera Manager is free for clusters fewer than 50 nodes; license fees vary for larger clusters.