Apache Hadoop 2.2.0 has reached its general-release milestone.
Apache Hadoop remains a popular open-source framework for running data applications on large hardware clusters. For numerous companies and developers, the software is a vital tool for wrestling with enormous amounts of data; MapReduce implementations of Hadoop have emerged as the backbone of many a company’s data infrastructure, although firms with skyrocketing information requirements—such as Facebook—have escalated to building their own customized frameworks in some cases.
Given Hadoop’s burgeoning popularity, it’s no surprise that seemingly every IT firm on the planet—from giants such as Intel and Oracle down to the tiniest “Big Data” startups—has released either its own Hadoop distribution or some bit of supporting software. While that’s led to increasing revenues for the Hadoop industry as a whole, the software’s open-source origins could eventually blunt attempts to profit from it.
Larger economics aside, Hadoop 2.2.0 includes a variety of improvements, including YARN (a general-purpose resource management system that makes it easier to work with MapReduce and other frameworks; the name stands for “Yet Another Resource Negotiator”), support for Windows, boosted integration with other open-source projects, binary compatibility for MapReduce applications built on Hadoop 1.x, and high availability for the Hadoop Distributed File System (HDFS).
In addition, Hadoop 2.0 features support for NFSv3 access to data stored in HDFS, as well as Federation and Snapshots. The Apache Software Foundation is asking anyone who uses Hadoop to upgrade as soon as possible to 2.2.0, as it features significantly more stability while remaining compatible with existing APIs and protocols.
Among all these new features, YARN is arguably the biggest, as it could radically change how data analysts do everything from search to graph processing. In essence, YARN is a total overhaul of MapReduce (essentially making it MapReduce 2.0, or MRv2), in which the resource management and job-scheduling/monitoring functions have been split.
“The idea is to have a global ResourceManager (RM) and per-application ApplicationManager (AM),” read the Apache Software Foundation’s note on YARN. “An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs.” The ResourceManager and NodeManager (the per-node slave) constitute the actual data-computational framework, with ResourceManager directing resources throughout the applications within the ecosystem.
The ResourceManager features two main components, the Scheduler and ApplicationsManager; the former is tasked with allocating resources to running applications (within constraints such as queues), while the ApplicationsManager governs job submissions. The Foundation has a helpful map for how all these MapReduce 2.0 components interact:
Although IT companies such as Hortonworks have included a version of YARN in recent product releases, it could gain a substantially wider audience thanks to its presence in Hadoop 2.2.0, given the reach of open-source. A list of YARN-compatible applications is available on the Hadoop Wiki; be aware that porting could still take some effort.
Images: Apache Software Foundation