Open-Source a Double-Edged Sword for Analytics Firms

In an essay earlier this week on ReadWrite, Matt Asay questioned whether proprietary software could really establish much of a foothold in a data-analytics marketplace dominated by open-source software such as Apache Hadoop.

Asay isn’t the first to question IT companies’ ability to maximize profit in the so-called “Big Data” space: last year, research firm IDC suggested that the availability of open-source solutions could very well hamper the ability of proprietary software platforms to fully capitalize the analytics space. “The Hadoop and MapReduce market will likely develop along the lines established by the development of the Linux ecosystem,” Dan Vesset, vice president of Business Analytics Solutions for IDC, wrote in a statement at the time. “Over the next decade, much of the revenue will be accrued by hardware, applications, and application development and deployment software vendors.”

Another analyst, 451 Research’s Matt Aslett, has taken a similar perspective on various IT companies’ launching proprietary Hadoop distributions. “What does it mean to be ‘all in on Hadoop’?” he wrote in a March blog posting. “Based on a strict reading of Defining Apache Hadoop (a document that demands by its own words to be read strictly), being ‘all in’ on Hadoop means only one thing: being “all in” on Apache Hadoop.”

Despite those analyst warnings, the prospect of big profits from analytics’ burgeoning popularity has convinced a broad cross-section of tech giants and startups to plunge into the space. This year alone, Intel announced a Hadoop distribution, and EMC rolled out Pivotal HD; Hortonworks, Cloudera, Splunk, and Amazon Web Services are just a few of the companies either supporting Hadoop or building software designed to layer additional functionality atop it.

The advantages of Hadoop are obvious: it allows companies to crunch large amounts of unstructured data on clusters of commodity hardware, and thus sidestep the often tangled, sometimes messy process of procuring and installing proprietary infrastructure. That being said, there is some valid criticism of Hadoop: at its most bare-bones level, the software doesn’t offer the full feature-sets of some proprietary database platforms. It can also prove too slow for applications that demand real-time (or near real-time analysis), and critics argue that many workers find it too hard to learn and use.

Hadoop is undergoing steady improvement: the just-released version 2.2.0 features YARN (a general-purpose resource management system that makes it easier to work with MapReduce and other frameworks; the name stands for “Yet Another Resource Negotiator”), support for Windows, boosted integration with other open-source projects, binary compatibility for MapReduce applications built on Hadoop 1.x, and high availability for the Hadoop Distributed File System (HDFS).

But that revised feature-set won’t miraculously transform Hadoop from a platform with an extensive-but-defined number of uses into the Swiss Army Knife of data analytics, and nor is it likely to dampen the criticism of the platform as unwieldy; and therein lies the opportunity for companies such as, say, Splunk, which recently released a bit of software that makes Hadoop queries more visual and user-friendly. Open-source software might ding some of the larger proprietary-software developers, but it’s also an opportunity for those firms who want to build off something widely used and make it faster and easier to handle.


Image: YanLev/