A few years ago, you couldn’t have a conversation about “Big Data” without someone bringing up Apache Hadoop, the open-source framework that allows firms to run data applications on large hardware clusters. Numerous tech firms (including Cloudera, IBM, Pivotal, and Hortonworks) charged into the arena with specialized Hadoop distributions, anxious to claim as many enterprise contracts as possible.
Apache Hadoop leverages the Hadoop Distributed File System (HDFS) and MapReduce (for processing), along with YARN and Hadoop Common (which features the libraries that the ecosystem needs to operate). Data is distributed to hardware nodes that process in parallel, speeding up the analysis. That’s an extremely simplified description, of course, but the end result is that companies can rely on the Hadoop ecosystem for everything from predictive analytics to data discovery.
All that hyped-up chatter belied a sobering fact: precious few companies were taking big strides to integrate Hadoop into their data-analytics workflows. In 2015, a study from Dresner Advisory Services suggested that only 17 percent of companies were storing and managing data within a Hadoop ecosystem. “While awareness is high across the board, more importance is assigned from suppliers, and we see a fairly substantial lag in current use and near term adoption plans within organizations today,” Howard Dresner, founder and chief researcher officer of the firm, wrote at the time.
Over the next few years, the rate of Hadoop adoption continued to creep upwards—slowly. In 2017, research firm Gartner announced that firms were spending close to $800 million on Hadoop distributions, even though a mere 14 percent of enterprises reported relying on the technology. Other studies have suggested that adoption (and spending) continued to perk up through last year.
What explains the dichotomy between years of hot hype and relatively slow adoption rate? Apache Hadoop is a complex technology that requires experts to install and use effectively. In addition, the rise of the cloud has thrown something of a curveball to firms interested in adopting Hadoop: sysadmins and data scientists have to decide whether they want to run their analytics on a local hardware cluster (i.e., an on-premises datacenter), in the cloud, or via some hybrid setup.
As Dice’s own analysis demonstrates, jobs that intersect heavily with Hadoop—not to mention other data-analytics platforms—can really pay off, salary-wise. But in order to land that sort of paycheck (and the benefits and perks that might come with it), you need the right mix of certifications and experience. Check out this breakdown:
All the data-related jobs in that list make sense. But why is Python ranking? Data scientists increasingly rely on the language, especially in the context of machine learning and research; given that, it’s logical that many Python developers are working with Hadoop.
While Hadoop might have been overhyped a few years back, it can still clearly pay off for tech pros: companies are still interested in the technology, meaning they need employees who can integrate it into a current stack.