Interview Questions for Hadoop Developers

Hadoop ElephantHadoop is an open distributed software framework that enables programmers to run an enormous number of nodes handling terabytes of data. One of its most significant abilities is allowing a system to continue to operate even if a significant number of nodes fail.

Since Hadoop is continuing to mature, hiring managers and recruiters are finding few Hadoop specialists out there. Consequently, many of those being hired for Hadoop-heavy jobs are those able to demonstrate that they can learn it quickly and are familiar with similar Big Data tools.

How do you convince a hiring manager you’re one of those people? We asked Eric Sammer, Engineering Manager at software provider Cloudera, to share a few common interview questions and the types of answers he’s looking for.

Explain how Hadoop differs from a traditional database.

  • What Most People Say:  Hadoop is not a database.
  • What You Should Say:  Describe authoritatively how Hadoop processes data in large batches.
  • Why You Should Say It: “Being able to define those more fundamental uses demonstrates more complete knowledge,” says Sammer

Give an example of a case where using Hadoop makes sense, and a case where using Hadoop is the way to solve a particular problem.

  • What Most People Say: It’s useful for offline batch operations.
  • What You Should Say:  Hadoop is useful in cases where you’re processing large quantities of hourly data and transforming data en mass.
  • Why You Should Say It: “This answer shows you thoroughly understand it,” says Sammer. Simply saying it’s useful for offline batch operations, while not incorrect, is a superficial answer, he says.

What happens when you invoke a read function in the operating system? How does the operating system perform IO?

  • What Most People Say: “I use the Java I/O system and the data gets returned from the disk.”
  • What You Should Say:  Walk through the various components of what the operating system does in order to be able to describe the entire I/O pipeline.
  • Why You Should Say It: “Hadoop is a low level system for processing data,” Sammer explains. “Candidates who are more familiar with operating system fundamentals can handle Hadoop at a much deeper level. That’s the kind of person we’re looking for.  This question is one of the earliest indicators about whether you know how computers really work.”

If you only had 32 megabytes of memory how would you sort one terabyte of data?

  • What Most People Say: “I don’t know.” (In fact, most candidates either get it right or don’t).
  • What You Should Say: “Take a smaller chunk of data and sort it in memory, so you partition it in lots of little data sets. Then merge those sorted lists into one big list before writing the results back to disk.”
  • Why You Should Say It: “Any candidate who does Hadoop or knows it at a deep level will be able to understand the depth of what Hadoop does,” Sammer believes. “It’s a great qualifying question. It demonstrates an understanding of how you manage data at that scale.”

Have you ever participated in open source in any way?

  • What Most People Say: “No,” or, “I’m familiar with open source.”
  • What You Should Say: “Here’s an example of a project I did for a previous employer with open source. I have also contributed code.”
  • Why You Should Say It: “Passion goes a long way,” says Sammer. “It gives us a high level gauge of interest in what they do for a living. People who do that tend to be a much better fit for us. Generally I — as well as the rest of Cloudera — believe there are a lot of ways to participate. You can contribute code, devote time by answering questions or write documentation. It’s so impressive to see in a candidate.”

8 Responses to “Interview Questions for Hadoop Developers”

  1. I think there needs to be more added to determine if the developer understand’s Hadoop or not and really can code MapReduce jobs effectively.

    Hadoop is more than just a distributed file system. It is a system that allows you to scale unstructured data big and use MapReduce to solve big problems with an enormous amount of data.

    A developer should totally understand the value of MapReduce as this is a core concept to utilizing the Hadoop system or any NoSQL db for that matter.

    Further the developer should be able to identify Hive, who created it, why it was created and be able to explain the difference between HiveQL and SQL. If they can’t explain this, I would be pretentious in hiring this Hadoop developer.

    I also would expect the dev to explain perhaps their experience with Pig, mahout or Hive and their interest in learning these frameworks.

    And for more advanced devs, I be wanting to pick their brains on Hadoop design patterns and their understanding of MapReduce 1 and MapReduce 2. For example when does it make sense to use MapReduce 2 over MapReduce 1? What is the recommended minimum number of nodes to use MapReduce 2?

    The questions posed in this original post I think are too generic to really say if the candidate is a solid Hadoop dev.

    just my two cents – Jeff A.

  2. 1)If you only had 32 megabytes of memory how would you sort one terabyte of data?

    I’m trying this way,
    If data is in files ..Partition the file and keep the smaller file in process while the other large file in streaming …
    While this space problem in table data- Joining –Apply same logic keep smaller table in left side and large table in right side for streaming …

  3. Sorting terabyte data using distributed computing is a good example of hadoop usefulness. Hadoop Examples include TeraGen and Terasort that demonstrate this behavior.

    Note: The memory (RAM) may be low, but you certainly need minimum 3 TB (with default replicaiton factor of 3) to store the 1 TB data, let alone sort it.

    For those looking for mote hadoop interview questions, may prove useful.