Main image of article Interview Qs for R's Statistical Suite

R is a favorite of a subset of developers, and with good reason: It was recently named one of the twenty fastest-growing programming languages on GitHub, and is listed as a top-paying skill in Dice’s 2015 Tech Salary Survey. Check out the latest R jobs. Interview QsOne of the big reasons behind that spike in popularity is R’s library of statistical tools, popular among many who work with Big Data. “Employers are usually looking for someone to conduct an analysis, interpret the results and communicate the findings to others through reports or presentations,” said Dr. Hadley Wickham, chief scientist at RStudio and an adjunct assistant professor of statistics at Rice University. Wickham shared several interview questions that effectively probe a candidate’s experience with R’s software environment. What are the basic data types of R and how are they related?

  • What Most People Say: “R’s basic data types are numbers, strings, sectors and dates but I’m not sure how they’re related.”
  • What You Should Say: “R supports a few basic data types such as numbers, strings, sectors and dates but most importantly, it has two main building blocks or vectors. A vector defines the set and may contain a collection of homogenous elements that are the same type or a list of heterogeneous elements called recursors.”
  • Why You Should Say It: While the first answer is technically correct, it doesn’t define the relationship between the data types, which is critical to data analysis. The second answer not only demonstrates a basic understanding of how to use R, but how R works.

How does R handle missing values? And what do you do when you encounter one?

  • What Most People Say: “In R, missing values are represented by the symbol NA (not available). When I encounter missing values, the first thing I do is get rid of them.”
  • What You Should Say: “It’s important not to delete missing values because they may indicate a problem with a query, data collection, programming or other things. I strive to find the root cause of the missing value and then I take steps to correct it to keep the problem from recurring.”
  • Why You Should Say It: An inexperienced person may not have the patience or expertise to deal with missing values, while a real pro will see the value in the reconciliation process. Plus, a pro understands that correcting the underlying problem today will provide benefits down the road.

When R provides several packages that seem like they might solve a specific problem, how do you decide which one to use?

  • What Most People Say: “I just pick one because I’m not always sure how they differ or which package is best.”
  • What You Should Say: “For starters, I look for a package that follows good software development principles. For instance, I want to see quality documentation and unit tests. Next, I look to see how the package is being used and I read the reviews from users posted on the site. It’s important to know if other analysts have been able to solve a problem that is similar to mine. When in doubt, I often ask for feedback from peers or members of the R community to make sure I’m making the right choice.”
  • Why You Should Say It: The CRAN package repository houses over 6,000 packages, so a developer or data scientist really needs some sort of defined process and criteria to select the right one. It’s best to develop a complete list of needs and issues before you start searching because, ideally, the downloaded package should solve several problems.

How do you communicate the results of a data analysis?

  • What Most People Say: “I copy and paste my visualizations into a Word document.”
  • What You Should Say: “I believe in reproducible research, so I use knitr to combine my code, data and conclusions into a single document. Reproducible research allows others to verify my findings, add to them, and facilitates on-going discussions.”
  • Why You Should Say It: Even if you think you may only do an analysis one time, reproducible research makes it easy to redo the experiment, insert new data or apply the model to a different problem. Creating a template is a real time-saver, and keeping the data and graphs updated will encourage business leaders to trust your methodology and the conclusions of your analysis.