Data Engineer Interview: Sample Questions and Expert Insight

Like data scientists, data engineers must wrangle with enormous datasets as part of their daily workflow. Thanks to the proliferation of sensors, websites, points-of-sale, and other types of data collection, data engineering has grown into a discipline with enormous demand behind it. 

Data engineers focus their energy on preparing large, cumbersome datasets for data scientists’ analysis. They also play a vital role in feeding data to machine-learning models. In fact, they’re often the first point of human contact when it comes to transforming data into something useful. 

But how easy is the discipline to learn? What questions should you expect in a data engineer job interview? We spoke with several experts to find out.

What qualities does a great data engineer possess?

Jess Anderson, Managing Director of Big Data Institute, says the most-desired skills come down to the company’s specific needs. Anderson looks for “a solid understanding of the frameworks they’re using. It’s also key to be able to create complex systems for data.”

When hiring data engineers, Dan Prince, founder and CEO of Illumisoft, looks for the ability to communicate complicated ideas easily and efficiently. In other words, “soft skills” such as empathy and communication are key for data engineers. He also values their ability to grasp a problem, understand its context, and ask the right questions.

“I also expect that they have some self-initiated project experience,” Prince says. “Many kids will go through college or a degree curriculum without ever trying to put any of their knowledge to work outside of academia. I’m looking for people that are bold enough to try, and if they have sold their services well still a student, that’s even better.”

Rudolf Höhn, Data Scientist at Unit8 SA, tells Dice: “We want our data engineers to be technical, but also be understandable by less technical people, such as a project client. Our engineers may face clients; hence, it is important that they can articulate ideas in a clear way, especially in front of a business audience.”

What mandatory skills should a data engineer possess?

“All data engineers should be able to code,” says Melissa Benua, VP of Engineering at mParticle, “though the language itself doesn’t matter. Candidates should also be familiar with distributed systems design principles. Likewise, they should have solid database experience. They should be skilled at writing SQL—especially high-performance and cost-efficient queries for processing large datasets—and basic database administration. Knowledge of AI/ML is a bonus but is generally not considered a requirement.”

Tee Selesi, talent acquisition manager for edge services provider StackPath, adds: “Our need to analyze data sets and find trends is ever-increasing, so we are interested in interviewing candidates with software programming, data modelling, and data partitioning backgrounds. A successful candidate would also need to have a firm understanding of the optimization/constraints that come with large datasets.”

Anderson adds: “They should have a least intermediate-level programming skills and experience with at least one batch processing system such as Apache Spark.” 

Prince adds: “You have to have experience using the tools of the trade like Apache Hadoop and Spark, C++, Amazon Web Services, and Redshift. You also have to know a number of different database systems, both relational and non-relational. You have to understand data warehousing solutions, ETL tools, machine learning, and data APIs. You need to know and understand some Python, some Java, and some scale-up programming languages. Other than communication skills, presentation skills and self-initiated experience, a plus would be a good understanding of distributed systems and knowledge of algorithms and data structures.”

What are some sample data engineer interview questions?

Anderson, Benua, Selesi, and CEO of Educative Fahim ul Haq offer these sample questions for the data engineer interview:

  • Have you ever transformed unstructured data into structured data?
  • How would you validate a data migration from one database to another?
  • What is Hadoop? How is it related to Big Data? Can you describe its different components?
  • Which Python libraries would you utilize for proficient data processing?
  • Do you consider yourself database- or pipeline-centric?
  • Tell us about a distributed system you’ve built. How did you engineer it?
  • How do you handle conflict with coworkers? Can you give us an example?
  • What do *args and **kwargs Mean?
  • Design a video streaming service like YouTube or Netflix.
  • Design a consumer-facing data storage solution like Google Drive or Dropbox.
  • Do you have experience using PostgreSQL or other RDBMS and general understanding of NoSQL databases? 
  • Do you have experience scripting ETL workflows on Linux/Unix Python and/or Golang? 
  • Do you have experience in building and maintaining data pipelines using Kakfa (or similar)? 
  • Do you have experience in writing analytic (OLAP) queries in SQL? 
  • Do you have Hands-on experience with Spark? 
  • Do you have experience with cloud platforms, such as GCP, AWS, and Azure? 
  • Do you have experience with Docker and/or Kubernetes?
  • Walk me through one of your data engineering projects, preferably one where you took the assignment from idea stage through implementation and owned it into production.
  • Describe to me how a pipeline that reads data from a queue and periodically uploads data to S3 might work. How would you scale it?
  • Design a SaaS platform that might compete with Google Analytics. How would it scale? What tradeoffs might you make in which parts of the problem you solve first?
  • Given a dataset, write SQL to answer these relevant business questions.
  • Why would you choose S3 versus a NoSQL database?
  • Why would you choose a NoSQL database versus a relational database?
  • How would you go about diagnosing a performance issue in a Spark job?
  • What is a shuffle sort?
  • How does Spark differ from S3?
  • As a take-home coding assignment (2-3 hours): Write a pipeline that reads input files and produces aggregated stats (similar to what group-by queries do in a database but write the process themselves). Depending on the level, the deliverable is either a basic pipeline or some extendable/scalable solutions.

How can you learn to be a data engineer?

Our experts largely agree the best way to learn data engineering is learning to code, then familiarizing yourself with the platforms data engineers use. The discipline is code-centric, so it’s vital to know SQL, Python, and Java to start. 

All agree that data engineers are, by nature, problem-solvers. “Tools like Hackerrank can be useful for polishing particular problem-solving skills,” Benua notes, adding: “The best interviews are based on skills and experience that is generally picked up on the job, rather than on memorized algorithms or ritualized complex coding challenges. We often like to see candidates practice technical skills as well by taking advantage of free trials or credits in the cloud providers (especially GCP and AWS) to explore setting up their own basic ETL pipelines or simple services.”