Main image of article Interview Qs for Pentaho's Data Toolset

Pentaho builds open-source business intelligence (BI) tools for companies, and its data-integration toolset—also known as Kettle—has gained in popularity among tech pros who spend the majority of their time working on either data extraction or data warehousing. Interview QsAlex Meadows, principal consultant for IT consulting firm CSpring (and co-author of Pentaho Data Integration Cookbook Second Edition), always looks for experience with Pentaho’s data-integration engine when building trams for either data-mining or data integration projects. In order to judge their skills, as well as their aptitude for Extract, Transform and Load (ETL) processes and critical thinking, he generally asks the following questions: How would you utilize logic from one transformation or job in another process?

  • What Most People Say: “Transformations and jobs can be nested, which allows me to put the same transformation or job logic into many different processes.”
  • What You Should Say: “Transformation logic can actually be shared by creating subtransformations.”
  • Why You Should Say It: While calling jobs and transformations within other processes does work, utilizing subtransformations allows for things like variables to be translated and loaded seamlessly into the rest of the transformation, which optimizes efficiency and productivity. Also, subtransformations can be called and reconfigured as needed for each process that needs to take advantage of the logic. The second answer not only showcases your experience with Pentaho, but your ability to create maintainable, object-oriented ETL code.

How would you connect database systems that have many instances?

  • What Most People Say: “Database connections can be parameterized and instances can be processed accordingly.”
  • What You Should Say: “Not all database systems are able to process instances within a single connection. Database connection clustering allows for the handling of these instances within a single PDI connection, which will stream data to or from each instance.”
  • Why You Should Say It: Parameterization will work, but it will treat the connections serially unless multiple instances of a given process are set to run in parallel. On the other hand, clustering allows for instances to be grouped and processed accordingly. By being smart with clustering, the process can pull data in or write data out across a high number of database instances.

What's the best way to manage large sets of data through multiple transformations/jobs?

  • What Most People Say: “Data can be stored in flat files and passed from one part to the next.”
  • What You Should Say: “While flat files can be used to store data for processing from one job/transformation to the next, creating serialized and deserialized steps is a better way to process the data.”
  • Why You Should Say It: Those steps can actually store the data stream in a compressed file. Other file types such as CSV, text and so forth have to be reinterpreted into the data stream. The serialized file stores the data stream as processed, including metadata, so the only cost is reading and writing the files on disk.

How would you build a reusable process that can accommodate data stream changes?

  • What Most People Say: “Parameterization can make transformations/jobs reusable. Or, you can use subtransformations to accommodate reusability.”
  • What You Should Say: “Although my solution will vary depending on the situation, generally speaking, I would use ETL Metadata Injection because it allows for the same transformation steps to be reused regardless of the data stream being processed.”
  • Why You Should Say It: These types of processes work great in cases where the same logic applies to many different data streams. Without Metadata Injection, however, a process would have to be written for every unique data stream since the column names/metadata would not match. That’s not efficient.

The ETL Metadata Injection step is capable of injecting step metadata into a template. It will inject the same steps with the metadata for the given stream (think of it as an advanced form of subtransformation). An example would be injecting data streams into a process that starts with a row normalization or de-normalization. From there, the data is processed through the rest of the transformation, since after that point the metadata of the stream matches the rest of the transformation.