How to Be Successful with Big Data Integration

As the noise around “big data” analytics intensifies, so does the complexity of dealing with it. All the new big data platforms—from Hadoop to the NoSQL databases MongoDB, Cassandra and HBase—forego the traditional relational database model in order to achieve massive scalability, exceptionally high performance and the ability to manage unstructured and semi-structured data. Each has novel ways for ingesting, manipulating and extracting data that does not use the familiar SQL/DDL language. Consequently, these new data platforms require a fresh approach to data integration that can accommodate the high volume, velocity and variety of big data stores.

However, as big data hype grows, so does the number of products that claim to do big data integration. Some only work with the Hadoop platform, while others can accommodate all varieties of big data stores and offer analytics as well. The first step in establishing a big data environment is to decide which data stores you will be working with, then align your data integration products and processes accordingly.

Hadoop Integration

Many big data users are deploying on Hadoop variants, including Apache Hadoop, Cloudera, MapR, HortonWorks and Amazon Elastic MapReduce. To a lesser extent, in my experience, big data users are turning to NoSQL databases and the high-performance relational analytic databases such as Greenplum, InfoBright, Netezza, Teradata, Vectorwise and Vertica.

These days, even the most basic Hadoop integration products need to go beyond the lowest common denominator Hive integration, and natively support the full range of core Hadoop interfaces such as MapReduce and HDFS and broader Hadoop ecosystem projects, including Pig, Sqoop and Oozie.

• Apache Hive is a powerful data warehousing application built on top of Hadoop, which enables you to access your data using Hive QL, a language that is similar to SQL.

• Apache MapReduce is Hadoop’s core software framework for rapidly processing vast amounts of data in parallel on large clusters of compute nodes.

• Apache Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data and distributes them on compute nodes throughout a cluster to enable reliable and extremely rapid computations. Some Hadoop distributions offer alternatives to HDFS, including NFS (from MapR) and Apache Cassandra (from DataStax), which provide advantages such as even higher performance and the ability to mount and easily browse using commonly used operating systems (Windows, OS X and Linux).

• Apache Pig is a high-level scripting language for creating MapReduce programs used with Hadoop. Pig programs are inherently parallelizable, enabling them to handle very large data sets.

• Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores, such as relational databases. A straightforward command-line tool, Sqoop (“SQL-to-Hadoop”) imports individual tables or entire databases to files in HDFS, generates Java classes to allow you to interact with your imported data and provides the ability to import from SQL databases straight into your Hive data warehouse.

• Oozie is a server-based workflow engine that specializes in running workflow jobs with actions that execute Hadoop jobs, such as MapReduce, Pig, Hive, Sqoop, HDFS operations, and sub-workflows.

Taking the NoSQL Database Route

If you decide to go the NoSQL database route instead (although, technically speaking, Hadoop itself is a form of NoSQL tabular key-value store), then your big data integration package needs to support MongoDB, Cassandra and HBase. Rapidly growing in popularity, MongoDB offers a scalable, high-performance open-source NoSQL document store featuring auto-sharding for horizontal scalability, rich document-based queries and fast in-place updates.

A viable NoSQL database data integration solution should combine:

• An easy-to-use visual development environment that allows everyone from IT, data analysts and business users to quickly ingest, manipulate, report, visualize and explore their big data.

• Hybrid data integration that enables the NoSQL database to be used directly as a data source for reports and dashboards, or easily integrate with data from other sources into a data warehouse for a 360-degree view of the business.

• Integration of the NoSQL database into the broader enterprise fabric of existing big data and traditional data stores for a complete big data management and analytics solution that extends how and where information can be used.

Hybrid Big Data Integration

So while some companies may specifically be looking for Hadoop or NoSQL data integration, others are looking for a comprehensive solution that will connect on the back end to multiple big data sources as well as their enterprise data stores. Overall, a big data integration solution will be most successful if it:

1. Lifts the major constraints around big data storage and data processing platforms so that, for Hadoop, there are no inherent delays in accessing data across large clusters of computers, and, for NoSQL databases, restrictions to querying data—such as the ability to sort and group data or perform joins—are removed.

2. Eliminates the technical barriers—users need simple, easy-to-use, and high-productivity visual development interfaces for high-performance data input, output and manipulation regardless of which big data platform (from Hadoop to the NoSQL databases) they deploy. This makes it easy and productive for IT staff, developers, data scientists and business analysts to operationalize, integrate and analyze both big data and traditional data sources.

3. Facilitates integration with enterprise data—even big data cannot thrive on its own. Ultimately, enterprises need to integrate their big data platform with the rest of their data stores. To make this happen, they need effective tools to connect their Hadoop and NoSQL databases with traditional relational databases, data exchange formats, and enterprise applications.

4. Also delivers a complete business analytics solution that includes everything from reports, dashboards, interactive visualization and exploration, and predictive analytics.

An all-round big data integration offering will enable you to input, output, manipulate and report on data using Hadoop and NoSQL stores, including: Apache Cassandra, Hadoop HDFS, Apache Hive, Apache HBase and MongoDB. It should provide easy job orchestration across Hadoop, Amazon EMR, MapReduce, Pig scripts, NoSQL databases and traditional data stores. Finally, for Hadoop, it should be able to run in-Hadoop as MapReduce, leveraging your investment in Hadoop’s massively parallel distributed data storage and processing across the cluster.


Richard Daley co-founded Pentaho in 2004 and is responsible for strategic initiatives, customer and partner relationships and leading product strategies including big data, customer adoption and cloud analytics. He has held key executive management positions in the business intelligence software market for over 20 years, starting his career at IBM. Richard was a co-founder at AppSource Corporation (acquired by Arbor Software which later merged into Hyperion Solutions) and Keyola (acquired by Lawson Software) and is an avid water skier.

Image: Sergej Khakimullin/