There’s a lot of interest these days in integrating self-service business intelligence and analytics applications with multiple sources of Big Data inside and out of the cloud. The challenge is figuring out the best way to go about doing that in a world where transferring large amounts of data across a network can be prohibitive.
Recent examples of this sort of integration include a B.I. application delivered as service from Tableau Software, combined with the High Performance Analytics Appliance (HANA) platform from SAP. Another is an alliance between QlickTech, a provider of a B.I. application as a service, and Attivio, which offers enterprise search technologies that work across both structured and unstructured data.
“Applications have to be able to bring structured and unstructured data together,” said Attivio CEO Ali Riaz, “Otherwise all we’re going to wind up with is all these stove pipes of data.”
Francois Ajenstat, director of product management for Tableau Software, suggested that the SaaS software provider views HANA as just another source of data that needs to be visualized alongside the other data sources that Tableau already supports, which includes Attivio.
Driving all the interest in Big Data these days is a desire for more confidence in the results being generated by analytics applications, which historically have tended to sample slices of data rather than analyze all the raw data the company collects. With the advent of technologies such as the open source Apache Hadoop data management framework, it’s become more cost-effective to analyze raw unstructured data in a way that doesn’t require the IT department to continually produce models.
“Decision makers want to have more confidence in the data,” says David White, an industry analyst with the market research firm Aberdeen Group. “IT departments, meanwhile, want to get out of the report-writing business.”
Of course, there’s no shortage of companies trying to position their analytics applications as the perfect front end of Big Data. Alteryx, for example, has a B.I. application for structured and unstructured data that will soon be available both in the cloud and on-premise. At the same time, a new class of analytics applications specifically designed for Hadoop from companies such as Datameer and Karmasphere are also starting to gain some traction.
The big issue that IT organizations need to consider when thinking about B.I. applications’ access to Big Data, said Alteryx president George Mathew, is where that data is actually located relative to the amount of it that needs to be transferred across the network. Small bits of data that are cached in memory on servers in the cloud are not much of a challenge. But once you get up in the realm of 1TB of data, it becomes faster and less expensive to simply ship a 1TB drive overnight between two data centers. “Moving data right now into the cloud can be a bit daunting,” he noted. “That’s one of the reasons we’re counting on programmatic APIs to share data.”
Other factors that can negatively affect performance, added Vincent Granville, an independent data scientist consultant and publisher of the AnalyticBridge newsletter and social networking site for data scientists, is whether the B.I. application relies on a SQL or MapReduce interface to access Big Data.
SQL is by far the more common interface within most organizations. But SQL is a lot slower than the MapReduce used in Hadoop environments, Granville said, because MapReduce takes better advantage of multi-threading.
Regardless of whether your organization favors SQL or MapReduce, as a general rule of thumb it’s usually best to locate the application as close to the data source as possible. That doesn’t necessarily mean that a cloud application, for example, can’t successfully be fed data from a source of Big Data outside the cloud—but some form of pre-processing of that data will generally be required to keep costs manageable.