The Data Warehouse Institute’s Global Conference in San Diego, scheduled to run through Aug. 3, offers ample evidence that the era of Big Data is truly upon us: a number of companies are reporting data warehouses fast approaching the half-a-petabyte range. What’s startling, however, is that giant number isn’t anything like an upper limit: we are about to collectively experience a flood of information that will make today’s data warehouses seem tiny.
There’s a host of new data types that companies want to add to the information mix, Harriet Fryman, director of business intelligence software at IBM’s Ottawa R&D Labs, told attendees at the conference. Making use of these new types of data presents both a challenge and an opportunity, with the ultimate goal to understand customers in deeper ways and make better business decisions.
While customer activity generated on social networks, mobile devices, and the Web are all responsible for this rich assortment of data, sensors are also the next big thing: deployed in a wide range of industries from healthcare to utilities, they provide new insights into customer behavior and business performance. “I personally think that applying analytics to sensor data is the next big thing,” said Becky Hanenkrat, IBM Big Data sales specialist.
However, there are a number of challenges to be overcome in order to reach this Big Data nirvana, according to conference attendees such as Hiren Deliwala, managing director for business systems at Amedisys in Baton Rouge, LA. His company, which is in the home health business, focuses primarily on managing the data warehouse for financial and business data. “We’re really not looking at unstructured data yet,” he said.
The big data industry is still building the infrastructure needed to integrate structured and unstructured data, according to the many solution providers and analysts at the conference. “The next stage will be to make people more analytical in their operations,” said Michael Corcoran, senior vice president at Information Builders. “I don’t get excited about big data. I get excited about how companies make use of Big Data—integrating social and text analytics, visualization and search.”
Some of the key issues under discussion included:
- The need for different tools to manage different types of data.
- Silos of customer data that reside in different systems and hasn’t yet been integrated to provide the much-desired 360-degree view of consumers.
- Integrating and processing all of this diverse data can be costly.
- Specialized skills are required to implement these systems and analyze the data.
- Only about 20 percent of the people at most companies have the ability to digest this complex data.
Traditional data warehouses, data marts and business intelligence tools do a good job of managing structured or transactional data. These are mature technologies, and business analysts have a lot of experience with them. But only about 10 percent of the data that companies generate is structured.
The remaining information includes text, unstructured, hierarchical, event, social, web, spatial and sensor data. There are various and sundry tools for inspecting and reporting on each type of data—which nonetheless leaves us with silos of information used by different departments with an organization.
Hadoop was a technology much discussed at the conference, with a number of classes on the various aspects of the technology. Hadoop and its associate technologies, are emerging as a unifying principal behind big data. Hadoop is an open source project from the Apache Foundation designed to handle massive amounts of complex data using clusters of commodity hardware. Hadoop has several components:
– The Hadoop Distributed File System was created at Yahoo and is based on Java. It lets large volumes of data to be stored and quickly accessed across large server clusters.
– MapReduce is a framework developed by Google for writing applications that process large amounts of structured and unstructured data in parallel across clusters.
– Hive is a data warehouse built on the MapReduce framework, which enables ad-hoc queries against large datasets stored in the HDFS.
– Pig is a platform for analyzing large datasets, comprised of a high-level language for expressing data analysis programs.
– HBase is a column-based data storage system that provides real-time access to big data.
– ZooKeeper is a high-availability system for distributed processing.
– Ambari is an open source management tool for Hadoop clusters.
– HCatalog is a centralized metadata management system.
A number of vendors have productized Hadoop, including IBM, Information Builders, HortonWorks and others.
The R programming language is also gaining traction in the Big Data world, according to Paul Ross, VP of product marketing at Alteryx, a business intelligence vendor based in Irvine, CA. The open source language is becoming widely used for creating data analysis and statistical applications.
But as Geoffrey Guerdat, director of data engineering for the Gilt Groupe will tell you: these are still the early days.
Gilt is an online fashion brand headquartered in New York. Speaking at the TDWI Conference this week, Guerdat shared his experience in trying to analyze social media data about his company. Using a variety of tools such as Cognos, Spotfire and Aster Data, Gilt tried to optimize its digital marketing efforts and understand consumer sentiment based on its customers’ Twitter activity. While it was able to capture consumer sentiment about particular products, the effort didn’t result in significant insights that would drive new business initiatives, Guerdat says.
But students and researchers at the Annenberg Innovation Lab at the University of Southern California in Los Angeles had a more positive experience in gauging consumer sentiment in the context of politics. Using IBM Streams data-analysis tool, students looked at how President Obama and Republican candidate Mitt Romney fared on Twitter. Researchers gleaned that public sentiment about politics and the upcoming race is pretty negative and that sarcasm was a widely used device to express opinions.
According to Jonathan Taplin, executive director of the lab and a conference presenter, the success of this experiment was based on bringing together people with various skill sets to participate in the project. He also noted that vendors must develop analysis tools that can be used by a wide range of people, not just those with engineering and business intelligence expertise.