Managing Big Data at Scale: Challenges and Opportunities

Fortunately, data overload won’t actually make hard drives explode into little bits.

For seemingly as long as anyone can remember, the cost of IT has escalated along with the volume of data under management. As the volume of data increases to epic levels (hence that overused buzzword, “Big Data”), the costs of managing it is quickly becoming prohibitive.

That being said, the tech world is on the cusp of developing new approaches to analyzing data; and new memory technologies are set to significantly increase the speed at which it can be (cost-effectively) processed. Together, those two changes are about transform just about every aspect of modern enterprise IT.

Roughly 80 percent of data sitting in enterprise databases is of uncertain value, according to IBM research manager Gabi Zodik at a recent IBM Innovate 2012 conference. Sensor data, for example, is imprecise; much of the unstructured data collected via social media applications comes in the form of highly-nuanced text (which makes it difficult for machines to determine meaning and context).

Zodik also noted that multiple sources are often inconsistent with one another; not only does an organization need to manage all that uncertain data at scale, but also correlate it: “We need to be able to identify all the interconnections when, for example, a volcano explodes in Iceland.”

One of the first attempts to tackle that correlation issue comes in the form of IBM’s Analytical Decision Management application for social networking data, which makes use of an entity analytics engine that allows business users to correlate data in real time across millions of big data records.

According to Erick Brethenoux, IBM director of predictive analytics and decision management market strategy, the offering is the first in a series of IBM Smarter Analytics offerings that marries predictive analytics software that IBM inherited via its acquisition of SPSS, entity analytics engine technology the company gained with the acquisition of SRD in 2005, and the business rules engine technology that IBM acquired when it purchased ILOG in 2009.

IBM’s entity analytic engine builds on the Non Obvious Relationship Awareness (NORA) algorithm that SRD developed to help identify a card-counting scheme being perpetrated on Las Vegas casinos by students from the Massachusetts Institute of Technology (as portrayed in the move “21.”)

Now IBM is taking that technology to another level in the form of a G2 project led by Jeff Jonas, an IBM chief scientist and research fellow, who is building systems that attempt to identify the relationship between any new piece of data and every other piece of existing data in real time within the construct of a streaming analytics application. In theory, that approach not only makes it easier to make better decisions faster about new data, but it could lead to a reduction in the amount of data in need of processing and storage; by automatically recognizing duplicate data, the system will reduce the amount of actual data in need of processing.

The G2 project borrows concepts from IBM’s Blue Gene supercomputer to process data in parallel across thousands of nodes. Context for all that data comes via a metadata server distributed across 100 TB of solid-state drives (SSDs) that allows the system to “sense and respond” to new data in under 200 milliseconds. As data is processed, the system first determines whether it previously stored that data before—if it’s unique—processing that data. The metadata server than remembers the relationship between that data and related piece of information in much the same way that a human processes putting a puzzle together. The challenge is that G2 systems have to be fast enough to compare, contrast and update every piece of data every time a new piece of data is processed.

Brute Force

In many ways, G2 relies on some of the same brute force computation techniques used by data management frameworks such as Hadoop to keep track and process data in real time. The scale and cost of IBM’s attempts far exceed anything currently dreamt of using Hadoop—yet both approaches stand as examples of how the IT industry is beginning to make greater use of multiple processors running in parallel to solve new problems.

In fact, Paul Kent, vice president of Big Data research and development at SAS, suggests we’re entering a new age where applications are in the process of being re-written to take advantage of the massively parallel processing capabilities now available in multi-core processors. There are no compilers that will magically make that happen, but Kent says that companies such as SAS are working on modernizing their algorithms to take advantage of the inherent parallelism provided by multicore processors that can easily be clustered together: “It was the rise of Hadoop that really brought all this to everyone’s attention.”

The biggest issue with parallelism in general, however, may have nothing to do with the technology. Most developers are trained to think sequentially in linear pattern, rather than defining tasks and jobs in way that that can actually be processed in parallel. “There’s a talent gap when it comes to data parallelism,” noted James Reinders, chief evangelist for Intel Software products. “There are not that many people [who] can program a 1,000-node cluster.”

But now that parallelism (in one form or another) has appeared on just about every major vendor’s radar screen, it’s becoming apparent that existing computer architectures are not going to be able to keep pace with the demands of streaming analytics applications and other types of “cognitive applications” that in the future will need systems as much as 10,000 times faster than anything we have today.

In fact’s it’s that requirement that is driving billions of dollars of research into in-memory computing using new materials and technologies. While there exists a multitude of new memory technologies being researched, the one that appears to be coming to market the fastest are memristor technologies that Hewlett-Packard says will be available in production systems starting in the 2014-2015 timeframe.

On a certain level, the memristor approach to non-volatile memory is relatively simple in that it builds on traditional resistor technology. As current flows in one direction through a memristor, resistance increases. When current flows in the opposite direction, the resistance decreases. When the current is stopped, the memristor retains the last resistance that it had, and when the flow of charge starts again, the resistance of the circuit will be what it was when it was last active. HP is using that technology essentially as a switch to build systems that will be not only substantially faster than traditional DRAM systems, but also consume much less energy.

“In four to five years we’ll have high-bandwidth, low latency nodes made up of essentially memristor blades,” said Paul Miller, vice president of converged application systems for the HP Enterprise Group.

Just as there are different tiers of storage today, there will eventually be different tiers of memory technologies that provide different performance characteristics at various price points. The challenge facing developers is mastering the nuances of different memory technologies that are all going to be capable of processing data in “real time.”

In fact, many IT organizations will need to sort out their definition of “real time.” Technically, real-time means less than 200 milliseconds. But for many businesses, real time can mean within 15 minutes, which is how long it might take for said business to fully appreciate the relevance of an actual event and then take action.

“For a lot of organizations there’s difference between real time and near real time that might have as much as a 15-minute delay,” noted Vincent Granville, an independent data scientist consultant and publisher of the AnalyticBridge newsletter and social networking site for data scientists.

That doesn’t mean there won’t be a host of business processes kicked off in less than 200 milliseconds after an event occurs, but the definition of real time as it applies to the usage of analytics will differ markedly from organization to organization.

Regardless of the exact definition applied, demand for applying split-second (or split-millisecond) analytics against increasingly bigger sources of data is skyrocketing.

“As more business people discover the power of analytics they are pounding data warehouses looking for instantaneous results,” said Howard Dresner, chief research officer for Dresner Advisory Services. “Mobile usage, in particular, is really driving that.”

Unfortunately, most data warehouses were built to support the occasional canned query against a SQL database.

“Today people want to be able to ask queries immediately based on the answers to their previous queries,” added Shawn Rodgers, an industry analyst with the IT market research firm Enterprise Management Associates.

Storage and Memory

In the absence of any major technical breakthroughs such as memristors, IT organizations have been shifting analytic application workloads that are performance-sensitive to any number of memory technologies, including offerings such as the High Performance Analytics Appliance (HANA) software developed by SAP, massively parallel databases, distributed caching software running on multicore processors, and arrays of SSDs that have been configured to resemble incredibly fast disk drives.

“When it comes to scale out parallelism, there are a lot of forces at work,” said Rodgers. “But if you can make your application run independent of disk, we’re talking about running analytics applications 1,000 times faster than they do using a traditional data warehouse.”

As impressive as that might be, however, that kind of processing capability is already exceeding our collective ability to keep pace with it. There’s already a shortage of data scientists capable of building analytic applications that take advantage of Big Data, and IT organizations don’t really have the tools they need to manage and govern that amount of data.

“The enterprise data warehouse was always a pie in the sky to begin with,” said Dresner. “Governance, meanwhile, is already suffering and nobody is coupling investments in data scientists with governance.”

Dresner says that, for the foreseeable future, it’s actually likely that investments in Big Data and advanced analytics will result in an explosion of different data silos, making it more expensive to manage data.

Robin Bloor, founder of the Bloor Research Group, believes that the way unstructured data is stored will need to fundamentally change as well. In order to maintain context across multiple applications, databases will increasingly become based on “triplestore” architectures that identify data entities using a “subject-predicate-object” model, which ultimately will make it easier to build metadata repositories that more readily identify and maintain the relationship between different sets of data. Those data stores will be invoked using a Resource Description Frameworks (RDF) specification for creating metadata models that are currently being crafted by the World Wide Web Consortium (W3C). “Triples are going to be needed to manage anything to do with unstructured text,” said Bloor.

Land Rush

It’s becoming apparent that just about everything to do with processing, storing and analyzing data is in the early stages of a new Renaissance period. Part of the need for such a Renaissance is the growing recognition that analytics represent a distinct type of application workload. Almost since the dawn of the database, IT vendors have been trying to optimize the use of processors designed primarily with transaction processing workloads in mind to run analytic applications. As memory technology continues to evolve, it should become increasingly possible to tune those technologies to meet the distinct processing requirements of analytic applications.

In the meantime, the rise of analytics applications is creating IT scenarios where business executives have started to see more potential value for investing in IT. Instead of treating data as a cost to be minimized, many organizations are starting to think of data in terms of a being a resource that needs to be exploited. In fact, the ability of one organization to effectively compete against another will increasingly come down to which one not only gain access to the most amount of data, but actually exploit it the most.

For that reason, Siki Giunta, vice president of cloud computing and software services for the IT services provider CSC, notes that there is already a bit of a land rush going on in terms of companies looking to aggregate massive amounts of data. Many of them are not sure just what to do with that data yet, but Giunta thinks a demand emerging for access to aggregated sources of data should create opportunities for service companies such as CSC to deliver aggregated data services, which CSC customers could then compare against their own internal data.

While there is nothing particularly new about delivering data feeds, the ability to do in the cloud would cost-effectively provide unprecedented access to massive amounts of information instantly, Giunta said.

There are, of course, a lot of issues that need to be dealt with before any of that becomes a commonplace reality—not the least of which is the need for more advanced data compression algorithms, substantially faster networks and standard application programming interfaces that will make it easier for developers to access that data.

In fact, while social networks tend to get most of the attention when it comes to generating Big Data, the Big Data that organizations are going to value most is going to be created by sensors providing continuous streams of data about a specific business process.

“Because of embedded systems there’s going to be thousands of processors for every person on the planet,” Bloor said.

As all that Big Data becomes more and more of a competitive business weapon, IT organizations will be judged on how well they arm organizations to compete. Organizations will expect to able to automatically trigger business events based on the data they collect. If the data is wrong or the analytic algorithms being used to make sense of it have been miscalculated, the cascading effect of a wrong decision could be catastrophic. On the other hand, if all goes as planned, the benefits to the business could be astronomical, given the real time business insights that would be enabled by IT. It’s just that given the processing speeds involved, the margin for error in terms of making that actually happen is going to be slim to none.

 

Image: Milos Stojanovic/Shutterstock.com

Post a Comment

Your email address will not be published.