Bad Data Can Wreck Big Data Implementations: Report

A flood of bad data can wreck a company’s Big Data plan.

People love talking about Big Data, but less attention is paid to Bad Data: incorrect, incomplete, corrupted, and outdated information that, when fed into a system, can produce highly questionable results.

Fortunately, there are steps any organization can take to limit the amount of bad data entering their analytics platform, according to a new column in the Harvard Business Review by Dr. Thomas Redman, author of “Data Driven: Profiting from Your Most Important Business Asset.”

The first step involves addressing preexisting issues in the data, he wrote: “You must make sure you understand the provenance of all data, what they truly mean, and how good they are.” That means scrubbing a dataset “clean” of as many errors as possible.

“An alternative is to complete the rinse, wash, scrub cycle for a small sample, repeat critical analyses using these ‘validated’ data, and compare results,” he wrote. “To be clear, this alternative must be used with extreme caution.”

Given the flood of data into most organizations, he added, “over the long term, the only way to deal with data quality problems is to prevent them.”

But that’s a potentially tricky proposition. Redman advises that CIOs and others in charge of Big Data projects adapt some of the meticulousness a scientist would display in experimental design. The latter plots out experimental elements beforehand, builds controls into the process, seeks out potential causes of error, and invite critiques from colleagues.

“Those pursuing Big Data must adapt these traditions to their circumstances,” he wrote. “You must measure quality, build in controls that stop errors in their tracks, and apply Six Sigma and other methods to get at root causes.” Understanding is key: from the data creators to those within the organization ultimately using the insights from that data to make decisions, everyone must have a clear articulation of the goals at hand.

It’s time, he concluded, “for senior leaders to get very edgy about data quality, get the managerial accountabilities right, and demand improvement.” The alternative is everything from cost overruns to seriously aggrieved customers.

Several surveys over the past few months have indicated that businesses are indeed struggling to handle the data entering from social networks, customer and vendor interactions, sensors, and pretty much every other device or software that can churn an output of some sort. In one such survey, conducted by Oracle, 94 percent of North American business executives indicated that their organizations were collecting and managing more data today than two years ago, even as nearly a third of them ranked their organizations’ data preparedness a “D” or “F.” Meanwhile, a full 93 percent of them felt that the inability to handle data was having a negative impact on revenue.

In light of that, it’s tempting for any organization to slap a Big Data platform into place, or perhaps skip some of the due diligence related to analytics. However, that can translate into some very vicious problem down the road. Groundwork is essential. In a recent column for SlashBI, Jaspersoft CEO Brian Gentile argued for a three-step approach to determining whether an organization is ready for a Big Data implementation; there are also some very big structural changes underway in how organizations approach analytical challenges.


Image: aldegonde/