Big Data Basics

By R. Emmett O’Ryan

The term “Big Data” has been tossed around in the media now for the last three or four years.  So what is it?

Basically, it’s about data sets that are too big for traditional software tools to capture, manage and process in a reasonable amount of time. It’s also a critical area today: The size of these data sets in continually growing, and is now well into the petabyte range.

Among the challenges of implementing solutions: capturing, storing, searching, sharing, analyzing and visualizing such data sets.

As a long time computer scientist in the area of information management, I see the term “Big Data” as a marketing label. It attempts to quantify all that is done in the Distributed Computing or High Performance Computing space, and make the results available to the Data Scientist or Information Analyst so that answers to questions can be formulated in timely manner.

What It’s About

Big Data is about:

  • Bringing together large amounts of data from non-traditional and traditional data sources, then joining, transforming and processing it so that there’s some kind of correlation between the different types that results in a single larger data set.
  • Reducing that larger set to a representative of the whole
  • Extracting the information that is relevant and apply appropriate statistical analysis to it.
  • Visualizing the data and the resulting information, so that trends can be determined and decisions made.

All of this, from ingestion to visualization of the analysis must be done in a timely manner that meets the needs of the business or organization.

A Competitive Advantage

Many businesses and organizations use Big Data to gain a competitive edge from the resulting business analytics — provided that they ask the right questions, have data sources that provide the right information and, of course, use the right statistical and analytical tools to interpret the answers they get.

These same businesses and organizations use Big Data to make timely strategic and tactical decisions. Instead of just relying on process or instinct, Big Data tools provide them with additional information to consider in making informed decisions.

The Technology

So is Big Data new? No, not really.  Distributed Computing or HPC has been around for well over 20 years. It’s just that never before have you used these HPC techniques and infrastructure on these different types of data sets, nor have you had to answer such business domain related questions. HPC has been the domain of scientists to do physics, biological and chemical analysis, simulations and experimentation where massive amounts of data are available.

The technology and infrastructure here is somewhat unique from HPC of even ten years ago. Today, low cost commodity servers or cloud computing infrastructures can be used for Big Data hardware platforms. As for software, key to a Big Data environment are Apache Hadoop/MapReduce and its associated tools, the use of NoSQL data bases, some means of doing basic analytics on data that’s to be processed, and a means of visualizing the results of the analytics.

There are now a wide variety of NoSQL data bases available, like MongoDB, Cassandra  and others. As for basic analytics, quite often Hadoop/MapReduce’s associated tools can do the trick. When it comes to advanced analytics, tools like R Project, SAS and other statistical packages can be used, though quite often an integration effort is needed to make them work.

Visualization of the data ingested and the resulting analytic results appears to be a wide open area with many vendors claiming dominance in one area or another.