While news of the government’s monitoring of Internet and telephone communications took many by surprise, subsequent reports that officials harnessed Apache Hadoop to help process the data was simply logical to many Big Data experts.
Although Hadoop is just one of the components of the NSA’s Big Data-based surveillance program, reports indicate the NSA considers it critical to the initiative.
First developed about seven years ago, Hadoop is open source and frequently lauded as easy to use. It requires relatively minimal data expertise and is easily accessible to anyone who knows how to use a MapReduce program.
“It’s an easier to understand programming model and better integrated software stack,” says Amy Apon, chair of the computer science department at Clemson University in South Carolina. “It allows you to place code next to data, do the application process and collect results in a single reporting process as a reduced step.”
She notes that Hadoop has been adopted by Facebook and Yahoo and has been improved significantly over the years. Widely used in private industry, “It has evolved into a fairly mature software system at this stage,” Apon says.
Carl Howe, an analyst and vice-president at the Boston-based consulting firm the Yankee Group, says Hadoop has democratized some Big Data parallel cluster uses. “Hadoop doesn’t require you to be a parallel processing expert,” he says. “You no longer have to be a cluster expert to run these Big Data problems.”
Howe says that while there are about 10 or 12 programs that can carry out functions similar to Hadoop, the program has two distinct advantages: its presence in the open source community and cost savings for the end user. “The federal government is just like any other enterprise,” he notes. “They look for cheap ways to do things.”
Adds Garth Gibson, a professor of computer science at Carnegie Mellon University in Pittsburgh: “Hadoop is easy compared to other programming paradigms. The fact that it is open source means it is less expensive. It is also true that if you’re trying to build the least expensive parallel program, it helps you make the least expensive giant computer. Those two things go together and cause people to say if it’s cheap and less expensive for capital costs, why don’t you use it?”
However, experts say that while Hadoop is inexpensive and great for learning how to program in the least amount of time, it does have some limitations.
“It is an easy way to build a fairly big computer at relatively low cost,” says Gibson. “But it’s not the most efficient and it’s not the fastest. Hadoop is all about getting the first answer quickly.”