Big Data and the End of Secrecy

Edward Snowden.

Edward Snowden, a former CIA employee who also worked as a contractor for the NSA, has been revealed as the leaker of top-secret documents detailing U.S. surveillance programs.

The Guardian and The Washington Post offered up those documents as part of their respective articles describing an NSA project codenamed PRISM, which allegedly siphons information from the databases of nine major technology companies: Microsoft, Google, Yahoo, Facebook, PalTalk, YouTube, Skype, AOL, and Apple.

In emails to Slashdot, many of those companies denied giving the NSA access to user data. In a statement released over the weekend, James R. Clapper, the nation’s Director of National Intelligence, claimed that media outlets hadn’t been “given the full context” and that the “surveillance activities published in The Guardian and The Washington Post are lawful.”

Snowden told the newspapers that he leaked the documents out of fear that the United States was evolving into a surveillance state without adequate checks and balances in place. “My sole motive is to inform the public as to that which is done in their name and that which is done against them,” he wrote in a note to the Post.

Whatever Snowden’s fate—he’s currently holed up in a Hong Kong hotel—his actions have sparked a firestorm of debate around surveillance, security, and privacy. But one thing’s for certain: none of the NSA’s alleged programs would have been possible without the rise of so-called “Big Data,” and all the storage and analytics platforms that come with it.

Surveillance organizations such as the NSA face a massive challenge in dealing with the mountains of unstructured data pouring into their data centers. Just to make things more interesting, much of that data may be encrypted. In order to analyze that information in a speedy and meaningful way, it’s likely that these organizations have developed analytics frameworks in the spirit of Apache Hadoop or Google BigQuery; it’s also a near certainty that the NSA and its brethren operate supercomputers that rival anything on the Top500 list, the better to crack open encrypted messages and find needles in data haystacks.

In 2008, the NSA began working on a database technology it eventually contributed to the Apache Software Foundation in 2011. The Foundation describes “Accumulo” as a “robust, scalable, high performance data storage and retrieval system” based on Google’s BigTable and built atop open-source frameworks and services such as Hadoop, Zookeeper and Thrift. How (and whether) Accumulo supports PRISM and other NSA programs is an open question; it’s just as likely that the agency depends on something far more evolved, and not quite as public, for dealing with its enormous datasets.

“It would be a comfort to believe that there is safety in numbers, that because there are so many of us little people, our particular foibles would fall beneath the radar,” analyst Roger Kay wrote in a June 10 column for Forbes. “Alas, it is not so! The data mining techniques are cold and precise and can be applied to the entire ‘corpus’ (the body of data extant at any given moment).”

Kay cited the Boston Marathon bombings as an example of how a government can quickly sift through tons of unstructured data—including video—to find persons of interest.

However, the Boston bombings investigation also highlighted the one downside to using Big Data as an investigative tool, at least from the government’s perspective: at some point, an actual human being needs to take over from the machine, examine the evidence, and make a judgment call based as much on intuition as cold, hard facts. In Boston, some FBI agents had to watch the same video segments hundreds of times in order to build a proper timeline for the bombing and the aftermath. And there are only so many agents and analysts to go around.

“How many agents do you think the FBI has?” David Simon, a former reporter for the Baltimore Sun and executive producer of the television show The Wire (which dealt with the intricacies of police surveillance), wrote in a widely disseminated June 7 blog posting. “How many computer-runs do you think the NSA can do—and then specifically analyze and assess each result?”

It’s a question of resources, Simon added: “When the government asks for something, it is notable to wonder what they are seeking and for what purpose. When they ask for everything, it is not for specific snooping or violations of civil rights, but rather a data base that is being maintained as an investigative tool.”

But it’s the database that worries privacy advocates at the moment. Big Data tools are only becoming more sophisticated—meaning that this current fervor over the NSA’s activities almost certainly won’t be the last of its kind.


Image: The Guardian (video)