Wikipedia, which features nearly 4 million articles in English alone, is widely considered a godsend for high school students on a tight paper deadline.
But for University of Illinois researcher Kalev Leetaru, Wikipedia’s volumes of crowd-sourced articles are also an enormous dataset, one he mined for insights into the history of globalization.
In total, he leveraged Wikipedia’s 37GB of English-language data, in particular the evolving connections between various locations across the globe over a period of years. “I put every coordinate on a map with a date stamp,” Leetaru told The New York Times. “It gave me a map of how the world is connected.”
Negative sentiments in the Wikipedia data were visualized in red, and positive ones in green—thus, the United States turns into a red ball of awful during the Civil War, while Europe goes similarly scarlet during War World I and World War II. You can view the time lapse/data visualization on YouTube.
For the actual data crunching he relied on infrastructure from Silicon Graphics International, which (according to the Times) can store 64TB of data and relies on 4,096 processing cores.
Leetaru’s research begs the question: if you put your mind to it, what sort of massive dataset (either structured or unstructured) couldn’t you crunch?
A number of companies seem intent on making the answer to that question: “Nothing.” In June, IT vendors large (IBM being perhaps the most notable) and midsize (Hortonworks, Karmasphere, Datameer) all announced platforms designed to help store, manage, analyze and visualize massive amounts of data—and by “large amounts,” far more than the 37GB Leetaru crunched in order to demonstrate humankind’s accelerating rate of cross-global connection.
Many IT vendors are also focusing on how to best put B.I. tools in the hands of workers who might not possess years’ worth of training in data analysis. According to Forrester analyst Boris Evelson, software vendors should pay additional attention to making those apps’ interfaces more intuitive; that’s on top of what he lists as vital self-service capabilities, including the automodeling of raw data, search-like GUI (graphical user interface), application sandboxes, and raw-data exploration and discovery.
It also helps to have some powerful hardware backing that analysis, either on-premises or in the cloud.