Facebook’s Project Prism, Corona Could Ease Data Crunch

by Nick Kolakowski Aug 24, 2012 3 min read

When it comes to the information Facebook keeps in-house about its users, the term “Big Data” seems like something of an understatement: processing more than 500 terabytes of data a day, divided into 2.5 billion pieces of individual content, a better term might be “Gigantic Data” or “Gargantuan Data.” Facebook recently invited a handful of employers into its headquarters for a more in-depth look at how it handles that flood of data. Part of that involves the social network’s upcoming “Project Prism,” which will allow Facebook to maintain data in multiple data centers around the globe while allowing company engineers to maintain a holistic view of it, thanks to tools such as automatic replication. That added flexibility could help Facebook as it attempts to wrangle an ever-increasing amount of data. “It allows us to physically separate this massive warehouse of data but still maintain a single logical view of all of it,” is how Wired quotes Jay Parikh, Facebook’s vice president of engineering, as explaining the system to reports. “We can move the warehouses around, depending on cost or performance or technology." Facebook has another project, known as Corona, which makes its Apache Hadoop clusters less crash-prone while increasing the number of tasks that can be run on the infrastructure. Facebook has been a high-profile user of Hadoop, an open-source framework for reliably running distributed applications on large hardware clusters. That warehouse of user data is Facebook’s most valuable asset, but also its biggest challenge. Although users upload tons of photos and other life information to the site, they’re also extraordinarily gung-ho about their privacy; Facebook can slice and dice that data in the name of advertising dollars—but if it goes too far, the inevitable privacy backlash can lead to major corporate headaches. “Big data really is about having insights and making an impact on your business,” Parikh added, according to TechCrunch. “If you aren’t taking advantage of the data you’re collecting, then you just have a pile of data, you don’t have big data.” The definition of “Big Data” is somewhat in dispute, of course, even among analysts and businesspeople tasked with wrestling enormous datasets on a day-by-day basis. A recent survey by SAP of 154 C-suite executives had 28 percent of them defining “Big Data” as the flood of data itself, while 19 percent equated it with storing data for regulatory compliance; another 18 percent saw “Big Data” as an increase in data sources. Based on Parikh’s comments, Facebook views “Big Data” as something altogether more holistic: not only the storage of enormous amounts of data, but also its processing and eventual use. Turning that data into revenue is a tricky process—but with Corona and Prism on the backend, the task of handling that data might be a little bit smoother. Image: Ahmad Faizal Yahya/Shutterstock.com