When building Big Data apps, you need to conduct a test run with someone else’s data before you put the software into production. Why? Because using an unfamiliar dataset can help illuminate any flaws in your code, perhaps making it easier to test and perfect your underlying algorithms.
To that end, there are a number of public data sources freely available for use. Some of them are infamous, such as the Enron email archive used in court hearings about the malfeasance of that company. It’s one of the largest collections of actual emails, and has proven useful for anti-spam vendors to test their own algorithms. Indeed, the archive may be the best and most enduring legacy of Enron.
More prosaic data sources include the complete works of Shakespeare, available online and easily incorporated into testing plans. Another good place to look is this list of 23 mainly health-related data sets, including various open data initiatives, cancer statistics, and more. Then there’s this collection that Princeton has put together—mostly a large collection of texts such as presidential debate transcripts and literary works that are now part of the public domain.
Another popular data destination is the digitized archives of the New York Times, with 150 years’ worth of archives. There are more than 30,000 subject tags and the information can be downloaded either as RDF or HTML format. The archive was opened to programmers three years ago and has an extensive data dictionary. There is detailed documentation of its various APIs and a discussion group for those that want to learn more about how to interact with this information. In addition, there are some specialized datasets and APIs, such as the books on its various bestseller lists, and a Web console where you can assemble your queries inside your browser. (You need to obtain a search key from their developer network first.)
Three Popular Visualization Sites
There are also Websites that do more than feature lots of text or numbers for downloading. Probably one of the largest efforts involving public data sources is Data.gov, which has been in place for more than three years and is constantly updated. It covers some obvious datasets, such as crime statistics, airline on-time arrival information and the entire Code of Federal Regulations, as well as more obscure ones, such as the habitat of several endangered species and location of farmers’ markets.
Dozens of these datasets come with a built-in visualization tool for exploring the data before you download it. The visualization tools are a bit crude, but there’s a series of video tutorials to help you get started, put together by the professors and students at RPI’s Tetherless World Constellation.
The latest feature is the ability to conduct geospatial visualizations so that you can view datasets on an interactive map, overlay them with other datasets, and investigate the underlying data points. It’s great for quick tests to see whether your geographic assumptions about your chosen datasets are sufficient to proceed further with your analysis.
Google, of course, is another great place to examine public data. The search-engine giant has put together more than 75 different sources in their special search portal, including U.S. Census data, information from the World Bank, U.S. energy consumption, and others. As with Data.gov, you can also graph this information. Their visualization tools are a bit more intuitive; with some datasets, you can animate them and compare how information changes from year to year by just moving a slider control. The visualization graphs automatically scale based on what you’re plotting, and you can export the graphs as HTML links.
Finally, there is Tableau Software. It doesn’t have any data sources, but does have some splendid visualizations in their galleries here. You need to download their free player to work with the visualizations. This is a good place to get some ideas of what your end product will look like, and how others have taken datasets and turned them into useful information. You can create your own if you want to publish it for the world to see.
If you’re looking to hone your Big Data skills, you might want to consider one of the DataDives events hosted by Datakind.org. These weekend events team data problems from selected social organizations with volunteer data scientists (think Hackathon with a bit more focus and collaboration). One held last year explored “stop-and-frisk” incident data reported by the NYPD in 2010, on behalf of the New York Civil Liberties Union. Their next events are being held in New York City and London in September. You can get more information and sign up here.
And if you want more experience and perhaps use your data skills for a good cause, then take a look at what Code For America is doing. They have numerous projects involving public data access, including their Open311 dashboard. Many municipal governments have put together 311 systems, with a variety of public services available by calling that special number. This project creates an open API to allow Web applications to seamlessly interact with 311 systems across the nation. According to their website, the “Open311 Dashboard aims to take the deluge of 311 data and translate it into a clean and interactive dashboard. It helps track response times, identify service request trends, and give city officials data about the efficiency of various city services.”