Every year since 2008, Google Flu Trends has done its best to predict flu outbreaks via the magic of analyzing Google search queries.
On paper, the idea makes total sense: sick people will search online for information related to their symptoms and treatment. Google Flu Trends aggregates those millions of queries and produces a color-coded heat map that (the company claims) can predict the ebb and flow of flu season with reasonable accuracy.
Sounds like a perfect use of so-called “Big Data,” right? Except according to a new study (as reported in New Scientist), Google Flu Trends has gotten it wrong. “The system has consistently overestimated flu-related visits over the past three years, and was especially inaccurate around the peak of flu season—when such data is most useful,” is the conclusion apparently reached by a team of researchers headed by Northwestern University’s David Lazer, according to the Website. “In the 2012/2013 season, it predicted twice as many doctors’ visits as the US Centers for Disease Control and Prevention (CDC) eventually recorded.”
In the 2011-2012 flu season, the article added, Google Flu Trends overestimated flu-related doctors’ visits by “more than 50 percent.”
But that doesn’t mean Google Flu Trends is broken. To the contrary, the researchers believe Google only needs to tweak how it weighs various pieces of data in order to make the system more accurate. “It’s a bit of a puzzle, because it really wouldn’t have taken that much work to substantially improve the performance of Google Flu Trends,” Lazer told New Scientist.
This isn’t the first time a science-oriented publication has questioned Google’s work. In 2013, Nature also published a piece that suggested the company’s flu-tracker overestimated the spread of pathogens. “[Google’s] estimate for the Christmas national peak of flu is almost double the CDC’s (see ‘Fever peaks’), and some of its state data show even larger discrepancies,” read that article. “Several researchers suggest that the problems may be due to widespread media coverage of this year’s severe US flu season, including the declaration of a public-health emergency by New York state last month.”
What’s the teachable moment here? That when it comes to complicated models that involve lots of aggregated data, data scientists (and others involved in the process) should do their best to “fact check” their work against other data sources out there, especially if their model contains inherent biases or stems largely from a single dataset.