Hadoop and Search: A Virtuous Circle

According to IDC, the total amount of digital data in the world is expected to reach 2.7 zettabytes by December 2013, and approximately 90 percent of this will be unstructured. Big Data initiatives are about transforming all this data into actionable business insight, but succeeding in this quest requires more than just running a series of analytical queries. The information must also be made available to business users via an enterprise search solution.

Unfortunately, many of today’s enterprise search initiatives fall far short of this goal. This failure, however, is typically caused not by a problem with the core search technology but because system designers don’t take into account what users are really doing with the data, what users really need from the data, and the larger business problems that need to be solved. By itself, keyword search, the foundation of most enterprise search initiatives, can only answer questions. It cannot solve complex business problems.

For example, keyword search can help a telecom perform some types of mobile traffic analysis, such as detecting sudden increases in calls in a particular area. That’s great, but what’s causing the spike? How long will it last? Does the company need to provision additional resources?

What if, instead, the company supported keyword search by developing a holistic and enriched view of this data correlated with other data? This is precisely what one company did. It took its mobile traffic analysis as well as social media analysis and fed the results back into the search engine. This enabled the company to correlate events taking place in both environments, for example, linking the spike in cell phone activity to a rash of tweets about a fire in a heavily populated area. In this way, the company was able to far more quickly assess the scope of the problem and gain a deeper understanding of the required response to potential network issues.

To create this solution, the company needed to be able to mine massive activity logs and other data at scale, and then push the results back into the search engine. Fortunately, the foundational technology the company used to accomplish this was the same technology they were using for other Big Data initiatives, Hadoop.

Hadoop, the open source distributed file system and processing engine, was originally developed to support the distribution of Nutch, an open-source search engine project; however, Hadoop’s capabilities made it perfect for supporting “Big Data” storage and analytics. Now it’s time to combine the two. By integrating search with analytics, organizations can effectively create an ongoing feedback loop, or “virtuous circle,” between the two: enhanced search and discovery delivers improved analytics, which further enhances search and discovery. This virtuous circle enables a form of “reflective intelligence” that improves an organization’s ability to develop strategies for attacking problems and approaching intellectually challenging tasks.

The most obvious and common example of the use of reflective intelligence is in the recommendation engines of many popular shopping sites, where Hadoop functions as a back end for a search engine built using Solr, an open source enterprise search platform that is part of the Apache Lucene project. As users search for products and generate results, logs are created of all their clicks, queries, results, requests for more information, and purchases. All of this information flows into Hadoop where log analysis is performed to determine a variety of metrics, such as what items are the most popular and what items are trending. This information is also often combined with information from consumer profiles and CRM data, as well as knowledge gleaned from social activity, such as what is trending on Twitter and Facebook. All of this information is then fed back into the search engine to improve the recommendations for new shoppers.

Some sites perform this “reflection” every hour, or even more frequently, all to ensure they are picking up on the most up-to-the-minute information and trends. Many sites also run multiple short-term experiments using different scoring models on the data in Hadoop to determine which scoring models have the greatest impact on buyer behavior.

With reflective intelligence, the virtuous circle never ceases. Increased customer interaction with search, leads to more robust log analysis, which leads to more effective recommendations, which in turn leads to increased customer interaction with search. Companies using reflective intelligence for their search engines have reported very positive results, with some claiming as much as double-digit increases in key metrics. And clearly, we are only at the beginning of a longer process.

As Big Data collection accelerates, as data integration capabilities are enhanced, and as analytics becomes increasingly more powerful and real-time, we will continue to see new ways in which reflective intelligence will improve the efficacy of search.


Grant Ingersoll, CTO and co-founder of LucidWorks, is an active member of the Lucene community: a Lucene and Solr committer, co-founder of the Apache Mahout machine learning project, and a longstanding member of the Apache Software Foundation. 

Image: Pepgooner/Shutterstock.com