More and more, data scientists are talking about “data exhaust”: information that isn’t central to an application’s primary purpose, but can still be captured and used to gain insight into areas that may have nothing at all to do with a particular system or product’s stated mission.
For example, Microsoft recently unveiled Workplace Analytics for Office 365 Enterprise. The add-on uses aggregate email and calendar metadata—such as to/from information, subject lines and timestamps—to generate intelligence about an organization’s productivity. (It doesn’t track the behavior of individual employees.)
“The idea was that there’s a ton of data about how people spend their time and who they spend it with, and it’s all being created incidentally as part of this exhaust that’s in email and calendar systems,” said Chantrelle Nielsen, director of Research & Strategy for Microsoft’s Office 365 unit.
In theory, a sales department could use that data exhaust to identify the collaborative patterns of top performers and encourage those behaviors throughout the broader sales team. Not only can it measure obvious metrics, such as the amount of time spent with customers, it can look at the size of the top performers’ internal network, which could hint at their ability to reach out to colleagues and solve customer problems quickly.
Examples like that show how “data exhaust presents a unique opportunity to enhance your service offering and can add a lot of contextual value,” said Dan Koellhofer, chief innovation officer at Atlanta-based background check company First Advantage.
Thinking Different and Digging Hard
Many data scientists see data exhaust as opening up more opportunities for organizational analytics or even new product lines. But to take advantage of it, data teams need to think a little bit differently.
“What’s important to realize is that data exhaust by itself commonly lacks structure, taxonomy and access, limiting its use,” Koellhofer observed. “The key, therefore, is augmenting existing repositories with these data streams, giving new levels of context and insight to both customers and internal organizations.” At first glance, the data may simply be a server log or workflow notification stream. But by augmenting it with other data streams, “you can turn it into much more powerful and valuable data.”
That’s not easy to do, Nielsen pointed out. Data exhaust “isn’t one of those things that’s just laying there waiting for someone to come along and look at it. It might not be trivial to get at.” She believes the key to harnessing data exhaust is creating a “really strong hypothesis” about what a dataset might hold, then linking it to a question that needs an answer.
“Before you start digging around in the data, have a hypothesis about what you’re going to find and have a plan for what you’ll do if you confirm your hypothesis—or don’t confirm it,” Nielsen said. “You have to know what to do next.”
“These raw streams typically require a person to perform some sort of transformation on them [to make] the data accessible, integrated and understandable by others,” Koellhofer added. “Consequently, transformation skills are in high demand.”
The Power of Unintended Consequences
In some ways, data exhaust’s power lies in the very fact that it is unstructured, and was entered into a system for a different purpose. Nielsen described a scenario in which automobile technicians enter information about a certain car, recording notes of the work they performed and the hours they spent, so they can generate an invoice and maintain a file on the history of that particular vehicle.
A data scientist might look at hundreds of thousands of similar records and do a sentiment analysis to determine why customers in Chicago are unhappy about a particular automobile make and model. “Maybe it’s because it’s snowing and there’s salt on the roads and it does something to the car,” she said. “They’re just approaching the same data set in a really different way” that solves a separate business problem.
The flip side, she noted, is this often puts data specialists into “more of a needle-in-a-haystack situation than they’re used to with other datasets.” That’s why having a hypothesis is so important.
Koellhofer agrees: the dynamic is opening up opportunities for data specialists who’ve embraced machine learning. “ML models can automate the process of figuring out how data exhaust relates, identifying correlations between various streams in a fraction of the time it may have otherwise required,” he explained.
The Impact on Data Science Roles
Will the growing use of data exhaust dramatically change data scientists’ role? Nielsen sees it leading to more of a “shift in focus,” while Koellhofer said it makes open-mindedness “critical” and re-emphasizes the importance of soft skills.
Data scientists will have to recognize that data exhaust has “the potential to create value for the organization and be able to clearly communicate that with individuals across the business … so they can also understand which datasets are available and what their business potential could be,” Koellhofer said.
Also, Nielsen and Koellhofer believe data exhaust can’t be fully exploited in a vacuum. Koellhofer said it’s “imperative to get data scientists in front of the various users to understand their goals, tasks and motivations. Understanding the customer is essential in order to explore what insights are worthwhile to present.”
“When you’re dealing with so much data about humans, it’s really important to take some of the more classical scientist and social scientist ilk of thinking about psychology, and check in with actual people to confirm what you’re seeing and gut check whether it makes sense,” said Nielsen. Comparing a huge dataset that measures “how people are really acting with survey data or other insights can help you truly understand what’s going on.”