Big data is big business. We’ve previously suggested it could improve cybersecurity and drug development. Now there’s a new avenue for tech pros to explore: Yahoo (part of Oath, which includes AOL) has released the freshly open-sourced Vespa.
No, ‘Vespa’ in this case doesn’t refer to the underpowered scooter. It’s meant to fill in the gaps left by Hadoop, the open-source framework for distributed data storage and processing, which Yahoo open-sourced in 2006 (and is now one of the highest paying Big Data skills). Released as an open-source project on GitHub, Vespa lets tech pros “build applications that can compute responses to user requests, over large datasets, at real time and at internet scale – capabilities that up until now, have been within reach of only a few large companies.”
Like a two-wheeled Vespa, in other words, this platform is meant to be speedy. “While developers can use the the Hadoop stack to store and batch process big data, and Storm to stream-process data, these technologies do not help with serving results to end users,” Yahoo’s posting on the matter continued. “Serving is challenging at large scale, especially when it is necessary to make computations quickly over data while a user is waiting, as with applications that feature search, recommendation, and personalization.”
In addition to Yahoo, Vespa is used across various Oath brands such as Flickr, helping the company process “billions of daily requests over billions of documents while responding to search queries.” This is possible because Vespa leverages multiple machines to process its throughput, “without any single master as a bottleneck.” It can also use a single, local device or an AWS or Docker instance.
Vespa is useful because it takes a lot of the boilerplate Big Data utilities off the table for developers. They’ll no longer have to cross-reference requests for relevance, organize the returned results, or worry about duplicates: Vespa handles all of that.
The TL;DR version here is that anyone can use the same Big Data engine as Yahoo. But this isn’t tech pros’ only option when it comes to speedy and powerful Big Data solutions: IBM Streams promises many of the same features as Vespa, and Hadoop is still a go-to for distributed processing. Pachyderm is a lesser-known competitor to Vespa, but very similar (especially with its use of Docker).
If you find Vespa too cumbersome, we suggest checking out this GitHub repo of Big Data sets. It lays out specific use-cases, and lists open-source Big Data solutions for those problems. And if Vespa intrigues you, Oath is promising to flesh out its blog with how-tos and use cases in the coming weeks.