Given the burgeoning interest in Hadoop and data analytics in general, it’s unsurprising that IT vendors and developers would turn to ways to speed up the process of sorting and gaining insights from data. Enter “Drill,” a new open-source project proposed via the Apache Software Foundation’s incubator wing.
“There is a strong need in the market for low-latency interactive analysis of large-scale datasets, including nested data (eg, JSON, Avro, Protocol Buffers),” read the proposal submitted for the project. “This need was identified by Google and addressed internally with a system called Dremel.”
Over the past few years, more open-source frameworks emerged to help data analysts and IT departments with scalable batch processing. Of these, Apache Hadoop emerged as the favorite of many organizations needing to crunch massive amounts of data. But in the eyes of Drill’s creators, Hadoop’s design prevents it from achieving “the sub-second latency needed for interactive data analysis and exploration.”
Drill, they added, “is intended to address this need.”
Drill’s architecture centers on four components: support for a variety of languages and programming models, including DrQL (used by Dremel and Google BigQuery), Mongo Query Language, and Plume; a low-latency distributed execution engine capable of efficiently querying petabytes of data on 10,000 servers; a layer for supporting schema-based and schema-less formats such as JSON (in the latter case) and Protocol Buffers/Dremel; and a layer supporting various data sources, with an initial focus on “Hadoop as a data source.”
Drill will eventually support encryption on the wire, which is not considered one of the project’s initial goals.
“Significant work” has apparently been done to identify Drill’s initial requirements and system architecture, with implementation of those four components offered as the next step. Although there’s a growing need for tools capable of large-dataset analysis (look at the buzz around Hadoop), Drill’s creators acknowledge that any project of this scope carries inherent risks: vendors deciding to change their strategies around data analytics could doom the project, although that scenario seems unlikely thanks to the aforementioned interest.
The proposal seeks to downplay other potential dangers, including excessive reliance on salaried developers (“we are confident that the project will continue even if no salaried developers contribute to the project”) and relationships with other Apache products (“we look forward to collaborating with those communities, as well as other Apache communities”). Initial workers on the project include employees of MapR Technologies, Drawn to Scale, and Concurrent, with mentors from MapR Technologies, Lucid Imagination and Nokia.