When it comes to wrestling with massive amounts of data, Netflix isn’t a unique company: Facebook, Google, Amazon, and other Internet giants also face a flood of petabytes into their datacenters. That being said, Netflix has managed to create some unique tools for building out the size and power its cloud-based data warehouse—a necessary task, considering its ever-expanding media library.
Netflix stores its data on Amazon’s Simple Storage Service (S3). That’s in contrast to other firms, which might opt to run their platform on commodity hardware. In addition to its durability and ability to scale to massive size, S3 also provides bucket versioning, which protects Netflix’s data from inadvertent data loss.
“Reading and writing from S3 can be slower than writing to HDFS,” read a recent note on the Netflix Tech Blog. “However, most queries and processes tend to be multi-stage MapReduce jobs, where mappers in the first stage read input data in parallel from S3, and reducers in the last stage write output data back to S3. HDFS and local storage are used for all intermediate and transient data, which reduces the performance overhead.”
Netflix’s reliance on Amazon doesn’t end there: the company also uses Amazon’s Elastic MapReduce (EMR) distribution of Hadoop, spinning up multiple Hadoop clusters for different workloads that access the same core dataset. (A number of large companies, including Netflix, rely on the Apache Hadoop framework to help run massive data applications on large hardware clusters.)
Netflix developers use a number of tools within the Hadoop ecosystem, including Hive and Pig. Netflix also offers “Genie,” a Hadoop-based Platform-as-a-Service (PaaS). Genie packs a set of REST-ful services for job management within the Hadoop ecosystem; Netflix cites its two key services as Execution Service, a REST-ful API for submitting and managing Hadoop and Hive and Pig, and Configuration Service, a store of available Hadoop resources.
In the words of the Netflix Tech Blog, Genie “provides a higher level of abstraction, where one can submit individual Hadoop, Hive and Pig jobs via a REST-ful API without having to provision new Hadoop clusters, or installing any Hadoop, Hive or Pig clients.” The system also allows “administrators to manage and abstract out configurations of various back-end Hadoop resources in the cloud.”
“That there was nothing that was already out there in the open source community that handled our requirements—an API to run jobs, abstraction of backend clusters, an ability to submit jobs to multiple clusters, and scalable enough (horizontally or otherwise) to support our usage,” read the blog. “We ruled out the use of Oozie as our scheduler, since it only supports jobs in the Hadoop ecosystem, whereas our process flows span Hadoop and non-Hadoop jobs.” Nor did Templeton, another considered alternative, fulfill all Netflix’s requirements.
Netflix now relies on Genie for dynamic resource management, including spinning up additional clusters to support regular production clusters. It is a vital part of the company’s production environment, deployed in a 6-12 node Auto Scaling Group (ASG) across three Availability Zones. On a daily basis, it runs hundreds of Hive and Pig-based ETL jobs for data warehousing, in addition to hundreds of visualization-related Hive jobs. “We are considering open sourcing Genie in the near future,” the blog posting concludes, “and would love to hear your feedback on whether this might be useful in your big data environment.”
Image: Cre8tive Images/Shutterstock.com