Netflix Gives Data Center Tools to Fail

Netflix’s Hystrix: nobody ever said that streaming X-Files episodes to your laptop wasn’t complicated.

Netflix has released Hystrix, a library designed for managing interactions between distributed systems, complete with “fallback” options for when those systems inevitably fail.

The code for Hystrix—which Netflix tested on its own systems—can be downloaded at Github, with documentation available here, in addition a getting-started guide and operations examples, among others.

Hystrix evolved out of Netflix’s need to manage an increasing rate of calls to its APIs, and resulted in (according to the company) a “dramatic improvement in uptime and resilience has been achieved through its use.”

The Netflix API receives more than 1 billion incoming calls per day, which translates into several billion outgoing calls (averaging a ratio of 1:6) to dozens of underlying systems, with peaks of over 100,000 dependency requests per second. That’s according to Netflix engineer Ben Christensen, who described the incredible loads on the company’s infrastructure in a February blog posting. The vast majority of those calls serve the discovery user interfaces (UIs) of the more than 800 different devices supported by Netflix.

According to Christensen, even if the Netflix API enjoys 99.7 percent uptime, the 0.3 percent downtime translates into 3 million failures—more than 2 hours of downtime per month, theoretically. “Reality is generally worse,” Christensen said during an August 2012 presentation.

Moreover, latency is generally worse. If a dependency fails, it can “fail fast” and shed its load. Latency can back up queues, threads, system resources, and even cause an entire system to fail if the process isn’t isolated (with isolation, backend latency spiked from less than 100 ms to over 10,000 ms—i.e., 10 seconds—in the 90th percentile of cases). At that point, users may simply give up.

Netflix decided that, in the event of a dependency failure, it would sacrifice aspects such as personalization in order to keep streaming content. Netflix’s Hystrix includes fallback code that can kick in if a service is failing too often, serving local data if possible. If the service responds to test queries, it is restored. In certain cases, Hystrix “fails silently,” returning a null value. And if there’s no fallback, it “fails quickly,” generating a displeasing HTTP 5xx response to the user—but at least it keeps the API servers healthy, Ben Schmaus, a senior manager for Netflix, wrote in a blog post last year.

Netflix will also release the real-time dashboard it uses for monitoring Hystrix. That dashboard relies on a traffic-light system to display service dependencies for the last ten seconds, with colors measuring latency and the size of the circles showing traffic.

“You might ask, ‘Do you really need a dashboard that shows you the state of your service dependencies for the last 10 seconds?’ The Netflix API receives around 20,000 requests per second at peak traffic,” Schmaus wrote. “At that rate, 10 seconds translates to 200,000 requests from client devices, which can easily translate to 1,000,000+ requests from the API into upstream services. A lot can happen in 10 seconds, and we want our software to base its decision making on what just happened, not what was happening 10 or 15 minutes ago. These real-time insights can also help us identify and react to issues before they become member-facing problems.”


Image: Netflix