How do Facebook engineers manage hundreds of servers and racks without getting lost in all that data? By visualizing it, of course.
In a corporate blog posting Sept. 19, Facebook application operations engineer Sean Lynch revealed the development of a tool, “Claspin,” which generates a heat map of the company’s numerous racks and servers—the better to determine which are “bad” and in need of repair. The post also featured some of those visualizations, such as:
According to Lynch, Facebook originally set out to manage the health of its computing resources via two tools: Memcache, and TAO, a caching graph database that performs its own MySQL queries. While the TAO tool generates reams of data from servers and clients, all of it collected into dashboards showing various latency and error rate statistics, it started giving Facebook engineers some scalability issues: “This worked well at first, but as Facebook grew both in size and complexity, it became more and more difficult to figure out which piece was broken when something went wrong.”
In the wake of that, Lynch turned to creating a tool that could generate lists of hosts, each with rankings for the number of timeouts, for example, or TCP retransmits. The resulting tool listed each server in a tuple, or an ordered list of elements. But the solution was also text-heavy and required a somewhat-trained operator to manage the problem—in that case, Lynch himself.
So Lynch settled on a heatmap, with each “pixel” representing a host. He settled on a separate heatmap per cluster, ordered by rack number and with each rack drawn vertically in an alternating “snake” pattern so racks would stay contiguous even if they wrapped around the top or bottom of the display. “The rack names naturally sort by datacenter, then cluster, then row, so problems common at any of these levels are readily apparent,” he wrote.
The “hotness” of a pixel is still determined by the characteristics previously identified by Lynch. If a pixel is black (no value), Facebook has a pretty good idea that the node is completely down.
In the event of ongoing problems, he added, “it’s easy to see when things change because a particular problem will have a particular pattern on the screen.”
Just as an aside: Claspin was named for a protein that monitors for DNA damage inside a cell.
However useful too Facebook’s operations, it’s hard to tell if other data-center operators will be able to use the tool. “We always try to open source tools like this, so it’s something we’ll consider with Claspin,” a Facebook spokeswoman said in an email. “But it’s possible that it’s so tightly integrated with our infrastructure that it wouldn’t be broadly useful.”
SlashDataCenter reached out to Lynch (via Facebook, natch) and will update this post if he responds.