Github has finally posted a detailed post-mortem for last Saturday’s massive outage, which took the Website down for several hours. The cause: trouble with aggregating its switch software, which cascaded over to its file servers.
“This was one of the worst outages in the history of GitHub, and it’s not at all acceptable to us,” Github technical staff member Mark Imbriaco wrote in that post-mortem. “I’m very sorry that it happened and our entire team is working hard to prevent similar problems in the future.”
The outage occurred because the site was taking proactive steps to solve an earlier problem involving slowdowns and latencies. In order to resolve that issue, Github was shifting data from one server to another via a limited number of switches, placing a disproportionate load on the latter. Even as it combined aggregation switches at the top of its network tree with redundant, per-cabinet access switches—an aggregation network model—Github discovered a number of problems with its switch behavior, including a bug in the software. When the unnamed switch vendor recommended that its software be upgraded, Github’s second round of problems began.
In addition to a pair of redundant access switches, Github’s servers employ aggregation switches also deployed in a redundant configuration, using MLAG to appear as a single switch. Normally, upgrading the software on one switch, then rebooting, would take the switch offline—at which point the other, redundant switch would take over.
But this time, Github began experiencing instability problems. Although the connection between aggregation switch and access switch had dropped, the aggregation switch’s counterpart continued to register active links between the two points. “With unlucky timing and the extra time that is required for the agent to record its running state for analysis, the link remained active long enough for the peer switch to detect a lack of heartbeat messages while still seeing an active link and failover using the more disruptive method,” Imbriaco wrote.
All the active links needed to be re-established, and so traffic to the access switches went down for around 90 seconds. The issue cascaded to the file servers (which used Pacemaker, Heartbeat, and DRBD to maintain offline availability).
Data written to an active node within Github’s system is copied to the standby node, in order to allow the server to fail over in case of problems. However, the Website also relies a Shoot The Other Node In The Head (STONITH) process to prevent redundancy problems, complete with two safeguards. First, to prevent the two nodes from “competing” with each other, the backup sever must always be in read-only mode. In addition, any failed active server is powered off, leaving the backup server online.
However, the network problems forced each node to try and “shoot” the other; if that wasn’t bad enough, each expected to be the active server once the network came back online. At this point, Imbriaco and the other Github staff put the site into maintenance mode, downgraded the aggregation switch software, and began recovery.
The problem was that, in some cases, Github couldn’t determine which of the fileserver nodes had been active—and the nodes themselves weren’t much help, since—in some cases—both nodes thought they had taken over. So Github engineers had to pore through log files, manually trying to make the determination for each pair of nodes. That process, in total, took about five hours.
What lessons did Github learn? Four things.
First, Github said it would build a functionally equivalent test network to perform an aggregation switch upgrade before pushing it live to the production environment.
Second, Github will put its fileserver high-availability software into maintenance mode before performing network changes, thus preventing any automated failover actions.
Third, the company is working with its network provider to address the cluster communication issues between file server nodes.
And fourth, the company is “reviewing all of our high availability configurations with fresh eyes to make sure that the failover behavior is appropriate.” Github also said that its network vendor is also working to resolve the MLAG redundancy problems.