CloudFlare essentially “dropped off the Internet” for about an hour on March 3, as all its edge routers (used by its 23 worldwide data centers) failed.
Why? Bad code.
Part of the problem was that CloudFlare received poor information. That morning, as one of the company’s customers was hit by a DDoS attack, the company’s internal tool for detecting and profiling attacks returned some odd data: packets involved in the attack were between 99,971 and 99,985 bytes long—and that’s strange, as CloudFlare noted, because the largest packets sent across the Internet are typically in the 1,500-byte range and average between 500 to 600 bytes.
CloudFlare’s edge routers are manufactured by Juniper and support a tool, FlowSpec, which allows a customer to push new router rules very quickly. When CloudFlare told the routers to disregard packets of between 99,971 to 99,985 packets long, they crashed, consuming every bit of RAM. Worse still, they were either unable to reboot or crashed again when they restarted.
“We removed the rule and then called the network operations teams in the data centers where our routers were unresponsive to ask them to physically access the routers and perform a hard reboot,” CloudFlare wrote, according to a Google cache of the page. (CloudFlare’s blog was inaccessible Monday morning.)
“We have already reached out to Juniper to see if this is a known bug or something unique to our setup and the kind of traffic we were seeing at the time,” CloudFlare added. “We will be doing more extensive testing of Flowspec provisioned filters and evaluating whether there are ways we can isolate the application of the rules to only those data centers that need to be updated, rather than applying the rules network wide.”
The company will also “proactively issue service credits to accounts covered by SLAs. Any amount of downtime is completely unacceptable to us and the whole CloudFlare team is sorry we let our customers down this morning.”
CloudFlare noted that its own error mirrored what happened in Syria earlier this year. Only in this case, the outage was accidental, not malicious—but it was an outage just the same.
When Microsoft’s Windows Azure cloud service went down on Feb. 22, the company acknowledged it was the result of an expired SSL certificate. How could the team have made such a basic error? Turns out, it didn’t.
Garbage in, garbage out: it’s one of the basic axioms in computer programming. And it’s something that affected both Microsoft and CloudFlare. In Microsoft’s case, though, there was a special twist: fixing the “garbage”—or at least a bug—prevented the SSL certificate from being processed.
In explaining the root cause of the outage, Microsoft blamed itself for a few mistakes. The first (and possibly most severe) was acquiring the SSL certificates in such a way that the blobs, queues and tables used by the Storage accounts within its Azure service expired at roughly the same time, worldwide. That meant, essentially, that any problem with the SSL certs would have produced a worldwide outage.
Microsoft maintains a “Secret Store” of SSL certificates, and automated processes notify the affected teams of expiring certificates 180 days before they expire. In this case, the Storage team was notified, took action, and renewed the certificates on Jan. 7.
But why didn’t the certificates actually process? Because they were pushed out of the production queue. “[T]he team failed to flag the storage service release as a release that included certificate updates,” Microsoft wrote. “Subsequently, the release of the storage service containing the time critical certificate updates was delayed behind updates flagged as higher priority, and was not deployed in time to meet the certificate expiration deadline. Additionally, because the certificate had already been updated in the Secret Store, no additional alerts were presented to the team, which was a gap in our alerting system.”
After it discovered its error, Microsoft rolled out updated SSL certificates to individual services and regions, taking its time to make sure the process was tested before deployment. The outage was detected at 12:44 PM PST on Feb. 22, and the certs were issued to its West Coast data center at 7:20 PM PST. Microsoft returned its Azure services to normal at about 8:45 PM PST, worldwide. An update containing the certificates was also pushed the next day.
Microsoft said that it would include its monitoring of certificate expiration to include customer endpoints as well as its Secret Store, and give expiring SSL certs higher priority: “Any production certificate that has less than 3 months until the expiration date will create an operational incident and will be treated and tracked as if it were a Service Impacting Event,” it wrote.
Image: Dmitry Kalinovsky/Shutterstock.com