Amazon Hints at Outage Cause

Matt Mankins, the chief technical officer for FastCompany, tweeted this picture of his network’s traffic during Amazon’s outage.

Amazon suffered a prolonged outage in its Northern Virginia region Oct. 23, knocking popular sites such as Foursquare offline.

The tech giant didn’t immediately reveal the source of the problems, although the company’s status page reported issues with Amazon’s Elastic Compute Cloud and Relational Database Services, CloudWatch and CloudSearch, plus supporting services like the AWS management console. For Amazon’s ECS and RDS, those problems persisted throughout most of the day, even as Amazon worked to fix the issue. One member of the hacker collective Anonymous claimed that he had attacked Amazon, which the company denied.

By 6:33 PM Oct. 23, Amazon gave a little more color into the possible source of the problem. “We are seeing elevated errors rates on APIs related to describing and associating EIP [Elastic IP] addresses,” the company said. “We are working to resolve these errors. In addition, ELB is experiencing elevated latencies recovering affected load balancers and making changes to existing load balancers. These delays are a result of the EIP related API errors and will improve when that issue is resolved.”

The outage isn’t Amazon’s first: a June electrical storm affected its services to the point where high-profile customers such as Instagram and Netflix were knocked offline. Amazon outages in April 2011 took out Foursquare, Netflix, and Twitter, among others. (Both Instagram and Netflix survived this week’s issues apparently unscathed.)

High-profile Websites affected by the current outage included Foursquare, Pinterest, Reddit,the online game Minecraft, Pocket, Airbnb, and Flipboard. Matt Mankins, the chief technical officer for FastCompany, tweeted a picture of his network’s traffic as it plunged to zero, returned to operation, then according to him, collapsed to zero again.

The problems officially began at 10:38 AM PST, when Amazon began investigating degraded performance for a small number of Elastic Block Store volumes—block level storage volumes for use with Amazon EC2 instances—in a single Availability Zone in the US-EAST-1 region. At 11:26, Amazon updated its status page to note that “New launches for EBS backed instances are failing and instances using affected EBS volumes will experience degraded performance.” Recovery began at 12:32 PM PST, and by 2:20 PM Amazon restored performance for about half the volumes that were experiencing issues.

“Customers with ELBs running in only the affected Availability Zone may be experiencing elevated error rates and customers may not be able to create new ELBs in the affected Availability Zone,” Amazon said. “For customers with multi-AZ ELBs, traffic was shifted away from the affected Availability Zone early in this event and they should not be seeing impact at this time.” That might have saved Netflix, which previously indicated that its service straddled several Availability Zones.

Foursquare reported it was back online by 3:02 PM PST, following Pinterest, which announced its service had resumed by 2:52 PM.


Image: @mankins