A cascading hardware outage struck recurring subscription provider Recurly last week, providing a painful lesson in how not to manage critical infrastructure.
Last Monday, the payment provider suffered an intermittent hardware failure, which prevented the company from processing either payments or refunds. The company says it serves over 1,000 customers, including Adobe, BrightCove, and Fox News Radio, processing recurring payments for subscriptions.
By Friday, the company still hadn’t completely straightened out the mess, providing updates to customers using payment gateways such as Authorize.net and LinkPoint/First Data.
What happened? It all started out in a relatively innocuous way.
“At approximately 3:30 a.m. (PDT) we experienced an intermittent hardware failure, which prevented some transactions from processing,” Recurly noted in a Sept. 4 blog post.
Recurly apparently took the right steps. “We had both failover and extra replacement boxes on hand,” the company said. “Working with our vendor we were able to restore the service and re-rack several replacement boxes. In order to do this, we paused our recurring transaction jobs. This means that new customer sign-ups are not impacted, but customers may see a delay in the posting of recurring daily invoices.”
The company apologized for the inconvenience and said it would provide a post-mortem (or at least a longer explanation) at a later point. But then it learned of a critical mistake.
“On Monday at 3:30am PDT, we experienced a hardware failure in our primary encryption hardware device,” the company noted in a subsequent post. “The failure cascaded to the backup slave device as well. This failure corrupted encryption keys used to access stored credit cards to process recurring transactions. At this point, it remains unclear how much of this data will be retrievable.
“We have been working with our vendor and their many partners to receive their expert assessments of the situation,” Recurly added. “However, the particular failure is in a device that is designed specifically to make retrieval of information extremely difficult. In this situation, we have found ourselves at odds with the very protections we have worked so hard to put in place.”
The problem, as others noted, was that Recurly failed to make a protected backup of its encryption keys. “Why oh why is there no offline backup of their keys?” noted Jordan Thoms, co-founder of Notable. “Make a few copies and put them in safes, and encrypt the keys with a strong password known only to senior management.”
We reached out to Recurly for comment, but haven’t heard back. The company said it would continue to provide updates via its Twitter support account, and process its backlog of subscriptions. But the company (along with its customers) has doubtlessly learned a painful lesson.
Image: Gunnar Assmy