When Developers Mess Up in Production

by David Bolton Jul 27, 2017 8 min read

In a perfect world, every line of code in every software release would be vetted before production. Nobody involved in development would touch “live” data in a way that could put a system at risk. Alas, we don’t live in that perfect world, and there are abundant examples of tech pros accidentally unleashing chaos on the very systems they’re supposed to keep running. Just take a (much up-voted) example that appeared on Reddit recently: a junior developer tasked with creating a test database instead accidentally deleted all data from a production database. “I was basically given a document detailing how to setup my local development environment. Which involves run a small script to create my own personal DB instance from some test data,” he wrote. “After running the command i was supposed to copy the database url/password/username outputted by the command and configure my dev environment to point to that database. Unfortunately instead of copying the values outputted by the tool, i instead for whatever reason used the values the document had.” Those values were for the production database. Result: total data vaporization. The junior developer lost his job, although many Reddit commenters insisted the situation wasn’t his fault; the company should have instituted backups, at the very least. When a similar thing happened at Amazon, the company’s engineers blamed the system setup, as opposed to the developer who accidentally deleted the data, and adjusted procedures so the situation wouldn’t happen again. Despite the catastrophic outage, no employees lost their jobs. Accidents happen, especially under pressure. I once worked for one of the world’s largest banks, where the emphasis on getting things done fast sometimes meant testing in production and fixing bugs later. This put live data potentially at risk; you could bring some systems to a halt with a two-word command. With all that in mind, here are some of the biggest and most common ways that developers can mess up in production, as well as suggestions to prevent them from happening.

Transactions

Ever had to make a change to a production database? Once, as a developer providing end-of-day support, I had to do just that thing late on a Friday night. It had to get done before a deadline because of downstream processes that handled hundreds of millions of dollars’ worth of business. Sure, there were backups, but getting it wrong would have resulted in hours of recovery time. Luckily, I knew the mantra: "For database data changes, always do them in a transaction." Databases don't have an undo action, but if you make changes to data inside a transaction, those can be undone if you make a mistake. When it's late and you're tired, a single delete or update (or forgetting the ‘where’ clause) can unleash total disaster, wiping out all rows in a database or setting all rows in one or more columns to the same values. Even worse, the server will log the delete, slowing things down for hours. Perform the change in a transaction, though, and even if you forget the ‘where’ clause, you can still issue a "get out jail" rollback. That’s what I did, slowly and carefully, and I didn’t mess up. But transactions still have a sting; I once stopped production for five minutes during the day by forgetting to end a transaction. A ‘begin’ transaction can lock up a full table, or even just part of it, depending on the lock granularity. (Unfortunately, the table in question had a lot of inserts going on, thanks to trades being saved.) When applications started timing out while writing data to the database, I realised what I'd done and committed the transaction. A few traders had to each re-enter their failed-to-save trades, but it could have been a lot worse. If you've used transactions with a relational database to let you make safe changes to data, you need to check that you have completed the transaction. If you haven’t, you need to commit it, otherwise the data changes won’t stick. Here’s how you do it in SQL Server:

print @@TRANCOUNT

That shows the current connection's transaction count. If it's above 0, you are still in an active transaction in your session and must complete it. Other databases have different ways of doing this. In MariaDB, the variable is called @@in_transaction.

Linux

Rivalling SQL’s where-less deletes for devastation potential is the Linux rm command with the –rf option, and it's just as easy to invoke:

rm -rf

Not sure what it does? It's the Linux delete files command. If you're short of disk space, running this will free up more than you need—but you should be very aware of the folder you run it from. The r parameter means recurse (i.e., descend through all directories and sub-directory, and delete each directory along with its contents); f means force it, i.e., don't prompt to delete each file or stop if a file doesn't exist. Net result: It will wipe out everything in your current folder and below. Do rm -rf from root /, and you'll wipe out a large part of your operating system along with all users and user files. There are prominent examples of this actually happening. Pixar saw 90 percent of the data for “Toy Story 2” data wiped out back in 1998, and only managed to recover the movie through some incredible effort. This isn’t a new problem, mind you; long before Linux existed, it was an issue for Unix users—that’s over 30 years ago. There’s no way to undo an rm –rf, so double- or triple-check the folder you’re in before doing it. Better still, do it the slower way with rm –ir folder name, which will initiate prompts.

Bad UI or Lack of Documentation?

A few months back, I helped a London taxi firm move its server overnight to a new location. We waited until 2:00 A.M., the quietest time in the company’s standard 24-hour cycle. It took three hours to perform the move, install everything, and get the system running. When the server restarted, the change of IP triggered an archiving run that lasted forty-five minutes, during which no taxis could be dispatched. Even worse, once the archiving run completed, taxis couldn’t connect to it for another 90 minutes. So that was more than two hours of unnecessary downtime. It took 90 minutes of fruitless wondering why the taxis couldn’t connect to the server to discover that the router's firewall screen had the enable port forwarding tick box located at the right of the web page. This was at 6 AM, after working through the night. The port had been added to the port forwarding table in the router's web admin pages. But on the 17-inch monitor used, the tick box to enable the forwarding was hidden off-screen on the right, and nobody noticed the horizontal scrollbar at the bottom that would have scrolled it into view. If you are doing complicated things (especially at unearthly hours), it’s much better to follow a script or checklist, which requires less thought (and creates less potential for error). If you spend time documenting the process in advance, preferably with illustrations or warnings about things that can go wrong, you can hopefully avoid significant time (and money) loss. To go back to the previous example, one screenshot of the router page showing the checkbox would have prevented our 90-minute delay.

Avoiding These Types of Errors

In unfamiliar situations, developers can end up operating well outside their comfort zone. That’s when it’s all too easy to mess up. The best way to not have these kinds of problems is to keep developers doing what they do best: developing. In other words, keep them away from production; build a wall between support staff and developers. In theory, developers should never touch production systems; any necessary changes (for example, altering a database table, or editing a config file) should be carried out by support staff following the properly documented procedures—including checklists and screenshots. Also, testing in production should become a thing of the past. While tight deadlines sometimes make it seem like the only solution to an issue, it’s often a failure of management if developers and support stuff find themselves in such a spot. If all else fails, and developers need to perform tasks that normally belong to support staff, make sure that everyone follows best practices, and that a manager understands and approves of any actions taken.