Managing, maintaining and protecting the network is a difficult job on the best of days—and despite a sysadmin’s Herculean efforts to keep order, proverbial cow patties still end up hitting the fan. With that in mind, there are a few events and outcomes that you can avoid with a little foreknowledge.
According to David Bishop, an infrastructure engineer at Apple, failing to test your backups is the number one worst-case scenario that can face any sysadmin. All too often, sysadmins will cover every other aspect of their job while leaving this crucial task undone. “’Backups aren’t actually backups until you’ve restored from them’ is a cliché for a reason,” he said.
Nearly as bad: making a change without testing and walking away, especially on a Friday. “I took down a 30,000 person company with a single malformed firewall rule doing that,” Bishop recalled. “Call-center couldn’t take calls, cash registers couldn’t check out customers, corporate HQ phones went silent, and the website would process orders but couldn’t calculate tax or shipping, so every order that was processed had to be manually processed.”
According to Rob Freeborn, vice president of Managed Services for Optanix, a lack of configured monitoring generates a lot of errors and wastes time. “Admins and ops need to do appropriately deep monitoring,” he said.
Freeborn once worked with an organization that only performed ICMP (Internet Control Message Protocol, or a main protocol used to send error messages) on voice routers at its various branches. When he asked why, he was told that restricting ICMP to certain routers prevented its sysadmins from fielding “too many alarms.” That was the worst possible answer they could have given him.
The same organization, he added, had massive issues with VoIP calls because of the sysadmins’ weak monitoring of switch logs. This carelessness resulted in dropped calls and the system nearly performing a DDoS attack on itself.
Sysadmins may spend a lot of time stressing the importance of security, but that doesn’t prevent employers from failing to pay attention to critical issues. Threats evolve when companies won’t devote the proper resources to ensuring that infrastructure vulnerabilities are eliminated as soon as possible.
Those vulnerabilities aren’t limited to software or the cloud. Tony Laurenzana, a consultant based in Bloomington, Indiana, had a client that lost nearly all their hardware in a fire. Although Laurenzana rescued the servers and virtualized them onto new hardware, it was only a temporary solution that allowed remote sites to access the data. Clients could remote-desktop right into the client’s file server. “In a perfect world,” he stressed, ‘that is a huge no-no.”
His client made the situation worse by refusing to upgrade: “They beat around the bush on the cost of a new VPN and infrastructure and never got around to it. I was cleaning up after attacks to their Active Directory on a monthly basis.”
Even the most skilled sysadmin may have to contend with an employer who avoids best practices. Another Laurenzana client made a single employee responsible for a weekly rotation of the company’s external hard drives; they would take one drive home after replacing it with the other. In any normal backup scenario, no one would ever permit such a dicey arrangement, but Laurenzana had to defer to the client. The practice only changed when the employee (unbeknownst to his bosses) stopped taking the drives home, and a major incident resulted in lost data that couldn’t be replaced from incomplete backups.
Tribal knowledge is essential to best practices, as well. Not documenting the IT environment for the team at large is a potentially critical error. “We get paid to leave the knowledge with the company we work for,” said Jeremy Myers, a sysadmin team leader at Tech-Pointe in Austin, Texas. “If someone is out sick or dies that shouldn’t mean the company goes out of business.”
Another point: freaking out when users ruin your hard work is counter-productive. Matt Simmons, a Linux sysadmin and author of the Standalone Sysadmin blog, recommends never punishing anyone for giving you bad news.
Simmons once got a call from an operations tech asking how often they backed up the primary database. Simmons gave him the information and asked why he wanted to know. The tech admitted that a developer had deleted tens of thousands of rows from the production database. Simmons knew the developer shouldn’t have had a login in the first place, but the immediate concern was the missing data.
Because he had been called as soon as the blunder was discovered, he was able to save the data without having to revert to a full restore. Like everyone else at the company, the operations tech knew that Simmons wouldn’t explode in response to a mistake. A good attitude can save time, money, and heartache. “I might have felt like [exploding],’ Simmons said, “but the only way to build trust is to realize that you’re on the same team. If they try to hide mistakes, you won’t find out until the worst time.”