Lessons From My Toughest Software Bugs

by David Bolton Aug 3, 2015 5 min read

Most programmers experience some tough bugs in their careers, but only occasionally do they encounter something truly memorable. In my life, three bugs have stood out. I describe them here not only for your entertainment, but also to show how I worked through the respective problems to arrive at a solution. In 1984, I dealt with a game written in 6502 asm that kept crashing. In those days, there was no debugger; it took two days’ worth of scanning code printouts to discover that:

JSR CLS

Should have been

JSR CLR

That one was a bit silly and was thankfully caught during development. The lesson is, sometimes the bug is glaringly obvious—and if you nail it before the software’s release, nobody’s the wiser. But my next memorable bug occurred in production.

Potentially Expensive Mistake

This next bug messed up the figures for a day’s worth of oil trading by $800 million, setting off alarm bells throughout the organization for which I worked. The technical reason was an exception just before a division that calculated the currency rate from the Japanese yen to the U.S. dollar. Instead of dividing by the rate, the program multiplied by it—a huge mistake. The code was correct, but the exception happened because a new financial instrument being traded had a zero value for “number of days,” and nobody had told us. Prior to that, no financial instrument traded had a zero-days length, and the original programmers hadn’t included checks for zero where the division took place. Detecting this crash could only be done live, and it took three of us working late for a month to catch the bug. Once it was detected, I added checks everywhere to ensure the division was done. The lesson here: If a bug occurs in the system, it might be the fault of whoever originally built the code—and you’ll end up spending a lot of time trying to figure out what exactly went wrong.

Tougher Still

I encountered my toughest bug in 1998, on a research project to forecast value at risk (VaR) for an exchange using futures and options. The software was written in Delphi 3 (i.e., Pascal) and used MS Access to store data. One day the platform began experiencing an odd crash. Release or Debug made no difference. On the other hand, we could go days between a crash, so it wasn’t the worst emergency. The odd part was the message, and where the crash kept occurring. The error message itself was just a number, and it crashed at either a trunc() statement or Open (for opening an MS Access table in a SQL query). Adding exception handling made no difference; it still blew up at one of those two places. I tried lots of things to fix the issue, including making the stack bigger, but no joy. It was really tough because there was no workaround for it. The software was doing a lot of calculations, back when top-of-the-range CPUs clocked at 200 MHZ CPU and had just 256MB of RAM. Thankfully the other project members (two professors and a quant) proved understanding. Nearly two months passed before I found the cause: One of the professors, wanting to keep very tight control over the econometrics algorithms he had devised, had learned to program and provided his code as a DLL. He knew just enough Delphi to be dangerous. My code was using his DLL and passing large arrays by reference. These arrays were 5,000 x 10 doubles. I'd written code to dump the array after the call to a csv and, when browsing through this file, came across a +INF amongst the float values. The smoking gun was revealed. It turned out the professor had a division-by-zero bug in his code, but it didn't crash there because our MS Access drivers had disabled exception handling in floating point arithmetic. Instead of an exception, the value +INF got into the program's data and eventually surfaced with a crash. These happened quite a distance from the DLL procedure call, which is why the origin wasn't so obvious. Here’s the lesson: Keep track of who does what to the code, and make sure that nobody’s altering major things within the software without telling the rest of the team.

More on Fixing the Bug

IEEE 754 Floating Point Standard is the standard to which all recent (i.e., the last 30 years) CPUs implement floating point. There are special bit patterns used to represent +/- infinity (INF) and Not a Number (NAN). To learn more about floating point, check out “What Every Computer Scientist Should Know About Floating-Point Arithmetic.” In order to fix the professor’s code, I wrote a procedure to scan any array for all four bit patterns (+INF, -INF, +NAN and –NAN) and called it for each of the arrays returned from the DLL. It found quite a few more INFs, and I used that in the fix. The time saved not chasing weird bugs more than compensated for the marginally longer processing time.

Conclusion

The disabled exception handling made the professor’s bug exceptionally hard to find. Unless out-and-out performance is vital, checking inputs is always a good idea. As for the divide-by-zero bug, that’s a funny one. You could argue that every division should check for zero, but when it's a parameter that is never zero, nobody thought to check. While you can’t avoid bugs, careful programming and a solid QA-testing system can help eliminate many of the worst. Stay vigilant.