Main image of article Diagnosibility: How to Avoid Cursing When You Hit 'Run'
Starting a new project is exciting and writing code is fun but getting the system to work and/or figure out why it's not is also a big part of the job. When you're working with software that's large enough to have dependencies, basically meaning any real system, diagnosing issues goes something like this:
  • Set up environment: IDE, source code client, language libraries, package manager.
  • Check out source code.
  • Hit the "compile and run" or "run" button. Watch errors spew through logs. Curse.
  • Install appropriate versions of required libraries.
  • Hit "run." Curse.
  • Look at first error in logs. Go ask nearest coworker what "/my/special/config" is. Get magic file.
  • Hit "run." Curse.
  • Look at first error in logs. Go ask nearest coworker. Hear, "Gosh, I don't remember."
  • Diagnose problem.
  • Repeat.
The trick to writing diagnosable code seems pretty simple:
  • There should be enough logging to see where it failed.
  • Errors should be unambiguous. Things like "a problem has occurred" or "bad HTTP response" are not helpful.
Consider this pseudocode, (sanitized a bit, of course): It looks like an easy method. It gets a list of sessions from a remote system and then pulls data out of the local system for each session. But how does it stand up to diagnosibility? Let's say we get a call from the support rep: "Hi. We're not getting any session data, and I'm not sure why. Can you take a look?" The big questions become:
  • Are there any sessions?
  • Could we talk to the other system or is there a connection error being swallowed?
  • Assuming the remote system has sessions, are we not finding them in the local system or not getting data from them?
Now we'll have to walk through each of those steps and confirm that what's happening is what we expect. We'll have to go to the remote system and look for sessions. Assuming we find some, we'll have to go to the local system and look to see if  data exists for those sessions. If that works, we'll start looking at connectivity. Take a deep breath and order lunch in. This may take a while. Imagine how much easier our job would be if this code's author had considered diagnosability and logged in like this: Or error handling, like this: By using this approach, we've gone a long way toward making these problems easier to identify and diagnose. So in the future, let's give ourselves an extra hour to work it through. A little more elbow grease in the short run will save us valuable time in the long run!