S h o r t S t o r i e s

// Tales from software development

Ken's First Law of Problem Diagnosis

with one comment

A long time ago in a former career in mainframe system software development and support I worked in a team alongside Mark and led by Ken. The three of us spent most of our time diagnosing customer issues with the company’s products.

Typically, we’d have to work our way though several hundred pages of hardcopy print out of a mainframe core dump to diagnose the cause of a system crash. We’d start by looking at the CPU registers to determine where the failure occurred and what was going on at the time. Then we’d examine the OS’s and our product’s control blocks and work our way backwards through the code to see how the failure had occurred. Sometimes we were lucky and recognised a problem that we’d seen previously and could short circuit much of the diagnosis and just confirm that the failure was the same. However, all too often it was a process that could take anywhere from a few hours to several weeks.

Ken had been doing this longer than Mark and myself and had a number of observations about it that we called “Ken’s Laws”. The first law was that the length of time it takes to diagnose a problem is inversely proportional to the size of the patch code required to fix it.

At first this seems like a very odd observation but it reflects the fact that the problems that take a long time to diagnose are usually very subtle and are often resolved by a very small change in the code.

The most dramatic example of this was a problem that I spent two weeks working on. I came close to giving up on it a few times because there’s no guarantee that the answer will be found in the core dump. But I kept at it with help from Ken and Mark and finally the cause of problem began to appear.

It came down to a single branch instruction in a piece of code that dealt with the recognition of physical disk devices. The developer who wrote the code had been aware that IBM was introducing a new device and had included support for it. Unfortunately, the preliminary documentation did not correctly reflect the device identifier codes used for all versions of the new disk type. The branch instruction that passed control to the code to deal with the new disk type had been coded as an assembler BNEH (branch not higher or equal) instruction and it should have been a BNH (branch not higher). The assembler generated the same machine code branch instruction for both assembler instructions with a 4 bit mask defining the branch condition. The difference between the two instructions was a single bit in the mask.

So, after about 70-80 hours of work on this problem the cause was identified as a single incorrect bit. Patches work by replacing bytes so although the problem was one bit in a 4 bit mask the smallest patch I could write was one byte long.

A corollary of this example came a few months later. A system dump arrived and within 10 minutes I’d identitified that the failure was due to a new variant of a disk that our software did not support. The solution was a patch that added support for the new disk but, as this required information about the geometry of the device, the patch ran to several hundred bytes.

Few developers have to spend time with system dumps these days but I think Ken’s First Law still holds – the most subtle and pernicious issues are the ones that take the longest to identify.


Written by Sea Monkey

April 18, 2008 at 4:17 pm

Posted in Debugging

Tagged with

One Response

Subscribe to comments with RSS.

  1. […] Ken’s First Law of Problem Diagnosis I recounted how one of my former colleagues had observed that the length of time it takes to […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: