S h o r t S t o r i e s

// Tales from software development

Archive for May 2015

Bug diagnosis by abduction

leave a comment »

Deduction is a term that’s often used when induction is actually meant and in the diagnosis of software bugs it’s often abduction, i.e. abductive reasoning, that’s applied anyway.

It’s generally acknowledged that innate problem solving ability varies and there’s a long standing debate as to the effectiveness of teaching problem solving skills but it’s clear that experience is always very useful in improving diagnostic skills. After 30 years in IT I think I can say I have a useful amount of experience and a couple of incidents over the past week made me realise just how useful…

Vitality has a program that takes an input data file and applies the data to its database. The program has been optimised for performance in several respects. Sometimes it’s used at frequent intervals to process small numbers of files (e.g. a regular batch update process) while at other times it’s used to perform ad hoc processing of large numbers of files (e.g. data migration). So, its file i/o is optimised and it caches a lot of schema information about the database that it’s connected to.

I was using the program for a new project that I was working on to load the details of 160,000 personnel and 12,500 organisations into the database. The process took seven minutes which seemed very acceptable but the real test was not the initial load but the subsequent re-processing of the input file as refresh data because this process would be performed every few months when updated personnel and organisation data became available.

So, I ran the program again using the same data file but it became clear after it had been running for 20 minutes that it was performing more slowly than I had anticipated. I’d enabled a diagnostic option on the program that caused it to log SQL database statements to a file and I could see from this that it was only a small way into the update process. I left it running and went to lunch.

It was still running when I returned and when I checked the SQL log it was still only about a third of the way through the processing. I left it to complete and got on with other work. It eventually finished processing four hours and forty minutes after starting. This was very odd as it wasn’t actually making any changes to the database because the data files were exactly the same as the files used to load the data. So, how could it take seven minutes to load the data and four hours and forty minutes to not make any changes to it ?

I checked the version of the program and realised that I was using a version that was at least eighteen months old. If I was going to raise a problem report with the support team for the program’s update performance then the first thing they’d ask me to do is to try re-running it with the current production release version so I might as well do that now and pre-empt their suggestion.

Unfortunately, when I tried running the new version it failed almost immediately with “ExecuteSQL() –> Err 103” and exited. So, I contacted the support team and started discussing the problem with Jim.

Jim was sure that the problem was due to the program encountering an error when interrogating the database schema when it built its schema information cache at the start of execution and we went through several steps of trying to identify how far it had got through this process by examining the activity logs on the database server.

However, this is where ‘experience’ kicked in… That 103 error looked suspiciously like a DOS/Windows file error. So, I checked what a 103 file error is: it’s “File not open” which usually means the program is trying to read, write, or close, a file that isn’t open. What file and why weren’t other people getting the same error ? Well, what is it that I was doing that most other users of the program wouldn’t doing ? Using the option to log the SQL statements to a file!

A little bit of inductive reasoning would suggest that the problem is caused by the program not opening the log file before trying to write to it and that this is a bug that has been introduced at some point in the past eighteen months because this option worked on the older version of the program that I’d been using. Plus, most other users wouldn’t be using the logging option that I was. This is arguably now abductive reasoning as it provides the most simple explanation for why the error is occurring when it didn’t previously and why I was getting the error but no one else had reported it.

Jim didn’t think this could be the explanation because the logging code hadn’t been changed in years and so we continued to look at the possibility that the error was occurring as part of the program initialisation. At this point I said to Jim, “Even if the logging code hasn’t changed, it’s possible that the calls to the code have been, isn’t it ?” but let it go as he still didn’t think this was the problem.

It was only later that day that I did what I should have done earlier to prove, or not, the logging file theory: disable the option. This wasn’t quite as straightforward as it sounds as it’s a registry option that developers like me tend to set and leave alone. So, it hadn’t immediately occurred to me to disable it and when it did I spent a while finding the exact setting as there were several with names suggesting they might be the ones responsible for SQL logging.

Once I’d disabled the option the program stopped failing and went into the main processing phase. So I called Jim and gave him the good news – it was the SQL logging that was causing the “ExecuteSQL(): Err 103” error.

The following day he called me to say that after a careful examination of the source code he could now see that while the logging code hadn’t been changed in years, some changes in the initialisation code meant that the log file wasn’t open when the program tried to log the first SQL statement for the processing. This was exactly what I’d suggested the previous day. I resisted the temptation to say, “I told you so”.

The performance problem was still an issue and Jim couldn’t believe what I was telling him – that processing the same data that only took seven minutes to load took forty times longer despite it not resulting in any database inserts or updates. He suggested that the cause might be that the database processed a SQL UPDATE statement more slowly than an INSERT. This is generally true because an UPDATE potentially affects many rows whereas an INSERT only relates to a single row. So the database server must do more processing for an UPDATE and the particular database server that we were using is known to have slow UPDATE processing. But, not this slow and also, the speed of UPDATE processing is irrelevant anyway because no UPDATES were being issued! The program correctly identified that the data being processed was the same as the data in the database and was not issuing an UPDATE statements.

Jim decided that I must be doing something very unusual and asked that I package everything up and send it to him – the data files, the program configuration, and a backup of the database that I was using.

Again, ‘experience’ suggested to me that the problem was something in the processing that identified that the data already existed in the database and did not need to be inserted and then that no update was required either. My suspicion was that the program was inefficiently processing in this scenario, e.g. by performing an inefficient cache or hash table operation. If so, then I’d expect to see a fairly constant high CPU utilisation. I started the program again and after a few minutes checked the Windows Task Manager. On the Processes tab it showed the program was running at a constant 25%. After watching it for a few seconds and realising that it was rock solid at 25%, I realised what that really meant – it was driving one of the processing cores on my quad core Xeon processor at 100% and switching to the Performance tab showed exactly that.

I called Jim again and suggested that the problem was in the code that executed after the program determined that an INSERT was not required and up to, or perhaps after, the code that then determined that an UPDATE wasn’t required either. The code would be doing something CPU intensive such as a cache or hash table search.

Jim said he had some good news. He’d replicated the problem using just the data files I’d sent him so it wasn’t anything to do with the configuration or database that I was using and he’d narrowed the problem down to something that the update processing was doing but hadn’t determined what yet.

A few hours later he called back to say that he’d found the problem – it was the code responsible for formatting the UPDATE statement. The format method being used was notoriously slow but he hadn’t realised just how slow until now.

This didn’t seem quite right as you’d hope that the UPDATE statement was not being constructed until the program, which is heavily optimised, had decided that it definitely needed to execute the UPDATE against the database. Also, would the format method be so inefficient as to make a process that ought to run in around seven minutes take four hours and forty minutes ?

There wasn’t any more that I could do at this point as I didn’t have access to the source, I wasn’t familiar with it anyway, and I had plenty of other work to get on with.

When Jim called a couple of days later to say that he’d fixed the problem it turned out that it was nothing to do with the inefficient format method and, guess what, it was do with an inefficient hash table implementation. It turns out that there’s a hash table used to keep track of column values that have been changed by the program. It’s only searched in update processing and the inefficient implementation is of relatively little consequence when the program is only processing a few hundred column values as is typical. However, in this case there were nearly half a million column values being processed and this caused a massive performance issue with the hash table.

Again, I didn’t say “I told you so” but it did make me realise how reasonably good problem diagnostic skills and a lot of experience can help in suggesting a cause based on abductive reasoning. Maybe I just got lucky but both times, without any sight of the code, I made a fairly specific suggestion about the cause that turned out to be accurate.

 

 

Advertisements

Written by Sea Monkey

May 5, 2015 at 11:45 am

Posted in Debugging

Tagged with

Windows 7 DisplayPort Link Failure: It might be the cable

with one comment

The day after I switched from a display configuration consisting of three 24″ Dell U2412M monitors to three 30″ Dell U3014 monitors on my Dell T5600 I was disappointed to suddenly see one monitor go blank and Windows 7 display this error message:

The system has detected a link failure and cannot set the requested resolution and refresh rate on your DisplayPort display. Your display might not support the requested resolution or there may be an issue with the cable connecting the display to your computer.

I’d checked and confirmed that the T5600’s AMD FirePro V5900 graphics card could handle three monitors at 2560 x 1600 resolution and had used the new display setup for most of the previous day without any problems. So what was the problem ? A defective monitor perhaps ?

I tried a few things: unplugging and re-seating the plugs on both ends of the DisplayPort cable, powering the monitor off and on, using the Detect function from the Control Panel Display Resolution applet. All to no avail. After a bit more fiddling around unplugging and replugging the cable the monitor came back to life but Windows decided that its maximum resolution was 1920 x 1200 rather than 2560 x 1600 and wouldn’t allow the correct resolution to be set.

When I’d re-seated the DisplayPort cable plugs I noticed that the cable was different from the other DisplayPort cable (the FirePro V5900 has two DisplayPort sockets and one DVI-D socket). The cable looked a bit, well, cheap. Both DisplayPort cables were the ones that had shipped with the monitors but they were definitely different. And, the error message did specifically say, “there may be an issue with the cable connecting the display to your computer”.

I went and retrieved one of the DisplayPort cables that I’d been using previously with the U2412M monitors and which I’d had to buy because the U2412M shipped with VGA and DVI-D cables only. I’d paid a premium price for these DisplayPort cables so hopefully they were good quality cables.

After changing the cable the monitor immediately came back to life and showed up in Windows with the correct maximum resolution of 2560 x 1600.

Written by Sea Monkey

May 3, 2015 at 8:00 pm

Posted in Hardware

Tagged with