S h o r t S t o r i e s

// Tales from software development

Posts Tagged ‘Debugging

Bug diagnosis by abduction

leave a comment »

Deduction is a term that’s often used when induction is actually meant and in the diagnosis of software bugs it’s often abduction, i.e. abductive reasoning, that’s applied anyway.

It’s generally acknowledged that innate problem solving ability varies and there’s a long standing debate as to the effectiveness of teaching problem solving skills but it’s clear that experience is always very useful in improving diagnostic skills. After 30 years in IT I think I can say I have a useful amount of experience and a couple of incidents over the past week made me realise just how useful…

Vitality has a program that takes an input data file and applies the data to its database. The program has been optimised for performance in several respects. Sometimes it’s used at frequent intervals to process small numbers of files (e.g. a regular batch update process) while at other times it’s used to perform ad hoc processing of large numbers of files (e.g. data migration). So, its file i/o is optimised and it caches a lot of schema information about the database that it’s connected to.

I was using the program for a new project that I was working on to load the details of 160,000 personnel and 12,500 organisations into the database. The process took seven minutes which seemed very acceptable but the real test was not the initial load but the subsequent re-processing of the input file as refresh data because this process would be performed every few months when updated personnel and organisation data became available.

So, I ran the program again using the same data file but it became clear after it had been running for 20 minutes that it was performing more slowly than I had anticipated. I’d enabled a diagnostic option on the program that caused it to log SQL database statements to a file and I could see from this that it was only a small way into the update process. I left it running and went to lunch.

It was still running when I returned and when I checked the SQL log it was still only about a third of the way through the processing. I left it to complete and got on with other work. It eventually finished processing four hours and forty minutes after starting. This was very odd as it wasn’t actually making any changes to the database because the data files were exactly the same as the files used to load the data. So, how could it take seven minutes to load the data and four hours and forty minutes to not make any changes to it ?

I checked the version of the program and realised that I was using a version that was at least eighteen months old. If I was going to raise a problem report with the support team for the program’s update performance then the first thing they’d ask me to do is to try re-running it with the current production release version so I might as well do that now and pre-empt their suggestion.

Unfortunately, when I tried running the new version it failed almost immediately with “ExecuteSQL() –> Err 103” and exited. So, I contacted the support team and started discussing the problem with Jim.

Jim was sure that the problem was due to the program encountering an error when interrogating the database schema when it built its schema information cache at the start of execution and we went through several steps of trying to identify how far it had got through this process by examining the activity logs on the database server.

However, this is where ‘experience’ kicked in… That 103 error looked suspiciously like a DOS/Windows file error. So, I checked what a 103 file error is: it’s “File not open” which usually means the program is trying to read, write, or close, a file that isn’t open. What file and why weren’t other people getting the same error ? Well, what is it that I was doing that most other users of the program wouldn’t doing ? Using the option to log the SQL statements to a file!

A little bit of inductive reasoning would suggest that the problem is caused by the program not opening the log file before trying to write to it and that this is a bug that has been introduced at some point in the past eighteen months because this option worked on the older version of the program that I’d been using. Plus, most other users wouldn’t be using the logging option that I was. This is arguably now abductive reasoning as it provides the most simple explanation for why the error is occurring when it didn’t previously and why I was getting the error but no one else had reported it.

Jim didn’t think this could be the explanation because the logging code hadn’t been changed in years and so we continued to look at the possibility that the error was occurring as part of the program initialisation. At this point I said to Jim, “Even if the logging code hasn’t changed, it’s possible that the calls to the code have been, isn’t it ?” but let it go as he still didn’t think this was the problem.

It was only later that day that I did what I should have done earlier to prove, or not, the logging file theory: disable the option. This wasn’t quite as straightforward as it sounds as it’s a registry option that developers like me tend to set and leave alone. So, it hadn’t immediately occurred to me to disable it and when it did I spent a while finding the exact setting as there were several with names suggesting they might be the ones responsible for SQL logging.

Once I’d disabled the option the program stopped failing and went into the main processing phase. So I called Jim and gave him the good news – it was the SQL logging that was causing the “ExecuteSQL(): Err 103” error.

The following day he called me to say that after a careful examination of the source code he could now see that while the logging code hadn’t been changed in years, some changes in the initialisation code meant that the log file wasn’t open when the program tried to log the first SQL statement for the processing. This was exactly what I’d suggested the previous day. I resisted the temptation to say, “I told you so”.

The performance problem was still an issue and Jim couldn’t believe what I was telling him – that processing the same data that only took seven minutes to load took forty times longer despite it not resulting in any database inserts or updates. He suggested that the cause might be that the database processed a SQL UPDATE statement more slowly than an INSERT. This is generally true because an UPDATE potentially affects many rows whereas an INSERT only relates to a single row. So the database server must do more processing for an UPDATE and the particular database server that we were using is known to have slow UPDATE processing. But, not this slow and also, the speed of UPDATE processing is irrelevant anyway because no UPDATES were being issued! The program correctly identified that the data being processed was the same as the data in the database and was not issuing an UPDATE statements.

Jim decided that I must be doing something very unusual and asked that I package everything up and send it to him – the data files, the program configuration, and a backup of the database that I was using.

Again, ‘experience’ suggested to me that the problem was something in the processing that identified that the data already existed in the database and did not need to be inserted and then that no update was required either. My suspicion was that the program was inefficiently processing in this scenario, e.g. by performing an inefficient cache or hash table operation. If so, then I’d expect to see a fairly constant high CPU utilisation. I started the program again and after a few minutes checked the Windows Task Manager. On the Processes tab it showed the program was running at a constant 25%. After watching it for a few seconds and realising that it was rock solid at 25%, I realised what that really meant – it was driving one of the processing cores on my quad core Xeon processor at 100% and switching to the Performance tab showed exactly that.

I called Jim again and suggested that the problem was in the code that executed after the program determined that an INSERT was not required and up to, or perhaps after, the code that then determined that an UPDATE wasn’t required either. The code would be doing something CPU intensive such as a cache or hash table search.

Jim said he had some good news. He’d replicated the problem using just the data files I’d sent him so it wasn’t anything to do with the configuration or database that I was using and he’d narrowed the problem down to something that the update processing was doing but hadn’t determined what yet.

A few hours later he called back to say that he’d found the problem – it was the code responsible for formatting the UPDATE statement. The format method being used was notoriously slow but he hadn’t realised just how slow until now.

This didn’t seem quite right as you’d hope that the UPDATE statement was not being constructed until the program, which is heavily optimised, had decided that it definitely needed to execute the UPDATE against the database. Also, would the format method be so inefficient as to make a process that ought to run in around seven minutes take four hours and forty minutes ?

There wasn’t any more that I could do at this point as I didn’t have access to the source, I wasn’t familiar with it anyway, and I had plenty of other work to get on with.

When Jim called a couple of days later to say that he’d fixed the problem it turned out that it was nothing to do with the inefficient format method and, guess what, it was do with an inefficient hash table implementation. It turns out that there’s a hash table used to keep track of column values that have been changed by the program. It’s only searched in update processing and the inefficient implementation is of relatively little consequence when the program is only processing a few hundred column values as is typical. However, in this case there were nearly half a million column values being processed and this caused a massive performance issue with the hash table.

Again, I didn’t say “I told you so” but it did make me realise how reasonably good problem diagnostic skills and a lot of experience can help in suggesting a cause based on abductive reasoning. Maybe I just got lucky but both times, without any sight of the code, I made a fairly specific suggestion about the cause that turned out to be accurate.

 

 

Advertisements

Written by Sea Monkey

May 5, 2015 at 11:45 am

Posted in Debugging

Tagged with

Windows Performance Counter names lost

leave a comment »

Despite working with Microsoft Windows for over 20 years I’ve never seen this happen before.

One of our customers reported that one of our Windows services was failing to start on their server. When I investigated I found that the failure was in the service’s initialisation routine where it attempts to reset its performance counters. There wasn’t much error handling around this code as we’ve never experienced any failures in this code and it’s difficult to see how resetting a performance counter could fail.

There wasn’t much diagnostic information available in the service’s log other than an indication that it was the counter reset that failed. However, the previous day the customer had told us that they had increased the disk allocation on this server and rebooted it. That implies that the server had been running short of disk space, so had this caused some corruption ?

I started PerfMon and was surprised to see that all the counter names had disappeared and had been replaced by numeric values. This appeared to confirm the corruption theory.

A quick look in the event log confirmed this:

eventid_3001

At this point I thought we might be looking at a re-install of Windows but fortunately I found a post that indicated it was possible to reload the counter names by running the command:

C:\Windows\System32\loadctr.exe /r

 
and the rebooting Windows.

After doing this everything appeared to be back to normal and the service that was previously failing to start was successfully started.

 

Written by Sea Monkey

October 22, 2014 at 8:00 pm

Posted in Debugging, Environments

Tagged with ,

Checking if a .NET assembly is compiled for 32 bit or 64 bit execution

leave a comment »

The traditional way of checking whether a program or dll is compiled as 32 bit or 64 bit is to use the Microsoft DumpBin utility with the /HEADERS switch. However, this doesn’t work for .NET assemblies as the first part of the file is a bootstrap for the .NET runtime and DumpBin will always indicate that the file is 32 bit regardless of whether the .NET code is 32 bit or 64 bit.

The solution is to use the CorFlags utility that ships as part of the Microsoft Platform SDK and the .NET Framework SDK. The output includes a 32BIT value that is set to either 0 to indicate the program is not specifically compiled as 32 bit only (i.e. it’s 64 bit capable) or 1 to indicate it is 32 bit only. For example, this is the output for an assembly compiled using a Platform Target of AnyCPU:

Version : v2.0.50727
CLR Header: 2.5
PE : PE32
CorFlags : 9
ILONLY : 1
32BIT : 0
Signed : 1

And this is the output for an assembly compiled with a Platform Target of x86:

Version : v2.0.50727
CLR Header: 2.5
PE : PE32
CorFlags : 11
ILONLY : 1
32BIT : 1
Signed : 1

Written by Sea Monkey

July 15, 2013 at 8:00 pm

Dynamic load of 32 bit class library from 32 bit Windows Service fails with BadImageFormatException

leave a comment »

We have an HL7 interface that runs as a Windows Service that is deployed at half a dozen customer sites. Each site has slightly differing interfacing requirements and this is handled using a site specific assembly that is dynamically loaded by the service.

Until recently we haven’t had to worry about whether the service was running as a 32 bit or 64 bit process but some of the more recent deployments have used a site specific plugin assembly that has dependencies on other 32 bit assemblies. So, the target of the automated build was changed from ‘Any CPU’ to ‘x86’. Note that no change was made to the solution or project files.

A few weeks later I deployed a new version of a plugin assembly to a test environment at one of our customer sites and found that the service would not start due to a BadImageFormat Exception being thrown when the service tried to dynamically load the plugin assembly. The intention was that all class library assemblies would be compiled as ‘AnyCPU’ but in fact this plugin assembly had been targeted to ‘x86’ and so was 32 bit only. This should not have been a problem however as the Windows Service executable was also compiled as 32 bit only.

I used DumpBin.exe to confirm that the headers in the two assemblies indicated that these were 32 bit only. [Edit: See this post that explains why this doesn’t work.]

Then I enabled fusion logging to see why the load was failing and at first glance the log didn’t appear to shed any light on why the load failure was happening:

*** Assembly Binder Log Entry  (10/07/2013 @ 11:52:12) ***

The operation failed.
Bind result: hr = 0x8007000b. An attempt was made to load a program with an incorrect format.

Assembly manager loaded from:  C:\Windows\Microsoft.NET\Framework64\v2.0.50727\mscorwks.dll
Running under executable  C:\Program Files (x86)\Vitality\HL7Interface\InterfaceService\Vitality.HL7.Processing.Service.exe
--- A detailed error log follows.

=== Pre-bind state information ===
LOG: User = NT AUTHORITY\SYSTEM
LOG: Where-ref bind. Location = C:\Program Files (x86)\Vitality\HL7Interface\InterfaceService\Vitality.HL7.Processing.RFH.dll
LOG: Appbase = file:///C:/Program Files (x86)/Vitality/HL7Interface/InterfaceService/
LOG: Initial PrivatePath = NULL
LOG: Dynamic Base = NULL
LOG: Cache Base = NULL
LOG: AppName = NULL Calling assembly : (Unknown).
===
LOG: This bind starts in LoadFrom load context.
WRN: Native image will not be probed in LoadFrom context. Native image will only be probed in default load context, like with Assembly.Load().
LOG: Using application configuration file: C:\Program Files (x86)\Vitality\HL7Interface\InterfaceService\Vitality.HL7.Processing.Service.exe.Config
LOG: Using machine configuration file from C:\Windows\Microsoft.NET\Framework64\v2.0.50727\config\machine.config.
LOG: Attempting download of new URL file:///C:/Program Files (x86)/Vitality/HL7Interface/InterfaceService/Vitality.HL7.Processing.RFH.dll.
LOG: Assembly download was successful. Attempting setup of file: C:\Program Files (x86)\Vitality\HL7Interface\InterfaceService\Vitality.HL7.Processing.RFH.dll
LOG: Entering run-from-source setup phase.
LOG: Assembly Name is: Vitality.HL7.Processing.RFH, Version=2.0.0.0, Culture=neutral, PublicKeyToken=f529d7a12981f1ee
ERR: Failed to complete setup of assembly (hr = 0x8007000b). Probing terminated.

It took me a few minutes to spot that line 6 indicates that the 64 bit loader is being used:

Assembly manager loaded from: C:\Windows\Microsoft.NET\Framework64\v2.0.50727\mscorwks.dll

Similarly, line 23 indicates that the machine configuration file being used is from the Framework64 folder.

This doesn’t make any sense as the service executable is running as a 32 bit process.

It seems that setting the PlatformTarget property to “x86” when invoking MSBuild in the automated build to build the project files is not having the effect I thought it did.

The project files in Visual Studio are set to PlatformTarget = “AnyCPU” and have references to 64 bit Framework assemblies. When the project files are compiled in the automated build with PlatformTarget=”x86″ the 32 bit flag is set on the assemblies created but the 64 bit framework assembly references remain unchanged.

The solution was to configure the platform targets for each project in Visual Studio as required (“x86” for executables and “AnyCPU” for class libraries) and then update the automated build to build the solution files rather than the project files.

Written by Sea Monkey

July 12, 2013 at 8:00 pm

Posted in Debugging, Development

Tagged with , ,

MySQL Connector/ODBC 3.5.1.30 and CommandBehavior.KeyInfo

leave a comment »

Some time ago I wrote a query tool that connects to databases using ODBC. The tool is intended to be database agnostic and is used on a daily basis against both MySQL and Microsoft SQL Server databases.

However, it’s recently exhibited some odd behaviour when running queries against MySQL – a single row is returned even when the resultset is larger than one row.

Adding a LIMIT clause to the query caused all the rows or at least the number of rows specified to be returned. Further investigation showed that the problem only occurred when version 3.51.1.30 of MySQL Connector/ODBC was installed. Earlier versions don’t show this behaviour and instead all rows in the resultset are returned.

It looked like a bug in 3.51.1.30 so I logged it with the MySQL development team. However, it soon became apparent that the problem is a change in 3.51.1.30 that is intended to correctly handle the CommandBehaviour.KeyInfo option that I was using in the call to ExecuteReader():

OdbcCommand command = new OdbcCommand(sql, this.connection);
command.CommandType = CommandType.Text;
command.CommandTimeout = Settings.Default.CommandTimeout;
OdbcDataReader reader = command.ExecuteReader(CommandBehavior.KeyInfo);

The intent was to retrieve extended information about the resultset. During development of the application, specifying this option appeared to have provided the additional information when executing queries against Microsoft SQL Server. According to the MySQL technician who dealt with the bug report, this option was implemented to retrieve metadata from Microsoft SQL Server but, as of version 3.51.1.30, MySQL Connector/ODBC is now taking account of it too. Unfortunately, the MySQL developers have decided to handle it in a different way to the Microsoft SQL Server development team. In short, the option adds metadata to the resultset without changing the number or rows returned when executing against a Microsoft SQL Server database but Connector/ODBC will always return only one row when this option is set regardless of how many rows the query would otherwise return.

It’s difficult not to consider this a bug although the MySQL development team don’t agree. If the MySQL development team acknowledge that the option was intended for Microsoft SQL Server but they have now decided to handle it as well, what is the logic behind handling it in a different manner to Microsoft SQL Server ?

Written by Sea Monkey

May 10, 2013 at 8:00 pm

Posted in Debugging, Development

Tagged with , , ,

C# array initialisation

leave a comment »

I often define char arrays inline when using the String Split() method, e.g.

string myStrings[] = myStringList.Split(new char[] {';'}, StringSplitOptions.RemoveEmptyEntries);

Earlier today I wrote a similar line but despite the source string containing several ‘;’ delimiters, the result was a string array containing a single element.

It wasn’t immediately obvious what my mistake was but when I looked very carefully I realised that I had miscoded the line as:

string myStrings[] = myStringList.Split(new char[';'], StringSplitOptions.RemoveEmptyEntries);

So, what does this do ?

The char constant ‘;’ is evaluated as an int. i.e. Ord(‘;’) = 59. So, this line is effectively:

string myStrings[] = myStringList.Split(new char[59], StringSplitOptions.RemoveEmptyEntries);

which means, create an array of char with 59 elements. Each element will have a value of zero.

Not at all what I intended…

The easiest way to see this is to type this into Visual Studio’s Interactive Window:

new char[';']

and you’ll see it evaluated as:

{char[59]}
    [0]: 0 ''
    [1]: 0 ''
    [2]: 0 ''
    [3]: 0 ''
    [4]: 0 ''
    [5]: 0 ''
    [6]: 0 ''
    [7]: 0 ''
    [8]: 0 ''
    [9]: 0 ''
    [10]: 0 ''
    [11]: 0 ''
    [12]: 0 ''
    [13]: 0 ''
    [14]: 0 ''
    [15]: 0 ''
    [16]: 0 ''
    [17]: 0 ''
    [18]: 0 ''
    [19]: 0 ''
    [20]: 0 ''
    [21]: 0 ''
    [22]: 0 ''
    [23]: 0 ''
    [24]: 0 ''
    [25]: 0 ''
    [26]: 0 ''
    [27]: 0 ''
    [28]: 0 ''
    [29]: 0 ''
    [30]: 0 ''
    [31]: 0 ''
    [32]: 0 ''
    [33]: 0 ''
    [34]: 0 ''
    [35]: 0 ''
    [36]: 0 ''
    [37]: 0 ''
    [38]: 0 ''
    [39]: 0 ''
    [40]: 0 ''
    [41]: 0 ''
    [42]: 0 ''
    [43]: 0 ''
    [44]: 0 ''
    [45]: 0 ''
    [46]: 0 ''
    [47]: 0 ''
    [48]: 0 ''
    [49]: 0 ''
    [50]: 0 ''
    [51]: 0 ''
    [52]: 0 ''
    [53]: 0 ''
    [54]: 0 ''
    [55]: 0 ''
    [56]: 0 ''
    [57]: 0 ''
    [58]: 0 ''

Written by Sea Monkey

January 8, 2013 at 8:00 pm

Posted in Debugging, Development

Tagged with , ,

.NET Interop Assemblies and COM references

leave a comment »

A confusing tale of legacy COM libraries and .NET Interop…

The Problem

We have a COM library called VDataServer.dll that used to have a dependency on a third party COM library called Gordias.dll. Some time ago the calls that VDataServer made to Gordias were replaced with calls to newly implemented replacement functions in VDataServer.dll.

One of our .NET applications calls VDataServer via .NET Interop and although it’s worked successfully in the past we had some problems deploying it to a customer site last week. It was failing with a type initialisation exception when a function in VDataServer was called yet it worked on our development machines. Well, most of them – I found that I couldn’t even compile the application on my machine as it complained about a missing reference to the Gordias.dll.

I checked the VDataServer source code and found that while the calls to functions in the Gordias library had been removed, there were still references to the library. These references didn’t cause any loading or execution problems in a COM only execution environment but they seemed to be causing a load or initialisation failure when used from a .NET application.

I copied the Gordias library onto my machine and registered it, used TLBIMP to create interop assemblies for both VDataServer and Gordias, and added references to these assemblies to the .NET application project. The project still failed to compile with an error message indicating that there was a missing reference to the Gordias assembly.

I used Reflector to load the interop assembly for VDataServer and was surprised to find that when I expanded the reference to Gordias it prompted me to browse for the Gordias interop assembly even though it was in the same folder. So, there was something wrong with the reference to Gordias in VDataServer or with the Gordias interop assembly.

A Solution

It took a while for me to realise that because the dependency on Gordias was not a direct dependency but rather a dependency that VDataServer had, what was required was to create the interop assembly for Gordias first and then create the interop assembly for VDataServer using the /reference argument to tell TLBIMP to use the Gordias interop assembly to resolve references:

tlbimp.exe Gordias.dll /out:Interop.Gordias.dll

tlbimp.exe VDataServer.dll /out:Interop.VDataServer.dll /reference:Interop.Gordias.dll

When I deleted and re-added the references for the two interop assemblies to the .NET project it successfully compiled.

Finally, just to prove that the failure on the version deployed to the customer site was simply a Fusion binding failure, i.e. the .NET CLR loader process was unable to correctly resolve the reference to the Gordias interop assembly in the VDataServer interop assembly, I configured Fusion logging to write bind failures to disk. This confirmed that Fusion was trying and failing to resolve the reference to the Gordias interop assembly by incorrectly loading the Gordias COM dll.

It seems that we were somehow getting away with the problem previously because most of our development machines, and the customer servers where the application was deployed, had a copy of the Gordias dll. It’s still not completely clear how this worked when it seems that TLBIMP must be run with the /reference argument specifying the Gordias interop assembly because no one can remember having to do this previously.

Lessons learned

1. If there’s the slightest suspicion the problem is a Fusion binding failure then use Fusion’s logging to check this as it will save you a lot of time.

2. If you’re referencing a COM dll in a .NET project that references other COM dlls then you may find that you need to create .NET interop assemblies for the referenced COM dlls first and then specify these using TLBIMP’s /reference argument when creating the interop assembly for the .COM dll that your .NET code is calling.

Written by Sea Monkey

September 12, 2012 at 8:00 pm

Posted in Debugging, Development

Tagged with , ,