S h o r t S t o r i e s

// Tales from software development

Ken's First Law revisited

leave a comment »

In Ken’s First Law of Problem Diagnosis I recounted how one of my former colleagues had observed that the length of time it takes to diagnose a problem is often inversely proportional to the size of the fix. Ken’s First Law struck with vengeance last week.

On Wednesday Rob asked me to look at a bug that had been reported with our client’s ecommerce system. The order system automatically places an order on hold if it fails one of several validation rules. Manual intervention is then required to either accept or reject the order. If the order is rejected then an email is automatically generated and sent to the customer to inform them. The bug report indicated that rejected orders were being processed correctly except that the customer email was not being generated and sent.

Unfortunately, the code in the systems that make up the ecommerce application was written several years ago and although there is some basic error logging it doesn’t have diagnostic features such as tracing. So, getting to the bottom of this problem wasn’t going to be easy.

I built and ran the web site code on my development machine and confirmed that the correct values were being passed from the UI. After about half a dozen calls down through various components, the web site code called a web service, part of the Order Administration Services (OAS) system, to reject the order.

Debugging the web service locally wasn’t an option because of the complexity of configuring OAS to run on my development machine. So, I carefully examined the code that was being called and followed the control flow down through about four levels. Eventually, the code made a call to another web service to create and send the customer email. This web service was one of another group of web services called Commerce Website (CW) services. Configuring the CW services to run locally was feasible, so I did this and then called the web service and debugged it. The control flow passed down about five levels and eventually invoked a stored procedure that added a row to a table. Each row on this table represents an email request. A separate task runs every few minutes to process these rows and generate and send the emails.

Over the next few days I seemed to go around in circles chasing every possibility but eventually coming up with nothing. There didn’t appear to be a problem in the code itself but it was impossible to run all the code, end to end, on my development machine so it was difficult to prove conclusively that there wasn’t a problem with the code. I seemed to be coming at this problem from the wrong angle anyway – I was trying to replicate the problem not prove that there wasn’t one.

The problem had been logged in the test system and had been replicated there. I logged into the test system and tried the replication steps and confirmed that the problem was still occurring.

I wandered over to where the Environments guys sit. They’re responsible for the test and live systems and for deploying code builds into these environments. One of them spared me some time to go through the SQL logs, the Windows event logs, and the application’s own SQL logging table, looking for any evidence of exceptions or errors in the test system. We didn’t find anything.

Back at my desk I created a simple web service client to call the OAS web service. I configured the client to call the service in the test environment. This turned out to be more difficult than I expected because the environments are configured with DNS aliases and suffixing. In the end I specified the IP address in the web service URL to ensure that I was calling the correct machine. When I ran the client I found that the call to the OAS service didn’t generate an email. So I knew the problem was somewhere between the call into the OAS service and the execution of the store procedure in the CW web service.

The next step was to create a client to call the CW web service. This was a bit more complicated because the web service method needed to be passed an instance of an Order object. After 30 minutes of cutting and pasting code from the source code for the OAS web service, I had a working client. Again, I configured the URL using the IP address of the machine where the test system’s copy of the CW services was located. The result of running the client was interesting: the call to the CW web service successfully generated the customer email.

As the call to the CW web service successfully generated an email but the call to the OAS service (which then called the CW web service) did not, it appeared that the OAS web service method was failing or was not successfully calling the CW service.

At least this was some progress. I had narrowed the failure down to a few call layers in the OAS web service and the possibility that the OAS web service was failing to call the CW web service.

I spent some more time with the Environments guys. We tried to determine if there was any reason why the OAS service would fail to call the CW service. Because these services run on different machines connectivity issues sometimes arise if, for example, firewalls are not correctly configured. We couldn’t find anything to indicate that the OAS web service would fail to call the CW web service. So, that was another possibility ruled out.

Back at my desk, I carefully looked through the code yet gain. I couldn’t see any obvious point of failure in the OAS web service. The code was straightforward enough and much of it was replicated in the web service client that I’d created to call the CW web service. So, I started leaning towards the idea that the problem was that the CW service was not being called even though we’d just failed to find a connectivity issue that might explain the failure.

I ran my two web service clients again just to make sure that I wasn’t imagining the results and to confirm my suspicions that the OAS web service was failing to successfully call the CW web service. At this point Rob wandered over to see how I was getting on. When I explained where I’d got to and that I was stumped on what to do next, he said that I needed to talk to Dave. He’s the Environments guy responsible for deploying releases.

Within 15 minutes we’d identified the problem. The web.config for the OAS web services contains settings for the URLs of all the web services that are called. Someone must have been doing some debugging or testing on the OAS services machine and had copied their own web.config into the folder where the particular service that the order rejection processing was calling was located. The URL settings in this web.config were almost the same except for two which began with http://localhost/. One of them was the web service that OAS calls to create an email. So when the OAS web service called the CW web service, it wasn’t the ‘real’ CW web service on the machine hosting the CW services that was being called, it was a local copy of the CW web service that didn’t successfully create the email. After the web.config file was replaced with a correctly configured version the bug was resolved.

So, after debugging all the way down through about 15 levels of the call stack, across three systems, on multiple machines, over a period of five days, the cause of the bug turned out to be someone’s discarded web.config.

Advertisements

Written by Sea Monkey

November 20, 2008 at 9:00 pm

Posted in Debugging

Tagged with

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: