Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The most elusive bug I ever investigated was also one of my first, in my first job after graduation, back in the early 90s.

TL;DR; spend months investigating a problem that didn't actually exist.

I was put on an investigation into why the data acquisition interface we were developing was not working. It was designed to plug into the serial bus on a Coast Guard icebreaker. Since we couldn't develop on the ship, one of the other engineers made an hour long recording of the bus traffic, which we were able to play back in a simulated setup in our lab.

Unfortunately, the traffic on the bus was not as reliable as we expected. There were supposed to be fixed number of data channels being broadcast every second. But frequently, seemingly at random, channels would disappear and then reappear a few seconds later. This made it impossible to identify the different channels, since they were order dependent. It usually wasn't even easy to tell which channel was missing. Usually multiple channels would missing at the same time. Our data readings would get corrupted and everything would fall apart. Since the end-product was to be a real-time engine diagnostics system, this was unacceptable.

When I was hired, the company had already been working at the issue for some time. I was tasked to try to find a pattern in the drop-outs, so that they could be predicted and our interface parsing adjusted as necessary. I tried to find correlations in the intervals between drop-outs, and the length of the drop-outs. I tried to apply smoothing functions so the effect of the drop-outs would be dampened. I tried doing predictive analysis on the data channels, so that when a drop-out occurred we would at least be able to make a pretty good guess which channel was missing, and realign all the other channels. Months went by. Nothing worked.

Then, one day while I was mulling over an printout of the data recording (literally 100s of pages of nothing but columns of numbers), I noticed something odd: almost every time a drop-out occurred, a channel would transition from positive to negative, or from negative to positive. That is, before the drop-out a channel would be positive, and after the drop-out had ended, the channel would be negative. Or vice versa. But a channel would never be 0. I poured over the entire recording, and there was not a single 0 value to be found.

The data recording had been made by plugging into the serial port on the ice breaker's data bus and using some modem software to record the traffic on the bus. We quickly put together a test setup using the same modem software, and sure enough, when a 0 value was received the software would not record it. It would just bloop right over it. There was nothing at all wrong with our data acquisition interface. The problem was with the recording we were using the test it. We had a new recording made using different modem software, and the problem was gone.



Nice one




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: