One of the best bugs I've seen had a description fairly similar to this. Hot rou...

lisper · on Jan 3, 2023

I had one of these back in the 90s that turned out to be a compiler bug. It was code that ran a mobile robot with an arm. Exact same code running on a Sun workstation never failed, but running on an embedded system running vxWorks crashed intermittently, but only when the arm was moving. Entire heap was corrupted, so by the time the crash occurred there was no hope of getting a stack trace or any hint of what went wrong upstream. Turned out to be two mis-ordered instructions that accessed a value on the stack after the stack pointer had been popped. On vxWorks, interrupts used the same stack as the currently running process, so if an interrupt occurred exactly between these two instructions it would clobber that value, and chaos ensued.

Took a full year to figure it out. Good times.

aw1621107 · on Jan 3, 2023

How did you end up piecing together what happened?

lisper · on Jan 3, 2023

Long story but the tldr is that it happened in two stages. First someone figured out a way to reliably reproduce the problem. And then I spent a very long time single stepping through machine instructions until I had a eureka moment.

dekhn · on Jan 3, 2023

And the compiler was emitting the two instructions in the wrong order?

lisper · on Jan 4, 2023

dylan604 · on Jan 4, 2023

I hope someone bought you a beer

lisper · on Jan 4, 2023

Discovery is its own reward ;-)

Actually, I remember reporting the bug to the compiler authors and being stunned when they told me that they were not going to issue a new version with the bug fix because the project was no longer being funded. (This was the T dialect of Lisp in case you're wondering.)

aw1621107 · on Jan 4, 2023

Sounds like quite the arduous process!

What was done in the time period between discovering the bug and its cause?

lisper · on Jan 4, 2023

A lot of rebooting and cursing.

Fortunately, it only happened when the robot's arm was moving, and we were mostly doing mobility research so we were able to be productive simply by not using the arm.

sidewndr46 · on Jan 3, 2023

It has been a while, but a switch to kernel mode followed by a switch back to the same user mode process doesn't actually mess with FP registers. The idea being, the kernel should not be using those anyways.

Also minor point: a const pointer is a pointer which always points at the same address. You can still change what is pointed at. You probably meant "a pointer to const"

dekhn · on Jan 3, 2023

Not a switch back to the same process- context switching during normal process switching.

People have been using the term "const pointer" to refer to "a pointer to const" for 20+ years (as long as I've been doing C++), although that's probably more out of laziness than incorrectness. Certainly the language definition didn't do anybody favors.

sidewndr46 · on Jan 4, 2023

OK, that's interesting. Have any details on how FP wasn't being restored?