Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

One of the best bugs I've seen had a description fairly similar to this. Hot routine run at scale (floating point math for ads ML training) fails at a rate about 0.000000001. Turned out to be a very obscure bug in the context switching code in the linux kernel, the FP registers weren't being restored properly.

Another one, the debugging was aided by the fact the developers ensured that everything was accessed through const pointers, so it wasn't their code corrupting their memory.



I had one of these back in the 90s that turned out to be a compiler bug. It was code that ran a mobile robot with an arm. Exact same code running on a Sun workstation never failed, but running on an embedded system running vxWorks crashed intermittently, but only when the arm was moving. Entire heap was corrupted, so by the time the crash occurred there was no hope of getting a stack trace or any hint of what went wrong upstream. Turned out to be two mis-ordered instructions that accessed a value on the stack after the stack pointer had been popped. On vxWorks, interrupts used the same stack as the currently running process, so if an interrupt occurred exactly between these two instructions it would clobber that value, and chaos ensued.

Took a full year to figure it out. Good times.


How did you end up piecing together what happened?


Long story but the tldr is that it happened in two stages. First someone figured out a way to reliably reproduce the problem. And then I spent a very long time single stepping through machine instructions until I had a eureka moment.


And the compiler was emitting the two instructions in the wrong order?


Yes.


I hope someone bought you a beer


Discovery is its own reward ;-)

Actually, I remember reporting the bug to the compiler authors and being stunned when they told me that they were not going to issue a new version with the bug fix because the project was no longer being funded. (This was the T dialect of Lisp in case you're wondering.)


Sounds like quite the arduous process!

What was done in the time period between discovering the bug and its cause?


A lot of rebooting and cursing.

Fortunately, it only happened when the robot's arm was moving, and we were mostly doing mobility research so we were able to be productive simply by not using the arm.


It has been a while, but a switch to kernel mode followed by a switch back to the same user mode process doesn't actually mess with FP registers. The idea being, the kernel should not be using those anyways.

Also minor point: a const pointer is a pointer which always points at the same address. You can still change what is pointed at. You probably meant "a pointer to const"


Not a switch back to the same process- context switching during normal process switching.

People have been using the term "const pointer" to refer to "a pointer to const" for 20+ years (as long as I've been doing C++), although that's probably more out of laziness than incorrectness. Certainly the language definition didn't do anybody favors.


OK, that's interesting. Have any details on how FP wasn't being restored?




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: