Troubleshooting an intermittent failure in CI tests on ARM64

vardump · on Dec 15, 2023

Another nice demonstration of power of rr, reversible debugging.

mark_undoio · on Dec 15, 2023

This is cool - time travel debugging is potentially really helpful in these flaky situations. Having tests fail unpredictably basically means you're not controlling the thing they test, which can be scary.

But I find the most annoying failures in CI are the ones I can't reproduce any other way and somehow always happen when I'm not trying.

Run locally? Fine. Run on a cloud machine that's identical to the CI system. Fine. Run multiple instances of the test on the cloud machine to generate more load. Fine. Run in the overnight tests - blam.

This doesn't always make sense, even once I've found the bug - sometimes the timing just shakes out that way.

Sometimes we record stubborn tests that are acting weird, so we're ready when they next fail.

tmiku · on Dec 15, 2023

I never get tired of funny misreads of product names. I thought "our own busted framework" was self-deprecation until it showed up in slab serif on the next line.

candiddevmike · on Dec 15, 2023

This is a good cautionary tale for folks jumping on cheap ARM cloud instances. Different architectures mean different bugs, and depending on how your infra is provisioned, could make reproducing bugs even harder.

nicklecompte · on Dec 15, 2023

This is a very well-written blog post: concise, lucid, and it does a good job speaking to programmer audiences with varying backgrounds.

byyll · on Dec 15, 2023

Is that the company that fucked up insomnia.rest?