100 error-free machine hours isn’t exactly evidence of anything when it comes to...

tavianator · on March 21, 2024

At first I thought jcalvinowens meant they had actively tried to reproduce this bug for 100 machine hours, which would lend credence to "very very rare in practice". But in 100 machine hours of a typical workload doesn't really show anything about this bug. Also the bug was introduced in v6.5, so very very many btrfs users have avoided this bug for far longer than 100 machine hours.

I do agree it's "very very rare in practice". You need at least 3 threads racing on the same disk block in a ~15 instruction window, during which time you need the other thread to start and finish reading a block from disk. Which means you need very fast I/O and decryption and enough CPU oversubscription to get preempted in the race window. And it only happens when you actually hit disk which means once your machine is in a steady state, most metadata blocks you need will already be cached and you'll never see the bug.

That said, this is not some crazy 1-instruction race window with preemption disabled. With enough threads calling statx() in parallel and regularly dropping caches, I can reproduce it consistently within a few minutes. Way less than 100 hours :)

jcalvinowens · on March 22, 2024

> But in 100 machine hours of a typical workload doesn't really show anything about this bug

It does though: it shows it isn't a dire usability problem. A machine with this bug is still able to do meaningful work, it just might eat it's own tail occasionally.

If I ran N servers with this bug and had an automated thing to whack them when they tripped over it, how many would be out of commission at any given time? All signs point to it being a pretty small number. As long as data loss isn't a concern, a sysadmin would probably choose to wait a couple weeks for the fix through the proper channels, rather than going to a lot of trouble to do it immediately themselves (which also carries risk).

This is also why I was curious about kconfig: my test boxes have all the hardening options turned on right now, I could see how that might make it less likely to trigger.

I completely agree the "uniform random bug" model is only so useful, since an artificial reproducer obviously blows it out of the water. But as I said elsewhere, I've seen it be shockingly predictive when applied to large numbers of machines with bugs like this.

jcalvinowens · on March 20, 2024

Of course it is: it's evidence the average person will never hit this bug. Statistics, and all that.

Having worked on huge deployments of Linux servers before, I can tell you that modelling race condition bugs as having a uniform random chance of happening per unit time is shockingly predictive. But it's not proof, obviously.

lxgr · on March 20, 2024

> modelling race condition bugs as having a uniform random chance of happening per unit time is shockingly predictive

I don't generally disagree with that methodology, but 100 hours is just not a lot of time.

If you have a condition that takes, on average, 1000 hours to occur, you have a 9 in 10 chance of missing it based on 100 error-free hours observed, and yet it will still affect nearly 100% of all of your users after a bit more than a month!

For file systems, the aim should be (much more than) five nines, not nine fives.

jcalvinowens · on March 20, 2024

You edited this in after I replied, or maybe I missed it:

> If you have a condition that takes, on average, 1000 hours to occur, you have a 9 in 10 chance of missing it based on 100 error-free hours observed, and yet it will still affect nearly 100% of all of your users after a bit more than a month!

I understand the point you're trying to make here, but 1000 is just an incredibly unrealistically small number if we're modelling bugs like that. The real number might be on the order of millions. The effect you're describing in real life might take decades: weeks is unrealistic.

lxgr · on March 20, 2024

I agree that a realistic error rate for this particular bug is much lower than 1 in 1000 hours (or it would have long been caught by others).

But that makes your evidence of 100 error-free hours even less useful to make any predictions about stability!

jcalvinowens · on March 22, 2024

> But that makes your evidence of 100 error-free hours even less useful to make any predictions about stability!

You're still conflating probabilities an event occurs across a group with the probability an event happens to one specific individual in that group (in this case, me). I'm talking about the second thing, and it's very much not the same.

If I could rewind the universe and replay it many many times, some portion of those times I will either be very lucky or very unlucky, and get an initial testing result that badly mispredicts my personal future. But we know that most of the time, I won't.

I can actually prove that. Because of the simple assumptions we're making, we can directly compute the probabilities we are initially that wrong:

  Odds we test 1000-hour bug for 1000 hours without tripping: 0.999^1000 = 36.7%
  Odds we test 50-hour bug for 100 hours without tripping: 0.98^100 = 13.3%
  Odds we test 10-hour bug for 100 hours without tripping: 0.9^100 = 0.002%

Under our spherical cow assumptions, my 100 hours is a very convincing demonstration the real bug rate is less than one per 10 hours.

Of course, in the real world, you might never hit the bug because you have to pat yourself on the head while singing Louie Louie and making three concurrent statx() calls on prime numbered CPUs with buffers off-by-one from 1GB alignment while Mars is in the fifth house to trigger it... it's just a model, after all.

jcalvinowens · on March 20, 2024

> If you have a condition that takes, on average, 1000 hours to occur, you have a 9 in 10 chance of missing it based on 100 error-free hours observed

Yes. Which means 9/10 users who used their machines 100 or fewer hours on the new kernel will never hit the hypothetical bug. Thank you for proving my point!

I'm not a filesystem developer, I'm a user: as a user, I don't care about the long tail, I only care about the average case as it relates to my deployment size. As you correctly point out, my deployment is of negligible size, and the long tail is far far beyond my reach.

Aside: your hypothetical event has a 0.1% chance of happening each hour. That means it has a 99.9% chance of not happening each hour. The odds it doesn't happen after 100 hours is 0.999^100, or 90.5%. I think you know that, I just don't want a casual reader to infer it's 90% because 1-(100/1000) is 0.9.

lxgr · on March 20, 2024

> Which means 9/10 users who used their machines 100 or fewer hours on the new kernel will never hit the bug.

No, that's not how probabilities work at all for a bug that happens with uniform probability (i.e. not bugs that deterministically happen after n hours since boot). If you have millions of users, some of them will hit it within hours or even minutes after boot!

> As you correctly point out, my deployment is of negligible size, and the long tail is far far beyond my reach.

So you don't expect to accrue on the order of 1000 machine-hours in your deployment? That's only a month for a single machine, or half a week for 10. That would be way too much for me even for my home server RPi, let alone anything that holds customer data.

> I'm not a filesystem developer, I'm a user: I don't care about the long tail, I only care about the average case as it relates to my deployment size.

Yes, but unfortunately you seem to either have the math completely wrong or I'm not understanding your deployment properly.

jcalvinowens · on March 20, 2024

> So you don't expect to accrue on the order of 1000 machine-hours in your deployment?

The 1000 number came from you. I have no idea where you got it from. I suspect the "real number" is several orders of magnitude higher, but I have no idea, and it's sort of artificial in the first place.

My overarching point is that mine is such a vanishingly small portion of the universe of machines running btrfs that I am virtually guaranteed that bugs will be found and fixed before they affect me, exactly as happened here. Unless you run a rather large business, that's probably true for you too.

The filesystem with the most users has the least bugs. Nothing with the feature set of btrfs has even 1% the real world deployment footprint it does.

> If you have millions of users, some of them will hit it within hours or even minutes after boot!

This is weirdly sensationalist: I don't get it. Nobody dies when their filesystem gets corrupted. Nobody even loses money, unless they've been negligent. At worst it's a nuisance to restore a backup.

lxgr · on March 20, 2024

> The 1000 number came from you. I have no idea where you got it from,

It's an arbitrary example of an error rate you'd have a 90% chance of missing in your sample size of 100 machine-hours, yet much too high for almost any meaningful application.

I have no idea what the actual error rate of that btrfs bug is; my only point is that your original assertion of "I've experienced 100 error-free hours, so this is a non-issue for me and my users" is a non sequitur.

> This is weirdly sensationalist: I don't get it. Nobody dies when their filesystem gets corrupted. Nobody even loses money, unless they've been negligent.

I don't know what to say to that other than that I wish I had your optimism on reliable system design practices across various industries.

Maybe there's a parallel universe where people treat every file system as having an error rate of something like "data corruption/loss once every four days", but it's not the one I'm familiar with.

For better or worse, the bar for file system reliability is much, much, much, much higher than anything you could reasonably produce empirical data for unless you're operating at Google/AWS etc. scale.

jcalvinowens · on March 20, 2024

> "I've experienced 100 error-free hours, so this is a non-issue for me and my users"

It's a statement of fact: it has been a non-issue for me. If you're like me, it's statistically reasonable to assume it will be a non-issue for you too. Also, no users, just me. "Proabably okay" is more than good enough for me, and I'm sure many people have similar requirements (clearly not you).

I have no optimism, just no empathy for the negligent: I learned my lesson with backups a long time ago. Some people blame the filesystem instead of their backup practices when their data is corrupted, but I think that's naive. The filesystem did you a favor, fix your shit. Next time it will be your NAS power supply frying your storage.

It's also a double edged sword: the more reliable a filesystem is, the longer users can get away without backups before being bitten, and the greater their ultimate loss will be.

lxgr · on March 20, 2024

> It's a statement of fact: it has been a non-issue for me.

Yes...

> If you're like me, it's statistically reasonable to assume it will be a non-issue for you too.

No! This simply does not follow from the first statement, statistically or otherwise.

You and I might or might not be fine; you having been fine for 100 hours on the same configuration just offers next-to-zero predictive power for that.

jcalvinowens · on March 20, 2024

> No! This simply does not follow from the first statement, statistically or otherwise.

> You and I might or might not be fine; you having been fine for 100 hours on the same configuration just offers next-to-zero predictive power for that.

You're missing the forest for the trees here.

It is predictive ON AVERAGE. I don't care about the worst case like you do: I only care about the expected case. If I died when my filesystem got corrupted... I would hope it's obvious I wouldn't approach it this way.

Adding to this: my laptop has this btrfs bug right now. I'm not going to do anything about it, because it's not worth 20 minutes of my time to rebuild my kernel for a bug that is unlikely to bite before I get the fix in 6.9-rc1, and would only cost me 30 minutes of time in the worst case if it did.

I'll update if it bites me. I've bet on much worse poker hands :)

lxgr · on March 20, 2024

Well, from your data (100 error-free hours, sample size 1) alone, we can only conclude this: “The bug probably happens less frequently than every few hours”.

Is that reliable enough for you? Great! Is that “very rare”? Absolutely not for almost any type of user/scenario I can imagine.

If you’re making any statistical arguments beyond that data, or are implying more data than that, please provide either, otherwise this will lead nowhere.

Dylan16807 · on March 21, 2024

> I only care about the expected case.

The expected case after surviving a hundred hours is that you're likely to survive another hundred.

Which is a completely useless promise.

That piece of data doesn't let you predict anything at reasonable time scales for an OS install.

You can't squeeze more implications out of such a small sample.

jcalvinowens · on March 22, 2024

I don't care about the aggregate: I only care about me and my machine here.

> The expected case after surviving a hundred hours is that you're likely to survive another hundred.

That's exactly right. I don't expect to accrue another hundred hours before the new release, so I'll likely be fine.

> Which is a completely useless promise.

Statistics is never a promise: that's a really naive concept.

> at reasonable time scales for an OS

The timescale of the OS install is irrelevant: all that matters is the time between when the bug is introduced and when it is fixed. In this case, about nine months.

Dylan16807 · on March 22, 2024

You only use your machines for twenty hours per month?

Even so, "likely" here is something like "better than 50:50". Your claim was "very very rare" and that's not supported by the evidence.

> Statistics is never a promise: that's a really naive concept.

It's a promise of odds with error bars, don't be so nitpicky.

jcalvinowens · on March 22, 2024

> Even so, "likely" here is something like "better than 50:50". Your claim was "very very rare" and that's not supported by the evidence.

You're free to disagree, obviously, but I think it's accurate to describe a race condition that doesn't happen in 100 hours on a multiple machines with clock rates north of 3GHz as "very very rare". That particular code containing the bug has probably executed tens of millions of times on my little pile of machines alone.

> It's a promise of odds with error bars, don't be so nitpicky.

No, it's not. I'm not being nitpicky, the word "promise" is entirely inapplicable to statistics.

Dylan16807 · on March 22, 2024

If my computer has a filesystem error that happens every week of uptime (168 machine hours), I call that "common".

bmicraft · on March 21, 2024

> Nothing with the feature set of btrfs has even 1% the real world deployment footprint it does.

So you haven't heard of zfs then?

kbolino · on March 20, 2024

A single 9 in reliability over 100 hours would be colossally bad for a filesystem. For the average office user, 100 hours is not even a month's worth of daily use.

Even as an anecdote this is completely useless. A couple thousand hours and dozens of mount/unmount cycles would just be a good start.

paulddraper · on March 20, 2024

> Yes. Which means 9/10 users who used their machines 100 or fewer hours on the new kernel will never hit the hypothetical bug. Thank you for proving my point!

So....that's really bad.

BenjiWiebe · on March 20, 2024

If I'm running btrfs on my NAS, that's only ~4 days of runtime. If there's a bug that trashes the filesystem every month on average, that's really bad and yet is very unlikely to get caught in 4 days of running.

7bit · on March 20, 2024

> Of course it is: it's evidence the average person will never hit this bug. Statistics, and all that.

Anecdotal statistics maybe.

kalleboo · on March 21, 2024

I've been running a public file server[0] on a 1994 vintage PowerBook 540c running Macintosh System 7.5 using the HFS file system on an SD card (via a SCSI2SD adapter) for like 400 hours straight this month, with zero issues.

Not once would I even insinuate that after my 400 hours of experience, HFS is a file system people should rely on.

100 hours says nothing. When I read your comment I just assumed that you had typoed 100 machine-years (as in running a server farm) as that would have been far more relevant.

[0] Participating in the MARCHintosh GlobalTalk worldwide AppleTalk network full of vintage computers and emulators

valicord · on March 20, 2024

You do understand that 100 hours is not even 5 days, right?