Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> You've asked for something which is nonsense if you want to "recycle" compute used to checksum records.

As you will note from my request and discussion, I'm perfectly willing to accept I might want something silly.

Would you care to explain why you think this feature is wrongheaded?

> using the checksums intrinsic to the zfs technology to avoid bitflips directly is a bad call.

You should read the discussion. I was requesting for a different purpose, although this "rsync" issue is an alternative purpose. I wanted to compare file versions on snapshots and also the live version to find all unique file versions.



> You should read the discussion. I was requesting for a different purpose, although this "rsync" issue is an alternative purpose. I wanted to compare file versions on snapshots and also the live version to find all unique file versions.

I have. I didn't need to. But I have.

And agree with the experts there and here... If you're struggling to follow I'm happy to explain in _great_ detail how you're off the mark. You have a nice idea, but it's unfortunately too naïve and is probably built on hearing "the filesystem stores checksums". Everything that is said as to why this is a bad idea is the same for btrfs too.

As I said, clear as day:

> If you want them to store the checksum of the POSIX object as an attribute...

This is what you _should_ be asking for. There are ways of building this even which _do_ recycle cpu cycles. But it's messy, oh god is it awkward, and by god it makes things so difficult to follow that the filesystem would suffer for want of this small feature.

If you're looking to store the checksum of the complete POSIX object _at write_, _as it's stored on disk_ for _that internal revision of the filesystem_ then it kinda by definition is turning into an extended POSIX attribute associated with that POSIX object.

Even if implemented, this is messy as it needs to be revised and amended and checked and updated and there will be multiple algorithms with different advantages/draw-backs.

I know because I work in a job where we replicate and distribute multiple 100s of PB globally. The only way this has been found to work and scale the way you want is to store the checksums alongside the data, either as additional POSIX objects on the filesystem, or in a db which is integrated and kept in sync with the filesystem itself.

People will and do burn a few extra cycles to avoid having unmaintainable extensions and pieces of code.

If you are worrying about data within individual records changing and replicating/transmitting/storing record-level changes (which would be the articles main complaint about rsync) ZFS has this in send/recv.

Again, as is being stated elsewhere here:

If you're concerned about data integrity handle it in the FS. If you're concerned about transfer integrity, handle it over the wire.

> Don't mix these up, it just leads to a painful view of the world.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: