I don't know exactly what Siracusa is doing here, but I can take an educated gue...

MBCook · 2025-02-25T18:07:29 1740506849

You can start with the size, which is probably really unique. That would likely cut down the search space fast.

At that point maybe it’s better to just compare byte by byte? You’ll have to read the whole file to generate the hash and if you just compare the bytes there is no chance of hash collision no matter how small.

Plus if you find a difference in bytes 1290 you can just stop there instead of reading the whole thing to finish the hash.

I don’t think John has said exactly how on ATP (his podcast with Marco and Casey), but knowing him as a longtime listener/reader he’s being very careful. And I think he’s said that on the podcast too.

jonhohle · 2025-02-26T00:44:48 1740530688

To make dedup[0] fast, I use a tree with device id, size, first byte, last byte, and finally SHA-256. Each of those is only used if there is a collision to avoid as many reads as possible. dedup doesn’t do a full file compare, because if you’ve found a file with the same size, first and last bytes, and SHA-256 you’ve also probably won the lottery several times over and can afford data recovery.

This is the default for ZFS deduplication and git does something similar with size and far weaker SHA-1. I would add a test for SHA-256 collisions, but no one has seemed to find a working example yet.

0 - https://github.com/ttkb-oss/dedup

gblargg · 2025-02-26T06:41:29 1740552089

How much time is saved by not comparing full file contents? Given that this is a tool some people will only run occasionally, having it take 30 seconds instead of 15 is a small price to pay for ensuring it doesn't treat two differing files as equal.

xanderlewis · 2025-02-26T09:33:10 1740562390

Same size, same first and last bytes, and same SHA-256.

…and you’re not worried about shark attacks, are you?

gblargg · 2025-03-05T21:29:56 1741210196

I just don't understand why one would intentionally throw chance into a tool that one wants to be 100% robust. It's baffling.

wink · 2025-02-26T12:38:48 1740573528

FWIW, when I wrote a tool like this I used same size + some hash function, not MD5 but maybe SHA1, don't remember. First and last bytes is a good idea, didn't think of that.

pornel · 2025-02-26T17:53:12 1740592392

Reading just the first byte is probably wasting a read of the whole block.

Hashing the whole file after that is wasteful. You need to read (and hash) only as much as needed to demonstrate uniqueness of the file in the set.

The tree concept can be extended to every byte in the file:

https://github.com/kornelski/dupe-krill?tab=readme-ov-file#n...

jonhohle · 2025-02-27T04:55:31 1740632131

Yeah, there is definitely some merit to more efficient hashing. Trees with a lot of duplicates require a lot of hashing, but hashing the entire file would be required regardless of whether partial hashes or done or not.

I have one data set where `dedup` was 40% faster than `dupe-krill` and another where `dupe-drill` was 45% faster than `dedup`.

`dupe-krill` uses blake3, which last I checked, was not hardware accelerated on M series processors. What's interesting is that because of hardware acceleration, `dedup` is mostly CPU-idle, waiting on the hash calculation, while `dupe-krill` is maxing out 3 cores.

Thanks for the link!

unclebucknasty · 2025-02-25T19:45:40 1740512740

>which is probably really unique

Wonder what the distribution is here, on average? I know certain file types tend to cluster in specific ranges.

>maybe it’s better to just compare byte by byte? You’ll have to read the whole file to generate the hash

Definitely, for comparing any two files. But, if you're searching for duplicates across the entire disk, then you're theoretically checking each file multiple times, and each file is checked against multiple times. So, hashing them on first pass could conceivably be more efficient.

>if you just compare the bytes there is no chance of hash collision

You could then compare hashes and, only in the exceedingly rare case of a collision, do a byte-by-byte comparison to rule out false positives.

But, if your first optimization (the file size comparison) really does dramatically reduce the search space, then you'd also dramatically cut down on the number of re-comparisons, meaning you may be better off not hashing after all.

You could probably run the file size check, then based on how many comparisons you'll have to do for each matched set, decide whether hashing or byte-by-byte is optimal.

pornel · 2025-02-26T18:20:28 1740594028

> exceedingly rare

To have a mere one in a billion chance of getting a SHA-256 collision, you'd need to spend 160 million times more energy than the total annual energy production on our planet (and that's assuming our best bitcoin mining efficiency, actual file hashing needs way more energy).

The probability of a collision is so astronomically small, that if your computer ever observed a SHA-256 collision, it would certainly be due to a CPU or RAM failure (bit flips are within range of probabilities that actually happen).

unclebucknasty · 2025-02-27T18:06:16 1740679576

You know, I've been hearing people warn of handling potential collisions for years and knew the odds were negligible, but never really delved into it in any practical sense.

Context is everything.

karparov · 2025-02-25T21:46:51 1740520011

This can be done much faster and safer.

You can group all files into buckets, and as soon as a bucket is empty, discard it. If in the end there are still files in the same bucket, they are duplicates.

Initially all files are in the same bucket.

You now iterate over differentiators which given two files tell you whether they are maybe equal or definitely not equal. They become more and more costly but also more and more exact. You run the differentiator on all files in a bucket to split the bucket into finer equivalence classes.

For example:

* Differentiator 1 is the file size. It's really cheap, you only look at metadata, not the file contents.

* Differentiator 2 can be a hash over the first file block. Slower since you need to open every file, but still blazingly fast and O(1) in file size.

* Differentiator 3 can be a hash over the whole file. O(N) in file size but so precise that if you use a cryptographic hash then you're very unlikely to have false positives still.

* Differentiator 4 can compare files bit for bit. Whether that is really needed depends on how much you trust collision resistance of your chosen hash function. Don't discard this though. Git got bitten by this.

jonhohle · 2025-02-26T00:54:02 1740531242

Not surprisingly, differentiator 2 can just be the first byte (or machine word). Differentiator 3 can be the last byte (or word). At that point, 99.99% (in practice more 9s) of files are different and you’re read at most 2 blocks per file. I haven’t figured out a good differentiator 3 prior to hashing, but it’s already so rare, that it’s not worth it, in my experience.

rzzzt · 2025-02-25T18:29:50 1740508190

I experimented with a similar, "hardlink farm"-style approach for deduplicated, browseable snapshots. It resulted in a small bash script which did the following:

- compute SHA256 hashes for each file on the source side

- copy files which are not already known to a "canonical copies" folder on the destination (this step uses the hash itself as the file name, which makes it easy to check if I had a copy from the same file earlier)

- mirror the source directory structure to the destination

- create hardlinks in the destination directory structure for each source file; these should use the original file name but point to the canonical copy.

Then I got too scared to actually use it :)

kccqzy · 2025-02-26T15:53:34 1740585214

Hard links are not a suitable alternative here. When you deduplicate files, you typically want copy-on-write: if an app writes to one file, it should not change the other. Because of this, I would be extremely scared to use anything based on hard links.

In any case, a good design is to ask the kernel to do the dedupe step after user space has found duplicates. The kernel can double-check for you that they are really identical before doing the dedupe. This is available on Linux as the ioctl BTRFS_IOC_FILE_EXTENT_SAME.

rzzzt · 2025-02-28T07:03:01 1740726181

It was for me. I was using rsync with "--link-dest" earlier for this purpose, but that only works if the file is present in consecutive backups. I wanted to have the option of seeing a potentially different subset of files for each backup and saving disk space at the same time.

Restic and Borg can do this at the block level, which is more effective but requires the tool to be installed when I want to check out something.

620gelato · 2025-02-27T17:38:41 1740677921

Oh the sha-256 hashes are precisely what I used for a quick script I put together to parse through various backups of my work laptop in different places (tool changes and laziness). I had 10 different backups going back 4 years, and I wanted to make sure I - 1) preserved all unique files, 2) preserve the latest folder structure they showed up in.

Using sha256 was a no-brainer, at least for me.

pmarreck · 2025-02-25T18:58:26 1740509906

xxHash (or xxh3 which I believe is even faster) is massively faster than SHA256 at the cost of security, which is unnecessary here.

Of course, engineering being what it is, it's possible that only one of these has hardware support and thus might end up actually being faster in realtime.

PhilipRoman · 2025-02-25T21:14:39 1740518079

Blake3 is my favorite for this kind of thing. It's a cryptographic hash (maybe not the world's strongest, but considered secure), and also fast enough that in real world scenarios it performs just as well as non-crypto hashes like xx.

f1shy · 2025-02-25T18:09:50 1740506990

I think the prob. is not so low. I remember reading here about a person getting a foto of another chat in a chat application, which was using sha in the background. I do not recall all the details, it is improbable, but possible.

sgerenser · 2025-02-25T18:26:31 1740507991

LOL nope, I seriously doubt that was the result of a SHA256 collision.

kittoes · 2025-02-25T18:23:55 1740507835

The probability is truly, obscenely, low. If you read about a collision then you surely weren't reading about SHA256.

https://crypto.stackexchange.com/questions/47809/why-havent-...

amelius · 2025-02-25T18:17:09 1740507429

Or just use whatever algorithm rsync uses.