You can start with the size, which is probably really unique. That would likely cut down the search space fast.
At that point maybe it’s better to just compare byte by byte? You’ll have to read the whole file to generate the hash and if you just compare the bytes there is no chance of hash collision no matter how small.
Plus if you find a difference in bytes 1290 you can just stop there instead of reading the whole thing to finish the hash.
I don’t think John has said exactly how on ATP (his podcast with Marco and Casey), but knowing him as a longtime listener/reader he’s being very careful. And I think he’s said that on the podcast too.
To make dedup[0] fast, I use a tree with device id, size, first byte, last byte, and finally SHA-256. Each of those is only used if there is a collision to avoid as many reads as possible. dedup doesn’t do a full file compare, because if you’ve found a file with the same size, first and last bytes, and SHA-256 you’ve also probably won the lottery several times over and can afford data recovery.
This is the default for ZFS deduplication and git does something similar with size and far weaker SHA-1. I would add a test for SHA-256 collisions, but no one has seemed to find a working example yet.
How much time is saved by not comparing full file contents? Given that this is a tool some people will only run occasionally, having it take 30 seconds instead of 15 is a small price to pay for ensuring it doesn't treat two differing files as equal.
FWIW, when I wrote a tool like this I used same size + some hash function, not MD5 but maybe SHA1, don't remember. First and last bytes is a good idea, didn't think of that.
Yeah, there is definitely some merit to more efficient hashing. Trees with a lot of duplicates require a lot of hashing, but hashing the entire file would be required regardless of whether partial hashes or done or not.
I have one data set where `dedup` was 40% faster than `dupe-krill` and another where `dupe-drill` was 45% faster than `dedup`.
`dupe-krill` uses blake3, which last I checked, was not hardware accelerated on M series processors. What's interesting is that because of hardware acceleration, `dedup` is mostly CPU-idle, waiting on the hash calculation, while `dupe-krill` is maxing out 3 cores.
Wonder what the distribution is here, on average? I know certain file types tend to cluster in specific ranges.
>maybe it’s better to just compare byte by byte? You’ll have to read the whole file to generate the hash
Definitely, for comparing any two files. But, if you're searching for duplicates across the entire disk, then you're theoretically checking each file multiple times, and each file is checked against multiple times. So, hashing them on first pass could conceivably be more efficient.
>if you just compare the bytes there is no chance of hash collision
You could then compare hashes and, only in the exceedingly rare case of a collision, do a byte-by-byte comparison to rule out false positives.
But, if your first optimization (the file size comparison) really does dramatically reduce the search space, then you'd also dramatically cut down on the number of re-comparisons, meaning you may be better off not hashing after all.
You could probably run the file size check, then based on how many comparisons you'll have to do for each matched set, decide whether hashing or byte-by-byte is optimal.
To have a mere one in a billion chance of getting a SHA-256 collision, you'd need to spend 160 million times more energy than the total annual energy production on our planet (and that's assuming our best bitcoin mining efficiency, actual file hashing needs way more energy).
The probability of a collision is so astronomically small, that if your computer ever observed a SHA-256 collision, it would certainly be due to a CPU or RAM failure (bit flips are within range of probabilities that actually happen).
You know, I've been hearing people warn of handling potential collisions for years and knew the odds were negligible, but never really delved into it in any practical sense.
At that point maybe it’s better to just compare byte by byte? You’ll have to read the whole file to generate the hash and if you just compare the bytes there is no chance of hash collision no matter how small.
Plus if you find a difference in bytes 1290 you can just stop there instead of reading the whole thing to finish the hash.
I don’t think John has said exactly how on ATP (his podcast with Marco and Casey), but knowing him as a longtime listener/reader he’s being very careful. And I think he’s said that on the podcast too.