Hacker News new | past | comments | ask | show | jobs | submit login
Why it's hard to write a dupefinder (readthedocs.org)
90 points by SXX on April 25, 2016 | hide | past | favorite | 44 comments



I wrote one of these for fun a while back using the following approach:

- Files are indexed by inode and device, files with the same inode + device are considered equal. (My main use case for this was to bundle up JS files.)

- Files are then indexed by size; only files with the same size are compared.

- During comparison, the files are read at block sizes increasing in powers of two, starting with 2k. The blocks are hashed and compared, and if they do not match the comparison is stopped early (often without having to read the full file). If all the hashes are equal, then the files are considered to be equal.

- Hashes are only computed when needed and cached in memory. Since the hash block size increases in powers of two, only a few dozen hashes are needed even for large files (reducing memory usage compared to a fixed hash block size).

link: https://github.com/mixu/file-dedupe


I wrote mine along similar lines, except without using hashing at all. Files of identical size are compared byte-by-byte instead, until first difference or end of file. As many files as possible at a time, of course, to avoid having to read through files multiple times. This avoids any uncertainty about hash collisions.

To find out how many files of each size you have:

find ~ -type f -printf '%s\n' | sort | uniq -c | sort -n


A hash is often use for an online algorithm. If you know the hash you know there is a potential for dedupe, and you can do a byte for byte comparison. I suppose you could use size as a prefilter for dedupe. This is if you do dedupe on a file level. Dedupe on block level doesn't care about the content, only the blocks, and it's not unusual to see for instance mp3 files with the same mp3 stream but different metadata. You cannot do the latter without hashing.


The "compare by front first" method sounds quite like a prefix tree.


I can think of a slightly more general solution to the problem of not being sure if A and B are different views on the same file or not, especially if you plan to hardlink one to the other: Hardlink A to a tempfile, hardlink the tempfile in place of both A and B, then remove the tempfile. This will give the right outcome regardless of whether A and B are secretly the same underlying file (assuming that they are on the same filesystem).


The approach that http://www.pixelbeat.org/fslint takes is to md5 and sha1 the files to avoid false positives.

The core of the duplicate detection is the shell script here:

https://github.com/pixelb/fslint/blob/ffcd3b85/fslint/findup...


One of the odd things I've seen that marks a "good" admin from a "bad" admin is their reliance on md5s as a checksum for file differences for anything in bulk. It's extremely rare to see a collision, but it definitely happens (I've seen it once after a decently sized config change. Nobody believed me :( ). That's not to say it's not a useful hash, just that it should be used with caution.


You saw a full 128 bit collision on real files? They were right not to believe you...


While it's rare, do remember that hash collisions need to be thought of in birthday-paradox terms, which does increase the chance of observing a collision significantly. Hitting MD5's birthday point (50% chance of collision) requires a couple tens of quintillions of hashes, but that's probably a lot smaller number than people expect, and a not-unreasonable number to see in a sufficiently large-scale data archiving system.

(meanwhile there's stuff in the wild using things like Java hashCode() to determine identities, and the birthday point on that is only around 75k IIRC)


"probably a lot smaller number than people expect"

If you crunch your way through files (and lets assume they're random from the hash functions point of view) for 10 years, you have to calculate the hash of 60.000.000.000 files / second - to have 50% chance of finding a collision. If the average filesize is 1 KB, that is ~56TB data every single second.

Md5 is not bad as long as you're not expecting malicious input, or gigantic datasets.


Well, a couple years ago it was reported that YouTube alone receives 300 hours' worth of video uploads every minute. That's without considering other social-media properties with file-upload capability, or PaaS offerings like AWS.

So I'd say it's certain that the requisite number of MD5 hashes have been generated across the aggregate of those systems (assuming enough of them use MD5 in some fashion), and I'd personally bet money there's at least one real-world deployment out there that's handled enough all on its own to hit the birthday point.


Humans really cannot fanthom the magnitude of the numbers involved here. 2^64 is an astronomically large number - it's mindboggling huge. Really, I will put money on the the claim that YouTube have not seen a collision yet. Let me explain:

300 hours every minute, equals 3006024*365.25 = 157.788.000 hours / year.

Lets assume it's all HD and 2 gb an hour, that is ~30pb / year. Storing all that video surely uses large capacity harddrive with a 4k sector size, so that it the smallest it makes sense to perform dedupe on.

That is ~82726 billion blocks / year.

To have 50% probability of a collision, we need to calculate 2^64 blocks, so we need 222985 years of video before there is more than 50% probability.

Lets move away from YouTube for a moment, and move towards the IP trafic as a whole. Cisco has estimated the total IP trafic to 27483 pb / month in 2011, if we're looking at the growth it is growing less and less for every year, and from 2010 to 2011 it was only 34% down from 65% from 2005 to 2006. I assume 30% growth from 2011 and forward, and we end up with ~291 exabytes / month, or 3497 exabytes / year in 2020, this is total IP trafic and will include a lot of duplicated data (e.g. Netflix streams). Again, lets assume a 4kb block, and the total estimated IP trafic in 2020 generates ~961e10^15 blocks. We need 2^64 blocks to have a 50% probability of collision, that is more than 19 times of the total estimated IP trafic in 2020, the vast majority of which is duplicated.

My claim is that unless you're google, md5 is fine. If you're google, md5 is still fine, but you should be looking to prepare your system to accept a hash with a larger output space.


Now spead that across 10 million system admins. An actual MD5 checksum in the Wild is vasty more likely to be brute forced, or a bug. However, it's reasonably likely to have happened.


See my reply to the sibiling comment. I don't believe we've seen an accidential collicion yet.


Your math is off. You tried to calculate the odds that the next block would collide you want the odds that any of them would collide. https://en.wikipedia.org/wiki/Birthday_problem

X!/((X^n) * (X-n)!)

It only takes 23 people to have a 50/50 shot of sharing a birthday.

On the plus side MD5 is 128bits.


No, I used a reasonable approximation. Sqrt(2^128) is roughly in the ballpark of what is needed to have 50% of a collision, in fact it is slightly lower than the actual number. Incidently sqrt(2^128) is 2^64, which is the number I used.


Huh. I may stand corrected. I just looked up the odds, and it's uh. A lot higher probability than I thought. Heh.

Makes me wonder now. Might've been a rogue rsync at the time or something, then. /shrug.


http://natmchugh.blogspot.com/2014/10/how-i-made-two-php-fil...

https://s3-eu-west-1.amazonaws.com/md5collisions/a.php https://s3-eu-west-1.amazonaws.com/md5collisions/b.php

Obviously a contrived example, but given the above and other weaknesses md5 should not be considered safe for verifying file contents (at least from a security perspective).


Yeah, MD5 has been known to be broken if you can append binary data. (Not sure if that's been "beaten", maybe you don't need full binary data any longer.)

... but accidental collisions? That's still hugely more likely to be "you're misremembering" or "you misread" than a real collision.


Like others are saying, that's extremely unlikely to happen by chance, near impossible. If I saw such a collision, I'd strongly suspect the collision was intentional and malicious. Edit: After discounting bugs in whatever software found the collision. It could be a bug in the script.


I will be honest dude, I don't believe a legit config file change could result in a collision (where appending random comments until a collision is found is not a legit edit, that's bruteforcing)

But I do believe you thought it did, if that's any consolation ;)

... but I would loved to be proved wrong.


Show the collision or it didn't happen. Sure it wasn't sum instead of md5sum?


Read my comment 30 minutes before yours. I'm pretty much conceding. This was also 5 years ago on a rhel4 machine, fwiw.


"Crypto collision used to hijack Windows Update goes mainstream"

http://www.theregister.co.uk/2014/11/05/md5_hash_collision/


Mind that that was intentional. Due to known weaknesses in MD5, it's much easier to engineer a collision than it is to come across one accidentally.


MD5 + File byte size should be a reliable indicator that two files are actually the same.


What is more unique?

A 64 bit checksum + 64 bit filesize or a 128 bit checksum?


The 128 bit checksum, because the real entropy of the file size is closer to 10-20 bits (most files aren't very large, and large files are sometimes multiples of common sizes like 1 KB). That's assuming a non-malicious, non-pathological system.


I am pretty sure that this is a stupid question, but what about a 127 bit checksum and then 1 bit which shows if the filesize in bytes is even or odd vs a normal 128 bit checksum?


You're still giving up entropy because even-numbered file sizes are quite a bit more common:

    $ find / -ls | awk '{print $7}' >/tmp/filesizes
    $ rev /tmp/filesizes | cut -b1 | sort -n | uniq -c | sort -n
    5718 5
    5816 3
    5823 1
    5901 9
    6463 7
    7549 4
    7958 8
    8240 0
    8991 6
    14507 2


Very cool. I just tried it on my own machine and got this (archlinux with lots and lots of python libraries/data junk):

Changed the command up a bit since it was including zero byte files and directories:

  sudo find / -size +0 -type f -ls | awk '{print $7,$NF}' > /tmp/filesizes
 110920 7
 111306 5
 111312 1
 111345 9
 111362 3
 130138 0
 130511 4
 131528 8
 139000 2
 152213 6
Some other interesting things (dirname is too slow...):

  cat filesizes | grep "6 " | awk '{print $NF}' | sed 's/\/[^/]*$//' | sort | uniq -c | sort -nr | head
   1607 /usr/share/man/man3
    906 /home/matt/.cache/mozilla/firefox/8jc2n3qa.default/cache2/entries
    825 /usr/bin
    808 /home/matt/.tmux/resurrect
Then again a lot of this is flawed because it's looking at /sys. Too lazy to fix it now :)


Any idea why 2 is so common? This same trend shows up on my system and seems to be caused by 102-byte files.


Good question. On OS X it seems the size of a directory is typically a multiple of 102 bytes (often 102 or 306). I ran my script again with "find / -type f" to select only regular files, and it showed file sizes ending in 2 being just slightly more common than 0. Odd sizes are still less common than even ones, though, so the overall conclusion is unchanged regardless of whether you consider checksumming of directory entries to be a valid application.


thx for the nice answer!


Still not a guarantee though ..


Friend wrote this a long time ago when he was a young C++ coder. It's hosted on another friends' Github but the author is John Brooks, the author of Ricochet.

https://github.com/rburchell/fastdup

:-)


I did some quick checks to investigate how size helps with deduplication.

Size is a fast initial differentiator because scanning for file size only needs to read metadata, but how good does it differentiate?

  # all files
  find ~ -type f | wc -l
  1098045
  # files with a unique size
  find ~ -type f -printf '%s\n'|sort|uniq -c|grep ' 1 '|wc -l
  48437
  # most prevalent sizes (prevalence vs size)
  find ~ -type f -printf '%s\n'|sort |uniq -c |sort -nr|head
  36972 0
  12707 24
   9480 22
   8036 39
   5025 12
   4804 26
   3879 6
   3232 76
   2946 38
   2855 20
In my home dir, less than 5% of files have a unique size.


we wrote such dupefinders as fun hackathon contest organized by the dallas perl user group 2013:

http://perlmonks.org/?node_id=1067570

discussion at http://mail.pm.org/pipermail/dfw-pm/2013-December/thread.htm...

=> http://dfw.pm.org/winners.html

it is really not easy. I used as only one bloom hashes, which explains why mine was fastest and smallest, but unfortunately not correct.


one simple option might be to just use sha256 sums on files. if done incrementally, it's not that expensive computationally, unless files are very large (couple of gb's).

one way to mitigate that might be take a _few_ random, say, 1k chunks of file content, and hash those. as long as you have a map of (offset, hash) de-duping should be just fine.

the only thing is, i don't really know, what an upper bound on 'few samples' would be like. asymptotically, it should approach file-size/chunk-size, but ideally, it should be way less than that.

any ideas on how to determine a minimal but complete set ?


The minimal set is to use full content of the file.

Consider the case of a log file and its backup made one short log message earlier.

If you want to gamble on using only a partial hash, I would hash the last few kb of each file, as that (at least based on this 30-second 'analysis' of the problem domain) likely would work better in practice. Theoretically, it is as vulnerable, though.

Edit: if you have a VM image on your disk, and that small append happens to a log file in the image, that small change can happen anywhere. It also can be even smaller; someone might run 'touch' on an already existing file. Theoretically, that could be a one-byte change.


The idea is to use a two stage process:

1. take a sha1 hash of the last 4KB of each file.

2. for any equal hashes, compare the whole file.

With this method you should be able to skip reading many large files in their entirety.


That is an idea, and it would, in typical cases, avoid reading most large files in their entirety, but it is not what I read in signa11's comment.


Many filesystems compress data. Unless the compression is done on block level, reading the last bytes of file that is transiently compressed is much more expensive than reading the first part.


A length check would catch the log file case. But really, just hash the whole file.

As an optimization, you might want to check for files with holes in them ("sparse" files). Not sure how to do that, though.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: