I wrote one of these for fun a while back using the following approach:
- Files are indexed by inode and device, files with the same inode + device are considered equal. (My main use case for this was to bundle up JS files.)
- Files are then indexed by size; only files with the same size are compared.
- During comparison, the files are read at block sizes increasing in powers of two, starting with 2k. The blocks are hashed and compared, and if they do not match the comparison is stopped early (often without having to read the full file). If all the hashes are equal, then the files are considered to be equal.
- Hashes are only computed when needed and cached in memory. Since the hash block size increases in powers of two, only a few dozen hashes are needed even for large files (reducing memory usage compared to a fixed hash block size).
I wrote mine along similar lines, except without using hashing at all. Files of identical size are compared byte-by-byte instead, until first difference or end of file. As many files as possible at a time, of course, to avoid having to read through files multiple times. This avoids any uncertainty about hash collisions.
A hash is often use for an online algorithm. If you know the hash you know there is a potential for dedupe, and you can do a byte for byte comparison. I suppose you could use size as a prefilter for dedupe. This is if you do dedupe on a file level. Dedupe on block level doesn't care about the content, only the blocks, and it's not unusual to see for instance mp3 files with the same mp3 stream but different metadata. You cannot do the latter without hashing.
I can think of a slightly more general solution to the problem of not being sure if A and B are different views on the same file or not, especially if you plan to hardlink one to the other: Hardlink A to a tempfile, hardlink the tempfile in place of both A and B, then remove the tempfile. This will give the right outcome regardless of whether A and B are secretly the same underlying file (assuming that they are on the same filesystem).
One of the odd things I've seen that marks a "good" admin from a "bad" admin is their reliance on md5s as a checksum for file differences for anything in bulk. It's extremely rare to see a collision, but it definitely happens (I've seen it once after a decently sized config change. Nobody believed me :( ). That's not to say it's not a useful hash, just that it should be used with caution.
While it's rare, do remember that hash collisions need to be thought of in birthday-paradox terms, which does increase the chance of observing a collision significantly. Hitting MD5's birthday point (50% chance of collision) requires a couple tens of quintillions of hashes, but that's probably a lot smaller number than people expect, and a not-unreasonable number to see in a sufficiently large-scale data archiving system.
(meanwhile there's stuff in the wild using things like Java hashCode() to determine identities, and the birthday point on that is only around 75k IIRC)
"probably a lot smaller number than people expect"
If you crunch your way through files (and lets assume they're random from the hash functions point of view) for 10 years, you have to calculate the hash of 60.000.000.000 files / second - to have 50% chance of finding a collision. If the average filesize is 1 KB, that is ~56TB data every single second.
Md5 is not bad as long as you're not expecting malicious input, or gigantic datasets.
Well, a couple years ago it was reported that YouTube alone receives 300 hours' worth of video uploads every minute. That's without considering other social-media properties with file-upload capability, or PaaS offerings like AWS.
So I'd say it's certain that the requisite number of MD5 hashes have been generated across the aggregate of those systems (assuming enough of them use MD5 in some fashion), and I'd personally bet money there's at least one real-world deployment out there that's handled enough all on its own to hit the birthday point.
Humans really cannot fanthom the magnitude of the numbers involved here. 2^64 is an astronomically large number - it's mindboggling huge. Really, I will put money on the the claim that YouTube have not seen a collision yet. Let me explain:
Lets assume it's all HD and 2 gb an hour, that is ~30pb / year. Storing all that video surely uses large capacity harddrive with a 4k sector size, so that it the smallest it makes sense to perform dedupe on.
That is ~82726 billion blocks / year.
To have 50% probability of a collision, we need to calculate 2^64 blocks, so we need 222985 years of video before there is more than 50% probability.
Lets move away from YouTube for a moment, and move towards the IP trafic as a whole. Cisco has estimated the total IP trafic to 27483 pb / month in 2011, if we're looking at the growth it is growing less and less for every year, and from 2010 to 2011 it was only 34% down from 65% from 2005 to 2006. I assume 30% growth from 2011 and forward, and we end up with ~291 exabytes / month, or 3497 exabytes / year in 2020, this is total IP trafic and will include a lot of duplicated data (e.g. Netflix streams). Again, lets assume a 4kb block, and the total estimated IP trafic in 2020 generates ~961e10^15 blocks. We need 2^64 blocks to have a 50% probability of collision, that is more than 19 times of the total estimated IP trafic in 2020, the vast majority of which is duplicated.
My claim is that unless you're google, md5 is fine. If you're google, md5 is still fine, but you should be looking to prepare your system to accept a hash with a larger output space.
Now spead that across 10 million system admins. An actual MD5 checksum in the Wild is vasty more likely to be brute forced, or a bug. However, it's reasonably likely to have happened.
No, I used a reasonable approximation. Sqrt(2^128) is roughly in the ballpark of what is needed to have 50% of a collision, in fact it is slightly lower than the actual number. Incidently sqrt(2^128) is 2^64, which is the number I used.
Obviously a contrived example, but given the above and other weaknesses md5 should not be considered safe for verifying file contents (at least from a security perspective).
Yeah, MD5 has been known to be broken if you can append binary data. (Not sure if that's been "beaten", maybe you don't need full binary data any longer.)
... but accidental collisions? That's still hugely more likely to be "you're misremembering" or "you misread" than a real collision.
Like others are saying, that's extremely unlikely to happen by chance, near impossible. If I saw such a collision, I'd strongly suspect the collision was intentional and malicious. Edit: After discounting bugs in whatever software found the collision. It could be a bug in the script.
I will be honest dude, I don't believe a legit config file change could result in a collision (where appending random comments until a collision is found is not a legit edit, that's bruteforcing)
But I do believe you thought it did, if that's any consolation ;)
The 128 bit checksum, because the real entropy of the file size is closer to 10-20 bits (most files aren't very large, and large files are sometimes multiples of common sizes like 1 KB). That's assuming a non-malicious, non-pathological system.
I am pretty sure that this is a stupid question, but what about a 127 bit checksum and then 1 bit which shows if the filesize in bytes is even or odd vs a normal 128 bit checksum?
Good question. On OS X it seems the size of a directory is typically a multiple of 102 bytes (often 102 or 306). I ran my script again with "find / -type f" to select only regular files, and it showed file sizes ending in 2 being just slightly more common than 0. Odd sizes are still less common than even ones, though, so the overall conclusion is unchanged regardless of whether you consider checksumming of directory entries to be a valid application.
Friend wrote this a long time ago when he was a young C++ coder. It's hosted on another friends' Github but the author is John Brooks, the author of Ricochet.
one simple option might be to just use sha256 sums on files. if done incrementally, it's not that expensive computationally, unless files are very large (couple of gb's).
one way to mitigate that might be take a _few_ random, say, 1k chunks of file content, and hash those. as long as you have a map of (offset, hash) de-duping should be just fine.
the only thing is, i don't really know, what an upper bound on 'few samples' would be like. asymptotically, it should approach file-size/chunk-size, but ideally, it should be way less than that.
any ideas on how to determine a minimal but complete set ?
The minimal set is to use full content of the file.
Consider the case of a log file and its backup made one short log message earlier.
If you want to gamble on using only a partial hash, I would hash the last few kb of each file, as that (at least based on this 30-second 'analysis' of the problem domain) likely would work better in practice. Theoretically, it is as vulnerable, though.
Edit: if you have a VM image on your disk, and that small append happens to a log file in the image, that small change can happen anywhere. It also can be even smaller; someone might run 'touch' on an already existing file. Theoretically, that could be a one-byte change.
Many filesystems compress data. Unless the compression is done on block level, reading the last bytes of file that is transiently compressed is much more expensive than reading the first part.
- Files are indexed by inode and device, files with the same inode + device are considered equal. (My main use case for this was to bundle up JS files.)
- Files are then indexed by size; only files with the same size are compared.
- During comparison, the files are read at block sizes increasing in powers of two, starting with 2k. The blocks are hashed and compared, and if they do not match the comparison is stopped early (often without having to read the full file). If all the hashes are equal, then the files are considered to be equal.
- Hashes are only computed when needed and cached in memory. Since the hash block size increases in powers of two, only a few dozen hashes are needed even for large files (reducing memory usage compared to a fixed hash block size).
link: https://github.com/mixu/file-dedupe