Hacker News new | past | comments | ask | show | jobs | submit login

Well, a couple years ago it was reported that YouTube alone receives 300 hours' worth of video uploads every minute. That's without considering other social-media properties with file-upload capability, or PaaS offerings like AWS.

So I'd say it's certain that the requisite number of MD5 hashes have been generated across the aggregate of those systems (assuming enough of them use MD5 in some fashion), and I'd personally bet money there's at least one real-world deployment out there that's handled enough all on its own to hit the birthday point.




Humans really cannot fanthom the magnitude of the numbers involved here. 2^64 is an astronomically large number - it's mindboggling huge. Really, I will put money on the the claim that YouTube have not seen a collision yet. Let me explain:

300 hours every minute, equals 3006024*365.25 = 157.788.000 hours / year.

Lets assume it's all HD and 2 gb an hour, that is ~30pb / year. Storing all that video surely uses large capacity harddrive with a 4k sector size, so that it the smallest it makes sense to perform dedupe on.

That is ~82726 billion blocks / year.

To have 50% probability of a collision, we need to calculate 2^64 blocks, so we need 222985 years of video before there is more than 50% probability.

Lets move away from YouTube for a moment, and move towards the IP trafic as a whole. Cisco has estimated the total IP trafic to 27483 pb / month in 2011, if we're looking at the growth it is growing less and less for every year, and from 2010 to 2011 it was only 34% down from 65% from 2005 to 2006. I assume 30% growth from 2011 and forward, and we end up with ~291 exabytes / month, or 3497 exabytes / year in 2020, this is total IP trafic and will include a lot of duplicated data (e.g. Netflix streams). Again, lets assume a 4kb block, and the total estimated IP trafic in 2020 generates ~961e10^15 blocks. We need 2^64 blocks to have a 50% probability of collision, that is more than 19 times of the total estimated IP trafic in 2020, the vast majority of which is duplicated.

My claim is that unless you're google, md5 is fine. If you're google, md5 is still fine, but you should be looking to prepare your system to accept a hash with a larger output space.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: