There's almost nothing to say because it's just such a self-evidently good idea.
It's high time we start seeing software understand that data identity as totally severable from data location -- and if our existing tools can't do it, we're gonna start seeing more and more clever hacks like this to make it happen.
(This (or bifrost, the upstream) should probably not be using md5 in this day and age though! sha384 and blake2 are both much, much better choices, and immune to length extension issues. sha512 is also fine, as long as the content length is also tracked.)
> It's high time we start seeing software understand that data identity as totally severable from data location -- and if our existing tools can't do it, we're gonna start seeing more and more clever hacks like this to make it happen.
I believe it was Alan Kay who said, "names don't scale". We've known about this problem for a half century. wget should be a semi-intelligent software agent, so that we humans can free up mental resources for the truly difficult problems.
The problem with MD5 here is the relatively poor collision resistance, not the length extension.
Length extension allows you to calculate the final hash of appended data without knowing the original data. H(secret + data) is vulnerable because you can calculate H(secret + data + extra) without knowing secret. The solution here is to use an HMAC, and in this case MD5 is still sufficient. It doesn't apply to the post's context at all.
SHA256 is sufficient to verify a file hasn't changed. The next step up would be to sign the file with a PGP key, which allows you to verify the source as well.
Ultimately to be on the safe side against all scenarios including chosen plaintext attacks and ensure integrity you'll need an authenticated encryption with encrypt-then-mac scheme. What always remains opens is the key sharing and identity management.
You don't need encryption to provide integrity. What you want is a hash function that has no collision, and the shaX functions do that perfectly.
The only attacks that can happen are if someone gives you a wrong hash; which means we need to secure the hash distribution, but there's no need to secure the content distribution. Unless you want an encrypted distribution, but that's a whole another topic.
That's not really the point. With SHA1/MD5 etc you may only ensure integrity against random non-malicious errors. But think of a malicious attacker targeting existential forgery using chosen message attack where she may change both the message and the hash.
Don't think I'm missing your idea to secure only the hash distribution, but you simply delay the same problem one step ahead. So you decouple the hashes [h{1}, h{2}, ... h{n}] from the contents (or messages) they are derived from. [m{1}, m{2}.. m{n}) and you think you may ensure the secure distribution of the hashes without their "messages". Now in recursion your new "content" are now those hashes which require their own confidentiality and integrity check which may better be done using an authenticated encryption with mac. Ultimately an attacker may simply forge hashes and provide contents for them.
Think of 10K of linux distro files with hashes being the target of a malicious government who wants to install modified versions of those files. Only SHAx etc won't help you to achieve "integrity". At some point you need secure hashes or macs (which are hashes secured with keys).
The SHA-2 family (namely SHA-224, SHA-256, SHA-384, SHA-512) are good candidates, and probably SHA-3 too. Let's stop recommending and start phasing out SHA-1 (and definitely never use SHA-0); nobody's demonstrated a collision yet, but it could well be coming. SHA-1 has been deprecated by NIST since 2010; new applications shouldn't use it.
Basically a big problem that we struggle with is that the level above URIs was never implemented. There was supposed to be one more level for an identifier that would always point to a resource no matter how the underlying structure of it changed. nearly 25 years later and we just now have the infrastructure in place to actually try to accomplish this using DOIs and the like.
I wonder if it could look for `${currentAddress%$filename}md5sums.txt` (or similar) as that's often where an md5sum for a file is, and then comparing that, rather than downloading the whole file and hoping for the best?
This is really cool. I wonder if there's any place for DHT here... except with links to the files being shared rather than the file content themselves.
It's high time we start seeing software understand that data identity as totally severable from data location -- and if our existing tools can't do it, we're gonna start seeing more and more clever hacks like this to make it happen.
(This (or bifrost, the upstream) should probably not be using md5 in this day and age though! sha384 and blake2 are both much, much better choices, and immune to length extension issues. sha512 is also fine, as long as the content length is also tracked.)