Show HN: Wget-finder

heavenlyhash · on March 12, 2015

There's almost nothing to say because it's just such a self-evidently good idea.

It's high time we start seeing software understand that data identity as totally severable from data location -- and if our existing tools can't do it, we're gonna start seeing more and more clever hacks like this to make it happen.

(This (or bifrost, the upstream) should probably not be using md5 in this day and age though! sha384 and blake2 are both much, much better choices, and immune to length extension issues. sha512 is also fine, as long as the content length is also tracked.)

coke12 · on March 13, 2015

> It's high time we start seeing software understand that data identity as totally severable from data location -- and if our existing tools can't do it, we're gonna start seeing more and more clever hacks like this to make it happen.

I believe it was Alan Kay who said, "names don't scale". We've known about this problem for a half century. wget should be a semi-intelligent software agent, so that we humans can free up mental resources for the truly difficult problems.

signa11 · on March 13, 2015

> I believe it was Alan Kay who said, "names don't scale".

apparently in (http://joearms.github.io/2015/03/12/The_web_of_names.html) joe-armstrong also says something similar.

lunixbochs · on March 13, 2015

The problem with MD5 here is the relatively poor collision resistance, not the length extension.

Length extension allows you to calculate the final hash of appended data without knowing the original data. H(secret + data) is vulnerable because you can calculate H(secret + data + extra) without knowing secret. The solution here is to use an HMAC, and in this case MD5 is still sufficient. It doesn't apply to the post's context at all.

SHA256 is sufficient to verify a file hasn't changed. The next step up would be to sign the file with a PGP key, which allows you to verify the source as well.

diminish · on March 13, 2015

Ultimately to be on the safe side against all scenarios including chosen plaintext attacks and ensure integrity you'll need an authenticated encryption with encrypt-then-mac scheme. What always remains opens is the key sharing and identity management.

rakoo · on March 13, 2015

You don't need encryption to provide integrity. What you want is a hash function that has no collision, and the shaX functions do that perfectly.

The only attacks that can happen are if someone gives you a wrong hash; which means we need to secure the hash distribution, but there's no need to secure the content distribution. Unless you want an encrypted distribution, but that's a whole another topic.

diminish · on March 13, 2015

> You don't need encryption to provide integrity

That's not really the point. With SHA1/MD5 etc you may only ensure integrity against random non-malicious errors. But think of a malicious attacker targeting existential forgery using chosen message attack where she may change both the message and the hash.

Don't think I'm missing your idea to secure only the hash distribution, but you simply delay the same problem one step ahead. So you decouple the hashes [h{1}, h{2}, ... h{n}] from the contents (or messages) they are derived from. [m{1}, m{2}.. m{n}) and you think you may ensure the secure distribution of the hashes without their "messages". Now in recursion your new "content" are now those hashes which require their own confidentiality and integrity check which may better be done using an authenticated encryption with mac. Ultimately an attacker may simply forge hashes and provide contents for them.

Think of 10K of linux distro files with hashes being the target of a malicious government who wants to install modified versions of those files. Only SHAx etc won't help you to achieve "integrity". At some point you need secure hashes or macs (which are hashes secured with keys).

rakoo · on March 16, 2015

> At some point you need secure hashes or macs (which are hashes secured with keys).

Which is not encryption, but signing :)

I totally agree with that though. And of course there's a BEP for that (http://www.bittorrent.org/beps/bep_0035.html).

taejo · on March 13, 2015

> the shaX functions do that perfectly.

The SHA-2 family (namely SHA-224, SHA-256, SHA-384, SHA-512) are good candidates, and probably SHA-3 too. Let's stop recommending and start phasing out SHA-1 (and definitely never use SHA-0); nobody's demonstrated a collision yet, but it could well be coming. SHA-1 has been deprecated by NIST since 2010; new applications shouldn't use it.

hyperion2010 · on March 13, 2015

Basically a big problem that we struggle with is that the level above URIs was never implemented. There was supposed to be one more level for an identifier that would always point to a resource no matter how the underlying structure of it changed. nearly 25 years later and we just now have the infrastructure in place to actually try to accomplish this using DOIs and the like.

patcon · on March 13, 2015

ipfs.io is exciting :)

hyperion2010 · on March 13, 2015

Yes! I'm quite hopeful that they are on to something.

moe · on March 13, 2015

So in essence a poor man's magnet link[1].

Interesting hack, nonetheless.

[1] http://en.wikipedia.org/wiki/Magnet_URI_scheme

colechristensen · on March 13, 2015

(I'm sorry I didn't see your comment and said the same thing)

colechristensen · on March 13, 2015

It seems like this is a hacked up version in the spirit of http://en.wikipedia.org/wiki/Magnet_URI_scheme

You'd never have to worry about losing links to things if all of your links were magnet links and you hosted files with bittorrent not http.

A magnet link can be just a sha hash. You could write a browser plugin to rewrite all sha hashes into magnet links.

The real hurdle with that is releasing a bittorrent client that separated itself from the grey area of media piracy.

If firefox included native libraries for downloading magnet links, it would be invisible to users.

You could also write btwget using libtorrent (or patch wget to handle magnet links)

rakoo · on March 13, 2015

It already exists, and works wonderfully: http://aria2.sourceforge.net/

bruce_one · on March 13, 2015

Clever :-)

I wonder if it could look for `${currentAddress%$filename}md5sums.txt` (or similar) as that's often where an md5sum for a file is, and then comparing that, rather than downloading the whole file and hoping for the best?

eg.

User wants file.tar.gz:aaaaaa...

wget-finder finds http://downloadable.com/file/file.tar.gz

wget-finder checks for (and downloads if present) http://downloadable.com/file/md5sums.txt

wget-finder compares the md5sum in md5sums.txt to the aaaaaa...

If it's good, it downloads the file (and still does the final check) and if it isn't it keeps searching, having not downloaded the file unnecessarily.

Seems like it could be neat for large files (to avoid downloading the wrong file as often).

(Could also check for md5sum.txt or md5sums or md5.txt etc)

chilicuil · on March 13, 2015

That's a good idea, I'll give it a try the next time I touch the code =)

Create · on March 13, 2015

https://en.wikipedia.org/wiki/Metalink#Example_Metalink_4.0_...

voltagex_ · on March 12, 2015

This is really cool. I wonder if there's any place for DHT here... except with links to the files being shared rather than the file content themselves.

theophrastus · on March 13, 2015

Sounds a wee-bit like a worthy re-invention of the ancient (veronica indexed) gopher protocol http://en.wikipedia.org/wiki/Gopher_%28protocol%29

zwischenzug · on March 13, 2015

Excellent! I was going to write one of these, now I don't have to. Thank you so much! That's one big TODO ticked off.

herf · on March 13, 2015

Does anyone index the web by {secure hash}? What a good index that would be.

rakoo · on March 13, 2015

https://en.wikipedia.org/wiki/Distributed_hash_table

tgpc · on March 13, 2015

such a cool idea :-)