Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Haha, this is funny. I've been obsessed with rolling-hash based chunking since I read about it in the dat paper. I didn't realize there was a tree version, but it is a natural extension.

I have a related cryptosystem that I came up with, but is so obvious I'm sure someone else has invented it first. The idea is to back up a file like so: first, do a rolling-hash based chunking, then encrypt each chunk where the key is the hash of that chunk. Then, upload the chunks to the server, along with a file (encrypted by your personal key) that contains the information needed to decrypt each chunk and reassemble them. If multiple users used this strategy, any files they have in common would result in the same chunks being uploaded. This would let the server provider deduplicate those files (saving space), without giving the server provider the ability to read the files. (Unless they already know exactly which file they're looking for, and just want to test whether you're storing it.)

Tangent: why is it that downloading a large file is such a bad experience on the internet? If you lose internet halfway through, the connection is closed and you're just screwed. I don't think it should be a requirement, but it would be nice if there was some protocol understood by browsers and web servers that would be able to break-up and re-assemble a download request into a prolly tree, so I could pick up downloading where I left off, or only download what changed since the last time I downloaded something.



"Tangent: why is it that downloading a large file is such a bad experience on the internet?"

This comment could only come from someone who never downloaded large files from the internet in the 1990s.

Feels like heaven to me downloading today.

Watching video from YouTube, Facebook, etc., if accessed via those websites running their Javascripts, usually uses the Range header. Some people refer to the "breeak up and re-assembly" as "progressive download".


Adding tangent to tangent, I recently experienced an unexpected modern counterpart of a 1990s large download: deleting about 120K emails from a GMail folder, then purging them for real by "emptying" the GMail "trash bin".

The first phase was severely asynchronous, with a popup mentioning "the next few minutes", which turned out to be hours. Manually refreshing the page showed a cringeworthy deletion rate of about 500 messages per minute.

But at least it worked; the second phase was more special, with plenty of arbitrary stopping and outright lies. After repeated purging attempts I finally got an empty bin achievement page on my phone but I found over 50K messages in the trash on my computer the next day, where every attempt to empty the trash showed a very slow progress dialog that reported completion but actually deleted only about 4K messages.

I don't expect many JavaScript card castles of the complexity of GMail message handling to be tested on large jobs; at least old FTP and web servers were designed with high load and large files in mind.


Video streaming usually uses something like DASH/HLS and is fair bit more complicated than Range headers. Notably this means that downloading the video means reversing the streaming format and glueing the segments together.


In recent times, large video files could often be downloaded in the popular browsers by changing a URL path parameter like "r=1234567" to "r=0". I have downloaded many large videos that way.

DASH is used sometimes, but not on the majority of videos I encounter. Of course this can change over time. The point is that downloading large files today, e.g., from YouTube, Facebook, etc., cf. downloading large files in the 90s where speeds were slower and interruptions were more common, has been relatively fast and easy by comparison, even though these websites might be changing how they serve these files behind the scenes and software developers gravitate toward complexity.

Commercial "streaming", e.g., ESPN, etc., might be intentionally difficult to download and might involve "reversing" and "glueing" but that is not what I'm describing.


> If you lose internet halfway through, the connection is closed and you're just screwed. [...] it would be nice if there was some protocol understood by browsers and web servers

HTTP Range Requests solve this without any clever logic, if mutually supported.


> HTTP Range Requests solve this without any clever logic, if mutually supported.

Understated comment in the thread.

The very first search hit on Google is none other than Mozilla's page on ranged requests.

https://developer.mozilla.org/en-US/docs/Web/HTTP/Guides/Ran...

Here's the leading summary from that page.

> An HTTP Range request asks the server to send parts of a resource back to a client. Range requests are useful for various clients, including media players that support random access, data tools that require only part of a large file, and download managers that let users pause and resume a download.

Here's a RFC:

https://datatracker.ietf.org/doc/html/rfc7233


AES-GCM-SIV[1] does something similar to your per chunk derived key, except that AES-GCM-SIV expects the key to be user-provided, and the IV is synthetic - hence Synthetic IV mode.

What's your threat model? This has "interesting"[3] properties. For example, given a file, the provider can figure out who has the file. Or, given a file, an arbitrary user can figure out if some other user already has the file. Users may even be able to "teleport" files to each other, like the infamous Dropbox Dropship[2].

I suspect why no one has tried this is many-fold: (1) Most providers want to store plaintext. Those few providers who don't want to store plaintext, whether for secrecy or deniability reasons, also don't want to store anything else correlatable, either. (2) Space is cheap. (3) Providers like being able to charge for space. Since providers sell space at a markup, they almost want you to use more space, not less.

[1]: https://en.wikipedia.org/wiki/AES-GCM-SIV [2]: https://en.wikipedia.org/wiki/Dropship_(software) [3]: "Interesting" is not a word you want associated with your cryptography usage, to say the least.


I think the cost of processing stuff that way would far exceed the cost of downloading the entire file again. You can already resume downloads from a byte offset if the server supports it, and that probably covers 99% of the cases where you would actually want to resume a download of a single file. Partial updates are rarely possible for large files anyway, as they are often compressed. If the host wants to make partial updates make sense then they could serve over rsync.


Bittorrent is the protocol you're looking for. Unfortunately not widely adopted for the use cases you are talking about.


> This would let the server provider deduplicate those files (saving space), without giving the server provider the ability to read the files.

This gives the service provider the ability to see who is storing the same files, however, which can be sensitive information. Moreover, once they know/decrypt a file for one user, they know that file for all users.


It does sound similar to ideas in Tahoe-LAFS: https://tahoe-lafs.readthedocs.io/en/latest/architecture.htm...

Which has already thought about attacks on the scheme you described: https://tahoe-lafs.org/hacktahoelafs/drew_perttula.html


Wow, the idea of adding a key to create groups where only someone inside the group could carry out the attack is awesome.


> the dat paper

What's the name of the paper you're alluding to? I'm not familiar with it and it sounds interesting



Thank you!


Sounds similar to IPFS.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: