This right here is the biggest new feature in my opinion, and should have been d...

sedatk · on Sept 8, 2020

Only if blocks have the same offset in the same binary and if they align with the block boundaries. Otherwise, different hashes would be generated. I don't expect that to happen a lot.

aurelian15 · on Sept 8, 2020

There are ways around this. See "content-aware chunking", e.g. implemented using rolling hashes [1]. This is for example what rsync does.

The idea is to make blocks (slightly) variable in size. Block boundaries are determined based on a limited window of preceding bytes. This way a change in one location will only have a limited impact on the following blocks.

[1] https://en.wikipedia.org/wiki/Rolling_hash

octoberfranklin · on Sept 8, 2020

Rolling hashing is really only useful for finding nonaligned duplicates.

There isn't a way to advertise some "rolling hash value" in a way that allows other people with a differently-aligned copy to notice that you and them have some duplicated byte ranges.

Rolling hashes only work when one person (or two people engaged in a conversation, like rsync) already has both copies.

btschaegg · on Sept 8, 2020

I think you misunderstood how the rolling hash is used in this context. It's not used to address a chunk; you'd use a plain old cryptographic hash function for that.

The rolling hash is used to find the chunk boundary: Hash a window before every byte (which is cheap with a rolling hash) and compare it against a defined bit mask. For example: Check if the first 20 bytes are zero. If so, you'd get chunks with about 2^20 bytes (1 MiB) average length.

As a good explanation, I'd encourage you to look at borgbackup's internals documentation: https://borgbackup.readthedocs.io/en/stable/internals.html

hinkley · on Sept 8, 2020

I think they understood just fine.

If I discover that the file I want to publish shares a range with an existing file, that does very little because the existing file has already chosen its chunk boundaries and I can’t influence those. That ship has sailed.

I can only benefit if the a priori chunks are small enough that some subset of the identified match is still addressable. And then I may only get half of a two thirds of the improvement I was after.

dchest · on Sept 8, 2020

that does very little because the existing file has already chosen its chunk boundaries

If they both used the same rolling hash function on the same or similar data, regardless of the initial and final boundary and regardless of when they chose the boundaries, they will share many chunks with high probability. That’s just how splitting with rolling hashes work. They produce variable-length chunks.

tleb_ · on Sept 8, 2020

The idea is that on none random data, you are able to use a heuristic that would create variable-sized chunks that fit the data. The simplest way seems to detect padding zeros and start a new block on the first following none zero byte. There probably are other ways, knowing the data type should help.

hinkley · on Sept 8, 2020

That seems fairly unlikely. Not a lot of big files have zero padding, and if they did them compress them. It will reduce your transfers more than and range substitutions ever will.

computerphage · on Sept 8, 2020

It will still definitely help some use cases:

> Identical files will always have the same hash and can more easily be moved from one torrent to another (when creating torrents) without having to re-hash anything.

naikrovek · on Sept 8, 2020

Well if files in an ISO are aligned to some boundary it would help a lot, similar to how filesystems on disk have a sector size, where all files begin at the start of a sector. However, I don't know if this is true in ISO9660 or any of its extensions.

em-bee · on Sept 8, 2020

ISO is a mountable filesystem so this could be the case if it doesn't have any space saving optimizations that create variable block sizes.

also need to watch out for compression

Ericson2314 · on Sept 8, 2020

Hopefully it would also be good for interopt between the two, as anything with a small enough block size could be picked up by IPLD.

sbierwagen · on Sept 8, 2020

>Just imagine if this was done all the way down to the piece level.

I don't know about that. With a reasonably large number of users, wouldn't you start seeing hash collisions via birthday paradox? Might be better off keeping it at the file level. (Though this does incentivize malicious users to find hash collisions of particular files they want to defend, and seeding the swarm with garbage files)

doomrobo · on Sept 8, 2020

Hashes are now 256 bits. In order to get a collision with probability 1% you need to produce about 4e37 hashes. For comparison, if you had a 5GHz computer that did a SHA-256 hash every cycle and gave 100 of these computers to every human on Earth, it'd take over 300 million years to produce 4e37 hashes.

ShinTakuya · on Sept 8, 2020

Yep, basically while it's entirely possible (as per Murphy's Law) that there'll be an accidental collision, the likelihood is close to zero. It's a fair risk to be taking.

fenomas · on Sept 8, 2020

> In order to get a collision with probability 1% you need to produce about 4e37 hashes

Does "get a collision" here mean finding a particular collision with a known hash, or finding any collision between any two of the generated hashes?

doomrobo · on Sept 8, 2020

The latter. I used the formula from the Wikipedia page on the birthday attack: p ~= 1 - e^{-n^2/2H} where n is the number of items you've picked and H is the size of the universe

fenomas · on Sept 8, 2020

I see, thanks!

fomine3 · on Sept 8, 2020

All git objects on GitHub has unique hash in 2017 even though git uses relatively weak SHA-1. SHA-256 should be safe.

https://github.blog/2017-03-20-sha-1-collision-detection-on-...

em-bee · on Sept 8, 2020

the birthday paradox comes from the fact that there are an extremly small number of birthdays.

sha-256 was chosen because it provides a sufficiently high number of different hashes that this won't be an issue for the forseeable future.

more details here: https://stackoverflow.com/questions/4014090/is-it-safe-to-ig...