What compression algorithms would help? It's already using lzma for the text (in...

londons_explore · on Oct 7, 2022

The hutter prize is a competition for compressing Wikipedia:

http://prize.hutter1.net/

So the best algorithm to use from there is starlit, with a compression factor of 8.67, compared to lzma in 2MB chunks which can only achieve about 4:1 compression.

londons_explore · on Oct 7, 2022

Oh, and if you are happy to wait days or weeks for your compressed data, Fabrice bellards nncp manages even higher ratios (but isn't eligible for the prize because it's too slow)

planede · on Oct 7, 2022

Submissions for the Hutter price also include the size of the compressor in the "total size". So I assume that's hard to beat if you use huge neural networks on the compression size, even if decompression is fast enough.

londons_explore · on Oct 7, 2022

nncp uses neural networks, but 'trains itself' as it goes, so there is no big binary blob involved in the compressor.

The only reason it isn't eligible are compute constraints (and I don't think the hutter prize allows a GPU, which nncp needs for any reasonable performance).

planede · on Oct 7, 2022

Ah, OK, fair enough.