Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
AI language models can exceed PNG and FLAC in lossless compression, says study (arstechnica.com)
78 points by belter on Sept 28, 2023 | hide | past | favorite | 40 comments


Let’s not forget that a decompressor for such a format would necessitate running the entire model, i.e. in this case, a 70B parameter decompressor. It’s perhaps not surprising that you can compress files better when given a (very) large dictionary to refer to. This is why any reasonable compression benchmark includes the decompressor’s size in the size score.


It's still surprising because the language model was only trained on text, and is generalizing to compressing other forms of data using in-context learning. The model contains no information about images or audio.

But yes, no one is suggesting it as a practical compression tool due to the large size and compute requirements.


It does not matter bytes are bytes in the end neural network is generic compression model it is predicting next bytes that should show up.

For parent - imagine winrar but in cloud so if you don’t pay you cannot decompress your data. There would be no way you could “pirate” it so everyone has to pay not like regular winrar everyone jokes about buying licenses.


Downloading an archive and uploading it and then downloading the uncompressed data seems a little counterproductive.


It was trained "mainly on text" according to the article. So there might as well have been 100GB of images in the training data, too.


Could a distributed decompressor be valid? I. e. something like bittorrent that provides a DHT for known files.



This has been implemented: https://github.com/philipl/pifs


> Therefore, inside pi is contained every discovery and every work of art!

Is this actually true? It is not clear to me that pi necessarily contains every possible sub-sequence.



Thanks. What I am getting from this is that pi is believed to have this property, but proving this is still an unresolved problem.


Could be. But if you wanted to shoot for the Hutter prize, you'd probably need to include the client binary and the downloaded data to actually measure the "size" of the decompressor.


If you have to fetch one or more dictionaries to decompress the file anyway, why not just include the dictionary with the file you need to decompress?


Because the model (=="dictionary") is 70B floats -- 280GB naively, 40-70GB aggressively quantized (which might reduce compression rate). If your file is big enough that the marginal compression win over other methods makes this space-effective, sure. But that's a very narrow case.


Give me a lookup table of 50 petabytes, let me seed that LUT with billions of files, and I, too, can write an amazingly efficient compression algorithm.


Sure. But you can only use text files in your seed.


What's the difference between 0x2e and .? You can easily transform between representations of data, be that text or binary in any encoding you'd like. The model input token vector (or LUT index) is still the same. I don't see a difference there.


You are missing the point. LLM are not trained on image data, but on human text data. This is not the same as encoding the image as ASCII characters.


I don’t see why video chat apps shouldn’t do something like this. Maybe the encoding/decoding is too slow.


Accessing a giant lookup table is about the most cache-unfriendly thing you could do. This is not a good approach when latency matters.

Smaller LUTs are fun to think of, especially if you eliminate the lossless angle. I'm really not that interesting to watch during a videoconference. When I'm not talking, it would probably be better for viewers to make up a stream of me sitting still with a dumb look on my face. It should be loosely matched to my actual position/pose, so that when I start talking or my dog attacks me or the bomb explodes and wipes out my room, the transition back to actual data is relatively seamless.

(And if you set the threshold high enough, maybe I don't need to be in the room at all...)


This is a technology described in the book "A Fire Upon the Deep" by Vernor Vinge. It's discussed fairly close to the end.


But it isn’t generalized like this is supposed to be


It's cool that it works, but this isn't entirely surprising result.

Compression can be understood as a combination of probability modelling and entropy coding.

Entropy coding takes a predicted probability of the next symbol (symbol can be a word, a color, audio sample, etc.) and converts it into bits. This is a solved problem. We know how to code this with perfect efficiency (however PNG didn't get optimal arithmetic coding due to patents, so its Huffman coding is optimal only for probabilities proportional to powers of two).

So the rest of the lossless compression is the art of predicting what will come next based on data decompressed so far. Standard compression algorithms have simple hardcoded rules for this, mostly predicting exact repetitions of data seen recently. OTOH ML models have seen many more patterns, can express probabilities with more finesse, and setups like LLMs are very good at adapting based on previous data.


But in this case that data isn't language, it's image data. Implying that's it's not a LLM, but something more, dare I say, general?


No, that would be personifying it too much. It's trained on tokens, and on the scale of the web, it has seen all kinds of patterns, including random garbage. It's seen "aaaaaaaaaaaaaaaaaaaaaa" and "abcdefghijklmnop" too, and that's already as good as PNG filters can do. If you feed it pixel-derived nonsense, it will mimic more of such nonsense.


The article mixes up what is significant. The headline refers to an utterly irrelevant result. As others have pointed out, it's just saying that a compression method with a large dictionary can beat a (better) compression method with a small dictionary.

The interesting bit is that it can transfer a dictionary-ish thing from the text to the audio and image domains and it'll work well enough to beat format-specific compression algorithms. Which is surprising.

I'm sure it's only benefiting from a small subset of the 280GB raw text-domain dictionary. But it is surprising that it is not only benefiting, it is benefiting enough to beat the special-purpose compressors.

I'll be sad if this turns out to only be true because of some meaningless category of edge cases.

On a side note, it does make one wonder if we should have a well-known massive image-based dictionary that compression algorithms can assume the existence of (and request ranges of from a server). Or heck, given these results in the article, maybe it would be a dictionary with a mishmash of different domains. Then when encoding for limited devices with network connections, you'd specify the maximum amount of data it could retrieve from the dictionary (perhaps as a percentage of the compressed size?).


> On a side note, it does make one wonder if we should have a well-known massive image-based dictionary that compression algorithms can assume the existence of (and request ranges of from a server). Or heck, given these results in the article, maybe it would be a dictionary with a mishmash of different domains. Then when encoding for limited devices with network connections, you'd specify the maximum amount of data it could retrieve from the dictionary (perhaps as a percentage of the compressed size?).

Could we not run stable diffusion in reverse (image to text prompt), then compress the text? Plus store a diff between the sd output and the original?


I don't know if chinchilla does the same, but PNG isn't really meant to be optimal lossless compression. Partly to avoid patent issues with LZW, but also because part of the point of the format is progressive display, i.e. the ability to show a lower-resolution image constructed from a subset of pixels before the entire file has transferred. You can turn off interlacing if you wish, but having it on makes for less effective compression. As far as I understand, most implementations also opt for faster speed and lower energy use at the cost of worse compression.


There are exceptions where interlacing can decrease the png size, but they are few and far in between: https://stackoverflow.com/questions/37867463/how-can-an-inte...


I have never seen a PNG image show a lower resolution subset before it has fully loaded. (I've never seen an interlaced PNG, ever. And I always turn it off.)


Even slow PNG compression (dozens of seconds per file) is quite inefficient.


Assuming any image or audio stream is available on the internet (not true but very close), you can have a super large compression by replacing the media by their url.

LLM's (the set of connections and their weights) are in fact a compressed version of a large part of the internet.

So what is found by the article should not surprise.


> Assuming any image or audio stream is available on the internet (not true but very close), you can have a super large compression by replacing the media by their url.

That reminds me of the "Dropship" utility: to save server storage and reduce file upload time, Dropbox used to deduplicate uploaded files globally in a way that files would have the same hash regardless of who uploaded them. Anyone who knew the hash of an uploaded file could download it to their Dropbox folder, so people could share large files by just sharing the Dropbox hash.

http://paranoia.dubfire.net/2011/04/how-dropbox-sacrifices-u...


That's not how it works. LLMs were only trained on text, so this is new data it's never seen before. There's no train-test leakage.


You are right but I am sure there are patterns in the weights that (accidentally or not) predict patterns commonly seen in media.


Does this paper actually say how they provide the image/audio data to the language model? You can't really just pipe 16khz audio into CLIP and slap that into your model of choice. I skimmed the paper, then skimmed it again, and I don't have an answer for what exactly they did here.


The paper actually lays it out pretty clearly. Yes, you can just pipe 16kHz audio into a tokenizer, which is exactly what they did (kind of).

They just treat the input as 8-bit ASCII (which in itself is technically not a thing, since ASCII is 7-bit, but the paper doesn't seem to care about technicalities).

So the input alphabet is 256 entries - 1 for each byte. Simple as that.


ts_zip ( https://bellard.org/ts_server/ts_zip.html ) already implements lossless compression with language models and is fast enough to be usable for some applications.


There does seem to be the trend of consciousness of a growing link between AI language models and data compression (AKA information compression)...

But what we're all looking for (ideally) is a hard mathematical proof (and/or algorithm) relating AI language models to data compression...

I'm guessing that like so much else in the history of Math and Computer Science -- that this will gradually be discovered by the world's Mathematicians and Computer Scientists in phases, in degrees...

I'm guessing that this will start via a single mathematician postulating upper and/or lower limits relating one specific compression algorithm with one specific AI language model of a specific size.

From there, others should be able to generate better/more exact formulas for upper/lower bounds, and from there we might start to see some inexact, yet steps-in-the-right direction formulas which will eventually lead to the exact relationship...

I'm guessing, but not knowing, that the way to get started on all of this, should someone choose, would be as follows:

A) Take the simplest known compression algorithm;

B) Take the simplest known AI language model;

and

C) Try to determine mathematical relationships between them (upper and lower bounds, in all likelihood, initially), no matter how inexact...

D) Publish a paper or write a web page explaining what you tried, and what results were obtained...

E) Wait for other researchers to try to improve, if even incrementally, upon those results...

There's a great history as to how mathematicians speculated over time about the exact formula governing the occurrence and distribution of prime numbers; that mystery has been solved -- but the history and speculation and incremental steps taken by various mathematicians over time are fascinating -- a part of math history, to be sure...





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: