Let’s not forget that a decompressor for such a format would necessitate running the entire model, i.e. in this case, a 70B parameter decompressor. It’s perhaps not surprising that you can compress files better when given a (very) large dictionary to refer to. This is why any reasonable compression benchmark includes the decompressor’s size in the size score.
It's still surprising because the language model was only trained on text, and is generalizing to compressing other forms of data using in-context learning. The model contains no information about images or audio.
But yes, no one is suggesting it as a practical compression tool due to the large size and compute requirements.
It does not matter bytes are bytes in the end neural network is generic compression model it is predicting next bytes that should show up.
For parent - imagine winrar but in cloud so if you don’t pay you cannot decompress your data. There would be no way you could “pirate” it so everyone has to pay not like regular winrar everyone jokes about buying licenses.
Could be. But if you wanted to shoot for the Hutter prize, you'd probably need to include the client binary and the downloaded data to actually measure the "size" of the decompressor.
Because the model (=="dictionary") is 70B floats -- 280GB naively, 40-70GB aggressively quantized (which might reduce compression rate). If your file is big enough that the marginal compression win over other methods makes this space-effective, sure. But that's a very narrow case.
Give me a lookup table of 50 petabytes, let me seed that LUT with billions of files, and I, too, can write an amazingly efficient compression algorithm.
What's the difference between 0x2e and .?
You can easily transform between representations of data, be that text or binary in any encoding you'd like.
The model input token vector (or LUT index) is still the same. I don't see a difference there.
Accessing a giant lookup table is about the most cache-unfriendly thing you could do. This is not a good approach when latency matters.
Smaller LUTs are fun to think of, especially if you eliminate the lossless angle. I'm really not that interesting to watch during a videoconference. When I'm not talking, it would probably be better for viewers to make up a stream of me sitting still with a dumb look on my face. It should be loosely matched to my actual position/pose, so that when I start talking or my dog attacks me or the bomb explodes and wipes out my room, the transition back to actual data is relatively seamless.
(And if you set the threshold high enough, maybe I don't need to be in the room at all...)
It's cool that it works, but this isn't entirely surprising result.
Compression can be understood as a combination of probability modelling and entropy coding.
Entropy coding takes a predicted probability of the next symbol (symbol can be a word, a color, audio sample, etc.) and converts it into bits. This is a solved problem. We know how to code this with perfect efficiency (however PNG didn't get optimal arithmetic coding due to patents, so its Huffman coding is optimal only for probabilities proportional to powers of two).
So the rest of the lossless compression is the art of predicting what will come next based on data decompressed so far. Standard compression algorithms have simple hardcoded rules for this, mostly predicting exact repetitions of data seen recently. OTOH ML models have seen many more patterns, can express probabilities with more finesse, and setups like LLMs are very good at adapting based on previous data.
No, that would be personifying it too much. It's trained on tokens, and on the scale of the web, it has seen all kinds of patterns, including random garbage. It's seen "aaaaaaaaaaaaaaaaaaaaaa" and "abcdefghijklmnop" too, and that's already as good as PNG filters can do. If you feed it pixel-derived nonsense, it will mimic more of such nonsense.
The article mixes up what is significant. The headline refers to an utterly irrelevant result. As others have pointed out, it's just saying that a compression method with a large dictionary can beat a (better) compression method with a small dictionary.
The interesting bit is that it can transfer a dictionary-ish thing from the text to the audio and image domains and it'll work well enough to beat format-specific compression algorithms. Which is surprising.
I'm sure it's only benefiting from a small subset of the 280GB raw text-domain dictionary. But it is surprising that it is not only benefiting, it is benefiting enough to beat the special-purpose compressors.
I'll be sad if this turns out to only be true because of some meaningless category of edge cases.
On a side note, it does make one wonder if we should have a well-known massive image-based dictionary that compression algorithms can assume the existence of (and request ranges of from a server). Or heck, given these results in the article, maybe it would be a dictionary with a mishmash of different domains. Then when encoding for limited devices with network connections, you'd specify the maximum amount of data it could retrieve from the dictionary (perhaps as a percentage of the compressed size?).
> On a side note, it does make one wonder if we should have a well-known massive image-based dictionary that compression algorithms can assume the existence of (and request ranges of from a server). Or heck, given these results in the article, maybe it would be a dictionary with a mishmash of different domains. Then when encoding for limited devices with network connections, you'd specify the maximum amount of data it could retrieve from the dictionary (perhaps as a percentage of the compressed size?).
Could we not run stable diffusion in reverse (image to text prompt), then compress the text? Plus store a diff between the sd output and the original?
I don't know if chinchilla does the same, but PNG isn't really meant to be optimal lossless compression. Partly to avoid patent issues with LZW, but also because part of the point of the format is progressive display, i.e. the ability to show a lower-resolution image constructed from a subset of pixels before the entire file has transferred. You can turn off interlacing if you wish, but having it on makes for less effective compression. As far as I understand, most implementations also opt for faster speed and lower energy use at the cost of worse compression.
I have never seen a PNG image show a lower resolution subset before it has fully loaded. (I've never seen an interlaced PNG, ever. And I always turn it off.)
Assuming any image or audio stream is available on the internet (not true but very close), you can have a super large compression by replacing the media by their url.
LLM's (the set of connections and their weights) are in fact a compressed version of a large part of the internet.
So what is found by the article should not surprise.
> Assuming any image or audio stream is available on the internet (not true but very close), you can have a super large compression by replacing the media by their url.
That reminds me of the "Dropship" utility: to save server storage and reduce file upload time, Dropbox used to deduplicate uploaded files globally in a way that files would have the same hash regardless of who uploaded them. Anyone who knew the hash of an uploaded file could download it to their Dropbox folder, so people could share large files by just sharing the Dropbox hash.
Does this paper actually say how they provide the image/audio data to the language model? You can't really just pipe 16khz audio into CLIP and slap that into your model of choice. I skimmed the paper, then skimmed it again, and I don't have an answer for what exactly they did here.
The paper actually lays it out pretty clearly. Yes, you can just pipe 16kHz audio into a tokenizer, which is exactly what they did (kind of).
They just treat the input as 8-bit ASCII (which in itself is technically not a thing, since ASCII is 7-bit, but the paper doesn't seem to care about technicalities).
So the input alphabet is 256 entries - 1 for each byte. Simple as that.
ts_zip ( https://bellard.org/ts_server/ts_zip.html ) already implements lossless compression with language models and is fast enough to be usable for some applications.
There does seem to be the trend of consciousness of a growing link between AI language models and data compression (AKA information compression)...
But what we're all looking for (ideally) is a hard mathematical proof (and/or algorithm) relating AI language models to data compression...
I'm guessing that like so much else in the history of Math and Computer Science -- that this will gradually be discovered by the world's Mathematicians and Computer Scientists in phases, in degrees...
I'm guessing that this will start via a single mathematician postulating upper and/or lower limits relating one specific compression algorithm with one specific AI language model of a specific size.
From there, others should be able to generate better/more exact formulas for upper/lower bounds, and from there we might start to see some inexact, yet steps-in-the-right direction formulas which will eventually lead to the exact relationship...
I'm guessing, but not knowing, that the way to get started on all of this, should someone choose, would be as follows:
A) Take the simplest known compression algorithm;
B) Take the simplest known AI language model;
and
C) Try to determine mathematical relationships between them (upper and lower bounds, in all likelihood, initially), no matter how inexact...
D) Publish a paper or write a web page explaining what you tried, and what results were obtained...
E) Wait for other researchers to try to improve, if even incrementally, upon those results...
There's a great history as to how mathematicians speculated over time about the exact formula governing the occurrence and distribution of prime numbers; that mystery has been solved -- but the history and speculation and incremental steps taken by various mathematicians over time are fascinating -- a part of math history, to be sure...