There is a simple exact model to compare against: Take the tokenized training data, put it into a Burrows–Wheeler transform, and add a few data structures. Then, given a context of any length, you can efficiently get the exact distribution for the next token in the training data. This is much cheaper to build and use than any LLM of comparable size. It's also much less useful, because it's overfitted to the training data.
LLMs approximate this model. By losing many little details and smoothing the probability distributions, they can generalize much better beyond the training data.
There isn't more information than you started with though. That's an illusion, kind of like procedurally generated terrain. Here 1K of javascript that generates endless maps of islands that all look unique:
An efficient pruned and quantized model contains about as much information as its file size would suggest.
The ability to train models and have them reproduce terabytes of human knowledge simply shows that this knowledge contains repetition in its underlying semantic structure that can be a target of compression.
Is there less information in the universe than we think there is?
> There isn't more information than you started with though.
There's the same amount of data, yes. The extra info is the structure of the model itself.
> The ability to train models and have them reproduce terabytes of human knowledge simply shows that this knowledge contains repetition in its underlying semantic structure that can be a target of compression.
Yes, and the interesting part is the "what" of that repitition. What patterns exist in written text?
> Is there less information in the universe than we think there is?
There is both less and more, depending on how you look at it.
Most of the information in the universe is Cosmic Microwave Background Radiation. It's literally everywhere. We can't predict exactly what CMBR data will exist at any point in spacetime. It's raw entropy: noise. The popular theory is that it comes from the expansion of the universe itself; originating from the big bang. From the most literal perspective, the entire universe is constantly creating more information.
Even though the specifics of the data are unpredictable, CMBR has an almost completely uniform frequency/spectrum and amplitude/temperature. From an inference perspective, the noisy entropy of the entire universe follows a coherent and homogenous pattern. The better we can model that pattern, the more entropy we can factor out; like data compression.
This same dynamic applies to human generated entropy, particularly written language.
If we approach language processing from a literal perspective, we have to contend with the entire set of possible written language. Because natural language allows ambiguity, that set is too large to explicitly model: there are too many unpredictable details. This is why traditional parsing only works with "context-free grammars" like programming languages.
If we approach language processing from an inference perspective, we only have to deal with the set of what has been written. This is what LLMs do. This method factors out enough entropy to be computable, but factoring out that entropy also means factoring out the explicit definitions language is constructed from. LLMs don't get stuck on ambiguity; but they also don't resolve it.
LLMs model the patterns that exist in text, then blindly follow those patterns. The result looks a lot like original content, but it isn't.