Probability is a bunch of numbers that add to 1. Sometimes you can use this to represent subjective beliefs. Sometimes you can use it to represent objectively existing probability distributions. For example, an LLM is a probability distribution on a following token given previous tokens. If two "observers" disagree about an LLM's probability assigned to some token, then only at most one of them can be correct. So the probability is objective.
We're talking about 2 different things. I agree that probability is objective as long as you've already decided on the definition of the macrostate, but that definition is subjective.
From an LLM's perspective, the macrostate is all the tokens in the context window and nothing more. A different observer may be able to take into account other information, such as the identity and mental state of the author, giving rise to a different distribution. Both of these models can be objectively valid even though they're different, because they rely on different definitions of the macrostate.
It can be hard to wrap your head around this, but try taking it to the extreme. Let's say there's an omniscient being that knows absolutely everything there is to know about every single atom within a system. To that observer, probability does not exist, because every macrostate represents a single microstate. In order for something to be repeated (which is core to the definition of probability), it must start from the exact same microstate, and thus always have the same outcome.
You might think that true randomness exists at the quantum level and that means true omniscience is impossible (and thus irrelevant), but that's not provable and, even if it were true, would not invalidate the general point that probabilities are determined by macrostate definition.
Suppose you're training a language model by minimizing cross entropy, and the omniscient being is watching. In each step, your model instantiates some probability distribution, whose gradients are computed. That distribution exists, and is not deterministic to the omniscient entity.
An LLM is given a definition of the macrostate which creates the probability distribution, but a different definition of the macrostate (such as would be known to the omniscient being) would create a different distribution. According to the omniscient entity, the vast majority of long combinations of tokens would have zero probability because nobody will ever write them down in that order. The infinite monkey theorem is misleading in this regard. The odds of producing Shakespeare's works completely randomly before the heat death of the universe are practically zero, even if all the computing power in the world were dedicated to the cause.
That “probability distribution” is just a mathematical function assigning numbers to tokens, defined using a model that the person creating the model and the omniscent entity know, applying a set of deterministic mathematical functions to a sequence of observed inputs that the person creating the model and the omniscent entity also know.
> If two "observers" disagree about an LLM's probability assigned to some token, then only at most one of them can be correct.
The observer who knows the implementation in detail and the state of the pseudo-random number generator can predict the next token with certainty. (Or almost certainty, if we consider flip-switching cosmic rays, etc.)
That’s the probability to observe a token given the prompt and the seed. The probability assigned to a token given the prompt alone is a separate thing, which is objectively defined independent of any observer and can be found by reading out the model logits.
Yes, that’s a purely mathematical abstract concept that exists outside of space and time. The labels “objective” and “subjective” are usually used to talk about probabilities in relation to the physical world.
An LLM distribution exists in the physical world, just as much as this comment does. It didn’t exist before the model was trained. It has relation to the physical world: it assigns probabilities to subword units of text. It has commercial value that it wouldn’t have if its objective probability values were different.
> It has relation to the physical world: it assigns probabilities to subword units of text.
How is that probability assignment linked to the physical world exactly? In the physical world the computer will produce a token. You rejected before that it was about predicting the token that would be produced.
Or maybe you mean that the probability assignments are not about the output of a particular LLM implementation in the real world but about subword units of text in the wild.
In that case how could two different LLMs do different assigments to the same physical world without being wrong? Would they be “objective” but unrelated to the “object”?