Not be contrarian, but if the next word prediction happens to be someone's name or a place or something discussed multiple places in the book then often, yes, a knowledge of the full plot of the book is "required" just to predict the next word, as you get to the middle or end of a book.
For example you could never fill in the last chapter of any good book without having knowledge of every previous chapter. Not highly detailed knowledge, but still knowledge.
What an LLM does is stuff it all into short term memory. Humans dump the first pages into long term memory and "make sense" of it. Humans have a massive context window because of this (and sheer brain size and efficiency).
We don’t put things into long term memory after we read it. We usually put it after night of sleep. I personally think that context (and kv cache correspondingly) in the models are akin to our short term memory, while training process (and actual weights) are to our long term memory. And we can’t be sure our short term memory doesn’t work in a way of matching the current context towards currently stored short term memory. From this perspective transformers are enough and just fine.
So if you now hide my original comment and try to recall what I said, do you know it word for word (and are thinking if every word, e.g. did I use one or 2 spaces somewhere as that would change tokens) or do you have a rough concept of what I said?
OTOH if you had to remember a phone number to write it down, how does that differ?
I think in a way it makes transformers superior to humans, their short term memory is much more powerful =)
Supporting extra long contexts also make transformers super human. Because, again, human's short term memory is exactly this - short term. And much shorter than millions of tokens we expect from models nowadays.
As for SSMs - I think they compress model memory state way too much. Mixed global/local attention layers do just as well. And sparse/block attention seems like a way forward much more (https://arxiv.org/abs/2502.11089).
> And much shorter than millions of tokens we expect from models nowadays.
Yet all current model still suck above 32k. (Yes some can do needle in a haystack fine, but they still fail at anything even slightly more complex over a long context).
32k is still much higher than humans' though, so I agree with you that it gives them some kind of super human abilities over moderately long context, but they are still disappointingly bad over longer context.
Out of curiosity I estimated per day context size (of text only!) by multiplying reading speed by number of minutes: 16 * 60 * 300 = 288000 words ~ 288000 tokens.
Isn't this exactly the point of this model? No need to memorize everything (which makes transfomers expensive), just keep the relevant info. SSM are essentially recurrent models.
You can't always know what will be "relevant info" in the future. Even humans can't do this but whenever that's an issue, we just go back and re-read, re-watch etc.
None of these modern recurrent architecture have a way to do this.
How often do you go back an rewatch earlier parts of a movie? I hardly ever do this. In the cinema, theater, or when listening to the radio it’s simply impossible and it still works.
You are mentioning avenues that are largely for entertainment. Sure you might not go back to re-attend for those. If you will be tested or are doing research, are you really looking at a large source once ?
For example you could never fill in the last chapter of any good book without having knowledge of every previous chapter. Not highly detailed knowledge, but still knowledge.