Not be contrarian, but if the next word prediction happens to be someone's name ...

parrit · 2025-04-29T22:19:22 1745965162

What an LLM does is stuff it all into short term memory. Humans dump the first pages into long term memory and "make sense" of it. Humans have a massive context window because of this (and sheer brain size and efficiency).

boroboro4 · 2025-04-30T02:52:04 1745981524

We don’t put things into long term memory after we read it. We usually put it after night of sleep. I personally think that context (and kv cache correspondingly) in the models are akin to our short term memory, while training process (and actual weights) are to our long term memory. And we can’t be sure our short term memory doesn’t work in a way of matching the current context towards currently stored short term memory. From this perspective transformers are enough and just fine.

parrit · 2025-04-30T03:04:32 1745982272

So if you now hide my original comment and try to recall what I said, do you know it word for word (and are thinking if every word, e.g. did I use one or 2 spaces somewhere as that would change tokens) or do you have a rough concept of what I said?

OTOH if you had to remember a phone number to write it down, how does that differ?

boroboro4 · 2025-04-30T04:03:15 1745985795

I think in a way it makes transformers superior to humans, their short term memory is much more powerful =) Supporting extra long contexts also make transformers super human. Because, again, human's short term memory is exactly this - short term. And much shorter than millions of tokens we expect from models nowadays.

As for SSMs - I think they compress model memory state way too much. Mixed global/local attention layers do just as well. And sparse/block attention seems like a way forward much more (https://arxiv.org/abs/2502.11089).

littlestymaar · 2025-04-30T08:39:04 1746002344

> And much shorter than millions of tokens we expect from models nowadays.

Yet all current model still suck above 32k. (Yes some can do needle in a haystack fine, but they still fail at anything even slightly more complex over a long context).

32k is still much higher than humans' though, so I agree with you that it gives them some kind of super human abilities over moderately long context, but they are still disappointingly bad over longer context.

boroboro4 · 2025-04-30T19:46:53 1746042413

Out of curiosity I estimated per day context size (of text only!) by multiplying reading speed by number of minutes: 16 * 60 * 300 = 288000 words ~ 288000 tokens.

tmalsburg2 · 2025-04-30T10:13:20 1746008000

Isn't this exactly the point of this model? No need to memorize everything (which makes transfomers expensive), just keep the relevant info. SSM are essentially recurrent models.

famouswaffles · 2025-04-30T12:50:54 1746017454

You can't always know what will be "relevant info" in the future. Even humans can't do this but whenever that's an issue, we just go back and re-read, re-watch etc.

None of these modern recurrent architecture have a way to do this.

tmalsburg2 · 2025-04-30T18:40:08 1746038408

How often do you go back an rewatch earlier parts of a movie? I hardly ever do this. In the cinema, theater, or when listening to the radio it’s simply impossible and it still works.

famouswaffles · 2025-04-30T20:50:05 1746046205

You are mentioning avenues that are largely for entertainment. Sure you might not go back to re-attend for those. If you will be tested or are doing research, are you really looking at a large source once ?

tmalsburg2 · 2025-05-04T07:52:31 1746345151

It’s do easy to come up with serious non-entertainment examples, I‘m sure you don’t need my help finding them.