"hundreds of thousands to potentially millions of tokens" - that's the same orde...

boroboro4 · 2025-06-16T22:10:00 1750111800

> Also note, if the sequence length is not really much larger than the model dimension (at least two orders of magnitude more), the quadratic complexity of the self-attention is really not such a big issue - the matrix multiplication in the feed-forward layers will be usually 8x the model dimension squared, and thus that part will usually dominate.

This is incorrect in case of batched inference. There are two bottlenecks at play: compute and memory, and your reasoning applies to compute. In case of memory it gets trickier: for MLP layers you’ll need to read same set of weights for all elements of your batch, while for kv cache for attention elements will be different. That’s why in practice the real length where attention dominates would be closer to model dimension / batch size, rather than just model dimension. And this number isn’t as high anymore.

3abiton · 2025-06-16T21:54:54 1750110894

> Unlike traditional Transformer designs, which suffer from quadratic memory and computation overload due to the nature of the self attention mechanism, our model avoids token to token attention entirely.

I skimmed the paper, and unlike transformers they basically can scale much more efficiently with longer context. While it's possible to fit 1M token, you need a significant amount of memory. Alrhough they benchmark against GPT2, so I would say quite preliminary work so far, although promising architecture.

cubefox · 2025-06-16T21:48:06 1750110486

> "hundreds of thousands to potentially millions of tokens" - that's the same order as current commercial LLMs.

Yes, but those are all relying on proprietary company secrets, while this is an open research paper. Besides, only Gemini so far has a context window of more than a million tokens.

littlestymaar · 2025-06-16T22:10:30 1750111830

Llama 4 Scout has it also, and is an open weight LLM, unfortunately it is also disappointing at pretty much any context length…