Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"hundreds of thousands to potentially millions of tokens" - that's the same order as current commercial LLMs.

Also note, if the sequence length is not really much larger than the model dimension (at least two orders of magnitude more), the quadratic complexity of the self-attention is really not such a big issue - the matrix multiplication in the feed-forward layers will be usually 8x the model dimension squared, and thus that part will usually dominate.

Also note that there has been so much research on this already. While this particular approach might be novel, there has been attempts to avoid the O(n^2) complexity in self-attention basically almost since the original transformer paper came out in 2017. I wonder a bit that this paper does not cite xLSTM, or Block-Recurrent Transformers.

Also, this paper comes very short in experiments. There is basically only table 2. There is no study on length extrapolation (which is very relevant for the topic), or needle-in-haystack experiments, or scaling studies, any larger scale experiments, etc. Also, even in this main table 2, I see a couple of typos. And looking at the results in table 2, the improvements seems to be quite minor.

So I would conclude, this needs a lot more work.



> Also note, if the sequence length is not really much larger than the model dimension (at least two orders of magnitude more), the quadratic complexity of the self-attention is really not such a big issue - the matrix multiplication in the feed-forward layers will be usually 8x the model dimension squared, and thus that part will usually dominate.

This is incorrect in case of batched inference. There are two bottlenecks at play: compute and memory, and your reasoning applies to compute. In case of memory it gets trickier: for MLP layers you’ll need to read same set of weights for all elements of your batch, while for kv cache for attention elements will be different. That’s why in practice the real length where attention dominates would be closer to model dimension / batch size, rather than just model dimension. And this number isn’t as high anymore.


> Unlike traditional Transformer designs, which suffer from quadratic memory and computation overload due to the nature of the self attention mechanism, our model avoids token to token attention entirely.

I skimmed the paper, and unlike transformers they basically can scale much more efficiently with longer context. While it's possible to fit 1M token, you need a significant amount of memory. Alrhough they benchmark against GPT2, so I would say quite preliminary work so far, although promising architecture.


> "hundreds of thousands to potentially millions of tokens" - that's the same order as current commercial LLMs.

Yes, but those are all relying on proprietary company secrets, while this is an open research paper. Besides, only Gemini so far has a context window of more than a million tokens.


Llama 4 Scout has it also, and is an open weight LLM, unfortunately it is also disappointing at pretty much any context length…




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: