I don't know what those words mean, but I am excited for the possibilities.

PaulHoule · 2025-06-16T20:48:51 1750106931

LLMs can look back over a certain number (N) of tokens, which roughly correspond to words. For instance if you want to summarize or answer questions about a document accurately the length of the document has to be less than N.

Conventionally they use an attention mechanism that compares every token to every other token which has a cost of N*N or N squared which is quadratic. If you want LLMs to chew over a huge amount of context (all the source code for your project) it’s a problem so people are looking for ways around this.

zoklet-enjoyer · 2025-06-16T21:09:27 1750108167

Thank you for that explanation

rybosome · 2025-06-16T21:24:20 1750109060

Adding to that excellent high level explanation of what the attention mechanism is, I’d add (from my reading of the abstract of this paper);

This work builds a model that has the ability to “remember” parts of its previous input when generating and processing new input, and has part of its intelligence devoted to determining what is relevant to remember.

This is in lieu of kind of saying “I need to keep re-reading what I’ve already read and said to keep going”.

I’d welcome better explanations. :)

Icko_ · 2025-06-16T21:20:13 1750108813

Not even that. With KV-caching, it's linear with the size of the context; and if someone figured out a way to have e.g. NlogN complexity, I imagine with KV-caching it may go down to logN complexity. (If the new algorithm permits that.)

yorwba · 2025-06-17T04:37:43 1750135063

When people say that attention is quadratic, they mean that the cost to process n tokens is O(n²), so the amortized cost per token is indeed O(n). KV-caching is a way to maintain that amortized cost when appending tokens one at a time instead of ingesting the whole sequence at once. But in the end people want to be able to generate multiple tokens, so we're back at O(n²) total time again.

IIRC there are some FFT-based attention alternatives where encoding has complexity O(n log n), but there's no feasible way to cache anything and after appending a single token it costs O(n log n) again, so if you generate n tokens in sequence, the cost is actually O(n² log n).