When people say that attention is quadratic, they mean that the cost to process n tokens is O(n²), so the amortized cost per token is indeed O(n). KV-caching is a way to maintain that amortized cost when appending tokens one at a time instead of ingesting the whole sequence at once. But in the end people want to be able to generate multiple tokens, so we're back at O(n²) total time again.
IIRC there are some FFT-based attention alternatives where encoding has complexity O(n log n), but there's no feasible way to cache anything and after appending a single token it costs O(n log n) again, so if you generate n tokens in sequence, the cost is actually O(n² log n).
IIRC there are some FFT-based attention alternatives where encoding has complexity O(n log n), but there's no feasible way to cache anything and after appending a single token it costs O(n log n) again, so if you generate n tokens in sequence, the cost is actually O(n² log n).