I like the idea of removing quadratic scaling for attention, this paper has thin...

I like the idea of removing quadratic scaling for attention, this paper has thin experimental support. No real tasks tested beyond perplexity. Nothing on reasoning, retrieval QA, or summarization quality. Even in perplexity the gains are marginal.

However it removes attention so I think its worth watching that space of non-attention models