Huh, I can't say I'm on the cutting edge but that's not how I understand transfo...

frotaur · 2025-05-25T15:17:20 1748186240

This is correct. Caching only saves you from having to recompute self attention on the system prompt tokens, but not from the attention from subsequent tokens, which are free to attend to the prompt.

conradkay · 2025-05-25T16:13:41 1748189621

My understanding is that even though it's quadratic, the cost for most token lengths is still relatively low. So for short inputs it's not bad, and for long inputs the size of the system prompt is much smaller anyways.

And there's value to having extra tokens even without much information since the models are decent at using the extra computation.