Huh, I can't say I'm on the cutting edge but that's not how I understand transformers to work.
By my understanding each token has attention calculated for it for each previous token. I.e. the 10th token in the sequence requires O(10) new calculations (in addition to O(9^2) previous calculations that can be cached). While I'd assume they cache what they can, that still means that if the long prompt doubles the total length of the final context (input + output) the final cost should be 4x as much...
This is correct. Caching only saves you from having to recompute self attention on the system prompt tokens, but not from the attention from subsequent tokens, which are free to attend to the prompt.
My understanding is that even though it's quadratic, the cost for most token lengths is still relatively low. So for short inputs it's not bad, and for long inputs the size of the system prompt is much smaller anyways.
And there's value to having extra tokens even without much information since the models are decent at using the extra computation.
By my understanding each token has attention calculated for it for each previous token. I.e. the 10th token in the sequence requires O(10) new calculations (in addition to O(9^2) previous calculations that can be cached). While I'd assume they cache what they can, that still means that if the long prompt doubles the total length of the final context (input + output) the final cost should be 4x as much...