Can you elaborate on how popping arguments penalizes every call? It seems like, for non-variadic calls at least, it should take the same amount of time to add a constant to %esp whether you're doing it in the RET instruction or in the caller; but factoring that code into the callee should improve code density and therefore icache hit rates. What am I missing?
Callers have to readjust the stack after each call when TCO is enabled, but without TCO a function can allocate outgoing argument space for every function invocation up front and reuse that space for every call.