A very interesting trick! In my opinion, the big takeaway here is that if you are willing to write your C code in tail call style, you can get a lot of control over which variables are stored in machine registers. In the x86-64 ABI, the first 6 function parameters are guaranteed to be passed via registers.
Obviously, this is architecture-specific. Other architectures may use a different calling convention that ruins this kind of manual register allocation. I'd imagine that it would be bad for performance in 32-bit systems, where function arguments are passed via the stack.
--------
> I think it’s likely that all of the major language interpreters written in C (Python, Ruby, PHP, Lua, etc.) could get significant performance benefits by adopting this technique.
I know that at least Lua and Python use "computed gotos" in their inner loops, which also helps the compiler generate better code. The architecture-dependent nature of the tail-call trick could be a problem here. Some of these interpreters also need to work well on 32-bit systems.
Yep, that's exactly the right takeaway. :) I have verified this on ARM64 also, but architectures that pass parameters on the stack will make this technique not really worth it.
Yes, there are portability issues with the tail call approach, so there would need to be a fallback on non-x64/ARM64 platorms. This would add complexity. But it's exciting to think that it could unlock significant performance improvements.
It's a bit of a weird case for Haskell. Because of laziness, you don't always want tail recursion (in fact sometimes you definitely don't want the tail recursive version). The classic example of this is foldl vs foldr.
-- Tail-recursive so must be good right?
-- That way lies thunk accumulation and *non-constant*
-- additional memory usage for a lot of functions f
foldl f acc [] = acc
foldl f acc (x:xs) = foldl f (f acc x) xs
-- In contrast, this definition often results in constant
-- additional space usage despite the non-tail call.
foldr f acc [] = acc
foldr f acc (x:xs) = f x (foldr f acc xs)
Tail recursion is an unalloyed good only for strict languages.
Although I'm not sure why you put mutual in parentheses and maybe you meant to address laziness with that?
You are right indeed regarding that there are cases where trail recursion does not work. But in this case I would say that the culprit is the laziness of f.
I put mutual in between parentheses because not all languages support optimising mutually tail recursive functions.
On the matter of portability, I wonder if it would be possible to use some macro magic and/or a code generator to convert the tail-call version back into a more traditional while/switch loop, for the stack-based architectures.
While the tail call version is more architecture dependent, it's nevertheless more portable than assembly language. It's still C.
I haven't implemented the fallback yet, but my thought was to keep most of the structure the same (dispatch function tail calls to the function for an individual operation) but then have the individual operation just return instead of tail calling back to dispatch.
I think this would be portable while still avoiding some of the performance problems associated with a traditional switch (like the bounds check).
I'm curious how this approach would fair for microcontrollers, well cortex m0-m4's actually. They're arm32 bit and may not have enough useful call registers. Still parsing protobuf's or interpreting WASM faster would result directly in better battery performance, etc.
Regarding PHP specifically, the interpreter's opcodes are all described in a meta language, and the actual PHP interpreter can be code-generated in 4 different styles depending on what is fastest for the target C compiler.
See ZEND_VM_KIND_CALL / SWITCH / GOTO / HYBRID in php-src/Zend/zend_vm_gen.php
> the first 6 function parameters are guaranteed to be passed via registers.
This assumes that your function is being called via the regular calling convention. By the as-if rule there is nothing guaranteeing that an internal call is using the regular calling convention or that it hasn't been inlined.
Are there cases where a compiler reasonably would internally use a different calling convention that supports fewer arguments passed via register passing?
Yours is still a valid point (no reason doesn't imply a guarantee) even if there's none, to be clear. Just curious because I can't really see any reason for a compiler to do so.
I can imagine some theoretical cases where compiler optimizations lead to additional arguments to be passed to a function [version].
Yes, in cases where optimizations would call for it.
If a function is marked `static` for example (in C) it means that it's private and local to the translation unit, which means there doesn't need to be a standard ABI agreement between it and external consumers, thus compiler is free to optimize calls or even inline it if necessary. This can drastically reduce both code size and runtime cost.
I agree it is unlikely. I can imagine something like the compiler realizing that some computation is pure, so it caches the value across invocations of the function, essentially adding addition arguments that push the actual arguments onto the stack. Of course this is an unlikely edge case but a reasonable set of possible optimizations that would work together I'm a surprising way.
We use the "noinline" attribute to get control over inlining, which gives a reasonable assurance that we'll get a normal standard ABI call with six parameters in registers on x64/ARM64.
That doesn’t disable other optimizations like cloning and IP-IRA (if those still exist), so it’s still possible to get a different calling convention. That would hopefully have more registers in use though, not less.
Modern compilers can compile tail calls of unknown functions (i.e. function pointers) to jumps; so instead of using "computed gotos" (which IIRC is a GCC extension), one can use ANSII C and get more efficient code (because of control of registers)
I'm not a C programmer at all, so please forgive me if this question doesn't make sense. If I remember correctly, there is a `register` keyword in C that you can use to hint the compiler that a given variable should be stored in a register. I think I've never seen this keyword in use, and I'm wondering why. But I think it would be useful for this same use case, without requiring tail calls and with the advantage of being fully portable.
`register` was used as a hint, and things like LLVM have Mem2Reg optimization passes that do a much better job at this sort of promotion.
IIRC register was mainly used in cases where hardware were involved, much like the volatile keyword. However unlike volatile, I don't believe register is even considered by compilers these days.
Obviously, this is architecture-specific. Other architectures may use a different calling convention that ruins this kind of manual register allocation. I'd imagine that it would be bad for performance in 32-bit systems, where function arguments are passed via the stack.
--------
> I think it’s likely that all of the major language interpreters written in C (Python, Ruby, PHP, Lua, etc.) could get significant performance benefits by adopting this technique.
I know that at least Lua and Python use "computed gotos" in their inner loops, which also helps the compiler generate better code. The architecture-dependent nature of the tail-call trick could be a problem here. Some of these interpreters also need to work well on 32-bit systems.