A very interesting trick! In my opinion, the big takeaway here is that if you ar...

haberman · on April 25, 2021

Yep, that's exactly the right takeaway. :) I have verified this on ARM64 also, but architectures that pass parameters on the stack will make this technique not really worth it.

Re: PHP, Ruby, etc. yes, computed gotos help some, but I think tail calls would help much more. That was our experience at least. I wrote more about this here: https://gcc.gnu.org/pipermail/gcc/2021-April/235891.html

Yes, there are portability issues with the tail call approach, so there would need to be a fallback on non-x64/ARM64 platorms. This would add complexity. But it's exciting to think that it could unlock significant performance improvements.

spockz · on April 25, 2021

If you like tail calls, look into CPS. Many forms of (pure) code can be rewritten in such a way.

Everyone who writes Haskell quickly learns to write their recursive functions so that they are (mutually) tail recursive.

dwohnitmok · on April 26, 2021

It's a bit of a weird case for Haskell. Because of laziness, you don't always want tail recursion (in fact sometimes you definitely don't want the tail recursive version). The classic example of this is foldl vs foldr.

  -- Tail-recursive so must be good right?
  -- That way lies thunk accumulation and *non-constant* 
  -- additional memory usage for a lot of functions f
  foldl f acc []     = acc
  foldl f acc (x:xs) = foldl f (f acc x) xs

  -- In contrast, this definition often results in constant 
  -- additional space usage despite the non-tail call.
  foldr f acc []     = acc
  foldr f acc (x:xs) = f x (foldr f acc xs)

Tail recursion is an unalloyed good only for strict languages.

Although I'm not sure why you put mutual in parentheses and maybe you meant to address laziness with that?

spockz · on April 26, 2021

You are right indeed regarding that there are cases where trail recursion does not work. But in this case I would say that the culprit is the laziness of f.

I put mutual in between parentheses because not all languages support optimising mutually tail recursive functions.

ufo · on April 25, 2021

On the matter of portability, I wonder if it would be possible to use some macro magic and/or a code generator to convert the tail-call version back into a more traditional while/switch loop, for the stack-based architectures.

While the tail call version is more architecture dependent, it's nevertheless more portable than assembly language. It's still C.

haberman · on April 25, 2021

I haven't implemented the fallback yet, but my thought was to keep most of the structure the same (dispatch function tail calls to the function for an individual operation) but then have the individual operation just return instead of tail calling back to dispatch.

I think this would be portable while still avoiding some of the performance problems associated with a traditional switch (like the bounds check).

elcritch · on April 25, 2021

I'm curious how this approach would fair for microcontrollers, well cortex m0-m4's actually. They're arm32 bit and may not have enough useful call registers. Still parsing protobuf's or interpreting WASM faster would result directly in better battery performance, etc.

mappu · on April 25, 2021

Regarding PHP specifically, the interpreter's opcodes are all described in a meta language, and the actual PHP interpreter can be code-generated in 4 different styles depending on what is fastest for the target C compiler.

See ZEND_VM_KIND_CALL / SWITCH / GOTO / HYBRID in php-src/Zend/zend_vm_gen.php

jashmatthews · on April 25, 2021

Even more wild is the JavaScriptCore bytecode VM which is written in a Ruby DSL which is either compiled to C or directly to ASM.

I'm jealous we don't have something like this for CRuby!

kevincox · on April 25, 2021

> the first 6 function parameters are guaranteed to be passed via registers.

This assumes that your function is being called via the regular calling convention. By the as-if rule there is nothing guaranteeing that an internal call is using the regular calling convention or that it hasn't been inlined.

anarazel · on April 25, 2021

Are there cases where a compiler reasonably would internally use a different calling convention that supports fewer arguments passed via register passing?

Yours is still a valid point (no reason doesn't imply a guarantee) even if there's none, to be clear. Just curious because I can't really see any reason for a compiler to do so.

I can imagine some theoretical cases where compiler optimizations lead to additional arguments to be passed to a function [version].

junon · on April 26, 2021

Yes, in cases where optimizations would call for it.

If a function is marked `static` for example (in C) it means that it's private and local to the translation unit, which means there doesn't need to be a standard ABI agreement between it and external consumers, thus compiler is free to optimize calls or even inline it if necessary. This can drastically reduce both code size and runtime cost.

kevincox · on April 25, 2021

I agree it is unlikely. I can imagine something like the compiler realizing that some computation is pure, so it caches the value across invocations of the function, essentially adding addition arguments that push the actual arguments onto the stack. Of course this is an unlikely edge case but a reasonable set of possible optimizations that would work together I'm a surprising way.

haberman · on April 25, 2021

We use the "noinline" attribute to get control over inlining, which gives a reasonable assurance that we'll get a normal standard ABI call with six parameters in registers on x64/ARM64.

astrange · on April 26, 2021

That doesn’t disable other optimizations like cloning and IP-IRA (if those still exist), so it’s still possible to get a different calling convention. That would hopefully have more registers in use though, not less.

tomp · on April 25, 2021

Modern compilers can compile tail calls of unknown functions (i.e. function pointers) to jumps; so instead of using "computed gotos" (which IIRC is a GCC extension), one can use ANSII C and get more efficient code (because of control of registers)

junon · on April 26, 2021

> which IIRC is a GCC extension.

Correct, and it's always been a gaping hole in the standard, in my opinion. Computed GOTO would be one of the more useful things added to C.

ceronman · on April 26, 2021

I'm not a C programmer at all, so please forgive me if this question doesn't make sense. If I remember correctly, there is a `register` keyword in C that you can use to hint the compiler that a given variable should be stored in a register. I think I've never seen this keyword in use, and I'm wondering why. But I think it would be useful for this same use case, without requiring tail calls and with the advantage of being fully portable.

junon · on April 26, 2021

`register` was used as a hint, and things like LLVM have Mem2Reg optimization passes that do a much better job at this sort of promotion.

IIRC register was mainly used in cases where hardware were involved, much like the volatile keyword. However unlike volatile, I don't believe register is even considered by compilers these days.

iamevn · on April 25, 2021

Wouldn't you still be able to get a lesser improvement from the tail call being able to overwrite the stack frame and jumping instead of calling?