I'm pretty sure the following optimization is invalid, since a is not a bool array, but one of positive/negative numbers (see top of article):
> You can use arithmetics to go branchless.
if (a[i] > 0) {
cnt++;
}
> Rewriting using arithmetic takes advantage of the fact that the expression a[i] > 0 has an arithmetic value 1 if true and 0 if false. So the whole expression can be rewritten as:
Really? In what case? It’s just an expression and the entire statement just uses the expression result.
Of course when optimising code like this, I think it’s important to look at the generated assembly anyway and then you can be sure that it does what you expect on the compilers you intend on supporting (doubly so when you want to generate conditional moves — I’ve found that to be a bit of a puzzle where I sometimes need to move things around as the obvious code still generated a branch), but at least GCC and Clang won’t generate a branch for using just a comparison. Maybe it’s not the compiler and instead the target architecture? In x86, comparisons set flags, so by themselves aren’t branches. In any case, I recommend using compiler explorer when working on code where this matters.
Setting the CPU flag after a comparison doesn't do anything useful yet, you also need to perform an addition with 1 or do nothing depending on the flag, and selecting between these two options is usually done with a conditional branch, unless the CPU can execute ALU instructions conditionally (ARM can do this, x86 only has conditional mov AFAIK).
Can it? I understand it’s always possible to decompose that to
if (a[i] > 0)
return 1
else
return 0
end
But why would a compiler do that?
Comparison operations are basic primitives that usually store their result into a register. With branching comparison operators potentially also available, but if there’s a branching comparator operation in your ISA, then there’s almost certainly also a pure comparator operation in your ISA, because you can always compose a branching comparator from simple comparison operation followed by a basic equality comparison of the result.
So I guess my question is, while it’s technically possible for a compiler to compile this into a branching operations, under what circumstances would a compiler actually choose to do that, given there’s isn’t a clear benefit?
In the branchless version, the CPU has to wait for the comparison to resolve before it can start executing the add for the next loop iteration. However, if the branch is predictable, the CPU can assume the result of the conditional and does not need to wait to add one or not. I wrote a more in depth comment about why this is true a few months ago: https://news.ycombinator.com/item?id=37245594
If I alter the code slightly to do `result += (a[i] == 0) * 2`, gcc emits a branch if the comparison is predictable: https://godbolt.org/z/df3fsoYK8
Here is a benchmark: https://quick-bench.com/q/NSGHu_wfhrMXp0-pZQp9qybCIok. Note how the branchless version takes the same time for the random and the zeroes vector, while the branch version is faster when the branch is predictable but slower when the branch is not predictable.
> Comparison operations are basic primitives that usually store their result into a register.
In one of the most common processor architectures (the x86 family), comparison operations do store their result into a register, but it's the flags register, which can't be used directly in arithmetic operations. So you have to follow a comparison operation with either a conditional branch or a conditional move (and earlier processors in the x86 family didn't have conditional moves).
> So I guess my question is, while it’s technically possible for a compiler to compile this into a branching operations, under what circumstances would a compiler actually choose to do that, given there’s isn’t a clear benefit?
It depends on the compiler heuristics and on the surrounding code; for instance, it might decide that "compare; conditional branch over next instruction; increment" is better than "copy to second register; increment second register; compare; conditional move from second register", because it uses one less register (the x86 family is register starved) and one less instruction (relevant when optimizing for size).
> So you have to follow a comparison operation with either a conditional branch or a conditional move (and earlier processors in the x86 family didn't have conditional moves).
The x86 family has the `setCC` instructions [^1] that move bits from the flag register to a general purpose one. Example from godbolt, see `setg`:
No, you've changed the right-hand side expression. Now you're adding a[i] to cnt and nowhere has it stated that a[i] is 1. But a[i] > 0's result is always zero or one.
I suggest what you want is this:
cnt += !!a[i]
Now, a[i] is not-notted -- the expression returns a bool indicating whether the value is converted to true.
> You can use arithmetics to go branchless.
> Rewriting using arithmetic takes advantage of the fact that the expression a[i] > 0 has an arithmetic value 1 if true and 0 if false. So the whole expression can be rewritten as: