In the branchless version, the CPU has to wait for the comparison to resolve before it can start executing the add for the next loop iteration. However, if the branch is predictable, the CPU can assume the result of the conditional and does not need to wait to add one or not. I wrote a more in depth comment about why this is true a few months ago: https://news.ycombinator.com/item?id=37245594
If I alter the code slightly to do `result += (a[i] == 0) * 2`, gcc emits a branch if the comparison is predictable: https://godbolt.org/z/df3fsoYK8
Here is a benchmark: https://quick-bench.com/q/NSGHu_wfhrMXp0-pZQp9qybCIok. Note how the branchless version takes the same time for the random and the zeroes vector, while the branch version is faster when the branch is predictable but slower when the branch is not predictable.
If I alter the code slightly to do `result += (a[i] == 0) * 2`, gcc emits a branch if the comparison is predictable: https://godbolt.org/z/df3fsoYK8
Here is a benchmark: https://quick-bench.com/q/NSGHu_wfhrMXp0-pZQp9qybCIok. Note how the branchless version takes the same time for the random and the zeroes vector, while the branch version is faster when the branch is predictable but slower when the branch is not predictable.