I wish there was a good way of knowing when an if forces an actual branch rather than when it doesn't. The reason people do potentially more expensive mix/lerps is because while it might cost a tiny overhead, they are scared of making it a branch.
I do like that the most obvious v = x > y ? a : b; actually works, but it's also concerning that we have syntax where an if is some times a branch and some times not. In a context where you really can't branch, you'd almost like branch-if and non-branching-if to be different keywords. The non-branching one would fail compilation if the compiler couldn't do it without branching. The branching one would warn if it could be done with branching.
>The reason people do potentially more expensive mix/lerps is because while it might cost a tiny overhead, they are scared of making it a branch.
And the reason for that is the confusing documentation from NVidia and its cg/CUDA compilers. I believe they did not want to scare programmers at first and hid the execution model, talking about "threads" and then they kept using that abstraction to hype up their GPUs ("it has 100500 CUDA threads!"). The result is people coding for GPUs with some bizarre superstitions though.
You actually want branches in the the code. Those are quick. The problem is that you cannot have a branch off a SIMD way so, instead of a branch the compiler will emit code for both branches and the results will be masked out based on the branch's condition.
So, to answer your question - any computation based on shader inputs (vertices, computer shader indices and what not) cannot and won't branch. It will all be executed sequentially with masking. Even in the TFA example, both values of ? operator are computed, the same happens with any conditional on an SIMD value. There can be shortcut branches emitted by the compiler to quickly bypass computations when all ways are the same value but in general case everything will be computed for every condition being true as well as being false.
Only conditionals based on scalar registers (shader constants/unform values) will generate branches and those are super quick.
> So, to answer your question - any computation based on shader inputs (vertices, computer shader indices and what not) cannot and won't branch.
It can do an actual branch if the condition ends up the same for the entire workgroup - or to be even more pedantic, for the part of the workgroup that is still alive.
You can also check that explicitly to e.g. take a faster special case branch if possible for the entire workgroup and otherwise a slower general case branch but also for the entire workgroup instead of doing both and then selecting.
And this is why I wrote There can be shortcut branches emitted by the compiler to quickly bypass computations when all ways are the same value but in general case everything will be computed for every condition being true as well as being false.
Execution with masking is pretty much how broaching works on GPUs. What’s more relevant however is that conditional statements add overhead on terms of additional instructions and execution state management. Eliminating small branches using conditional moves or manual masking can be a performance win.
No, branching works on GPU just like everywhere else - the instruction pointer gets changed to another value. But you cannot branch on a vector value unless every element of the vector is the same, this is why branching on vector values is a bad idea. However, if your vectorized computation is naturally divergent then there is no way around it, conditional moves are not going to help as they also will evaluate both branches in a conditional. The best you can do is to arrange it in such a way that you only add computation instead of alternating it, i.e. you do if() ... instead of if() ... else ... then you only take as long as the longest path.
This reminds me that people who believe that GPU is not capable of branches do stupid things like writing multiple shaders instead of branching off a shader constant e.g. you have some special mode, say x-ray vision, in a game and instead of doing a branch in your materials, you write an alternative version of every shader.
Does this also apply for shaders? And is it even useful given the enormous variation in hardware capabilities out there. My impression was that it's all JIT compiled unless you know which hardware you're targeting, e.g. Valve precompiling highly optimized shaders for the Steam Deck
(I'm not a grapics programmer, mind you, so please correct any misunderstandings on my end)
It's all JIT'd based on the specific driver/GPU, but the intermediate assembly language is sufficient to inspect things like branches and loop unrolling.
Not really. DXIL in particular will still have branches and not care much about unrolling. You need to look at the assembly that is generated. And yes, that depends on the target hardware and compiler/driver.
You will have to check for the different GPUs you are targetting. But GPU vendors don't start from scratch for each hardware generation so you will often see similar results.
I'll comment this here as I got downvoted when I made the point in a standalone comment - this is mostly an academic issue, since you don't want to use step of pixel-level if statements in your shader code, as it will lead to ugly aliasing artifacts as the pixel color transitions from a to b.
What you want is to use smoothstep which blends a bit between these two values and for that you need to compute both paths anyway.
>since you don't want to use step of pixel-level if statements in your shader code
The observation relates to pixel shaders, and even within that, it relates to values that vary based on pixel-level data. In these cases having if statements without any sort of interpolation introduces aliasing, which tends to look very noticeable.
Now you might be fine with that, or have some way of masking it, so it might be fine in your use case, but most in the most common, naive case the issue does show up.
I don't know how many graphics products you shipped and when, but, say, clamping values at 0, is pretty common even in most basic shaders. It's not magic and won't introduce "aliasing" just for the fact of using it. On the other hand, for example, using negative dot products in you lighting computation will introduce bizarre artifacts. And yes, everyone uses various forms of MSAA for the past 15 years or so even in games. Welcome to the 21st century.
The way you write seems to imply you have professional experience in the matter, which makes it very strange you're not getting what I'm writing about.
Nobody ever talked about clamping - and it's not even relavant to the discussion as it doesn't introduce discontinuity that can cause aliasing.
What I'm referring to is shader aliasing, which MSAA does nothing about - MSAA is for geometry aliasing.
To illustrate what I'm talking about with, an example that draws a red circle on a quad:
The first version has a hard boundary for the circle which has an ugly aliased and pixelated contour, while the latter version smooths it. This example might not be egregious, but this can and does show up in some circumstances.
> it's also concerning that we have syntax where an if is some times a branch and some times not.
That's true on scalar CPUs too though. The CMOV instruction arrived with the P6 core in 1995, for example. Branches are expensive everywhere, even in scalar architectures, and compilers do their best to figure out when they should use an alternative strategy. And sometimes get it wrong, but not very often.
For scalar CPUs, historically CMOV used to be relatively slow on x86, and notably for reliable branching patterns (>75% reliable) branches could be a lot faster.
cmov also has dependencies on all three inputs, so if there's a high level of bias towards the unlikely input having a much higher latency than the likely one a cmov can cost a fair amount of waiting.
Finally cmov were absolutely terrible on P4 (10-ish cycles), and it's likely that a lot of their lore dates back to that.
You got this the wrong way around: For GPUs conditional moves are the default and real branches are a performance optimization possible only if the branch is uniform (=same side taken for the entire workgroup).
Assuming computing the two function calls f() and g() is relativelly expensive, it becomes a trade-off whether to emit conditional code or to compute both followed by a select. So it's not a simple choice, and the decision is made by the compiler.
The GPU will almost always execute f and g due to GPU differences vs CPU.
You can avoid the f vs g if you can ensure a scalar Boolean / if statement that is consistent across the warp. So it's not 'always' but requires incredibly specific coding patterns to 'force' the optimizer + GPU compiler into making the branch.
It depends. If the code flow is uniform for the warp, only side of the branch needs to be evaluated. But you could still end up with pessimistic register allocation because the compiler can’t know it is uniform. It’s sometimes weirdly hard to reason about how exactly code will end up executing on the GPU.
f or g may have side effects too. Like writing to memory.
Now a conditional has a different meaning.
You could also have some fun stuff, where f and g return a boolean, because thanks to short circuit evaluation && || are actually also conditionals in disguise.
I think that capability in the shader language would be interesting to have. One might even want it to two-color all functions in the code. Anything annotated nonbranching must have if statements compile down to conditional moves and must only call nonbranching functions.
The good way of knowing is to look at the assembly generated by the compiler. Maybe not a completely satisfying answer given that the result is heavily vendor-dependent, but unless the high-level language exposes some way of explicitly controlling it, then assembly is what you got.
> The reason people do potentially more expensive mix/lerps is because while it might cost a tiny overhead, they are scared of making it a branch.
This is a problem, though. People shouldn't do things potentially, they should look at the actual code that is generated and executed.
godbolt has rga compiler now, you can always paste in hlsl and look at the actual rdna instructions that are generated (what GPU actually runs, not spirv)
But you don't generally need to care if the shader code contains a few branches, modern GPUs handles those reasonably well, and the compiler will probably make a reasonable guess about what is fastest.
A non-branching version of the same algorithm will also run code equivalent to both branches. The branching version may sometimes skip one of the branches, the non-branching version can't. So if the functionality you want is best described by a branch, then use a branch.
I do like that the most obvious v = x > y ? a : b; actually works, but it's also concerning that we have syntax where an if is some times a branch and some times not. In a context where you really can't branch, you'd almost like branch-if and non-branching-if to be different keywords. The non-branching one would fail compilation if the compiler couldn't do it without branching. The branching one would warn if it could be done with branching.