I wish there was a good way of knowing when an if forces an actual branch rather...

pandaman · 2025-02-09T18:41:46 1739126506

>The reason people do potentially more expensive mix/lerps is because while it might cost a tiny overhead, they are scared of making it a branch.

And the reason for that is the confusing documentation from NVidia and its cg/CUDA compilers. I believe they did not want to scare programmers at first and hid the execution model, talking about "threads" and then they kept using that abstraction to hype up their GPUs ("it has 100500 CUDA threads!"). The result is people coding for GPUs with some bizarre superstitions though.

You actually want branches in the the code. Those are quick. The problem is that you cannot have a branch off a SIMD way so, instead of a branch the compiler will emit code for both branches and the results will be masked out based on the branch's condition.

So, to answer your question - any computation based on shader inputs (vertices, computer shader indices and what not) cannot and won't branch. It will all be executed sequentially with masking. Even in the TFA example, both values of ? operator are computed, the same happens with any conditional on an SIMD value. There can be shortcut branches emitted by the compiler to quickly bypass computations when all ways are the same value but in general case everything will be computed for every condition being true as well as being false.

Only conditionals based on scalar registers (shader constants/unform values) will generate branches and those are super quick.

account42 · 2025-02-10T09:26:29 1739179589

> So, to answer your question - any computation based on shader inputs (vertices, computer shader indices and what not) cannot and won't branch.

It can do an actual branch if the condition ends up the same for the entire workgroup - or to be even more pedantic, for the part of the workgroup that is still alive.

You can also check that explicitly to e.g. take a faster special case branch if possible for the entire workgroup and otherwise a slower general case branch but also for the entire workgroup instead of doing both and then selecting.

pandaman · 2025-02-10T12:18:52 1739189932

And this is why I wrote There can be shortcut branches emitted by the compiler to quickly bypass computations when all ways are the same value but in general case everything will be computed for every condition being true as well as being false.

ribit · 2025-02-11T14:45:33 1739285133

Execution with masking is pretty much how broaching works on GPUs. What’s more relevant however is that conditional statements add overhead on terms of additional instructions and execution state management. Eliminating small branches using conditional moves or manual masking can be a performance win.

pandaman · 2025-02-12T00:34:29 1739320469

No, branching works on GPU just like everywhere else - the instruction pointer gets changed to another value. But you cannot branch on a vector value unless every element of the vector is the same, this is why branching on vector values is a bad idea. However, if your vectorized computation is naturally divergent then there is no way around it, conditional moves are not going to help as they also will evaluate both branches in a conditional. The best you can do is to arrange it in such a way that you only add computation instead of alternating it, i.e. you do if() ... instead of if() ... else ... then you only take as long as the longest path.

This reminds me that people who believe that GPU is not capable of branches do stupid things like writing multiple shaders instead of branching off a shader constant e.g. you have some special mode, say x-ray vision, in a game and instead of doing a branch in your materials, you write an alternative version of every shader.

ryao · 2025-02-09T23:46:09 1739144769

You can always have the compiler dump the assembly output so you can examine it. I suspect few do that.

vanderZwan · 2025-02-10T08:29:58 1739176198

Does this also apply for shaders? And is it even useful given the enormous variation in hardware capabilities out there. My impression was that it's all JIT compiled unless you know which hardware you're targeting, e.g. Valve precompiling highly optimized shaders for the Steam Deck

(I'm not a grapics programmer, mind you, so please correct any misunderstandings on my end)

swiftcoder · 2025-02-10T09:25:55 1739179555

It's all JIT'd based on the specific driver/GPU, but the intermediate assembly language is sufficient to inspect things like branches and loop unrolling.

grg0 · 2025-02-11T03:22:52 1739244172

Not really. DXIL in particular will still have branches and not care much about unrolling. You need to look at the assembly that is generated. And yes, that depends on the target hardware and compiler/driver.

account42 · 2025-02-10T09:32:19 1739179939

You will have to check for the different GPUs you are targetting. But GPU vendors don't start from scratch for each hardware generation so you will often see similar results.

torginus · 2025-02-10T19:20:25 1739215225

I'll comment this here as I got downvoted when I made the point in a standalone comment - this is mostly an academic issue, since you don't want to use step of pixel-level if statements in your shader code, as it will lead to ugly aliasing artifacts as the pixel color transitions from a to b.

What you want is to use smoothstep which blends a bit between these two values and for that you need to compute both paths anyway.

pandaman · 2025-02-11T00:03:10 1739232190

It's absurd to claim that you'd never use step(), even in pixel shaders (there are all kinds of shaders not related to pixels at all).

torginus · 2025-02-11T08:58:27 1739264307

>since you don't want to use step of pixel-level if statements in your shader code

The observation relates to pixel shaders, and even within that, it relates to values that vary based on pixel-level data. In these cases having if statements without any sort of interpolation introduces aliasing, which tends to look very noticeable.

Now you might be fine with that, or have some way of masking it, so it might be fine in your use case, but most in the most common, naive case the issue does show up.

pandaman · 2025-02-11T12:37:03 1739277423

I don't know how many graphics products you shipped and when, but, say, clamping values at 0, is pretty common even in most basic shaders. It's not magic and won't introduce "aliasing" just for the fact of using it. On the other hand, for example, using negative dot products in you lighting computation will introduce bizarre artifacts. And yes, everyone uses various forms of MSAA for the past 15 years or so even in games. Welcome to the 21st century.

torginus · 2025-02-11T14:31:26 1739284286

The way you write seems to imply you have professional experience in the matter, which makes it very strange you're not getting what I'm writing about.

Nobody ever talked about clamping - and it's not even relavant to the discussion as it doesn't introduce discontinuity that can cause aliasing.

What I'm referring to is shader aliasing, which MSAA does nothing about - MSAA is for geometry aliasing.

To illustrate what I'm talking about with, an example that draws a red circle on a quad:

The bad version:

    gl_FragColor = vec4(vec3(1.0 - step(0.25, distance(vUv, vec2(0.5)))) * vec3(1.0, 0.0, 0.0), 1.0);

The good version:

    gl_FragColor = vec4(vec3(1.0 - smoothstep(0.24, 0.25, distance(vUv, vec2(0.5)))) * vec3(1.0, 0.0, 0.0), 1.0);

The first version has a hard boundary for the circle which has an ugly aliased and pixelated contour, while the latter version smooths it. This example might not be egregious, but this can and does show up in some circumstances.

ajross · 2025-02-09T15:51:48 1739116308

> it's also concerning that we have syntax where an if is some times a branch and some times not.

That's true on scalar CPUs too though. The CMOV instruction arrived with the P6 core in 1995, for example. Branches are expensive everywhere, even in scalar architectures, and compilers do their best to figure out when they should use an alternative strategy. And sometimes get it wrong, but not very often.

masklinn · 2025-02-09T17:58:06 1739123886

For scalar CPUs, historically CMOV used to be relatively slow on x86, and notably for reliable branching patterns (>75% reliable) branches could be a lot faster.

cmov also has dependencies on all three inputs, so if there's a high level of bias towards the unlikely input having a much higher latency than the likely one a cmov can cost a fair amount of waiting.

Finally cmov were absolutely terrible on P4 (10-ish cycles), and it's likely that a lot of their lore dates back to that.

account42 · 2025-02-10T09:49:52 1739180992

You got this the wrong way around: For GPUs conditional moves are the default and real branches are a performance optimization possible only if the branch is uniform (=same side taken for the entire workgroup).

mpreda · 2025-02-09T18:12:10 1739124730

Exactly. Consider this example:

  a = f(z);
  b = g(z);
  v = x > y ? a : b;

Assuming computing the two function calls f() and g() is relativelly expensive, it becomes a trade-off whether to emit conditional code or to compute both followed by a select. So it's not a simple choice, and the decision is made by the compiler.

dragontamer · 2025-02-09T19:40:36 1739130036

This is a GPU focused article.

The GPU will almost always execute f and g due to GPU differences vs CPU.

You can avoid the f vs g if you can ensure a scalar Boolean / if statement that is consistent across the warp. So it's not 'always' but requires incredibly specific coding patterns to 'force' the optimizer + GPU compiler into making the branch.

justsid · 2025-02-09T21:19:01 1739135941

It depends. If the code flow is uniform for the warp, only side of the branch needs to be evaluated. But you could still end up with pessimistic register allocation because the compiler can’t know it is uniform. It’s sometimes weirdly hard to reason about how exactly code will end up executing on the GPU.

danybittel · 2025-02-10T05:09:03 1739164143

f or g may have side effects too. Like writing to memory. Now a conditional has a different meaning.

You could also have some fun stuff, where f and g return a boolean, because thanks to short circuit evaluation && || are actually also conditionals in disguise.

account42 · 2025-02-10T09:38:12 1739180292

Side effects will be masked, the GPU is still executing exactly the same code for the entire workgroup.

plagiarist · 2025-02-09T16:34:43 1739118883

I think that capability in the shader language would be interesting to have. One might even want it to two-color all functions in the code. Anything annotated nonbranching must have if statements compile down to conditional moves and must only call nonbranching functions.

catlifeonmars · 2025-02-09T17:36:49 1739122609

This is also very relevant for cryptography use cases, where branching is a potential side channel for leaking secret information.

grg0 · 2025-02-11T03:19:33 1739243973

The good way of knowing is to look at the assembly generated by the compiler. Maybe not a completely satisfying answer given that the result is heavily vendor-dependent, but unless the high-level language exposes some way of explicitly controlling it, then assembly is what you got.

> The reason people do potentially more expensive mix/lerps is because while it might cost a tiny overhead, they are scared of making it a branch.

This is a problem, though. People shouldn't do things potentially, they should look at the actual code that is generated and executed.

mwkaufma · 2025-02-09T19:44:54 1739130294

One can do the precisely how it's done in the article -- inspect the assembly.

chrisjj · 2025-02-09T16:30:20 1739118620

The good way is to inspect the code :)

> it's also concerning that we have syntax where an if is some times a branch and some times not.

It would be more concerning if we didn't. We might get a branch on one GPU and none on another.

phkahler · 2025-02-09T17:51:34 1739123494

>> The good way is to inspect the code :)

The best way is to profile the code. Time is what we are after, so measure that.

chrisjj · 2025-02-11T21:51:21 1739310681

For sure.

nice_byte · 2025-02-09T16:50:46 1739119846

godbolt has rga compiler now, you can always paste in hlsl and look at the actual rdna instructions that are generated (what GPU actually runs, not spirv)

NohatCoder · 2025-02-09T20:12:37 1739131957

But you don't generally need to care if the shader code contains a few branches, modern GPUs handles those reasonably well, and the compiler will probably make a reasonable guess about what is fastest.

account42 · 2025-02-10T09:42:45 1739180565

You do need to care about large non-uniform branches as in the general case the GPU will have to execute both sides.

NohatCoder · 2025-02-10T16:51:50 1739206310

A non-branching version of the same algorithm will also run code equivalent to both branches. The branching version may sometimes skip one of the branches, the non-branching version can't. So if the functionality you want is best described by a branch, then use a branch.