Optimizing your programs for Arm platforms

astrange · on April 26, 2024

This isn't a good article. I would say that if you're trying to rely on `restrict` and autovectorization you're doomed and should write it yourself. Even if it works on one compiler version, it won't work on all of them.

(It could possibly work in a language that isn't C and is designed for it; Fortran or shader programs are easier to autovectorize, and something like ISPC starts out "vectorized" and gets "autoscalarized".)

This is why ffmpeg writes SIMD in assembly and is more successful than all the people constantly replying "um actually you never need to write anything in assembly" to them.

BoingBoomTschak · on April 27, 2024

The big problem is that gcc/clang don't seem to have a concept of optimization notices, like SBCL does.

Nobody is more appropriate than the compiler to warn you that it couldn't optimize something costly and why.

dzaima · on April 27, 2024

Clang does have "-Rpass-missed=vectorize" among others.

BoingBoomTschak · on April 27, 2024

Huh, the more you know.

Phyx · on April 28, 2024

GCC has among others -fopt-info-vec-missed

ColonelPhantom · on April 26, 2024

Aren't shader programs more like ISPC (or OpenCL/CUDA), in that the programming model is based around 'pretend each SIMD lane is thread'?

astrange · on April 26, 2024

Depends on the target architecture. Some GPUs have used vectors in the past, and some people try to run shaders on CPU. (like for OpenCL or for emulation)

ColonelPhantom · on April 28, 2024

> Some GPUs have used vectors in the past

Which ones? I'm aware of AMD/ATi Terascale using VLIW, but I'm pretty sure that architecture also used SIMD (requiring the use of both VLIW and large 'waves' to achieve maximum occupancy).

And running shaders on CPU is, in essence, a similar programming model to ISPC. When you run OpenCL on a CPU I am quite certain that the runtime pretends each lane is a program instance, same as ISPC.

SubjectToChange · on April 27, 2024

Writing good assembly is a niche skill, especially SIMD assembly. Projects like ffmpeg are able to do it because they're pulling from a massive pool of contributors. In general writing raw assembly should be avoided unless you're genuinely in a position of knowing better.

...people constantly replying "um actually you never need to write anything in assembly" to them.

Honestly, who is saying that?

astrange · on April 27, 2024

> Projects like ffmpeg are able to do it because they're pulling from a massive pool of contributors.

It has the opposite problem; it's drawing from a small pool of skilled contributors, because not enough people have learned it, because so much other incorrect advice thinks it's fine to use autovectorization that doesn't work.

> Honestly, who is saying that?

The recent article here about ffmpeg's use of assembly exclusively these comments, or people thinking it was a joke, even though everyone replying who'd actually used it explained why it was good.

https://news.ycombinator.com/item?id=39813724

(note asm vs intrinsics is a different tradeoff - it doesn't use intrinsics because they aren't actually easier to work with; they are not faster, not more portable, and on Intel not even more readable because of Hungarian notation. Although they are easier to debug.)

janwas · on April 27, 2024

> it doesn't use intrinsics because they aren't actually easier to work with; they are not faster, not more portable

Well golly, I'll just have to disagree based on >20 years of experience, including several in assembly. asm is only (maybe) faster for the code we manage to get written.

From where I sit, video codecs are a rare special case in that the format is standardized, changes only every few years, and has only a few but super-time-critical kernels. For many many other use cases, the situation looks different and productivity matters more. Would you rather get a 10x speedup on 40% of the cycles, or 8x on 80%?

BTW the "not more portable" comment is a strawman because intrinsics themselves indeed aren't portable, but a wrapper library on top (such as our Highway) is.

astrange · on April 27, 2024

> BTW the "not more portable" comment is a strawman because intrinsics themselves indeed aren't portable, but a wrapper library on top (such as our Highway) is.

That's not intrinsics then, it's different abstraction. You could write a wrapper library over inline assembly if you wanted to.

(And of course the intrinsics themselves could almost all be implemented as a header using inline assembly too. Since you're probably not relying on the compiler to optimize your intrinsic math. But optimization would be a bit worse because it doesn't know the byte size of each instruction.)

dzaima · on April 27, 2024

The compiler can still do optimizations on intrinsics - clang passes most through its regular optimizations, so you get things like loop unrolling, CSE (quite powerful if you have multiple invocations of the same SIMD thing, deduplicating constant loads or whatnot), and some genuine improvements/reducing what you need to pay attention to (don't need to manually merge to 'vpandn', 'vpand a,b,c; vptest a,a' → 'vptest b,c', sometimes improving shuffles, moving out negation from movmsk of negation of vpcmpeq), though it can of course make things worse too as regular compiler tax.

An example of something that inline assembly would handle badly would be broadcast, which x86 pre-AVX-512 only can do with a value already in a SIMD register, or directly from memory, but the programmer almost always will want to provide it as a regular scalar variable, i.e. GPR.

janwas · on April 27, 2024

hm, sounds almost like macros wouldn't make it assembly anymore :)

I'm curious whether you know of any such inline asm wrapper? Seems that this gives the compiler less information than the intrinsics, which largely expand to builtins.

SubjectToChange · on April 27, 2024

It has the opposite problem; it's drawing from a small pool of skilled contributors,..

The project has 2000+ direct contributors and even more indirect contributors on its mailing lists.

...because so much other incorrect advice thinks it's fine to use autovectorization that doesn't work.

There are few high performance programmers who genuinely believe that autovectorization can compete with hand written assembly.

The recent article here about ffmpeg's use of assembly exclusively these comments, or people thinking it was a joke, even though everyone replying who'd actually used it explained why it was good.

I don't see anyone thinking it was a "joke". Comments range from std::simd to SIMD support in Java/C#. A few others quibble over the problems of hand written assembly, but only one or two users genuinely push back against the assembly. This is hardly persecution.

That said, I don't exactly understand your gripe with those people. Should they be showering ffmpeg et al. with praise or something? Like, it's great that the ffmpeg developers can afford to duplicate the same routines across different architectures and SIMD instruction sets, but hardly anyone else can justify doing that. For everyone else the best they can hope for are custom languages and/or better optimizing compilers.

astrange · on April 27, 2024

> The project has 2000+ direct contributors and even more indirect contributors on its mailing lists.

I'm one of them, so please just believe me instead of trying to correct me ;) It's an ongoing problem the project talks about that there aren't enough newcomers ready to write more SIMD code with good enough quality.

> Should they be showering ffmpeg et al. with praise or something?

The top reply is "just do this other thing that the article said was unworkable", so not doing that would be a start. Though, the article could've spent some more time explaining why intrinsics don't work well enough.

> but hardly anyone else can justify doing that

Other people mostly only target one CPU architecture as they're less important, but they also get paid and ffmpeg developers largely didn't. (These days more of them do, but those people are contributing security work more than performance work I think.)

It's similar to how x264 was better than every commercial competitor while working for free, simply because they took more time to think about what they were doing.

SubjectToChange · on April 27, 2024

I'm one of them, so please just believe me instead of trying to correct me ;)…

No you aren’t. Or rather, there’s absolutely no reason for me to believe you are.

It's an ongoing problem the project talks about that there aren't enough newcomers ready to write more SIMD code with good enough quality.

Good programmers are in short supply across the entire industry. Like anything else it’s just a matter of practice.

The top reply is "just do this other thing that the article said was unworkable", so not doing that would be a start.

A) Get a thicker skin. It’s not the end of the world when people leave comments related to the topic at hand.

B) x86inc.asm isn’t a particularly interesting approach to programming assembly.

Other people mostly only target one CPU architecture as they're less important,…

If hardware portability is a goal then handwritten assembly is even more wasteful.

It's similar to how x264 was better than every commercial competitor while working for free, simply because they took more time to think about what they were doing.

A lot of it simply comes down to sheer man hours and the quantity/quality of bug reports. No 4d chess, no great geniuses, just “good enough” persistence.

Anyway, the problem with handwriting assembly is that such programs are trivial in their complexity and/or given unusually strong guarantees.

dzaima · on April 27, 2024

If ffmpeg can't pull together enough good SIMD developers from its thousands of contributors, then most projects won't be able to get any. Having a problem of "not enough" is already miles better the problem of "having none".

neonsunset · on April 27, 2024

Yes, I support your sentiment.

On the topic, in C# (*), it is currently strongly recommended to rewrite any code that used to rely on specific ISA and ISA extensions (SSE4.2, AVX, NEON) to cross-platform methods on `Vector128/256/512<T>` and `Vector<T>` themselves.

Also, .NET is getting support for SVE2 now on top of `Vector<T>`, which is nice.

* Which has proper portable SIMD API, unlike Java panama vectors in their current shape (codegen and API limitations makes them currently an unsuitable counterpart)

astrange · on April 27, 2024

Is it strongly recommended by people who've successfully written performant software on multiple platforms (like ffmpeg), or by compiler engineers? Because this is the kind of thing compiler engineers would like to believe is true, but in practice isn't unless you have the power to make them fix it for you.

There are actual differences between ISAs here and if your case deoptimizes on one of them then there wasn't a point in using SIMD at all.

(Examples: whether it has a full permute, whether unaligned loads are fast or unusably slow, whether it supports half-floats.)

And of course nobody can do an abstraction for MMX, so they just pretend it doesn't exist.

neonsunset · on April 27, 2024

The quote reads as "In C#, it is recommended..."

Which means that, in the past, C# did not have cross-platform SIMD abstractions aside from a very limited set of arithmetic operations on Vector<T> so that manually vectorized code had to rely on SIMD intrinsics introduced earlier (AVX2, AdvSimd, etc.).

However, .NET 7 introduced a set of common arithmetic, logical and bitwise operations on Vector128/256/512<T>[0] (VectorXXX are common vector width types that used to be consumed by intrinsic APIs exclusively), and subsequently both 7 and 8 also included QoL improvements for more high-level Vector<T>.

This change rendered the code that duplicated SIMD paths per-platform mostly obsolete save for certain operations like `Shuffle` (which then got addressed by introducing ShuffleUnsafe which is just a raw platform-specific shuffle with the expectation that the users will account for those manually, or the set of outputs they care about has sufficiently common behavior everywhere).

CoreLib itself relies on these APIs now and there is no reason to use platform-specific intrinsics in most situations over cross-platform API.

Now, one of the reasons .NET can do this is because it does not have to target such a wide range of platforms regular C code has to deal with: most code out there only ever cares about x86, x86_64 (multiple flavors due to SSE2/4, AVX/2 and AVX512), armv7, armv8a and now also wasm (with packed SIMD), and maybe riscv in the future. Out of those, x86_x64 and armv8a receive most attention and performance investment, which are sufficiently similar save for movemask workhorse emulation of which is suboptimal (community has learned vshrn[1] and other tricks for common operations since then to avoid the issue).

With that said, C++ has its own experimental cross-plat SIMD abstraction, and there are many high-quality frameworks that allow to abstract away writing SIMD code manually completely. There is also a Rust crate[2] that offers C#-style SIMD abstraction, so it's not exclusive to C# (or particularly difficult in systems programming languages) but C# is probably the one and only high-level language to offer it with an assurance to emit good codegen (unless you abuse it too hard).

[0] https://github.com/dotnet/runtime/blob/main/docs/coding-guid...

[1] https://github.com/U8String/U8String/blob/main/Sources/U8Str...

[2] https://github.com/Lokathor/wide

dzaima · on April 27, 2024

Any modern compiler that bothers should be able to autovectorize most practical vectorizable things without issue, even without restrict. Of course there'll be some small inefficiencies or failed autovectorization sometimes, but small and big missed optimizations are in no way at all a problem unique to vectorization, so it's a moot point here.

kimixa · on April 27, 2024

"restrict" is getting around language issues where it's easy to "accidently" block things like vectorization, or having to reload values multiple times just in case pointers aliased. There's no compiler in the world that can work around this without more guarantees on the expected behavior from the programmer, as it would be incorrect according to the spec.

I've seen "obvious" big wins missed without it.

dzaima · on April 27, 2024

Both gcc and clang check for aliasing at runtime if not provable statically for autovectorization (granted, that can fail if you have reverse/strided/gather addresses, but those are less common; and yeah it does lead to some constant overhead, though likely not significant often).

Of minor note is that you can add "#pragma clang loop vectorize(assume_safety)" or "#pragma GCC ivdep" on the respective compilers to a loop to allow them to vectorize anything, which in my experience is much more functional than restrict. And even the vast majority of benefit I've gotten from it was just removing the alias check overhead (though it did catch a case of a reversing loop failing to vectorize due to a 32-bit index variable or something)

astrange · on April 27, 2024

> Both gcc and clang check for aliasing at runtime if not provable statically for autovectorization

This is often enough to make it unworkable, because it means you're inserting checks into hot loops.

Also, if you partially vectorize something yourself you have to write similar setup code, which might involve scalar versions of the loop, but then autovectorization can come by and vectorize those, so now you have duplicate setup code making it worse than nothing.

dzaima · on April 27, 2024

The checks are done outside of the loop though (unless you mean ≥two nested loops, with a small-ish inner one, at which point things indeed get less nice, but also this is less common).

Yeah, autovectorization of a manual tail loop can be annoying, but there is "#pragma clang loop unroll(disable)" and "#pragma clang loop vectorize(disable)" for clang, and for gcc "#pragma GCC unroll 1" and, from gcc 14, "#pragma GCC novector".

astrange · on April 27, 2024

You need restrict for a loop involving char* or equivalent because it can alias any type in C. Without it, the compiler has to preserve memory accesses exactly, and then it's hardly able to do any optimizations. If you're writing all the memory accesses out perfectly optimally already, then you don't need the compiler for much.

rarepostinlurkr · on April 26, 2024

It’s good so much attention is being given to arm! Apple also recently released more details on optimization

https://developer.apple.com/documentation/apple-silicon/cpu-...

electricshampo1 · on April 26, 2024

Thanks for this link; did not realize that they did this.