> Except as an application author I can't rely on it being there.
I'm surprised to keep hearing this concern. We can write code once in a vector-length agnostic way, compile it for multiple targets, and use whatever is there at runtime.
Agreed. Do you have any example of an optimization of generic/cross-platform vector code, such that it would run better on SSE2 or SSSE3?
One example might be reducing the number of live registers to avoid spilling, but you can already do that in your portable code, without necessarily requiring a separate codepath for the low-spec machines.
When writing "generic" code you still want to consider the target; for example, while it may semantically be nicer to use masking everywhere, doing so is pretty bad for perf pre-AVX-512 (especially for loads/stores, which just don't have masking pre-AVX2, and iirc AVX2's masks on AMD can still fault for masked out elements). A pretty big portion of AVX-512 is that it makes more things possible/doable a lot nicer, but that's useless if you have to also support things that don't. (another example may be using a vector-wide byte shuffle; SSE and AVX512 have instructions for that, but on AVX2 it needs to be emulated with two 16×u8 shuffles, which you'd want to avoid needing if possible; in general, if you're not considering exactly the capabilities of your target arch & bending data formats and wanted results around it, you're losing performance)
> while it may semantically be nicer to use masking everywhere, doing so is pretty bad for perf
Agreed. It's best if applications pad data structures.
> iirc AVX2's masks on AMD can still fault for masked out elements).
Unfortunately yes. I haven't seen a CPU use that latitude, though.
> it makes more things possible/doable a lot nicer, but that's useless if you have to also support things that don't.
hm, is it truly useless? I've found that even emulating missing vector functionality is usually still faster than entirely scalar code.
> if you're not considering exactly the capabilities of your target arch & bending data formats and wanted results around it, you're losing performance
That's fair. We're also losing performance if we don't port to every single arch we might be running on.
It seems to me that generic code (written as you say with an eye to platform capabilities) is a good and practical compromise, especially because we can still specialize per-arch where that is worthwhile.
> I've found that even emulating missing vector functionality is usually still faster than entirely scalar code.
inefficient SIMD can still be better than scalar loops, yes, but better than that is efficient SIMD; and you may be able to achieve that by rearranging things such that you don't need said emulation, which you wouldn't have to bother doing if you could target only AVX-512.
OK, it would indeed be nice if we only had to target AVX-512, but that's not the reality I'm in.
On minimizing required emulation - any thoughts as to how? Padding data structures seems to be the biggest win, only mask/gather/scatter where unavoidable, anything else?
Stay within 16-byte lanes for many things (≤16-bit element shuffles, truncation, etc); use saturating when narrowing types if possible; try to stay to a==b and signed a>b by e.g. moving negation elsewhere or avoiding unsigned types; switch to a wider element type if many operations in a sequence aren't supported on the narrower one (or, conversely, stay to the narrower type if only a few ops need a wider one). Some of these may be mitigated by sufficiently advanced compilers, but they're quite limited currently.
Great points! It seems useful to add a list based on yours to our readme.
Please let me know if you'd like us to acknowledge you in the commit message with anything other than the username dzaima.
"dzaima" is how I prefer to be referred to as; but that list is largely me going off of memory, definitely worth double-checking. (and of course, they're ≤AVX2-specific, i.e. x!=y does exist in avx-512 (and clang can do movemask(~(a==b)) → ~movemask(a==b), but gcc won't), and I can imagine truncated narrowing at some point in the future being faster than saturating narrowing on AVX-512; or maybe saturating narrow isn't even better? (for i32→i8, clang emits two xmmword reads whereas _mm256_packs_epi32 → _mm256_packs_epi16 → _mm256_permutevar8x32_epi32({0,4,undef}) can read a ymmword at a time, thus maybe (?) being better on the memory subsystem, but clang decides to rewrite the permd as vextracti128 & vpunpckldq, making it unnecessarily worse in throughput))
Yes, they theoretically could. The AMD manual contains this language:
Exception and trap behavior for elements not selected for loading or storing from/to memory is implementation dependent. For instance, a given implementation may signal a data breakpoint or a page fault for doublewords that are zero-masked and not actually written.
To clarify, are you saying the entire app was slower with AVX than it was with SSE4?
That would be surprising, because 2x vector width is expected to outweigh 10-20 percent downclocking. Even more so with Haswell, which adds FMA and thus doubles FLOPS.
The additional permutes are indeed not free, but we did get an all to all int32 shuffle, which could actually be more efficient than having to load/generate the corresponding PSHUFB input.
Taking a step back, these are examples of AVX adding a bit of cost, but I'm not yet seeing an accounting of the benefits, nor showing that they are outweighed by the cost.
> To clarify, are you saying the entire app was slower with AVX than it was with SSE4?
We were optimizing loop by loop, and some loops converted in AVX could well be slower yes (at this point of time). AVX is probably more often a win nowadays.
> That would be surprising, because 2x vector width is expected to outweigh 10-20 percent downclocking.
In practice workload is often bottleneck by memory access, and there is diminishing returns with increased vector size.
I'm surprised to keep hearing this concern. We can write code once in a vector-length agnostic way, compile it for multiple targets, and use whatever is there at runtime.