> Except as an application author I can't rely on it being there. I'm surprised ...

p0nce · on Nov 28, 2022

Yes, but if you write consumer software you need to optimize more for the low-spec machines. It is those users that need the speed.

janwas · on Nov 28, 2022

Agreed. Do you have any example of an optimization of generic/cross-platform vector code, such that it would run better on SSE2 or SSSE3?

One example might be reducing the number of live registers to avoid spilling, but you can already do that in your portable code, without necessarily requiring a separate codepath for the low-spec machines.

dzaima · on Nov 29, 2022

When writing "generic" code you still want to consider the target; for example, while it may semantically be nicer to use masking everywhere, doing so is pretty bad for perf pre-AVX-512 (especially for loads/stores, which just don't have masking pre-AVX2, and iirc AVX2's masks on AMD can still fault for masked out elements). A pretty big portion of AVX-512 is that it makes more things possible/doable a lot nicer, but that's useless if you have to also support things that don't. (another example may be using a vector-wide byte shuffle; SSE and AVX512 have instructions for that, but on AVX2 it needs to be emulated with two 16×u8 shuffles, which you'd want to avoid needing if possible; in general, if you're not considering exactly the capabilities of your target arch & bending data formats and wanted results around it, you're losing performance)

janwas · on Nov 29, 2022

> while it may semantically be nicer to use masking everywhere, doing so is pretty bad for perf

Agreed. It's best if applications pad data structures.

> iirc AVX2's masks on AMD can still fault for masked out elements).

Unfortunately yes. I haven't seen a CPU use that latitude, though.

> it makes more things possible/doable a lot nicer, but that's useless if you have to also support things that don't.

hm, is it truly useless? I've found that even emulating missing vector functionality is usually still faster than entirely scalar code.

> if you're not considering exactly the capabilities of your target arch & bending data formats and wanted results around it, you're losing performance

That's fair. We're also losing performance if we don't port to every single arch we might be running on. It seems to me that generic code (written as you say with an eye to platform capabilities) is a good and practical compromise, especially because we can still specialize per-arch where that is worthwhile.

dzaima · on Nov 29, 2022

> I've found that even emulating missing vector functionality is usually still faster than entirely scalar code.

inefficient SIMD can still be better than scalar loops, yes, but better than that is efficient SIMD; and you may be able to achieve that by rearranging things such that you don't need said emulation, which you wouldn't have to bother doing if you could target only AVX-512.

janwas · on Nov 29, 2022

OK, it would indeed be nice if we only had to target AVX-512, but that's not the reality I'm in.

On minimizing required emulation - any thoughts as to how? Padding data structures seems to be the biggest win, only mask/gather/scatter where unavoidable, anything else?

dzaima · on Dec 2, 2022

Stay within 16-byte lanes for many things (≤16-bit element shuffles, truncation, etc); use saturating when narrowing types if possible; try to stay to a==b and signed a>b by e.g. moving negation elsewhere or avoiding unsigned types; switch to a wider element type if many operations in a sequence aren't supported on the narrower one (or, conversely, stay to the narrower type if only a few ops need a wider one). Some of these may be mitigated by sufficiently advanced compilers, but they're quite limited currently.

janwas · on Dec 3, 2022

Great points! It seems useful to add a list based on yours to our readme. Please let me know if you'd like us to acknowledge you in the commit message with anything other than the username dzaima.

dzaima · on Dec 3, 2022

"dzaima" is how I prefer to be referred to as; but that list is largely me going off of memory, definitely worth double-checking. (and of course, they're ≤AVX2-specific, i.e. x!=y does exist in avx-512 (and clang can do movemask(~(a==b)) → ~movemask(a==b), but gcc won't), and I can imagine truncated narrowing at some point in the future being faster than saturating narrowing on AVX-512; or maybe saturating narrow isn't even better? (for i32→i8, clang emits two xmmword reads whereas _mm256_packs_epi32 → _mm256_packs_epi16 → _mm256_permutevar8x32_epi32({0,4,undef}) can read a ymmword at a time, thus maybe (?) being better on the memory subsystem, but clang decides to rewrite the permd as vextracti128 & vpunpckldq, making it unnecessarily worse in throughput))

janwas · on Dec 6, 2022

Here's a first draft, comments welcome: https://github.com/google/highway/pull/1078.

janwas · on Dec 4, 2022

Got it :) Yes, I'll also check our x86_128 file for #if; those are some of the potholes in SSE4 which are filled by AVX2 or AVX-512.

p0nce · on Nov 29, 2022

> iirc AVX2's masks on AMD can still fault for masked out elements

wait, wat? Do you mean _mm_maskload_ps and _mm_maskstore_ps would segfault, where they should not by-spec?

janwas · on Nov 29, 2022

Yes, they theoretically could. The AMD manual contains this language:

Exception and trap behavior for elements not selected for loading or storing from/to memory is implementation dependent. For instance, a given implementation may signal a data breakpoint or a page fault for doublewords that are zero-masked and not actually written.

p0nce · on Nov 30, 2022

Well this makes the instruction useless unless all memory pointed to makes no page fault.

janwas · on Dec 1, 2022

Agreed. This is why it is even more important for the application to pad its arrays to the vector size. Then we might also skip the masking entirely.

p0nce · on Dec 1, 2022

Yes. It also makes the masked instruction more portable to arm or non-x86, which is great. It also avoids the side channel?

p0nce · on Nov 30, 2022

Oh god it's worse than expected: https://erik.science/2019/06/21/AVX-fun.html

janwas · on Dec 1, 2022

Interesting, hadn't seen this before. So it's another side channel, one of many.

p0nce · on Nov 30, 2022

- When AVX/AVX2 was introduced on Haswell, you could have slowdown because algorithm using both lanes had to use additional permutes.

- The AVX throttling slowdown

AVX is just risky

janwas · on Dec 1, 2022

To clarify, are you saying the entire app was slower with AVX than it was with SSE4?

That would be surprising, because 2x vector width is expected to outweigh 10-20 percent downclocking. Even more so with Haswell, which adds FMA and thus doubles FLOPS. The additional permutes are indeed not free, but we did get an all to all int32 shuffle, which could actually be more efficient than having to load/generate the corresponding PSHUFB input.

Taking a step back, these are examples of AVX adding a bit of cost, but I'm not yet seeing an accounting of the benefits, nor showing that they are outweighed by the cost.

p0nce · on Dec 1, 2022

> To clarify, are you saying the entire app was slower with AVX than it was with SSE4?

We were optimizing loop by loop, and some loops converted in AVX could well be slower yes (at this point of time). AVX is probably more often a win nowadays.

> That would be surprising, because 2x vector width is expected to outweigh 10-20 percent downclocking.

In practice workload is often bottleneck by memory access, and there is diminishing returns with increased vector size.