should be v4 = permute4x64(v3, 3120) = d c b a v5 = permute4x64(v3, 2031) = b a ...

janwas · on Nov 29, 2022

Thanks for the example. This seems like a reasonable thing to want, and as you say it's not ideal to SwapAdjacentBlocks and then shuffle again. It's not yet clear to me how to define an op for this that's useful in case vectors are only 128-bit. Until then, it seems you could use TableLookupLanes (general permutation) at the same latency as permute4x64, plus loading an index vector constant?

> These things are often a little bit like chess engines and a little bit less like linear algebra systems, so the implementation is designed around a specific width.

hm. For something like a bitboard or AES state, typically wider vectors mean you can do multiple independent instances at once. Likely that's already happening for your AVX-512 version? If you can define your problem in terms of a minimum block size of 128 bits, it should be feasible.

> I ever move on to these targets, which I certainly should if they have large performance benefits compared to NEON.

SVE is the only way to access wider vectors (currently 256 or 512 bits) on Arm.