I had a go at writing this, both the initial float version and the later integer...

I had a go at writing this, both the initial float version and the later integer-based version. Note that I haven't actually tested these to make sure that output is correct, so, I might have messed up somewhere, and that I targeted Rocketlake for the CPU.

For the float-based version[0] I had to break out the unstable portable_simd to get it to vectorize. Most of the function ends up being setting up everything, but then actually doing the calculation is simple, and basically the same as non-SIMD section. I've never used the portable SIMD stuff before, and it was quite pleasant to use.

For the integer-based version, I started with the simple naive approach[1], and that vectorized to a pretty good degree on stable. However, it doesn't use the dot-product instruction. For that, I think we need to use nightly and go a bit manual[2]. Unsurprisingly, it mostly ends up looking like the float version as a fair chunk is just setup. I didn't bother here, but it should probably be using feature detection to make sure the instruction exists.

[0] https://godbolt.org/z/Gdv8azorW [1] https://godbolt.org/z/d8jv3ofYo [2] https://godbolt.org/z/4oYEnKTbf