“Memory movement”? None of the instructions you list involve memory. I find the ...

Voultapher · 2025-09-15T17:46:12 1757958372

Doing the phf as shown is an and + neg instruction and just doing % 4 is just the and. I tested it on a Apple M1 machine and saw no difference in performance at all. It's possible to go much faster with vectorization 3x on the Zen 3 machine.

Sesse__ · 2025-09-15T22:21:53 1757974913

I didn't say it was slower, just that it was more obfuscated.

akoboldfrying · 2025-09-19T09:53:07 1758275587

You're right about memory movement, not sure what I was thinking.