After analyzing the benchmarks I have a suspition that the apple compiler does some automatic parallelization and issues some instructions on the GPU, since the GPU shares the memory coherency with the CPU. That's why roseta2 performance is slower, as optimizing loops on the binary is harder. If they are not doing that, they should, as it will be something like an AVX512 for the poor energy budget. Just lunching a debate, as the wide decode with long pipeline smels like a Pentium4 crap or a miracle.