On all of this type of computation the real bottleneck is memory, size and bandw...

On all of this type of computation the real bottleneck is memory, size and bandwidth. You have to read and write back the data, you just simply can not perform faster than your memory bandwidth. ex If you have 2GB model ( 1G float16 ), which does not fit in any cache, and have to perform calculation on all of the data, than for 1 operation on all of the data 4 GB bandwidth is required. M1 17 times/s (if you don't make anything, so its a impossible max) 3090 233 times/s (that also not so practical) so 3090 10x M1 is a lower estimate.

3090 memory 24GB bandwidth 936.2 GB/s

M1 memory 16GB max bandwidth 68GB/s (shared with other parts of the system)