Apple claims 11 trillion OPS, NVIDIA claims 285 TOPS, my educated guess is that ...

trhway · on Dec 6, 2020

nvidia is ~30 TFLOPS, i.e 3x.

andy_threos_io · on Dec 6, 2020

On all of this type of computation the real bottleneck is memory, size and bandwidth. You have to read and write back the data, you just simply can not perform faster than your memory bandwidth. ex If you have 2GB model ( 1G float16 ), which does not fit in any cache, and have to perform calculation on all of the data, than for 1 operation on all of the data 4 GB bandwidth is required. M1 17 times/s (if you don't make anything, so its a impossible max) 3090 233 times/s (that also not so practical) so 3090 10x M1 is a lower estimate.

3090 memory 24GB bandwidth 936.2 GB/s

M1 memory 16GB max bandwidth 68GB/s (shared with other parts of the system)

EvgeniyZh · on Dec 6, 2020

30 TFLOPS are general-purpose ones, for tensor cores, depending on what exactly you count can be 71 TFLOPS (fp32 accumulator), 142 TFLOPS (fp16 accumulator), or 282 TFLOPS (sparse tensors)