Have you tried to write a kernel for basic matrix multiplication? Because I have and I can assure you it is very hard to get 50% of maximum FLOPs, let alone 90%. It is nothing like CPUs where you write a * b in C and get 99% of the performance by the compiler.
And this is just basic matrix mult. If you add activation functions it will slow down even more. There is nothing easy about GPU programming, if you care about performance. CUDA gives you all that optimization on a plate.
Well, CUDA gives you a whole programming language where you have to figure out the optimization for your particular card's cache size and bus width.
I'm saying the API surface of what to offer for LLMs is pretty small. Yeah, optimizing it is hard but it's "one really smart person works for a few weeks" hard, and most of the tiling techniques are public. Speaking of which, thanks for that blog post, off to read it now.
Here is an example of how hard it is: https://siboehm.com/articles/22/CUDA-MMM
And this is just basic matrix mult. If you add activation functions it will slow down even more. There is nothing easy about GPU programming, if you care about performance. CUDA gives you all that optimization on a plate.