I don't know why you are getting downvoted. This is 100% true. It's not like you can take any random data and train it into a NN. You have to transform the data, you have to write the low level GPU kernels which will actually run fast on that particular GPU, you also have to get the output and transform that as well. All of this is hard and very much impossible to create from scratch.
If people use PyTorch on a Nvidia GPU they are running layers and layers of code written by those that know how to write fast kernels for GPUs. In some cases they use assembly as well.
Nvidia stuck to one stack and wrote all their high level libraries on it, while their competitors switched from old APIs to new ones and never made anything close to CUDA.
Because in the context of LLM transformers, you really just need matrix multiplication to be hyper-optimized, it's 90-99% (citation needed) of the FLOPs. Get some normalization and activation functions in and you're good to go. It's not a massive software ecosystem.
CUDA and CUBLAS being capable of a bunch of other things is really cool, and would take a long time to catch up with, but getting the bare minimum to run LLMs on any platform with a bunch of GDDR7 channels and cores at a reasonable price would have people writing torch/ggml backends within weeks.
Have you tried to write a kernel for basic matrix multiplication? Because I have and I can assure you it is very hard to get 50% of maximum FLOPs, let alone 90%. It is nothing like CPUs where you write a * b in C and get 99% of the performance by the compiler.
And this is just basic matrix mult. If you add activation functions it will slow down even more. There is nothing easy about GPU programming, if you care about performance. CUDA gives you all that optimization on a plate.
Well, CUDA gives you a whole programming language where you have to figure out the optimization for your particular card's cache size and bus width.
I'm saying the API surface of what to offer for LLMs is pretty small. Yeah, optimizing it is hard but it's "one really smart person works for a few weeks" hard, and most of the tiling techniques are public. Speaking of which, thanks for that blog post, off to read it now.
If people use PyTorch on a Nvidia GPU they are running layers and layers of code written by those that know how to write fast kernels for GPUs. In some cases they use assembly as well.
Nvidia stuck to one stack and wrote all their high level libraries on it, while their competitors switched from old APIs to new ones and never made anything close to CUDA.