I don't know why you are getting downvoted. This is 100% true. It's not like you...

woooooo · 2025-09-08T13:18:14 1757337494

Because in the context of LLM transformers, you really just need matrix multiplication to be hyper-optimized, it's 90-99% (citation needed) of the FLOPs. Get some normalization and activation functions in and you're good to go. It's not a massive software ecosystem.

CUDA and CUBLAS being capable of a bunch of other things is really cool, and would take a long time to catch up with, but getting the bare minimum to run LLMs on any platform with a bunch of GDDR7 channels and cores at a reasonable price would have people writing torch/ggml backends within weeks.

nromiun · 2025-09-08T13:34:49 1757338489

Have you tried to write a kernel for basic matrix multiplication? Because I have and I can assure you it is very hard to get 50% of maximum FLOPs, let alone 90%. It is nothing like CPUs where you write a * b in C and get 99% of the performance by the compiler.

Here is an example of how hard it is: https://siboehm.com/articles/22/CUDA-MMM

And this is just basic matrix mult. If you add activation functions it will slow down even more. There is nothing easy about GPU programming, if you care about performance. CUDA gives you all that optimization on a plate.

woooooo · 2025-09-08T14:08:49 1757340529

Well, CUDA gives you a whole programming language where you have to figure out the optimization for your particular card's cache size and bus width.

I'm saying the API surface of what to offer for LLMs is pretty small. Yeah, optimizing it is hard but it's "one really smart person works for a few weeks" hard, and most of the tiling techniques are public. Speaking of which, thanks for that blog post, off to read it now.

kadushka · 2025-09-08T16:33:04 1757349184

it's "one really smart person works for a few weeks" hard

AMD should hire that one really smart person.

adgjlsfhk1 · 2025-09-08T16:55:27 1757350527

yeah they really should. the primary reason AMD or behind in the GPU space is that they massively under-prioritize software.

astrange · 2025-09-09T06:14:52 1757398492

Not having written one of these (…well I've written an IDCT) I can imagine it getting complicated if there's any known sparsity to take advantage of.

hedgehog · 2025-09-08T18:09:44 1757354984

I assure you from experience that it's more than a smart person for a few weeks.