Thanks for sharing, enjoyed reading it! I have a slightly tangential question: D...

saagarjha · 2025-02-21T04:35:00 1740112500

They didn’t. They used PTX, which is what CUDA C++ compiles down to, but which is part of the CUDA toolchain. All major players have needed to do this because the intrinsics for the latest accelerators are not actually exposed in the C++ API, which means using them requires inline PTX at the very minimum.

t55 · 2025-02-20T22:43:37 1740091417

They basically ditched CUDA and went straight to writing in PTX, which is like GPU assembly, letting them repurposing some cores for communication to squeeze out extra performance. I believe that with better AI models and tools like Cursor, we will move to a world where you can mold code ever more specific to your use case to make it more performant.

suresk · 2025-02-21T06:24:21 1740119061

Are you sure they ditched CUDA? I keep hearing this, but it seems odd because that would be a ton of extra work to entirely ditch it vs selectively employing some ptx in CUDA kernels which is fairly straightforward.

Their paper [1] only mentions using PTX in a few areas to optimize data transfer operations so they don't blow up the L2 cache. This makes intuitive sense to me, since the main limitation of the H800 vs H100 is reduced nvlink bandwidth, which would necessitate doing stuff like this that may not be a common thing for others who have access to H100s.

1. https://arxiv.org/abs/2412.19437

t55 · 2025-02-21T16:12:08 1740154328

I should have been more precise, sorry. Didn't want to imply they entirely ditched CUDA but basically circumvented it in a few areas like you said.

pjmlp · 2025-02-21T16:49:14 1740156554

Targeting directly PTX is perfectly regular CUDA, and used by many toolchains that target the ecosystem.

CUDA is not only C++, as many mistake it for.

spps11 · 2025-02-20T23:13:39 1740093219

got it, thanks for explaining.

> with better AI models and tools like Cursor, we will move to a world where you can mold code ever more specific to your use case to make it more performant

what do you think the value of having the right abstraction will be in such a world?

t55 · 2025-02-20T23:29:10 1740094150

I think that for at least for us dumb humans with limited memory, having good abstractions makes things much easier to understand

spps11 · 2025-02-21T00:14:38 1740096878

Yes, but I wonder how much of this trait is carried over to the LLMs from us.

t55 · 2025-02-21T00:26:57 1740097617

what do you mean, the LLM abstracting things for us while we speak to it?

spps11 · 2025-02-21T03:01:49 1740106909

No I meant something else. As you said: us humans love clean abstractions. We love building on top of them. Now LLMs are trained on data produced by us. So I wonder if they would also inherit this trait from us and end up loving good abstractions, and would find it easier to build on top of them. Other possibility is that they end up move-37ing the whole abstraction shebang. And find that always building something up bespoke, from low-level is better than constraining oneself to some general purpose abstraction.

tomnipotent · 2025-02-21T03:29:04 1740108544

It's an interesting idea.

If code is ever updated by an LLM, does it benefit from using abstractions? After all they're really a tool for us lowly sapients to aid in breaking down complex problems. Maybe LLM's will create their own class of abstractions, diverse from our own but useful for their task.

t55 · 2025-02-21T04:52:21 1740113541

ah gotcha. I think that with the new trend of RLing models, the move 37 may come up sooner than we think -- just provide the pretrained models some outcome-goal and the way it gets there may use low-level code without clean abstractions