Sure. The performance-purist in me would be very doubtful about the result's opt...

littlestymaar · 2025-07-26T12:06:17 1753531577

The performance purist don't use Cuda either though (that's why Deepseek used PTX directly).

Everything is an abstraction and choosing the right level of abstraction for your usecase is a tradeoff between your engineering capacities and your performance needs.

LowLevelMahn · 2025-07-26T12:10:37 1753531837

this Rust demo also uses PTX directly

  During the build, build.rs uses rustc_codegen_nvvm to compile the GPU kernel to PTX.
  The resulting PTX is embedded into the CPU binary as static data.
  The host code is compiled normally.

LegNeato · 2025-07-26T12:22:07 1753532527

To be more technically correct, we compile to NVVM IR and then use NVIDIA's NVVM to convert it to PTX.

saagarjha · 2025-07-27T11:43:49 1753616629

That’s not really the same thing; it compiles through PTX rather than using inline assembly.

LegNeato · 2025-08-01T01:40:14 1754012414

FYI, you can drop down into ptx if need be:

https://github.com/Rust-GPU/Rust-CUDA/blob/aa7e61512788cc702...

brandonpelfrey · 2025-07-26T13:39:43 1753537183

The issue in my mind is that this doesn’t seem to include any of the critical library functionality specific eg to NVIDIA cards, think reduction operations across threads in a warp and similar. Some of those don’t exist in all hardware architectures. We may get to a point where everything could be written in one language but actually leveraging the hardware correctly still requires a bunch of different implementations, ones for each target architecture.

The fact that different hardware has different features is a good thing.

rbanffy · 2025-07-26T23:14:15 1753571655

The features missing hardware support can fall back to software implementations.

In any case, ideally, the level of abstraction would be higher, with little application logic requiring GPU architecture awareness.