Controversial opinion: I wish every GPU programming language and API did memory ...

kllrnohj · on July 5, 2022

I don't think it's controversial at all to wish everything was simple. That said...

> Basically if you know how to deal with memory in C, you already know how to deal with GPU memory in CUDA.

is not even remotely true in the slightest. Using memory in CUDA is massively more complex than in C. The actual allocation & deallocation API isn't meaningfully different, no, but that's not where the story ends. The story ends with ensuring memory accesses are friendly for the GPU, meaning coalesced or strided. It means adjusting the behavior of L2 cache for the workload. It means optimizing how you actually get data to and from the GPU in the first place. And, last but certainly not least, cudaMalloc and cudaFree are much more expensive than malloc & free are. Like, orders of magnitude more expensive.

Yes you can do all that micro-optimization for CPUs, of course. But for GPGPU it's actually super critical or your performance is just dreadfully bad. Like, don't even bother using the GPU at all bad.

jpgvm · on July 5, 2022

Not to mention how insanely difficult heterogeneous memory management can get if you start needing coherence and/or relaxed atomics.

I have taken multiple swings at getting good at mixed computing and decided in most cases it's better to just pick CPU or GPU, trying to use both at the same time increases the difficulty by orders of magnitude.

Hats off to people that are good at that shit.

boulos · on July 5, 2022

Partly it was second system syndrome. OpenCL in particular thought it was going to be "better", particularly for hybrid programming and portability between "cpu only, GPU only and mixed". I personally find it was a failure, and not just because NVIDIA never cared to really push it.

DirectX and GL predated CUDA and already had opaque buffer allocation things (e.g., vertex buffers). Partly this was a function of limited fixed-function units, maximum sizes of frame buffers and texture dimensions, and so on.

But yes, CUDA had a memory model that wasn't necessarily "magic" but just like regular malloc and free, it's pretty obvious what it does. (And you just live with pinned memory, host <=> device copies, and so on).

bl0b · on July 5, 2022

I think CUDA even lets you allocate pinned host memory too now - cuHostMalloc or something like that - so you can skip the double-copy

my123 · on July 5, 2022

CUDA provides a tier significantly above that: unified memory.

See: https://on-demand.gputechconf.com/gtc/2017/presentation/s728...

And: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index....

However, the Windows driver infrastructure's unified memory support is much further behind, with the pre-Pascal feature set.

For those you'll have to use Linux. Note that WSL2 is considered as Windows for this, it's a driver infrastructure limitation in Windows.

andoma · on July 5, 2022

I've switched to using cudaMallocManaged() exclusively. From what I can tell there's isn't much of a performance difference. A few cudaMemPrefetchAsync() at strategic places will remedy any performance problems. I really love the ability that you can just break with gdb and look around in that memory as well.

einpoklum · on July 5, 2022

Unified memory is just _different_, not above or below. It offers on-demand paging. But that comes at a cost (at times) in terms of memory I/O speed.

my123 · on July 5, 2022

It's a feature tier above, with much more emphasis on ease of use from the programmer's perspective.

It also allows for significantly more approachable programming models. For example: https://developer.nvidia.com/blog/accelerating-standard-c-wi...

boulos · on July 5, 2022

Yeah, sorry if I was unclear: some folks thought that cuHostMalloc et al. and pinned memory were "impure". That you should instead have a unified sense of "allocate" and that it could sometimes be host, sometimes device, sometimes migrate.

The unified memory support in CUDA (originally intended for Denver, IIRC) is mostly a response to people finding it too hard to decide (a la mmap, really).

So it's not that CUDA doesn't have these. It's that it does, but many people never have to understand anything beyond "there's a thing called malloc, and there's host and device".

01100011 · on July 5, 2022

Sure, but pinned memory is often a limited resource and requires the GPU to issue PCI transactions. Depending on your needs, it's generally better to copy to/from the GPU explicitly, which can be done asynchronously, hiding the overhead behind other work to a degree.

jhj · on July 5, 2022

In CUDA, some transfers involving pageable host memory are completely synchronous from the perspective of the host, even if you use `cudaMemcpyAsync`:

https://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behav...

Pinned memory is typically used to get around the synchronization aspects.

greendream17 · on July 5, 2022

The third system, WebGPU, solves the memory management problem. Another thing cuda does it gives you is a convenient way to describe how to share data between cpu and gpu. No good solution for this yet, I'm hoping for some procedural rust macro.

adwn · on July 5, 2022

That's primarily because Vulkan is a very low-level, but still portable, API. Different GPUs with different architectures provide memory heaps with different functional and performance characteristics (device local yes/no, host visible yes/no, host cached yes/no, host coherent yes/no), often even within the same device. For example, my Radeon RX 550 has access to 3 heaps: 1.75 GiB device local, 256 MiB device local + host visible + host coherent, 3 GiB host visible + host coherent + [optional] host cached. The last heap can be allocated either host cached or uncached; access from the GPU is presumably faster if it's treated as uncached.

This gives the maximum amount of control over the performance/convenience tradeoff. If you want more convenience, then you can use a library or engine that handles the details for you. It's the same for CUDA: you get the convenience of malloc(), but you potentially lose a bit of performance.

pjmlp · on July 5, 2022

And thus Vulkan has become the same extension spaghetti as OpenGL in less amount of years than has taken OpenGL to reach that point.

The pain is so big that Khronos has caved in and now is trying to clean up the mess with Vulkan Profiles.

darzu · on July 5, 2022

To the contrary, this is my biggest complaint with CUDA. It's a nice seeming abstraction from a programmer's perspective, unfortunately it's not a good abstraction b/c it doesn't match the reality of hardware. The truth is there is a heterogeneous arena of memory kinds, access patterns, and sizes that gives you drastically different performance trade offs for different tasks. (This isn't even about diverse hardware, every modern GPU has this complexity.) CUDA oversimplifies this which causes solutions to have opaque performance cliffs and you end up having to understand a lot of the underlying complexity and then awkwardly back-propagate that up into CUDA's ill-fit APIs to get decent performance. It's a false sense of simplicity that ends up causing more work and complexity.

Contrast that with something like WebGPU where their notions of GPU buffers, textures, pipelines and command queues maps well into what actually happens in hardware and it's much simpler to get predictable performance.

Now I totally agree there needs to be more work done to provide simpler abstractions on top of WebGPU/Vulkan/Metal/DirectX for certain common patterns of work. But pretending you have a pointer to a blob of GPU memory isn't the way.

This talk gives a great overview of the GPU compute landscape: https://www.youtube.com/watch?v=DZRn_jNZjbw

LtWorf · on July 5, 2022

I'm no CUDA expert but from what I recall you do need to use different types of memory to write efficient code, or your code might end up being very slow.

And then you have different card models, some share RAM, some don't. Some can process AND transfer data, some can't. Some can transfer only one way while processing. All of this must be accounted while programming.

dragontamer · on July 5, 2022

C++Amp has the easiest GPU allocation.

    array<int, 128> myGPUArray;

That's it. Thanks to RAII, this will be malloc() upon the constructor being called, and free() when the destructor is called.

A proper C++ language, with proper RAII and integration would be best. C++Amp played around a little bit with some concepts, but it seems to have died off in 2014.

There's also "array_view", which serves as an automagic pointer, that copies data "when the system determines it needs to be copied". (Using a C++ future-like system to block when the CPU-side is accessing before the GPU-side is done writing, and vice versa).

"Array" in C++Amp is similar to cudaMalloc. array_view is pretty nifty overall, and I like the idea of transfers being abstracted away by blocking / futures (besides: CUDA itself has blocking calls to kernels, so its not a completely alien concept).

High speed CPU/GPU interaction would be hampered by array_view (too much automagic), but having a simpler interface and less boilerplate code is a good idea IMO.

brandmeyer · on July 5, 2022

Is this under another namespace or something? std::array<T, Size> doesn't call malloc or free at all. Its entire reason for being is that it provides a fixed-sized type which models a RandomAccessContainer while being allocated inline to the parent object.

dragontamer · on July 5, 2022

concurrency::array from C++ AMP.

EDIT: It doesn't really call malloc/free, but instead is the equivalent to cudaMalloc / cudaFree.

nmfisher · on July 5, 2022

I've only touched OpenGL very briefly, but isn't this mostly explained by the fact that CUDA is for general purpose computing and not hardware rendering? If you're mostly dealing with vertex/texture buffers, then presumably you'll need to dress up CUDA with a bit of custom boilerplate, at which point it starts looking like the OpenGL approach anyway.

dotnet00 · on July 5, 2022

On older GPUs, that was the case. But ever since unified shader model became a thing, all the various buffers except textures are just generic chunks of memory similar to what CUDA deals with. In fact the 'modern' OpenGL approach is basically to tie together various pieces to get close to CUDA's model, where you deal only with generic chunks of memory with raw pointers to them (except for textures). The issue mainly being that it's a lot more boilerplate than CUDA's straightforward malloc calls.

kllrnohj · on July 5, 2022

The GP included Metal & Vulkan in that list, though. But at least in the case of Vulkan you do just have a single vkCreateBuffer for non-texture data. There's a usage bitmask, but that's more about ensuring you are told the right alignment and padding requirements.

CUDA, meanwhile, makes you just sort of "know" that from reading the docs. CUDA's model isn't simpler or friendlier here. It's the same complexity, "just" moved from API to documentation instead. At least, assuming you want to get anywhere close to the performance the hardware is capable of, that is.

tehsauce · on July 5, 2022

I agree. Maybe there’s a reason for all the extra complexity in vulkan, perhaps because such a wide variety of devices are targeted?

my123 · on July 5, 2022

Explicit GPU pointers, instead of opaque structures from the programmer's view, only became part of core in Vulkan 1.2 (VK_KHR_buffer_device_address).

And for Metal, getting the GPU address of a buffer only became exposed in Metal 3.0, which is going to be released this fall in iOS/iPadOS 16 and macOS Ventura.

dagmx · on July 5, 2022

I'm excited about the addition to Metal, especially in conjunction with the shared memory of the apple silicon chips

ykl · on July 5, 2022

Yup, I’m excited by APIs finally moving in this direction. IMO CUDA got it right from the start.

curiousgal · on July 5, 2022

I mean you do still have to fiddle around with global vs shared memory.