Controversial opinion: I wish every GPU programming language and API did memory management like CUDA. In CUDA, you allocate and manage GPU memory using things like cudaMalloc and cudaMemcpy, which behave exactly like what their names suggest, and handles to GPU memory allocations are just plain old pointers just like regular pointers, except pointing to the GPU’s address space. Basically if you know how to deal with memory in C, you already know how to deal with GPU memory in CUDA.
Contrast with Vulkan/GL/Metal/DirectX, where in order to manage GPU memory, you need to know about a whole diverse zoo of different buffer types and different rules for different buffers and there’s a billion different API calls to handle different cases… I just want a chunk of GPU memory that I can do whatever I want with!
I don't think it's controversial at all to wish everything was simple. That said...
> Basically if you know how to deal with memory in C, you already know how to deal with GPU memory in CUDA.
is not even remotely true in the slightest. Using memory in CUDA is massively more complex than in C. The actual allocation & deallocation API isn't meaningfully different, no, but that's not where the story ends. The story ends with ensuring memory accesses are friendly for the GPU, meaning coalesced or strided. It means adjusting the behavior of L2 cache for the workload. It means optimizing how you actually get data to and from the GPU in the first place. And, last but certainly not least, cudaMalloc and cudaFree are much more expensive than malloc & free are. Like, orders of magnitude more expensive.
Yes you can do all that micro-optimization for CPUs, of course. But for GPGPU it's actually super critical or your performance is just dreadfully bad. Like, don't even bother using the GPU at all bad.
Not to mention how insanely difficult heterogeneous memory management can get if you start needing coherence and/or relaxed atomics.
I have taken multiple swings at getting good at mixed computing and decided in most cases it's better to just pick CPU or GPU, trying to use both at the same time increases the difficulty by orders of magnitude.
Partly it was second system syndrome. OpenCL in particular thought it was going to be "better", particularly for hybrid programming and portability between "cpu only, GPU only and mixed". I personally find it was a failure, and not just because NVIDIA never cared to really push it.
DirectX and GL predated CUDA and already had opaque buffer allocation things (e.g., vertex buffers). Partly this was a function of limited fixed-function units, maximum sizes of frame buffers and texture dimensions, and so on.
But yes, CUDA had a memory model that wasn't necessarily "magic" but just like regular malloc and free, it's pretty obvious what it does. (And you just live with pinned memory, host <=> device copies, and so on).
I've switched to using cudaMallocManaged() exclusively. From what I can tell there's isn't much of a performance difference. A few cudaMemPrefetchAsync() at strategic places will remedy any performance problems. I really love the ability that you can just break with gdb and look around in that memory as well.
Yeah, sorry if I was unclear: some folks thought that cuHostMalloc et al. and pinned memory were "impure". That you should instead have a unified sense of "allocate" and that it could sometimes be host, sometimes device, sometimes migrate.
The unified memory support in CUDA (originally intended for Denver, IIRC) is mostly a response to people finding it too hard to decide (a la mmap, really).
So it's not that CUDA doesn't have these. It's that it does, but many people never have to understand anything beyond "there's a thing called malloc, and there's host and device".
Sure, but pinned memory is often a limited resource and requires the GPU to issue PCI transactions. Depending on your needs, it's generally better to copy to/from the GPU explicitly, which can be done asynchronously, hiding the overhead behind other work to a degree.
The third system, WebGPU, solves the memory management problem. Another thing cuda does it gives you is a convenient way to describe how to share data between cpu and gpu. No good solution for this yet, I'm hoping for some procedural rust macro.
That's primarily because Vulkan is a very low-level, but still portable, API. Different GPUs with different architectures provide memory heaps with different functional and performance characteristics (device local yes/no, host visible yes/no, host cached yes/no, host coherent yes/no), often even within the same device. For example, my Radeon RX 550 has access to 3 heaps: 1.75 GiB device local, 256 MiB device local + host visible + host coherent, 3 GiB host visible + host coherent + [optional] host cached. The last heap can be allocated either host cached or uncached; access from the GPU is presumably faster if it's treated as uncached.
This gives the maximum amount of control over the performance/convenience tradeoff. If you want more convenience, then you can use a library or engine that handles the details for you. It's the same for CUDA: you get the convenience of malloc(), but you potentially lose a bit of performance.
To the contrary, this is my biggest complaint with CUDA. It's a nice seeming abstraction from a programmer's perspective, unfortunately it's not a good abstraction b/c it doesn't match the reality of hardware. The truth is there is a heterogeneous arena of memory kinds, access patterns, and sizes that gives you drastically different performance trade offs for different tasks. (This isn't even about diverse hardware, every modern GPU has this complexity.) CUDA oversimplifies this which causes solutions to have opaque performance cliffs and you end up having to understand a lot of the underlying complexity and then awkwardly back-propagate that up into CUDA's ill-fit APIs to get decent performance. It's a false sense of simplicity that ends up causing more work and complexity.
Contrast that with something like WebGPU where their notions of GPU buffers, textures, pipelines and command queues maps well into what actually happens in hardware and it's much simpler to get predictable performance.
Now I totally agree there needs to be more work done to provide simpler abstractions on top of WebGPU/Vulkan/Metal/DirectX for certain common patterns of work. But pretending you have a pointer to a blob of GPU memory isn't the way.
I'm no CUDA expert but from what I recall you do need to use different types of memory to write efficient code, or your code might end up being very slow.
And then you have different card models, some share RAM, some don't. Some can process AND transfer data, some can't. Some can transfer only one way while processing. All of this must be accounted while programming.
That's it. Thanks to RAII, this will be malloc() upon the constructor being called, and free() when the destructor is called.
A proper C++ language, with proper RAII and integration would be best. C++Amp played around a little bit with some concepts, but it seems to have died off in 2014.
There's also "array_view", which serves as an automagic pointer, that copies data "when the system determines it needs to be copied". (Using a C++ future-like system to block when the CPU-side is accessing before the GPU-side is done writing, and vice versa).
"Array" in C++Amp is similar to cudaMalloc. array_view is pretty nifty overall, and I like the idea of transfers being abstracted away by blocking / futures (besides: CUDA itself has blocking calls to kernels, so its not a completely alien concept).
High speed CPU/GPU interaction would be hampered by array_view (too much automagic), but having a simpler interface and less boilerplate code is a good idea IMO.
Is this under another namespace or something? std::array<T, Size> doesn't call malloc or free at all. Its entire reason for being is that it provides a fixed-sized type which models a RandomAccessContainer while being allocated inline to the parent object.
I've only touched OpenGL very briefly, but isn't this mostly explained by the fact that CUDA is for general purpose computing and not hardware rendering? If you're mostly dealing with vertex/texture buffers, then presumably you'll need to dress up CUDA with a bit of custom boilerplate, at which point it starts looking like the OpenGL approach anyway.
On older GPUs, that was the case. But ever since unified shader model became a thing, all the various buffers except textures are just generic chunks of memory similar to what CUDA deals with. In fact the 'modern' OpenGL approach is basically to tie together various pieces to get close to CUDA's model, where you deal only with generic chunks of memory with raw pointers to them (except for textures). The issue mainly being that it's a lot more boilerplate than CUDA's straightforward malloc calls.
The GP included Metal & Vulkan in that list, though. But at least in the case of Vulkan you do just have a single vkCreateBuffer for non-texture data. There's a usage bitmask, but that's more about ensuring you are told the right alignment and padding requirements.
CUDA, meanwhile, makes you just sort of "know" that from reading the docs. CUDA's model isn't simpler or friendlier here. It's the same complexity, "just" moved from API to documentation instead. At least, assuming you want to get anywhere close to the performance the hardware is capable of, that is.
Explicit GPU pointers, instead of opaque structures from the programmer's view, only became part of core in Vulkan 1.2 (VK_KHR_buffer_device_address).
And for Metal, getting the GPU address of a buffer only became exposed in Metal 3.0, which is going to be released this fall in iOS/iPadOS 16 and macOS Ventura.
Contrast with Vulkan/GL/Metal/DirectX, where in order to manage GPU memory, you need to know about a whole diverse zoo of different buffer types and different rules for different buffers and there’s a billion different API calls to handle different cases… I just want a chunk of GPU memory that I can do whatever I want with!