Partly it was second system syndrome. OpenCL in particular thought it was going to be "better", particularly for hybrid programming and portability between "cpu only, GPU only and mixed". I personally find it was a failure, and not just because NVIDIA never cared to really push it.
DirectX and GL predated CUDA and already had opaque buffer allocation things (e.g., vertex buffers). Partly this was a function of limited fixed-function units, maximum sizes of frame buffers and texture dimensions, and so on.
But yes, CUDA had a memory model that wasn't necessarily "magic" but just like regular malloc and free, it's pretty obvious what it does. (And you just live with pinned memory, host <=> device copies, and so on).
I've switched to using cudaMallocManaged() exclusively. From what I can tell there's isn't much of a performance difference. A few cudaMemPrefetchAsync() at strategic places will remedy any performance problems. I really love the ability that you can just break with gdb and look around in that memory as well.
Yeah, sorry if I was unclear: some folks thought that cuHostMalloc et al. and pinned memory were "impure". That you should instead have a unified sense of "allocate" and that it could sometimes be host, sometimes device, sometimes migrate.
The unified memory support in CUDA (originally intended for Denver, IIRC) is mostly a response to people finding it too hard to decide (a la mmap, really).
So it's not that CUDA doesn't have these. It's that it does, but many people never have to understand anything beyond "there's a thing called malloc, and there's host and device".
Sure, but pinned memory is often a limited resource and requires the GPU to issue PCI transactions. Depending on your needs, it's generally better to copy to/from the GPU explicitly, which can be done asynchronously, hiding the overhead behind other work to a degree.
The third system, WebGPU, solves the memory management problem. Another thing cuda does it gives you is a convenient way to describe how to share data between cpu and gpu. No good solution for this yet, I'm hoping for some procedural rust macro.
DirectX and GL predated CUDA and already had opaque buffer allocation things (e.g., vertex buffers). Partly this was a function of limited fixed-function units, maximum sizes of frame buffers and texture dimensions, and so on.
But yes, CUDA had a memory model that wasn't necessarily "magic" but just like regular malloc and free, it's pretty obvious what it does. (And you just live with pinned memory, host <=> device copies, and so on).