Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Partly it was second system syndrome. OpenCL in particular thought it was going to be "better", particularly for hybrid programming and portability between "cpu only, GPU only and mixed". I personally find it was a failure, and not just because NVIDIA never cared to really push it.

DirectX and GL predated CUDA and already had opaque buffer allocation things (e.g., vertex buffers). Partly this was a function of limited fixed-function units, maximum sizes of frame buffers and texture dimensions, and so on.

But yes, CUDA had a memory model that wasn't necessarily "magic" but just like regular malloc and free, it's pretty obvious what it does. (And you just live with pinned memory, host <=> device copies, and so on).



I think CUDA even lets you allocate pinned host memory too now - cuHostMalloc or something like that - so you can skip the double-copy


CUDA provides a tier significantly above that: unified memory.

See: https://on-demand.gputechconf.com/gtc/2017/presentation/s728...

And: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index....

However, the Windows driver infrastructure's unified memory support is much further behind, with the pre-Pascal feature set.

For those you'll have to use Linux. Note that WSL2 is considered as Windows for this, it's a driver infrastructure limitation in Windows.


I've switched to using cudaMallocManaged() exclusively. From what I can tell there's isn't much of a performance difference. A few cudaMemPrefetchAsync() at strategic places will remedy any performance problems. I really love the ability that you can just break with gdb and look around in that memory as well.


Unified memory is just _different_, not above or below. It offers on-demand paging. But that comes at a cost (at times) in terms of memory I/O speed.


It's a feature tier above, with much more emphasis on ease of use from the programmer's perspective.

It also allows for significantly more approachable programming models. For example: https://developer.nvidia.com/blog/accelerating-standard-c-wi...


Yeah, sorry if I was unclear: some folks thought that cuHostMalloc et al. and pinned memory were "impure". That you should instead have a unified sense of "allocate" and that it could sometimes be host, sometimes device, sometimes migrate.

The unified memory support in CUDA (originally intended for Denver, IIRC) is mostly a response to people finding it too hard to decide (a la mmap, really).

So it's not that CUDA doesn't have these. It's that it does, but many people never have to understand anything beyond "there's a thing called malloc, and there's host and device".


Sure, but pinned memory is often a limited resource and requires the GPU to issue PCI transactions. Depending on your needs, it's generally better to copy to/from the GPU explicitly, which can be done asynchronously, hiding the overhead behind other work to a degree.


In CUDA, some transfers involving pageable host memory are completely synchronous from the perspective of the host, even if you use `cudaMemcpyAsync`:

https://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behav...

Pinned memory is typically used to get around the synchronization aspects.


The third system, WebGPU, solves the memory management problem. Another thing cuda does it gives you is a convenient way to describe how to share data between cpu and gpu. No good solution for this yet, I'm hoping for some procedural rust macro.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: