Partly it was second system syndrome. OpenCL in particular thought it was going ...

bl0b · on July 5, 2022

I think CUDA even lets you allocate pinned host memory too now - cuHostMalloc or something like that - so you can skip the double-copy

my123 · on July 5, 2022

CUDA provides a tier significantly above that: unified memory.

See: https://on-demand.gputechconf.com/gtc/2017/presentation/s728...

And: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index....

However, the Windows driver infrastructure's unified memory support is much further behind, with the pre-Pascal feature set.

For those you'll have to use Linux. Note that WSL2 is considered as Windows for this, it's a driver infrastructure limitation in Windows.

andoma · on July 5, 2022

I've switched to using cudaMallocManaged() exclusively. From what I can tell there's isn't much of a performance difference. A few cudaMemPrefetchAsync() at strategic places will remedy any performance problems. I really love the ability that you can just break with gdb and look around in that memory as well.

einpoklum · on July 5, 2022

Unified memory is just _different_, not above or below. It offers on-demand paging. But that comes at a cost (at times) in terms of memory I/O speed.

my123 · on July 5, 2022

It's a feature tier above, with much more emphasis on ease of use from the programmer's perspective.

It also allows for significantly more approachable programming models. For example: https://developer.nvidia.com/blog/accelerating-standard-c-wi...

boulos · on July 5, 2022

Yeah, sorry if I was unclear: some folks thought that cuHostMalloc et al. and pinned memory were "impure". That you should instead have a unified sense of "allocate" and that it could sometimes be host, sometimes device, sometimes migrate.

The unified memory support in CUDA (originally intended for Denver, IIRC) is mostly a response to people finding it too hard to decide (a la mmap, really).

So it's not that CUDA doesn't have these. It's that it does, but many people never have to understand anything beyond "there's a thing called malloc, and there's host and device".

01100011 · on July 5, 2022

Sure, but pinned memory is often a limited resource and requires the GPU to issue PCI transactions. Depending on your needs, it's generally better to copy to/from the GPU explicitly, which can be done asynchronously, hiding the overhead behind other work to a degree.

jhj · on July 5, 2022

In CUDA, some transfers involving pageable host memory are completely synchronous from the perspective of the host, even if you use `cudaMemcpyAsync`:

https://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behav...

Pinned memory is typically used to get around the synchronization aspects.

greendream17 · on July 5, 2022

The third system, WebGPU, solves the memory management problem. Another thing cuda does it gives you is a convenient way to describe how to share data between cpu and gpu. No good solution for this yet, I'm hoping for some procedural rust macro.