Short answer: it has been a big pain in the butt. The GPU hardware is mostly really great, but the drivers/APIs were not designed for such a low-latency use case. There's (for audio) a large overhead latency in kernel execution scheduling. I've had to do a lot of fun optimization in terms of just reducing the runtime of the kernel itself, and a lot of less-fun evil dark magic optimization to e.g. trick macOS into raising the GPU clock speed.
Long answer: I've written a fair bit about this on my devlog. You might check out these tags:
Thanks for the extra info, I read through some of your entries on GPU optimization and it definitely seems like it's been a journey! Thanks for blazing the trail.
Long answer: I've written a fair bit about this on my devlog. You might check out these tags:
https://anukari.com/blog/devlog/tags/gpu https://anukari.com/blog/devlog/tags/optimization