Lot of downvotes. Anyone have any opinion? Is CUDA fine forever? Is there something other than Vulkan we should also try? Do you think AMD should solve every problem CUDA solves for their customers too? What gives here?
I see a lot a lot a lot of resistance to the idea that we should start trying to align to Vulkan. Here & elsewhere. I don't get it, it makes no sense, & everyone else using GPU's is running fast as they can towards Vulkan. Is it just too soon too early in the adoption curve, or do ya'll think there are more serious obstructions long term to building a more Vulkan centric AI/ML toolkit? It still feels inevitable to me. What we are doing now feels like a waste of time. I wish ya'll wouldn't downvote so casually, wouldn't just try to brush this viewpoint away.
> Do you think AMD should solve every problem CUDA solves for their customers too?
They had no choice. Getting a bunch of HPC people to completely rewrite their code for a different API is a tough pill to swallow when you're trying to win supercomputer contracts. Would they have preferred to spend development resources elsewhere? Probably, they've even got their own standards and SDKs from days past.
> everyone else using GPU's is running fast as they can towards Vulkan
I'm not qualified to comment on the entirety of it, but I can say that basically no claim in this statement is true:
1. Not everyone doing compute is using GPUs. Companies are increasingly designing and releasing their own custom hardware (TPUs, IPUs, NPUs, etc.)
2. Not everyone using GPUs is cares about Vulkan. Certainly many folks doing graphics stuff don't, and DirectX is as healthy as ever. There have been bits and pieces of work around Vulkan compute for mobile ML model deployment, but it's a tiny niche and doesn't involve discrete GPUs at all.
> Is it just too soon too early in the adoption curve
Yes. Vulkan compute is still missing many of the niceties of more developed compute APIs. Tooling is one big part of that: writing shaders using GLSL is a pretty big step down from using whatever language you were using before (C++, Fortran, Python, etc).
> do ya'll think there are more serious obstructions long term to building a more Vulkan centric AI/ML toolkit
You could probably write a whole page about this, but TL;DR yes. It would take at least as much effort as AMD and Intel put into their respective compute stacks to get Vulkan ML anywhere near ready for prime time. You need to have inference, training, cross-device communication, headless GPU usage, reasonably wide compatibility, not garbage performance, framework integration, passable tooling and more.
Sure these are all feasible, but who has the incentive to put in the time to do it? The big 3 vendors have their supercomputer contracts already, so all they need to do is keep maintaining their 1st-party compute stacks. Interop also requires going through Khronos, which is its own political quagmire when it comes to standardization. Nvidia already managed to obstruct OpenCL into obscurity, why would they do anything different here? Downstream libraries have also poured untold millions into existing compute stacks, OR rely on the vendors to implement that functionality for them. This is before we even get into custom hardware like TPUs that don't behave like a GPU at all.
So in short, there is little inevitable about this at all. The reason people may have been frustrated by your comment is because Vulkan compute comes up all the time as some silver bullet that will save us from the walled gardens of CUDA and co (especially for ML, arguably the most complex and expensive subdomain of them all). We'd all like it to come true, but until all of the aforementioned points are addressed this will remain primarily in pipe dream territory.
The paradox I identify in your comments is the start & where you end. The start is that AMD's only choice is to re-embark & re-do the years & years of hard work, to catch up.
The end is decrying how impossible & hard it is to imagine anyone ever reproducing anything like CUDA in Vulkan:
> Sure these are all feasible, but who has the incentive to put in the time to do it?
To talk to the first though: what choice do we have? Why would AMD try to compete by doing it all again as a second party? It seems like, with Nvidia so dominant, AMD and literally everyone else should realize their incentive is to compete, as a group, against the current unquestioned champion. There needs to be some common ground that the humble opposition can work from. And, from what I see, Vulkan is that ground, and nothing else is remotely competitive or interesting.
I really appreciate your challenges, thank you for writing them out. It is real hard, there are a lot of difficulties starting afresh, with a much harder to use toolkit than enriched spiced up C++ (CUDA) as a starting point. At the same time, I continue to think there will be a sea-change, it will happen enormously fast, & it will take far less real work than the prevailing pessimist's view could ever have begin to encompassed. Some good strategic wins to set the stage & make some common use cases viable, good enough technics to set a mold, and I think the participatory nature will snowball, quickly, and we'll wonder why we hadn't begun years ago.
Saying all the underdog competitors should team up is a nice idea, but as anyone who has seen how the standards sausage is made (or, indeed, has tried something similar) will tell you, it is often more difficult than everyone going their own way. It might be unintuitive, but coordination is hard even when you're not jockeying for position with your collaborators. This is why I mentioned the silver bullet part: a surface level analysis leads one to believe collaboration is the optimal path, but that starts to show cracks real quickly once one starts actually digging into the details.
To end things on a somewhat brighter note, there will be no sea change unless people put in the time and effort to get stuff like Vulkan compute working. As-is, most ML people (somewhat rightfully) expect accelerator support to be handed to them on a silver platter. That's fine, but I'd argue by doing so we lose the right to complain about big libraries and hardware vendors doing what's best for their own interests instead of for the ecosystem as a whole.
I see a lot a lot a lot of resistance to the idea that we should start trying to align to Vulkan. Here & elsewhere. I don't get it, it makes no sense, & everyone else using GPU's is running fast as they can towards Vulkan. Is it just too soon too early in the adoption curve, or do ya'll think there are more serious obstructions long term to building a more Vulkan centric AI/ML toolkit? It still feels inevitable to me. What we are doing now feels like a waste of time. I wish ya'll wouldn't downvote so casually, wouldn't just try to brush this viewpoint away.