It always falls back on the software. AMD is behind, not because the hardware is bad, but because their software historically has played second fiddle to their hardware. The CUDA moat is real.
So, unless they also solve that issue with their own hardware, then it will be like the TPU, which is limited to usage primarily at Google, or within very specific use cases.
There are only so many super talented software engineers to go around. If you're going to become an expert in something, you're going to pick what everyone else is using first.
I don't know. The transformer architecture uses only a limited number of primitives. Once you have ported those to your new architecture, you're good to go.
Also, Google has been using TPUs for a long time now, and __they__ never hit a brick wall for a lack of CUDA.
Very few developers outside of Google have ever written code for a TPU. In a similar way, far fewer have written code for AMD, compared to NVIDIA.
If you're going to design a custom chip and deploy it in your data centers, you're also committing to hiring and training developers to build for it.
That's a kind of moat, but with private chips. While you solve one problem (getting the compute you want), you create another: supporting and maintaining that ecosystem long term.
NVIDIA was successful because they got their hardware into developers hands, which created a feedback loop, developers asked for fixes/features, NVIDIA built them, the software stack improved, and the hardware evolved alongside it. That developer flywheel is what made CUDA dominant and is extremely hard to replicate because the shortage of talented developers is real.
I mean it's all true to some extent. But that doesn't mean implementing the few primitives to get transformers running requires CUDA, or that it's an impossible task. Remember, we're talking about >$1B companies here who can easily assemble teams of 10s-100s of developers.
You can compare CUDA to the first PC OS, DOS 1.0. Sure, DOS was viewed as a moat at the time, but it didn't keep others from kicking its ass.
> You can compare CUDA to the first PC OS, DOS 1.0.
Sorry, I don't understand this comparison at all. CUDA isn't some first version of an OS, not even close. It's been developed for almost 20 years now. Bucketloads of documentation, software and utility have been created around it. It won't have its ass kicked by any stretch of imagination.
Yes, CUDA has a history. And it shows. CUDA has very bad integration with the OS for example. It's time some other company (Microsoft sounds like a good contender) showed them how you do this the right way.
Anyway, this all distracts from the fact that you don't need an entire "OS" just to run some arithmetic primitives to get transformers running.
> CUDA has very bad integration with the OS for example.
If you want to cherry pick anything, you can. But in my eyes, you're just solidifying my point. Software is critical. Minimizing the surface is obviously a good thing (tinygrad for example), but you're still going to need people who are willing and able to write the code.
The CUDA moat is real for general purpose computing and for researchers that want a swiss army knife, but when it comes to well known deployments, for either training or inference, the amount of stuff that you need from a chip is quite limited.
You do not need most of CUDA, or most of the GPU functionality, so dedicated chips make sense. It was great to see this theory put to the test in the original llama.cpp stack which showed just what you needed, the tiny llama.c that really shows how little was actually needed and more recently how a small team of engineers at Apple put together MLX.
Absolutely agreed on the need for just specific parts of the chip and tailoring to that. My point is bigger than that. Even if you build a specific chip, you still need engineers who understand the full picture.
Internal ASICs are a completely different market. You know your workloads and there is a finite number of them. It's as if you had to build a web browser, normally an impossible task, except it only needs to work with your company website, which only uses 1% of all of the features a browser offers.
So, unless they also solve that issue with their own hardware, then it will be like the TPU, which is limited to usage primarily at Google, or within very specific use cases.
There are only so many super talented software engineers to go around. If you're going to become an expert in something, you're going to pick what everyone else is using first.