It always falls back on the software. AMD is behind, not because the hardware is...

amelius · 2025-10-03T15:40:35 1759506035

> The CUDA moat is real.

I don't know. The transformer architecture uses only a limited number of primitives. Once you have ported those to your new architecture, you're good to go.

Also, Google has been using TPUs for a long time now, and __they__ never hit a brick wall for a lack of CUDA.

latchkey · 2025-10-03T16:45:43 1759509943

It is beyond porting, it is mentality of developers. Change is expensive and I'm not just talking about $ value.

> Also, Google has been using TPUs for a long time now, and __they__ never hit a brick wall for a lack of CUDA.

That's exactly what I'm saying. __they__ is the keyword.

amelius · 2025-10-03T16:52:34 1759510354

Not sure what you mean. Google is a big company. Their TPUs have many users internally.

latchkey · 2025-10-03T17:03:19 1759510999

Very few developers outside of Google have ever written code for a TPU. In a similar way, far fewer have written code for AMD, compared to NVIDIA.

If you're going to design a custom chip and deploy it in your data centers, you're also committing to hiring and training developers to build for it.

That's a kind of moat, but with private chips. While you solve one problem (getting the compute you want), you create another: supporting and maintaining that ecosystem long term.

NVIDIA was successful because they got their hardware into developers hands, which created a feedback loop, developers asked for fixes/features, NVIDIA built them, the software stack improved, and the hardware evolved alongside it. That developer flywheel is what made CUDA dominant and is extremely hard to replicate because the shortage of talented developers is real.

amelius · 2025-10-03T17:33:46 1759512826

I mean it's all true to some extent. But that doesn't mean implementing the few primitives to get transformers running requires CUDA, or that it's an impossible task. Remember, we're talking about >$1B companies here who can easily assemble teams of 10s-100s of developers.

You can compare CUDA to the first PC OS, DOS 1.0. Sure, DOS was viewed as a moat at the time, but it didn't keep others from kicking its ass.

latchkey · 2025-10-03T17:40:20 1759513220

> You can compare CUDA to the first PC OS, DOS 1.0.

Sorry, I don't understand this comparison at all. CUDA isn't some first version of an OS, not even close. It's been developed for almost 20 years now. Bucketloads of documentation, software and utility have been created around it. It won't have its ass kicked by any stretch of imagination.

amelius · 2025-10-03T17:48:50 1759513730

Yes, CUDA has a history. And it shows. CUDA has very bad integration with the OS for example. It's time some other company (Microsoft sounds like a good contender) showed them how you do this the right way.

Anyway, this all distracts from the fact that you don't need an entire "OS" just to run some arithmetic primitives to get transformers running.

latchkey · 2025-10-03T18:06:24 1759514784

> CUDA has very bad integration with the OS for example.

If you want to cherry pick anything, you can. But in my eyes, you're just solidifying my point. Software is critical. Minimizing the surface is obviously a good thing (tinygrad for example), but you're still going to need people who are willing and able to write the code.

amelius · 2025-10-03T18:41:28 1759516888

OK, but Microsoft is a software company ...

VirusNewbie · 2025-10-03T20:33:23 1759523603

>Very few developers outside of Google have ever written code for a TPU.

Anthropic?

migueldeicaza · 2025-10-03T17:31:07 1759512667

The CUDA moat is real for general purpose computing and for researchers that want a swiss army knife, but when it comes to well known deployments, for either training or inference, the amount of stuff that you need from a chip is quite limited.

You do not need most of CUDA, or most of the GPU functionality, so dedicated chips make sense. It was great to see this theory put to the test in the original llama.cpp stack which showed just what you needed, the tiny llama.c that really shows how little was actually needed and more recently how a small team of engineers at Apple put together MLX.

latchkey · 2025-10-03T17:36:03 1759512963

Absolutely agreed on the need for just specific parts of the chip and tailoring to that. My point is bigger than that. Even if you build a specific chip, you still need engineers who understand the full picture.

imtringued · 2025-10-03T17:06:49 1759511209

Internal ASICs are a completely different market. You know your workloads and there is a finite number of them. It's as if you had to build a web browser, normally an impossible task, except it only needs to work with your company website, which only uses 1% of all of the features a browser offers.

latchkey · 2025-10-03T17:12:39 1759511559

Very true indeed. I'm not arguing against that at all.