Even if the community provides support it could take years to reach the maturity...

jeroenhd · on June 25, 2024

If, and that's a big if, AMD can get ROCm working well for this chip, I don't think this will be a big problem.

ROCm can be spotty, especially on consumer cards, but for many models it does seem to work on their more expensive models. It may be worth it spending a few hours/days/weeks to work around the peculiarities of ROCm given the cost difference between AMD and Nvidia in this market segment.

This all stands or falls with how well AMD can get ROCm to work. As this article states, it's nowhere near ready yet, but one or two updates can turn AMD's accelerators from "maybe in 5-10 years" to "we must consider this next time we order hardware".

I also wonder if AMD is going to put any effort into ROCm (or a similar framework) as a response to Qualcomm and other ARM manufacturers creaming them on AI stuff. If these Copilot PCs take off, we may see AMD invest into their AI compatibility libraries because of interest from both sides.

latchkey · on June 25, 2024

https://stratechery.com/2024/an-interview-with-amd-ceo-lisa-...

"One of the things that you mentioned earlier on software, very, very clear on how do we make that transition super easy for developers, and one of the great things about our acquisition of Xilinx is we acquired a phenomenal team of 5,000 people that included a tremendous software talent that is right now working on making AMD AI as easy to use as possible."

jjoonathan · on June 25, 2024

Oh no. Ohhhh nooooo. No, no, no!

Xilinx dev tools are awful. They are the ones who had Windows XP as the only supported dev environment for a product with guaranteed shipments through 2030. I saw Xilinx defend this state of affairs for over a decade. My entire FPGA-programming career was born, lived, and died, long after XP became irrelevant but before Xilinx moved past it, although I think they finally gave in some time around 2022. Still, Windows XP through 2030, and if you think that's bad wait until you hear about the actual software. These are not role models of dev experience.

In my, err, uncle? post I said that I was confused about where AMD was in the AI arms race. Now I know. They really are just this dysfunctional. Yikes.

pbalcer · on June 25, 2024

Xilinx made triSYCL (https://github.com/triSYCL/triSYCL), so maybe there's some chance AMD invests first-class support for SYCL (an open standard from Khronos). That'd be nice. But I don't have much hope.

pjmlp · on June 26, 2024

Comparing what AMD has done so far with SYCL, and what Intel has done with OpenAPI, yeah better not keep that hope flame burning.

paulmd · on June 25, 2024

this is honestly a very enlightening interview because - as pointed out at the time - Lisa Su is basically repeatedly asked about software and every single time she blatantly dodges the question and tries to steer the conversation back to her comfort-zone on hardware. https://news.ycombinator.com/item?id=40703420

> He tries to get a comment on the (in hindsight) not great design tradeoffs made by the Cell processor, which was hard to program for and so held back the PS3 at critical points in its lifecycle. It was a long time ago so there's been plenty of time to reflect on it, yet her only thought is "Perhaps one could say, if you look in hindsight, programmability is so important". That's it! In hindsight, programmability of your CPU is important! Then she immediately returns to hardware again, and saying how proud she was of the leaps in hardware made over the PS generations.

> He asks her if she'd stayed at IBM and taken over there, would she have avoided Gerstner's mistake of ignoring the cloud? Her answer is "I don’t know that I would’ve been on that path. I was a semiconductor person, I am a semiconductor person." - again, she seems to just reject on principle the idea that she would think about software, networking or systems architecture because she defines herself as an electronics person.

> Later Thompson tries harder to ram the point home, asking her "Where is the software piece of this? You can’t just be a hardware cowboy ... What is the reticence to software at AMD and how have you worked to change that?" and she just point-blank denies AMD has ever had a problem with software. Later she claims everything works out of the box with AMD and seems to imply that ROCm hardly matters because everyone is just programming against PyTorch anyway!

> The final blow comes when he asks her about ChatGPT. A pivotal moment that catapults her competitor to absolute dominance, apparently catching AMD unaware. Thompson asks her what her response was. Was she surprised? Maybe she realized this was an all hands to deck moment? What did NVIDIA do right that you missed? Answer: no, we always knew and have always been good at AI. NVIDIA did nothing different to us.

> The whole interview is just astonishing. Put under pressure to reflect on her market position, again and again Su retreats to outright denial and management waffle about "product arcs". It seems to be her go-to safe space. It's certainly possible she just decided to play it all as low key as possible and not say anything interesting to protect the share price, but if I was an analyst looking for signs of a quick turnaround in strategy there's no sign of that here.

not expecting a heartfelt postmortem about how things got to be this bad, but you can very easily make this question go away too, simply by acknowledging that it's a focus and you're working on driving change and blah blah. you really don't have to worry about crushing some analyst's mindshare on AMD's software stack because nobody is crazy enough to think that AMD's software isn't horrendously behind at the present moment.

and frankly that's literally how she's governed as far as software too. ROCm is barely a concern. Support base/install base, obviously not a concern. DLSS competitiveness, obviously not a concern. Conventional gaming devrel: obviously not a concern. She wants to ship the hardware and be done with it, but that's not how products are built and released in 2020 anymore.

NVIDIA is out here building integrated systems that you build your code on and away you go. They run NVIDIA-written CUDA libraries, NVIDIA drivers, on NVIDIA-built networks and stacks. AMD can't run the sample packages in ROCm stably (as geohot discovered) on a supported configuration of hardware/software, even after hours of debugging just to get it that far. AMD doesn't even think drivers/runtime is a thing they should have to write, let alone a software library for the ecosystem.

"just a small family company (bigger than NVIDIA, until very recently) who can't possibly afford to hire developers for all the verticals they want to be in". But like, they spent $50b on a single acquisition, they spent $12b in stock buybacks over 2 years, they have money, just not for this.

jjoonathan · on June 25, 2024

So I knew that AMD's compute stack was a buggy mess -- nobody starts out wanting to pay more for less and I had to learn the hard way how big of a gap there was between AMD's paper specs and their actual offerings -- and I also knew that Nvidia had a huge edge at the cutting edge of things, if you need gigashaders or execution reordering or whatever, but ML isn't any of that. The calculations are "just" matrix multiplication, or not far off.

I would have thought AMD could have scrambled to fix their bugs, at least the matmul related ones, scrambled to shore up torch compatibility or whatever was needed for LLM training, and pushed something out the door that might not have been top-of-market but could at least have taken advantage of the opportunity provided by 80% margins from team green. I thought the green moat was maybe a year wide and tens of millions deep (enough for a team to test the bugs, a team to fix the bugs, time to ramp, and time to make it happen). But here we are, multiple years and trillions in market cap delta later, and AMD still seems to be completely non-viable. What happened? Did they go into denial about the bugs? Did they fix the bugs but the industry still doesn't trust them?

JonChesterfield · on June 25, 2024

It's roughly that the AMD tech works reasonably well on HPC and less convincingly on "normal" hardware/systems. So a lot of AMD internal people think the stack is solid because it works well on their precisely configured dev machines and on the commercially supported clusters.

Other people think it's buggy and useless because that's the experience on some other platforms.

This state of affairs isn't great. It could be worse but it could certainly be much better.

entropicdrifter · on June 25, 2024

If we're extremely lucky they might invest in SYCL and we'll see an Intel/AMD open-source teamup

sorenjan · on June 25, 2024

This seems like the option that would make the most sense. If developers can "write once, run everywhere", they might as well do that instead of Cuda. But if they have to "write once, run on Intel, or AMD, or Nvidia", why would they bother with anything other than Nvidia considering their market share? If you're an underdog you go for open standards that makes it easy to switch to your products, but it seems like AMD have seen Nvidia's Cuda and jealously decided they wanted their own version, but 15 years too late.

pbalcer · on June 25, 2024

> Qualcomm and other ARM manufacturers creaming them on AI stuff

That's mostly on Microsoft's DirectML though. I'm not sure whether AMD's implementation is based on ROCm (doubt it).

kd913 · on June 25, 2024

You do know that Microsoft, Oracle, Meta are all in on this right?

Heck I think it is being used to run ChatGPT 3.5 and 4 services.

softfalcon · on June 25, 2024

I feel like people forget that AMD has huge contracts with Microsoft, Valve, Sony, etc to design consoles at scale. It's an invisible provider as most folks don't even realize their Xbox and their Playstation are both AMD.

When you're providing fab designs at that scale, it makes a lot more sense to folks that companies would be willing to try a more affordable option to nVidia hardware.

My bet is that AMD figures out a service-able solution for some (not all) workloads that isn't ground breaking, but affordable to the clients that want an alternative. That's usually how this goes for AMD in my experience.

sangnoir · on June 25, 2024

If you read/listen to the Stratechary interview wirh Lisa Hsu, she spelled out being open ro customizing AMD hardware to meet partner's needs. So if Microsoft needs more memory bandwidth and less compute, AMD will build something just for them based on what they have now. If Meta wants 10% less power consumption (and cooling) for a 5% hit in compute, AMD will hear them out too. We'll see if that hardware customization strategy works outside of consoles.

rcxdude · on June 25, 2024

It certainly helps differentiate from NVIDIA's "Don't even think about putting our chips on a PCB we haven't vetted" approach.

pjmlp · on June 26, 2024

Yeah, but they will be using internal Microsoft and Meta software stacks, nothing that will dent CUDA.

Rinzler89 · on June 25, 2024

>I feel like people forget that AMD has huge contracts with Microsoft, Valve, Sony, etc to design consoles at scale.

Nobody forget that, just that those console chips are super low margins, which is why Intel and Nvidia stopped catering to that market after the Xbox/PS3 generations and only AMD took it up because they were broke and every penny mattered to them.

Nvidia did a brief stint with the Shield/Switch because they were trying to get into the Android/ARM space and also kinda gave up due to the margins.

pjmlp · on June 26, 2024

A market that keeps being discussed that is reaching its end, as newer generations aren't that much into traditional game consoles, and both Sony and Microsoft[0] have to reach out to PCs and mobile devices, to achieve sales growth.

Among the gamer community the discussion of this being the last generation keeps poping up.

[0] - Nintendo is more than happy to keep redoing their hit franchaises, in good enough hardware.

0cf8612b2e1e · on June 25, 2024

On the other hand, AMD has had a decade of watching CUDA eat their lunch and done basically nothing to change the situation.

bee_rider · on June 25, 2024

AMD tries to compete in hardware with Intel’s CPUs and Nvidia’s GPUs. They have to slack somewhere, and software seems to be where. It isn’t any surprise that they can’t keep up on every front, but it does mean they can freely bring in partners whose core competency is software and work with them without any caveats.

Not sure why they haven’t managed to execute on that yet, but the partners must be pretty motivated now, right? I’m sure they don’t love doing business at Nvidia’s leisure.

pjmlp · on June 26, 2024

Hardware is useless without software to make it show off.

bobsondugnut · on June 25, 2024

when was the last time AMD hardware was keeping up with NVIDIA? 2014?

0cf8612b2e1e · on June 25, 2024

Been a while since AMD had the top tier offering, but it has been trading blows in the middle tier segment the entire time. If you are just looking for a gamer card (ie not max AI performance), the AMD is typically cheaper and less power hungry than the equivalent Nvidia.

aurareturn · on June 25, 2024

It’s trading blows because AMD sells their cards at lower margins in the midrange and Nvidia lets them.

bee_rider · on June 26, 2024

But, the fact that Nvidia cards command higher margins also reflects their better software stack, right? Nvidia “lets them” trade blows in the midrange, or, equivalently, Nvidia is receiving the reward of their software investments: even their midrange hardware commands a premium.

bobsondugnut · on June 25, 2024

> the AMD is typically cheaper and less power hungry than the equivalent Nvidia

cheaper is true, but less power hungry is absolutely not true, which is kind of my point.

dralley · on June 25, 2024

It was true with RDNA 2. RDNA 3 regressed on this a bit, supposedly there was a hardware hiccup that prevented them from hitting frequency and voltage targets that they were hoping to reach.

In any case they're only slightly behind, not crazy far behind like Intel is.

bee_rider · on June 25, 2024

The MI300X sounds like it is competitive, haha

bobsondugnut · on June 25, 2024

competitive with H100 for inference. a 2 year old product on just one half of the ML story. H200 (and potentially B100) is the appropriate comparison based on their production in volume.

adabyron · on June 25, 2024

I have read in a few places that Microsoft is using AMD for inference to run ChatGPT. If I recall they said the price/performance was better.

I'm curious if that's just because they can't get enough Nvidia GPUs or if the price/performance is actually that much better.

atq2119 · on June 25, 2024

Most likely it really is better overall.

Think of it this way: AMD is pretty good at hardware, so there's no reason to think that the raw difference in terms of flops is significant in either direction. It may go in AMD's favor sometimes and Nvidia's other times.

What AMD traditionally couldn't do was software, so those AMD GPUs are sold at a discount (compared to Nvidia), giving you better price/performance if you can use them.

Surely Microsoft is operating GPUs at large enough scale that they can pay a few people to paper over the software deficiencies so that they can use the AMD GPUs and still end up ahead in terms of overall price/performance.

JonChesterfield · on June 25, 2024

Something like Triton from Microsoft/OpenAI as a cuda bypass? Or pytorch/tensorflow targeting ROCm without user intervention.

Or there's openmp or hip. In extremis opencl.

I think the language stack is fine at this point. The moat isn't in cuda the tech. It's in code running reliably on nvidia's stack, without things like stray pointers needing a machine reboot. Hard to know how far off robust rocm is at this point.

singhrac · on June 25, 2024

The problem is that we all have a lot of FUD (for good reasons). It's on AMD to solve that problem publically. They need to make it easier to understand what is supported so far and what's not.

For example, for bitandbytes (a common dependency in LLM world) there's a ROCm fork that the AMD maintainers are trying to merge in (https://github.com/TimDettmers/bitsandbytes/issues/107). Meanwhile an Intel employee merged a change that made a common device abstraction (presumably usable by AMD + Apple + Intel etc.).

There's a lot of that right now - super popular package that is CUDA-only is navigating how to make it work correctly with any other accelerator. We just need more information on what is supported.