The people who buy stuff like that are professionals. They often know something ...

latchkey · 2025-01-19T00:15:38 1737245738

Groq is a good example here:

https://www.eetimes.com/groq-ceo-we-no-longer-sell-hardware/

Our users give them plenty of feedback. They just RMA'd whole bunch of our GPUs over this issue so that they could take them back to the mothership and figure out what's up...

https://github.com/ROCm/ROCm/issues/4021

It takes a lot of coordination, across ourselves (with customers), our DC, AMD and Dell to make that happen.

AnthonyMouse · 2025-01-19T00:53:51 1737248031

It's not that you don't get bug reports from data center customers, it's that data center customers have scale in a bad way. They buy thousands of GPUs, they do whatever they're going to do with them, they have a problem, they report the bug. One bug report across thousands of GPUs, because they're all being used for the same thing by that customer so they only have the problems you have when you try to do that. Another data center buys thousands of GPUs and they're doing something else which is extremely common and well supported, so they don't have any issues and you get zero bug reports from them.

Compare this to, you sell a thousand GPUs to a thousand professionals and 10% of them have some problem, but each a different one. You get 100 bug reports, you fix 100 bugs instead of just one, things improve much faster.

latchkey · 2025-01-19T02:35:29 1737254129

We have 136 of these things. Not thousands. AMD is intentionally keeping their number of providers limited [0](bottom of page).

No two providers has the same customers, meaning the workloads vary quite a lot, and a lot of the "professional" developers you're talking about all have jobs that rent this compute.

These GPUs are enterprise, they only come in one form factor. It is a 350lbs box that takes 10kW of power and some pretty serious cooling. It costs as much as an expensive Ferrari.

If you're now also suggesting that AMD also release another product that is easier for developers to get their hands on and deploy, then now you've totally lost me. You're exponentially trying to increase the amount of work and money they spend, for what? Some feedback?

[0] https://www.amd.com/en/products/accelerators/instinct.html

_zoltan_ · 2025-01-19T19:14:19 1737314059

I think you underestimate the people here when you throw around things like "it costs as much as an expensive Ferrari". a lot of us work with systems like these, so we understand why they cost so much and what they can do. On Reddit this works, here, I feel this is pretty condescending.

"Intentionally limiting" is just koolaid. It's ok to drink it, it's your business, but it's koolaid. You think if AWS wanted to deploy a couple hundred thousand of these systems, AMD would be sad? I bet Lisa would be happy.

I tried renting a system, and putting in a credit card is not enough. That's a red flag for me. I don't want to email, chat with sales, etc, just put in a card number. This works for even GH200 systems over at lambda.

As for number of SKUs, for Blackwell there are a lot, if you believe Jensen, and why wouldn't you? He stated at CES that almost every DC they go into is a bit bespoke with modifications.

AMD seems unable to execute on this, which is reflected in its share price.

latchkey · 2025-01-19T19:24:11 1737314651

> I feel this is pretty condescending

Apologies, not my intention.

> I bet Lisa would be happy.

I bet too! I was referring to neoclouds, not tier 1.

> I tried renting a system, and putting in a credit card is not enough.

You truly don't need to talk to anyone, CC and go: https://www.shadeform.ai/

> AMD seems unable to execute on this, which is reflected in its share price.

I agree, they haven't been doing the best job [0]. Let's hope they can show action and turn it around.

[0] https://x.com/HotAisle/status/1880679135875362839

_zoltan_ · 2025-01-19T20:09:23 1737317363

Ok, maybe it works now just by CC. Glad that's sorted.

AMD is tone deaf unfortunately, but I liked your reply on X.

AnthonyMouse · 2025-01-19T08:19:37 1737274777

> We have 136 of these things. Not thousands.

That's a number within an order of magnitude, and you're presumably not the largest provider.

> No two providers has the same customers, meaning the workloads vary quite a lot, and a lot of the "professional" developers you're talking about all have jobs that rent this compute.

If you own something and you've having problems with it, you're more inclined to try to solve them. If you're renting something and you have problems with it, you're more inclined to rent something else instead.

> These GPUs are enterprise, they only come in one form factor. It is a 350lbs box that takes 10kW of power and some pretty serious cooling. It costs as much as an expensive Ferrari.

Making only 4-socket systems was a choice.

You're also acting like multiple SKUs are something weird. Start offering Ryzen APUs with some on-package GDDR or HBM. Make something that fits in the Threadripper socket and uses PCIe power connectors for extra power. People would buy these things.

The point is to create lots of systems in the hands of lots of people that use the same general hardware architecture so that you're improving its software support.

Aurornis · 2025-01-19T00:02:20 1737244940

> provide bug reports that actually describe what's happening

Doesn’t matter if the bug reports are good or bad. Supporting low volume applications is a bad business move when the alternative is 9-figure data center contracts.

The data center business is orders of magnitude larger. Trying to support individual developers would be a huge business mistake when they already can’t keep up with data center.

AnthonyMouse · 2025-01-19T00:10:23 1737245423

It's the same hardware running the same software. You want the bug reports so you can fix them and then your data center customers don't encounter them when they're evaluating your product.

What they can keep up with is basically a matter of how much capacity they order from TSMC. If they underestimated demand for some generation, that's the sort of thing you fix with the next contract or you're just throwing money away.