The people who buy stuff like that are professionals. They often know something about the tools they're using and if there are any problems, provide bug reports that actually describe what's happening instead of some non-descriptive mush like "I have your GPU and Windows crashes sometimes". That is extremely helpful if you're trying to get rid of those bugs.
This is the same reason software shops have found it useful to support Linux, even if not many people use it. The people who do will make your product suck less, which in turn makes it easier to sell to the mass market, which will get upset and think unfavorably of you if they have the same problem but not be as good at telling you about it.
Our users give them plenty of feedback. They just RMA'd whole bunch of our GPUs over this issue so that they could take them back to the mothership and figure out what's up...
It's not that you don't get bug reports from data center customers, it's that data center customers have scale in a bad way. They buy thousands of GPUs, they do whatever they're going to do with them, they have a problem, they report the bug. One bug report across thousands of GPUs, because they're all being used for the same thing by that customer so they only have the problems you have when you try to do that. Another data center buys thousands of GPUs and they're doing something else which is extremely common and well supported, so they don't have any issues and you get zero bug reports from them.
Compare this to, you sell a thousand GPUs to a thousand professionals and 10% of them have some problem, but each a different one. You get 100 bug reports, you fix 100 bugs instead of just one, things improve much faster.
We have 136 of these things. Not thousands. AMD is intentionally keeping their number of providers limited [0](bottom of page).
No two providers has the same customers, meaning the workloads vary quite a lot, and a lot of the "professional" developers you're talking about all have jobs that rent this compute.
These GPUs are enterprise, they only come in one form factor. It is a 350lbs box that takes 10kW of power and some pretty serious cooling. It costs as much as an expensive Ferrari.
If you're now also suggesting that AMD also release another product that is easier for developers to get their hands on and deploy, then now you've totally lost me. You're exponentially trying to increase the amount of work and money they spend, for what? Some feedback?
I think you underestimate the people here when you throw around things like "it costs as much as an expensive Ferrari". a lot of us work with systems like these, so we understand why they cost so much and what they can do. On Reddit this works, here, I feel this is pretty condescending.
"Intentionally limiting" is just koolaid. It's ok to drink it, it's your business, but it's koolaid. You think if AWS wanted to deploy a couple hundred thousand of these systems, AMD would be sad? I bet Lisa would be happy.
I tried renting a system, and putting in a credit card is not enough. That's a red flag for me. I don't want to email, chat with sales, etc, just put in a card number. This works for even GH200 systems over at lambda.
As for number of SKUs, for Blackwell there are a lot, if you believe Jensen, and why wouldn't you? He stated at CES that almost every DC they go into is a bit bespoke with modifications.
AMD seems unable to execute on this, which is reflected in its share price.
That's a number within an order of magnitude, and you're presumably not the largest provider.
> No two providers has the same customers, meaning the workloads vary quite a lot, and a lot of the "professional" developers you're talking about all have jobs that rent this compute.
If you own something and you've having problems with it, you're more inclined to try to solve them. If you're renting something and you have problems with it, you're more inclined to rent something else instead.
> These GPUs are enterprise, they only come in one form factor. It is a 350lbs box that takes 10kW of power and some pretty serious cooling. It costs as much as an expensive Ferrari.
Making only 4-socket systems was a choice.
You're also acting like multiple SKUs are something weird. Start offering Ryzen APUs with some on-package GDDR or HBM. Make something that fits in the Threadripper socket and uses PCIe power connectors for extra power. People would buy these things.
The point is to create lots of systems in the hands of lots of people that use the same general hardware architecture so that you're improving its software support.
> provide bug reports that actually describe what's happening
Doesn’t matter if the bug reports are good or bad. Supporting low volume applications is a bad business move when the alternative is 9-figure data center contracts.
The data center business is orders of magnitude larger. Trying to support individual developers would be a huge business mistake when they already can’t keep up with data center.
It's the same hardware running the same software. You want the bug reports so you can fix them and then your data center customers don't encounter them when they're evaluating your product.
What they can keep up with is basically a matter of how much capacity they order from TSMC. If they underestimated demand for some generation, that's the sort of thing you fix with the next contract or you're just throwing money away.
This is the same reason software shops have found it useful to support Linux, even if not many people use it. The people who do will make your product suck less, which in turn makes it easier to sell to the mass market, which will get upset and think unfavorably of you if they have the same problem but not be as good at telling you about it.