Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Testing AMD's Giant MI300X (chipsandcheese.com)
213 points by klelatti on June 25, 2024 | hide | past | favorite | 244 comments


Impressions from last week’s CVPR, a conference with 12k attendees on computer vision - Pretty much everyone is using NVIDIA GPUs, and pretty much everyone isn’t happy with the prices, and would like some competition in the space:

NVIDIA was there with 57 papers, a website dedicated to their research presented at the conference, a full day tutorial on accelerating deep learning, and ever present with shirts and backpacks in the corridors and at poster presentations.

AMD had a booth at the expo part, where they were raffling off some GPUs. I went up to them to ask what framework I should look into, when writing kernels (ideally from Python) for GPGPU. They referred me to the “technical guy”, who it turns out had a demo on inference on an LLM. Which he couldn’t show me, as the laptop with the APU had crashed and wouldn’t reboot. He didn’t know about writing kernels, but told me there was a compiler guy who might be able to help, but he wasn’t to be found at that moment, and I couldn’t find him when returning to the booth later.

I’m not at all happy with this situation. As long as AMDs investment into software and evangelism remains at ~$0, I don’t see how any hardware they put out will make a difference. And you’ll continue to hear people walking away from their booth, saying “oh when I win it I’m going to sell it to buy myself an NVIDIA GPU”.


> I’m not at all happy with this situation. As long as AMDs investment into software and evangelism remains at ~$0, I don’t see how any hardware they put out will make a difference.

It appears AMD initial strategy is courting the HPC crowd and hyperscalers, they have big budgets, lower support overhead and are willing and able to write code that papers-over AMDs not-great software while appreciating lower-than-Nvidia TCO. I think this this incremental strategy is sensible, considering where most of the money is.

As a first mover, Nvidia had to start from the bottom up; CUDA used to run only/mostly on consumer GPUs - AMD is going top-down, starting with high-margin DC hardware, before trickling down rack-level users, and eventually APUs later as revenue growth allows more re-investment.


They’re making the wrong strategic play.

They will fail if they go after the highest margin customers. Nvidia has every advantage and every motivation to keep those customers. They would need a trillion dollars in capital to have a chance imho.

It would be like trying to go after Intel in the early 2000s by trying to target server cpus, or going after the desktop operating system market in the 90s against Microsoft. Its aiming for your competition where they are strongest and you are weakest.

Their only chance to disrupt is to try to get some of the customers that Nvidia doesn’t care about, like consumer level inference / academic or hobbyist models. Intel failed when they got beaten in a market they didn’t care about, i.e mobile / small power devices.


This is a common sentiment, no doubt also driven by the wish thay AMD would cater to us.

But I see no evidence that the strategy is wrong or failing. AMD is already powering a massive and rapidly growing share of Top 500 HPC:

https://www.top500.org/statistics/treemaps/

AMD compute growth isn't in places where people see it, and I think that gives a wrong impression. (Or it means people have missed the big shifts over the last two years.)


It would be interesting to see how much these "supercomputers" are actually used, and what parts of them are used.

I use my university's "supercomputer" every now and then when I need lots of VRAM, and there are rarely many other users. E.g. I've never had to queue for a GPU even though I use only the top model, which probably should be the most utilized.

Also, I'd guess there can be nvidia cards in the grid even if "the computer" is AMD.

Of course it doesn't matter for AMD whether the compute is actually used or not as long as it's bought, but lots of theoretical AMD flops standing somewhere doesn't necessarily mean AMD is used much for compute.


It is a pretty safe bet that if someone builds a supercomputer there is a business case for it. Spending big on compute then leaving it idle is terrible economics. I agree with Certhas in that although this is not a consumer-first strategy it might be working. AMDs management are not incapable, for all that they've been outmanoeuvred convincingly by Nvidia.

That being said, there is a certain irony and schadenfreude in the AMD laptop being bricked from the thread root. The AMD engineers are at least aware that running a compute demo is an uncomfortable experience on their products. The consumer situation is not acceptable even if strategically AMD is doing OK.


I find it a safer bet that there are terrible economics all over. Especially when the buyers are not the users, as is usually the case with supercomputers (just like with all "enterprise" stuff).

In the cluster I'm using there's 36 nodes, of which 13 are currently not idling (doesn't mean they are computing). There are 8 V100 GPUs and 7 A100 GPUs and all are idling. Admittedly it's holiday season and 3AM here, but this it's similar other times too.

This is of course great for me, but I think the safer bet is that the typical load average of a "supercomputer" is under 0.10. And the less useful the hardware, the less will be its load.


It is not a reasonable assumption to compare your local cluster to the largest clusters within DOE or their equivalents in Europe/Japan. These machines regularly run at >90% utilization and you will not be given an allocation if you can’t prove that you’ll actually use the machine.

I do see the phenomenon you describe on smaller university clusters, but these are not power users who know how to leverage HPC to the highest capacity. People in DOE spend their careers working to use as much as these machines as efficiently as possible.


In Europe at least supercomputer are organised in tiers. Tier 0 are the highest grade, tier 3 are small local university clusters like the one you describe. Tier 2 or Tier 1 machines and upward usually require you to apply for time. They are definitely highly utilised. Tier 3 the situation will be very different from one university to the next. But you can be sure that funding bodies will look at utilisation before deciding on upgrades.

Also this amount of GPUs is not sufficient for competitive pure ML research groups from what I have seen. The point of these small decentral underutilized resources is to have slack for experimentation. Want to explore ML application with a master student in your (non-ML) field? Go for it.

Edit: No idea how much of the total hpc market is in the many small instalks, vs the fewer large ones. My instinct is that funders prefer to fund large centralised infrastructure, and getting smaller decentralised stuff done is always a battle. But that's all based on very local experience, and I couldn't guess how well this generalises.


    > It is a pretty safe bet that if someone builds a supercomputer there is a business case for it.
As I understand, most (95%+) of the market for supercomputers is gov't. If wrong, please correct. Else, what do you mean by "business case"?


When you ask your funding agency for an HPC upgrade or a new machine, the first thing they will want from you are utilisation numbers of current infrastructure. The second thing they will ask is why you don't just apply for time on a bigger machine.

Despite the clichés, spending taxpayer money is really hard. In fact my impression is always that the fear that resources get misused is a major driver of the inefficient bureaucracies in government. If we were more tolerant of taxpayer money being wasted we could spend it more efficiently. But any individual instance of misuse can be weaponized by those who prefer for power to stay in the hands of the rich...


At least where I'm from, new HPC clusters aren't really asked for by the users, but they are "infrastructure projects" of their own.

With the difficulty of spending taxpayer money, I fully agree. I even think HPC clusters are a bit of a symptom of this. It's often really hard to buy a beefy enough workstation of your own that would fit the bill, or to just buy time from cloud services. Instead you have to faff with a HPC cluster and its bureaucracy, because it doesn't mean extra spending. And especially not doing a tender, which is the epitome of the inefficiency caused by the paranoia of wasted spending.

I've worked for large businesses, and it's a lot easier to spend in those for all sorts of useless stuff, at least when the times are good. When the times get bad, the (pointless) bureaucracy and red tape gets easily worse than in gov organizations.


> At least where I'm from, new HPC clusters aren't really asked for by the users, but they are "infrastructure projects" of their own.

Because the users expect them to be renewed and improved. Otherwise the research can’t be done. None of our users tell us to buy new systems. But they cite us like mad, so we can buy systems every year.

The dynamics of this ecosystem is different.


> It would be interesting to see how much these "supercomputers" are actually used, and what parts of them are used.

I’m in that ecosystem. Access is limited, demand is huge. There’s literal queues and breakneck competition to get time slots. Same for CPU and GPU partitions.

They generally run at ~95% utilization. Even our small cluster runs at 98%.


Did your university not have a bioinformatics department?


It does. And meteorology, climatology and cosmology for example.


Well then I'm really unsure what's happening. Any serious researcher in either of those fields should be able and trying to expand into all available supercompute.


Maybe they just don't need them? At least a bioinformatics/computational science professor I know runs most of his analyses on a laptop.


I see a lot of evidence, in the form of a rising moat for NVidia.


Super computers are in 95% cases government funded and I recommend that you check in conditions for tenders and how government has check on certain condition in buying. That isn't a normal business partner who only looks at performance, there are many more other criteria in the descision making.

Or let me ask you directly, can you name me one enterprise which would buy a super computer and wait 5+ years for it and fund the development of HW for it which doesn't exist yet? At the same time when the competition can deliver a super computer within the year with an existing product?

No sane CEO would have done Frontier or El Capitan. Such things work only with government funding where the government decides to wait and fund an alternative. But AMD is indeed a bit lucky that it happened or otherwise they wouldn't been forced to push the Instinct line.

In the commercial world, things work differently. There is always a TCO calculation. But one critical aspect since the 90s is SW. No matter how good the HW is, the opportunity costs in SW could force enterprises to use the inferior HW due to SW deployment. If vision computing SW in industry is supporting and optimized for CUDA or even runs only with CUDA then any competition has a very hard time penetrating that market. They first have to invest a lot of money to make their products equally appealing.

AMD makes a huge mistake and is by far not paranoid enough to see it. For 2 decades, AMD and Intel have been in a nice spot with PC and HPC computing requiring x86. It basically to this date has guaranteed a steady demand. But in that timeframe mobile computing has been lost to ARM. ML/AI doesn't require x86 as Nvidia demonstrates by combining their ARM CPUs into the mix but also ARM themselves want more and more of the PC and HPC computing cake. And MS is eager to help with OS for ARM solutions.

What that means is that if some day x86 isn't as dominant anymore and ARM becomes equally good then AMD/Intel will suddenly have more competition in CPUs and might even offer non-x86 solutions as well. Their position will therefore drop into yet another commodity CPU offering.

In the AI accelerator space we will witness something similiar. Nvidia has created a platform and earns tons of money with it by combining and optimizing SW+HW. Big Tech is great at SW but not yeat at HW. So the only logical thing to do is getting better at HW. All large Tech companies are working on their own accelerators and they will build their platform around it to compete with Nvidia and locking in customers all the same way. The primary losers in all of this will be HW only vendors without a platform, hoping that Big Tech will support them on their platforms. Amazon and Google have already shown today that they have no intention to support anything besides their platform and Nvidia (which they only must due to customer demand).


I am that crazy ceo building a super computer, for rent by anyone who wants it. We are starting small and growing with demand.

Our first deployment is 3x larger flops than Cheyenne and a fraction of the cost.

https://en.wikipedia.org/wiki/Cheyenne_(supercomputer)


The savings are an order of magnitude different. Switching from Intel to AMD in a data center might have saved millions if you were lucky. Switching from NVidia to AMD might save the big LLM vendors billions.


Nvidia have less moat for inference workloads since inference is modular. AMD would be mistaken to go after training workloads but that's not what they're going after.


I only observe this market from the sidelines... but

They're able to get the high end customers, and this strategy works because they can sell the high end customers high end parts in volume without having to have a good software stack; at the high end, the customers are willing to put in the effort to make their code work on hardware that is better in dollars/watts/availability or whatever it is that's giving AMD inroads into the supercomputing market. They can't sell low end customers on GPU compute without having a stack that works, and somebody who has a small GPU compute workload may not be willing or able to adapt their software to make it work on an AMD card even if the AMD card would be a better choice if they could make it work.


They’re going to sell a billion dollars of GPUs to a handful of customers while NVIDIA sells a trillion dollars of their products to everyone.

Every framework, library, demo, tool, and app is going to use CUDA forever and ever while some “account manager” at AMD takes a government procurement officer to lunch to sell one more supercomputer that year.


I'd guess that the majority of ML software is written in PyTorch, not in CUDA, and PyTorch has support for multiple backends including AMD. torch.compile also supports AMD (generating Triton kernels, same as it does for NVIDIA), so for most people there's no need to go lower level.


GPUs are used for more than only ML workloads.

CUDA relevance in the industry is so big now, that NVidia has several WG21 seats, and helps driving heterogenous programming roadmap for C++.


You can use PyTorch for more than ML. No need to use backprop. Thinks of it as GPU accelerated NumPy.


I would like to see OctaneRender done in Pytorch. /s


Sure, but if the OctaneRender folk wanted to support AMD, then I highly doubt they'd be interested in a CUDA compatability layer either - they'd want to be using the lowest level API possible (Vulkan?) to get close to the metal and optimize performance.


See, that is where you got all wrong, they dropped Vulkan for CUDA, and even made a talk about it at GTC.

https://www.cgchannel.com/2023/11/otoy-releases-first-public...

https://www.cgchannel.com/2023/11/otoy-unveils-the-octane-20...

And then again, there are plenty of other cases where Pytorch makes absolute no sense in GPU, which was the whole starting point.


> See, that is where you got all wrong

I said that if they wanted to support AMD they would use the closest-to-metal API possible, and your links prove that this is exactly their mindset - preferring a lower level more performant API to a higher level more portable one.

For many people the tradeoffs are different and ability to write code quickly and iterate on design makes more sense.


Nvidia's 2024 data center revenue was $46B. They got a long fucking way to go to get to trillion dollars of product.


Take a look at this chart going back ~3Y: https://ycharts.com/indicators/nvidia_corp_nvda_data_center_...

Their quarterly data centre revenue is now $22.6B! Even assuming that it immediately levels off, that's $90B over the next 12 months.

If it merely doubles, then they'll hit a total of $1T in revenue in about 6 years.

I'm an AI pessimist. The current crop of generative LLMs are cute, but not a direct replacement for humans in all but a few menial tasks.

However, there's a very wide range of algorithmic improvements available, which wouldn't have been explored three years ago. Nobody had the funding, motivation, or hardware. Suddenly, everyone believes that it is possible, and everyone is throwing money at the problem. Even if the fruits of all of this investment is just a ~10% improvement in business productivity, that's easily worth $1T to the world economy over the next decade or so.

AMD is absolutely leaving trillions of dollars on the table because they're too comfortable selling one supercomputer at a time to government customers.

Those customers will stop buying their kit very soon, because all of the useful software is being written for CUDA only.


Did you look at your own chart? There's no trend of 200% growth. Rather this last few quarters were a huge jump from relatively modest gains the years prior. Expecting 6 years of "merely doubling" is absolutely bonkers lol

Who can even afford to buy that much product? Are you expecting Apple, Microsoft, Alphabet, Amazon, etc to all dump 100% of their cash on Nvidia GPUs? Even then that doesn't get you to a trillion dollars


Once AI becomes a political spending topic like green energy, I think we’ll see nation level spending. Just need one medical breakthrough and you won’t be able to run a political campaign without AI in your platform.


Meta alone bought 350,000 H100 GPUs, which cost them $10.5 billion: https://www.pcmag.com/news/zuckerbergs-meta-is-spending-bill...

This kind of AI capital investment seems to have helped them improve the feed recommendations, doubling their market cap over the last few years. In other words, they got their money back many times over! Chances are that they're going to invest this capital into B100 GPUs next year.

Apple is about to revamp Siri with generative AI for hundreds of millions of their customers. I don't know how many GPUs that'll require, but I assume... many.

There's a gold rush, and NVIDIA is the only shovel manufacturer in the world right now.


> Meta alone bought 350,000 H100 GPUs, which cost them $10.5 billion

Right, which means you need about a trillion dollars more to get to a trillion dollars. There's not another 100 Metas floating around.

> Apple is about to revamp Siri with generative AI for hundreds of millions of their customers. I don't know how many GPUs that'll require, but I assume... many.

Apple also said they were doing it with their silicon. Apple in particular is all but guaranteed to refuse to buy from Nvidia even.

> There's a gold rush, and NVIDIA is the only shovel manufacturer in the world right now.

lol no they aren't. This is literally a post about AMD's AI product even. But Apple and Google both have in-house chips as well.

Nvidia is the big general party player, for sure, but they aren't the only. And more to the point, exponential growth of the already largest player for 6 years is still fucking absurd.


The GDP of the US alone over the next five years is $135T. Throw in other modern economies that use cloud services like Office 365 and you’re over $200T.

If AI can improve productivity by just 1% then that is $2T more. If it costs $1T in NVIDIA hardware then this is well worth it.


(note to conversation participants - I think jiggawatts might be arguing about $50B/qtr x 24 qtr = $1 trillion and kllrnohj is arguing $20 billion * 2^6 years = $1 trillion - although neither approach seems to be accounting for NPV).

That is assuming Nvidia can capture the value and doesn't get crushed by commodity economics. Which I can see happening and I can also see not happening. Their margins are going to be under tremendous pressure. Plus I doubt Meta are going to be cycling all their GPUs quarterly, there is likely to be a rush then settling of capital expenses.


Another implicit assumption is that LLMs will be SoTA throughout that period, or the successor architecture will have an equally insatiable appetite for lots of compute, memory and memory bandwidth; I'd like to believe that Nvidia is one research paper away from a steep drop in revenue.


Agreed with @roenxi and I’d like to propose a variant of your comment:

All evidence is that “more is better”. Everyone involved professionally is of the mind that scaling up is the key.

However, like you said, just a single invention could cause the AI winds to blow the other way and instantly crash NVIDIA’s stock price.

Something I’ve been thinking about is that the current systems rely on global communications which requires expensive networking and high bandwidth memory. What if someone invents an algorithm that can be trained on a “Beowulf cluster” of nodes with low communication requirements?

For example the human brain uses local connectivity between neurons. There is no global update during “training”. If someone could emulate that in code, NVIDIA would be in trouble.


> They will fail if they go after the highest margin customers.

They are already powering the most powerful supercomputers, so I guess you’re right.

Oh, by coincidence, the academic crowd is the primary user of these supercomputers.

Pure luck.


AMD did go after intels server CPUs in the 2000s, with quite a bit of success.


And it worked mainly because they were a drop-in for Intel processors. Which was and is an amazing feat. I and most people could and can run anything compiled (but avx512 stuff back then on zen1 and 2 ?) without a hitch. And it was still a huge uphill battle and Intel let it happen, what with their bungling of the 10nm process.

I don't see how the same can work here. HIP isn't it right now (every time I try, anyway).


> They would need a trillion dollars in capital to have a chance imho.

All AMD would really need is for Nvidia innovation to stall. Which, with many of their engineers coasting on $10M annual compensation, seems not too far fetched


AMD can go toe to toe with Nvidia on hardware innovation. What AMD had realised (correctly, IMO), is that all they need is for hyperscalers to match/come close to Nvidia on software innovation on AMD hardware - Amazon/Meta/Microsoft engineers can get their foundation models running on M1300X well enough for their needs - CUDA is not much of moat in that market-segment where there are dedicated AI-infrastructure teams. If the price is right, they may shift some of those CapEx dollars from Nvidia to AMD. Few AI practitioners - and even fewer LLM consumers - care about the libraries underpinning torch/numpy/high-level-python-framework/$LLM-service, as long as it works.


That is wrong move personally would start from localllm/llama folks who crave more memory and build up from there.


Seeing that they don't have a mature software stack, I think for now AMD would prefer one customer who brings in $10m revenue over 10'000 customers at $1000 a pop.


I doesn't make sense because they can market to both at the same time.


> It appears AMD initial strategy is courting the HPC crowd and hyperscaler...

I don't agree with this at all! Give me something that I can easily prototype at home and then quickly scale up at work!


> As long as AMDs investment into software and evangelism remains at ~$0

Last time I checked they have been trying to hire a ton of software engineers for improving the applied stacks (CV, ML, DSP, compute, etc) at the location near where I'm located.

It seems like there's a big push to improve the stacks but given that less than 10 years ago they were practically at death's door it's not terribly surprising that their software is in the state it is. It's been getting better gradually but quality software doesn't just show up over night and especially so when things are as complex and arcane as they are in the GPU world.


With margins that high?

There is always financing, there are always people willing to go to the competitor at some wage, there is always a way if the leadership wants to.

If it was just a straight up fab bottleneck? Yeah maybe you buy that for a year or two.

“During Q1, Nvidia reported $5.6 billion in cost of goods sold (COGS). This resulted in a gross profit of $20.4 billion, or a margin profile of 78.4%.”

That’s called an “induced market failure”.


They literally bought Xilinx for their software engineering team. That's at least a thousand firmware engineers and software engineers focused on software stack improvements. That was two years ago. And on top of Xilinx they've been hiring staff like crazy for years now.

The issue was that they basically let everyone go who wasn't building hardware for their essential product lines (CPU & GPU) other than a skeleton crew to keep the software at least mostly functioning. And as much as this seems like it was a bad decision, AMD was probably weeks from bankruptcy by the time they got Zen out the door even despite doing this. Had they not done so, they'd almost certainly closed up entirely.

So for the last ~5 years minimum now they've been building back their software teams and trying to recuperate what they lost in institutional knowledge. That all takes time to do even if you hire back twice as many engineers as you lost.

And so now we are here. Things are clearly improving but nowhere near acceptable yet. But there's a trend of improvement.


> Things are clearly improving

How long am I supposed to wait, as my still-modern AMD GPU sits still-unsupported?

The anecdote above doesn't even sound like there's any improvement at all, let alone "clear" improvement.

And with Zen in 2017 and Zen+ in 2018 the counter is past six years at this point since the money gates opened wide.


> How long am I supposed to wait, as my still-modern AMD GPU sits still-unsupported?

Which GPU do you have? At least according to these docs, on linux the upper chunk of RDNA3 is supported officially but from experience, basically all 6xxx or 7xxxx cards are unofficially supported if you build it for your target arch. 5xxx cards get the short end of the stick and got skipped (they were a rough launch) but Radeon VII cards should also still be officially supported (with support shifting to unofficial status in the next release).

https://rocm.docs.amd.com/en/latest/compatibility/compatibil...

And given that ROCm is pretty core to AMD's support for the windows AI stack (via ONNX), you can assume any new GPUs released from here on out will be supported.


It's 5xxx. And "rough launch" is not an excuse. They've had plenty of time. Is it that different from the other RDNA cards?

The unofficial support for so many cards is not a good situation either.

Edit: Actually, no, I know it's not that different, because some versions of ROCm largely work on RDNA1 if you trick them. They are just refusing to do the extra bit of work to patch over the differences.


I mean it apparently works on RDNA1 now after some effort but they never really attempted to support it because they initially only supported workstation RDNA cards but they didn't have a workstation RDNA1 release.

https://www.reddit.com/r/ROCm/comments/1bd8vde/psa_rdna1_gfx...

I wish they had comprehensive support for basically all recent GPU releases but tbh I'd rather they focus on perfecting support for the current and upcoming generations than spread their efforts too thin.

And ideally with time backports to the older cards will come with time but it's really not a priority over issues on the current generation because those RDNA1 cards were never actually supported in the first place.


Every post I see about trying it has the person run into issues, but maybe Soon it will finally be true.


Have you ever organized anything of size?

Financing is not the bottleneck. Organizational capacity might well be, though. As an organization, AMDs survival depended not on competing with nVidia but on competing with Intel. Now they are established, in what must be one of the greatest come from behind successes in tech history. 8 years ago, Intel was worth 80 times as much as AMD, today AMD has surpassed them:

https://www.financecharts.com/compare/AMD,INTC/summary/marke...

Stock isn't reality, but I wouldn't so easily assume that the team that led AMD to overtake Intel are idiots.


> With margins that high? There is always financing, there are always people willing to go to the competitor at some wage, there is always a way if the leadership wants to.

People love to pop-off on stuff they really know anything about. Let me ask you: what financing do you imagine is available? Like literally what financing do you propose for a publically traded company? Like do you realize they can't actually issue new shares without putting it to a shareholder vote? Should they issue bonds? No I know they should run an ICO!!!

And then what margins exactly? Do you know what the margin is on MI300? No. Do you know whether they're currently selling at a loss to win marketshare? No.

I would the happiest boy if hn, in addition to policing jokes and memes, could police arrogance.


Are you saying that companies lose the ability to secure financing once they go public?


of course not - mentioned 3 routes to securing further financing. did you read about those 3 routes in my comment?


You mentioned them all mockingly. If you weren't trying to suggest none are viable, you need to reword.


This isn't hard: financing routes exist but they aren't as simple or easy or straightforward as the person to whom I was responding makes it seem.


They didn't imply it was notably easy. Your reply there only makes sense if you were trying to say it's nearly impossible. If you're just saying it's kinda hard then your post is weirdly hostile for no reason, reading theirs in an extreme way just so you can correct it harder.


> They didn't imply it was notably easy

Really? I must be reading a different language than English here

> There is always financing, there are always people willing to go to the competitor at some wage, there is always a way if the leadership wants to.


If "always a way" implies anything about difficulty, it implies that there are challenges to overcome, not ease.


I guess there's always a way to play devil's advocate <shrug>


Have you looked into TinyCorp [0]/tinygrad [1], one of the latest endeavors by George Hotz? I've been pretty impressed by the performance. [2]

[0] https://tinygrad.org/ [1] https://github.com/tinygrad/tinygrad [2] https://x.com/realGeorgeHotz/status/1800932122569343043?t=Y6...


I have not been impressed by the perf. Slower than PyTorch for LLMs, and PyTorch is actually stable on AMD (I've trained 7B/13B models).. so the stability issues seem to be more of a tinygrad problem and less of an AMD problem, despite George's ramblings [0][1]

[0] https://github.com/tinygrad/tinygrad/issues/4301 [1] https://x.com/realAnthonix/status/1800993761696284676


He also shakes his fist at the software stack, but loudly enough that it has AMD react to it.


As more a business person than engineer, help me understand why AMD are not getting this, what's the counter argument? Is CUDA just too far ahead, are they lacking the right people in senior leadership roles to see this through?


CUDA is very far ahead. Not only technically, but in mindshare. Developers trust CUDA and know that investing in CUDA is a future proof investment. AMD has had so many API changes over the years, that no one trusts them any more. If you go all in on AMD, you might have to re-write all your code in 3-5 years. AMD can promise that this won't happen, but it's happened so many times already that no one really believes them.

Another problem is simply that hiring (and keeping) top talent is really really hard. If you're smart enough to be a lead developer of AMDs core Machine Learning libraries, you can probably get hired at any number of other places, so why choose AMD.

I think the leadership gets it and understand the importance, I just don't think they (or really anybody) knows how to come up with a good plan to turn things around quickly. They're going to have to commit to at least a 5 year plan and lose money each of those 5 years, and I'm not sure they can or even want to fight that battle.


> Another problem is simply that hiring (and keeping) top talent is really really hard.

Absolutely. And when your mandate for this top talent is going to be "go and build something that basically copies what those other guys have already built", it is even harder to attract them, when they can go any place they like and work on something new.

> I think the leadership gets it and understand the importance, I just don't think they (or really anybody) knows how to come up with a good plan to turn things around quickly.

Yes, it always puzzles me when people think nobody at AMD actually sees the problem. Of course they see it. Turning a large company is incredibly hard. Leadership can give direction, but there is so much baked in momentum, power structures, existing projects and interests, that it is really tough to change things.


CUDA is one area that Nvidia really nailed. When it was first announcement I saw it as something neat but could have never envisioned just how ingrained it would become. This was long before AI training/execution was something really on most people radars.

But for years I have heard the same things from so many people working in the field. "We hate Nvidia because they got it so right but are the only option."


As another commenter points out their strategy appears to be to focus on HPC clients where AMD can focus providing after-sale software support around a relatively small number of customer requests. This gets them some sales while avoiding the level of organizational investment necessary to build a software platform that can support NVIDIA-style broad compatibility and good out-of-the-box experience.


Yes, to add to the other comments, what many don't realize is that CUDA is an ecosystem, C, C++ and Fortran foremost, however NVidia quickly realized that supporting any programming language community to target PTX was a very good idea.

Their GPUs were re-designed to follow C++ memory model, and many NVidia engineers are seat at ISO C++, yet making CUDA the best way to run heterogenous C++. Something that Intel also realized, by acquiring CodePlay, key players in SYCL, and also employing ISO C++ contributors.

Then there are the Visual Studio and Eclipse plugins, and graphical debuggers that allow even to single step shaders if you so wish.


> are they lacking the right people in senior leadership roles to see this through?

Just like Intel, they have an outdated culture. IMHO they should start a software Skunk Works isolated from the company and have the software guys guide the hardware features. Not the other way around.

I wouldn't bet money on either of them doing this. Hopefully some other smaller, modern, and flexible companies can try it.


CUDA is a software moat. If you want to use any gpu other than nvidia, you need to double your engineering budget because theres no easy to bootstrap projects at any level. The hardware prices are meaninglesz if you need a 200k engineer, if they exist, just.to bootstrap a product.


Depending on your hardware budget, the engineering one can look like a rounding error.


Sure, but then youre still on the.side.of NVIDIA because you jave the.budget.


Why give any additional money to Nvidia when you can announce more profits (or get more compute if you're a government agency) by hiring more engineers to enable AMD hardware for less than a few million per year? It's not like Microsoft loves the idea of handing over money to Nvidia if there is a cheaper alternative that can make $MSFT go up.


Say your success rate for replicating CUDA+Nvidia hardware on AMD is 60%. But it will take 2 years. That's not going to be compelling for any large org, especially when the MI300x is cheaper, but not crazy cheaper than an h100.

Especially since CUDA is still rolling out new functionality and optimizations, so the goal posts will keep moving.


> Say your success rate for replicating CUDA+Nvidia[...]

Rational hyperscalers would just stop as soon as their tooling/workloads/models are functional on AMD hardware within an acceptable perf envelope - just like they already do with their custom silicon. Replicating CUDA is just unnecessary, expensive and time-consuming completionism; if some workloads require CUDA, they will be executed on Nvidia clusters that are part of the fleet.


It depends on how cheaper the total solution is and how available the hardware is. If I can't get Nvidia hardware less than six months after I get AMD hardware, I have a couple months to port my software to AMD and still beat my competitor that's waiting for Nvidia. It's always a matter of how many problems can you solve for a given amount of money x time.


Sure, but the "it depends" is carrying a lot of weight. NVIDIA's moat will get you testable software straight out the gate; any other stack currently is a game of "how long can we take to get this going".

Corporations simply arn't interested in long term gains unless there's a straightforward path.


It depends on the problems you have. If you need CUDA, then you married yourself to Nvidia. If you can use libraries that work equally well on both, then you would benefit.

When you are a government agency, it’s more palatable to spend the budget in a way it results in employment of nationals and development of indigenous technologies.


Because if you don't join NVIDA your likely hood of success goes down. So the "more profits" you speak of is gambling money. Most corporations arn't going to gamble.


Depends on you needing CUDA or not. If you don’t, you can use anything.

It was this same game with x86 and ARM is eroding the former king’s place in the datacenter.


Leadership lacking vision + being almost bankrupt until relatively recently.


MIVisionX is probably the library you want for computer vision. As for kernels, you would generally write HIP, which is very similar to CUDA. To my knowledge, there's no equivalent to cupy for writing kernels in Python.

For what it's worth, your post has cemented my decision to submit a few conference talks. I've felt too busy writing code to go out and speak, but I really should make time.



Oh cool! It appears that I've already packaged cupy's required dependencies for AMD GPU support in the Debian 13 'main' and Ubuntu 24.04 'universe' repos. I also extended the enabled architectures to cover all discrete AMD GPUs from Vega onwards (aside from MI300, ironically). It might be nice to get python3-cupy-rocm added to Debian 13 if this is a library that people find useful.


HIP isn't similar to CUDA, in the set of available languages that target PTX, existing library ecosystem, IDE plugins and graphical debuggers.

This is the kind of stuff AMD keeps missing out, even OneAPI from Intel looks better in that regard.


If you are looking for attention from an evangelist, I'm sorry but you are not the target customer for MI300. They are courting the Hyperscalers for heavy duty production inference workloads.


I also stopped by their booth and talked about trial access, and right away asked for easy access a la Google Collab, specifically without bureaucracy. And they are like "yeah, we are making it, but nah man, you can't just login and use it, you gotta fill a form and wait for us to approve it". Was very disappointed at that point.

That was a marketing guy BTW. I don't think they realize their marketing strategies suck.


Completely agree. It's been 18 years since Nvidia released CUDA. AMD has had a long time to figure this out so I'm amazed at how they continue to fumble this.


10 years ago AMD was selling its own headquarters so that it could stave off bankruptcy for another few weeks (https://arstechnica.com/information-technology/2013/03/amd-s...).

AMD's software investments have begun in earnest a few years ago, but AMD really did progress more than pretty much everyone else aside from NVidia IMO.

AMD further made a few bad decisions where they "split the bet", relying upon Microsoft and others to push software forward. (I did like C++ Amp for what its worth). The underpinnings of C++Amp led to Boltzmann which led to ROCm, which then needed to be ported away from C++Amp and into CUDA-like Hip.

So its a bit of a misstep there for sure. But its not like AMD has been dilly dallying. And for what its worth, I would have personally preferred C++ Amp (a C++11 standardized way to represent GPU functions as []-lambdas rather than CUDA-specific <<<extensions>>>). Obviously everyone else disagrees with me but there's some elegance to parallel_for_each([](param1, param2){magically a GPU function executing in parallel}), where the compiler figures out the details of how to get param1 and param2 from CPU RAM into GPU (or you use GPU-specific allocators to make param1/param2 in the GPU codespace already to bypass the automagic).


Nowadays you can write regular C++ in CUDA if you so wish, and contrary to AMD, NVidia employs several WG21 contributors.


CUDA of 18 years ago is very different to CUDA of today.

Back then AMD/ATI were actually at the forefront on the GPGPU side - things like the early brook language and CTM lead pretty quickly into things like OpenCL. Lots of work went on using the xbox360 gpu in real games for GPGPU tasks.

But CUDA steadily improved iteratively, and AMD kinda just... stopped developing their equivalents? Considering a good part of that time they were near bankruptcy it might have not have been surprising though.

But saying Nvidia solely kicked off everything with CUDA is rather a-historical.


AMD kinda just... stopped developing their equivalents?

I wasn't so much that they stopped developing, rather they kept throwing everything out and coming out with new and non backwards compatible replacements. I knew people working in the GPU Compute field back in those days who were trying to support both AMD/ATI and NVidia. While their CUDA code just worked from release to release and every new release of CUDA just got better and better, AMD kept coming up with new breaking APIs and forcing rewrite and rewrite until they just gave up and dropped AMD.


> CUDA of 18 years ago is very different to CUDA of today.

I've been writing CUDA since 2008 and it doesn't seem that different to me. They even still use some of the same graphics in the user guide.


Yep! I used BrookGPU for my GPGPU master thesis, before CUDA was a thing. AMD lacked followthrough on yhe software side as you said, but a big factor was also NV handing out GPUs to researchers.


10 years ago they were basically broke and bet the farm on Zen. That bet paid off. I doubt a bet on CUDA would have paid off in time to save the company. They definitely didn't have the resources to split that bet.


It's not like the specific push for AI on GPUs came out of nowhere either, Nvidia first shipped cuDNN in 2014.


Did you talk to anyone from Intel? It seems they were also present: https://community.intel.com/t5/Blogs/Tech-Innovation/Artific...


Well if Mojo and Modular Max Platform take off I guess there will be a path for AMD


Well,

"Modular to bring NVIDIA Accelerated Computing to the MAX Platform"

https://www.modular.com/blog/modular-partners-with-nvidia-to...


The whole point of Max is that you can compile same code to multiple targets without manually optimizing for a given target. They are obviously going to support NVIDIA as a target.


Yet you haven't seen any AMD or Intel deal from them.


Cause they start with the target with largest user base?


99%+ of people aren't writing kernels man, this doesn't mean anything, this is just silly


The news you've all been waiting for!

We are thrilled to announce that Hot Aisle Inc. proudly volunteered our system for Chips and Cheese to use in their benchmarking and performance showcase. This collaboration has demonstrated the exceptional capabilities of our hardware and further highlighted our commitment to cutting-edge technology.

Stay tuned for more exciting updates!


Thank you for loaning the box out! Has a lot more credibility than the vendor saying it runs fast


Thanks Jon, that's exactly the idea. About $12k worth of free compute on a box that costs as much as a Ferrari.

Funny that HN doesn't like my comment for some reason though.


Don't sweat it. Some people are trigger happy on downvoting things looking like self-promotion due to the sheer amount of spam everywhere. Your sponsorship (?) is the right way to promote your company. Thank you.


It reads like the kind of chumbox PR you read at bottom of random website. Get a copywriter or something like writer.ai. I thought your comment was spam and nearly flagged it. It really is atrocious copy.


I thought it was sarcastic.


[retracted]


Do you think this comment will make Hot Aisle more or less likely to loan out their hardware in the future?

Personally, I couldn't care less about the quality of copy. I do care about having access to similar hardware in the future.


Heh, I didn't even think of that, but you make a good point. Don't worry though, we will keep the access coming. I hate to say it, but it literally is... stay tuned for more exciting updates.


Thanks so much for doing that. There are loads of people here who really appreciate it. We will stay tuned!


This is the news that many people have been waiting for and we do have more exciting updates coming. There is another team on the system now doing testing. We have a list of 22 people currently waiting.


okay, I've retracted my comments. Thanks for your generosity.


Thanks, but I wouldn't call it generosity. We're helping AMD build a developer flywheel and that is very much to our benefit. The more developers using these chips, the more chips that are needed, the more we buy to rent out, the more our business grows.

Previously, this stuff was only available to HPC applications. We're trying to get these into the hands of more developers. Our view is that this is a great way to foster the ecosystem.

Our simple and competitive pricing reflects this as well.


All eyes are of course on AI, but with 192GB of VRAM I wonder if this or something like it could be good enough for high end production rendering. Pixar and co still use CPU clusters for all of their final frame rendering, even though the task is ostensibly a better fit for GPUs, mainly because their memory demands have usually been so far ahead of what even the biggest GPUs could offer.

Much like with AI, Nvidia has the software side of GPU production rendering locked down tight though so that's just as much of an uphill battle for AMD.


One missed opportunity from the game streaming bubble would be a 20-or-so player game where one big machine draws everything for everybody and streams it.


It would immediately prevent several classes of cheating. No more wallhacks or ESP.

Ironically the main type that'd still exist would be the vision-based external AI-powered target-highlighting and aim/fire assist.

The display is analysed and overlaid with helpful info (like enemies highlighted) and/or inputs are assisted (snap to visible enemies, and/or automatically pull trigger.)


Stuff like this is still of interest to me. There are some really compelling game ideas that only become possible once you look into modern HPC platforms and streaming.


My son and I have wargamed it a bit. The trouble is there is a huge box of tricks used in open world and other complex single player games for conserving RAM that compete with just having a huge amount of RAM and it is not so clear the huge SMP machine with a huge GPU really comes out ahead in terms of creating a revolution in gaming.

In the case of Stadia, however, failing to develop this was like a sports team not playing any home games. One way of thinking about the current crisis of the games industry and VR is that building 3-d worlds is too expensive and a major part of it is all the shoehorning tricks the industry depends on. Better hardware for games could be about lowering development cost as opposed to making fancier graphics but that tends to be a non-starter with companies whose core competence is getting 1000 highly-paid developers to struggle with difficult to use tools and the idea you could do the same with 10 ordinary developers is threatening to them.


I am thinking beyond the scale of any given machine and traditional game engine architectures.

I am thinking of an entire datacenter purpose-built to host a single game world, with edge locations handling the last mile of client-side prediction, viewport rendering, streaming and batching of input events.

We already have a lot of the conceptual architecture figured out in places like the NYSE and CBOE - Processing hundreds of millions of events in less than a second on a single CPU core against one synchronous view of some world. We can do this with insane reliability and precision day after day. Many of the technology requirements that emerge from the single instance WoW path approximate what we have already accomplished in other domains.


EVE online is more or less the closest to this so far, so it may be worth learning lessons from them (though I wouldn't suggest copying their approach: their stackless python behemoth codebase appears to contain many a horror). It's certainly a hard problem though, especially when you have a concentration of large numbers of players (which is inevitable when you create such a game world).


The question though is how you make something that complex and not have it be a horror though, and is stackless python really the culprit of the horror vs anything else they could have built it in.


Curious what that is. Some kind of AR physics simulation?

I have been thinking about if the the compute could go right in cellphone towers but this would take it up a notch.


Stadia was supposed to allow for really big games distributed across a cluster. Too bad it died in the crib.


I’d imagine ray tracing is a bit easier to paralize over lots of older cards. The computations aren’t as heavily linked and are more fault tolerant. So I doubt anyone is paying h100 style premiums


The computations are easily parallelized, sure, but the data feeding those computations isn't easily partitioned. Every parallel render node needs as much memory as a lone render node would, and GPUs typically have nowhere near enough for the highest of high end productions. Last I heard they were putting around 128GB to 256GB of RAM in their machines and that was a few years ago.


Pixar is paying a massive premium; they probably are using an order of magnitude or two more CPUs than they would if they could use GPUs. Using a hundred CPUs in place of a single H100 is a greater-than-h100 style premium.


Would Pixar's existing software run on GPUs without much work?


It does already, at least on Nvidia GPUs: https://rmanwiki.pixar.com/pages/viewpage.action?mobileBypas...

They currently only use the GPU mode for quick iteration on relatively small slices of data though, and then switch back to CPU mode for the big renders.


It's probably implemented way differently, but I worry about the driver suitability. Gaming benchmarks at least perform substantially worse on AI accelerators than even many generations old GPUs, I wonder if this extends to custom graphics code too.


I work in this field, and I think so. This is actually the project I'm currently working on.

I'm betting with current hardware and some clever tricks, we can resolve full production frames in real-time rates.


I hate the state of AMDs software for non gamers. RoCm is a war crime (which has improved dramatically in the last two years and still sucks).

But like many have said considering AMD was almost bankrupt their performance is impressive. This really speaks for their hardware division. If only they could get the software side of things fixed!

Also I wonder if NVIDIA has an employee of the decade plaque for CUDA. Because CUDA is the best thing that could’ve happened to them.


I feel like these huge graphics cards with insane amounts of RAM are the moat that AI companies have been hoping for.

We can't possibly hope to run the kinds of models that run on 192GB of VRAM at home.


Ow contrarily I'd argue the opposite. GPU vram has gotten faster but the density isn't that good. 8gb used to be high end for the early 2000's yet now 16gb can't even run games that well, especially if its a studio that loves vram.

Side note: as someone who has been into machine learning for over 10 years, let me tell ya us hobbyists and researchers hunger for compute and memory.

VRAM isn't everything.....I am well aware but certain workflows really do benefit from heaps of vram like vfx and cad and CFD. I realize that the dream of upgradable GPUs where I can upgrade the different components just like you do on the computer. Computer is slow, then upgrade ram or storage or get a faster chip that uses the same socket. GPU could possibility see modularity with the processor the vram etc.

Level1Tech has some great videos about how PCIe is the future...where we can connect systems together using raw PCI lanes, which is similar to how nvidia Blackwell servers communicate to other servers in the rack.


Wasn't that just because of Nvidia's market segmentation?


Apple will gladly sell you a GPU with 192GB of memory, but your wallet won't like it.


Won't Nvidia, and Intel, and Qualcomm, and Falanx (who make the ARM Mali GPUs from what I can see), and Imagination Technologies (PowerVR) do the same? They each make a GPU, and if you pay them enough money I have a hard time beleiving they won't figure out how to slap enough RAM on a board for one of their existing products and making whatever changes are required.


The US government is looking into heavily limit availability of high end GPUs from now on. And the biggest and most effective bottleneck for AI right now is VRAM

So maybe Apple is happy to sell huge GPUs like that but the government will probably put it under export controls like A100 and H100 already is


Cue the PowerMac G4 TV ad.

https://youtu.be/lb7EhYy-2RE


OTOH, it comes free with one of the finest Unix workstations ever made.


It's easy to be best when you have no competition. Linux exists for the rest of us.


It’s good even if compared to Linux. Not perfect, but certainly not bad.


Which Unix workstation?


They are referring to MacOS being included with expensive Mac hardware.


How many desktop systems can have 192GB visible to the GPU? How many cost less than a Mac?


Just because it has a lot of GPU RAM doesn't mean it's actually useful for people doing ML work.

How many companies use Macs for ML work instead of Nvidia and Cuda?


It won’t be as fast as a high-end GPU like the MI300 series, but it’s enough to check whether the code works before running it on a high-end GPU-heavy machine and the large GPU-accessible RAM simplifies the code enormously, as you don’t have to partition and shuffle data between CPU and GPU.


Ok that's the theory but how many companies actually do that for their workflow. All the ML companies I saw use directly Cuda for prototyping to production and don't bother with Apple ML unless their target happens to be exclusively iPhones.


Anyone doing heavy lifting and low-level tooling will be better to optimise for specialised training and inference engines. Usage will depend on where the abstraction layer is - if you want to see CUDA, then you'll need Nvidia. If all you care for is the size of the model and you know it's very large, then the Apple hardware becomes competitive.

Besides, you'd be well served with a Mac as a development desktop anyway.


Everyone has laptops now though. Nobody's gonna carry a Mac studio between home and office. And if you're gonna use your Mac just as an SSH machine then you'll remote to a Nvidia data center anyway not to a Mac studio.


I still have a Mac Mini on my desk in my home office, regardless of the laptops. If I were into crunching 192 gigabytes of numbers at a time, I’d get myself a Mac Studio.

At least until someone makes an MI300A workstation.


Sure, but then if you take your code to production to monetize it as a business you won't be deploying on a datacenter of Mac Minis.

What you alone do at home, is irelevant for the ML market as a whole, along with your Mac Mini, as you alone won't move the market, and the companies serious about ML are all-in on Nvidia and CUDA compatible code for mass deployment.

I can also get to run some NNs on some microcontroler, but my hoppy project won't move the market, and that's what I was talking about, the greater market, not your hobby project.


>> We can't possibly hope to run the kinds of models that run on 192GB of VRAM at home.

I'm looking to build a mini-ITX system with 256GB of RAM for my next build. DDR5 spec can support that in 2 modules, but nobody makes them yet. No need for a GPU, I'm looking to the AMD APUs which are getting into the 50TOPs range. But yes, RAM seems to be the limiting factor. I'm a little surprised the memory companies aren't pushing harder for consumers to have that capacity.


128GB DDR5 module - https://store.supermicro.com/us_en/supermicro-hynix-128gb-28...

It is of course RDIMM, but you didn't specify what memory type you were looking at.


For inference you could use a maxed-out Mac Ultra; the RAM is shared between the CPU and GPU.


For single user (batch_size = 1), sure. But that is quite expensive in $/tok.


Even if the community provides support it could take years to reach the maturity of CUDA. So while it's good to have some competition, I doubt it will make any difference in the immediate future. Unless some of the big corporations in the market lean in heavily and support the framework.


If, and that's a big if, AMD can get ROCm working well for this chip, I don't think this will be a big problem.

ROCm can be spotty, especially on consumer cards, but for many models it does seem to work on their more expensive models. It may be worth it spending a few hours/days/weeks to work around the peculiarities of ROCm given the cost difference between AMD and Nvidia in this market segment.

This all stands or falls with how well AMD can get ROCm to work. As this article states, it's nowhere near ready yet, but one or two updates can turn AMD's accelerators from "maybe in 5-10 years" to "we must consider this next time we order hardware".

I also wonder if AMD is going to put any effort into ROCm (or a similar framework) as a response to Qualcomm and other ARM manufacturers creaming them on AI stuff. If these Copilot PCs take off, we may see AMD invest into their AI compatibility libraries because of interest from both sides.


https://stratechery.com/2024/an-interview-with-amd-ceo-lisa-...

"One of the things that you mentioned earlier on software, very, very clear on how do we make that transition super easy for developers, and one of the great things about our acquisition of Xilinx is we acquired a phenomenal team of 5,000 people that included a tremendous software talent that is right now working on making AMD AI as easy to use as possible."


Oh no. Ohhhh nooooo. No, no, no!

Xilinx dev tools are awful. They are the ones who had Windows XP as the only supported dev environment for a product with guaranteed shipments through 2030. I saw Xilinx defend this state of affairs for over a decade. My entire FPGA-programming career was born, lived, and died, long after XP became irrelevant but before Xilinx moved past it, although I think they finally gave in some time around 2022. Still, Windows XP through 2030, and if you think that's bad wait until you hear about the actual software. These are not role models of dev experience.

In my, err, uncle? post I said that I was confused about where AMD was in the AI arms race. Now I know. They really are just this dysfunctional. Yikes.


Xilinx made triSYCL (https://github.com/triSYCL/triSYCL), so maybe there's some chance AMD invests first-class support for SYCL (an open standard from Khronos). That'd be nice. But I don't have much hope.


Comparing what AMD has done so far with SYCL, and what Intel has done with OpenAPI, yeah better not keep that hope flame burning.


this is honestly a very enlightening interview because - as pointed out at the time - Lisa Su is basically repeatedly asked about software and every single time she blatantly dodges the question and tries to steer the conversation back to her comfort-zone on hardware. https://news.ycombinator.com/item?id=40703420

> He tries to get a comment on the (in hindsight) not great design tradeoffs made by the Cell processor, which was hard to program for and so held back the PS3 at critical points in its lifecycle. It was a long time ago so there's been plenty of time to reflect on it, yet her only thought is "Perhaps one could say, if you look in hindsight, programmability is so important". That's it! In hindsight, programmability of your CPU is important! Then she immediately returns to hardware again, and saying how proud she was of the leaps in hardware made over the PS generations.

> He asks her if she'd stayed at IBM and taken over there, would she have avoided Gerstner's mistake of ignoring the cloud? Her answer is "I don’t know that I would’ve been on that path. I was a semiconductor person, I am a semiconductor person." - again, she seems to just reject on principle the idea that she would think about software, networking or systems architecture because she defines herself as an electronics person.

> Later Thompson tries harder to ram the point home, asking her "Where is the software piece of this? You can’t just be a hardware cowboy ... What is the reticence to software at AMD and how have you worked to change that?" and she just point-blank denies AMD has ever had a problem with software. Later she claims everything works out of the box with AMD and seems to imply that ROCm hardly matters because everyone is just programming against PyTorch anyway!

> The final blow comes when he asks her about ChatGPT. A pivotal moment that catapults her competitor to absolute dominance, apparently catching AMD unaware. Thompson asks her what her response was. Was she surprised? Maybe she realized this was an all hands to deck moment? What did NVIDIA do right that you missed? Answer: no, we always knew and have always been good at AI. NVIDIA did nothing different to us.

> The whole interview is just astonishing. Put under pressure to reflect on her market position, again and again Su retreats to outright denial and management waffle about "product arcs". It seems to be her go-to safe space. It's certainly possible she just decided to play it all as low key as possible and not say anything interesting to protect the share price, but if I was an analyst looking for signs of a quick turnaround in strategy there's no sign of that here.

not expecting a heartfelt postmortem about how things got to be this bad, but you can very easily make this question go away too, simply by acknowledging that it's a focus and you're working on driving change and blah blah. you really don't have to worry about crushing some analyst's mindshare on AMD's software stack because nobody is crazy enough to think that AMD's software isn't horrendously behind at the present moment.

and frankly that's literally how she's governed as far as software too. ROCm is barely a concern. Support base/install base, obviously not a concern. DLSS competitiveness, obviously not a concern. Conventional gaming devrel: obviously not a concern. She wants to ship the hardware and be done with it, but that's not how products are built and released in 2020 anymore.

NVIDIA is out here building integrated systems that you build your code on and away you go. They run NVIDIA-written CUDA libraries, NVIDIA drivers, on NVIDIA-built networks and stacks. AMD can't run the sample packages in ROCm stably (as geohot discovered) on a supported configuration of hardware/software, even after hours of debugging just to get it that far. AMD doesn't even think drivers/runtime is a thing they should have to write, let alone a software library for the ecosystem.

"just a small family company (bigger than NVIDIA, until very recently) who can't possibly afford to hire developers for all the verticals they want to be in". But like, they spent $50b on a single acquisition, they spent $12b in stock buybacks over 2 years, they have money, just not for this.


So I knew that AMD's compute stack was a buggy mess -- nobody starts out wanting to pay more for less and I had to learn the hard way how big of a gap there was between AMD's paper specs and their actual offerings -- and I also knew that Nvidia had a huge edge at the cutting edge of things, if you need gigashaders or execution reordering or whatever, but ML isn't any of that. The calculations are "just" matrix multiplication, or not far off.

I would have thought AMD could have scrambled to fix their bugs, at least the matmul related ones, scrambled to shore up torch compatibility or whatever was needed for LLM training, and pushed something out the door that might not have been top-of-market but could at least have taken advantage of the opportunity provided by 80% margins from team green. I thought the green moat was maybe a year wide and tens of millions deep (enough for a team to test the bugs, a team to fix the bugs, time to ramp, and time to make it happen). But here we are, multiple years and trillions in market cap delta later, and AMD still seems to be completely non-viable. What happened? Did they go into denial about the bugs? Did they fix the bugs but the industry still doesn't trust them?


It's roughly that the AMD tech works reasonably well on HPC and less convincingly on "normal" hardware/systems. So a lot of AMD internal people think the stack is solid because it works well on their precisely configured dev machines and on the commercially supported clusters.

Other people think it's buggy and useless because that's the experience on some other platforms.

This state of affairs isn't great. It could be worse but it could certainly be much better.


If we're extremely lucky they might invest in SYCL and we'll see an Intel/AMD open-source teamup


This seems like the option that would make the most sense. If developers can "write once, run everywhere", they might as well do that instead of Cuda. But if they have to "write once, run on Intel, or AMD, or Nvidia", why would they bother with anything other than Nvidia considering their market share? If you're an underdog you go for open standards that makes it easy to switch to your products, but it seems like AMD have seen Nvidia's Cuda and jealously decided they wanted their own version, but 15 years too late.


> Qualcomm and other ARM manufacturers creaming them on AI stuff

That's mostly on Microsoft's DirectML though. I'm not sure whether AMD's implementation is based on ROCm (doubt it).


You do know that Microsoft, Oracle, Meta are all in on this right?

Heck I think it is being used to run ChatGPT 3.5 and 4 services.


I feel like people forget that AMD has huge contracts with Microsoft, Valve, Sony, etc to design consoles at scale. It's an invisible provider as most folks don't even realize their Xbox and their Playstation are both AMD.

When you're providing fab designs at that scale, it makes a lot more sense to folks that companies would be willing to try a more affordable option to nVidia hardware.

My bet is that AMD figures out a service-able solution for some (not all) workloads that isn't ground breaking, but affordable to the clients that want an alternative. That's usually how this goes for AMD in my experience.


If you read/listen to the Stratechary interview wirh Lisa Hsu, she spelled out being open ro customizing AMD hardware to meet partner's needs. So if Microsoft needs more memory bandwidth and less compute, AMD will build something just for them based on what they have now. If Meta wants 10% less power consumption (and cooling) for a 5% hit in compute, AMD will hear them out too. We'll see if that hardware customization strategy works outside of consoles.


It certainly helps differentiate from NVIDIA's "Don't even think about putting our chips on a PCB we haven't vetted" approach.


Yeah, but they will be using internal Microsoft and Meta software stacks, nothing that will dent CUDA.


>I feel like people forget that AMD has huge contracts with Microsoft, Valve, Sony, etc to design consoles at scale.

Nobody forget that, just that those console chips are super low margins, which is why Intel and Nvidia stopped catering to that market after the Xbox/PS3 generations and only AMD took it up because they were broke and every penny mattered to them.

Nvidia did a brief stint with the Shield/Switch because they were trying to get into the Android/ARM space and also kinda gave up due to the margins.


A market that keeps being discussed that is reaching its end, as newer generations aren't that much into traditional game consoles, and both Sony and Microsoft[0] have to reach out to PCs and mobile devices, to achieve sales growth.

Among the gamer community the discussion of this being the last generation keeps poping up.

[0] - Nintendo is more than happy to keep redoing their hit franchaises, in good enough hardware.


On the other hand, AMD has had a decade of watching CUDA eat their lunch and done basically nothing to change the situation.


AMD tries to compete in hardware with Intel’s CPUs and Nvidia’s GPUs. They have to slack somewhere, and software seems to be where. It isn’t any surprise that they can’t keep up on every front, but it does mean they can freely bring in partners whose core competency is software and work with them without any caveats.

Not sure why they haven’t managed to execute on that yet, but the partners must be pretty motivated now, right? I’m sure they don’t love doing business at Nvidia’s leisure.


Hardware is useless without software to make it show off.


when was the last time AMD hardware was keeping up with NVIDIA? 2014?


Been a while since AMD had the top tier offering, but it has been trading blows in the middle tier segment the entire time. If you are just looking for a gamer card (ie not max AI performance), the AMD is typically cheaper and less power hungry than the equivalent Nvidia.


It’s trading blows because AMD sells their cards at lower margins in the midrange and Nvidia lets them.


But, the fact that Nvidia cards command higher margins also reflects their better software stack, right? Nvidia “lets them” trade blows in the midrange, or, equivalently, Nvidia is receiving the reward of their software investments: even their midrange hardware commands a premium.


> the AMD is typically cheaper and less power hungry than the equivalent Nvidia

cheaper is true, but less power hungry is absolutely not true, which is kind of my point.


It was true with RDNA 2. RDNA 3 regressed on this a bit, supposedly there was a hardware hiccup that prevented them from hitting frequency and voltage targets that they were hoping to reach.

In any case they're only slightly behind, not crazy far behind like Intel is.


The MI300X sounds like it is competitive, haha


competitive with H100 for inference. a 2 year old product on just one half of the ML story. H200 (and potentially B100) is the appropriate comparison based on their production in volume.


I have read in a few places that Microsoft is using AMD for inference to run ChatGPT. If I recall they said the price/performance was better.

I'm curious if that's just because they can't get enough Nvidia GPUs or if the price/performance is actually that much better.


Most likely it really is better overall.

Think of it this way: AMD is pretty good at hardware, so there's no reason to think that the raw difference in terms of flops is significant in either direction. It may go in AMD's favor sometimes and Nvidia's other times.

What AMD traditionally couldn't do was software, so those AMD GPUs are sold at a discount (compared to Nvidia), giving you better price/performance if you can use them.

Surely Microsoft is operating GPUs at large enough scale that they can pay a few people to paper over the software deficiencies so that they can use the AMD GPUs and still end up ahead in terms of overall price/performance.


Something like Triton from Microsoft/OpenAI as a cuda bypass? Or pytorch/tensorflow targeting ROCm without user intervention.

Or there's openmp or hip. In extremis opencl.

I think the language stack is fine at this point. The moat isn't in cuda the tech. It's in code running reliably on nvidia's stack, without things like stray pointers needing a machine reboot. Hard to know how far off robust rocm is at this point.


The problem is that we all have a lot of FUD (for good reasons). It's on AMD to solve that problem publically. They need to make it easier to understand what is supported so far and what's not.

For example, for bitandbytes (a common dependency in LLM world) there's a ROCm fork that the AMD maintainers are trying to merge in (https://github.com/TimDettmers/bitsandbytes/issues/107). Meanwhile an Intel employee merged a change that made a common device abstraction (presumably usable by AMD + Apple + Intel etc.).

There's a lot of that right now - super popular package that is CUDA-only is navigating how to make it work correctly with any other accelerator. We just need more information on what is supported.


I remember years ago one of the amd apus had the cup and gpu on the same die, and could exchange ownership of cpu and gpu memory with just a pointer change or some other small accounting.

Has this returned? Because for dual gpu/cpu workloads (alpha zero, etc) that would deliver effective “infinite bandwidth” between gpu and cpu. Using an apu of course gets you huge amounts of slowish memory. But being some to fling things around with abandon would be an advantage, particularly for development.


You don't need to change the pointer value. The GPU and the CPU have the same page table structures and both use the same pointer representation for "somewhere in common memory".

On the GPU there are additional pointer types for different local memory, e.g. LDS is a uint16_t indexing from zero. But even there you can still have a single pointer to "somewhere" and when you store to it with a single flat addressing instruction the hardware sorts out whether it's pointing to somewhere in GPU stack or somewhere on the CPU.

This works really well for tables of data. It's a bit of a nuisance for code as the function pointer is aimed at somewhere in memory and whether that's to some x86 or to some gcn depends on where you got the pointer from, and jumping to gcn code from within x86 means exactly what it sounds like.


I'm not sure it was "pointers" but it was some very low cost way to change ownership of memory between the CPU and GPU.

They had some fancy marketing name for it at the time. But it wasn't on all chips, it should have been. Even if it was dog slow between PCIe GPU and CPU the unified interface would have been the right way to go. Also, amenable to automated scheduling.

The point still stand though, I want entirely unified GPU and CPU memory.


The unified address space with moving pages between CPU and GPU on page fault works on some discrete GPU systems but it's a bit of a performance risk compared to keeping the pages on the same device.

Fundamentally if you've got separate blocks of memory tied together by pcie then it's either annoying copying data across or a potential performance problem doing it behind the scenes.

A single block of memory that everything has direct access to is much better. It works very neatly on the APU systems.


> Fundamentally if you've got separate blocks of memory tied together by pcie then it's either annoying copying data across or a potential performance problem doing it behind the scenes.

Well, as I said that's amenable to automated planning.

But what I really, really want is a nice APU with 512GB+ of memory that both the CPU and GPU can access willy nilly.


Yep, that's what I want too. The future is now.

The MI300A is an APU with 128gb on the package. They come in four socket systems, that's 512gb of cache coherent machine with 96 fast x64 cores and many GCN cores. Quite like a node from El Capitan.

I'm delighted with the hardware and not very impressed with the GPU offloading languages for programming it. The GCN and x64 cores are very much equal peers on the machine, the asymmetry baked into the languages grates on me.

(on non-apu systems, moving the data around in the background such that the latency is hidden is a nice idea and horrendously difficult to do for arbitrary workloads)


Probably thinking of this https://en.m.wikipedia.org/wiki/Heterogeneous_System_Archite...

> Even if it was dog slow between PCIe GPU and CPU the unified interface would have been the right way to go

That is actually what happened. You can directly access pinned cpu memory over pcie on discrete gpus.


I assume the MI300A APU also supports zero-copy. Because MI300X is a separate chip you necessarily have to copy data over PCIe to get it into the GPU.


One day someone will build a workstation around that chip. One day…


I'm surprised at the simplicity of the formula in the paragraph below. Could someone explain the relationship between model size, memory bandwidth and token/s as they calculated here?

> Taking LLaMA 3 70B as an example, in float16 the weights are approximately 140GB, and the generation context adds another ~2GB. MI300X’s theoretical maximum is 5.3TB/second, which gives us a hard upper limit of (5300 / 142) = ~37.2 tokens per second.


From Cheese (they don't have a HN account, so I'm posting for them):

Each weight is a FP16 float which is 2 Bytes worth of data, you have 70B tokens, so the total amount of data the weights take up is 140GB then you have a couple extra GBs for the context.

Then to figure out the theoretical tokens per second you just divide the amount of memory bandwidth, 5300GB/s in MI300X's case, by the amount of data that the tokens and context take up so 5300/142 which is about 37 tokens per second.


So am I correct in understanding what they really mean is 37 full forward passes per second?

In which case, if the model weights are fitting in the VRAM and are already loaded, why does the bandwidth impact the rate of tok/s?


You have to get those weights from the RAM to the floating point unit. The bandwidth here is the rate at which you can do that.

The weights are not really reused. Which means they are never in registers, or in L1/L2/L3 caches. They are always in VRAM and always need to be loaded back in again.

However, if you are batching multiple separate inputs you can reuse each weight on ech input, in which case you may not be entirely bandwidth bound and this analysis breaks down a bit. Basically you can't produce a single stream of tokens any faster than this rate, but you can produce more than one stream of tokens at this rate.


37 somethings per second doesn’t sound fast at all. You need to remember it’s 37 ridiculously difficult things per second.


AFAIK generating a single token requires reading all the weights from RAM. So 5300 GB/s total memory bandwidth / 142 GB weights = ~37.2 tokens per second.


That would be higher with batching, right? (5300 / 144) * 2 = ~73.6 and so on.


Good. If there is even a slight suspicion that the best value is team read in 5 or 10 years then CUDA will look a lot less attractive already today.


> Taking LLaMA 3 70B as an example, in float16 the weights are approximately 140GB, and the generation context adds another ~2GB. MI300X’s theoretical maximum is 5.3TB/second, which gives us a hard upper limit of (5300 / 142) = ~37.2 tokens per second.

I think they mean 37.2 forward passes per second. And at 4008 tokens per second (from "LLaMA3-70B Inference" chart) it means they were using a batch size of ~138 (if using that math, but probably not correct). Right?


So just out of curiosity, what does this thing cost?


Pricing is strictly NDA. AMD does not give it out.


The rumors say $20k. Nothing official though.


Fantastic to see.

The MI300X does memory bandwidth better than anything else by a ridiculous margin, up and down the cache hierarchy.

It did not score very well on global atomics.

So yeah, that seems about right. If you manage to light up the hardware, lots and lots of number crunching for you.


I wonder if the human body could grow artificial kidney's so that I can just sell infinite kidney's and manage to afford a couple of these so I can do AI training on my own hardware.


why not infinite brains so you can have more computational power than these GPUs?


Apparently one of those costs around 15K $. I don't know if you can buy a couple or they only sell those in massive batches, but in any case, how many human kidneys you need to sell to get 30K $?


It would be great to have real world inference benchmarks for LLMs. These aren't it.

That means e.g. 8xH100 with TensorRT-LLM / vLLM vs 8xMI300X with vLLM running many concurrent requests with reasonable # of input and output tokens. Ran both in fp8 and fp16.

Most of the benchmarks I've seen had setups that no one would use in production. For example running on a single MI300X or 2xH100 -- this will likely be memory bound, you need to go to higher batch sizes (more VRAM) to be compute bound to properly utilize these. Or benchmarking requests with unrealistically low # of input tokens.


Would you like to sign up for free time on our system to do it "right"?

https://hotaisle.xyz/free-compute-offer/


Would be interesting to see a workstation based on the version with a couple x86 dies, the MI300A. Oddly enough, it’d need a discrete GPU.


Without first-class CUDA translation or cross compile, AMD is just throwing more transistors at the void


Given the number of people who need the compute but are only accessing it via APIs like HuggingFace's transformers library, which supports these chips, I don't really think that CUDA support is absolutely essential.

Most kernels are super quick to rewrite, and higher level abstractions like PyTorch and JAX make dealing with CUDA a pretty rare experience for most people making use of large clusters and small installs. And if you have the money to build a big cluster, you can probably also hire the engineers to port your framework to the right AMD library.

The world has changed a lot!

The bigger challenge is that if you are starting up, why in the world would you give yourself the additional challenge of going off the beaten path? Its not just CUDA but the whole infrastructure of clusters and networking that really gives NVIDIA an edge, in addition to knowing that they are going to stick around in the market, whereas AMD might leave it tomorrow.


When buying a supercomputer, you negotiate support contracts so it doesn't matter if AMD leaves the day after they sign the contract, you've still got your supercomputer and support for it.


True, but that works only for the current round of hardware. NVIDIA will be around for the next decade to support future clusters, too.


I agree they need to work on their software but I also think that the availability as well as massive expense of the H100, AMD can undercut Nvidia and build a developer ecosystem if they wanted to. I think they need to hit the consumer market pretty hard and get all the local llama people hacking up the software and drivers to make things work. A cheaper large VRAM consumer card would go a long ways to getting a developer ecosystem behind them.


Have you looked at ZLUDA?

Edit: Or HIPIFY, a tool made by AMD as a translation for CUDA. https://github.com/ROCm/HIPIFY/blob/amd-staging/README.md


i worked there. they see software as a cost center, they should fix their mentality.


from the summary:

"When it is all said and done, MI300X is a very impressive piece of hardware. However, the software side suffers from a chicken-and-egg dilemma. Developers are hesitant to invest in a platform with limited adoption, but the platform also depends on their support. Hopefully the software side of the equation gets into ship shape. Should that happen, AMD would be a serious competitor to NVIDIA."


IMO Nvidia is going to force companies to fix this, Nvidia have always made it clear they will increase prices and capture 90% of your profits when left free to do so. See any example from the GPU vendor space. There isn't infinite money to be spent per token, so it's not like the AI companies can just increase prices.

AMD can offer this product at a 40% discount and still make money tells you all you need to know.


I'm personally wondering when nVidia will open an AI AppStore, and every app that runs on nVidia hardware will have to be notarized first, and you'll have to pay 30% of your profits to nVidia.

History has shown that this idea is not as crazy as it sounds.


Oh that is a great example, just wipe out Hugging Face, Ollama, Together.ai, and probably 20 more. They can host it all themselves or require their vendors to lease some time to them at cost..


OTOH, the performance advantage compared to the H100 is super-impressive according to tfa. Things could become interesting again in the GPU market.


AMD should be giving out these units to whichever clouds are willing to host them, so they can get them in the hands of developers to induce demand via working software


You are totally right.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: