Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Boosting Computational Fluid Dynamics Performance with AMD MI300X (amd.com)
47 points by latchkey 11 months ago | hide | past | favorite | 45 comments


Now if only I could purchase one*

*I realize enterprise “GPUs” are difficult to purchase as an individual whether they’re AMD or Nvidia, but AMD might be well-served to mimic their RX 480 strategy (“make a competitive mid-range GPU, distribute it through several board partners, and sell it at rock-bottom prices to get it to as many customers as possible”). If there’s a compelling reason to use AMD hardware over Nvidia, the software support will come. As an extreme example, if one could purchase an MI300X for $50 a pop, surely FAANG and others would invest time and effort into the software side to leverage the cost savings vs Nvidia, who is unquestionably price-gouging due to their monopolistic hold on the enterprise GPU market.


It seems AMD has to choose between monetizing the GPUs right now by selling them at the maximum effective price, or forgoing that gain and selling the GPUs at cost and increasing GPU market share in the future. The choice between short-term gain and long-term gain.


And the splitting into CDNA and RDNA comes from the same direction: market segmentation, to allow much higher prices for the CDNA data-center GPUs, while keeping the gamer-focused RDNA GPUs affordable for mere mortals. Of coures this backfires by making the powerful GPUs not available for mostly anybody anymore to experiment on.

For example this blog post, about how great MI300X is. Really, what do I care -- I'm not a billionaire.


> And the splitting into CDNA and RDNA comes from the same direction: market segmentation

Not really.

Wave64 on CDNA is provably more throughput. But with most video game code written for NVidia's Wave32, RDNA being reworked to be more NVidia-like and Wave32 is how you reach better practical video game performance.

HPC will prefer the wider execute, 64-bit execution, and other benefits.

Video Gamers will prefer massive amounts of 32MB+ of "Infinity cache", which is used in practice for all kinds of screen-space calculations. But this would NEVER be used for fluid dynamics.


Maybe never by the big players, but RDNA and even fp32 are perfectly fine for a number of CFD algorithms and uses; Stable Fluids-like algorithms and Lagrangian Vortex Particle Methods to name two.


I'm talking about Wave64.

CDNA executes 64-threads per compute unit per clock tick. RDNA only executes 32-threads. CDNA is smaller, more efficient, more parallel and much higher compute than RDNA.

Furthermore, all ROCm code from GCN (and older) was on Wave64, because historically AMD's architecture from 2010 through 2020 was Wave64. RDNA changed to Wave32 so that they can match NVidia and have slightly better latency characteristics (at the cost of bandwidth).

CDNA has more compute bandwidth and parallelism. RDNA is narrower, faster latency and less parallelism. Building a GPU out of 2048-bit compute (aka: 64-lanes x 32-bit wide/CDNA) is always going to be more bandwidth than 1024-bit compute (aka: 32-lanes x 32-bit wide) like RDNA.


I wasn't familiar with the "Wave32" term, but took "RDNA" to mean the smaller wavefront size. I've used both, and wave32 is still quite effective for CFD.


ROCm support for RDNA took like 2 years, maybe longer.

If you actually were using both, you'd know that CDNA was the only supported platform on ROCm for what felt like an eternity. That's because CDNA was designed to be as similar to GCN so that ROCm could support it easier.

--------

What I'm saying is that today, now that ROCm works on RDNA and CDNA, the two architectures can finally be unified into UDNA. And everyone should be happy with the state of code moving forward.


They’re unifying the architectures. AMD will move to UDNA for both gaming and data center. The next graphics cards after RDNA4 will be UDNA. Makes sense given how ML-heavy graphics has become.


The point is they shouldn't have done it in the first place. It was obvious right from the start it's a bad idea, except maybe for temporarily boosting short term profits.

The whole AMD AI/ML strategy feels like this - prioritize short term profits and completely shoot themselves in the foot in the long term.


ROCm was clearly designed with Wave64 in mind. It was going to take years for ROCm to be reworked for Wave32 of RDNA.

DirectX shaders however were already ready for Wave32, and other architectural changes that RDNA had. In fact, RDNA was basically AMD changing their architecture to be more "NVidia-like" on many regards (32-wide execute being the most noticeable).

CDNA existed because HPC has $Billion+ contracts with code written for Wave64 and still needing ROCm support. That means staying on the older GCN-like architecture and continuing to support say, DPP instructions or other obscure features of GCN.

---------

Remember how long it took for RDNA to get ROCm support? Did you want to screw the HPC customers for that whole time?

Splitting the two architectures, focusing ROCm on HPC (where the money was in 2018 for GPU Compute research dollars), and focusing on better video game performance for RDNA (where money is for video game / consumer cards) just makes sense.


>The whole AMD AI/ML strategy feels like this - prioritize short term profits and completely shoot themselves in the foot in the long term.

That's what the stock market rewards.


You mean hard in regards of pricing? Because otherwise it’s not really hard to buy enterprise gpus at all.


Can you provide a link to where I can purchase the MI300X? The only way I seem to be able to buy one is to contact AMD Sales who will surely laugh at my request for “less than 1000 units.”


Depends on what country you are in. These are technically export controlled.


Disclosure, I have a business in this.

This is what I saw as well. As a developer, I wanted access to enterprise HPC compute, but I'm also not going to do a PhD just to play around with these things. So, I got funding, started a business and bought 8 of them as a PoC test. We got customers, we got more funding, got a real datacenter, we bought 128 more. Crawl, walk, run.

You can now rent them by the minute from us for a few bucks an hour. Currently limited to docker containers for individual GPUs, but you can get a full bare metal 8x box too (with BIOS too!). Support for VM's is coming. If you want multiple boxes, we have the full 8x400G NICs too. The boxes are fully loaded with tons of enterprise NVMe, RAM and top core/clock Intel CPUs (not AMD cause Dell didn't have that as a solution).

Our model is to follow AMD's roadmap and buy/release their products as they come. We're currently debating the 325x and looking forward to / planning for the 355x.

Despite your desire, it will be a long time before there is a consumer version of these things. Especially as they move to more and more complex deployments. Look at the NV72 and the requirements around that... we can all guess where AMD is going. DC rails in the racks, DLC cooling, massive power requirements. It is only getting more and more capex/opex intensive.

Let's also not forget that AMD is really just a hardware manufacturer. When you buy a RX480 (I had 130,000 of these previously), it was from an OEM, like Sapphire, that could handle all the end user support.

This is why the whole NeoCloud industry has sprung up. Large clouds can only handle this pace by selling thousands at a time in multi-year contracts. We are taking the long tail and built a business around that. Short of doing everything we are doing yourself (which trust me, is not easy), your best bet is to work with companies like mine to get you access to this gear.

You can now rent them by the minute from us for a few bucks an hour. Currently limited to docker containers for individual GPUs, but you can get a full bare metal 8x box too. Support for VM's is coming. If you want multiple boxes, we have the full 8x400G NICs too. The boxes are fully loaded with tons of enterprise NVMe, RAM and top core/clock Intel CPUs (not AMD cause Dell didn't have that as a solution).


I recognize that selling enterprise hardware one or two units at a time to people like me is not cost effective and is why AMD isn’t doing it, but I don’t think there’s anything stopping them from relying on distribution partners like Sapphire, Gigabyte and XFX to handle everything but the GPU die. Demand would be low relative to consumer stuff, and after cutting these partners in they’re probably selling at cost or on thin margin, but again, if they want to carve out market share it’s going to be VERY slow-going if they continue with this “charge as much money as possible and only sell to datacenters” approach. Nvidia can do that because they’ve cornered the market.

Meanwhile I can’t even find an MI300X on eBay. I can at least poach enterprise Nvidia GPUs like the A100 on eBay. This tells me AMD’s shipping far fewer units and therefore enterprise GPUs aren’t doing much for their balance sheet (though I’d have to look at their quarterly and annual reports to know for certain). To me this strengthens the case for selling to individuals/startups, and at prices that offset the risk of picking AMD over Nvidia and potentially running into software shortcomings.

I’m set with two RTX 3090s at the moment, but it’s very neat that you’ve been able to bootstrap essentially a cloud service provider in the age of AWS, Azure and GCP (and DO and Vultr and Linode et al).


> I don’t think there’s anything stopping

There absolutely is. The current form factor is not standard PCIe. It is a OAM/UBB board that is custom designed by AMD to support Infinity Link. It only comes in an 8x configuration. Now, you're asking for a totally different design and that requires a huge investment that would take away from their existing focus on enterprise.

> Meanwhile I can’t even find an MI300X on eBay.

https://www.ebay.com/itm/305850340813


I just want the card, not an entire server with a cluster of them. You can’t get an MI300X on eBay. But yes I didn’t realize they don’t come in the standard PCIe form factor. Their previous Instict “GPUs” did.

They should have considered the RX 480 approach for the Instinct accelerators.


What you're asking for is HPC level of compute, but as a consumer product. They moved away from that market because they realized that it wasn't going to compete with the AI farms that Nvidia is building with companies like CoreWeave. The 355x is going to move even further away cause it'll require even more high end datacenter deployments.

So, on one hand, people want them to compete with Nvidia, but on the other hand we want them to ignore the market that is actually going to make them money. As much as it would be nice to have, we can't have it both ways.

The middle ground is to rent a single one from us billed by the minute (or another neocloud). We handle all the detailed problems for you (don't forget the massive upfront capex spend), and you get to build your products/companies on that. Once you grow to the point of being able to buy your own equipment, we can even help you deploy it.


You're doing good work on your subreddit!



(sorry that last paragraph got duplicated somehow)


I'm pretty sure they're selling at capacity


Nope ... IMO AMD cannot compete on software with nvidia. I bought an rx580 to test rocm/tensorflow ... Only for it to be a buggy mess that was discontinued for support in the next version.

AMD needs to invest a Fsck load of money in software... Until then they can have the greatest compute cards in the world.but it will.mean nothing


Isn't rx580 like 4 gens behind now? It was released in 2017. Probably not the right card for anything AI.


Why would anyone buy a new AMD card if the expectation is that the new one won't work either after support runs out? I use AMD at home and Nvidia A100s at work. There is no need to upgrade an old GPU to a new GPU if all it does is act as a fancy iGPU.


Because only the new ones are officially supported by AMDs compute stack?


To be fair Nvidia is the de-facto king of “planned obsolescence” and “remove features for no reason other than to force people to pay more $$$”. I think they’re even worse than Apple in this regard.

My favorite one of many examples of this is when Nvidia said “okay, fine, we’ll allow customers to enable G-Sync (adaptive refresh) on any display, but only 1000 series and newer GPUs!” No hardware limitation for this, they just didn’t want to give any 900-series holdouts a reason not to upgrade.

Then there’s the arbitrary lock on using consumer Nvidia GPUs in passthrough on a level 1 hypervisor. Sometimes there are unofficial workarounds but why do I have to go through that hassle at all? Ain’t nobody realistically going to buy consumer grade GPUs to throw in datacenter racks… but even if they were, so what? It’s their hardware, they bought it, let them use it. No reason whatsoever to block passthrough on consumer cards except $$$.


Yes this was years ago... Pre covid ... I thought the fact I used tensoflow as the target library I thought would flag that for others


I will say AMD seems to have come a long way since 2020/2021. I also messed around with an RX580 and ROCm on Ubuntu Server and while I wasn’t as experienced in the world of ML/Data Science then, I remember encountering a roadblock or two when following the official install guide. I haven’t had an AMD GPU since, but I’ve read and heard many good things. Supposedly Tensorflow and PyTorch work more or less out of the box with the right AMD GPU.


Our team has access to multiple systems that either have MI250Xs or H100s. Getting stuff to work with AMD/ROCm is substantially more effort than the NVIDIA/CUDA experience.

Some of this is lack of groundwork/engineering by packages or system administrators, but it seems a decent amount is the relative lack of effort by AMD to make things work well OOTB.


Installing ROCm is easy and well documented [0].

Anush (AMD VP of AI software) has had a fire lit under his butt after the recent SemiAnalysis article [1] and is actively taking feedback on improving the experience. If you have specific things you'd like to see, I'm more than happy to forward them onto him (contact in my profile).

[0] https://rocm.docs.amd.com/en/latest/

[1] https://semianalysis.com/2024/12/22/mi300x-vs-h100-vs-h200-b...


The real question is: is this from lack of effort or simply from NVidia's headstart? Will it get better?


> Will it get better?

It won't, not in any way that will make AMD approximately competitive with Nvidia.

AMD, unlike Nvidia, seems unable to prioritize developers. Here's a summary of last week's charlie-fox when the TinyGrad team attempted to get 2 MI300s for on-premises testing and was rebuffed by an AMD representative. https://x.com/dehypokriet/status/1879974587082912235


Both, really.

It will get better, but Nvidia seems to be creating CUDA libraries for all kinds of applications, so the moat is constantly widening/deepening.


Those are some sketchy bar charts. If the bar doesn't start at y=0 and you don't even have axis ticks, it's meaningless and just a colorful blob on your slide.


This looks like a really stupid benchmark, no? 4 million cells in a standard aerodynamic simulation isn't enough work to even saturate the host CPUs on these systems, never mind the GPUs which have far "worse" roofline models from a CFD algorithm point of view.

I would not be surprised at all if these benchmarks ran faster if you removed the GPUs completely.


That's the smallest of 4 experiments. It goes upto 140 million cells, where MI300X retains similar performance advantage of around 10% over Nvidia's H100.


Yeah, but the lack of comparison to a CPU-only system is suspicious. (As is the exaggerated plot without ticks and labels on the y-axis.)

I've never seen a non-reactive incompressible flow simulation get substantial speedup on GPUs. There are well understood fundamental reasons why this is the case.


I helped develop a hydro solver coupled with radiation at Los Alamos this summer. We observed from 7x upto 15x performance improvement on a single GPU compared to 64-core intel CPU. [1]

Now granted, the flops to byte ratio for this program might be better than an avg fluid simulator. Also, our performance tanked when we moved to multi-node system. But I am aware of underlying reasons behind the scalibility issues and they don't feel like problems that can't be overcome.

[1] https://github.com/lanl/HARD/tree/main


Nice, interesting solver! But yes, compressible/density-based solvers work a lot better on GPUs than incompressible/pressure-based solvers. When 80% of the runtime is spent solving a Poisson equation, GPU is not worth it.

Especially if you do the comparison on equivalent cost basis, i.e. "what is the walltime difference if I run on a $60k all-CPU cluster versus a $60k GPU cluster". Or in terms of cloud compute cost / HPC allocation spend.


This has not been my experience in the academic/research side. Poison solver-based incompressible CFD regularly runs ~10x faster on equivalently-priced GPU systems, and has been doing so since I've been following it (since 2008). Some FFT-based solvers don't weak scale ideally, but that'd be even worse for CPU-based versions, as they use similar algorithms and would be spread over many more nodes.


It's tangential, but this is the first time I've seen Fluent installed by simply decompressing a tar, instead of executing their big installer.


warning - no pretty videos in this post




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: