Hacker News new | past | comments | ask | show | jobs | submit login

CFD is highly memory-bandwidth-bottlenecked, it is in fact pretty much the prototypical memory-bandwidth-bottlenecked task.

The performance scaling you see between systems pretty much corresponds to the memory bandwidth in those configurations.

Note that on the M1, the CPU can only access a fraction (about 25% iirc) of the total memory bandwidth, you have to use the GPU to really get the full performance of the M1 here.




I saw this quantified (I think at anandtech), something like 220GB/sec out of 400GB/sec on the M1 max. So something north of 50%.

Also keep in mind that normal x86-64's, even without an IGP only get about 60-65% of peak, even with nothing else sharing the memory bus. I often see this quantified with McCalpin's stream benchmark.

So the M1 Ultra likely has a pretty impressive memory bandwidth of around 440GB/sec, which isn't a large fraction of 800GB/sec, but it still more than any other desktop or server chip I know of. The AMD Epcy maxes out at 8 channels of DDR-3200, which is in the neighborhood of 208GB/sec peak, with an observed bandwidth of 110-120GB/sec.


> I saw this quantified (I think at anandtech)

Correct. The numbers we have are from their M1 Max deep dive, with the M1 Ultra being two M1 Max chips fused together.

For CPU cores:

>Adding a third thread there’s a bit of an imbalance across the clusters, DRAM bandwidth goes to 204GB/s, but a fourth thread lands us at 224GB/s and this appears to be the limit on the SoC fabric that the CPUs are able to achieve, as adding additional cores and threads beyond this point does not increase the bandwidth to DRAM at all. It’s only when the E-cores, which are in their own cluster, are added in, when the bandwidth is able to jump up again, to a maximum of 243GB/s.

https://www.anandtech.com/show/17024/apple-m1-max-performanc...

GPU cores:

>I haven’t seen the GPU use more than 90GB/s (measured via system performance counters). While I’m sure there’s some productivity workload out there where the GPU is able to stretch its legs, we haven’t been able to identify them yet.

Other:

>That leaves everything else which is on the SoC, media engine, NPU, and just workloads that would simply stress all parts of the chip at the same time. The new media engine on the M1 Pro and Max are now able to decode and encode ProRes RAW formats, the above clip is a 5K 12bit sample with a bitrate of 1.59Gbps, and the M1 Max is not only able to play it back in real-time, it’s able to do it at multiple times the speed, with seamless immediate seeking. Doing the same thing on my 5900X machine results in single-digit frames. The SoC DRAM bandwidth while seeking around was at around 40-50GB/s – I imagine that workloads that stress CPU, GPU, media engines all at the same time would be able to take advantage of the full system memory bandwidth, and allow the M1 Max to stretch its legs and differentiate itself more from the M1 Pro and other systems.


> of around 440GB/sec, which isn't a large fraction of 800GB/sec

Where's the 800GB/s from?


Peak = never observed, but calculated from clock speed * bus width. Much like the speed of light, you'll never see it.

That number for the M1 Ultra (from the OP's post) = 800GB/sec. McCalpin's stream benchmark is often cited as a practical/useful number for usable bandwidth using a straight forward implementation in C or Fortran without trying to play games, much like the vast majority of codes out there.

Also note that the x86-64's in the world use a strict memory model that results in a lower fraction of observed bandwidth vs peak. Arm has a looser memory model which achieves a higher fraction of peak.


> memory-bandwidth-bottlenecked task.

The thing that's interesting for me around this is: I hate the on-die memory. I hate the idea of not being able to upgrade after your order, or a few years down the road. It has both practical problems and offends my inner nerd.

But.

This is a useful example of why there's value in it. It seems unlikely that you'd get as good a result with the traditional memory architecture.


People who need this sort of system aren't going to be looking to expand memory capacity in "a few years down the road." They'll be looking to purchase or lease the current model.

Also, if you really need more memory - buy the config with more memory and sell the old config. Resale on macs is usually superb.

I can't remember hearing a friend or gaming buddy say "I upgraded my ram." I haven't upgraded ram in any machine I've owned in twenty years and even before that, it was rare. There was never any point in putting that sort of money into an out of date CPU and memory architecture.

If you owned a trashcan Mac Pro and were a working creative, would you be upgrading its memory this month? Nope...


I contest this claim. I'm professional and I know I'm not alone in buying an iMac 5K with 16 GiB or less explicitly to later upgrade to much more with 3rd party (like Crucial) memory for vastly less than Apple asks.

That said, if the memory is fixed like on the Ultra, then sure, we'll buy what we need (and a bit more).


I think what people are really looking for is expandable memory so that you can buy a base version, and then add cheaper 3rd party RAM.


I’m hoping that the Mac Pro introduces a two tier memory architecture so you can get the best of both worlds.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: