Unified memory is old, every CPU with an integrated GPU is on unified memory. It...

sliken · on Nov 30, 2020

That's overly simplified. Larger order buffer, larger caches, lower latency caches, more outstanding loads, etc.

They also managed in a cheap platform (mac mini = $700) get 4266 Mhz memory working with about half the latency of any x86-64 I've seen. The mac mini can manage a random cacheline (assuming a relatively TLB friendly pattern) of around 30-33ns.

Maybe the usual intel core i7/i9 or ryzen 5000 look rather quaint with their 60ns or higher memory latencies.

kllrnohj · on Dec 1, 2020

> They also managed in a cheap platform (mac mini = $700) get 4266 Mhz memory working with about half the latency of any x86-64 I've seen. The mac mini can manage a random cacheline (assuming a relatively TLB friendly pattern) of around 30-33ns.

Where are you getting those numbers? From https://www.anandtech.com/show/16252/mac-mini-apple-m1-teste...

"In terms of memory latency, we’re seeing a (rather expected) reduction compared to the A14, measuring 96ns at 128MB full random test depth, compared to 102ns on the A14."

Which would put M1's DRAM latency at worse than a modern Intel or AMD desktop, which is measuring around 70-80ns: https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-di...

sliken · on Dec 1, 2020

I've replicated them myself with my own code, so I'm pretty confident. It doesn't hurt that my numbers match Anandtech's, at least for the range of arrays they use and only using a single thread.

On pretty much any current CPU if you randomly access an array significantly larger than cache (12MB in the M1 case) you end up thrashing the TLB which significantly impacts latency. The number of pages that can be quickly access depends on the number of page in the TLB.

To separate out TLB latency from memory latency I allow controlling size of the sliding window for randomizing the array, so that only a few pages are heavily used at any one time, while visiting each cache line exactly once.

That's exactly what the brown "R per RV prange" does. For more info look at the description at: https://www.anandtech.com/show/14072/the-samsung-galaxy-s10p...

My code builds an array, then does a knuth shuffle, but modified so the maximum shuffle distance is 64 pages, so the average shuffle is 32 pages or so. I get a nice clean line at 34ns. With 2 or 4 threads I get a throughput (not true latency) of a cacheline every 21ns. With 8 threads (using the 4 slow and 4 fast cores) I get a somewhat better cacheline per 12.5ns.

Pretty stunning to have latencies that low on a low end $700 mac mini that embarrasses machines that costs 10x that much. Even high end Epyc machines (200 watt TDP) with 8 x 64 bit memory channels have to try hard to get a cacheline every 10ns.

kllrnohj · on Dec 1, 2020

> Pretty stunning to have latencies that low on a low end $700 mac mini that embarrasses machines that costs 10x that much. Even high end Epyc machines (200 watt TDP) with 8 x 64 bit memory channels have to try hard to get a cacheline every 10ns.

Eh? That's not how memory latency works. The cheaper consumer chips with "non-spec" RAM and without ECC are regularly better here than the enterprise stuff. This isn't something that scales with price.

sliken · on Dec 1, 2020

Sure, ECC and in particular registered memory does increase the latency a bit. But servers are designed for throughput and have multiple memory channels to better feed the large amount of cores involved, up to 64 cores for the new AMD epyc chips. The amazing thing is that the Apple M1 can fetch random cachelines almost as fast as a current AMD Epyc.

kllrnohj · on Dec 1, 2020

You're confusing throughput & latency here. More channels increases throughput, but doesn't improve latency.

The M1's memory bandwidth is ~68GB/s, which is of course a tiny fraction of AMD Epyc's ~200GB/s per socket.

Epyc's latency isn't even competitive with AMD's own consumer parts, so I'm really not sure why you're surprised that Epyc's latency is also worse than the M1's?

sliken · on Dec 1, 2020

I'm not surprised the latency on the M1 is better than Epyc, but it's near half of any other consumer part, like say the AMD Rzyen 5950x. When accessed in a TLB friendly way (not TLB thrashing) the M1 manages 30ns which is excellent.

Even more impressively is that the random cacheline throughput is also excellent. So if all 8 cores have a cache miss the M1 memory system is very good at keeping multiple pending requests in flight to achieve surprisingly good throughput. Granted this isn't pure latency, so I call it throughput. Getting a random cacheline per 12ns is quite good, especially for a cheap low power system. Normally getting more than 2 memory channels on a desktop requires something exotic like an AMD threadripper.