Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Unified memory is old, every CPU with an integrated GPU is on unified memory. It's not doing anything at all for M1's general performance.

The main reason for M1's performance is just that Apple managed to get an 8-wide CPU to work & be fed. That's it. That's the entirety of the story of the M1. Apple got 8-wide to work, while Intel & AMD are still 4-wide. ARM's fixed length instructions are helping a ton there, but Apple also put in work to feed it.

All the other shit about specialized units & unified memory & vertical integration is all irrelevant and mostly wrong. The existing Intel Macbook Airs are all unified memory with specialized hardware units for specialized tasks, too (Intel QuickSync is 9 years old now - dedicated silicon for the specialized task of video encoding). Apple did absolutely nothing new or interesting on that front. Other than marketing. They are magic at marketing. And also magic at making a super wide CPU front-end, with really good cache latency numbers.



That's overly simplified. Larger order buffer, larger caches, lower latency caches, more outstanding loads, etc.

They also managed in a cheap platform (mac mini = $700) get 4266 Mhz memory working with about half the latency of any x86-64 I've seen. The mac mini can manage a random cacheline (assuming a relatively TLB friendly pattern) of around 30-33ns.

Maybe the usual intel core i7/i9 or ryzen 5000 look rather quaint with their 60ns or higher memory latencies.


> They also managed in a cheap platform (mac mini = $700) get 4266 Mhz memory working with about half the latency of any x86-64 I've seen. The mac mini can manage a random cacheline (assuming a relatively TLB friendly pattern) of around 30-33ns.

Where are you getting those numbers? From https://www.anandtech.com/show/16252/mac-mini-apple-m1-teste...

"In terms of memory latency, we’re seeing a (rather expected) reduction compared to the A14, measuring 96ns at 128MB full random test depth, compared to 102ns on the A14."

Which would put M1's DRAM latency at worse than a modern Intel or AMD desktop, which is measuring around 70-80ns: https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-di...


I've replicated them myself with my own code, so I'm pretty confident. It doesn't hurt that my numbers match Anandtech's, at least for the range of arrays they use and only using a single thread.

On pretty much any current CPU if you randomly access an array significantly larger than cache (12MB in the M1 case) you end up thrashing the TLB which significantly impacts latency. The number of pages that can be quickly access depends on the number of page in the TLB.

To separate out TLB latency from memory latency I allow controlling size of the sliding window for randomizing the array, so that only a few pages are heavily used at any one time, while visiting each cache line exactly once.

That's exactly what the brown "R per RV prange" does. For more info look at the description at: https://www.anandtech.com/show/14072/the-samsung-galaxy-s10p...

My code builds an array, then does a knuth shuffle, but modified so the maximum shuffle distance is 64 pages, so the average shuffle is 32 pages or so. I get a nice clean line at 34ns. With 2 or 4 threads I get a throughput (not true latency) of a cacheline every 21ns. With 8 threads (using the 4 slow and 4 fast cores) I get a somewhat better cacheline per 12.5ns.

Pretty stunning to have latencies that low on a low end $700 mac mini that embarrasses machines that costs 10x that much. Even high end Epyc machines (200 watt TDP) with 8 x 64 bit memory channels have to try hard to get a cacheline every 10ns.


> Pretty stunning to have latencies that low on a low end $700 mac mini that embarrasses machines that costs 10x that much. Even high end Epyc machines (200 watt TDP) with 8 x 64 bit memory channels have to try hard to get a cacheline every 10ns.

Eh? That's not how memory latency works. The cheaper consumer chips with "non-spec" RAM and without ECC are regularly better here than the enterprise stuff. This isn't something that scales with price.


Sure, ECC and in particular registered memory does increase the latency a bit. But servers are designed for throughput and have multiple memory channels to better feed the large amount of cores involved, up to 64 cores for the new AMD epyc chips. The amazing thing is that the Apple M1 can fetch random cachelines almost as fast as a current AMD Epyc.


You're confusing throughput & latency here. More channels increases throughput, but doesn't improve latency.

The M1's memory bandwidth is ~68GB/s, which is of course a tiny fraction of AMD Epyc's ~200GB/s per socket.

Epyc's latency isn't even competitive with AMD's own consumer parts, so I'm really not sure why you're surprised that Epyc's latency is also worse than the M1's?


I'm not surprised the latency on the M1 is better than Epyc, but it's near half of any other consumer part, like say the AMD Rzyen 5950x. When accessed in a TLB friendly way (not TLB thrashing) the M1 manages 30ns which is excellent.

Even more impressively is that the random cacheline throughput is also excellent. So if all 8 cores have a cache miss the M1 memory system is very good at keeping multiple pending requests in flight to achieve surprisingly good throughput. Granted this isn't pure latency, so I call it throughput. Getting a random cacheline per 12ns is quite good, especially for a cheap low power system. Normally getting more than 2 memory channels on a desktop requires something exotic like an AMD threadripper.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: