Latencies like this are doable with a lot of tuning on Intel CPUs; out of the bo...

my123 · on Nov 30, 2020

Apple M1 has three cache levels:

- for big cores, private: 128KB L1D

- for big cores, shared within a cluster: 12MB L2

- system-level cache (shared between everything, CPU clusters, GPU, neural engine...): 16MB

and then you reach RAM.

sliken · on Nov 30, 2020

I've written a benchmark to measure such thins and from what I can tell.

Each fast core has a L1D of 128KB.

The fast cores have a cluster with 12MB, cache misses to to main memory.

The slow cores have a 4MB L2.

The cache misses from the fast L2 can't quite saturate the main memory systems (I believe it's 8 channels of 16 bits). So when all cores are busy you keep 12MB of L2 for fast, 4MB of L2 for the slow cores, and end up getting better throughput from the memory system since you are keeping all 8 channels busy.

my123 · on Nov 30, 2020

Wonder if the SLC is mostly used for coherency purposes and the other blocks then...

And yeah, it's 128-bit wide LPDDR4X-4266, pretty quick imo.

sliken · on Nov 30, 2020

Not just 128 bits wide (standard on high end laptops and most desktops), but 8 channels. The latency is halved and over the last decades I've only been seeing very modest improvements in latency to main memory on the order of 3-5% a year.