The M1 still uses DDR memory at the end of the day, it's just physically closer ...

sounds · on Oct 18, 2021

Just to underscore this, memory physically closer to the cores has improved tRAS times measured in nanoseconds. This has the secondary effect of boosting the performance of the last-level cache since it can fill lines on a cache miss much faster.

The step up from DDR4 to DDR5 will help fill cache misses that are predictable, but everybody uses a prefetcher already, the net effect of DDR5 is mostly just better efficiency.

The change Apple is making, moving the memory closer to the cores, improves unpredicted cache misses. That's significant.

dragontamer · on Oct 18, 2021

> Just to underscore this, memory physically closer to the cores has improved tRAS times measured in nanoseconds.

I doubt that tRAS timing is affected by how close / far a DRAM chip is from the core. Its just a RAS command after all: transfer data from DRAM to the sense-amplifiers.

If tRAS has improved, I'd be curious how it was done. Its one of those values that's basically been constant (on a nanosecond basis) for 20 years.

Most DDR3 / DDR4 improvements have been about breaking up the chip into more-and-more groups, so that Group#1 can be issued a RAS command, then Group#2 can be issued a separate RAS command. This doesn't lower latency, it just allows the memory subsystem to parallelize the requests (increasing bandwidth but not improving the actual command latency specifically).

kllrnohj · on Oct 18, 2021

The physically shorter wiring is doing basically nothing. That's not where any of the latency bottlenecks are for RAM. If it was physically on-die, like HBM, that'd be maybe different. But we're still talking regular LPDDR5 using off the shelf dram modules. The shorter wiring would potentially improve signal quality, but ground shields do that, too. And Apple isn't exceeding any specs on this (ie, it's not overclocked), so above average signal integrity isn't translating into any performance gains anyway.

wmf · on Oct 18, 2021

improved tRAS times

Has this been documented anywhere? What timings are Apple using?

GeekyBear · on Oct 18, 2021

Apple also uses massive cache sizes, compared to the industry.

They put a 32 megabyte system level cache in their latest phone chip.

>at 32MB, the new A15 dwarfs the competition’s implementations, such as the 3MB SLC on the Snapdragon 888 or the estimated 6-8MB SLC on the Exynos 2100

https://www.anandtech.com/show/16983/the-apple-a15-soc-perfo...

It will be interesting to see how big they go on these chips.

rektide · on Oct 18, 2021

> Apple also uses massive cache sizes, compared to the industry.

AMD's upcoming Ryzen are supposed to have 192MB L3 "v-cache" SRAM stacked above each chiplet. Current chiplets are 8-core. I'm not sure if this is a single chiplet but supposedly good for 2Tbps[1].

Slightly bigger chip than a iphone chip yes. :) But also wow a lot of cache. Having it stacked above rather than built in to the core is another game-changing move, since a) your core has more space b) you can 3D stack many layers of cache atop.

This has already been used on their GPUs, where the 6800 & 6900 have 128MB of L3 "Infinity cache" providing 1.66TBps. It's also largely how these cards get by with "only" 512GBps worth of GDDR6 feeding them (256bit/quad-channel... at 16GT). AMD's R9 Fury from spring 2015 had 1TBps of HBM2, for compare, albeit via that slow 4096bit wide interface.

Anyhow, I'm also in awe of the speed wins Apple got here from bringing RAM in close. Cache is a huge huge help. Plus 400GBps main memory is truly awesome, and it's neat that either the CPU or GPU can make use of it.

[1] https://www.anandtech.com/show/16725/amd-demonstrates-stacke...

dragontamer · on Oct 18, 2021

> The M1 still uses DDR memory at the end of the day, it's just physically closer to the core. This is in contrast to L3 which is actual SRAM on the core.

But they're probably using 8-channels of LPDDR5, if this 400GB/s number is to be believed. Which is far more memory channels / bandwidth than any normal chip released so far, EPYC and Skylake-server included.

duskwuff · on Oct 18, 2021

It's more comparable to the sort of memory bus you'd typically see on a GPU... which is exactly what you'd hope for on a system with high-end integrated graphics. :)

dragontamer · on Oct 18, 2021

You'd expect HBM or GDDR6 to be used. But this is seemingly LPDDR5 that's being used.

So its still quite unusual. Its like Apple decided to take commodity phone-RAM and just make many parallel channels of it... rather than using high-speed RAM to begin with.

HBM is specifically designed to be soldered near a CPU/GPU as well. For them to be soldering commodity LPDDR6 is kinda weird to me.

---------

We know it isn't HBM because HBM is 1024-bits at lower clock speeds. Apple is saying they have 512-bits across 8 channels (64-bits per channel), which is near LPDDR5 / DDR kind of numbers.

200GBps is within the realm of 1x HBM channel (1024-bit at low clock speeds), and 400GBps is 2x HBM channels (2048-bit bus at low clock speeds).

floatboth · on Oct 19, 2021

HBM isn't just "soldered near", it's connected through a silicon interposer rather than a PCB.

Also we know it's not HBM because the word "LPDDR5" was literally on the slides :)

> just make many parallel channels of it

isn't that just how LPDDR is in general? It has much narrower channels than DDR so you need much more of them?

dragontamer · on Oct 19, 2021

> isn't that just how LPDDR is in general? It has much narrower channels than DDR so you need much more of them?

Well yeah. But 400GBps is equivalent to 16x DDR4 channels. Its an absurdly huge amount of bandwidth.

kergonath · on Oct 18, 2021

> The DDR being closer to the core may or may not allow the memory to run at higher speeds due to better signal integrity, but you can purchase DDR4-5333 today whereas the M1 uses 4266.

My understanding is that bringing the RAM closer increases the bandwidth (better latency and larger buses), not necessarily the speed of the RAM dies. Also, if I am not mistaken, the RAM in the new M1s is LP-DDR5 (I read that, but it did not stay long on screen so I could be mistaken). Not sure how it is comparable with DDR4 DIMMs.

Unklejoe · on Oct 18, 2021

The overall bandwidth isn't affected much by the distance alone. Latency, yes, in the sense that the signal literally has to travel further, but that difference is miniscule (like 1/10th of a nanosecond) compared to overall DDR access latencies.

Better signal integrity could allow for larger busses, but I don't think this is actually a single 512 bit bus. I think it's multiple channels of smaller busses (32 or 64 bit). There's a big difference from an electrical design perspective (byte lane skew requirements are harder to meet when you have 64 of them). That said, I think multiple channels is better anyway.

The original M1 used LPDDR4 but I think the new ones use some form of DDR5.

rdw · on Oct 18, 2021

Your comment got me thinking, and I checked the math. It turns out that light takes ~0.2 ns to travel 2 inches. But the speed of signal propagation in copper is ~0.6 c, so that takes it up to 0.3 ns. So, still pretty small compared to the overall latencies (~13-18 ns for DDR5) but it's not negligible.

I do wonder if there are nonlinearities that come in to play when it comes to these bottlenecks. Yes, by moving the RAM closer it's only reducing the latency by 0.2 ns. But, it's also taking 1/3rd of the time that it used to, and maybe they can use that extra time to do 2 or 3 transactions instead. Latency and bandwidth are inversely related, after all!

jjoonathan · on Oct 18, 2021

Well, you can have high bandwidth and poor latency at the same time -- think ultra wide band radio burst from Earth to Mars -- but yeah, on a CPU with all the crazy co-optimized cache hierarchies and latency hiding it's difficult to see how changing one part of the system changes the whole. For instance, if you switched 16GB of DRAM for 4GB of SRAM, you could probably cut down the cache-miss latency a lot -- but do you care? If you cache hit rate is high enough, probably not. Then again, maybe chopping the worst case lets you move allocation away from L3 and L2 and into L1, which gets you a win again.

I suspect the only people who really know are the CPU manufacturer teams that run PIN/dynamorio traces against models -- and I also suspect that they are NDA'd through this life and the next and the only way we will ever know about the tradeoffs are when we see them pop up in actual designs years down the road.

jjoonathan · on Oct 18, 2021

DRAM latencies are pretty heinous. It makes me wonder if the memory industry will go through a similar transition to the storage industry's HDD->SSD sometime in the not too distant future.

I wonder about the practicalities of going to SRAM for main memory. I doubt silicon real estate would be the limiting factor (1T1C to 6T, isn't it?) and Apple charges a king's ransom for RAM anyway. Power might be a problem though. Does anyone have figures for SRAM power consumption on modern processes?

phkahler · on Oct 18, 2021

>> I wonder about the practicalities of going to SRAM for main memory. I doubt silicon real estate would be the limiting factor (1T1C to 6T, isn't it?) and Apple charges a king's ransom for RAM anyway. Power might be a problem though. Does anyone have figures for SRAM power consumption on modern processes?

I've been wondering about this for years. Assuming the difference is similar to the old days, I'd take 2-4GB of SRAM over 32GB of DRAM any day. Last time this came up people claimed SRAM power consumption would be prohibitive, but I have a hard time seeing that given these 50B transistor chips running at several GHz. Most of the transistors in an SRAM are not switching, so they should be optimized for leakage and they'd still be way faster than DRAM.

GeekyBear · on Oct 18, 2021

> The overall bandwidth isn't affected much by the distance alone.

Testing showed that the M1's performance cores had a surprising amount of memory bandwidth.

>One aspect we’ve never really had the opportunity to test is exactly how good Apple’s cores are in terms of memory bandwidth. Inside of the M1, the results are ground-breaking: A single Firestorm achieves memory reads up to around 58GB/s, with memory writes coming in at 33-36GB/s. Most importantly, memory copies land in at 60 to 62GB/s depending if you’re using scalar or vector instructions. The fact that a single Firestorm core can almost saturate the memory controllers is astounding and something we’ve never seen in a design before.

https://www.anandtech.com/show/16252/mac-mini-apple-m1-teste...

fomine3 · on Oct 19, 2021

It just said that bandwidth between a performance core and memory controller is great. It's not related to distance between memory controller and DRAM.

morei · on Oct 18, 2021

L3 is almost never SRAM, it's usually eDRAM and clocked significantly lower than L1 or L2.

(SRAM is prohibitively expensive to do at scale due to die area required).

Edit: Nope, I'm wrong. It's pretty much only Power that has this.

dragontamer · on Oct 18, 2021

As far as I'm aware, IBM is one of the few chip-designers who have eDRAM capabilities.

IBM has eDRAM on a number of chips in varying capacities, but... its difficult for me to think of Intel, AMD, Apple, ARM, or other chips that have eDRAM of any kind.

Intel had one: the eDRAM "Crystalwell" chip, but that is seemingly a one-off and never attempted again. Even then, this was a 2nd die that was "glued" onto the main chip, and not like IBM's truly eDRAM (embedded into the same process).

morei · on Oct 18, 2021

You're right. My bad. It's much less common than I'd thought. (Intel had it on a number of chips that included the Iron Pro Graphics across Haswell, Broadwell, Skylake etc)

dragontamer · on Oct 18, 2021

But only the Iris Pro 5200 (codename: Crystalwell) had eDRAM. All other Iris Pro were just normal DDR4.

EDIT: Oh, apparently there were smaller 64MB eDRAM on later chips, as you mentioned. Well, today I learned something.

epmaybe · on Oct 19, 2021

Ha, I still use an intel 5775c in my home server!

beebeepka · on Oct 18, 2021

I think the chip you are talking about is Broadwell.

dragontamer · on Oct 18, 2021

Broadwell was the CPU-core.

Crystalwell was the codename for the eDRAM that was grafted onto Broadwell. (EDIT: Apparently Haswell, but... yeah. Crystalwell + Haswell for eDRAM goodness)

Unklejoe · on Oct 18, 2021

L3 is SRAM on all AMD Ryzen chips that I'm aware of.

I think it's the same with Intel too except for that one 5th gen chip.