This is about the processors, not the laptops, so commenting on the chips instea...

neogodless · on Oct 18, 2021

> "just" two more cores than the vanilla M1

Total cores, but going from 4 "high performance" and 4 "efficiency" to 8 "high performance" and 2 "efficiency. So should be more dramatic increase in performance than "20% more cores" would provide.

skohan · on Oct 18, 2021

Is there a tradeoff in terms of power consumption?

ksec · on Oct 18, 2021

Yes. But the 14" and 16" has larger battery than 13" MacBook Pro or Air. And they were designed for performance, so two less EE core doesn't matter as much.

It is also important to note, despite the name with M1, we dont know if the CPU core are the same as the one used in M1 / A14. Or did they used A15 design where the energy efficient core had significant improvement. Since the Video Decoder used in M1 Pro and Max seems to be from A15, the LPDDR5 is also a new memory controller.

hajile · on Oct 18, 2021

In A15, Anandtech claims the Efficiency cores are 1/3 the performance, but 1/10 the power. They should be looking at (effectively) doubling the power consumption over M1 with just the CPUs and assuming they don't increase clockspeeds.

Going from 8 to 16 or 32 GPU cores is another massive power increase.

barelysapient · on Oct 18, 2021

I wonder if Apple will give us a 'long-haul' mode where the system is locked to only the energy efficient cores and settings. I us developer types would love a computer that survives 24 hours on battery.

modulusshift · on Oct 18, 2021

macOS Monterey coming out on the 25th has a new Low Power Mode feature that may do just that. That said, these Macs are incredibly efficient for light use, you may already get 24 hrs of battery life with your workload. Not counting screen off time.

sulam · on Oct 18, 2021

Yes, it depends on what you're doing, but if you can watch 21 hours of video, many people will be able to do more than 24 hours of development.

willseth · on Oct 18, 2021

Video playback is accelerated by essentially custom ASIC processing built into the CPU, so it's one of the most efficient things you can do now. Most development workloads are far more compute intensive.

jhugo · on Oct 19, 2021

I get about 14-16 hours out of my M1 MacBook Air doing basically full-time development (browser, mail client, Slack, text editor & terminal open, and compiling code periodically).

kube-system · on Oct 19, 2021

I know everyone's use case is different, but most of my development workload is 65% typing code into a text editor and 35% running it. I'm not continually pegging the CPU, just intermittently, in which case the existence of low power cores help a lot. The supposed javascript acceleration in the M1 has seemed to really speed up my workloads too.

rgrs · on Oct 19, 2021

Might be less computationally expensive. But video playback constantly refreshes screen which uses up battery

foepys · on Oct 19, 2021

Did Electron fix the 60Hz (or rather current screen refresh rate) cursor blinking? Otherwise I don't see many web devs getting a lot of runtime in.

barelysapient · on Oct 19, 2021

That's actually a curious question, I wonder what the most energy efficient dev tool is. Can't imagine its VScode. Maybe plain terminal with VIM?

willseth · on Oct 19, 2021

This is true, but it's not worst case by far. Most video is 24 or 30 fps, so about half the typical 60 hz refresh rate. Still a nice optimization path for video. I'm not sure what effect typing in an editor will have on screen refresh, but if the Electron issue is any indication, it's probably complicated.

trogdor · on Oct 18, 2021

Huge, apparently. I just spent a bit over $7,000 for a top-spec model and was surprised to read that it comes with a 140 watt power adapter.

Prior to my current M1 MBP, my daily driver was a maxed-out 16" MBP. It's a solid computer, but it functions just as well as a space heater.

And its power adapter is only 100 watts.

jerrysievert · on Oct 18, 2021

the power supply is for charging the battery faster. the new magsafe 3 system can charge with more wattage than usb-c, as per the announcement. usb-c max wattage is 100 watts, which was the previous limiting factor for battery charge.

gumby · on Oct 18, 2021

USB Power Delivery 3.1 goes up to 240 W (or, I should say, “will go up” as I don’t think anybody is shipping it yet)

robert_foss · on Oct 18, 2021

USB-C 3.1 PD delivers up to 240watts.

aetherspawn · on Oct 19, 2021

That's with 2 connectors right? I have a Dell Precision 3760 and the one connector charging mode is limited to around 90W. With two connectors working in tandem (they snap together), it's 180W.

The connectors never get remotely warm .. in fact under max charge rate they're consistently cool to touch, so I've always thought that it could probably be increased a little bit with no negative consequences.

wlesieutre · on Oct 19, 2021

Single connector, the 3.1 spec goes up to 5A at 48V. You need new cables with support for the higher voltages, but your "multiple plugs for more power" laptop is exactly the sort of device it's designed for.

It was announced earlier this year, so not in wide use yet. PDF warning: https://usb.org/sites/default/files/2021-05/USB%20PG%20USB%2...

jsjohnst · on Oct 19, 2021

I’ve not seen any manufacturer even announce they were going to make a supported cable yet, let alone seen one that does. I might’ve missed it though. This will only make the hell of USB-C cabling worse imho.

GeekyBear · on Oct 19, 2021

The USB Implementers Forum announced a new set of cable markings for USB 4 certified cables that will combine information on the maximum supported data rate and maximum power delivery for a given cable.

wlesieutre · on Oct 19, 2021

Currently I think there aren't any, and the MBP will only get its full 140W capacity using the magsafe cable.

trogdor · on Oct 19, 2021

Color me skeptical.

I believe you (and Apple) that the battery can be charged faster, but I am currently rendering video on an M1 MBP. Its power draw: ~20 Watts.

That's a lot of charging overhead.

nilsbunger · on Oct 19, 2021

The 16” has a 100wh battery, so it needs 100w of power to charge 50% in 30 minutes (their “fast charging”). Add in 20w to keep the laptop running at the same time, and some conversion losses, and a 140w charger sounds just about right.

trogdor · on Oct 20, 2021

That actually sounds on the money. I hope you're right!

cpuguy83 · on Oct 19, 2021

The other end of the new MagSafe is usb-c, which gets plugged into the power adapter.

simonh · on Oct 19, 2021

Sure, but it's an Apple cable plugging into an Apple socket. They don't have to be constrained by the USB-C specs and could implement a custom high power charging mode. In fact I believe some other laptop manufacturers already do this.

tantony · on Oct 18, 2021

They support fast-charging the battery to 50% in 30 minutes. That's probably the reason for the beefy charger.

hrrsn · on Oct 19, 2021

I'm surprised the fast charger isn't a separate purchase, a la iPhone.

eyelidlessness · on Oct 19, 2021

I’m not particularly surprised. They have little to prove with the iPhone, but have every reason to make every measurable factor of these new Macs better than both the previous iteration and the competition. Throwing in a future-model-upsell is negligible compared to mixed reviews about Magsafe 3 cluttering up reviews they otherwise expect to be positive.

dzhiurgis · on Oct 19, 2021

Yeah I'd prefer that. Converted everything to quality 6ft braided usb-c cables, bought a separate multiport charger. Will have to sell original one...

josephg · on Oct 19, 2021

Just in case people missed it - the magsafe cable connects to the power supply via usb-c. So (in theory) there's nothing special about the charger that you couldn't do with a 3rd party charger, or a multiport charger or something like that.

dzhiurgis · on Oct 19, 2021

MagSafe was a gimmick for me - disconnects far too easy, cables fray in like 9 months, only one side, proprietary and overpriced. Use longer cables and they will never be yanked again. MBP is heavy enough that even USB-C is getting pulled out on a good yank.

EricE · on Oct 20, 2021

I briefly had an M1 Macbook Air and the thing I hated the most about it was the lack of Magsafe. I returned it (needed more RAM) and was overjoyed they brought Magsafe back with these and am looking forward to having it on my new 16" You can also still charge through USB C if you don't care for Magsafe.

2muchcoffeeman · on Oct 19, 2021

Might be a power limitation. I have an XPS 17 which only runs at full performance and charges the battery with the supplied 130W charger. USB C is only specced to 100W. I can still do most things on the spare USB C charger I have.

kalleboo · on Oct 19, 2021

With the latest USB-PD standard that was announced in May, up to 240W is supported

easygenes · on Oct 19, 2021

I have a top-spec 15” MBP that was the last release just before 16”. It has 100W supply and it’s easy to have total draw more than that (so pulling from the battery while plugged in) while running heavy things like 3D games. I’ve seen around 140W peak. So a 150W supply seems prudent.

cbarrick · on Oct 18, 2021

In the power/performance curves provided by Apple, they imply that the Pro/Max provides the same level of performance at a slightly lower power consumption than the original M1.

But at the same time, Apple isn't providing any hard data or explaining their methodology. I dunno how much we should be reading into the graphs. /shrug

mlindner · on Oct 18, 2021

I think you misread the graph. https://www.apple.com/newsroom/images/product/mac/standard/A...

The graph there shows that the new chip is higher power usage at all performance levels.

derefr · on Oct 18, 2021

Not all; it looks like the M1 running full-tilt is slightly less efficient for the same perf than the M1 Pro/Max. (I.e., the curves intersect.)

jrk · on Oct 18, 2021

Yes, but only at the very extreme. It's normal that a high core count part at low clocks has higher efficiency (perf/power) at a given performance level than a low core count part at high clocks, since power grows super-linearly with clock speed (decreasing efficiency). But notably they've tuned the clock/power regime of the M1 Pro/Max CPUs that the crossover region here is very small.

eyelidlessness · on Oct 19, 2021

I think this is pretty easy to math: M1 has 2x the efficiency cores of these new models. Those cores do a lot of work in measured workloads that will sometimes be scheduled on performance cores instead. The relative performance and efficiency lines up pretty well if you assume that a given benchmark is utilizing all cores.

gzer0 · on Oct 18, 2021

> M1 Pro delivers up to 1.7x more CPU performance at the same power level and achieves the PC chip’s peak performance using up to 70 percent less power

Uses less power

mlindner · on Oct 18, 2021

That's compared to the PC chips, not M1. M1 uses less power at same performance levels.

https://www.apple.com/newsroom/images/product/mac/standard/A...

ashtonkem · on Oct 19, 2021

> I'm still a bit sad that the era of "general purpose computing" where CPU can do all workloads is coming to an end.

You'd have to be extremely old to remember that era. Lots of stuff important to making computers work got split off into separate chips away from the CPU pretty early into mass computing, such as sound, graphics, and networking. We've also been sending a lot of compute from the CPU into the GPU as late for both graphics and ML purposes.

Lately it seems like the trend has been taking these specialized peripheral chips and moving them back into SoC packages. Apple's approach here seems to be an evolutionary step on top of say, an Intel chip with integrated graphics, rather than a revolutionary step away from the era of general purpose computing.

growt · on Oct 19, 2021

Does an Intel 286 without coprocessor count? C64? I remember those, but I wouldn't say I'm extremely old (just regular old).

WorldMaker · on Oct 19, 2021

The IBM PC that debuted with the 286 was the PC/AT ("Advanced Technology", hah) that is best known for introducing the AT bus later called the ISA bus that led to the proliferation of video cards, sound cards, and other expansion cards that made the PC what it is today.

I'm actually not sure there ever was a "true CPU computer age" where all processing was CPU-bound/CPU-based. Even the deservedly beloved MOS 6502 processor that powered everything for a hot decade or so was considered merely a "micro-controller" rather than a "micro-processor" and nearly every use of the MOS 6502 involved a lot of machine-specific video chips, memory management chips. The NES design lasted so long in part because toward then end cartridges would sometimes have entirely custom processing chips pulling work off the MOS 6502.

Even the mainframe era term itself "Central Processing Unit" has always sort of implied it always works in tandem with other "processing units", it's just the most central. (In some mainframe designs I think this was even quite literal in floorplan.) Of course too, when your CPU is a massive tower full of boards that make up individual operations and very the opposite of an Integrated Circuit, it's quite tough to call those a "general purpose CPU" as we imagine them today.

jinto36 · on Oct 19, 2021

The C64 had the famous SID chip (MOS 6581) for sound. Famous in the chiptune scene at any rate. https://en.wikipedia.org/wiki/MOS_Technology_6581

musicale · on Oct 19, 2021

The C64 was discontinued in 1994 but is technically still available in 2021 as the C64 Mini. ;-)

growt · on Oct 19, 2021

The C64 mini runs on an ARM processor, so that doesn't count in this context. Also I just learned that the processor in the C64 had two coprocessors for sound and graphics (?). So maybe that also doesn't count.

rahoulb · on Oct 19, 2021

All this talk of "media engines" and GPUs reminds me of the old SID chip in the Commodore 64 and the Amiga with Agnes and Fat Agnes.

hansel_der · on Oct 19, 2021

discrete floating point co-processors and vodoo graphics come to my mind

fotta · on Oct 18, 2021

I think the higher memory is also a huge win, with support for up to 64gb.

PaulKeeble · on Oct 18, 2021

400GB/s available to the CPU cores in a unified memory, that is going to really help certain workloads that are very memory dominant on modern architectures. Both Intel and AMD are solving this with ever increasing L3 cache sizes but just using attached memory in a SOC has vastly higher memory bandwidth potential and probably better latency too especially on work that doesn't fit in ~32MB of L3 cache.

Unklejoe · on Oct 18, 2021

The M1 still uses DDR memory at the end of the day, it's just physically closer to the core. This is in contrast to L3 which is actual SRAM on the core.

The DDR being closer to the core may or may not allow the memory to run at higher speeds due to better signal integrity, but you can purchase DDR4-5333 today whereas the M1 uses 4266.

The real advantage is the M1 Max uses 8 channels, which is impressive considering that's as many as an AMD EPYC, but operates at like twice the speed at the same time.

sounds · on Oct 18, 2021

Just to underscore this, memory physically closer to the cores has improved tRAS times measured in nanoseconds. This has the secondary effect of boosting the performance of the last-level cache since it can fill lines on a cache miss much faster.

The step up from DDR4 to DDR5 will help fill cache misses that are predictable, but everybody uses a prefetcher already, the net effect of DDR5 is mostly just better efficiency.

The change Apple is making, moving the memory closer to the cores, improves unpredicted cache misses. That's significant.

dragontamer · on Oct 18, 2021

> Just to underscore this, memory physically closer to the cores has improved tRAS times measured in nanoseconds.

I doubt that tRAS timing is affected by how close / far a DRAM chip is from the core. Its just a RAS command after all: transfer data from DRAM to the sense-amplifiers.

If tRAS has improved, I'd be curious how it was done. Its one of those values that's basically been constant (on a nanosecond basis) for 20 years.

Most DDR3 / DDR4 improvements have been about breaking up the chip into more-and-more groups, so that Group#1 can be issued a RAS command, then Group#2 can be issued a separate RAS command. This doesn't lower latency, it just allows the memory subsystem to parallelize the requests (increasing bandwidth but not improving the actual command latency specifically).

kllrnohj · on Oct 18, 2021

The physically shorter wiring is doing basically nothing. That's not where any of the latency bottlenecks are for RAM. If it was physically on-die, like HBM, that'd be maybe different. But we're still talking regular LPDDR5 using off the shelf dram modules. The shorter wiring would potentially improve signal quality, but ground shields do that, too. And Apple isn't exceeding any specs on this (ie, it's not overclocked), so above average signal integrity isn't translating into any performance gains anyway.

wmf · on Oct 18, 2021

improved tRAS times

Has this been documented anywhere? What timings are Apple using?

GeekyBear · on Oct 18, 2021

Apple also uses massive cache sizes, compared to the industry.

They put a 32 megabyte system level cache in their latest phone chip.

>at 32MB, the new A15 dwarfs the competition’s implementations, such as the 3MB SLC on the Snapdragon 888 or the estimated 6-8MB SLC on the Exynos 2100

https://www.anandtech.com/show/16983/the-apple-a15-soc-perfo...

It will be interesting to see how big they go on these chips.

rektide · on Oct 18, 2021

> Apple also uses massive cache sizes, compared to the industry.

AMD's upcoming Ryzen are supposed to have 192MB L3 "v-cache" SRAM stacked above each chiplet. Current chiplets are 8-core. I'm not sure if this is a single chiplet but supposedly good for 2Tbps[1].

Slightly bigger chip than a iphone chip yes. :) But also wow a lot of cache. Having it stacked above rather than built in to the core is another game-changing move, since a) your core has more space b) you can 3D stack many layers of cache atop.

This has already been used on their GPUs, where the 6800 & 6900 have 128MB of L3 "Infinity cache" providing 1.66TBps. It's also largely how these cards get by with "only" 512GBps worth of GDDR6 feeding them (256bit/quad-channel... at 16GT). AMD's R9 Fury from spring 2015 had 1TBps of HBM2, for compare, albeit via that slow 4096bit wide interface.

Anyhow, I'm also in awe of the speed wins Apple got here from bringing RAM in close. Cache is a huge huge help. Plus 400GBps main memory is truly awesome, and it's neat that either the CPU or GPU can make use of it.

[1] https://www.anandtech.com/show/16725/amd-demonstrates-stacke...

dragontamer · on Oct 18, 2021

> The M1 still uses DDR memory at the end of the day, it's just physically closer to the core. This is in contrast to L3 which is actual SRAM on the core.

But they're probably using 8-channels of LPDDR5, if this 400GB/s number is to be believed. Which is far more memory channels / bandwidth than any normal chip released so far, EPYC and Skylake-server included.

duskwuff · on Oct 18, 2021

It's more comparable to the sort of memory bus you'd typically see on a GPU... which is exactly what you'd hope for on a system with high-end integrated graphics. :)

dragontamer · on Oct 18, 2021

You'd expect HBM or GDDR6 to be used. But this is seemingly LPDDR5 that's being used.

So its still quite unusual. Its like Apple decided to take commodity phone-RAM and just make many parallel channels of it... rather than using high-speed RAM to begin with.

HBM is specifically designed to be soldered near a CPU/GPU as well. For them to be soldering commodity LPDDR6 is kinda weird to me.

---------

We know it isn't HBM because HBM is 1024-bits at lower clock speeds. Apple is saying they have 512-bits across 8 channels (64-bits per channel), which is near LPDDR5 / DDR kind of numbers.

200GBps is within the realm of 1x HBM channel (1024-bit at low clock speeds), and 400GBps is 2x HBM channels (2048-bit bus at low clock speeds).

floatboth · on Oct 19, 2021

HBM isn't just "soldered near", it's connected through a silicon interposer rather than a PCB.

Also we know it's not HBM because the word "LPDDR5" was literally on the slides :)

> just make many parallel channels of it

isn't that just how LPDDR is in general? It has much narrower channels than DDR so you need much more of them?

dragontamer · on Oct 19, 2021

> isn't that just how LPDDR is in general? It has much narrower channels than DDR so you need much more of them?

Well yeah. But 400GBps is equivalent to 16x DDR4 channels. Its an absurdly huge amount of bandwidth.

kergonath · on Oct 18, 2021

> The DDR being closer to the core may or may not allow the memory to run at higher speeds due to better signal integrity, but you can purchase DDR4-5333 today whereas the M1 uses 4266.

My understanding is that bringing the RAM closer increases the bandwidth (better latency and larger buses), not necessarily the speed of the RAM dies. Also, if I am not mistaken, the RAM in the new M1s is LP-DDR5 (I read that, but it did not stay long on screen so I could be mistaken). Not sure how it is comparable with DDR4 DIMMs.

Unklejoe · on Oct 18, 2021

The overall bandwidth isn't affected much by the distance alone. Latency, yes, in the sense that the signal literally has to travel further, but that difference is miniscule (like 1/10th of a nanosecond) compared to overall DDR access latencies.

Better signal integrity could allow for larger busses, but I don't think this is actually a single 512 bit bus. I think it's multiple channels of smaller busses (32 or 64 bit). There's a big difference from an electrical design perspective (byte lane skew requirements are harder to meet when you have 64 of them). That said, I think multiple channels is better anyway.

The original M1 used LPDDR4 but I think the new ones use some form of DDR5.

rdw · on Oct 18, 2021

Your comment got me thinking, and I checked the math. It turns out that light takes ~0.2 ns to travel 2 inches. But the speed of signal propagation in copper is ~0.6 c, so that takes it up to 0.3 ns. So, still pretty small compared to the overall latencies (~13-18 ns for DDR5) but it's not negligible.

I do wonder if there are nonlinearities that come in to play when it comes to these bottlenecks. Yes, by moving the RAM closer it's only reducing the latency by 0.2 ns. But, it's also taking 1/3rd of the time that it used to, and maybe they can use that extra time to do 2 or 3 transactions instead. Latency and bandwidth are inversely related, after all!

jjoonathan · on Oct 18, 2021

Well, you can have high bandwidth and poor latency at the same time -- think ultra wide band radio burst from Earth to Mars -- but yeah, on a CPU with all the crazy co-optimized cache hierarchies and latency hiding it's difficult to see how changing one part of the system changes the whole. For instance, if you switched 16GB of DRAM for 4GB of SRAM, you could probably cut down the cache-miss latency a lot -- but do you care? If you cache hit rate is high enough, probably not. Then again, maybe chopping the worst case lets you move allocation away from L3 and L2 and into L1, which gets you a win again.

I suspect the only people who really know are the CPU manufacturer teams that run PIN/dynamorio traces against models -- and I also suspect that they are NDA'd through this life and the next and the only way we will ever know about the tradeoffs are when we see them pop up in actual designs years down the road.

jjoonathan · on Oct 18, 2021

DRAM latencies are pretty heinous. It makes me wonder if the memory industry will go through a similar transition to the storage industry's HDD->SSD sometime in the not too distant future.

I wonder about the practicalities of going to SRAM for main memory. I doubt silicon real estate would be the limiting factor (1T1C to 6T, isn't it?) and Apple charges a king's ransom for RAM anyway. Power might be a problem though. Does anyone have figures for SRAM power consumption on modern processes?

phkahler · on Oct 18, 2021

>> I wonder about the practicalities of going to SRAM for main memory. I doubt silicon real estate would be the limiting factor (1T1C to 6T, isn't it?) and Apple charges a king's ransom for RAM anyway. Power might be a problem though. Does anyone have figures for SRAM power consumption on modern processes?

I've been wondering about this for years. Assuming the difference is similar to the old days, I'd take 2-4GB of SRAM over 32GB of DRAM any day. Last time this came up people claimed SRAM power consumption would be prohibitive, but I have a hard time seeing that given these 50B transistor chips running at several GHz. Most of the transistors in an SRAM are not switching, so they should be optimized for leakage and they'd still be way faster than DRAM.

GeekyBear · on Oct 18, 2021

> The overall bandwidth isn't affected much by the distance alone.

Testing showed that the M1's performance cores had a surprising amount of memory bandwidth.

>One aspect we’ve never really had the opportunity to test is exactly how good Apple’s cores are in terms of memory bandwidth. Inside of the M1, the results are ground-breaking: A single Firestorm achieves memory reads up to around 58GB/s, with memory writes coming in at 33-36GB/s. Most importantly, memory copies land in at 60 to 62GB/s depending if you’re using scalar or vector instructions. The fact that a single Firestorm core can almost saturate the memory controllers is astounding and something we’ve never seen in a design before.

https://www.anandtech.com/show/16252/mac-mini-apple-m1-teste...

fomine3 · on Oct 19, 2021

It just said that bandwidth between a performance core and memory controller is great. It's not related to distance between memory controller and DRAM.

morei · on Oct 18, 2021

L3 is almost never SRAM, it's usually eDRAM and clocked significantly lower than L1 or L2.

(SRAM is prohibitively expensive to do at scale due to die area required).

Edit: Nope, I'm wrong. It's pretty much only Power that has this.

dragontamer · on Oct 18, 2021

As far as I'm aware, IBM is one of the few chip-designers who have eDRAM capabilities.

IBM has eDRAM on a number of chips in varying capacities, but... its difficult for me to think of Intel, AMD, Apple, ARM, or other chips that have eDRAM of any kind.

Intel had one: the eDRAM "Crystalwell" chip, but that is seemingly a one-off and never attempted again. Even then, this was a 2nd die that was "glued" onto the main chip, and not like IBM's truly eDRAM (embedded into the same process).

morei · on Oct 18, 2021

You're right. My bad. It's much less common than I'd thought. (Intel had it on a number of chips that included the Iron Pro Graphics across Haswell, Broadwell, Skylake etc)

dragontamer · on Oct 18, 2021

But only the Iris Pro 5200 (codename: Crystalwell) had eDRAM. All other Iris Pro were just normal DDR4.

EDIT: Oh, apparently there were smaller 64MB eDRAM on later chips, as you mentioned. Well, today I learned something.

epmaybe · on Oct 19, 2021

Ha, I still use an intel 5775c in my home server!

beebeepka · on Oct 18, 2021

I think the chip you are talking about is Broadwell.

dragontamer · on Oct 18, 2021

Broadwell was the CPU-core.

Crystalwell was the codename for the eDRAM that was grafted onto Broadwell. (EDIT: Apparently Haswell, but... yeah. Crystalwell + Haswell for eDRAM goodness)

Unklejoe · on Oct 18, 2021

L3 is SRAM on all AMD Ryzen chips that I'm aware of.

I think it's the same with Intel too except for that one 5th gen chip.

emsy · on Oct 18, 2021

Good point. Especially since a lot of software these days is not all that cache friendly. Realistically this means we have 2 years or so till further abstractions eat up the performance gains.

amelius · on Oct 18, 2021

> 400GB/s available to the CPU cores in a unified memory

It's not just throughput that counts, but latency. Any numbers to compare there?

wmf · on Oct 18, 2021

We'll have to wait for the AnandTech review but memory latency should be similar to Intel and AMD.

znwu · on Oct 18, 2021

I'm thinking with that much bandwidth, maybe they will roll out SVE2 with vlen=512/1024 for future M series.

AVX512 suffers from bandwidth on desktop. But now the bandwidth is just huge and SVE2 is naturally scalable. Sounds like free lunch?

bla3 · on Oct 18, 2021

I thought the memory was one of the more interesting bits here.

My 2-year-old Intel MBP has 64 GB, and 8 GB of additional memory on the GPU. True, on the M1 Max you don't have to copy back and forth between CPU and GPU thanks to integrated memory, but the new MBP still has less total memory than my 2-year-old Intel MBP.

And it seems they just barely managed to get to 64 GiB. The whole processor chip is surrounded by memory chips. That's in part why I'm curious to see how they'll scale this. One idea would be to just have several M1 Max SoCs on a board, but that's going to be interesting to program. And getting to 1 TB of memory seems infeasible too.

Gene_Parmesan · on Oct 18, 2021

Just some genuine honest curiosity here; how many workloads actually require 64gb of ram? For instance, I'm an amateur in the music production scene, and I know that sampling heavy work flows benefit from being able to load more audio clips fully into RAM rather than streaming them from disk. But 64g seems a tad overkill even for that.

I guess for me I would prefer an emphasis on speed/bandwidth rather than size, but I'm also aware there are workloads that I'm completely ignorant of.

jackjeff · on Oct 18, 2021

Can’t answer for music, but as a developer a sure way to waste a lot of RAM is to run a bunch of virtual machines, containers or device simulators.

I have 32GB, so unless I’m careless everything usually fits in memory without swapping. If you got over things get slow and you notice.

00deadbeef · on Oct 18, 2021

Same, I tend to get everything in 32GB but more and more often I'm going over that and having things slow down. I've also nuked an SSD in a 16GB MBP due to incredibly high swap activity. It would make no sense for me to buy another 32GB machine if I want it to last five years.

miohtama · on Oct 18, 2021

Don’t run Chrome and Slack at the same time :)

arvinsim · on Oct 19, 2021

So run Slack inside Chrome? :)

xcskier56 · on Oct 18, 2021

How do you track the swap activity? What would you call “high” swap activity?

discordance · on Oct 18, 2021

Open Activity Monitor, select Memory and there's "Swap used" down the bottom

secondcoming · on Oct 18, 2021

My laptop has 128GB for running several VMs that build C++ code, and Slack.

hatsubai · on Oct 18, 2021

Another anecdote from someone who is also in the music production scene - 32GB tended to be the "sweet spot" in my personal case for the longest time, but I'm finding myself hitting the limits more and more as I continue to add more orchestral tracks which span well over 100 tracks total in my workflows.

I'm finding I need to commit and print a lot of these. Logic's little checker in the upper right showing RAM, Disk IO, CPU, etc also show that it is getting close to memory limits on certain instruments with many layers.

So as someone who would be willing to dump $4k into a laptop where its main workload is only audio production, I would feel much safer going with 64GB knowing there's no real upgrade if I were to go with the 32GB model outside of buying a totally new machine.

Edit: And yes, there is does show the typical "fear of committing" issue that plagues all of us people making music. It's more of a "nice to have" than a necessity, but I would still consider it a wise investment. At least in my eyes. Everyone's workflow varies and others have different opinions on the matter.

kmeisthax · on Oct 18, 2021

I know the main reason why the Mac Pro has options for LRDIMMs for terabytes of RAM is specifically for audio production, where people are basically using their system memory as cache for their entire instrument library.

I have to wonder how Apple plans to replace the Mac Pro - the whole benefit of M1 is that gluing the memory to the chip (in a user-hostile way) provides significant performance benefits; but I don't see Apple actually engineering a 1TB+ RAM SKU or an Apple Silicon machine with socketed DRAM channels anytime soon.

sharikous · on Oct 19, 2021

I wonder about that too.

My bet is that they will get rid of the Mac Pro entirely. Too low ROI for them at this point.

My hope is to see an ARM workstation where all components are standard and serviceable.

I cannot believe we are in the era of glued batteries and soldered SSDs that are guaranteed to fail and take the whole machine with them.

halo37253 · on Oct 19, 2021

I think we'd probably see apple use the fast and slow ram method that old computers used back in the 90's.

16-32GB of RAM on the SOC, with DRAM sockets for usage past the built in amount.

Though by the time we see an ARM MacPro they might move to stacked DRAM on the SOC. But i'd really think two tier memory system would be apple's method of choice.

I'd also expect a dual SOC setup.

So I don't expect to see that anytime soon.

I'd love to get my hands on a Mac Mini with the M1 Max.

EricE · on Oct 20, 2021

I went for 64GB. I have one game where 32GB is on the ragged edge - so for the difference it just wasn't worth haggling over. Plus it doubled the memory bandwidth - nice bonus.

And unused RAM isn't wasted - the system will use it for caching. Frankly I see memory as one of the cheapest performance variables you can tweak in any system.

ellisv · on Oct 18, 2021

> how many workloads actually require 64gb of ram?

Don't worry, Chrome will eat that up in no time!

More seriously, I look forward to more RAM for some of the datasets I work with. At least so I don't have to close everything else while running those workloads.

FpUser · on Oct 18, 2021

I ran 512GB on my home server, 256GB on my desktop and 128GB on small factor desktop that I take with me to summer cottage.

Some of my projects work with big in memory databases. Add regular tasks and video processing on top and there you go.

halhen · on Oct 19, 2021

As a data scientist, I sometimes find myself going over 64 GB. Of course it all depends on how large data I'm working on. 128 GB RAM helps even with data of "just" 10-15 GB, since I can write quick exploratory transformation pipelines without having to think about keeping the number of copies down.

I could of course chop up the workload earlier, or use samples more often. Still, while not strictly necessary, I regularly find I get stuff done quicker and with less effort thanks to it.

AdrianB1 · on Oct 18, 2021

Not many, but there are a few that need even more. My team is running SQL servers on their laptops (development and support) and when that is not enough, we go to Threadrippers with 128-256GB of RAM. Other people run Virtual Machines on their computers (I work most of the time in a VM) and you can run several VMs at the same time, eating up RAM really fast.

dylan604 · on Oct 18, 2021

On a desktop Hackintosh, I started with 32GB that would die with out of memory errors when I was processing 16bit RAW images at full resolution. Because it was Hackintosh, I was able to upgrade to 64GB so the processing could complete. That was the only thing running.

jsjohnst · on Oct 19, 2021

What image dimensions? What app? I find this extremely suspect, but it’s plausible if you’ve way undersold what you’re doing. 24Mpixel 16bit RAW image would have no problem generally on an 4gb machine if it’s truly the only app running and the app isn’t shit. ;)

dylan604 · on Oct 19, 2021

I shoot timelapse using Canon 5D RAW images, I don't know the exact dimensions off the top of my head but greater than 5000px wide. I then grade them using various programs, ultimately using After Effects to render out full frame ProRes 4444. After Effects was running out of memory. It would crash and fail to render my file. It would display an error message that told me specifically it was out of memory. I increased the memory available to the system. The error goes away.

But I love the fact that you have this cute little theory to doubt my actual experience to infer that I would make this up.

jsjohnst · on Oct 19, 2021

> But I love the fact that you have this cute little theory to doubt my actual experience to infer that I would make this up.

The facts were suspect, your follow up is further proof I had good reason to be suspect. First off, the RAW images from a 5D aren’t 16 bit. ;) Importantly, the out of memory error had nothing to do with the “16 bit RAW files”, it was video rendering lots of high res images that was the issue which is a very different issue and of course lots of RAM is needed there. Anyway, notice I said “but it’s plausible if you’ve way undersold what you’re doing”, which is definitely the case here, so I’m not sure why it bothered you.

dylan604 · on Oct 20, 2021

Yes, Canon RAW images are 14bit. Once opened in After Effects, you are working in 16bit space. Are you just trying to be argumentative for the fun?

jsjohnst · on Oct 20, 2021

>> die with out of memory errors when I was processing 16bit RAW images

> Canon RAW images are 14bit

You don’t see the issue?

> Are you just trying to be argumentative for the fun?

In the beginning, I very politely asked a clarifying question making sure not to call you a liar as I was sure there was more to the story. You’re the one who’s been defensive and combative since, and honestly misrepresenting facts the entire time. Where you wrong at any point? Only slightly, but you left out so many details that were actually important to the story for anyone to get any value out of your anecdata. Thanks to my persistence, anyone who wanted to learn from your experience now can.

currency · on Oct 19, 2021

Not the person you're replying to.

>> I was processing 16bit RAW images at full resolution.

>> ...using After Effects to render out full frame ProRes 4444.

Those are two different applications to most of us. No one is accusing you of making things up, just that the first post wasn't fully descriptive of your use case.

mnw21cam · on Oct 19, 2021

Working with video will use up an extraordinary amount of memory.

Some of the genetics stuff I work on requires absolute gobs of RAM. I have a single process that requires around 400GB of RAM that I need to run quite regularly.

eyelidlessness · on Oct 19, 2021

I can exhaust my 64GB just opening browser tabs for documentation.

kaba0 · on Oct 19, 2021

In case your statement is only a slight sarcasm:

Isn’t that just the OS saying “unused memory is wasted memory”? Most of it is likely cache that can easily be evicted with higher memory pressure.

eyelidlessness · on Oct 20, 2021

It’s a slight exaggeration, I also have an editor open and some dev process (test runner usually). It’s not just caching, I routinely hit >30 GB swap with fans revved to the max and fairly often this becomes unstable enough to require a reboot even after manually closing as much as I can.

I mean, some of this comes down to poor executive function on my part, failing to manage resources I’m no longer using. But that’s also a valid use case for me and I’m much more effective at whatever I’m doing if I can defer it with a larger memory capacity.

kaba0 · on Oct 20, 2021

Which OS do you use? It’s definitely not a problem on your part, the OS should be managing it completely transparently.

kasabali · on Oct 19, 2021

It is the OS saying “unused memory is wasted memory”, and then every other application thinking they're OS and doing the same.

kaba0 · on Oct 19, 2021

Since applications have virtual memory, it sort of doesn’t matter? The OS will map these to actual pages based on how many processes are available, etc. So if only one app runs and it wants lots of memory, it makes sense to give it lots of memory - that is the most “economical” decision from both a energy and performance POV.

kasabali · on Oct 20, 2021

> So if only one app runs

You answered yourself.

sulam · on Oct 18, 2021

So, M1 has been out for a while now, with HN doom and gloom about not being able to put enough memory into them. Real world usage has demonstrated far less memory usage than people expected (I don't know why, maybe someone paid attention and can say). The result is that 32G is a LOT of memory for an M1-based laptop, and 64G is only needed for very specific workloads I would expect.

astrange · on Oct 19, 2021

Measuring memory usage is a complicated topic and just adding numbers up overestimates it pretty badly. The different priorities of memory are something like 1. wired (must be in RAM) 2. dirty (can be swapped) 3. purgeable (can be deleted and recomputed) 4. file backed dirty (can be written to disk) 4. file backed clean (can be read back in).

Also note M1's unified memory model is actually worse for memory use not better. Details left as an exercise for the reader.

simonh · on Oct 19, 2021

Unified memory is a performance/utilisation tradeoff. I think the thing is it's more of an issue with lower memory specs. The fact you don't have 4GB (or even 2 GB) dedicated memory on a graphics card in a machine with 8GB of main memory is a much bigger deal than not having 8GB on the graphics card on a machine with 64 GB of main RAM.

kmonsen · on Oct 19, 2021

Or like games, even semi-casual ones. Civ6 would not load at all on my mac mini. Also had to fairly frequently close browser windows as I ran out of memory.

concinds · on Oct 19, 2021

I couldn't load Civ6 until I verified game files in Steam, and now it works pretty perfectly. I'm on 8GB and always have Chrome, Apple Music and OmniFocus running alongside.

kmonsen · on Oct 24, 2021

Huh, thank you I will try this.

fotta · on Oct 18, 2021

I'm interested to see how the GPU on these performs, I pretty much disable the dGPU on my i9 MBP because it bogs my machine down. So for me it's essentially the same amount of memory.

derefr · on Oct 18, 2021

> but the new MBP still has less total memory

From the perspective of your GPU, that 64GB of main memory attached to your CPU is almost as slow to fetch from as if it were memory on a separate NUMA node, or even pages swapped to an NVMe disk. It may as well not be considered "memory" at all. It's effectively a secondary storage tier.

Which means that you can't really do "GPU things" (e.g. working with hugely detailed models where it's the model itself, not the textures, that take up the space) as if you had 64GB of memory. You can maybe break apart the problem, but maybe not; it all depends on the workload. (For example, you can't really run a Tensorflow model on a GPU with less memory than the model size. Making it work would be like trying to distribute a graph-database routing query across nodes — constant back-and-forth that multiplies the runtime exponentially. Even though each step is parallelizable, on the whole it's the opposite of an embarrassingly-parallel problem.)

Vomzor · on Oct 18, 2021

That's not how M1's unified memory works.

>The SoC has access to 16GB of unified memory. This uses 4266 MT/s LPDDR4X SDRAM (synchronous DRAM) and is mounted with the SoC using a system-in-package (SiP) design. A SoC is built from a single semiconductor die whereas a SiP connects two or more semiconductor dies. SDRAM operations are synchronised to the SoC processing clock speed. Apple describes the SDRAM as a single pool of high-bandwidth, low-latency memory, allowing apps to share data between the CPU, GPU, and Neural Engine efficiently. In other words, this memory is shared between the three different compute engines and their cores. The three don't have their own individual memory resources, which would need data moved into them. This would happen when, for example, an app executing in the CPU needs graphics processing – meaning the GPU swings into action, using data in its memory. https://www.theregister.com/2020/11/19/apple_m1_high_bandwid...

These Macs are gonna be machine learning beasts.

derefr · on Oct 19, 2021

I know; I was talking about the computer the person I was replying to already owns.

The GP said that they already essentially have 64GB+8GB of memory in their Intel MBP; but they don't, because it's not unified, and so the GPU can't access the 64GB. So they can only load 8GB-wide models.

Whereas with the M1 Pro/Max the GPU can access the 64GB, and so can load 64GB-wide models.

Vomzor · on Oct 26, 2021

It seems I misunderstood.

andrekandre · on Oct 18, 2021

so whats the implication of this?

that apples specific use cases for the m1 series is basically "prosumer" ?

(sorry if i'm just repeating something obvious)

londons_explore · on Oct 18, 2021

Memory is very stackable if needed, since the power per unit area is very low.

mlindner · on Oct 18, 2021

How much of that 64 GB is in use at the same time though? Caching not recently used stuff from DRAM out to an SSD isn't actually that slow, especially with the high speed SSD that Apple uses.

RantyDave · on Oct 19, 2021

Right. And to me, this is the interesting part. There's always been that size/speed tradeoff ... by putting huge amounts of memory bandwidth on "less" main RAM, it becomes almost half-ram-half-cache; and by making the SSD fast it becomes more like massive big half-hd-half-cache. It does wear them out, however.

gamacodre · on Oct 18, 2021

Why 1TB? 640GB ought to be enough for anything...

gamacodre · on Oct 18, 2021

Huh, I guess that was as bad an idea as the 640K one.

saijanai · on Oct 18, 2021

How much per 8K x 10 bit color, video frame?

Roughly 190GB per minute without sound.

Trying to do special effects on more than a few seconds of 8K video would overwhelm a 64GB system, I suspect.

gamacodre · on Oct 18, 2021

You were (unintentionally) trolled. My first post up there was alluding to the legend that Bill Gates once said, speaking of the original IBM PC, "640K of memory should be enough for anybody." (N.B. He didn't[0])

[0] https://www.wired.com/1997/01/did-gates-really-say-640k-is-e...

jrk · on Oct 18, 2021

Video and VFX generally don't need to keep whole sequences in RAM persistently these days because:

1. The high-end SSDs in all Macs can keep up with that data rate (3GB/sec) 2. Real-time video work is virtually always performed on compressed (even losslessly compressed) streams, so the data rate to stream is less than that.

ssijak · on Oct 18, 2021

And NVMe with 7.5gbps are like, we are almost not even note worthy haha Impressive all around.

jeswin · on Oct 18, 2021

It's not that noteworthy, given that affordable Samsung 980 Pro SSDs have been doing those speeds for well over a year now.

nojito · on Oct 18, 2021

980 pro maxes at 7.

jeswin · on Oct 18, 2021

But it's also been around for at least a year. And upcoming pcie 5 SSDs will up that to 10-14GBps.

I'm saying Apple might have wanted to emphasise their more standout achievements. Such as on the CPU front, where they're likely to be well ahead for a year - competition won't catch up until AMD starts shipping 5nm Zen4 CPUs in Q3/Q4 2022.

nojito · on Oct 18, 2021

Apple has a well over 5 year advantage when compared to their competition.

cozzyd · on Oct 19, 2021

That is very difficult to believe, short of sabotage.

dispat0r · on Oct 19, 2021

Apple has a node advantage.

lkbm · on Oct 18, 2021

I'm guessing that's new for the 13" or for the M1, but my 16‑inch MacBook Pro purchased last year had 64GB of memory. (Looks like it's considered a 2019 model, despite being purchased in September 2020).

jack_riminton · on Oct 18, 2021

I don't think this is an apples to apples comparison because of how the new unified memory works

NovaS1X · on Oct 19, 2021

Well technically it is an Apple to Apple comparison in his case.

lostlogin · on Oct 19, 2021

It all falls apart when the apple contains something non-apple.

fotta · on Oct 18, 2021

Right the Intels supported 64gb, but the 16gb limitation on the M1 was literally the only thing holding me back from upgrading.

christkv · on Oct 18, 2021

And the much higher memory bandwidth

baybal2 · on Oct 18, 2021

Actually no "extra" chip area in comparison to x86 based solution.

They just throw away so much of cruft from the die like PCIE PHYs, and x86 legacy I/O with large area analog circuitry.

Redundant complex DMA, and memory controller IPs are also thrown away.

Clock, and power rails on the SoC are also probably taking less space because of more shared circuitry.

Same with self-test, debug, fusing blocks, and other small tidbits.

azinman2 · on Oct 18, 2021

This is very interesting and first time I’ve heard / thought about this. Wonder how much power efficiency comes from exactly these things?

baybal2 · on Oct 18, 2021

PCI is quite power hungry when it works on full trottle.

The seemed power efficiency when PCIE was going 1.0 2.0 3.0 ... was due to dynamic power control, and link sleep.

On top of it, they simply don't haul memory nonstop over PCIE anymore, since data going to/from GPU is simply not moving anywhere.

lowbloodsugar · on Oct 18, 2021

Really curious if the memory bandwidth is entirely available to the CPU if the GPU is idle. An nvidia RTX3090 has nearly 1TB/s bandwidth, so the GPU is clearly going to use as much of the 400GB/s as possible. Other unified architectures have multiple channels or synchronization to memory, such that no one part of the system can access the full bandwidth. But if the CPU can access all 400GB/s, that is an absolute game changer for anything memory bound. Like 10x faster than an i9 I think?

znwu · on Oct 18, 2021

Not sure if it will be available, but 400GB/s is way too much for 8 cores to take up. You would need some sort of avx512 to hog up that much bandwidth.

Moreover, it's not clear how much of a bandwidth/width does M1 max CPU interconnect/bus provide.

--------

Edit: Add common sense about HPC workloads.

There is a fundamental idea called memory-access-to-computation ratio. We can't assume a 1:0 ratio since it was doing literally nothing except copying.

Typically your program needs serious fixing if it can't achieve 1:4. (This figure comes from a CUDA course. But I think it should be similar for SIMD)

Edit: also a lot of that bandwidth is fed through cache. Locality will eliminate some orders of magnitudes of memory access, depending on the code.

GeekyBear · on Oct 18, 2021

A single big core in the M1 could pretty much saturate the memory bandwidth available.

https://www.anandtech.com/show/16252/mac-mini-apple-m1-teste...

terafo · on Oct 18, 2021

> Not sure if it will be available, but 400GB/s is way too much for 8 cores to take up. You would need some sort of avx512 to hog up that much bandwidth.

If we assume that frequency is 3.2Ghz and IPC of 3 with well optimized code(which is conservative for performance cores since they are extremely wide) and count only performance cores we get 5 bytes for instruction. M1 supports 128-bit Arm Neon, so peak bandwidth usage per instruction(if I didn't miss anything) is 32 bytes.

lowbloodsugar · on Oct 18, 2021

Don't know the clock speed but 8 cores at 3Ghz working on 128bit SIMD is 8316 = 384GB/s so we are in the right ball park. Not that I personally have a use for that =) Oh, wait, bloody Java GC might be a use for that. (LOL, FML or both).

dragontamer · on Oct 18, 2021

But the classic SIMD problem is matrix-multiplication, which doesn't need full memory bandwidth (because a lot of the calculations are happening inside of cache).

The question is: what kind of problems are people needing that want 400GB/s bandwidth on a CPU? Well, probably none frankly. The bandwidth is for the iGPU really.

The CPU just "might as well" have it, since its a system-on-a-chip. CPUs usually don't care too much about main-memory bandwidth, because its like 50ns+ away latency (or ~200 clock ticks). So to get a CPU going in any typical capacity, you'll basically want to operate out of L1 / L2 cache.

> Oh, wait, bloody Java GC might be a use for that. (LOL, FML or both).

For example, I know you meant the GC as a joke. But if you think of it, a GC is mostly following pointer->next kind of operations, which means its mostly latency bound, not bandwidth bound. It doesn't matter that you can read 400GB/s, your CPU is going to read an 8-byte pointer, wait 50-nanoseconds for the RAM to respond, get the new value, and then read a new 8-byte pointer.

Unless you can fix memory latency (and hint, no one seems to be able to do so), you'll be only able to hit 160MB/s or so, no matter how high your theoretical bandwidth is, you get latency locked at a much lower value.

znwu · on Oct 18, 2021

Yeah the marking phase cannot be efficiently vectorized. But I wonder if it can help with compacting/copying phase.

Also for me the process sounds oddly familiar to vmem table walking. There is currently a RISC-V J extension drafting group. I wonder what they can come up with.

zX41ZdbW · on Oct 18, 2021

> The question is: what kind of problems are people needing that want 400GB/s bandwidth on a CPU? Well, probably none frankly.

It is needed for analytic databases, e.g. ClickHouse: https://presentations.clickhouse.com/meetup53/optimizations/

znwu · on Oct 19, 2021

This is some seriously interesting stuff.

But they are demonstrating with 16 cores + 30 GB/s & 128 cores + 190 GB/s. And to my understanding they did not really mention what type of computational load did they perform. So this does not sound too ridiculous. M1 max is pairing 8 cores + 400GB/s.

carlhjerpe · on Oct 18, 2021

Doesn't prefetching data into the cache more quickly assist in execution speed here?

dragontamer · on Oct 18, 2021

How do you prefetch "node->next" where "node" is in a linked list?

Answer: you literally can't. And that's why this kind of coding style will forever be latency bound.

EDIT: Prefetching works when the address can be predicted ahead of time. For example, when your CPU-core is reading "array", then "array+8", then "array+16", you can be pretty damn sure the next thing it wants to read is "array+24", so you prefetch that. There's no need to wait for the CPU to actually issue the command for "array+24", you fetch it even before the code executes.

Now if you have "0x8009230", which points to "0x81105534", which points to "0x92FB220", good luck prefetching that sequence.

--------

Which is why servers use SMT / hyperthreading, so that the core can "switch" to another thread while waiting those 50-nanoseconds / 200-cycles or so.

carlhjerpe · on Oct 18, 2021

I don't really know how the implementation of a tracing GC works but I was thinking they could do some smart memory ordering to land in the same cache-line as often as possible.

Thanks for the clarifications :)

monocasa · on Oct 18, 2021

Interestingly earlyish smalltalk VMs used to keep the object headers in a separate contiguous table.

Part of the problem though, is that the object graph walk pretty quickly is non contiguous, regardless of how it's laid out in memory.

kaba0 · on Oct 19, 2021

But that’s just the marking phase, isn’t it? And most of it can be done fully in parallel, so while not all CPU cores can be maxed out with that, more often than not the original problem itself can be hard to parallelize to that level, so “wasting” a single core may very well be worth it.

rocqua · on Oct 19, 2021

You prefetch by having node->next as close as possible to node. You do that by using an allocator that tries very hard to ensure this.

Doesn't work that well for GC but for specific workloads it can work very nicely.

dragontamer · on Oct 19, 2021

Yeah, that's a fine point.

I always like pointing out Knuth's dancing links algorithm for Exact-covering problems. All "links" in that algorithm are of the form "1 -> 2 -> 3 -> 4 -> 5" at algorithm start.

Then, as the algorithm "guesses" particular coverings, it turns into "1->3->4->5", or "1->4", that is, always monotonically increasing.

As such, no dynamic memory is needed ever. The linked-list is "statically" allocated at the start of the program, and always traversed in memory order.

Indeed, Knuth designed the scheme as "imagine doing malloc/free" to remove each link, but then later, "free/malloc" to undo the previous steps (because in Exact-covering backtracking, you'll try something, realize its a dead end, and need to backtrack). Instead of a malloc followed up by a later free, you "just" drop the node out of the linked list, and later reinsert it. So the malloc/free is completely redundant.

In particular: a given "guess" into an exact-covering problem can only "undo" its backtracking to the full problem scope. From there, each "guess" only removes possibilities. So you use the "maximum" amount of memory at program start, you "free" (but not really) nodes each time you try a guess, and then you "reinsert" those nodes to backtrack to the original scope of the problem.

Finally, when you realize that, you might as well put them all into order for not only simplicity, but also for speed on modern computers (prefetching and all that jazz).

Its a very specific situation but... it does happen sometimes.

hajile · on Oct 18, 2021

AMD showed with their Infinity Cache that you can get away with much less bandwidth if you have large caches. It has the side effect of radically reducing power consumption.

Apple put 32MB of cache in their latest iPhone. 128 or even 256MB of L3 cache wouldn't surprise me at all given the power benefits.

londons_explore · on Oct 18, 2021

I suspect the GPU is never really idle.

Even simple screen refresh blending say 5 layers and outputting it to a 4k screen is 190Gbits at 144 Hz.

modulusshift · on Oct 18, 2021

Apple put ProMotion in the built in display, so while it can ramp up to 120Hz, it'll idle at more like 24 Hz when showing static content. (the iPad Pro goes all the way down to 10Hz, but some early sources seem to say 24Hz for these MacBook Pros.) There may also be panel self refresh involved, in which case a static image won't even need that much. I bet the display coprocessors will expose the adaptive refresh functionality over the external display connectors as well.

londons_explore · on Oct 19, 2021

It only takes a tiny animation (eg. a spinner, pulsing glowing background, animated clock, advert somewhere, etc), and suddenly the whole screen is back to 120 Hz refresh.

simonh · on Oct 19, 2021

That's a useful tip for saving some battery actually, thanks.

lowbloodsugar · on Oct 18, 2021

Don't know much about the graphics on an M1. Does it not render to a framebuffer? Is that framebuffer spread over all 4 memory banks? Can't wait to read all about it.

simonh · on Oct 19, 2021

The updates from the Asahi Linux team are fantastic for getting insights into the M1 architecture. They've not really dug deep into the GPU yet, but that's coming soon.

floatboth · on Oct 19, 2021

When nothing is changing, you do not have to touch the GPU. Yes, without Panel Self Refresh there would be this many bits going to the panel at that rate, but the display engine would keep resubmitting the same buffer. No need to rerender when there's no damage. (And when there is, you don't have to rerender the whole screen, only the combined damage of the previous and current frames.)

snek_case · on Oct 18, 2021

Don't Apple iPhones use an adaptive refresh rate nowadays?

andy_ppp · on Oct 18, 2021

Indeed ProMotion is coming to these new MacBooks too.

tkrsh · on Oct 18, 2021

More memory bandwith = 10x faster than an i9 ? this makes no sense to me doesn't clock speed and cores determine the major part of the performance of a cpu ?

tcmart14 · on Oct 19, 2021

Yes and no. There are many variables to take into account. An example from the early days of the PPC architecture was their ability to pre-empt instructions. This gave performance boasts even in the absence of a higher clock speed. I can't speak specifically on the M1, but there are other things outside of clock speed and cores that determine speed.

ed_elliott_asc · on Oct 18, 2021

Surely the shared ram between cpu and gpu is the killer feature - zero copy and up to 64gb ram available for the gpu!

RantyDave · on Oct 19, 2021

Yes, but it's a double edged sword. It means you're using relatively slow ram for the GPU, and that the GPU takes memory bandwidth away from the CPU as well. Traditionally we've ended up with something that looks like Intel's kinda crappy integrated video.

The copying process was never that much of a big deal, but paying for 8GB of graphics ram really is.

davedx · on Oct 19, 2021

> The copying process was never that much of a big deal

I don't know about that? Texture memory management in games can be quite painful. You have to consider different hardware setups and being able to keep the textures you need for a certain scene in memory (or not, in which case, texture thrashing).

djmips · on Oct 19, 2021

The copying process was quite a barrier to using compute (general purpose GPU) to augment CPU processing and you had to ensure that the work farmed to the GPU was worth the cost of the to/from costs. Game consoles of late have generally had unified memory (UMA) and it's quite a nice advantage because moving data is a significant bottleneck.

Using Intel's integrated video as a way to assess the benefits of unified memory is off target. Intel had a multitude of design goals for their integrated GPU and UMA was only one aspect so it's not so easy to single that out for any shortcomings that you seem to be alluding to.

GeekyBear · on Oct 20, 2021

If you're looking at the SKU with the high GPU core count and 64 Gigs of LPDDR5, the total memory bandwidth (400 GBps) isn't that far off from the bandwidth a discrete GPU would have to it's local pool of memory.

You also have an (estimated from die shots) 64 megabyte SRAM system level cache and large L2 and L1 CPU caches, but you are indeed sharing the memory bandwidth between the CPU and GPU.

I'm looking forward to these getting into the hands of testers.

KMnO4 · on Oct 18, 2021

> I'm still a bit sad that the era of "general purpose computing" where CPU can do all workloads is coming to an end.

They’ll still do all workloads, but are optimized for certain workloads. How is that any different than say, a Xeon or EPYC cpu designed for highly threaded (server/scientific computing) applications?

klelatti · on Oct 18, 2021

In this context the absence of the 27 inch iMac was interesting. If these SoC were not deemed to be 'right' for the bigger iMac then possibly a more CPU focused / developer focused SoC may be in the works for the iMac?

julienb_sea · on Oct 18, 2021

I doubt they are going to make different chips for prosumer devices. They are going to spread out the M1 pro/max upgrade to the rest of the lineup at some point during the next year, so they can claim "full transition" through their quoted 2 years.

The wildcard is the actual mac pro. I suspect we aren't going to hear about mac pro until next Sept/Oct events, and super unclear what direction they are going to go. Maybe allowing config of multiple M1 max SOCs somehow working together. Seems complicated.

klelatti · on Oct 18, 2021

On reflection I think they've decided that their pro users want 'more GPU not more CPU' - they could easily have added a couple more CPU cores but it obviously wasn't a priority.

Agreed that it's hard to see how designing a CPU just for the Mac Pro would make any kind of economic sense but equally struggling to see what else they can do!

arvinsim · on Oct 19, 2021

There is only so much you can do improving the CPU.

On the other hand, we still haven't really reached the limit for GPUs.

klelatti · on Oct 19, 2021

I suppose it’s the software - most CPU software is primarily designed to run on a single / few cores. GPU code is massively parallel by design.

meerita · on Oct 19, 2021

I think we will see iMac Pro with incredible performance. Mac Pros, maybe in the next years to come. It's a really high end product to release new specs. Plus, in they release it with M1 Max chips, what would be the difference? a nicer case and more upgrading slot? I don't see the advantage of power. I think Mac Pros will be upgraded like in 2 years ahead

madeofpalk · on Oct 19, 2021

The real question for the Mac Pros is whether the GPU is changeable or not, and what that means for shared memory.

wmf · on Oct 18, 2021

Nah, it'll be the same M1 Max. Just wait a few months like with the 24" iMac.

klelatti · on Oct 18, 2021

You're probably right. Maybe they have inventory to work through!

nicoburns · on Oct 18, 2021

They might also have supply constraints on these new chips. I suspect they are going to sell a lot of these new MacBook Pros