100% agree. A lot of people asking for ARM based machines do not really realize ...

NobodyNada · on Oct 16, 2021

It’s true that ARM alone isn’t the reason for the M1’s performance, but it’s definitely a significant factor. x86 is old — modern x86 chips are still backwards-compatible with the original 8086 from 1978 — and it’s stuck with plenty of design decisions that might have been the correct choice sometime within the past 45 years but no longer today. Whereas the M1 only implements AArch64, a complete redesign of the ARM architecture from 2012, so it doesn’t have to deal with legacy architectural baggage. (We’ve known x86 was the wrong design since the 80’s — hence why there’s no Intel chips in smartphones — but it hasn’t been realistic for anybody except Apple to spend 10 years and billions of dollars to make a high-performance non-x86 chip.)

Some examples:

- x86 guarantees strong memory ordering on multi-processor systems, which adds completely unnecessary overhead to every memory access. arm64 uses a weak memory model instead, providing atomic instructions with relaxed or acquire/release semantics (see https://youtu.be/KeLBd2EJLOU?t=28m19s for a more detailed discussion). This significantly improves performance all around the board, but especially with reference counting operations (which are extremely common and often a bottleneck in code written in ObjC/Swift): https://twitter.com/Catfish_Man/status/1326238434235568128

> fun fact: retaining and releasing an NSObject takes ~30 nanoseconds on current gen Intel, and ~6.5 nanoseconds on an M1

- x86 instruction decode is pretty awful, a significant bottleneck, and not parallelizable due to the haphazardly-designed variable-length CISC instruction set. arm64’s instruction set is highly regular and easy to decode, so Apple can decode up to 8 instructions per clock (as opposed to 4 for x86 chips). Most sources agree this is why the M1 can have such a big out-of-order-execution window and achieve such high instruction-level parallelism compared to Intel/AMD.

- x86_64 has only 16 architectural registers, compared to 32 for arm64. This means the compiler has a much harder time generating efficient, parallelizable code and must resort to spilling registers much more often.

smoldesu · on Oct 16, 2021

The issue for me is that ARM is also really old now. I mean, just look at the ISA Apple has to use to run their MacOS on it: it's littered with NEON extensions and more cruft than you can shake a stick at. Simply put, Apple's implementation of ARM is decidedly CISC. On top of this, I'm still dumbfounded by the fact that they didn't go for a chiplet design where ARM could truly shine: if Apple had went the chiplet route, the M1 could have had a much higher IO ceiling and might have a shot at addressing more than 16 gigs of RAM.

Apple has a much bigger issue, though. ARM doesn't scale: it's a fundamental conceit of the architecture, one that a lot of people are probably willing to take on a laptop that will mostly be used for Twitter and YouTube. This presents issues for the rest of the market though, and it will be fascinating to see how Apple retains their pro userbase while missing out on the high-performance hardware sector entirely.

I think x86 is pretty terrible too, if it's any consolation, but really it's the only option you've got as a programmer in the 21st century. I hopped on the Raspberry Pi bandwagon when I was still in middle school, I grew up rooting for the little guy here. Looking out on the future landscape of computer hardware though, I really only see RISC-V. ARM is an improvement on x86, but I don't think it's profound enough to make people care. RISC-V, on the other hand, blows both of them out of the water. On consumer hardware, it's able to accelerate pretty much any workload while sipping a few mW. On professional hardware, you can strap a few hundred of those cores together and they'll work together to create highly complex pipelines for data processing. On server hardware, it will probably move like gangbusters. Even assuming that cloud providers pocket half the improvements, a 5x price/performance increase will have the business sector racing to support it.

So yeah, it is a pretty complex situation. Apple did a cool thing with the M1, but they have a long ways to go if they want to dethrone x86 in it's entirety.

klelatti · on Oct 16, 2021

Where to start?

> ARM is really old now.

Well aarch64 was announced in 2011 so not really that old.

> Apple’s implementation of ARM is decidedly CISC.

CISC is a description of the instruction set not the implementation.

> ARM doesn’t scale.

No idea what this means but you can get 128 core Arm CPUs and address huge amounts of memory but perhaps you have another definition of scaling.

And so on.

NobodyNada · on Oct 17, 2021

As far as I understand it, “CISC” doesn’t mean “has a lot of instructions”, it means the individual instructions are themselves complex/composable/expressing more than one hardware operation. For instance, on x86 you can write an instruction like ‘ADD [rax + 0x1234 + 8*rbx], rcx’ that performs a multi-step address calculation with two registers, reads from memory, adds a third register, and writes the result back to memory — and you can stick on prefix bytes to do even more things. ARM doesn’t have anything like that; it is a strict load/store architecture where every instruction is fixed-width with a regular format and either accesses memory or performs a computation on registers.

Stuff like hardware primitives for AES/SHA, or the FJCVTZS “JavaScript instruction” don’t make a processor CISC just because they’re specialized. They all encode trivial, single-cycle hardware operations that would otherwise be difficult to express in software (even though they may be a bit more specialized than something like “add”, they’re not any more complex). x86 is CISC because the instruction encoding is more complicated, specifying many hardware operations with one software instruction.

I’m not exactly sure whar all the “cruft” is in ARM that you’re referring to. The M1 only implements AArch64, which is less than 10 years old and is a completely new architecture that is not backwards-compatible with 32-bit ARM (it has been described as being closer to MIPS than to arm32). NEON doesn’t strike me as a good example of cruft because SIMD provides substantial performance gains for math-heavy programs, and in any case 10 years of cruft is much better than 45.

I’m curious as to why RISC-V is different or better? I don’t know much about RISC-V — but looking at the Wikipedia article, it just looks like a generic RISC similar to MIPS or AArch64 (and it’s a couple years older than AArch64 as well). Is there some sort of drastic design difference I’m missing?

klelatti · on Oct 17, 2021

The only advantage I’ve heard put forward for RISC-V on single threaded applications is the existence of compressed instructions - which could reduce cache misses albeit at the expense of a slightly more complex decoder. I’m a bit sceptical as to whether this is material though as cache sizes increase.

Of course the flexibility of the RISC-V model allows approaches such as that being pursued by Esperanto [1] with lots and lots of simpler cores.

[1] https://www.esperanto.ai/wp-content/uploads/2021/08/HC2021.E...

NobodyNada · on Oct 17, 2021

ARM had THUMB, which definitely improved performance back in the GameBoy days — but they dropped that with AArch64, so presumably they decided it wasn’t beneficial anymore.

klelatti · on Oct 17, 2021

Indeed and IIRC the increased code density got them into Nokia phones too.

I find it hard to believe that they didn't drop Thumb from AArch64 without a lot of analysis of the impact on performance.

fomine3 · on Oct 18, 2021

> On top of this, I'm still dumbfounded by the fact that they didn't go for a chiplet design where ARM could truly shine: if Apple had went the chiplet route, the M1 could have had a much higher IO ceiling and might have a shot at addressing more than 16 gigs of RAM.

Remember that M1 is just a mobile SoC that work for iPad/MacBook Air. It's exceptionally great so people tend to confuse M1 is targeted higher end. 16GB max is fine for a mobile SoC in 2021. I can't wait M1X.

mixmastamyk · on Oct 17, 2021

If you don't think arm can scale any further, why do you think x86 can? They could easily double all the specs in the "M2" and slap two+ of them into a mac pro.