X86 isn't fixed width instructions. Depending on the mix you may be able to deco...

pixelpoet · on Nov 30, 2020

This I think is the real answer; for a long time people were saying that "CISC is just compression for RISC, making virtue of necessity", but it seems like M1 serves as a good counterexample where a simpler ISA is scaled up to modern transistor counts (and given exclusive access to the world's best manufacturing, TSMC 5nm).

throwarchitect · on Nov 30, 2020

Considering that x86 is less dense than any RISC ISA, the "compression" argument behind CISC falls apart. No surprise a denser, trivial to decode ISA does better.

wk_end · on Dec 1, 2020

You have a source for that? The first google result I found for research on that shows it as denser than almost every RISC ISA [1]. It’s just one study and it predates ARM64 fwiw though.

[1] https://www.researchgate.net/profile/Sally_McKee/publication...

throwarchitect · on Dec 1, 2020

That paper uses no actual benchmarks, but rather grabbed a single system utility and then hand-optimized it; SPEC and geekbench show x86-64 comes in well over 4 bytes on average.

wk_end · on Dec 1, 2020

Sure, I never claimed it to be the be-all-end-all, just the only real source I could find. Adding "SPEC" or "geekbench" didn't really help.

Doing a little more digging, I have also found this [1], which claims "the results show that the average instruction length is about 2 to 3 bytes". On the other hand, this [2] finds that the average instruction length is 4.25 bytes.

Bytes per instruction doesn't really say anything useful for code density when talking about RISC vs. CISC though, since (arguably) the whole idea is that individual CISC instructions are supposed to do more than individual RISC instructions. A three instruction CISC routine at five bytes each is still a win over a four instruction RISC routine at four bytes each. Overall code size is what actually matters.

[1] https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.40...

[2] http://www.cs.unc.edu/~porter/pubs/instrpop-systor19.pdf

monocasa · on Dec 1, 2020

But there's more work being done per average x86_64 instruction due to RMW ops. Hence why they just look at an entire binary.

BlackFingolfin · on Nov 30, 2020

OK, I could see how one could implement a variable width instruction decoder (e.g. "if there are 8 one-byte instructions in a row, handle them, otherwise fallback to 4-way decoding" -- of course much more sophisticated approach could be made).

But is this actually done? I honestly would be interested in a source for that; I just searched again and could find no source supporting this (but of course I may have simply not used the right search, I would not be surprised by that in the least). E.g. https://www.agner.org/optimize/microarchitecture.pdf#page216 makes no mention of this and calls AMD Zen (version 1; it doesn't saying anything on Zen 2/3).

I did find various sources which talk about how many instructions / µops can be scheduled at a time, and there it may be 8-way, but that's a completely different metric, isn't it?

kens · on Nov 30, 2020

As a historical note, the Pentium P6 uses an interesting approach. It has three decoders but only one of them can handle "complex macroinstructions" that require micro-operations from the ROM. If a limited-functionality decoder got the complex instruction, the instruction gets redirected to another decoder the next cycle.

As far as variable-length instructions, a separate Instruction Length Decoder sorts that out before decoding.

Ref: "Modern Processor Design" p268.

Dylan16807 · on Dec 1, 2020

> As far as variable-length instructions, a separate Instruction Length Decoder sorts that out before decoding.

And how fast is that able to run on x86? How many instructions can that process at once, compared to an alternate universe where that circuit has the same transistor and time budget but only has to look at the first four bits of an instruction?

happycube · on Dec 1, 2020

Probably should check Abner's guide, but the P6 is still the rough blueprint for everything (except P4) that intel did since.

95014_refugee · on Dec 1, 2020

They were still doing this in the Nehalem timeframe (possibly influence from Hinton?)