also allows extremely long instructions, the ISA will allow up to 15 bytes, and fault at 16 (without that artificial limit you can create arbitrarily long x86 instructions).
What a nightmare, but it makes me wonder: rather than decoding into micro-ops at runtime, could Intel or AMD "JIT" code up-front, in hardware, into a better bytecode?
I'm sure it wouldn't work for everything, but why wouldn't it be feasible to keep a cache of decoding results by page or something?
This is exactly how the hardware works and what micro-ops are, on any system with a u-op cache or trace cache those decoded instructions are cached and used instead of decoding again. Unfortunately you still have to decode the instructions at least once first and that bottleneck is the one being discussed here. This is all transparent to the OS and not visible outside a low level instruction cache though, which means you don’t need a major OS change, but arguably if you were willing to take that hit you could go further here.
So what stops x86 from adding the micro-ops as a dedicated alternate instruction set, Thumb style? Maybe with the implication that Intel will not hold the instruction set stable between chips, pushing vendors to compile to it on the fly?
They do something similar for ‘loops’. CPU doesn’t decode same instructions over and over again, just using them from ‘decoded instruction cache’ which has capacity around 1500 bytes.
also allows extremely long instructions, the ISA will allow up to 15 bytes, and fault at 16 (without that artificial limit you can create arbitrarily long x86 instructions).