Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The article specifically answers this:

- x86 instruction set can't be queued up as easily because instructions have different lengths - 4 decoders max, while Apple has 8 and could go higher.

- Business model does not allow this kind of integration.



> instructions have different lengths

also allows extremely long instructions, the ISA will allow up to 15 bytes, and fault at 16 (without that artificial limit you can create arbitrarily long x86 instructions).


What a nightmare, but it makes me wonder: rather than decoding into micro-ops at runtime, could Intel or AMD "JIT" code up-front, in hardware, into a better bytecode?

I'm sure it wouldn't work for everything, but why wouldn't it be feasible to keep a cache of decoding results by page or something?


This is exactly how the hardware works and what micro-ops are, on any system with a u-op cache or trace cache those decoded instructions are cached and used instead of decoding again. Unfortunately you still have to decode the instructions at least once first and that bottleneck is the one being discussed here. This is all transparent to the OS and not visible outside a low level instruction cache though, which means you don’t need a major OS change, but arguably if you were willing to take that hit you could go further here.


So what stops x86 from adding the micro-ops as a dedicated alternate instruction set, Thumb style? Maybe with the implication that Intel will not hold the instruction set stable between chips, pushing vendors to compile to it on the fly?


Mirco Ops are usually much wider than instructions of the ISA. They are usually not multiple of 8 bits wide either.

An dedicated alternative instruction set would be possible but that would take die space and make x86_64+mine even harder to decode.


From what I understand this is exactly what the instruction decoder does.


They do something similar for ‘loops’. CPU doesn’t decode same instructions over and over again, just using them from ‘decoded instruction cache’ which has capacity around 1500 bytes.


Hmm, this reminds me of Transmeta https://en.wikipedia.org/wiki/Transmeta


They do this in a lot of designs. It's called a micro op cache, or sometimes an L0I cache.


I think the latter is the biggest challenge.

I imagine that Apple's M1 are using what they know about MacOS, what they know about applications in the store, what user telemetry MacOS customers have opted into, all to build a picture of which problems are most important for them to solve for the type of customer who will be buying an M1-equipped MacOS device. They have no requirement to provide something that will work equally well for server, desktop, etc roles, for Windows and Linux, and they have a lot more information about what's actually running on a day-to-day basis.


They say in the article that AMD "can't" build more than 4 decoders. Is that really true? It could mean:

* we can't get a budget to sort it out

* 5 would violate information theory

* nobody wants to wrestle that pig, period

* there are 20 other options we'd rather exhaust before trying

When they've done 12 of those things and the other 8 turn out to be infeasible, will they call it quits, or will someone figure out how to either get more decoders or use them more effectively?


Their business model allowed for them to integrate GPU and video decoder. Of course it allows for this kind of integration. The author is not even in that industry, so a lot of his claims are fishy. Moore's law is not about frequency, for example.


I think what they mean is lack of coordination between software and hardware manufacturers + unwillingness of intel/amd etc to license their IP to dell etc. What is untrue about that?

On Moore's law, yes it's about transistors on a chip, but I think the point they're making is that Moore's law created a situation where you couldn't ramp up frequency anymore because the transistors are so small.


> What is untrue about that?

The fact that they don't need to license technology. They can bring more functionality into the package that they sell to Dell, etc. like they have already done.

> but I think the point they're making is that Moore's law created a situation where you couldn't ramp up frequency anymore because the transistors are so small.

That is not the point they are making. Clock frequencies have not changed since the deep-pipelined P4, but transistor count has continued to climb. Here is what the author, who clearly does not know what he is talking about, said about that:

"increasing the clock frequency is next to impossible. That is the whole 'End of Moore’s Law' that people have been harping on for over a decade now."




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: