> There is no compilation that happens for the assembly code as far as I am aware. ... Wouldn't that execute all of the code serially?
Nope, that's largely what the article is getting at, more or less. Modern x86 processors optimized the x86 machine code so much that they quite literally 'compile' it down to what are called micro operations, and those micro operations are what the CPU actually executes. And then it goes beyond that, because the x86 machine code doesn't really map to the processor's actual implementation, so the CPU does extra things like register renaming, where it dynamically maps the 16 or so registers exposed in the machine code to say 64 or 128 internal registers (So an instruction like `inc %eax` may actually just write the incremented '%eax' to a completely new internal register rather than modifying the existing value, with that new internal register being the new `%eax`).
And it uses all of this to then aggressively execute the machine code completely out of order, by seeing which instructions have dependencies on other instructions and determining which can be executed out-of-order without affecting the end result. The point of doing this is that there's lots of actions that can stall the processor, with the big two being branching and fetching memory (Either from cache, or main memory). The CPU is much faster than memory and even cache, so any time you have to go to either of those causes a big performance hit, but if you can continue executing instructions during that time (Because they don't depend on that memory) then you can get a lot more performance.
For branching, it effectively prevents the out-of-order execution at that point because the CPU doesn't know what instruction will be executed after the branch. The CPU can do 'branch prediction' however, where it guesses the result of the branch and then keeps executing from that point while waiting for the branch to be resolved. If the guess was right, then there is no delay. If the guess was wrong, the work it did was thrown out and it starts executing from the right location.
Note that, generally speaking, none of these are bad things by themselves, I would even argue they're great things and adding such features to a processor is somewhat inevitable if you want to retain decades of compatibility like we have. But it has arguably resulted in hardware bugs like Spectre and Meltdown, though I would argue it's a lot more nuanced then that and then the article implies. And none of this really has to do with C, we're only talking about x86 assembly (which exists in the way it does almost purely for backwards compatibility).
Intel and AMD do not expose the micro operations in any form, preventing a lot of what the article is talking about. But at the same time, you can easily argue that's a good thing because if they did they would either need to support whatever form they expose for the next decade (And eventually result in a different set of weird optimizations to boost performance while maintaining compatibility), or you'd have to compile different versions of your code for every new CPU (Which would be a disaster).
Edit: I left out one more relevant detail (Which I'm only including because the article talks about it a fair amount) - the CPU requests memory in chunks called 'cache lines', usually 32, 64, or 128 bytes in size. This means that whether or not the CPU will have a particular piece of memory when you code is execution is a more complicated question, because if multiple parts of your code reference memory within the same cache line, it will be a lot faster since it will only require one memory fetch. And code that has no branches will all be in the same cache line (Or consecutive cache lines), which makes the out-of-order execution simpler since all of the code is already fetched. And more still, there's a complex process for ensuring consistency of cached memory across multiple cores/CPUs. Older CPUs didn't bother doing any of this because memory was fast enough to simply be read/written on demand without slowing the CPU down, so the x86 instruction set (generally) acts as though you're reading/writing directly to main memory, without any cache, and it's up to the CPU to maintain that illusion.
Mapping to uops is a trivial translation that hardly counts as compilation. Everything else is dynamic scheduling and speculation which is also not complilation as it is (mostly) data dependent.
> Mapping to uops is a trivial translation that hardly counts as compilation.
That's fair, but now we're just arguing the semantics of what is and isn't compilation :) I understand your criticism though, it's just a 'translator'.
Nope, that's largely what the article is getting at, more or less. Modern x86 processors optimized the x86 machine code so much that they quite literally 'compile' it down to what are called micro operations, and those micro operations are what the CPU actually executes. And then it goes beyond that, because the x86 machine code doesn't really map to the processor's actual implementation, so the CPU does extra things like register renaming, where it dynamically maps the 16 or so registers exposed in the machine code to say 64 or 128 internal registers (So an instruction like `inc %eax` may actually just write the incremented '%eax' to a completely new internal register rather than modifying the existing value, with that new internal register being the new `%eax`).
And it uses all of this to then aggressively execute the machine code completely out of order, by seeing which instructions have dependencies on other instructions and determining which can be executed out-of-order without affecting the end result. The point of doing this is that there's lots of actions that can stall the processor, with the big two being branching and fetching memory (Either from cache, or main memory). The CPU is much faster than memory and even cache, so any time you have to go to either of those causes a big performance hit, but if you can continue executing instructions during that time (Because they don't depend on that memory) then you can get a lot more performance.
For branching, it effectively prevents the out-of-order execution at that point because the CPU doesn't know what instruction will be executed after the branch. The CPU can do 'branch prediction' however, where it guesses the result of the branch and then keeps executing from that point while waiting for the branch to be resolved. If the guess was right, then there is no delay. If the guess was wrong, the work it did was thrown out and it starts executing from the right location.
Note that, generally speaking, none of these are bad things by themselves, I would even argue they're great things and adding such features to a processor is somewhat inevitable if you want to retain decades of compatibility like we have. But it has arguably resulted in hardware bugs like Spectre and Meltdown, though I would argue it's a lot more nuanced then that and then the article implies. And none of this really has to do with C, we're only talking about x86 assembly (which exists in the way it does almost purely for backwards compatibility).
Intel and AMD do not expose the micro operations in any form, preventing a lot of what the article is talking about. But at the same time, you can easily argue that's a good thing because if they did they would either need to support whatever form they expose for the next decade (And eventually result in a different set of weird optimizations to boost performance while maintaining compatibility), or you'd have to compile different versions of your code for every new CPU (Which would be a disaster).
Edit: I left out one more relevant detail (Which I'm only including because the article talks about it a fair amount) - the CPU requests memory in chunks called 'cache lines', usually 32, 64, or 128 bytes in size. This means that whether or not the CPU will have a particular piece of memory when you code is execution is a more complicated question, because if multiple parts of your code reference memory within the same cache line, it will be a lot faster since it will only require one memory fetch. And code that has no branches will all be in the same cache line (Or consecutive cache lines), which makes the out-of-order execution simpler since all of the code is already fetched. And more still, there's a complex process for ensuring consistency of cached memory across multiple cores/CPUs. Older CPUs didn't bother doing any of this because memory was fast enough to simply be read/written on demand without slowing the CPU down, so the x86 instruction set (generally) acts as though you're reading/writing directly to main memory, without any cache, and it's up to the CPU to maintain that illusion.