There are a number of reasons for the Itanium's poor performance, and it's the combination of these various factors that did it in. I wasn't present back in the Itanium's heyday, but this is what I gathered.
As a quick recap, superscalar processors have multiple execution units, each of which can execute one instruction each cycle. So if you have three execution units, your CPU can execute up to three instructions every cycle. The conventional way to make use of the power of more than one execution unit is to have an out-of-order design, where a complicated mechanism (Tomasulo algorithm) decodes multiple instructions in parallel, tracks their dependencies and dispatches them onto execution units as they can be executed. Dependencies are resolved by having a large physical register file, which is dynamically mapped onto the programmer-visible logical register file (register renaming). This works well, but is notoriously complex to implement and requires a couple of extra pipeline stages before decode and execution, increasing the latency of mispredicted branches.
The idea of VLIW architectures was to improve on this idea by moving the decision which instruction to execute on which port to the compiler. The compiler, having prescient knowledge about what your code is going to do next, can compute the optimal assignment of instructions to execution units. Each instruction word is a pack of multiple instructions, one for each port, that are executed simultaneously (these words become very wide, hence VLIW for Very Long Instruction Word). In essence, all the bits of the out-of-order mechanism between decoding and execution ports can be done away with and the decoder is much simpler, too.
However, things fail in practice:
* the whole idea hinges on the compiler being able to figure out the correct instruction schedule ahead of time.
While feasible for Intel's/HP's in house compiler team, the authors of other toolchains largely did not bother,
instead opting for more conventional code generation that did not performed all too well.
* This issue was exacerbated by the Itanium's dreadful model for fast memory loads. You see, loads can take a
long time to finish, especially if cache misses or page faults occur. To fix that, the Itanium has the option
to do a speculative load, which may or may not succeed at a later point. So you can do a load from a dubious
pointer, then check if the pointer is fine (e.g. is it in bounds? Is it a null pointer?), and only once it has
been validated you make use of the result. This allows you to hide the latency of the load, significantly
speeding up typical business logic. However, the load can still fail (e.g. due to to pagefault), in which case
your code has to roll back to where the load should be performed and then do a conventional load as a back-up.
Understandably, few, if any compilers ever made use of this feature and load latency was dealt with rather
poorly.
* Relatedly, the latency of some instructions like loads and division is variable and cannot easily be predicted.
So there usually isn't even the one perfect schedule the compiler could find. Turns out the schedule is much
better when you leave it to the Tomasulo mechanism, which has accurate knowledge of the latency of already
executing long-latency instructions.
* By design, VLIW instruction sets encode a lot about how the execution units work in the instruction format. For
example, Itanium is designed for a machine with three execution units and each instruction pack has up to three
instructions, one for each of them. But what if you want to put more execution units into the CPU in a future
iteration of the design? Well, it's not straightforward. One approach is to ship executables in a bytecode,
which is only scheduled and encoded on the machine it is installed on, allowed the instruction encoding and thus
number of ports to vary. Intel had chosen a different approach and instead implemented later Itanium CPUs as
out-of-order designs, combining the worst of both worlds.
* Due to not having register renaming, VLIW architectures conventionally have a large register file (128 registers
in the case of the Itanium). This slows down context switches, further reducing performance. Out-of-order CPUs
can cheat by having a comparably small programmer-visible state, with most of the state hidden in the bowels of
the processor and consequently not in need of saving or restoring.
* Branch prediction rapidly grew more and more accurate shortly after the Itanium's release, reducing the importance
of fast recovery from mispredictions. These days, branch prediction is up to 99% accurate and out-of-order CPUs
can evaluate multiple branches per cycle using speculative execution. A feature, that is not possible with a
straightforward VLIW design due to the lack of register renaming. So Intel locked itself out of one of the most
crucial strategies for better performance with this approach.
* Another enginering issue was that x86 simulation on the Itanium performed quite poorly, giving existing customers
no incentive to switch. And those that did decide to switch found that if they invest into porting their
software, they might as well make it fully portable and be independent of the architecture. This is the same problem
that led to the death of DEC: by forcing their customers to rewrite all the VAX software for the Alpha, the created
a bunch of customers that were no longer locked into their ecosystem and could now buy whatever UNIX box was cheapest
on the free market.
> To fix that, the Itanium has the option to do a speculative load, which may or may not succeed at a later point. So you can do a load from a dubious pointer, then check if the pointer is fine (e.g. is it in bounds? Is it a null pointer?), and only once it has been validated you make use of the result.
Way back in the day, as a fairly young engineer, I was assigned to a project to get a bunch of legacy code migrated from Alpha to Itanium. The assignment was to "make it compile, run, and pass the tests. Do nothing else. At all."
We were using the Intel C compiler on OpenVMS and every once in a while would encounter a crash in a block of code that looked something like this:
It was evaluating both parts of the if statement simultaneously and crashing on the second. Not being allowed to spend too much time debugging or investigating the compiler options, we did the following:
EDIT - I recognize that the above change introduces a potential bug in the program ;) Obviously I wasn't copying code verbatim - it was 10-15 years ago! But you get the picture - the compiler was wonky, even the one you paid money for.
The main case I ever found was implement missing language features. E.G.
break 3; // Break 3 levels up
break LABEL; // Break to a named label - safer-ish than goto
goto LABEL; // When you have no other option.
Usually for breaking out of a really deep set of loops to an outer loop. Such as a data stream reset, end of data, or for an error so bad a different language might E.G. throw an error and usually die.
> the whole idea hinges on the compiler being able to figure out the correct instruction schedule ahead of time. While feasible for Intel's/HP's in house compiler team, the authors of other toolchains largely did not bother, instead opting for more conventional code generation that did not performed all too well.
I definitely think that keeping their compilers as an expensive license was a somewhat legendary bit of self-sabotage but I’m not sure it would’ve helped even if they’d given them away or merged everything into GCC. I worked for a commercial software vendor at the time before moving into web development, and it seemed like they basically over-focused on HPC benchmarks and a handful of other things like encryption. All of the business code we tried was usually slower even before you considered price, and nobody wanted to spend time hand-coding it hoping to make it less uneven. I do sometimes wonder if Intel’s compiler team would have been able to make it more competitive now with LLVM, WASM, etc. making the general problem of optimizing everything more realistic but I think the areas where the concept works best are increasingly sewn up by GPUs.
Your comment with DEC was spot-on. A lot of people I met had memories of the microcomputer era and were not keen on locking themselves in. The company I worked for had a pretty large support matrix because we had customers running most of the “open systems” platforms to ensure they could switch easily if one vendor got greedy.
>By design, VLIW instruction sets encode a lot about how the execution units work in the instruction format. For example, Itanium is designed for a machine with three execution units and each instruction pack has up to three instructions, one for each of them. But what if you want to put more execution units into the CPU in a future iteration of the design? Well, it's not straightforward. One approach is to ship executables in a bytecode, which is only scheduled and encoded on the machine it is installed on, allowed the instruction encoding and thus number of ports to vary.
This was how Sun's MAJC[0] worked -
" For instance, if a particular implementation took three cycles to complete a floating-point multiplication, MAJC compilers would attempt to schedule in other instructions that took three cycles to complete and were not currently stalled. A change in the actual implementation might reduce this delay to only two instructions, however, and the compiler would need to be aware of this change.
This means that the compiler was not tied to MAJC as a whole, but a particular implementation of MAJC, each individual CPU based on the MAJC design.
...
The developer ships only a single bytecode version of their program, and the user's machine compiles that to the underlying platform. "[0]
> Itanium is designed for a machine with three execution units and each instruction pack has up to three instructions, one for each of them.
The design was that each bundle had some extra bits including a stop which was a sort of barrier to execution. The idea was that you could have a series of bundles with no stop bit and the last one would set it. That meant the whole series could be safely scheduled on a future wide IA64 machine. Of course that meant the compiler had to be explicit about that parallelism (hence EPIC) but future machines would be able to schedule on the extra execution units.
This also addressed the problem where VLIW traditionally would require re-compilation to run/run more efficiently on newer hardware.
> Due to not having register renaming, VLIW architectures conventionally have a large register file (128 registers in the case of the Itanium). This slows down context switches, further reducing performance. Out-of-order CPUs can cheat by having a comparably small programmer-visible state, with most of the state hidden in the bowels of the processor and consequently not in need of saving or restoring.
Itanium borrowed the register windows from SPARC. It was effectively a hardware stack that had a minimum of 128 physical registers but were referenced in instructions by 6 bits — e.g. 64 virtual registers, iirc.
So you could make a function call and the stack would push. And a return would pop. Just like SPARC execept they weren't fixed-sized windows.
That said, the penalty for spilling the RSE (They called this part the Register Stack Engine) for say, an OS context switch was quite heavy since you'd have to write the whoe RSE state to memory.
It was pretty cool reading about this stuff as a new grad.
> Another enginering issue was that x86 simulation on the Itanium performed quite poorly, giving existing customers no incentive to switch.
As I mentioned in my previous comment Merced had a tiny corner of the chip devoted to the IVE, Intel Value Engine which was meant to be the very simple 32-bit x86 chip meant mainly for booting the system. The intent was (and the docs also had sample code) to boot, do some set up of system state, and then jump into IA64 mode where you would actually get a fast system.
I think they did devote more silicon to x86 support but I had already served my very short time at HP and Merced still took 2+ years to tape out.
> The design was that each bundle had some extra bits including a stop which was a sort of barrier to execution. The idea was that you could have a series of bundles with no stop bit and the last one would set it. That meant the whole series could be safely scheduled on a future wide IA64 machine. Of course that meant the compiler had to be explicit about that parallelism (hence EPIC) but future machines would be able to schedule on the extra execution units. This also addressed the problem where VLIW traditionally would require re-compilation to run/run more efficiently on newer hardware.
Thanks, that makes sense. I did not understand the intent of the stop bits correctly.
However, it still seems like the design wouldn't scale super well: if you have less ports, you want to schedule dependent instructions on the critical path as early as possible, even if other independent (but not latency-critical) instructions could be scheduled earlier, incurring extra stop bits. So while some degree of performance-portability is designed into the hardware, the compiler may have a hard time generating code that is scheduled well on both 3 port and possible future 6 port machines.
This reminds me of macro-fusion, where there's a similar contradiction: macro fusion only triggers if the fusable instructions are issued back to back. But when optimising for a multi-issue-in-order design, you usually want to interleave dependency chains (i.e. not issue dependet instructions back to back) such that all the pipelines are kept busy. So unless the pairs that fuse are the same on all of them, it's very hard to generate code that performs well on a variety of microarchitectures.
I don't remember if the parent article mentioned it but there were also a bunch of things like the predicate bits for predicated execution and I remember trying to gain an advantage using speculative loads was also very tricky. In the end it was pretty gnarly.
The other bit no one mentions is that it was an HP-Intel alliance. HP committed to PA-RISC compatibility with a combination of hardware and software whereas Intel just expected stuff to run.
From the instruction reference guide:
```
Binary compatibility between PA-RISC and IA-64 is handled through dynamic object code translation. This process is very efficient because there is such a high degree of correspondence between PA-RISC and IA-64 instructions. HP’s performance studies show that on average the dynamic translator only spends 1-2% of its time in translation with 98-99% of the time spent executing native code. The dynamic translator actually performs optimizations on the translated code to take advantage of IA-64’s wider instructions, and performance features such as predication, speculation and large register sets
```
There was some hardware support for 32-bit userspace binaries. See the addp4 instruction.
> That said, the penalty for spilling the RSE (They called this part the Register Stack Engine) for say, an OS context switch was quite heavy since you'd have to write the whoe RSE state to memory.
I've read that the original intention for the RSE was that it would have saved its state in the background during spare bus cycles, which would have reduced the amount of data to save when a context switch happened.
Supposedly, this was not implemented in early models of the Itanium. Was it ever?
> * the whole idea hinges on the compiler being able to figure out the correct instruction schedule ahead of time. While feasible for Intel's/HP's in house compiler team, the authors of other toolchains largely did not bother, instead opting for more conventional code generation that did not performed all too well.
Was Intel's compiler actually able to get good performance on Itanium? How much less screwed would Itanium have been if other toolchains matched the performance on Intel's compiler?
Also, I vaguely remember reading that Itanium also had a different page table structure (like a hash table?). Did that cause problems too?
Intel’s compiler was a bit better than some but still wasn’t great. Largely, Intel quickly lost interest in Itanium when AMD64 started selling well. HP had their own tooling, and HP was pretty much the only customer buying Itaniums. Intel quit investing in Itanium beyond what their contractual obligations to HP dictated.
I am curious about what could have been, but my assumption is that a mature and optimized software industry would be required. This was never going to happen after the launch of AMD64.
It's a long time ago but the thing I remember the most is that the binaries were huge, around 3x the size of x86 binaries. At the time we were very space constrained and that aspect alone was big concern. If the performance had been there it might still have been worth pursuing, but the performance never exceeded the fastest x86 processors at the time.
I didn't know these things, I don't think they are part of the meme-lore about Itanium:
- The problems with the fast load misses and compiler support
- I didn't understand the implications of a completely visible register file
- The trouble with "hard coding" three execution units. Very bad if you can't recompile your code and/or bytecde to a new binary when you get a new CPU.
Your last point about coding your way out of the ecosystem, I wonder if that might have been a reason for why Intel didn't go all-in to make Itanium the Java machine...
These were (intended for) Unix machines, not general purpose PCs… the assumption was that everyone was compiling anything that went on the system anyways, or were buying licenses for a specific hardware box. So at least it wasn’t considered to be a problem at the time.
One other unmet hope was that improvements in compiler technology would give a peer boost, they had up to 10x over the life of the program, which seemed wishful to me at the time (a lowly validation engine just out of college) but if it had worked out, theoretically your old programs would have gotten faster over time just by recompiling, which would have been cool…
As a quick recap, superscalar processors have multiple execution units, each of which can execute one instruction each cycle. So if you have three execution units, your CPU can execute up to three instructions every cycle. The conventional way to make use of the power of more than one execution unit is to have an out-of-order design, where a complicated mechanism (Tomasulo algorithm) decodes multiple instructions in parallel, tracks their dependencies and dispatches them onto execution units as they can be executed. Dependencies are resolved by having a large physical register file, which is dynamically mapped onto the programmer-visible logical register file (register renaming). This works well, but is notoriously complex to implement and requires a couple of extra pipeline stages before decode and execution, increasing the latency of mispredicted branches.
The idea of VLIW architectures was to improve on this idea by moving the decision which instruction to execute on which port to the compiler. The compiler, having prescient knowledge about what your code is going to do next, can compute the optimal assignment of instructions to execution units. Each instruction word is a pack of multiple instructions, one for each port, that are executed simultaneously (these words become very wide, hence VLIW for Very Long Instruction Word). In essence, all the bits of the out-of-order mechanism between decoding and execution ports can be done away with and the decoder is much simpler, too.
However, things fail in practice:
* the whole idea hinges on the compiler being able to figure out the correct instruction schedule ahead of time. While feasible for Intel's/HP's in house compiler team, the authors of other toolchains largely did not bother, instead opting for more conventional code generation that did not performed all too well.
* This issue was exacerbated by the Itanium's dreadful model for fast memory loads. You see, loads can take a long time to finish, especially if cache misses or page faults occur. To fix that, the Itanium has the option to do a speculative load, which may or may not succeed at a later point. So you can do a load from a dubious pointer, then check if the pointer is fine (e.g. is it in bounds? Is it a null pointer?), and only once it has been validated you make use of the result. This allows you to hide the latency of the load, significantly speeding up typical business logic. However, the load can still fail (e.g. due to to pagefault), in which case your code has to roll back to where the load should be performed and then do a conventional load as a back-up. Understandably, few, if any compilers ever made use of this feature and load latency was dealt with rather poorly.
* Relatedly, the latency of some instructions like loads and division is variable and cannot easily be predicted. So there usually isn't even the one perfect schedule the compiler could find. Turns out the schedule is much better when you leave it to the Tomasulo mechanism, which has accurate knowledge of the latency of already executing long-latency instructions.
* By design, VLIW instruction sets encode a lot about how the execution units work in the instruction format. For example, Itanium is designed for a machine with three execution units and each instruction pack has up to three instructions, one for each of them. But what if you want to put more execution units into the CPU in a future iteration of the design? Well, it's not straightforward. One approach is to ship executables in a bytecode, which is only scheduled and encoded on the machine it is installed on, allowed the instruction encoding and thus number of ports to vary. Intel had chosen a different approach and instead implemented later Itanium CPUs as out-of-order designs, combining the worst of both worlds.
* Due to not having register renaming, VLIW architectures conventionally have a large register file (128 registers in the case of the Itanium). This slows down context switches, further reducing performance. Out-of-order CPUs can cheat by having a comparably small programmer-visible state, with most of the state hidden in the bowels of the processor and consequently not in need of saving or restoring.
* Branch prediction rapidly grew more and more accurate shortly after the Itanium's release, reducing the importance of fast recovery from mispredictions. These days, branch prediction is up to 99% accurate and out-of-order CPUs can evaluate multiple branches per cycle using speculative execution. A feature, that is not possible with a straightforward VLIW design due to the lack of register renaming. So Intel locked itself out of one of the most crucial strategies for better performance with this approach.
* Another enginering issue was that x86 simulation on the Itanium performed quite poorly, giving existing customers no incentive to switch. And those that did decide to switch found that if they invest into porting their software, they might as well make it fully portable and be independent of the architecture. This is the same problem that led to the death of DEC: by forcing their customers to rewrite all the VAX software for the Alpha, the created a bunch of customers that were no longer locked into their ecosystem and could now buy whatever UNIX box was cheapest on the free market.