There are a number of reasons for the Itanium's poor performance, and it's the c...

Kon-Peki · on Jan 24, 2024

> To fix that, the Itanium has the option to do a speculative load, which may or may not succeed at a later point. So you can do a load from a dubious pointer, then check if the pointer is fine (e.g. is it in bounds? Is it a null pointer?), and only once it has been validated you make use of the result.

Way back in the day, as a fairly young engineer, I was assigned to a project to get a bunch of legacy code migrated from Alpha to Itanium. The assignment was to "make it compile, run, and pass the tests. Do nothing else. At all."

We were using the Intel C compiler on OpenVMS and every once in a while would encounter a crash in a block of code that looked something like this:

   if(ptr != NULL && ptr->val > 0) {
     //do something
   } else {
     //init the ptr
   }

It was evaluating both parts of the if statement simultaneously and crashing on the second. Not being allowed to spend too much time debugging or investigating the compiler options, we did the following:

   if(ptr != NULL) {
     if(ptr->val > 0) {
       //do something
     }
   } else {
     //init the ptr
   }

Which resolved the problem!

EDIT - I recognize that the above change introduces a potential bug in the program ;) Obviously I wasn't copying code verbatim - it was 10-15 years ago! But you get the picture - the compiler was wonky, even the one you paid money for.

jandrese · on Jan 24, 2024

When I was learning C many years ago I was warned that some compilers don't support boolean short circuiting and thus you had to be careful with it.

quux · on Jan 24, 2024

Is this one of those rare cases where using a goto would be reasonable?

mjevans · on Jan 24, 2024

The main case I ever found was implement missing language features. E.G.

break 3; // Break 3 levels up

break LABEL; // Break to a named label - safer-ish than goto

goto LABEL; // When you have no other option.

Usually for breaking out of a really deep set of loops to an outer loop. Such as a data stream reset, end of data, or for an error so bad a different language might E.G. throw an error and usually die.

acdha · on Jan 24, 2024

> the whole idea hinges on the compiler being able to figure out the correct instruction schedule ahead of time. While feasible for Intel's/HP's in house compiler team, the authors of other toolchains largely did not bother, instead opting for more conventional code generation that did not performed all too well.

I definitely think that keeping their compilers as an expensive license was a somewhat legendary bit of self-sabotage but I’m not sure it would’ve helped even if they’d given them away or merged everything into GCC. I worked for a commercial software vendor at the time before moving into web development, and it seemed like they basically over-focused on HPC benchmarks and a handful of other things like encryption. All of the business code we tried was usually slower even before you considered price, and nobody wanted to spend time hand-coding it hoping to make it less uneven. I do sometimes wonder if Intel’s compiler team would have been able to make it more competitive now with LLVM, WASM, etc. making the general problem of optimizing everything more realistic but I think the areas where the concept works best are increasingly sewn up by GPUs.

Your comment with DEC was spot-on. A lot of people I met had memories of the microcomputer era and were not keen on locking themselves in. The company I worked for had a pretty large support matrix because we had customers running most of the “open systems” platforms to ensure they could switch easily if one vendor got greedy.

sillywalk · on Jan 24, 2024

>By design, VLIW instruction sets encode a lot about how the execution units work in the instruction format. For example, Itanium is designed for a machine with three execution units and each instruction pack has up to three instructions, one for each of them. But what if you want to put more execution units into the CPU in a future iteration of the design? Well, it's not straightforward. One approach is to ship executables in a bytecode, which is only scheduled and encoded on the machine it is installed on, allowed the instruction encoding and thus number of ports to vary.

This was how Sun's MAJC[0] worked -

" For instance, if a particular implementation took three cycles to complete a floating-point multiplication, MAJC compilers would attempt to schedule in other instructions that took three cycles to complete and were not currently stalled. A change in the actual implementation might reduce this delay to only two instructions, however, and the compiler would need to be aware of this change.

This means that the compiler was not tied to MAJC as a whole, but a particular implementation of MAJC, each individual CPU based on the MAJC design.

...

The developer ships only a single bytecode version of their program, and the user's machine compiles that to the underlying platform. "[0]

[0] https://en.wikipedia.org/wiki/MAJC

sillywalk · on Jan 25, 2024

For posterity more info:

They also had general instruction units, not a specific one for floating point or integer or SIMD, they were all the same.

hawflakes · on Jan 24, 2024

> Itanium is designed for a machine with three execution units and each instruction pack has up to three instructions, one for each of them. The design was that each bundle had some extra bits including a stop which was a sort of barrier to execution. The idea was that you could have a series of bundles with no stop bit and the last one would set it. That meant the whole series could be safely scheduled on a future wide IA64 machine. Of course that meant the compiler had to be explicit about that parallelism (hence EPIC) but future machines would be able to schedule on the extra execution units. This also addressed the problem where VLIW traditionally would require re-compilation to run/run more efficiently on newer hardware.

> Due to not having register renaming, VLIW architectures conventionally have a large register file (128 registers in the case of the Itanium). This slows down context switches, further reducing performance. Out-of-order CPUs can cheat by having a comparably small programmer-visible state, with most of the state hidden in the bowels of the processor and consequently not in need of saving or restoring.

Itanium borrowed the register windows from SPARC. It was effectively a hardware stack that had a minimum of 128 physical registers but were referenced in instructions by 6 bits — e.g. 64 virtual registers, iirc. So you could make a function call and the stack would push. And a return would pop. Just like SPARC execept they weren't fixed-sized windows.

That said, the penalty for spilling the RSE (They called this part the Register Stack Engine) for say, an OS context switch was quite heavy since you'd have to write the whoe RSE state to memory.

It was pretty cool reading about this stuff as a new grad.

> Another enginering issue was that x86 simulation on the Itanium performed quite poorly, giving existing customers no incentive to switch.

As I mentioned in my previous comment Merced had a tiny corner of the chip devoted to the IVE, Intel Value Engine which was meant to be the very simple 32-bit x86 chip meant mainly for booting the system. The intent was (and the docs also had sample code) to boot, do some set up of system state, and then jump into IA64 mode where you would actually get a fast system.

I think they did devote more silicon to x86 support but I had already served my very short time at HP and Merced still took 2+ years to tape out.

clausecker · on Jan 25, 2024

> The design was that each bundle had some extra bits including a stop which was a sort of barrier to execution. The idea was that you could have a series of bundles with no stop bit and the last one would set it. That meant the whole series could be safely scheduled on a future wide IA64 machine. Of course that meant the compiler had to be explicit about that parallelism (hence EPIC) but future machines would be able to schedule on the extra execution units. This also addressed the problem where VLIW traditionally would require re-compilation to run/run more efficiently on newer hardware.

Thanks, that makes sense. I did not understand the intent of the stop bits correctly. However, it still seems like the design wouldn't scale super well: if you have less ports, you want to schedule dependent instructions on the critical path as early as possible, even if other independent (but not latency-critical) instructions could be scheduled earlier, incurring extra stop bits. So while some degree of performance-portability is designed into the hardware, the compiler may have a hard time generating code that is scheduled well on both 3 port and possible future 6 port machines.

This reminds me of macro-fusion, where there's a similar contradiction: macro fusion only triggers if the fusable instructions are issued back to back. But when optimising for a multi-issue-in-order design, you usually want to interleave dependency chains (i.e. not issue dependet instructions back to back) such that all the pipelines are kept busy. So unless the pairs that fuse are the same on all of them, it's very hard to generate code that performs well on a variety of microarchitectures.

hawflakes · on Jan 27, 2024

I don't remember if the parent article mentioned it but there were also a bunch of things like the predicate bits for predicated execution and I remember trying to gain an advantage using speculative loads was also very tricky. In the end it was pretty gnarly.

The other bit no one mentions is that it was an HP-Intel alliance. HP committed to PA-RISC compatibility with a combination of hardware and software whereas Intel just expected stuff to run.

From the instruction reference guide: ``` Binary compatibility between PA-RISC and IA-64 is handled through dynamic object code translation. This process is very efficient because there is such a high degree of correspondence between PA-RISC and IA-64 instructions. HP’s performance studies show that on average the dynamic translator only spends 1-2% of its time in translation with 98-99% of the time spent executing native code. The dynamic translator actually performs optimizations on the translated code to take advantage of IA-64’s wider instructions, and performance features such as predication, speculation and large register sets ```

There was some hardware support for 32-bit userspace binaries. See the addp4 instruction.

Findecanor · on Jan 24, 2024

> That said, the penalty for spilling the RSE (They called this part the Register Stack Engine) for say, an OS context switch was quite heavy since you'd have to write the whoe RSE state to memory.

I've read that the original intention for the RSE was that it would have saved its state in the background during spare bus cycles, which would have reduced the amount of data to save when a context switch happened.

Supposedly, this was not implemented in early models of the Itanium. Was it ever?

xoranth · on Jan 24, 2024

> * the whole idea hinges on the compiler being able to figure out the correct instruction schedule ahead of time. While feasible for Intel's/HP's in house compiler team, the authors of other toolchains largely did not bother, instead opting for more conventional code generation that did not performed all too well.

Was Intel's compiler actually able to get good performance on Itanium? How much less screwed would Itanium have been if other toolchains matched the performance on Intel's compiler?

Also, I vaguely remember reading that Itanium also had a different page table structure (like a hash table?). Did that cause problems too?

BirAdam · on Jan 24, 2024

Intel’s compiler was a bit better than some but still wasn’t great. Largely, Intel quickly lost interest in Itanium when AMD64 started selling well. HP had their own tooling, and HP was pretty much the only customer buying Itaniums. Intel quit investing in Itanium beyond what their contractual obligations to HP dictated.

I am curious about what could have been, but my assumption is that a mature and optimized software industry would be required. This was never going to happen after the launch of AMD64.

martinpw · on Jan 24, 2024

It's a long time ago but the thing I remember the most is that the binaries were huge, around 3x the size of x86 binaries. At the time we were very space constrained and that aspect alone was big concern. If the performance had been there it might still have been worth pursuing, but the performance never exceeded the fastest x86 processors at the time.

actionfromafar · on Jan 24, 2024

Wow, another great explanation!

I didn't know these things, I don't think they are part of the meme-lore about Itanium:

- The problems with the fast load misses and compiler support

- I didn't understand the implications of a completely visible register file

- The trouble with "hard coding" three execution units. Very bad if you can't recompile your code and/or bytecde to a new binary when you get a new CPU.

Your last point about coding your way out of the ecosystem, I wonder if that might have been a reason for why Intel didn't go all-in to make Itanium the Java machine...

kristianbrigman · on Jan 24, 2024

These were (intended for) Unix machines, not general purpose PCs… the assumption was that everyone was compiling anything that went on the system anyways, or were buying licenses for a specific hardware box. So at least it wasn’t considered to be a problem at the time.

One other unmet hope was that improvements in compiler technology would give a peer boost, they had up to 10x over the life of the program, which seemed wishful to me at the time (a lowly validation engine just out of college) but if it had worked out, theoretically your old programs would have gotten faster over time just by recompiling, which would have been cool…