> The design was that each bundle had some extra bits including a stop which was a sort of barrier to execution. The idea was that you could have a series of bundles with no stop bit and the last one would set it. That meant the whole series could be safely scheduled on a future wide IA64 machine. Of course that meant the compiler had to be explicit about that parallelism (hence EPIC) but future machines would be able to schedule on the extra execution units. This also addressed the problem where VLIW traditionally would require re-compilation to run/run more efficiently on newer hardware.
Thanks, that makes sense. I did not understand the intent of the stop bits correctly.
However, it still seems like the design wouldn't scale super well: if you have less ports, you want to schedule dependent instructions on the critical path as early as possible, even if other independent (but not latency-critical) instructions could be scheduled earlier, incurring extra stop bits. So while some degree of performance-portability is designed into the hardware, the compiler may have a hard time generating code that is scheduled well on both 3 port and possible future 6 port machines.
This reminds me of macro-fusion, where there's a similar contradiction: macro fusion only triggers if the fusable instructions are issued back to back. But when optimising for a multi-issue-in-order design, you usually want to interleave dependency chains (i.e. not issue dependet instructions back to back) such that all the pipelines are kept busy. So unless the pairs that fuse are the same on all of them, it's very hard to generate code that performs well on a variety of microarchitectures.
I don't remember if the parent article mentioned it but there were also a bunch of things like the predicate bits for predicated execution and I remember trying to gain an advantage using speculative loads was also very tricky. In the end it was pretty gnarly.
The other bit no one mentions is that it was an HP-Intel alliance. HP committed to PA-RISC compatibility with a combination of hardware and software whereas Intel just expected stuff to run.
From the instruction reference guide:
```
Binary compatibility between PA-RISC and IA-64 is handled through dynamic object code translation. This process is very efficient because there is such a high degree of correspondence between PA-RISC and IA-64 instructions. HP’s performance studies show that on average the dynamic translator only spends 1-2% of its time in translation with 98-99% of the time spent executing native code. The dynamic translator actually performs optimizations on the translated code to take advantage of IA-64’s wider instructions, and performance features such as predication, speculation and large register sets
```
There was some hardware support for 32-bit userspace binaries. See the addp4 instruction.
Thanks, that makes sense. I did not understand the intent of the stop bits correctly. However, it still seems like the design wouldn't scale super well: if you have less ports, you want to schedule dependent instructions on the critical path as early as possible, even if other independent (but not latency-critical) instructions could be scheduled earlier, incurring extra stop bits. So while some degree of performance-portability is designed into the hardware, the compiler may have a hard time generating code that is scheduled well on both 3 port and possible future 6 port machines.
This reminds me of macro-fusion, where there's a similar contradiction: macro fusion only triggers if the fusable instructions are issued back to back. But when optimising for a multi-issue-in-order design, you usually want to interleave dependency chains (i.e. not issue dependet instructions back to back) such that all the pipelines are kept busy. So unless the pairs that fuse are the same on all of them, it's very hard to generate code that performs well on a variety of microarchitectures.