1. This is only true in JIT's and similar things that strongly depend on linear time register allocation. LLVM happens to now use a greedy allocator that still tries to prioritize time over allocation optimality.
2. This is just a bad compiler then. Good static profiling is actually quite good, and quite accurate, in most cases. This includes value and other forms of profiling, rather than just simple static edge estimation based on heuristics. Usually, the thing that gets hurt is not spill placement, but bad inlining decisions.
For diamond shaped switch statement interpreter loops with simple bodies, the real issue is that most greedy/linear allocators are not great at live range splitting.
Compilers like LLVM (and to some degree, GCC), move all the variable allocations up to the beginning of the function to make life easy by removing scopes (otherwise you have really really crazy edge cases performing hoisting/sinking optimizations), and then for those that don't get mem2reg'd, can't prove they aren't live all at the same time during the switch due to the loop.
Then they make bad choices about which of these variables should stay in registers because their value profiling infrastructures are non-existent.
Proper region analysis, allocation regions, and better optimistic live range splitting would go a long way towards fixing this, but it's not worth it. There is little to no sense in optimizing LLVM for the very uncommon case of interpreter loops (particularly when one of the goals of LLVM is to ... replace interpreters).
So the basic answer is: It's not really a problem anyone cares to solve, not "it's a really hard problem to solve".
Nevertheless, register allocation is still an NP-complete problem and even in AOT compilation everything above O(n^3) is too slow (unless you just compile kernels for embedded software).
Actually, it is not the register allocation itself, which is NP-complete [0]. Avoiding spills and copys is the hard part.
The fact that some subproblems people want to solve are NP complete is mostly irrelevant, and has been for many many years. Nobody cares about optimal copy coalescing, only "good enough". There are studies about optimal vs non-optimal, and the answer is basically "you could eliminate 20% more copies". That was against the best optimistic coalescers of the time, which have already been superseded by PBQP and other approaches.
Also remember that people build architectures knowing what compilers can and can't do, and so processors do a lot at runtime.
As I said, the real issue is that nobody has chosen to optimize for the interpreter case, because it's uncommon and doing so does not help anything else, but comes at great cost.
Choosing the one edge case everyone has said "we don't care about" and saying "see, they suck at everything!" does not seem quite right to me.
For example, it is completely irrelevant to whether register allocation is a problem for tight SIMD intrinsic loops, for example.
At least in GCC (which is what x264 was likely talking about), the issues are GCC's architecture, and not some fundamental "register allocation is hard" issue.
remember that people build architectures knowing what compilers can and can't do, and so processors do a lot at runtime
It sounds like you may be referring to MOV elimination by register renaming, which should make the extra moves no more costly than NOP's. I read that Intel post-Ivy Bridge does this, but haven't been able to find any real documentation. Do you know if this is something one can now rely on, or what the limits of this are (number per cycle, size differences, latency)?
Answering myself: Yes, this is documented and can be depended on for Ivy Bridge onward.
3.5.1.13 Zero-Latency MOV Instructions
In processors based on Intel microarchitecture code named Ivy Bridge, a subset of
register-to-register move operations are executed in the front end (similar to zeroidioms, see Section 3.5.1.8). This conserves scheduling/execution resources in the
out-of-order engine. Most forms of register-to-register MOV instructions can benefit
from zero-latency MOV. Example 3-23 list the details of those forms that qualify and
a small set that do not.
So, PRE and other code motion/placement optimizations rely on a CFG for computation of dataflow information either implicitly (SSA form) or explicitly (non-SSA form and some SSA algorithms).
I'm going to stick to non-SSA algorithms because they are actually easier to understand the issues for.
Once you've done this, particularly the section on bitvector problems, realize that most of the properties being computed make no sense earlier than original scope of the variable, and what to do is not always apparent.
Take PRE for example, here's a simple, but crappily devised example:
int main(int argc, char argv)
{
int b;
for (int i = 0; i < 50; i++)
{
int a;
a = argc;
b = a;
}
return b;
}
(It's crappy because any value numbering/copy prop/etc would take care of this particular example)
A standard PRE is going to want to hoist a = argc out of the loop, not b = a;
But, well, it can't, because it's out of scope!
How does it know what it should do?
Should it move int a out of the loop?
It may end up inserting a stack allocation into a hot loop if it just hoists it one scope up!
Should it create a temporary?
If you allow it to create new outside-scope temporaries anyway, what's the point of keeping the scopes?
So now you have to do scope understanding and movement anyway.
Plus, in most cases, you need to construct accurate live ranges of variables anyway, and scopes do not always properly represent those.
You also have to compute what variables may share registers on your own too.
Non-renaming forms of SSA (IE that don't literally rename the temporaries) are worse here, because you can accidentally overlap live ranges of variables without too much trouble.
In any case, the basic answer is "Scopes buy you very little, and hurt a lot, because every single optimizer has to care a lot about them"
Thanks for the reply. I was being pretty inaccurate in my terminology: I said "branchy" when I meant "diamond-shaped" and I said "spilling" when I should have said "splitting". That's what I get for posting before being fully awake :)
2. This is just a bad compiler then. Good static profiling is actually quite good, and quite accurate, in most cases. This includes value and other forms of profiling, rather than just simple static edge estimation based on heuristics. Usually, the thing that gets hurt is not spill placement, but bad inlining decisions.
For diamond shaped switch statement interpreter loops with simple bodies, the real issue is that most greedy/linear allocators are not great at live range splitting. Compilers like LLVM (and to some degree, GCC), move all the variable allocations up to the beginning of the function to make life easy by removing scopes (otherwise you have really really crazy edge cases performing hoisting/sinking optimizations), and then for those that don't get mem2reg'd, can't prove they aren't live all at the same time during the switch due to the loop.
Then they make bad choices about which of these variables should stay in registers because their value profiling infrastructures are non-existent.
Proper region analysis, allocation regions, and better optimistic live range splitting would go a long way towards fixing this, but it's not worth it. There is little to no sense in optimizing LLVM for the very uncommon case of interpreter loops (particularly when one of the goals of LLVM is to ... replace interpreters).
So the basic answer is: It's not really a problem anyone cares to solve, not "it's a really hard problem to solve".