1. This is only true in JIT's and similar things that strongly depend on linear ...

qznc · on July 24, 2013

Nevertheless, register allocation is still an NP-complete problem and even in AOT compilation everything above O(n^3) is too slow (unless you just compile kernels for embedded software).

Actually, it is not the register allocation itself, which is NP-complete [0]. Avoiding spills and copys is the hard part.

[0] https://pp.info.uni-karlsruhe.de/publication.php?id=buchwald...

DannyBee · on July 24, 2013

The fact that some subproblems people want to solve are NP complete is mostly irrelevant, and has been for many many years. Nobody cares about optimal copy coalescing, only "good enough". There are studies about optimal vs non-optimal, and the answer is basically "you could eliminate 20% more copies". That was against the best optimistic coalescers of the time, which have already been superseded by PBQP and other approaches. Also remember that people build architectures knowing what compilers can and can't do, and so processors do a lot at runtime.

As I said, the real issue is that nobody has chosen to optimize for the interpreter case, because it's uncommon and doing so does not help anything else, but comes at great cost.

Choosing the one edge case everyone has said "we don't care about" and saying "see, they suck at everything!" does not seem quite right to me.

For example, it is completely irrelevant to whether register allocation is a problem for tight SIMD intrinsic loops, for example.

At least in GCC (which is what x264 was likely talking about), the issues are GCC's architecture, and not some fundamental "register allocation is hard" issue.

nkurz · on July 24, 2013

remember that people build architectures knowing what compilers can and can't do, and so processors do a lot at runtime

It sounds like you may be referring to MOV elimination by register renaming, which should make the extra moves no more costly than NOP's. I read that Intel post-Ivy Bridge does this, but haven't been able to find any real documentation. Do you know if this is something one can now rely on, or what the limits of this are (number per cycle, size differences, latency)?

nkurz · on July 28, 2013

Answering myself: Yes, this is documented and can be depended on for Ivy Bridge onward.

3.5.1.13 Zero-Latency MOV Instructions

In processors based on Intel microarchitecture code named Ivy Bridge, a subset of register-to-register move operations are executed in the front end (similar to zeroidioms, see Section 3.5.1.8). This conserves scheduling/execution resources in the out-of-order engine. Most forms of register-to-register MOV instructions can benefit from zero-latency MOV. Example 3-23 list the details of those forms that qualify and a small set that do not.

Example 3-23. Zero-Latency MOV Instructions

MOV instructions latency that can be eliminated

  MOV reg32, reg32
  MOV reg64, reg64
  MOVUPD/MOVAPD xmm, xmm
  MOVUPD/MOVAPD ymm, ymm
  MOVUPS?MOVAPS xmm, xmm
  MOVUPS/MOVAPS ymm, ymm
  MOVDQA/MOVDQU xmm, xmm
  MOVDQA/MOVDQU ymm, ymm
  MOVZX reg32, reg8 (if not AH/BH/CH/DH)
  MOVZX reg64, reg8 (if not AH/BH/CH/DH)

MOV instructions latency that cannot be eliminated

  MOV reg8, reg8
  MOV reg16, reg16
  MOVZX reg32, reg8 (if AH/BH/CH/DH)
  MOVZX reg64, reg8 (if AH/BH/CH/DH)
  MOVSX

http://www.intel.com/content/dam/doc/manual/64-ia-32-archite...

caf · on July 24, 2013

Could you please expand more on those "crazy edge cases"? I've long been puzzled why compilers seem to throw away so much lifetime/scope information.

DannyBee · on July 25, 2013

So, PRE and other code motion/placement optimizations rely on a CFG for computation of dataflow information either implicitly (SSA form) or explicitly (non-SSA form and some SSA algorithms). I'm going to stick to non-SSA algorithms because they are actually easier to understand the issues for.

To save a large amount of text here, read http://en.wikipedia.org/wiki/Data-flow_analysis

Once you've done this, particularly the section on bitvector problems, realize that most of the properties being computed make no sense earlier than original scope of the variable, and what to do is not always apparent.

Take PRE for example, here's a simple, but crappily devised example: int main(int argc, char argv) { int b; for (int i = 0; i < 50; i++) { int a; a = argc; b = a; } return b; }

(It's crappy because any value numbering/copy prop/etc would take care of this particular example) A standard PRE is going to want to hoist a = argc out of the loop, not b = a;

But, well, it can't, because it's out of scope! How does it know what it should do? Should it move int a out of the loop? It may end up inserting a stack allocation into a hot loop if it just hoists it one scope up!

Should it create a temporary? If you allow it to create new outside-scope temporaries anyway, what's the point of keeping the scopes?

So now you have to do scope understanding and movement anyway.

Plus, in most cases, you need to construct accurate live ranges of variables anyway, and scopes do not always properly represent those. You also have to compute what variables may share registers on your own too.

Non-renaming forms of SSA (IE that don't literally rename the temporaries) are worse here, because you can accidentally overlap live ranges of variables without too much trouble.

In any case, the basic answer is "Scopes buy you very little, and hurt a lot, because every single optimizer has to care a lot about them"

pcwalton · on July 24, 2013

Thanks for the reply. I was being pretty inaccurate in my terminology: I said "branchy" when I meant "diamond-shaped" and I said "spilling" when I should have said "splitting". That's what I get for posting before being fully awake :)