I stated that optimization is hard (each architecture has timings that impact wh...

mikeash · on Jan 21, 2014

"If Intel simply generated and ran Intel-targeted code always, we would have the same debate because why didn't they align their instructions and do it in such an order because that would be better on AMD."

That same debate might happen. But I doubt we'd see it, because most people would dismiss such a debate as stupid. Why would Intel put any effort into making their compiler optimize code for AMD CPUs?

Again: there is a world of difference between simply not optimizing for AMD, and deliberately running slow code when an AMD CPU is detected.

If Intel just checked CPU features and decided based on that, would it still produce bad code for AMD sometimes? Probably. Would it be as slow? Doesn't sound like it, from people with real-world experience with the compilers. Would people still complain? Yes. Would those complaints have any merit? No.

nkurz · on Jan 21, 2014

Just to join the chorus: it's not a hard problem. Intel intentionally adds code to the executables produced by their compiler so as to limit their performance on non-Intel CPUs. They lost a federal lawsuit filed by the FCC on deceptive practices because they did not inform their customers, who thought they were buying a working compiler, that they were doing this.

Instead of simply removing this code, which serves no purpose other than degrade performance on non-Intel CPUs, they reached an agreement to put legal disclaimers on every web page mentioning their compiler that they reserve the right to do this. But because of the slick legal language of the disclosure, most people get the impression that they are simply failing to make AMD specific optimizations, rather than intentionally preventing non-Intel CPUs from utilizing the optimizations already present in the code.

I think we all agree that at this point Intel's practice is fully legal. Their engineers should feel proud of having built a really solid compiler. But Intel-the-company deserves to be shamed for its slimy underhanded practices, and the engineers should feel a certain amount of revulsion for allowing themselves to be used in this way. Perhaps if they stood up for the obvious right approach, they could help change the company for the better.

nitrogen · on Jan 20, 2014

1. Old versions of icc ran well on AMD.

2. New icc version adds check for "GenuineIntel".

3. New icc version now runs slowly on AMD.

corresation · on Jan 21, 2014

This, like almost every counter point I've faced thus far, is simply wrong. It is manufactured reality.

ICC 8 added auto-vectorization. It, the very first auto-vectorization version, added the "GenuineIntel" branch for such vectorized code, because despite all of the fiction stated otherwise, vectorizing is actually a very hard task (hence why Intel maintains such a lead, and people are still griping about this 9 years after it came about).

I am hardly standing up for Intel, but this is Reddit-level conversation, where people simply say what they hope is true.

mikeash · on Jan 21, 2014

I don't understand your "because" statement. Intel added a check for Intel CPUs because vectorization is difficult. That's a complete non sequitur as far as I can tell. It makes as much sense as saying that I baked a chocolate cake because it rained yesterday.

Yes, various optimizations, including auto-vectorization, are difficult. Why does that mean Intel had to add a check for Intel CPUs in their compiler?

corresation · on Jan 21, 2014

I'm a glutton for punishment, I suppose.

The Intel compiler makes tight, fast x86[^1]. It ALSO can optionally generate auto-vectorized code paths for specific Intel architectures (it is not simply "has feature versus doesn't have feature", but instead chooses the usage profile of features based on the runtime architecture. Each architecture has significant nuances, setup and teardown costs, etc, and anyone who says "they should just feature sniff" does not understand the factors, though that certainly doesn't stop them from having an opinion), for that small amount of niche code that can be vectorized. Saying that because they don't do the latter for AMD processors means they "crippled" them is nonsensical.

Just to be clear, I have heavily used the Intel compiler for back-office financial applications. I'm not just repeating some opinion I happened across. Nor do I have any particularly love for Intel.

Further, if you understand that Intel specifically targets specific Intel architectures with every branch path, saying "well just run it on all things", again, you simply don't understand the discussion, or the architecture based dispatcher. Yeah, "just run it" might run perfectly fine, and for a contrived example might yield better runtimes, but it also can yield runtime errors or actual performance losses.

As I have repeatedly stated, we should expect great cross architecture and platform (including ARM, which with NEON also has vectorization) compilation with auto-vectorization from the dominant compilers, including GCC, LLVM, and VC. But somehow it always returns to the Intel compiler, nine years after they publicly stated "Yeah, this is for Intel targets".

^1 - So much so that in almost all of these conversations, the people who complain about Intel compilers still use them because it still generates the fastest code for AMD processors, vectorization or not. Which is pretty bizarre, really.

mikeash · on Jan 21, 2014

Well, your explanation seems completely at odds with what is currently the top-voted comment in this discussion. The linked discussion of the patch he built to fix the problem indicates that the dispatcher does just do CPU feature detection. Here is the URL for reference:

http://www.swallowtail.org/naughty-intel.shtml

According to that, the code simply does a feature check for SSE, SSE2, and SSE3. Except it also does a check for "GenuineIntel" and treats its absence as "no SSE of any kind" even if the CPU otherwise indicates that it does SSE. That check is completely unnecessary and does nothing but slow (or crash!) the code on non-Intel CPUs.

If you still think that's wrong, could you post the relevant code to show it?

corresation · on Jan 21, 2014

That link doesn't actually show what it does to determine whether to use SSE or SSE2 (much less SSE3 and beyond). That it derives a boolean value is not the same as feature detection.

Further the bulk of that entry was from 2004, which is pertinent given that at the time the new Pentium 4 was the first Intel processor with SSE2, and the SSE implementation on the Pentium III was somewhat of a disaster -- both single-precision width (it simulated 128-bits through two 64-bit operations, and for the P3 compilers could optimize for its specific handicap), and sharing resources with the floating point unit. So the feature flag, coupled with "GenuineIntel", was all they needed to know for the two possible Intel variants with support.

Since then the dispatcher and options have grown dramatically more complex as the number of architectures and permutations have exploded.

mikeash · on Jan 21, 2014

Well, here's a complete analysis of the function:

http://publicclu2.blogspot.com/2013/05/analysis-of-intel-com...

Unfortunately, it doesn't show the raw assembly. But in the absence of any information to the contrary, I'm perfectly happy to trust this pseudocode. It shows a bunch of feature checks, preceded by a single "GenuineIntel" check. The code that's gated on "GenuineIntel" would work just fine on non-Intel CPUs. It might sometimes produce sub-optimal results, but overall it'll be fine. There are some CPU family checks, but my understanding is that non-Intel CPUs return the same values that Intel CPUs do for similar architectures/capabilities.

We have multiple people saying that the code runs faster if the "GenuineIntel" checks are removed, we have pseudocode for the function in question that shows a bunch of feature detection with a bit of CPU family detection, neither of which are at all Intel specific. And then we have you, who can't seem to substantiate your claims at all.

If you have actual code or other reasonable evidence to support what you're saying, I'd love to see it. But right now, I'm not buying it.