I did tests on Kahan summation recently on my macbook pro and -O3 defeated the algorithm while -O2 did not. Declaring the below variables as volatile restored error compensation with -O3.
Sounds like a compiler bug to me. Can you file a bug to clang with a reduced standalone test (or I can do it for you if you share the standalone test).
Here is a complete simplified Kahan summation test and indeed it works with -O3 but fails with -Ofast. There must have been something else going on in my real program at -O3. However the original point that 'volatile' can be a workaround for some optimization problems is still valid (you may want the rest of your program to benefit from -Ofast without breaking certain parts).
Changing the three kahan_* variables to volatile makes this work (slowly) with -Ofast.
#include <stdio.h>
int main(int argc, char **argv) {
int i;
double sample, sum;
double kahan_y, kahan_t, kahan_c;
// initial values
sum=0.0;
sample=1.0; // start with "large" value
for (i=0; i <= 1000000000; i++) { // add 1 large value plus 1 billion small values
// Kahan summation algorithm
kahan_y=sample - kahan_c;
kahan_t=sum + kahan_y;
kahan_c=(kahan_t - sum) - kahan_y;
sum=kahan_t;
// pre-load next small value
sample=1.0E-20;
}
printf("sum: %.15f\n", sum);
}
Correct. `-Ofast` claim to fame is it enables `-ffast-math` which is why it has huge warning signs around it in the documentation. `-ffast-math` turns on associativity which is problematic for Kahan summation. Rather than sprinkling in volatiles which pessimizes the compiler to no end, I would recommend annotating the problematic function to turn off associativity [1][2].
That way the compiler applies all the optimizations it can but only turns off associative math. This should work on Clang & GCC & be net faster in all cases.
This is what I mean by "If you're sprinkling volatile around, you probably aren't doing what you want" and are just cargo culting bad advice.
I hope this isn't the actual "real" code, because you've got undefined behavior before you even have to worry about the associativity optimizations. There's an uninitialized read of 'kahan_c' on the first loop iteration.
The relevant code is:
(this is in an inner loop where a new g_sample_z is calculated and then added to a running g_sample_z_sum with this snippet)