Putting Your Data and Code in Order: Optimization and Memory

marmaduke · on Jan 30, 2016

This is 2016 and article's author is using GCC 4.1 (released in 2007)? I wonder if a modern GCC does the autovectorizations that the author says v4.1 does not do.

berkut · on Jan 31, 2016

It does.

I benchmarked compilers for some of my stuff just over a year ago (GCC, LLVM, ICC): http://imagine-rt.blogspot.co.nz/2014/12/c-compiler-benchmar...

and surprisingly I found ICC was no longer always the fastest, whereas two years previously in tests I'd regularly done against GCC 4.1 and 4.4 it would always win, sometimes by a factor of 2x.

wallacoloo · on Jan 30, 2016

Furthermore, the ICC version he is comparing it to (16.1) hasn't even been released yet, as far as I can tell. Wikipedia lists 16.0 as the latest stable version, and it's really difficult to find out which version of ICC is available for download on intel's website without navigating through tens of pages and submitting at least one form.

Also, the article brought up the idea of block sizing, which sounds compelling. But the writer failed to produce a benchmark for it which did better than baseline on ICC, and then kept on writing as if block sizing had merit without even commenting on this discrepancy.

drmackay · on Feb 3, 2016

The intel compiler update 1 (16.1) was released in late 2015. It is available to download to all who have a license or are beginning a new evaluation.

For matrices of size 4000x4000 there was significant benefit for blocking the loops - please check the table again. This benefit continues as matrices continue to get larger. I specifically used two different size matrices as examples so that the reader could see that blocking is not a panacea or cure all. So as you point out for the smaller case there was no benefit - I wanted the reader to see that, so I included both data sets. The value of block size varies depending on the matrix size and the cache size of the system you are running on. When the data sets or number of iterations you are running on is relatively small (compared to cache size) the blocking will not provide benefit; don't blindly apply it. The matrix multiply was selected for pedagogical purposes - it is easy to illustrate and easy to explain. I was limited to article length - see the link to the article on finite difference methods for more on value of blocking. If you are really interested in blocking for matrix multiply there many other good articles that go into this in much more detail (and the fastest ones do block into submatrices and perform additional optimizations). (Please note the optimal blocking may be rectangular non square blocks - I did not code the general case.)

pierrealexandre · on Jan 30, 2016

If I understand Table 1 correctly, it shows that with ICC a block size of 8 with 4000x4000 matrices gives roughly a speedup factor of 2 (~10.53 / 6.38). The unit is not clearly stated but it looks like a bigger number means a better "performance" ie a faster program.

wallacoloo · on Jan 31, 2016

Oh, indeed. I believe I was not reading the table as intended.

drmackay · on Feb 3, 2016

The gcc compiler does vectorize the code. I wrote that gcc 4.1.2 did not interchange loops automatically (ijk to ikj). Matrix multiply is very easy for a compiler to optimize and it is easy to use to illustrate principles. There may exist some cases where a developer may see the value in interchanging loops that the compiler (intel compiler, llvm, gcc, . . . )doesn't recognize. I pointed out how to see when the compiler does this with the optimization report. If the compiler already does it for you, don't bother with explicitly making the change - save your time for something else.

marmaduke · on Feb 5, 2016

Ok but that is a point you could make without comparing two compilers. It seems an ingenuous choice to compare a 2007 compiler to a 2015 compiler.

nly · on Jan 31, 2016

Afaict the old wisdom that ICC does well on numeric code and vectorisation holds true. I havent used it in a while but, for more general stuff, i've seen it generate some really slow code compared to GCC and clang.