Outperforming LAPACK with C++ metaprogramming

costrouc · on Jan 29, 2018

For those wondering the lapack guys are working on a new version of lapack to run on heterogeneous architecture including gpus. See their work at http://www.icl.utk.edu/research

I have worked with these guys and all I can say is good luck outperforming the highly optimized routines they have written. My bet is that the the guy writing this blog used a non optimized version of lapack.

stephencanon · on Jan 29, 2018

There's no mystery here, author is only solving a 5x5 system. LAPACK will have barely finished checking function call parameters by the time a fully specialized and inlined solver is finished. Also, from a cursory inspection, author does no pivoting [edit: worse, it just uses Cramer's rule, which is insane], so their solver will be much less numerically robust (but this is admittedly less of a problem for the 5x5 case).

The performance is completely expected--in fact, I would expect an optimized 5x5 solver to be faster than this.

electricslpnsld · on Jan 29, 2018

A better comparison might be a C++ library like Eigen that is intended for use on these small systems, that should optimize out function calls in a similar way, and that uses intrinsics.

stochastic_monk · on Jan 29, 2018

If your matrices are of arbitrary size, Blaze [0] would be my pick. It has the best performance on their benchmarks, Baptiste Wicht's benchmarks, and my own.

Though if the matrices are guaranteed to be small (like in this example), libxsmm [1] is specialized and highly optimized for this use case and beats its competitors above.

And yes, it's absolutely essential to make sure you didn't just move computations to compile-time.

[0]: https://bitbucket.org/blaze-lib/blaze

[1]: https://github.com/hfp/libxsmm

stabbles · on Jan 29, 2018

Cramer's rule is unstable even for 2x2 problems

mattnewton · on Jan 29, 2018

I think gp is missing a “less” before “numerically robust”

stephencanon · on Jan 29, 2018

Yeah, thanks. Fixed.

barrkel · on Jan 29, 2018

Please check with array initialized from file or other dynamic source. With maximal inlining, there's a risk of constant propagation resulting in a lot of computation happening at compile time rather than run time.

dpwm · on Jan 29, 2018

As an anecdote of just how powerful this can be, I once wrote a prototype of a simulation (written in C) that used a lot of integer divisions and needed to be run many times with different parameters.

As a quick and dirty hack that took about half hour to implement, I decided to just have the program as a template string and the parameters be substituted in by a python script that called GCC. To the amazement of a very sceptical friend, the result was a 5x speedup.

A lot of this speedup was consistent with replacing the integer divisions with multiplicative inverses, but also merging constants in a way that would have made things quite unreadable. I was quite impressed with how far GCC had gone.

Considering each instance of this simulation ran for tens of minutes, it was well worth compilation time of a few seconds. It kept the code really simple, too.

ChrisRackauckas · on Jan 29, 2018

That's pretty much what you get with Julia, except it will also do interprocedural optimization (i.e. inline constants and functions across calls) on v0.7. All without playing with templates. I'll gladly pay a 1 second JiT lag for a few days off my computation. At the same time, I don't have screw around with templates.

dpwm · on Jan 29, 2018

Your comment has taken me from having mixed feelings about Julia to making me convinced I want to try Julia for something numerically intensive that I would be inclined to turn to C for.

jey · on Jan 29, 2018

Even better, Julia code is super high-level and readable. And you can just drop down to C using the builtin "ccall" facility. Julia's type system is designed for excellent interop with the C ABI.

pkaye · on Jan 29, 2018

You should check out this presentation done by Matt Godbolt about compiler internals. He talks about what kind of optimizations are being done by GCC and Clang. Pretty amazing stuff. For example, the compiler will figure out you are trying to write a pop count in high level code and replace it with a single popcount instruction.

https://www.youtube.com/watch?v=bSkpMdDe4g4

dpwm · on Jan 30, 2018

That talk really revealed some brilliant optimizations that Clang does. In addition to what you mentioned, the bit where Clang recognises the summation loop, removes the loop and replaces it with a closed-form solution feels intuitively more like the operations that only some cutting edge research-language compiler should be able to do. To see this for C and C++ is very impressive.

mpweiher · on Jan 29, 2018

Amazing, yes. Scary, also.

If you're going to figure this out, please output a warning and ask me to call the optimized intrinsic or something.

bigcheesegs · on Jan 30, 2018

Why? The intrinsics aren't portable, the other code is.

dpwm · on Jan 30, 2018

Even with intrinsics that do have a degree of portability, I prefer the more explicit form of the bit counting just for readability. Nobody has to learn what popcnt means.

These optimizations really mean that as software writers we can declare our intent and have the compiler do safe optimizations.

It seems like these optimizations of Clang are also quite insensitive to the many different ways of writing the same thing that C++ has earned notoriety for.

mpweiher · on Jan 30, 2018

> I prefer the more explicit form

Huh? It's not explicit, the purpose is implicit. You are writing down an algorithm that implements a popcount algorithm. The compiler than has to infer the fact that you actually wanted popcount (maybe you just wanted to waste a bit of time in a timing loop?) and then substitute something based on that inference.

I am all for it being explicit, i.e. some library function that has a name that somewhere mentions "popcount" or some such. Then have the compiler generate whatever code it needs for that function.

pkaye · on Jan 30, 2018

What is your concern?

Doxin · on Jan 30, 2018

Next time you're considering something like this, have a look at D: It has nearly the same syntax as C but with some seriously awesome power features. You could've easily written the whole python/c mess entirely in D, generating the code to compile at compile time e.g. from a config file.

contravariant · on Jan 29, 2018

What made you choose a python script rather than C++ templates or the C preprocessor?

dpwm · on Jan 29, 2018

It was a hack. I knew the C compiler could do these optimizations at compile-time that would take me longer to implement at run-time and would probably introduce mistakes. I could well have used the C-processor, but the parameters needed to be taken from a file for each program.

The simulation was run about 100 times with different parameters. Instead of taking about an hour each they instead took about 12 minutes each, with the only difference being baking in all the parameters as constants and letting GCC do the rest.

mmlb · on Jan 29, 2018

The way I read GP, the python script can very well do the equivalent of gcc -DPARAM=42 -o binary-param=42 && ./binary-param-42.

edit: oh nm, reread GP...

dpwm · on Jan 29, 2018

Pretty much, except with more parameters taken from a file. The substituting in of parameters was done in Python just so that I could manually check the output for a few cases.

I think if I had to do something similar again today, any script would output a shell script that used command line macro definitions. That didn't really occur to me at the time.

npsimons · on Jan 29, 2018

> With maximal inlining, there's a risk of constant propagation resulting in a lot of computation happening at compile time rather than run time.

Without even reading TFA, this was my first thought, remembering back to implementing Fibonnaci's sequence in template metaprogramming from "Thinking in C++, Volume 2": http://web.mit.edu/merolish/ticpp/TicV2.html#_Toc53985733

It's cool and useful to be sure, but as costrouc pointed out, LAPACK has been optimized for a very long time by some very smart people. Most people who have not studied the field for a bit are unlikely to come across a more optimal method.

electricslpnsld · on Jan 29, 2018

Poking at the various compilers the machine instructions generated from the article appear to produce an unrolled version of the operation (as claimed by the article): https://godbolt.org/g/Wmj8ar

jerrre · on Jan 29, 2018

I think that's why it uses `volatile double` in stead of `double`?

sdeer · on Jan 29, 2018

It is somewhat unclear from his writing but did he compare this metaprogramming solution against OpenBlas or Atlas or just the reference BLAS/LAPACK implementation? If I was trying to do any significant linear algebra computations, I would definitely first try an optimized BLAS/LAPACK solution first.

pletnes · on Jan 29, 2018

Also, it seems the author solved a triangular system. LAPACK has special routines for that. Were they used?

LAPACK is typically optimized for larger systems, not 5 unknowns. Also, 5 is not a great number for vectorized operations - it might even be beneficial to zero/one pad the matrix.

Optimized LAPACK is often 5-10x faster than «basic LAPACK».

BLAS/LAPACK was originally written in the 70s/80s and sometimes unroll loops to the tune of 5 or 7 - it made sense then. Not so much these days.

I didn’t read the article in detail, but there appear to be a lot of holes in it.

sevensor · on Jan 29, 2018

Personally, my biggest reason to prefer LAPACK in general is that its authors have already put a great deal of effort into correctness and numerical stability, so I don't have to. Even basic LAPACK is pretty fast, let alone the optimized libraries. Hand-optimizing my own special case is an absolute last resort.

pletnes · on Jan 29, 2018

Yes, Cramer’s rule (referred to in the article) does not give numerically stable results AFAIK - which is yet another can of worms. Good catch.

thanatropism · on Jan 29, 2018

I don't think I have heard Cramer's rule mentioned without the caveat "but don't use this for real problems".

pletnes · on Jan 30, 2018

It’s useful for certain theoretical proofs in analytic math - or so I’ve heard. Not all math is well suited to be coded into reliable software.

bluescarni · on Jan 29, 2018

In his first factorial example, he could've avoided templates altogether with constexpr:

https://godbolt.org/g/ZaoMNa

Or just use a template specialisation:

https://godbolt.org/g/TVkHMd

ZenoArrow · on Jan 29, 2018

Thanks for sharing the godbolt site, can see myself using it for debugging.

andrewjw · on Jan 29, 2018

the author of the site (Matt Godbolt) is in general pretty great.

stabbles · on Jan 29, 2018

This article makes no sense at all. It compares a numerically unstable O(n!) algorithm (Cramer's rule) to a numerically stable O(n^3) algorithm (Gaussian elimination).

sannee · on Jan 29, 2018

What bothers me more is the fact that the article seems to compare LAPACKs function for solving any-sized systems of equations with a specialized function for solving 5x5 systems.

I mean, it's neat trick and all, but the comparison with LAPACK does not really stand on equal ground.

drwells · on Jan 29, 2018

The proposed C++ tricks are useful and I think that its reasonably well-known (at least to people who do numerics) that a C++ compiler, with the right flags, can beautifully unroll loops when array sizes are known at compile-time.

> brute force Cramer's rule solution

This isn't really a fair comparison: trusty old DGESV will yield far more accurate results for badly conditioned problems since it does row swaps (partial pivoting). It might also rescale values; I don't remember the details. It would be much more reasonable to compare a hypothetical DGEC (double precision general matrix cramer's rule) routine.

jordigh · on Jan 29, 2018

D's Mir does the same sort of thing, and metaprogramming in D is much nicer, because it was designed from the ground up instead of the bug elevated to feature that metaprogramming is in C++.

http://docs.mir.dlang.io/latest/index.html

http://blog.mir.dlang.io/glas/benchmark/openblas/2016/09/23/...

acdha · on Jan 29, 2018

> instead of the bug elevated to feature that metaprogramming is in C++.

That inflammatory wording doesn’t add anything to your comment. If you wanted to compare D with C++ it’d be a lot more useful to link to something which does that rather than just the documentation index.

jordigh · on Jan 29, 2018

C++ metaprogramming was an accident. Expression templates are a huge, obscure, and very surprising hack. I don't know about you, but I find them really hard to understand. SFINAE wasn't meant to be used for compile-time branching, but when people realised that it could be used that way, it was just about the only tool for metaprogramming in C++. It certainly wasn't designed for that purpose, though.

C++ has for a long time embraced metaprogramming and given new tools for it (I think there's `static if` now? Or will be soon?), but it really all started out as an accident:

https://softwareengineering.stackexchange.com/questions/1253...

> If you wanted to compare D with C++

My second link does that, compares Mir to a bunch of other numeric libraries, most in C++.

Edit: (Replying to comment below by acdha because HN does not allow me to post new comments right now.)

I don't have that kind of example handy. I can point you to the D Gems section of the D tour to give you a taste of what is possible in D. Click on "Gems" in the navigation menu above to see them.

https://tour.dlang.org/tour/en/gems/uniform-function-call-sy...

If you want a detailed blog post comparing D and C++ metaprogramming, I don't have it ready yet. If you're curious to learn about it, look in more depth at the links I've provided. I'm sorry I don't have the resources to give you the detailed answer you require.

acdha · on Jan 29, 2018

Again, I'm not defending C++ or disparaging D but saying that this style of abrasive advocacy doesn't leave the reader knowing anything more about D than they already knew. Your last sentence is a great example:

> My second link does that, compares Mir to a bunch of other numeric libraries, most in C++.

This is technically correct but not helpful because it just shows a bunch of benchmark results. If the point is to demonstrate how D is more powerful for this kind of programming what would be useful is something which shows the same function written in modern C++ and D so you could see how they differ.

flatline · on Jan 29, 2018

I know that SFINAE exploited the preprocessor's template overload resolution rules. It was later standardized and expanded on.

https://books.google.com/books?id=i7M1DwAAQBAJ&pg=PT564

gpderetta · on Jan 29, 2018

there is no preprocessor template overloading, or overloading at all really. The preprocessor has nothing to do with templates.

flatline · on Jan 29, 2018

Thank you, please excuse my pre-coffee ramblings and mentally strike "preprocessor" from my comment;)

tzahola · on Jan 29, 2018

>and metaprogramming in D is much nicer, because it was designed from the ground up instead of the bug elevated to feature that metaprogramming is in C++

Yet neither of them are as elegant as LISP macros.

jordigh · on Jan 29, 2018

Very true! I wish D's metaprogramming was slightly more structured instead of relying on raw string concatenation. However, it's not so bad, and it's miles ahead of C++ metaprogramming in terms of ease of use. It's nice that it's so much more deliberately conceived. There's static if, static for, compile-time function evaluation, template and string mixins, and more!

I think D's metaprogramming is very close to lisp, possibly as close as possible without having sexps.

beojan · on Jan 29, 2018

> relying on raw string concatenation

> it's not so bad

You can always use the C preprocessor in C++ if raw string manipulation is enough.

Seriously, though, constexpr means C++ also has static if, compile time loops, function eval, etc.

jordigh · on Jan 29, 2018

Raw string manipulation in D's mixins is much more limited than in CPP, and it's also not the only tool available for metaprogramming.

https://dlang.org/mixin.html

> constexpr means C++ also has static if, compile time loops, function eval, etc.

It's still not quite the same. I'm sorry to not bring examples, but if you look for them yourself, I think you'll find that D's metaprogramming looks nicer than C++'s constexpr.

nicwilson · on Jan 29, 2018

> C++ also has static if

Not even close. You can't conditionally insert a member.

bstamour · on Jan 29, 2018

Hopefully that will be fixed in a future C++ standard. There is an interesting static reflection proposal floating around that would allow for this, plus more.

xaedes · on Jan 29, 2018

The talk about the reflection proposal was amazing. Meta classes are also very cool. We can only hope it comes fast.

https://www.youtube.com/watch?v=4AfRAVcThyA

ChrisRackauckas · on Jan 29, 2018

Compilation with LISP-like macros for fast generic programming? Julia?

bachmeier · on Jan 29, 2018

I'm not sure how you're measuring elegance, but writing Lisp macros and getting them to work correctly has never been something I'd call elegant. You have the Common Lisp approach, where you work in a complicated sublanguage like this example from PCL:

  (defmacro when (condition &rest body)
    `(if ,condition (progn ,@body)))

and then you think hard about where the leaks are. Or else you take the Scheme approach of hygienic macros which are even less elegant. Macros in Lisp are powerful, but I'm not sure metaprogramming in Lisp is better than in a language like D.

sedachv · on Jan 29, 2018

> You have the Common Lisp approach, where you work in a complicated sublanguage like this example from PCL:

I don't think you actually understand anything about the example code you posted. There is no special "complicated sublanguage" - the backquote[1] is an extension of the quote shorthand notation[2] for list literals. It is for building data structures and has nothing to do with "sublanguages" or macros. There is no special syntax for macros in Common Lisp. Common Lisp macros do not produce source code - they produce any type of Common Lisp objects directly. That is why they are simpler and strictly more powerful than D templates.

When it comes to templates, I wouldn't call anything from this list of D examples[3] elegant.

[1] http://www.lispworks.com/documentation/HyperSpec/Body/02_df.... [2] http://www.lispworks.com/documentation/HyperSpec/Body/s_quot... [3] https://dlang.org/templates-revisited.html

bachmeier · on Jan 29, 2018

> I don't think you actually understand anything about the example code you posted.

This doesn't add anything to the conversation. If you want to discuss my post, I'm happy to discuss. You're clearly upset that I didn't praise Common Lisp, but things like that don't lead to useful discussion.

> There is no special "complicated sublanguage" - the backquote[1] is an extension of the quote shorthand notation[2] for list literals. It is for building data structures and has nothing to do with "sublanguages" or macros. There is no special syntax for macros in Common Lisp.

I've never seen anyone write macros using only s-expressions. True, you don't need to learn any special syntax to do it, but that's not how anyone writes macros.

> When it comes to templates, I wouldn't call anything from this list of D examples[3] elegant.

How is that is relevant to my comment?

sedachv · on Jan 30, 2018

> You're clearly upset that I didn't praise Common Lisp, but things like that don't lead to useful discussion. > How is that is relevant to my comment?

No, I am upset that you are posting misinformed comments about Common Lisp macros vs D templating while representing yourself as being knowledgeable about the subject. You clearly do not understand how the former works.

> I've never seen anyone write macros using only s-expressions. True, you don't need to learn any special syntax to do it, but that's not how anyone writes macros.

Please stop pretending that you have enough experience with Common Lisp to say things like "I've never seen anyone do X." Writing macros that call functions that return code is a basic technique - Graham's On Lisp is full of examples. Saying that quoting as a way of specifying list literals is a "complicated sublanguage" is on the level of "C++ templates have too many pointy brackets" of criticism.

_19qg · on Jan 30, 2018

> complicated sublanguage

I kind of fail to see how it is especially complicated. The macro uses a general mechanism to create lists. This mechanism isn't tied to macros and does not do anything special in macros.

It's also not really a sublanguage.

> then you think hard about where the leaks are

There are no leaks in the example.

vog · on Jan 29, 2018

I wonder if the similar tricks work in other languages with similarily optimizing compilers, but "nicer" metaprogramming, such as Common LISP, Rust and perhaps OCaml? (with camlp4)

Even more interesting would be whether this leads to the same huge performance gains in those languages as well.

ChrisRackauckas · on Jan 29, 2018

It does lead to performance gains. This kind of stuff is done all the time in Julia's numerical codes. Flexible macros definitely make this easier, and multiple dispatch + automatic function specialization on types + inlining + interprocedural optimizations makes a lot of what's discussed here automatic. If it's not automatic, then libraries like StaticArrays.jl make functions that use small arrays do these kinds of tricks automatically (it overloads and unrolls operations specifically to match the size of the array which is compile time information). I am sure other languages can do this too, but having it semi-automatic is pretty cool if what you do all day is write numerical codes.

uadjet · on Jan 30, 2018

Julia's flexibility in the numerical space is why it's my daily driver language for prototyping and implementation nowadays. Truly a fantastic language.

ufo · on Jan 29, 2018

These kinds of tricks certainly are more pleasant to do in languages with real metaprogramming support.

For numerical applications, one interesting project that comes to mind is the Terra programming language. It uses a flexible high level language (Lua) to metaprogram a fast and predictable C-like language (Terra). The advantage compared to the examples you mentioned is that each language can focus on its strength: the low level language being similar to C allows the programmer to better reason about performance and having a high level language for metaprogramming keep things simple.

https://terralang.org

There are also systems out there specifically designed for numeric applications that let you specify an algorithm in a high level DSL and generate highly performing C code, some of which might even tesr various versions of the code to find one that runs better on your machine (for example, the best way to iterate the data might be affected by cache sizes). I don't remember the names of these off the top of my head but the Terra PhD thesis has lots of references to them.

ska · on Jan 29, 2018

It was playing around with exactly these sorts of tricks in c++ (with late 90's compilers) that drove me to learn common lisp. It's a much nicer environment to do this sort of thing in, but difficult to compete head to head with a good optimizing c++ compiler on code generation in general. Fwiw, in common lisp particularly having access to to compiler macros allows for interesting approaches.

I was doing primarily research code at the time though, and the overall improvement in productivity more than made up for the fractional difference in run times for me. Of course, ymmv.

sidlls · on Jan 29, 2018

Just be wary of performance gains claims. Per other comments this implementation is highly optimized for a specific use and uses a numerically unstable algorithm.

fithisux · on Jan 29, 2018

Or even Dlang .

1024core · on Jan 29, 2018

Template metaprogramming just shifts the burden of computation to compile time. Sure, I could pre-compute the first 10000 factorials, put them in an array, compile and then print the answer in constant time. But it'll work for the first 10000 numbers only.

ska · on Jan 29, 2018

It's not nearly that simple. You can use template meta-programming to make expressions both easy to read and efficient, by avoiding creation of temporary objects, fusion, lifting, etc. etc.

In some domains (e.g. linear algebra) this can be very powerful because you don't have write non idiomatic expressions just to feed the compiler things it can understand.

xvilka · on Jan 29, 2018

Modern Fortran (especially upcoming Fortran 2018 standard) is good enough already for those kind of libraries. Moreover, big chunks of the computations are implemented in assembly in BLAS/LAPACK. What surprises me, that authors of those libraries still stuck in Fortran 90.

virgilp · on Jan 29, 2018

[edit] I had a bug in my code; it does seem that the compiler is not picking up constants at compiletime. That is to say, it's not precomputing the final value at compiletime based on the value from those matrices.

Here's how I modified the code to make sure: https://pastebin.com/iG21pTiP

gmueckl · on Jan 29, 2018

There needs to be a j in the exit condition of the inner loop of the clobber function.

bwasti · on Jan 29, 2018

in C++17 you can get generic loop unrolling templates in 8 lines of code: https://godbolt.org/g/2Di4PM

skierscott · on Jan 30, 2018

There’s also a closed form for the Fibonacci numbers. It runs in O(1).

http://stsievert.com/blog/2015/01/31/the-mysterious-eigenval...

blt · on Jan 29, 2018

Others have already mentioned Cramer's rule, but even if this code were switched to Gaussian elimination it might not beat lapack on big problems. Lapack implementations do extra tricks to optimize cache usage on big matrices. Would be interesting to include eigen in the benchmark too.

bachmeier · on Jan 29, 2018

I've read your comment several times and can't follow it. I think you're saying that even if the author continued to use Cramer's rule, he would lose to LAPACK on big problems.

AnimalMuppet · on Jan 29, 2018

No. If he continued to use Cramer's rule, he would definitely lose to LAPACK on big problems, purely because of using Cramer's rule. But even if he didn't continue to use Cramer's rule (if he switched to Gaussian elimination), he still might lose to LAPACK, because LAPACK is pretty well optimized.

At least, that's how I interpreted the comment...

sannee · on Jan 29, 2018

> he would definitely lose to LAPACK on big problems, purely because of using Cramer's rule.

More importantly though, the code would probably take ages (as in, "age of the universe" ages) to compile, as Cramer's rule is O(n!) (using a naive implementation, I think one could get to O(n^4) if determinant was computed using LU factorization, but what's the point of Cramer's rule then...).

blt · on Jan 29, 2018

Correct interpretation