For those wondering the lapack guys are working on a new version of lapack to run on heterogeneous architecture including gpus. See their work at http://www.icl.utk.edu/research
I have worked with these guys and all I can say is good luck outperforming the highly optimized routines they have written. My bet is that the the guy writing this blog used a non optimized version of lapack.
There's no mystery here, author is only solving a 5x5 system. LAPACK will have barely finished checking function call parameters by the time a fully specialized and inlined solver is finished. Also, from a cursory inspection, author does no pivoting [edit: worse, it just uses Cramer's rule, which is insane], so their solver will be much less numerically robust (but this is admittedly less of a problem for the 5x5 case).
The performance is completely expected--in fact, I would expect an optimized 5x5 solver to be faster than this.
A better comparison might be a C++ library like Eigen that is intended for use on these small systems, that should optimize out function calls in a similar way, and that uses intrinsics.
If your matrices are of arbitrary size, Blaze [0] would be my pick. It has the best performance on their benchmarks, Baptiste Wicht's benchmarks, and my own.
Though if the matrices are guaranteed to be small (like in this example), libxsmm [1] is specialized and highly optimized for this use case and beats its competitors above.
And yes, it's absolutely essential to make sure you didn't just move computations to compile-time.
Please check with array initialized from file or other dynamic source. With maximal inlining, there's a risk of constant propagation resulting in a lot of computation happening at compile time rather than run time.
As an anecdote of just how powerful this can be, I once wrote a prototype of a simulation (written in C) that used a lot of integer divisions and needed to be run many times with different parameters.
As a quick and dirty hack that took about half hour to implement, I decided to just have the program as a template string and the parameters be substituted in by a python script that called GCC. To the amazement of a very sceptical friend, the result was a 5x speedup.
A lot of this speedup was consistent with replacing the integer divisions with multiplicative inverses, but also merging constants in a way that would have made things quite unreadable. I was quite impressed with how far GCC had gone.
Considering each instance of this simulation ran for tens of minutes, it was well worth compilation time of a few seconds. It kept the code really simple, too.
That's pretty much what you get with Julia, except it will also do interprocedural optimization (i.e. inline constants and functions across calls) on v0.7. All without playing with templates. I'll gladly pay a 1 second JiT lag for a few days off my computation. At the same time, I don't have screw around with templates.
Your comment has taken me from having mixed feelings about Julia to making me convinced I want to try Julia for something numerically intensive that I would be inclined to turn to C for.
Even better, Julia code is super high-level and readable. And you can just drop down to C using the builtin "ccall" facility. Julia's type system is designed for excellent interop with the C ABI.
You should check out this presentation done by Matt Godbolt about compiler internals. He talks about what kind of optimizations are being done by GCC and Clang. Pretty amazing stuff. For example, the compiler will figure out you are trying to write a pop count in high level code and replace it with a single popcount instruction.
That talk really revealed some brilliant optimizations that Clang does. In addition to what you mentioned, the bit where Clang recognises the summation loop, removes the loop and replaces it with a closed-form solution feels intuitively more like the operations that only some cutting edge research-language compiler should be able to do. To see this for C and C++ is very impressive.
Even with intrinsics that do have a degree of portability, I prefer the more explicit form of the bit counting just for readability. Nobody has to learn what popcnt means.
These optimizations really mean that as software writers we can declare our intent and have the compiler do safe optimizations.
It seems like these optimizations of Clang are also quite insensitive to the many different ways of writing the same thing that C++ has earned notoriety for.
Huh? It's not explicit, the purpose is implicit. You are writing down an algorithm that implements a popcount algorithm. The compiler than has to infer the fact that you actually wanted popcount (maybe you just wanted to waste a bit of time in a timing loop?) and then substitute something based on that inference.
I am all for it being explicit, i.e. some library function that has a name that somewhere mentions "popcount" or some such. Then have the compiler generate whatever code it needs for that function.
Next time you're considering something like this, have a look at D: It has nearly the same syntax as C but with some seriously awesome power features. You could've easily written the whole python/c mess entirely in D, generating the code to compile at compile time e.g. from a config file.
It was a hack. I knew the C compiler could do these optimizations at compile-time that would take me longer to implement at run-time and would probably introduce mistakes. I could well have used the C-processor, but the parameters needed to be taken from a file for each program.
The simulation was run about 100 times with different parameters. Instead of taking about an hour each they instead took about 12 minutes each, with the only difference being baking in all the parameters as constants and letting GCC do the rest.
Pretty much, except with more parameters taken from a file. The substituting in of parameters was done in Python just so that I could manually check the output for a few cases.
I think if I had to do something similar again today, any script would output a shell script that used command line macro definitions. That didn't really occur to me at the time.
It's cool and useful to be sure, but as
costrouc pointed out, LAPACK has been optimized for a very long time by some very smart people. Most people who have not studied the field for a bit are unlikely to come across a more optimal method.
Poking at the various compilers the machine instructions generated from the article appear to produce an unrolled version of the operation (as claimed by the article):
https://godbolt.org/g/Wmj8ar
It is somewhat unclear from his writing but did he compare this metaprogramming solution against OpenBlas or Atlas or just the reference BLAS/LAPACK implementation? If I was trying to do any significant linear algebra computations, I would definitely first try an optimized BLAS/LAPACK solution first.
Also, it seems the author solved a triangular system. LAPACK has special routines for that. Were they used?
LAPACK is typically optimized for larger systems, not 5 unknowns. Also, 5 is not a great number for vectorized operations - it might even be beneficial to zero/one pad the matrix.
Optimized LAPACK is often 5-10x faster than «basic LAPACK».
BLAS/LAPACK was originally written in the 70s/80s and sometimes unroll loops to the tune of 5 or 7 - it made sense then. Not so much these days.
I didn’t read the article in detail, but there appear to be a lot of holes in it.
Personally, my biggest reason to prefer LAPACK in general is that its authors have already put a great deal of effort into correctness and numerical stability, so I don't have to. Even basic LAPACK is pretty fast, let alone the optimized libraries. Hand-optimizing my own special case is an absolute last resort.
This article makes no sense at all. It compares a numerically unstable O(n!) algorithm (Cramer's rule) to a numerically stable O(n^3) algorithm (Gaussian elimination).
What bothers me more is the fact that the article seems to compare LAPACKs function for solving any-sized systems of equations with a specialized function for solving 5x5 systems.
I mean, it's neat trick and all, but the comparison with LAPACK does not really stand on equal ground.
The proposed C++ tricks are useful and I think that its reasonably well-known (at least to people who do numerics) that a C++ compiler, with the right flags, can beautifully unroll loops when array sizes are known at compile-time.
> brute force Cramer's rule solution
This isn't really a fair comparison: trusty old DGESV will yield far more accurate results for badly conditioned problems since it does row swaps (partial pivoting). It might also rescale values; I don't remember the details. It would be much more reasonable to compare a hypothetical DGEC (double precision general matrix cramer's rule) routine.
D's Mir does the same sort of thing, and metaprogramming in D is much nicer, because it was designed from the ground up instead of the bug elevated to feature that metaprogramming is in C++.
> instead of the bug elevated to feature that metaprogramming is in C++.
That inflammatory wording doesn’t add anything to your comment. If you wanted to compare D with C++ it’d be a lot more useful to link to something which does that rather than just the documentation index.
C++ metaprogramming was an accident. Expression templates are a huge, obscure, and very surprising hack. I don't know about you, but I find them really hard to understand. SFINAE wasn't meant to be used for compile-time branching, but when people realised that it could be used that way, it was just about the only tool for metaprogramming in C++. It certainly wasn't designed for that purpose, though.
C++ has for a long time embraced metaprogramming and given new tools for it (I think there's `static if` now? Or will be soon?), but it really all started out as an accident:
My second link does that, compares Mir to a bunch of other numeric libraries, most in C++.
Edit: (Replying to comment below by acdha because HN does not allow me to post new comments right now.)
I don't have that kind of example handy. I can point you to the D Gems section of the D tour to give you a taste of what is possible in D. Click on "Gems" in the navigation menu above to see them.
If you want a detailed blog post comparing D and C++ metaprogramming, I don't have it ready yet. If you're curious to learn about it, look in more depth at the links I've provided. I'm sorry I don't have the resources to give you the detailed answer you require.
Again, I'm not defending C++ or disparaging D but saying that this style of abrasive advocacy doesn't leave the reader knowing anything more about D than they already knew. Your last sentence is a great example:
> My second link does that, compares Mir to a bunch of other numeric libraries, most in C++.
This is technically correct but not helpful because it just shows a bunch of benchmark results. If the point is to demonstrate how D is more powerful for this kind of programming what would be useful is something which shows the same function written in modern C++ and D so you could see how they differ.
>and metaprogramming in D is much nicer, because it was designed from the ground up instead of the bug elevated to feature that metaprogramming is in C++
Yet neither of them are as elegant as LISP macros.
Very true! I wish D's metaprogramming was slightly more structured instead of relying on raw string concatenation. However, it's not so bad, and it's miles ahead of C++ metaprogramming in terms of ease of use. It's nice that it's so much more deliberately conceived. There's static if, static for, compile-time function evaluation, template and string mixins, and more!
I think D's metaprogramming is very close to lisp, possibly as close as possible without having sexps.
> constexpr means C++ also has static if, compile time loops, function eval, etc.
It's still not quite the same. I'm sorry to not bring examples, but if you look for them yourself, I think you'll find that D's metaprogramming looks nicer than C++'s constexpr.
Hopefully that will be fixed in a future C++ standard. There is an interesting static reflection proposal floating around that would allow for this, plus more.
I'm not sure how you're measuring elegance, but writing Lisp macros and getting them to work correctly has never been something I'd call elegant. You have the Common Lisp approach, where you work in a complicated sublanguage like this example from PCL:
(defmacro when (condition &rest body)
`(if ,condition (progn ,@body)))
and then you think hard about where the leaks are. Or else you take the Scheme approach of hygienic macros which are even less elegant. Macros in Lisp are powerful, but I'm not sure metaprogramming in Lisp is better than in a language like D.
> You have the Common Lisp approach, where you work in a complicated sublanguage like this example from PCL:
I don't think you actually understand anything about the example code you posted. There is no special "complicated sublanguage" - the backquote[1] is an extension of the quote shorthand notation[2] for list literals. It is for building data structures and has nothing to do with "sublanguages" or macros. There is no special syntax for macros in Common Lisp. Common Lisp macros do not produce source code - they produce any type of Common Lisp objects directly. That is why they are simpler and strictly more powerful than D templates.
When it comes to templates, I wouldn't call anything from this list of D examples[3] elegant.
> I don't think you actually understand anything about the example code you posted.
This doesn't add anything to the conversation. If you want to discuss my post, I'm happy to discuss. You're clearly upset that I didn't praise Common Lisp, but things like that don't lead to useful discussion.
> There is no special "complicated sublanguage" - the backquote[1] is an extension of the quote shorthand notation[2] for list literals. It is for building data structures and has nothing to do with "sublanguages" or macros. There is no special syntax for macros in Common Lisp.
I've never seen anyone write macros using only s-expressions. True, you don't need to learn any special syntax to do it, but that's not how anyone writes macros.
> When it comes to templates, I wouldn't call anything from this list of D examples[3] elegant.
> You're clearly upset that I didn't praise Common Lisp, but things like that don't lead to useful discussion.
> How is that is relevant to my comment?
No, I am upset that you are posting misinformed comments about Common Lisp macros vs D templating while representing yourself as being knowledgeable about the subject. You clearly do not understand how the former works.
> I've never seen anyone write macros using only s-expressions. True, you don't need to learn any special syntax to do it, but that's not how anyone writes macros.
Please stop pretending that you have enough experience with Common Lisp to say things like "I've never seen anyone do X." Writing macros that call functions that return code is a basic technique - Graham's On Lisp is full of examples. Saying that quoting as a way of specifying list literals is a "complicated sublanguage" is on the level of "C++ templates have too many pointy brackets" of criticism.
I kind of fail to see how it is especially complicated. The macro uses a general mechanism to create lists. This mechanism isn't tied to macros and does not do anything special in macros.
I wonder if the similar tricks work in other languages with similarily optimizing compilers, but "nicer" metaprogramming, such as Common LISP, Rust and perhaps OCaml? (with camlp4)
Even more interesting would be whether this leads to the same huge performance gains in those languages as well.
It does lead to performance gains. This kind of stuff is done all the time in Julia's numerical codes. Flexible macros definitely make this easier, and multiple dispatch + automatic function specialization on types + inlining + interprocedural optimizations makes a lot of what's discussed here automatic. If it's not automatic, then libraries like StaticArrays.jl make functions that use small arrays do these kinds of tricks automatically (it overloads and unrolls operations specifically to match the size of the array which is compile time information). I am sure other languages can do this too, but having it semi-automatic is pretty cool if what you do all day is write numerical codes.
Julia's flexibility in the numerical space is why it's my daily driver language for prototyping and implementation nowadays. Truly a fantastic language.
These kinds of tricks certainly are more pleasant to do in languages with real metaprogramming support.
For numerical applications, one interesting project that comes to mind is the Terra programming language. It uses a flexible high level language (Lua) to metaprogram a fast and predictable C-like language (Terra). The advantage compared to the examples you mentioned is that each language can focus on its strength: the low level language being similar to C allows the programmer to better reason about performance and having a high level language for metaprogramming keep things simple.
There are also systems out there specifically designed for numeric applications that let you specify an algorithm in a high level DSL and generate highly performing C code, some of which might even tesr various versions of the code to find one that runs better on your machine (for example, the best way to iterate the data might be affected by cache sizes). I don't remember the names of these off the top of my head but the Terra PhD thesis has lots of references to them.
It was playing around with exactly these sorts of tricks in c++ (with late 90's compilers) that drove me to learn common lisp. It's a much nicer environment to do this sort of thing in, but difficult to compete head to head with a good optimizing c++ compiler on code generation in general. Fwiw, in common lisp particularly having access to to compiler macros allows for interesting approaches.
I was doing primarily research code at the time though, and the overall improvement in productivity more than made up for the fractional difference in run times for me. Of course, ymmv.
Just be wary of performance gains claims. Per other comments this implementation is highly optimized for a specific use and uses a numerically unstable algorithm.
Template metaprogramming just shifts the burden of computation to compile time. Sure, I could pre-compute the first 10000 factorials, put them in an array, compile and then print the answer in constant time. But it'll work for the first 10000 numbers only.
It's not nearly that simple.
You can use template meta-programming to make expressions both easy to read and efficient, by avoiding creation of temporary objects, fusion, lifting, etc. etc.
In some domains (e.g. linear algebra) this can be very powerful because you don't have write non idiomatic expressions just to feed the compiler things it can understand.
Modern Fortran (especially upcoming Fortran 2018 standard) is good enough already for those kind of libraries. Moreover, big chunks of the computations are implemented in assembly in BLAS/LAPACK. What surprises me, that authors of those libraries still stuck in Fortran 90.
[edit] I had a bug in my code; it does seem that the compiler is not picking up constants at compiletime. That is to say, it's not precomputing the final value at compiletime based on the value from those matrices.
Others have already mentioned Cramer's rule, but even if this code were switched to Gaussian elimination it might not beat lapack on big problems. Lapack implementations do extra tricks to optimize cache usage on big matrices. Would be interesting to include eigen in the benchmark too.
I've read your comment several times and can't follow it. I think you're saying that even if the author continued to use Cramer's rule, he would lose to LAPACK on big problems.
No. If he continued to use Cramer's rule, he would definitely lose to LAPACK on big problems, purely because of using Cramer's rule. But even if he didn't continue to use Cramer's rule (if he switched to Gaussian elimination), he still might lose to LAPACK, because LAPACK is pretty well optimized.
> he would definitely lose to LAPACK on big problems, purely because of using Cramer's rule.
More importantly though, the code would probably take ages (as in, "age of the universe" ages) to compile, as Cramer's rule is O(n!) (using a naive implementation, I think one could get to O(n^4) if determinant was computed using LU factorization, but what's the point of Cramer's rule then...).
I have worked with these guys and all I can say is good luck outperforming the highly optimized routines they have written. My bet is that the the guy writing this blog used a non optimized version of lapack.