Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The Intel compiler is extremely good at finding and exploiting vectorization (SSE/AVX) opportunities; using these instructions in hot loops is becoming key to getting anywhere near peak performance out of modern CPUs.

Most people don't care enough about performance to notice, but recompiling with Intel's compiler often shows a 5-15% difference on number crunching codes and that's before spending time investigating the vectorization output and fine-tuning.

On the other hand, if you really care about speed then someone with some experience in performance tuning will typically be able to make your code run 4-8x faster, vastly outweighing any benefits from the compiler.



Just in case you skimmed moconnor's comment, it bears repeating:

Intel's compiler: 15% speedup

Hand-optimized code: 800% speedup

This gap in compiler tech is still a big deal today. Think about the early mainframes and how the code was all written in machine code or assembler. http://www.pbm.com/~lindahl/mel.html

Compilers can still improve, a lot.

• Parallel code? _still_ hand-written, even though choosing the right language/library can help. Note that choosing that language that makes parallelism easy may cost you when you actually go for the max parallel speedup

• GPU? hand-written. See: litecoin miners and bitcoin miners before that. OpenCL but were hand-tuned for a specific architecture

• Cross-platform? Java and C should be portable, but ask any Android developer how it really works

• And the one we're talking about here: number-crunching code? hand optimized!

I'm actually quite optimistic about the future of compilers. One of the reasons HN is so fun to read is that it comes up often.


"Hand-optimized code: 800% speedup"

It really depends.

Especially in how naively the "non-optimized" code was written.

I can see vectorization accelerate from 2x to 4x (per core), but not much more than that (which the Intel compiler does best)

But even GCC can vectorize better today than in the early days of 4.0


Sure, it depends. I've seen embarrassingly parallel (yeah, that's a real term) code with speedups in the 20's.

My personal best was a 9x speedup, partly by using SSSE3 and partly by some really good prefetching and non-temporal writes.

If you look at what I said in the very narrowest light, I agree that SSE2 all by itself typically delivers a 2x speedup per core over non-SSE code.


Technically, 8x faster is a 700% speedup.

Either way, using percentages there seems really misleading. 15% versus 700% (or 800%) looks like a much bigger difference than 1.15 versus 8 if you're not careful when thinking about it.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: