Hacker News new | past | comments | ask | show | jobs | submit login

Isn't there any C/C++ tutorial on how to avoid cache misses ?

I already read the classical argument of having data oriented designs and to avoid linked lists, but I've never really heard of anything concrete on performance in programming.

All I hear is "profile profile profile". I guess most of the time it will point out a place where my code is slow, but if I find one, how do I solve it ?

For example, can I create some simple bit of code that do a cache miss at will ? Any example of such code ? How do I detect a cache miss ?




Cache misses are independent of language; however C++ gives you a lot of control over where items are placed in memory.

To create a cache miss in one processor all you need to force the processor to look at M different memory addresses that map to the same cache location, where M is higher than the N-way associativity for your processor. The specifics of this differ for each processor, but as an example, Intel Sandy Bridge CPUs have 8-way associativity, so if you fetch 9 addresses that are 4096 bytes apart you will cause a cache miss.

Lots of small tips and links if you Google, e.g.

http://stackoverflow.com/questions/8744088/what-is-the-best-...

http://stackoverflow.com/questions/3359524/detecting-cache-m...

The full detail is in books like http://www.intel.co.uk/content/www/uk/en/architecture-and-te...


Another interesting case: suppose you have a contiguous array of objects that are 48 bytes big, and your cache line is 64 bytes. Then every second object will straddle two cache lines. For random access to any one such object, padding them out to 64 bytes each will run faster on average, as it will access main memory only once and evict fewer cache lines. But you can only fit 3/4 as many objects in your cache now, so if you need to access a lot of them then your program will run slower.

So, "profile, profile, profile", but also understand factors like this that pull in different directions in different scenarios.


There's Ulrich Drepper's "What every programmer should know about memory" - http://lwn.net/Articles/250967/. It's long, it's from 2007, but its fundamentals are right.


The reason all you hear is "profile profile profile" is that the performance implications of a given code change may sometimes be completely non-obvious, due to all the complex trickery in the processor and cache hierarchy. So what do you do after profiling and finding a performance bottleneck? If you can't identify exactly what the bottleneck's cause is and the solution is not obvious, I suppose all that's left is to do a best guess of the cause and think of a possible solution. Then profile after implementing your solution to check if it made a (positive) difference. And do the profiling/benchmarks even if the solution seems obvious, since it's so easy to be wrong.

Detecting cache misses can be done with valgrind: http://valgrind.org/docs/manual/cg-manual.html

There are architecture-specific instruction-level profilers as well. This video mentions a couple and is also a great overview of how non-intuitive performance can be with modern CPUs: http://channel9.msdn.com/Events/Build/2014/4-587


For general programs, i.e. programs not meant to be run a exact particular machine, it is better to focus on better algorithms rather than better code.

In this case it is useful to look at "Cache-oblivious algorithms": Cache oblivious algorithms allow you to write code that performs well regardless of the architecture and processor model they are run on. From Wikipedia: «a cache-oblivious algorithm (or cache-transcendent algorithm) is an algorithm designed to take advantage of a CPU cache without having the size of the cache (or the length of the cache lines, etc.) as an explicit parameter.» http://en.wikipedia.org/wiki/Cache-oblivious_algorithm

Sadly the field is new and there are not many such algorithms around, but you can already find algorithms for many common tasks such as traversing linked lists and calculating FFTs.


Accessing contiguous chunks of memory in order minimizes cache misses. It's not a whole lot more complicated than that.

This is a good intro video on the subject: https://www.youtube.com/watch?v=16ZF9XqkfRY


woooh thanks a lot for that video! I hope this will bother donald knuth :)


here's a few books/theses that have decent intro's to the realm of temporal/spatial locality, GPU etc. Not C++ specific (except CUDA code is C++, i think)

http://tacc-web.austin.utexas.edu/veijkhout/public_html/istc...

http://www.mpi-sws.org/~turon/turon-thesis.pdf




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: