Isn't there any C/C++ tutorial on how to avoid cache misses ? I already read the...

bboreham · on Nov 1, 2014

Cache misses are independent of language; however C++ gives you a lot of control over where items are placed in memory.

To create a cache miss in one processor all you need to force the processor to look at M different memory addresses that map to the same cache location, where M is higher than the N-way associativity for your processor. The specifics of this differ for each processor, but as an example, Intel Sandy Bridge CPUs have 8-way associativity, so if you fetch 9 addresses that are 4096 bytes apart you will cause a cache miss.

Lots of small tips and links if you Google, e.g.

http://stackoverflow.com/questions/8744088/what-is-the-best-...

http://stackoverflow.com/questions/3359524/detecting-cache-m...

The full detail is in books like http://www.intel.co.uk/content/www/uk/en/architecture-and-te...

bboreham · on Nov 1, 2014

Another interesting case: suppose you have a contiguous array of objects that are 48 bytes big, and your cache line is 64 bytes. Then every second object will straddle two cache lines. For random access to any one such object, padding them out to 64 bytes each will run faster on average, as it will access main memory only once and evict fewer cache lines. But you can only fit 3/4 as many objects in your cache now, so if you need to access a lot of them then your program will run slower.

So, "profile, profile, profile", but also understand factors like this that pull in different directions in different scenarios.

queensnake · on Nov 1, 2014

There's Ulrich Drepper's "What every programmer should know about memory" - http://lwn.net/Articles/250967/. It's long, it's from 2007, but its fundamentals are right.

ehsanu1 · on Nov 1, 2014

The reason all you hear is "profile profile profile" is that the performance implications of a given code change may sometimes be completely non-obvious, due to all the complex trickery in the processor and cache hierarchy. So what do you do after profiling and finding a performance bottleneck? If you can't identify exactly what the bottleneck's cause is and the solution is not obvious, I suppose all that's left is to do a best guess of the cause and think of a possible solution. Then profile after implementing your solution to check if it made a (positive) difference. And do the profiling/benchmarks even if the solution seems obvious, since it's so easy to be wrong.

Detecting cache misses can be done with valgrind: http://valgrind.org/docs/manual/cg-manual.html

There are architecture-specific instruction-level profilers as well. This video mentions a couple and is also a great overview of how non-intuitive performance can be with modern CPUs: http://channel9.msdn.com/Events/Build/2014/4-587

gioele · on Nov 1, 2014

For general programs, i.e. programs not meant to be run a exact particular machine, it is better to focus on better algorithms rather than better code.

In this case it is useful to look at "Cache-oblivious algorithms": Cache oblivious algorithms allow you to write code that performs well regardless of the architecture and processor model they are run on. From Wikipedia: «a cache-oblivious algorithm (or cache-transcendent algorithm) is an algorithm designed to take advantage of a CPU cache without having the size of the cache (or the length of the cache lines, etc.) as an explicit parameter.» http://en.wikipedia.org/wiki/Cache-oblivious_algorithm

Sadly the field is new and there are not many such algorithms around, but you can already find algorithms for many common tasks such as traversing linked lists and calculating FFTs.

TillE · on Nov 1, 2014

Accessing contiguous chunks of memory in order minimizes cache misses. It's not a whole lot more complicated than that.

This is a good intro video on the subject: https://www.youtube.com/watch?v=16ZF9XqkfRY

jokoon · on Nov 1, 2014

woooh thanks a lot for that video! I hope this will bother donald knuth :)

gtani · on Nov 1, 2014

here's a few books/theses that have decent intro's to the realm of temporal/spatial locality, GPU etc. Not C++ specific (except CUDA code is C++, i think)

http://tacc-web.austin.utexas.edu/veijkhout/public_html/istc...

http://www.mpi-sws.org/~turon/turon-thesis.pdf