One way to limit branching is to try to make applications of each operation as t...

One way to limit branching is to try to make applications of each operation as temporally contiguous as possible. That is, rather than take each data item and apply each operation to it, structure the operations so they can each be applied to the entire data set, one after the other. The branch predictor is happy because you're predictably looping over a simpler piece of code. The I-cache is happy because you're looping over a smaller piece of code.

Optimizing for the second level memory cache means making your data access patterns predictable (the above suggestion helps here, too) and keeping your data compact to get as much use out of each cache line as possible: Do you really need a 64-bit size_t to store an index into an array that will never be larger than a few thousand elements?