Can anyone with a good knowledge explain why aren't we also explicitly (with the help of compiler) managing cache as well? Register allocation is basically lowest form of cache on which you can operate on with most efficiency. Why on the next level we rely on our hardware to do the guesswork for us?
First, you can manage the cache explicitly. You can load from memory bypassing the cache (if you know you won't need something again), you can prefetch into the cache ahead of time, and you can expunge lines from the cache. Those instructions are useful in high performance code (and in Spectre-like exploits :) ).
Second, it's harder than it looks. Even a simpler problem, branch speculation, is notoriously difficult to get right without profiling information from program execution. For example the Linux kernel has the likely() and unlikely() macros which can inject branch predictor hints into the assembly. You use them like "if(likely(...))". Problem? Those hints, carefully inserted by serious systems engineers, were often terrible! So much so that removing all of them increased the performance of the kernel.
I think that you pretty much have to provide profiling information to think about having the compiler manage the cache. It's doable, but it's a lot of work - your profiles better be highly representative.
This is part of why Java performance is far better than it should, if we just consider the overhead of the JVM and the garbage collection. You get profiles for free, and that covers many sins. This is how sometimes we end up with companies that really care about latency using the JVM anyway.
>Those hints, carefully inserted by serious systems engineers, were often terrible! So much so that removing all of them increased the performance of the kernel.
So glad I'm not alone, every time I tried to optimize hot loops with these hints, it's either barely improved or flat out regressed by a larger margin.
Register allocation is a form of value analysis, while managing caches like you suggest is a form of pointer analysis which is 100 times harder. One huge problem is aliasing. It is in general very difficult to tell whether X[i] and Y[j] are distinct memory locations. So the compiler has to err on the side of caution and assume that X[i] overwrites Y[j] and vice versa. The C99 "restrict" keyword and modern improvements in language semantics help (like forbidding pointer arithmetic), but doesn't completely eliminate aliasing. You also have pointer indirection, e.g., X[Y[Z[j]]], making the analysis even harder.
In other words, state-of-the-art compilers can beat 99.9% of all developers on register allocation, but they can't beat a developer who manages cache memory explicitly.
GameBoy Advance had a CPU without cache, but with 32kbit of fast RAM on the chip. That was pretty close to a fully manual cache, and turned out to be a complete waste in practice.
It was impractical to share/reuse it like a real cache, because that would require writing some sort of memory allocator or a hash table — in software. Managing that was a hassle, and an overhead in itself. Only a few games (mostly 3D) took advantage of it, and simply by copying their hottest loop in there.
Aren't registers fixed by x86_64, while cache is a CPU hardware specific thing (e.g.: newer cpus have more cache than older ones, bit register count is fixed 8 on x86 and 16 on x86_64)?
So I think the compiler can work with registers at compile time but cant work with an unknown structure of cache
Cache usage won't be known statically at compile-time. If you retrieve one item from a database, it will result in a different usage of the cache than if you retrieved a different item.
In addition to the other good answers, if the amount of state that's explicitly managed by software gets too large then it gets really expensive to save or restore that state. This happens, for example, when a syscall transfers control to the operating system. If the (many-MB) cache were software-managed, the OS would need to decide between flushing it all to main memory (expensive for quick syscalls) or leaving it in place and having OS code and data be uncached. Function calls between libraries have similar problems, how is a called function supposed to know which cache space is available for it to use? If you call the same function multiple times who's responsible for keeping its working data cached?
For a 32MB L3 cache, flushing the entire cache to memory (as would be required when switching between processes) could take over a millisecond, let alone trying to manage caches shared by multiple cores.
Cache was designed to be program-transparent so hardware makers could add it and get better benchmarks without recompiling. Since then, it's stayed that way because it's easier for the hardware maker to change and improve it without having to change the ISA
I$ pressure from bloating machine code with such high-volume information.
Though c.f. prefetch and clflush instructions, as well as the concept (s) of non-tenporal and write-combining stores.