I'm sad that this ends on a cliffhanger, I was excited to see the results for the circular array version. Hopefully I don't lose track of this story when (if) someone who knows what they're doing does that test (I am not such a person, or I'd do it myself).
I haven't tested an actually fast version of this, but I did hack together two versions that used rings of nodes instead of lists. There are two separate issues, IMHO: Rings - be it in an array or using linked nodes - makes the logic even simpler because you don't need to test for the start/end:
Implementing it using arrays isn't hard. You have two obvious options:
One is to still use a linked ring. In that case your array just saves you dynamic allocations and allocator overhead (and if in a gc'd language, gc pressure).
The other is to copy to compact the right on eviction. Whether that is a horrible idea or might work will depend a great deal on your performance characteristics and the number of entries in your cache.
Personally I think the "linked ring in an array" approach is a nice middle ground - you're likely to get better (CPU) cache locality than a naive linked list (whether enough to matter really depends on your implementation languages underlying allocator - and the singly linked version has quite low extra overhead (and it can be made smaller if size is a concern - e.g. it doesn't need to be a pointer but can be an array index) vs. just an array.