> trace caches They don't anymore they have uop caches, but trace caches are gre...

> trace caches

They don't anymore they have uop caches, but trace caches are great and apple uses them [1].

They allow you to collapse taken branches into a single fetch.

Which is extreamly important, because the average instructions/taken-branch is about 10-15 [2]. With a 10 wide frontend, every second fetch would only be half utilized or worse.

> extra caches

This is one thing I don't understand, why not replace the L1I with the uop-cache entirely?

I quite like what Ventana does with the Veyron V2/V3. [3,4] They replaced the L1I with a macro-op trace cache, which can collapse taken branches, do basic instruction fusion and more advanced fusion for hot code paths.

[1] https://www.realworldtech.com/forum/?threadid=223220

[2] https://lists.riscv.org/g/tech-profiles/attachment/353/0/RIS... (page 10)

[3] https://www.ventanamicro.com/technology/risc-v-cpu-ip/

[4] https://youtu.be/EWgOVIvsZt8