They don't anymore they have uop caches, but trace caches are great and apple uses them [1].
They allow you to collapse taken branches into a single fetch.
Which is extreamly important, because the average instructions/taken-branch is about 10-15 [2]. With a 10 wide frontend, every second fetch would only be half utilized or worse.
> extra caches
This is one thing I don't understand, why not replace the L1I with the uop-cache entirely?
I quite like what Ventana does with the Veyron V2/V3. [3,4]
They replaced the L1I with a macro-op trace cache, which can collapse taken branches, do basic instruction fusion and more advanced fusion for hot code paths.
They don't anymore they have uop caches, but trace caches are great and apple uses them [1].
They allow you to collapse taken branches into a single fetch.
Which is extreamly important, because the average instructions/taken-branch is about 10-15 [2]. With a 10 wide frontend, every second fetch would only be half utilized or worse.
> extra caches
This is one thing I don't understand, why not replace the L1I with the uop-cache entirely?
I quite like what Ventana does with the Veyron V2/V3. [3,4] They replaced the L1I with a macro-op trace cache, which can collapse taken branches, do basic instruction fusion and more advanced fusion for hot code paths.
[1] https://www.realworldtech.com/forum/?threadid=223220
[2] https://lists.riscv.org/g/tech-profiles/attachment/353/0/RIS... (page 10)
[3] https://www.ventanamicro.com/technology/risc-v-cpu-ip/
[4] https://youtu.be/EWgOVIvsZt8