I am terrified to think what AMD's predictor structure is if it's easier for them to do _this_ than it is to simply add privilege tags to their predictors. I don't personally buy this explanation anyways; trying to optimize retpolines in hardware would be an absolute pain in the ass and require an insane amount of synchronization with the backend since retpolines always trash the RAS.
I would guess it's probably physics. Specifically the complexity of the signal path routing for the predictor core must be pretty heavily optimized, and probably are where AMD (and Intel) have invested heavily in advanced design software - their secret sauce - to push the chip right to the edge of what semiconductor physics can achieve.
The branch predictor is one of the most highly optimized pieces of the CPU core. Lots of discussion has been had about how the arm architecture's frontend is simpler, so for example Apple's chips have way more execution units. Intel and AMD's latest designs have also expanded the number of execution units, but the frontend instruction decode and dispatch is the "serial" part of the process, reading the incoming instruction stream. And the x86 instruction set is hard to decode, with a lot of variation in the number of bytes per instruction. So for the instruction decoder to even know there's a branch coming up is a "hard problem," and then it predicts which way the branch will go.