Awesome, thank you for expanding. Now I see the intuition that branch prediction accuracy could be much higher using the knowledge of the last opcode, so this becomes a game of massaging the code to prod the CPU to use more inputs into its branch prediction. Also helpful color on your empirical observation that branch prediction accuracy dominates other factors like switch indirection and loading registers.
There's one thing I'm still missing. How exactly do we force the CPU to use the last opcode in its branch prediction model? In your first switch example, the CPU "knows" the path it has followed to get to each iteration, so in theory it could use the information of the last node it visited (or the last two nodes, etc.) to aid branch prediction right?
Related to that: in your second example, what exactly happens in `goto *curInst->handler;`? Doesn't this need to revert back to something like a switch statement which has the same problem? (Unless you are doing polymorphism / dynamic dispatch in that example? Which I assume has some performance penalty that you're saying is dwarfed by the extra branch prediction effectiveness). Analogous to the line in the OP's article that says `MUSTTAIL return dispatch(UPB_PARSE_ARGS);` - doesn't the generic dispatch function need another switch statement?
Probably missing a couple obvious things here.
Lastly - if you have any books/article recommendations that helped you learn some of these intricacies (esp. regarding intuition about which performance quirks matter vs. don't matter) that would be great as well. Thanks!
There's one thing I'm still missing. How exactly do we force the CPU to use the last opcode in its branch prediction model? In your first switch example, the CPU "knows" the path it has followed to get to each iteration, so in theory it could use the information of the last node it visited (or the last two nodes, etc.) to aid branch prediction right?
Related to that: in your second example, what exactly happens in `goto *curInst->handler;`? Doesn't this need to revert back to something like a switch statement which has the same problem? (Unless you are doing polymorphism / dynamic dispatch in that example? Which I assume has some performance penalty that you're saying is dwarfed by the extra branch prediction effectiveness). Analogous to the line in the OP's article that says `MUSTTAIL return dispatch(UPB_PARSE_ARGS);` - doesn't the generic dispatch function need another switch statement? Probably missing a couple obvious things here.
Lastly - if you have any books/article recommendations that helped you learn some of these intricacies (esp. regarding intuition about which performance quirks matter vs. don't matter) that would be great as well. Thanks!