Another take on this: phi-3 small has 1100 ELO on LMSYS (ranked #52) while the confidence interval for Gemma 2 9B is [1170, 1200] ELO (ranked btw #15 and #25).
Phi is notorious for benchmark overfitting. It's good, but not as good as it looks on the charts. On the Lmsys leaderboard it places a whole 23 spots behind Llama-3-8B which it also claims to soundly beat on the above. So YMMV.