Page 7 of their technical report [0] has a better apples to apples comparison. Why they choose to show apples to oranges on their landing page is odd to me.
This isn't apples to apples - they're taking the optimal prompting technique for their own model, then using that technique for both models. They should be comparing it against the optimal prompting technique for GPT-4.
[0] https://storage.googleapis.com/deepmind-media/gemini/gemini_...