The comparisons I saw I think were manual, so it makes sense it can run a whole suite- these were just some basic prompts and showed the difference in how the produced output ran.
Pro tip: It's hard to trust Twitter for opinions on Grok. The thumb is very clearly on the scale. I have personally seen very few positive opinions of Grok outside of Twitter.
I agree with you, and I hate to say this, but I saw them on LinkedIn. One purportedly used the same prompts to make a "pacman like" game and the results from Grok3 were at least better, assuming the post is on the up and up, better looking than o3-mini-high.
I thought Grok 2 was pretty bad, but Grok 3 is actually quite good. I'm mostly impressed by the speed of answering. But Claude is still the king of code.