Just for fun I created a new personal benchmark for vision-enabled LLMs: playing minecraft. I used JSON structured output in LM Studio to create basic controls for the game. Unfortunately no matter how hard I proompted, gemma-3-27b QAT is not really able to understand simple minecraft scenarios. It would say things like "I'm now looking at a stone block. I need to break it" when it is looking out at the horizon in the desert.
i've found the vision capabilities are very bad with spatial awareness/reasoning. They seem to know that certain things are in the image, but not where they are relative to each other, their relative sizes, etc.
Here is the JSON schema: https://pastebin.com/SiEJ6LEz System prompt: https://pastebin.com/R68QkfQu