Just for fun I created a new personal benchmark for vision-enabled LLMs: playing... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

technologesus 18 days ago | parent | context | favorite | on: Gemma 3 QAT Models: Bringing AI to Consumer GPUs

Just for fun I created a new personal benchmark for vision-enabled LLMs: playing minecraft. I used JSON structured output in LM Studio to create basic controls for the game. Unfortunately no matter how hard I proompted, gemma-3-27b QAT is not really able to understand simple minecraft scenarios. It would say things like "I'm now looking at a stone block. I need to break it" when it is looking out at the horizon in the desert.

Here is the JSON schema: https://pastebin.com/SiEJ6LEz System prompt: https://pastebin.com/R68QkfQu

jvictor118 17 days ago [–]

i've found the vision capabilities are very bad with spatial awareness/reasoning. They seem to know that certain things are in the image, but not where they are relative to each other, their relative sizes, etc.

Consider applying for YC's Summer 2025 batch! Applications are open till May 13
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact