This appears to be a classic vision fail on the VLM's part. Which is entirely unsurprising for anyone who has used open VLMs for anything except ""benchmarks"" in the past two god damn years. The field is in a truly embarrassing state, where they pride themselves how it can solve equations off a blackboard, yet couldn't even accurately read a d20 dice roll among many other things. I've tried (and failed) to have VLMs accurately caption images for such a long time, yet anytime I check on the output it is blindingly clear that these models are awful at actually _seeing things_.