Yeah, this is my number one complaint with all recent open source vision models, and it seems like it is only getting worse. It's verbose to the point of parody, making it extremely difficult to evaluate what it can actually _see_, and what it's just dumbly markov-chaining based on previous text tokens.
In GPT4V, you can prompt around this if you know about it, but none of the people collecting datasets for open models appear know or care to apply that, and so we just get this default GPT4V contamination everywhere.
The only vision model I enjoy is Google Gemini, simply because it will give you a no-nonsense caption. Of course it still hallucinates things that are not there, but getting a color or object wrong is orders of magnitude less bad than having 3 sentences that have nothing to do with the image.
In GPT4V, you can prompt around this if you know about it, but none of the people collecting datasets for open models appear know or care to apply that, and so we just get this default GPT4V contamination everywhere.
The only vision model I enjoy is Google Gemini, simply because it will give you a no-nonsense caption. Of course it still hallucinates things that are not there, but getting a color or object wrong is orders of magnitude less bad than having 3 sentences that have nothing to do with the image.