Yes but because it's trained on text and in the backend, images are converted to tokens, it is absolutely dogshit at navigation and basic puzzles. It can't figure out what Squirrels can about how to achieve goals in a maze.
The images are converted to an embedding space the size of token embedding space. And the model is trained on that new embedding space. A joint representation of text and images is formed.
It’s not as though the image is converted to text tokens.