Yes but because it's trained on text and in the backend, images are converted to... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		cm2012 7 months ago \| parent \| context \| favorite \| on: Ask HN: Any insider takes on Yann LeCun's push aga... Yes but because it's trained on text and in the backend, images are converted to tokens, it is absolutely dogshit at navigation and basic puzzles. It can't figure out what Squirrels can about how to achieve goals in a maze.

mountainriver 7 months ago [–]

The images are converted to an embedding space the size of token embedding space. And the model is trained on that new embedding space. A joint representation of text and images is formed.

It’s not as though the image is converted to text tokens.

Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact