You can look at Google's recent PaLM-E model for a possible approach. They use a...

You can look at Google's recent PaLM-E model for a possible approach. They use a vision transformer to tokenise the image (or to generate embeddings and then tokenise those?) and they also tokenise detected objects so the model can reason at a semantic level. Either way, it's been shown that these massive LLMs can handle images in tokenised form if you pretend it's text. In Google's case, the model is trained to look for sentinel values in the prompt (i.e. <img>) that denote images/objects are being sent.