There are a few different approaches. Meta documents at least one approach quite well in one of their llama papers.
The general gist is that you have some kind of adapter layers/model that can take an image and encode it into tokens. You then train the model on a dataset that has interleaved text and images. Could be webpages, where images occur in-between blocks of text, chat logs where people send text messages and images back and forth, etc.
The LLM gets trained more-or-less like normal, predicting next token probabilities with minor adjustments for the image tokens depending on the exact architecture. Some approaches have the image generation be a separate "path" through the LLM, where a lot of weights are shared but some image token specific weights are activated. Some approaches do just next token prediction, others have the LLM predict the entire image at once.
As for encoding-decoding, some research has used things as simple as Stable Diffusion's VAE to encode the image, split up the output, and do a simple projection into token space. Others have used raw pixels. But I think the more common approach is to have a dedicated model trained at the same time that learns to encode and decode images to and from token space.
For the latter approach, this can be a simple model, or it can be a diffusion model. For encoding you do something like a ViT. For decoding you train a diffusion model conditioned on the tokens, throughout the training of the LLM.
For the diffusion approach, you'd usually do post-training on the diffusion decoder to shrink down the number of diffusion steps needed.
The real crutch of these models is the dataset. Pretraining on the internet is not bad, since there's often good correlation between the text and the images. But there's not really good instruction datasets for this. Like, "here's an image, draw it like a comic book" type stuff. Given OpenAI's approach in the past, they may have just bruteforced the dataset using lots of human workers. That seems to be the most likely approach anyway, since no public vision models are quite good enough to do extensive RL against.
And as for OpenAI's architecture here, we can only speculate. The "loading from top to be from a blurry image" is either a direct result of their architecture or a gimmick to slow down requests. If the former, it means they are able to get a low resolution version of the image quickly, and then slowly generate the higher resolution "in order." Since it's top-to-bottom that implies token-by-token decoding. My _guess_ is that the LLM's image token predictions are only "good enough." So they have a small, quick decoder take those and generate a very low resolution base image. Then they run a stronger decoding model, likely a token-by-token diffusion model. It takes as condition the image tokens and the low resolution image, and diffuses the first patch of the image. Then it takes as condition the same plus the decoded patch, and diffuses the next patch. And so forth.
A mixture of approaches like that allows the LLM to be truly multi-modal without the image tokens being too expensive, and the token-by-token diffusion approach helps offset memory cost of diffusing the whole image.
I don't recall if I've seen token-by-token diffusion in a published paper, but it's feasible and is the best guess I have given the information we can see.
EDIT: I should note, I've been "fooled" in the past by OpenAI's API. When o* models first came out, they all behaved as if the output were generated "all at once." There was no streaming, and in the chat client the response would just show up once reasoning was done. This led me to believe they were doing an approach where the reasoning model would generate a response and refine it as it reasoned. But that's clearly not the case, since they enabled streaming :P So take my guesses with a huge grain of salt.
When you randomly pick the locations they found it worked okay, but doing it in raster order (left to right, top to bottom) they found it didn't work as well. We tried it for music and found it was vulnerable to compounding error and lots of oddness relating to the fragility of continuous space CFG.
There is a more recent approach to auto-regressive image generation.
Rather than predicting the next patch at the target resolution one by one, it predicts the next resolution. That is, the image at a small resolution followed by the image at a higher resolution and so on.
The general gist is that you have some kind of adapter layers/model that can take an image and encode it into tokens. You then train the model on a dataset that has interleaved text and images. Could be webpages, where images occur in-between blocks of text, chat logs where people send text messages and images back and forth, etc.
The LLM gets trained more-or-less like normal, predicting next token probabilities with minor adjustments for the image tokens depending on the exact architecture. Some approaches have the image generation be a separate "path" through the LLM, where a lot of weights are shared but some image token specific weights are activated. Some approaches do just next token prediction, others have the LLM predict the entire image at once.
As for encoding-decoding, some research has used things as simple as Stable Diffusion's VAE to encode the image, split up the output, and do a simple projection into token space. Others have used raw pixels. But I think the more common approach is to have a dedicated model trained at the same time that learns to encode and decode images to and from token space.
For the latter approach, this can be a simple model, or it can be a diffusion model. For encoding you do something like a ViT. For decoding you train a diffusion model conditioned on the tokens, throughout the training of the LLM.
For the diffusion approach, you'd usually do post-training on the diffusion decoder to shrink down the number of diffusion steps needed.
The real crutch of these models is the dataset. Pretraining on the internet is not bad, since there's often good correlation between the text and the images. But there's not really good instruction datasets for this. Like, "here's an image, draw it like a comic book" type stuff. Given OpenAI's approach in the past, they may have just bruteforced the dataset using lots of human workers. That seems to be the most likely approach anyway, since no public vision models are quite good enough to do extensive RL against.
And as for OpenAI's architecture here, we can only speculate. The "loading from top to be from a blurry image" is either a direct result of their architecture or a gimmick to slow down requests. If the former, it means they are able to get a low resolution version of the image quickly, and then slowly generate the higher resolution "in order." Since it's top-to-bottom that implies token-by-token decoding. My _guess_ is that the LLM's image token predictions are only "good enough." So they have a small, quick decoder take those and generate a very low resolution base image. Then they run a stronger decoding model, likely a token-by-token diffusion model. It takes as condition the image tokens and the low resolution image, and diffuses the first patch of the image. Then it takes as condition the same plus the decoded patch, and diffuses the next patch. And so forth.
A mixture of approaches like that allows the LLM to be truly multi-modal without the image tokens being too expensive, and the token-by-token diffusion approach helps offset memory cost of diffusing the whole image.
I don't recall if I've seen token-by-token diffusion in a published paper, but it's feasible and is the best guess I have given the information we can see.
EDIT: I should note, I've been "fooled" in the past by OpenAI's API. When o* models first came out, they all behaved as if the output were generated "all at once." There was no streaming, and in the chat client the response would just show up once reasoning was done. This led me to believe they were doing an approach where the reasoning model would generate a response and refine it as it reasoned. But that's clearly not the case, since they enabled streaming :P So take my guesses with a huge grain of salt.