There are a few different approaches. Meta documents at least one approach quite...

zaptrem · 2025-03-26T08:40:55 1742978455

Token by token diffusion was done by MAR https://arxiv.org/abs/2406.11838 and Fluid (scaled up MAR) https://arxiv.org/abs/2410.13863

When you randomly pick the locations they found it worked okay, but doing it in raster order (left to right, top to bottom) they found it didn't work as well. We tried it for music and found it was vulnerable to compounding error and lots of oddness relating to the fragility of continuous space CFG.

famouswaffles · 2025-03-26T01:35:40 1742952940

There is a more recent approach to auto-regressive image generation. Rather than predicting the next patch at the target resolution one by one, it predicts the next resolution. That is, the image at a small resolution followed by the image at a higher resolution and so on.

https://arxiv.org/abs/2404.02905