Hacker News new | past | comments | ask | show | jobs | submit login

Well, there's also gemini-2.0-flash-exp-image-generation. Also autoregressive/transfusion based.



Such a good name....


gemini-2.0-flash-exp-image-generation doesn’t perform as well as GPT-4o's image generation, as mentioned in section 5.1 of this paper: https://arxiv.org/pdf/2504.02782. However based on my test, for certain types of images such as realistic recipe images, the results are quite good. You can see some examples here: https://github.com/Yiling-J/tablepilot/tree/main/examples/10...


It's quite bad now, but I have no doubt that Google will catch up.

The AI field looks awfully like {OpenAI, Google, The Irrelevent}.


It's also good but clearly not close still. Maybe Gemini 2.5 or 3 will have better image gen.


> transfusion based.

what is that?


It's a mix between the Transformer architecture and diffusion, shown to provide better output results than simple autoregressive image token generation alone: https://arxiv.org/html/2408.11039v1

Of course, nobody really knows what 4o image generation really is under the hood, but it looks to be like some kind of hybrid system like Transfusion to me. It is much better at prompt adherence than diffusion models, but its output can be clunkier/stylistically incoherent. At times, it also exhibits similar failure modes as diffusion (such as weirdly rotated body parts).

Given how it behaves, I think Gemini 2.0 Flash image generation is probably the same approach but with a smaller parameter count. It's... eerie... how close together these two were released and how similar they appear to be.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: