ChatGPT's prompt adherence is light years ahead of all the others. I won't even ...

thegeomaster · 2025-04-24T21:53:45 1745531625

Well, there's also gemini-2.0-flash-exp-image-generation. Also autoregressive/transfusion based.

thefourthchime · 2025-04-24T22:08:02 1745532482

Such a good name....

Yiling-J · 2025-04-25T01:44:06 1745545446

gemini-2.0-flash-exp-image-generation doesn’t perform as well as GPT-4o's image generation, as mentioned in section 5.1 of this paper: https://arxiv.org/pdf/2504.02782. However based on my test, for certain types of images such as realistic recipe images, the results are quite good. You can see some examples here: https://github.com/Yiling-J/tablepilot/tree/main/examples/10...

raincole · 2025-04-25T04:43:08 1745556188

It's quite bad now, but I have no doubt that Google will catch up.

The AI field looks awfully like {OpenAI, Google, The Irrelevent}.

yousif_123123 · 2025-04-24T23:14:57 1745536497

It's also good but clearly not close still. Maybe Gemini 2.5 or 3 will have better image gen.

swyx · 2025-04-25T03:23:33 1745551413

> transfusion based.

what is that?

thegeomaster · 2025-04-25T20:56:26 1745614586

It's a mix between the Transformer architecture and diffusion, shown to provide better output results than simple autoregressive image token generation alone: https://arxiv.org/html/2408.11039v1

Of course, nobody really knows what 4o image generation really is under the hood, but it looks to be like some kind of hybrid system like Transfusion to me. It is much better at prompt adherence than diffusion models, but its output can be clunkier/stylistically incoherent. At times, it also exhibits similar failure modes as diffusion (such as weirdly rotated body parts).

Given how it behaves, I think Gemini 2.0 Flash image generation is probably the same approach but with a smaller parameter count. It's... eerie... how close together these two were released and how similar they appear to be.

echelon · 2025-04-25T01:53:35 1745546015

I'd go out on a limb and say that even your praise of gpt-image-1 is underselling its true potential. This model is as remarkable as when ChatGPT first entered the market. People are sleeping on its capabilities. It's a replacement for ComfyUI and potentially most of Adobe in time.

Now for the bad part: I don't think Black Forest Labs, StabilityAI, MidJourney, or any of the others can compete with this. They probably don't have the money to train something this large and sophisticated. We might be stuck with OpenAI and Google (soon) for providing advanced multimodal image models.

Maybe we'll get lucky and one of the large Chinese tech companies will drop a model with this power. But I doubt it.

This might be the first OpenAI product with an extreme moat.

raincole · 2025-04-25T03:54:27 1745553267

> Now for the bad part: I don't think Black Forest Labs, StabilityAI, MidJourney, or any of the others can compete with this.

Yeah. I'm a tad sad about it. I once thought the SD ecosystem proves open-source won when it comes to image gen (a naive idea, I know). It turns out big corps won hard in this regard.

soared · 2025-04-24T21:14:06 1745529246

This is a take so incredulous it doesn’t seem credible.

stavros · 2025-04-24T21:50:36 1745531436

I can confirm, ChatGPT's prompt adherence is so incredibly good, it gets even really small details right, to a level that diffusion-based generators couldn't even dream of.

mediaman · 2025-04-24T21:33:57 1745530437

It is correct, the shift from diffusion to transformers is a very, very big difference.

abhpro · 2025-04-25T00:47:52 1745542072

Also chiming in to say you're wrong, I mean they're correct

tacoooooooo · 2025-04-24T21:16:16 1745529376

its 100% the correct take

fkyoureadthedoc · 2025-04-24T21:21:48 1745529708

yeah this is my personal experience. The new image generation is the only reason I keep an OpenAI subscription rather than switching to Google.