I've added/tested this multimodal Gemini 2.0 to my shoot-out of SOTA image gen m...

ticulatedspline · 2025-05-08T01:19:56 1746667196

Excellent site! OpenAI 4o is more than mildly frighting in it's capabilities to understand the prompt. Seems mostly what's holding it back is a tendency away from photo-realism (or even typical digital art styles) and it's own safeguards.

echelon · 2025-05-08T02:11:14 1746670274

Multimodal is the only image generation modality that matters going forward. Flux, HiDream, Stable Diffusion, and the like are going to be relegated to the past once multimodal becomes more common. Text-to-image sucks, and image-to-image with all the ControlNets and Comfy nodes is cumbersome in comparison to true multimodal instructiveness.

I hope that we get an open weights multimodal image gen model. I'm slightly concerned that if these things take tens to hundreds of millions of dollars to train, that only Google and OpenAI will provide them.

That said, the one weakness in multimodal models is that they don't let you structure the outputs yet. Multimodal + ControlNets would fix that, and that would be like literally painting with the mind.

The future, when these models are deeply refined and perfected, is going to be wild.

zaptrem · 2025-05-08T04:18:58 1746677938

Good chance a future llama will output image tokens

echelon · 2025-05-08T04:58:29 1746680309

That's my hope. That Llama or Qwen bring multimodal image generation capabilities to open source so we're not left in the dark.

If that happens, then I'm sure we'll see slimmer multimodal models over the course of the next year or so. And that teams like Black Forest Labs will make more focused and performant multimodal variants.

We need the incredible instructivity of multimodality. That's without question. But we also need to be able to fine tune, use ControlNets to guide diffusion, and to compose these into workflows.

troupo · 2025-05-08T04:57:52 1746680272

I also find it weird how it defaults/devolves into this overall brown-ish style. Once you see it, you see it everywhere

flir · 2025-05-08T11:26:08 1746703568

I've played around with "create an image based on this image" chains quite a lot, and yep, everything goes brown with 4o. You append the images to each other as a filmstrip and it's almost like a gradient.

They also simplify over the generations (eg a basket full of stuff slowly loses the stuff), but I guess that's to be expected.

jlarcombe · 2025-05-08T16:27:32 1746721652

yes. it's absolutely horrible looking.

avereveard · 2025-05-08T02:17:51 1746670671

It's a bit expensive/slow but for styled request I let it do the base image and when happy with the composition I ask to remake it as a picture or in whatever style needed.

belter · 2025-05-07T19:54:19 1746647659

Your shoot-out site is very useful. Could I suggest adding prompts that expose common failure modes?

For example, asking the models to show clocks set to a specific time or people drawing with their left hand. I think most, if not all models, will likely display every clock with the same time...And portray subjects drawing with their right hand.

vunderba · 2025-05-07T21:57:39 1746655059

@belter / @crooked-v

Thanks for the suggestions. Most of the current prompts are a result of personal images that I wanted to generate - so I'll try to add some "classic GenAI failure modes". Musical instruments such as pianos also used to be a pretty big failure point as well.

troupo · 2025-05-08T05:00:12 1746680412

For personal images I often play with wooly mammoths, and most models are incapable of generating anything but textbook images. Any deviation either becomes an elephant or an abomination (bull- or bear-like monsters)

crooked-v · 2025-05-07T20:22:03 1746649323

Another I would suggest is buildings with specific unusual proportions and details(e.g. "the mansion's west wing is twice the height of the right wing and has only very wide windows"). I've yet to find a model that will do that kind of thing reliably, where it seems to just fall back on the vibes of whatever painting or book cover is vaguely similar to what's described.

droopyEyelids · 2025-05-07T20:40:49 1746650449

generating a simple maze for kids is also not possible yet

vunderba · 2025-05-07T21:54:59 1746654899

Love this one so I've added it. The concept is very easy for most GenAI models to grasp, but it requires a strong overall cohesive understanding. Rather unbelievably, OpenAI 4o managed to produce a pass.

I should also add an image that is heavy with "greebles". GenAI usually lacks the fidelity for these kinds of minor details so although it adds them - they tend to fall apart at more than a cursory examination.

https://en.wikipedia.org/wiki/Greeble

saretup · 2025-05-08T14:01:42 1746712902

What I found while using these models:

For generating a new image, GPT 4o image gen is the best.

For editing an existing image (while retaining parts of the original image) such as adding text or objects in the original image, Gemini 2.0 image gen model is the best (GPT 4o always changes the original image no matter what).

esperent · 2025-05-08T01:12:20 1746666740

Your site is really useful, thanks for sharing. One issue is that the list of examples sticks to the top and covers more than half of the screen on mobile, could you add a way to hide it?

If you're looking for other suggestions a summary table showing which models are ahead would be great.

vunderba · 2025-05-08T04:25:53 1746678353

Great point - when I started building it I think I only had about four test cases, but now the nav bar is eating 50% of the vertical display so I've removed it from mobile display!

Wrt to the summary table, did you have a different metric in mind? The top of the display should already be showing a "Model Performance" chart with OpenAI 4 and Google Imagen 3 leading the pack.

esperent · 2025-05-08T08:09:18 1746691758

That's much easy to read now.

> The top of the display should already be showing a "Model Performance" chart

I guess I missed this earlier!

pkulak · 2025-05-07T23:20:41 1746660041

> That mermaid was quite the saucy tart.

Really now?

croes · 2025-05-08T11:59:52 1746705592

How about including the simple cases where AI usually fails like

"Draw a clock showing the time of 09:30 a.m."

ChatGPT still shows 01:50

Or

"Draw a painter painting a picture of the Eiffel Tower with his left hand"

Painter is still right handed.

liuliu · 2025-05-08T01:57:07 1746669427

Do you mind to share which HiDream-I1 model you are using? I am getting better results with these prompts from mine implementation inside Draw Things.

vunderba · 2025-05-08T04:30:25 1746678625

Sure - I was using "hidream-l1-dev" but if you're seeing better results - I might rerun the hidream tests with the "hidream-i1-full" model.

I've been thinking about possibly rerunning the Flux Dev prompts using the 1.1 Pro but I liked having a base reference for images that can be generated on consumer hardware.

liuliu · 2025-05-08T16:06:28 1746720388

Yeah, I use the full model which is slightly better at some of these prompts.

andybak · 2025-05-08T08:45:31 1746693931

Any thoughts on how Ideogram would rank? I've not used it recently but I used to get the sense that it is (or was) a "contender".