I've added/tested this multimodal Gemini 2.0 to my shoot-out of SOTA image gen models (OpenAI 4o, Midjourney 7, Flux, etc.) which contains a collection of increasingly difficult prompts.
I don't know how much of Google's original Imagen 3.0 is incorporated into this new model, but the overall aesthetic quality seems to be unfortunately significantly worse.
The big "wins" are:
- Multimodal aspect in trying to keep parity with OpenAI's offerings.
- An order of magnitude faster than OpenAI 4o image gen
Excellent site! OpenAI 4o is more than mildly frighting in it's capabilities to understand the prompt. Seems mostly what's holding it back is a tendency away from photo-realism (or even typical digital art styles) and it's own safeguards.
Multimodal is the only image generation modality that matters going forward. Flux, HiDream, Stable Diffusion, and the like are going to be relegated to the past once multimodal becomes more common. Text-to-image sucks, and image-to-image with all the ControlNets and Comfy nodes is cumbersome in comparison to true multimodal instructiveness.
I hope that we get an open weights multimodal image gen model. I'm slightly concerned that if these things take tens to hundreds of millions of dollars to train, that only Google and OpenAI will provide them.
That said, the one weakness in multimodal models is that they don't let you structure the outputs yet. Multimodal + ControlNets would fix that, and that would be like literally painting with the mind.
The future, when these models are deeply refined and perfected, is going to be wild.
That's my hope. That Llama or Qwen bring multimodal image generation capabilities to open source so we're not left in the dark.
If that happens, then I'm sure we'll see slimmer multimodal models over the course of the next year or so. And that teams like Black Forest Labs will make more focused and performant multimodal variants.
We need the incredible instructivity of multimodality. That's without question. But we also need to be able to fine tune, use ControlNets to guide diffusion, and to compose these into workflows.
I've played around with "create an image based on this image" chains quite a lot, and yep, everything goes brown with 4o. You append the images to each other as a filmstrip and it's almost like a gradient.
They also simplify over the generations (eg a basket full of stuff slowly loses the stuff), but I guess that's to be expected.
It's a bit expensive/slow but for styled request I let it do the base image and when happy with the composition I ask to remake it as a picture or in whatever style needed.
Your shoot-out site is very useful. Could I suggest adding prompts that expose common failure modes?
For example, asking the models to show clocks set to a specific time or people drawing with their left hand. I think most, if not all models, will likely display every clock with the same time...And portray subjects drawing with their right hand.
Thanks for the suggestions. Most of the current prompts are a result of personal images that I wanted to generate - so I'll try to add some "classic GenAI failure modes". Musical instruments such as pianos also used to be a pretty big failure point as well.
For personal images I often play with wooly mammoths, and most models are incapable of generating anything but textbook images. Any deviation either becomes an elephant or an abomination (bull- or bear-like monsters)
Another I would suggest is buildings with specific unusual proportions and details(e.g. "the mansion's west wing is twice the height of the right wing and has only very wide windows"). I've yet to find a model that will do that kind of thing reliably, where it seems to just fall back on the vibes of whatever painting or book cover is vaguely similar to what's described.
Love this one so I've added it. The concept is very easy for most GenAI models to grasp, but it requires a strong overall cohesive understanding. Rather unbelievably, OpenAI 4o managed to produce a pass.
I should also add an image that is heavy with "greebles". GenAI usually lacks the fidelity for these kinds of minor details so although it adds them - they tend to fall apart at more than a cursory examination.
For generating a new image, GPT 4o image gen is the best.
For editing an existing image (while retaining parts of the original image) such as adding text or objects in the original image, Gemini 2.0 image gen model is the best (GPT 4o always changes the original image no matter what).
Your site is really useful, thanks for sharing. One issue is that the list of examples sticks to the top and covers more than half of the screen on mobile, could you add a way to hide it?
If you're looking for other suggestions a summary table showing which models are ahead would be great.
Great point - when I started building it I think I only had about four test cases, but now the nav bar is eating 50% of the vertical display so I've removed it from mobile display!
Wrt to the summary table, did you have a different metric in mind? The top of the display should already be showing a "Model Performance" chart with OpenAI 4 and Google Imagen 3 leading the pack.
Sure - I was using "hidream-l1-dev" but if you're seeing better results - I might rerun the hidream tests with the "hidream-i1-full" model.
I've been thinking about possibly rerunning the Flux Dev prompts using the 1.1 Pro but I liked having a base reference for images that can be generated on consumer hardware.
https://genai-showdown.specr.net
I don't know how much of Google's original Imagen 3.0 is incorporated into this new model, but the overall aesthetic quality seems to be unfortunately significantly worse.
The big "wins" are:
- Multimodal aspect in trying to keep parity with OpenAI's offerings.
- An order of magnitude faster than OpenAI 4o image gen