Yes, because models like midjourny tiny compared to LLMs like GPT. I'm pretty sure there's a good hackernews discussion on this that occurred recently, but with all the AI talk I can't find it. But really we need a lot less information to make a reasonable city, then the amount of information we need to make billboards and signs make sense. I don't think Midjourny wants to pay 10+ million dollars to have their model trained.
> then the amount of information we need to make billboards and signs make sense.
Subsequently, this applies to posters, letters, newspapers, and other types of text-heavy images, ultimately reducing the language modeling problem to an image generation problem.