Try looking at things from my angle: A few errors in an image can be not much of a big deal (with modern tools, the mistakes are within human margin of error on average), but errors in delivery of textual data such as facts, dates or code can be far more severe and subtle. There are ways to work around or reduce the shortcomings of image generation, and the quirks you mention have drop-in solutions for local installs, but you can't quite fix wrong facts in text automatically, or the context going off-rails. It can be much harder to catch than a hand missing a finger, too.
It's also worth mentioning you can run a heavily customized Stable Diffusion setup at home with fairly modest hardware with satisfactory results if you know what you are doing, but anything you can run at home for LLMs in the same hardware is dog slow and actually kind of terrible.
I think this is still just a difference in how the output is used. You're presenting text generation as factual and image generation as artistic. It could be reversed - no one will care if a fantasy story gets some in-milieu "facts" wrong, but a blueprint or architectural reference coming out of Stable Diffusion could ruin someone's year.
It's also worth mentioning you can run a heavily customized Stable Diffusion setup at home with fairly modest hardware with satisfactory results if you know what you are doing, but anything you can run at home for LLMs in the same hardware is dog slow and actually kind of terrible.