That makes sense, but that would imply that there's a limit, right? Once the image is pixel-perfect and outputs the optimal image, what does increasing the model size do? Who and how can decide: "yes, this is more Picasso looking than that one", or "this one indeed looks more energetic", or "this image does make me sadder than this one". How do you benchmark this?
Yes you are on the right track. Once you get really close to a perfect score on your benchmark you can no longer improve so you need to develop a better benchmark with more headroom. And you have the right idea of how you go about benchmarking subjective quality. A bunch of humans produce output-scoring pairings and the model is judged against that. To train an AI you need a very measurable goal and in this case the measure is “humans like it.”
If you are noticing that this seems to fundamentally limit model performance on certain tasks to aggregate human capability, you are noticing correctly.
To give you some idea of what these benchmarks look like, here’s the prompt list from DrawBench which Google created as part of training their Imagen model.
Also, after a point the differences will be more given by the specific individual that views the image, not by what the AI can generate, so the AI would have to optimize it's output per individual and would need to have a deep understanding of them.