Many claims don't stand up to scrutiny, and some look suspiciously like training to the test.
The Apple study was clear about this. LLMs and their related modal models lack the ability to abstract information from noisy text inputs.
This is really obvious if you play with any of the art generators. For example - the understanding of basic prepositions just isn't there. You can't say "Put this thing behind/over/in front of this other thing" and get the result you want with any consistency.
If you create a composition you like and ask for it in a different colour, you get a different image.
There is no abstracted concept of a "colour" in there. There's just a lot of imagery tagged with each colour name, and if you select a different colour you get a vector in a space pointing to different images.
Text has exactly the same problem, but it's less obvious because what the grammar is usually - not always - perfect and the output has been tuned to sound authoritative.
There is not enough information in text as a medium to handle more than a small subset of problems with any consistency.
> There is no abstracted concept of a "colour" in there. There's just a lot of imagery tagged with each colour name, and if you select a different colour you get a vector in a space pointing to different images.
It has been observed in LLMs that the distance between embeddings for colors follows the same similarity patterns that humans experience - colors that appear similar to humans, like red and orange, are closer together in the embedding space than colors that appear very different, like red and blue.
While some argue these models 'just extract statistics,' if the end result matches how we use concepts, what's the difference?
Part of this is that the art generators tend to use CLIP, whjch is not a particularly good text model, often only being slightly better than a bag of words, which makes many interactions and relationships pretty difficult to represent. Some of the newer ones have better frontends which improve this situation, though.
I think color is fairly well abstracted, but most image generators are not good for edits, because the generator more or less starts from scratch, and from a new random seed each time (and even if the seed is fixed, the initial stages of the generation, where things like the rough image composition form, tend to be quite chaotic and so sensitive to small changes in prompt). There are tools that can make far more controlled adjustments of an image, but they tend to be a bit less user-friendly.
> I think color is fairly well abstracted, but most image generators are not good for edits, because the generator more or less starts from scratch
It’s unlikely that the models have been trained on “similarity”. Ask it to swap red boots for brown boots and it will happily generate an entirely different image because it was never trained on the concept of images being similar.
That doesn’t mean it’s impossible to train an LLM on the concept of similarity.
I just asked Midjourney to do precisely that, and it swapped the boots with no issue, although it didn't seem to quite understand what it meant for a cat to _wear_ boots.
The Apple study was clear about this. LLMs and their related modal models lack the ability to abstract information from noisy text inputs.
This is really obvious if you play with any of the art generators. For example - the understanding of basic prepositions just isn't there. You can't say "Put this thing behind/over/in front of this other thing" and get the result you want with any consistency.
If you create a composition you like and ask for it in a different colour, you get a different image.
There is no abstracted concept of a "colour" in there. There's just a lot of imagery tagged with each colour name, and if you select a different colour you get a vector in a space pointing to different images.
Text has exactly the same problem, but it's less obvious because what the grammar is usually - not always - perfect and the output has been tuned to sound authoritative.
There is not enough information in text as a medium to handle more than a small subset of problems with any consistency.