Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It has no "understanding" of a cat. It's an associative store with soft edges that pulls out compressed cat representations when given the noun "cat". The key store includes nouns, adverbs, verbs, and adjectives, and style abstractions, and there are mappings into the store that link all of those.

But they're very limited, and if you prompt with a relationship that isn't defined you get best-guess, which will either be quite distant or contaminated with other values.

If you ask Dall-E for "a woman made of birds" you get a composite that also includes trees and/or leaves. Dall-E has values for "made of" and "birds" but its value representation for "birds" is contaminated with contextual trees and branches.

Leonardo doesn't have a value for "made of", so you get a woman surrounded by bird-like blobs.

To understand a cat in a human sense the store would have to include the shape, the movement dynamics in all possible situations, the textures, and a set of defining behaviours that is as complete as possible. It would also have to be able to provide and remember an object-constant instantiation of a specific cat that is clean of contamination.

SORA is maybe 10% of the way there. One of the examples doing the rounds shows some puppies playing in snow. It looks impressive until you realise the puppies are zombies. They have none of the expressions or emotions of a real puppy.

None of this is impossible, but training time, storage, and power consumption all explode the more information you try to include.



I don't see what's so problematic with it. I doubt the model is actually confusing trees and branches with birds. It has associations, but humans do too. If I ask a human to draw a demon, the background would not be an office?

Also the complain about 'made of' not being in the training data. Humans who never saw a bird can not draw a bird. Why is that saying something about the model?

I'm not saying that diffusion models act like humans. And I was talking specifically about image generation. My usage of the word understanding is in the task of image generation. I'm not even talking about 'made of', or 'birds'. Just 'cats' and 'hats'. If it can understand 1 thing, it can understand others, but they are not always in the training data.

This is all a non-problem. It kinda remind me of the discussion of what constitutes a 'male', or 'female'. All i want is to refer this one property that i observe in diffusion models. Which is what language is, reference. If you are so covetous of the word 'understand', then provide an alternative to refer to this property and i will gladly use it.

https://imgur.com/a/fuS8kcf


> It's an associative store with soft edges that pulls out compressed cat representations when given the noun "cat".

And how do you know that this is not what "understanding" is? To me, understanding the concept of a cat is exactly to immediately recall (or have ready) all the associations, the possibilities, the consequences of the "cat" concept. If you can make up correct sentences about cats and conduct a reasonable conversation about cats, it means that you understand cats.


> If you can make up correct sentences about cats and conduct a reasonable conversation about cats, it means that you understand cats.

No, plenty of humans can have reasonable conversations about things with zero understanding about them. We know they don't understand because when put in a situation in practice they fail to use the things they talked about. Understanding means you can apply what you know, not only talk about it.


It's interesting, can you give me some examples of such cases?


If you don't believe that is true, how can you explain programmer recruitment? If discussing something cogently is showing real understanding, any 10 minute discussion would be enough to make a hire decision.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: