Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I feel LeCun got roped in debating the likes of Marcus and Yudkowsky. This has made his arguments lose nuance and become rigid. I also can't escape the feeling that if Facebook was tuned into Transformers, they would have shipped earlier, so there must have been some resistance or underestimation that's now repeated "They can't reason", "They can't plan", "They can't understand the world", "They are a distraction / side road to AGI".

It is kind of ironic that researchers who claim LLMs lack adaptive intelligence seemingly refuse to adapt their intelligence to LLMs. If even GPT-3 can find logical holes or oversimplification in your arguments about GPTs, at one point this starts becoming embarrassing and unbecoming.

> The generation of mostly realistic-looking videos from prompts does not indicate that a system understands the physical world.

While arguably true, it also does not indicate that a system does not understand the physical world (reflections, collision detection, gravity, object permanence, long-term scene coherence, etc.).

If LeCun wants to argue it does not understand the physical world, he should do so directly. Not attack something that is not directly stated, but rather convincingly and tentatively demo'd (I myself find it hard to argue that a system that generates novel pond reflections has not memorized/stored in weights some generalization program to apply to realistic scene generation).

This demo shows it is not even a wild prediction to guess that soon (consumer tech) we will be able to discuss visual scenes with conversational AIs.



> long-term scene coherence,

FWIW none of the video models released so far demonstrate any object coherence whatsoever, which suggests they don't have the higher level capabilities you mention yet.

In Sora, as soon as an object is obstructed by an obstacle or goes offscreen, it's likely to disappear or be radically transformed.


You've seen the demos of a couple holding hands and walking, or the museum shots where all the paintings maintain coherence, or a woman temporarily obscuring a street sign. Or you haven't seen those demos. Either way...


In the couple holding hands videos you see the people walking in front duck into a wall and disappear, the girl in front walks into the fence and disappears, another girl walks straight through that fence. These aren't just issues with forgetting, it completely doesn't understand how things works. It draws a fence but then doesn't understand that people can't walk through it.

Or the video with the dog, that dog phases straight through those window shutters as if they weren't there and were rendered in layers rather than 3d. It doesn't understand the scenes it draws at all, it had shadows from those shutters so they were drawn to have depth, but that dog then were rendered on top of those shutters anyway and moved straight through them. You even see their shadows overlap since the shadow part is treated differently apparently, so it "knows" they overlap but also renders the dog on top, telling me that it doesn't really know any of that at all and is just based on guessing based on similar looking data samples.

And this in videos handpicked because they were especially good. We should expect the videos we are able to generate to be way worse than the demo in general. They didn't even manage to make a dog that moves between windows without such bugs, that was the best they got and even that was had a very egregious error for a very short clip.


The primary thing I noticed is that it doesn't quite seem to grasp that time runs forward either. In the video with the dogs in the snow you can see snow getting kicked up in reverse. I.E. snow arcs through the air and lands right as a paw gets placed.

Kind of made me wonder how these videos would look run backwards, but not enough to figure out how to make them run backwards.

EDIT: wow, the "backwards" physics is especially noticeable in the chair video[0]. Aside from the chair morphing wildly, notice how it floats and bounces around semi-physically. Clearly some issues grasping cause and effect.

[0] https://www.youtube.com/watch?v=lfbImB0_rKY


If the "couple holding hands and walking" one is the "Beautiful, snowy Tokyo city is bustling. ..." look at the traffic on the left side of the frame:

https://www.youtube.com/watch?v=ezaMd4l_5kw

We also have the spontaneous creation and annihilation of wolves and the shape-shifting chair:

https://www.youtube.com/watch?v=jspYKxFY7Sc

https://www.youtube.com/watch?v=lfbImB0_rKY


That chair is fucking wild.

The more I watch the cherry blossom one, the more I see how wrong it is, even the fact there is Cherry blossoms in the middle of winter is just totally wack. I've seen it snow in Tokyo before during spring when the cherry blossoms were out, but you don't have a foot of snow on the roof like in the clip.

Edit: I know the prompt asked for the cherry blossoms in snow, but it's still a wild amount of snow which is somehow not covering the trees.


If we are talking analogies, this is just Sora forgetting because of limitations of how the network handles the autoregressive dynamics. When they make a bigger version of Sora this will happen less. Sora aleady has unprecedented object permanence, see the woman walking in Tokyo scene where signs and people are reconstructed after two seconds of occlusion. Soon we will have object permanence following ten or more seconds of occlusion. Then a minute. Then three minutes. Then we will figure out a trick to store long term memory. What will people say then?


Our brain can't work that in our long-term memory btw, that why each time we remember something, we change minor aspects of said thing.


It's also what happens when we dream: everything is fluid. Things appear and disappear, people and places become someone or somewhere else, reading is difficult and hands are distorted.


Because these systems are dreaming about their datasets. Or hallucinating about it, as people have decided to call lately. I won't say this is a dead end. I will say we are very, very short of any sort of actual intelligence.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: