Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I find Gary's arguments increasingly semantic and unconvincing. He lists several examples of how LLMs "fail to build a world model", but his definition of "world model" is an informal hand-wave ("a computational framework that a system (a machine, or a person or other animal) uses to track what is happening in the world"). His examples are lifted from a variety of unclear or obsolete models - what is his opinion of O3? Why doesn't he create or propose a benchmark that researchers could use to measure progress of "world model creation"?

What's more, his actual point is unclear. Even if you simply grant, "okay, even SOTA LLMs don't have world models", why do I as a user of these models care? Because the models could be wrong? Yes, I'm aware. Nevertheless, I'm still deriving subtantial personal and professional value from the models as they stand today.



I think the point is that category errors or misinterpreting what a tool does can be dangerous.

Both statistical data generators and actual reasoning are useful in many circumstances, but there are also circumstances in which thinking that you are doing the latter when you are only doing the former can have severe consequences (example: building a bridge).

If nothing else, his perspective is a counterbalance to what is clearly an extreme hype machine that is doing its utmost to force adoption through overpromising, false advertising, etc. These are bad things even if the tech does actually have some useful applications.

As for benchmarks, if you fundamentally don't believe that stochastic data generation leads to reason as an emergent property, developing a benchmark is pointless. Also, not everyone has to be on the same side. It's clear that Marcus is not a fan of the current wave. Asking him to produce a substantive contribution that would help them continue to achieve their goals is preposterous. This game is highly political too. If you think the people pushing this stuff are less than estimable or morally sound, you wouldn't really want to empower them or give them more ideas.


> If nothing else, his perspective is a counterbalance to what is clearly an extreme hype machine that is doing its utmost to force adoption through overpromising, false advertising, etc. These are bad things even if the tech does actually have some useful applications.

In other words, overhyped in the short term, underhyped in the long term. Where short and long term are extremely volatile.

Take programming as an example. 2.5 years ago, gpt3.5 was seen as "cute" in the programming world. Oh, look, it does poems and e-mails, and the code looks like python but it's wrong 9 times out of 10. But now a 24B model can handle end-to-end SWE tasks in 0-shot a lot of the times.


The improvements in programming are largely due to the adoption of “agentic” architectures. This is really a hybrid neural-symbolic approach: the symbolic part being the interpreter/compiler. Effectively the LLM still produces an almost-correct-but-wrong program and then the compiler “fact-checks” it and then the LLM basically local-searches its way from there to something that passes the compiler. (If you want to be disabused of the idea that LLMs on their own are good at programming, just review the “reasoning” log of one trying to fix a simple string | undefined error in Typescript).

It seems clear to me therefore that further improvements in programming ability will not come from better LLM models (which have not really improved much), but from better integration of more advanced compilers. That is, the more types of errors that can be caught by the compiler, the better chance of the AI fuzzing its way to a good overall solution. Interestingly, I hear anecdotally that current LLMs are not great at writing Rust, which does have an advanced type system able to capture more types of errors. That’s where I’d focus if I was working on this. But we should be clear that the improvements are already largely coming via symbolic means, not better LLMs.

I wrote some notes about a year ago about the irony of LLMs being considered a refutation of GOFAI when they are actually now firmly recapitulating that paradigm: https://neilmadden.blog/2024/06/30/machine-learning-and-the-...


> The improvements in programming are largely due to the adoption of “agentic” architectures.

Yes, I agree. But it's not just the cradles, it's cradles + training on traces produced with those cradles. You can test this very easily with running old models w/ new cradles. They don't perform well at all. (one of the first things I did when guidance, a guided generation framework, launched ~2 years ago was to test code - compile - edit loops. There were signs of it working, but nothing compared to what we see today. That had to be trained into the models.)

> will not come from better LLM models (which have not really improved much), but from better integration of more advanced compilers.

Strong disagree. They have to work together. This is basically why RL is gaining a lot of traction in this space.

Also disagree on llms not improving much. Whatever they did with gemini 2.5 feels like gpt3-4 to me. The context updates are huge. This is the first model that can take 100k tokens and still work after that. They're doing something right to be able to support such large contexts with such good performance. I'd be surprised if gemini 2.5 is just gemini 1 + more data. Extremely surprised. There have to be architecture changes and improvements somewhere in there.


> You can test this very easily with running old models w/ new cradles. They don't perform well at all.

This is because neither the LLMs nor the cradles are intelligent.

> They have to work together.

Exactly. Because they are essentially a single, brittle model. Not a "smart" text generator + a "smart" validation system.

LLMs are an enormous breakthrough in NLP and something like it will be part of an AGI system. But there is no path to AGI without more breakthroughs.


He cites o3 and o4-mini as examples of LLMs that play illegal chess moves.


I don't understand the reasoning behind drawing a conclusion that if something fails a task that requires reasoning implies that thing cannot reason.

To use chess as an example. Humans sometimes play illegal moves. That does not mean Humans cannot reason. It is an instance of failing to show proof of reasoning. Not a proof of the inability to reason.


I don't think that's a fair representation of the argument.

The argument is not "here's one failure case, therefore they don't reason". The argument is that systematically if you given an LLM problem instances outside training sets in domains with clear structural rules, they will fail to solve them. The argument then goes that they must not have an actual model or understanding of the rules, as they seem to only be capable of solving problems in the training set. That is, they have failed to figure out how to solve novel problem instances of general problem structures using logical reasoning.

Their strict dependence on having seen the exact or extremely similar concrete instances suggests that they don't actually generalize—they just compute a probability based on known instances—which everyone knew already. The problem is we just have a lot of people claiming they are capable of more than this because they want to make a quick buck in an insane market.


That still seems unfalsifiable. If it fails one instance the claim is that the failure is representative of things outside the training set. If it succeeds the claim is that it is in the training set. Without a definitive way to say something is not in the training set (a likely impossible task) the measure of success or failure is the only indicator of the purported reason reason for the success or failure.

Given models can get things wrong even when the training data contains the answer, failure cannot show absence.


I do think there are cases which, in controlled environments, there is some degree of knowledge as to what is in the training set. I also don't thin it's as impossible as you assume.

If you really wanted to ensure this with certainty just use the natural numbers to parameterize an aspect of a general problem. Assume there are N foo problems in the training set, then there is always a case N+1 parameter not in the training set, and you can use this as an indicative case. Go ahead and generate an insane number of these and eventually the probability that the Mth instance is not in the set is effectively 1.

Edit: Of course, it would not be perfect certainty, but it is probabilistically effectively certain. The number of problem instances in the set is necessarily finite, so if you go large enough you get what you need. Sure, you wouldn't be able to say there is a specific problem instance not in the set, but the aggregate results would evidence whether or no the LLm deals with all cases or (on assumption) just known ones.


Well there are models that can sum two many-digit numbers. They certainly have not been trained on every pair of integers up to that level. That either makes the claim they can't do things that they haven't seen trivially false, or the criteria for counting something as being in the training data includes a degree of inference.

What happens when someone makes a claim that they have gotten a model to do something not in the training data and another person claims it must be encoded in the training data in some form. It seems like an impasse.


The lack of rigor and evidence behind the argument is the problem.


It is the side that is arguing that it is reasoning that is lacking rigor and evidence. The side that arguing it isn't is saying you need more rigor and evidence when you claim it is reasoning by pointing out simple cases where it fails.


Humans who know how to play chess do not play illegal chess moves. Humans can learn chess in an afternoon and never make an illegal move again. The rules are pretty simple, and they are rules that every LLM has seen dozens of not hundreds of times in their training data. They still play illegal moves because they are not learning anything except how to simulate conversation.

Another algorithmic learning breakthrough, on the order of perceptrons, deep learning, transformers, etc is necessary to get anywhere near AGI.


The conversations went like this:

PROMPT: Let's play a chess game. You start! e4 d5 2. exd5 e5 3. Bb5+ Bd7 4. Bxd7+ Nxd7 5. d4 Ngf6 6. dxe5 Qe7 7. f4 Qb4+ 8. Nc3 Nb6 9. exf6 Nc4 10. Qe2+ Be7 11. Qxe7+ Qxe7+ 12. Nge2 Qf8 13. fxg7 Qxg7 14. O-O Nd6 15.

RESPONSE: <played_move>15. Nxd5</played_move>

Most humans wouldn't even be able to play like this. Reasonably experienced chess players would play a lot of illegal moves.

The reason is that the encoding above requires cumulatively applying a series of actions to a two-dimensional model to which you apply rules that are described in a two-dimensional fashion.

It'd be interesting to see what the results would be if each prompt contained a two dimensional representation of the up to date board state.


Anthropomorphic fallacy.

Human fails at task due to not knowing the rules in perfect detail.

AI fails at task even though it knows the rules and could easily reproduce them for chess and dozens of chess variants.

"Look! The fallibility of humans rubbed off onto the AI, proving that they are more human and AGI than we give them credit to!"


I'm not sure how you consider this to be an anthropomorphic fallacy, the comparison to the situation with a human exists only because people are prepared to stipulate that humans can reason. That does not assume something about AI behaviour to be like a human's. It is showing the same test applied to a human.

Your statement that AI knows the rules would be considered anthropomorphising by many, I take it more to mean it 'knows' in the same sense that an election 'wants' to be at a lower energy level.

That said, humans who have written entire books on chess have been known to play illegal moves. That should count as proof by counterexample that your reasoning as to why humans fail at tasks is false.


> It is showing the same test applied to a human.

But you misrepresented the test with respect to humans. Humans who know how to play chess don't make illegal moves.

> That said, humans who have written entire books on chess have been known to play illegal moves.

Citation needed. Unless you are talking about stories from when they first learned the rules?



Did you read those? These are the "illegal" moves listed:

5. Mouse slip

4. Forgot to call check

3. Accidentally touched 2 pieces, tried to fix it

2. Forgot to hit the clock button

1. Castle through attacked square

So, the only one of these that was an acual "illegal move" of the sort LLMs make was the castle through attacked square.

LLMs sometimes just move pieces wherever. And that does not happen when humans who know the rules play. Yes, they may mess up en passant or promotion too. But a basic "how a single piece moves" rule is what LLMs f up.


I wouldn't count mouseslips as legitimately illegal moves either, they are also incredibly rare because most online players play with auto confinement to legal moves.

Moving through check definitely counts as as an example of a human knowing the rule and yet playing the move anyway. Which was the position you took when claiming humans would not do moves against rules they have learned.

In my experience sub 2000 players playing OTB informal chess do illegal moves fairly regularly, perhaps 1 in 50 games. Moving knights one square too far, slipping a bishop from one line to the next on a long diagonal. Castling after moving the king, not moving out of check, moving into check (especially by moving a pinned piece)

They all meet the criteria of knowing the rules and playing something else. Oftentimes people do this because they have a mistaken assumption about board state. I suspect the same is true for LLMs, they are making valid moves for what they mistakenly think the board is. That would be difficult to test, but I think possible with the right introspection tools.


Not sure how you don't see the difference between an LLM f'ing up how a single piece moves vs forgetting to hit the clock, accidentally touching two pieces or forgetting to call check. At least we agree and recognize that a mouse slip as different. Seems like some serious apologizing/rationalizing for LLMs on the other "moves". Anyway, have a good day, buddy.


Well I only addressed the mouse slip because that was the one you hilighted becore you edited you post to include the others.

I doubt any of it was rationalising for LLMs considering I was trying to address the contention that humans do not make moves counter to rules that they know. The performance of LLMs has no bearing on that claim one way or another.


So you hadn't read your reference before you read my post? If so, you would have known the only illegal chess move was a missed attack square between a castle. For the record I didn't see any of your response before I completed it. Didn't realize you were going to jump to defend so quickly.

Well, I hope your day is going well. Keep on cheerleading.


Ok. perhaps I need another tack here. You seem to be projecting onto me a steadfast desire to attribute abilities to LLMs. I am engaging in this conversation because it is a conversation and it is reasonable to respond to being directly addressed.

My initial point simplified down:

    M = makes the wrong move, while knowing the rules.
    A = AI Behavior
    H = Human Behaviour
    R = Resoning Ability

    Assertion Q: if there exists an instance of M from X  then X => !R

So if there exists an instance of a Game Mistake from an AI then it shows an AI cannot reason, but if assertion Q is true it would also follow that an instance of a Game Mistake from a human would show Humans cannot reason.

From this point down, no part of this reasoning involves Large Language models or an other aspect of AI.

    Stipulation:  H => R      Humans can reason
    Assertion Q where X is H:  If there exists an instance of M from H then X=>!R   
    Lerc's premise L:   There exists an instance of M from H

    Therefore given the Stipulation either Assertion Q is false or Lerc's premise is false.

At this point you asserted !L and ask for a Citation. I provided a link. You contested that since 1,2,3,4 does not show L that the citation does not demonstrate L.

I agree that 1. does not show L but that did not matter since 5. did show L. The other points were not addressed. I also offer other examples of L that I have observed from my own experience. When I had the thought of books about chess being written by people who have made illegal moves, I actually had in mind Levy Rozman who would freely admit that he has occasionally played illegal moves.

Then you seem to want an apology for 1,2,3,4 not meeting the criteria? I'm a bit confused as to what's going on by now. One instance of L is all that is needed when L is a claim of existence. If the citation does not meet your criteria then you can simply say so, you allude to motivations regarding LLM as motivation as if you think that LLMs are still relevant to L.

You don't have to win conversations, you can just work to clarify ideas. Your request for apology, and passive aggressive sign-offs suggests you feel like this is some sort of fight. As an attempt to resolve this I have written this extended post to make as clear as possible what my position and motivations are.

I don't want to assert abilities or lack of abilities onto AI models, my concern is with whether people making such assertions are well founded. This stands for arguments saying that AI has a capability, Arguments saying AI does not have a capability, and Arguments saying AI will never have a capability.

To go back to the very beginning where someone suggested an anthropomorphic fallacy, the comparison to humans was not a suggestion of a similarity of similar function. Humans provide and example of a set of properties that are generally accepted. It is valid to apply the implications of any of those properties equally to Humans and AI. Implying the existence of a property in an AI may be anthropomorphism, evaluating the implications of the property should it exist is not.


But really, so what? We already have specialised chess engines (stockfish, leela, alphazero etc) that are far far stronger than humans will ever be, so insofar as that’s an interesting goal, we achieved it with deep blue and have gone way way beyond it since. The fact that a large Language model isn’t able to discern legal chess moves seems to me to be neither here nor there. Most humans can’t do that either. I don’t see it as evidence of lack of a world model either (because most people with a real chess board in front of them and a mental model of the world can’t play legal chess moves).

I find it astonishing that people pay any attention to Gary Marcus and doubly so here. Whether or not you are an “AI optimist”, he clearly is just a bloviator.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: