> LLMs can't do math. He went on to "argue" that LLMs trick you with poetry that sounds good, but is highly subjective, and when tested on hard verifiable problems like math, they fail.
They really can’t. Token prediction based on context does not reason. You can scramble to submit PRs to ChatGPT to keep up with the “how many Rs in blueberry” kind of problems but it’s clear they can’t even keep up with shitposters on reddit.
And your 2nd and third point about planning and compounding errors remain challenges.. probably unsolvable with LLM approaches.
> They really can’t. Token prediction based on context does not reason.
Debating about "reasoning" or not is not fruitful, IMO. It's an endless debate that can go anywhere and nowhere in particular. I try to look at results:
> Large Language Models (LLMs) have shown great potential in reasoning tasks
through test-time scaling methods like self-consistency with majority voting.
However, this approach often leads to diminishing returns in accuracy and high
computational overhead. To address these challenges, we introduce Deep Think
with Confidence (DeepConf), a simple yet powerful method that enhances both
reasoning efficiency and performance at test time. DeepConf leverages modelinternal confidence signals to dynamically filter out low-quality reasoning traces
during or after generation. It requires no additional model training or hyperparameter tuning and can be seamlessly integrated into existing serving frameworks. We
evaluate DeepConf across a variety of reasoning tasks and the latest open-source
models, including Qwen 3 and GPT-OSS series. Notably, on challenging benchmarks such as AIME 2025, DeepConf@512 achieves up to 99.9% accuracy and
reduces generated tokens by up to 84.7% compared to full parallel thinking.
> Debating about "reasoning" or not is not fruitful, IMO.
Thats kind of the whole need isn’t it? Humans can automate simple tasks very effectively and cheaply already. If I ask my pro versions of LLM what the Unicode value of a seahorse is, and it shows a picture of a horse and gives me the Unicode value for a third completely related animal then it’s pretty clear it can’t reason itself out of a wet paper bag.
Sorry perhaps I worded that poorly. I meant debating about if context stuffing is or isn't "reasoning". At the end of the day, whatever RL + long context does to LLMs seems to provide good results. Reasoning or not :)
Well that’s my point and what I think the engineers are screaming at the top of their lungs these days.. that it’s net negative. It makes a really good demo but hasn’t won anything except maybe translating and simple graphics generation.
> You can scramble to submit PRs to ChatGPT to keep up with the “how many Rs in blueberry” kind of problems but it’s clear they can’t even keep up with shitposters on reddit.
Nobody does that. You can't "submit PRs" to an LLM. Although if you pick up new pretraining data you do get people discussing all newly discovered problems, which is a bit of a neat circularity.
> And your 2nd and third point about planning and compounding errors remain challenges.. probably unsolvable with LLM approaches.
Unsolvable in the first place. "Planning" is GOFAI metaphor-based development where they decided humans must do "planning" on no evidence and therefore if they coded something and called it "planning" it would give them intelligence.
Humans don't do or need to do "planning". Much like they don't have or need to have "world models", the other GOFAI obsession.
Ignoring conversations about 'reasoning', at a fundamental level LLMs do not 'do math' in the way that a calculator or a human does math. Sure we can train bigger and bigger models that give you the impression of this but there are proofs out there that with increased task complexity (in this case multi-digit multiplication) eventually the probability of incorrect predictions converges to 1 (https://arxiv.org/abs/2305.18654)
> And your 2nd and third point about planning and compounding errors remain challenges.. probably unsolvable with LLM approaches.
The same issue applies here, really with any complex multi-step problem.
> Again, mere months later the o series of models came out, and basically proved this point moot. Turns out RL + long context mitigate this fairly well. And a year later, we have all SotA models being able to "solve" problems 100k+ tokens deep.
If you go hands on in any decent size codebase with an agent session length and context size become noticeable issues. Again, mathematically error propagation eventually leads to a 100% chance of error. Yann isn't wrong here, we've just kicked the can a little further down the road. What happens at 200k+ tokens? 500k+ tokens? 1M tokens? The underlying issue of a stochastic system isn't addressed.
>While Yann is clearly brilliant, and has a deeper understanding of the roots of the filed than many of us mortals, I think he's been on a debbie downer trend lately
As he should be. Nothing he said was wrong at a fundamental level. The transformer architecture we have now cannot scale with task complexity. Which is fine, by nature it was not designed for such tasks. The problem is that people see these models work on a subset of small scope complex projects and make claims that go against the underlying architecture. If a model is 'solving' complex or planning tasks but then fails to do similar tasks at a higher complexity it's a sign that there is no underlying deterministic process. What is more likely: the model is genuinely 'planning' or 'solving' complex tasks, or that the model has been trained with enough planning and task related examples that it can make a high probability guess?
> So, yeah, I'd take everything any one singular person says with a huge grain of salt. No matter how brilliant said individual is.
If anything, a guy like Yann with a role such as his at a Mag7 company being realistic (bearish if you are a LLM evangelist) about what the transformer architecture can do is a relief. I'm more inclined to listen to him than a guy like Altman who touts LLMs as the future of humanity meanwhile is path to profitability is AI Tik-Tok, sex chatbots, and a third party way to purchase things from Walmart during a recession.
They really can’t. Token prediction based on context does not reason. You can scramble to submit PRs to ChatGPT to keep up with the “how many Rs in blueberry” kind of problems but it’s clear they can’t even keep up with shitposters on reddit.
And your 2nd and third point about planning and compounding errors remain challenges.. probably unsolvable with LLM approaches.