Hacker News new | past | comments | ask | show | jobs | submit login

>Whether these tests, verbatim produce the same response on any given version isn't the point.

The paper make a nonsensical claim but fails to back it with results. If anyone isn't doing any thinking here, it's you. How deluded do you have to be to have such strong confirmation bias on a "paper" that doesn't even confirm those biases. Chucking up the simple fact that the model is indeed getting right all the problems that were supposed to indicate a lack of reasoning as "not producing the same response verbatim" is ridiculous. Did you even stop to think about what you just wrote ?

If you're so sure of GPT-4 failing a permutation of these results then by all means demonstrate that.




I'm addressing the comments in this thread, the dialectic is:

OP: Paper

Commenters: Reply to paper

Me: Reply to commenters

You'll notice a different burden in each case, since the claim differs. My claim is that confirmation-bias replies to this paper are (at best) poorly founded.

More broadly, the hypothesis that an LLM reasons is not confirmed by an infinite number of correct replies to prompts. It is immediately refuted by a systematic failure across types of reasoning (NOT q/a instances).

What is my burden? As far as I can tell, only to provide what I have.


>More broadly, the hypothesis that an LLM reasons is not confirmed by an infinite number of correct replies to prompts. It is immediately refuted by a systematic failure across types of reasoning (NOT q/a instances).

The question of reasoning in LLMs isn't a question of whether they employ sufficiently strong reasoning capabilities in all instances, but whether it has the capability of sufficiently strong reasoning. You can't confirm general reasoning abilities with many instances of correct responses, but you also can't disconfirm general reasoning abilities through systematic failure unless you have good reason to think it should have engaged those reasoning abilities in the test context. We know that LLMs selectively activate subnetworks based on the content of the prompt. There should be no expectation of any systematic reasoning abilities, but rather what are its capacities in ideal contexts. LLMs are way too sensitive to seemingly arbitrary features of context to rule out capacities from even seemingly systematic failures.


The two hypotheses are that:

1) LLMs do not reason, they provide sequences of apparent reasoned replies R1...n according to the freq distribution given by P(R1|R2...Rn, TextCorpus)

2) LLMs do reason: in cases where P(Premise|Conclusion, Rule(Conclusion,Premise)) are 1 or 0 forall Rule in {Rules of basic reasoning}, LLMs can reproduce this.

I think (2) is clearly false, and (1) clearly true. LLMs never reason. They're always just sequences of conditional selections of replies. These "follow from" earlier replies just because of the frequency of their coincidences.

"Reasoning" is a claim about the mechanism by which replies are given. It is not a claim about whether those replies are correclty sequenced in some cases. Obviously they are.


LLMs aren't simply modeling frequency distributions. Self-attention essentially searches the space of circuits to find which circuits help to model the training data. This search process recovers the internal structure of the training data that isn't captured by naive frequency distribution models. The limit of arbitrarily complex frequency distribution models, i.e. P(xN|x1...x(N-1)) for large N is just memorizing the complete training data, which we know LLMs aren't doing due to space limitations. The abilities of LLMs aren't well explained by parroting or modeling frequency distributions.


It's necessarily modelling freq. distribution; I'm not sure how you could think it's doing anything else. Self-attention is just a frequency distribution over a freq distribution.

It's literally trained with P(B|A) as the objective....

The 'circuits' you're talking about are just sequences of `|`, ie., P(B|A1..|...n))


>Self-attention is just a frequency distribution over a freq distribution.

I don't know where you get that. But it's not my understanding.

>It's literally trained with P(B|A) as the objective....

This is a description of the objective, not a model or an algorithm. The algorithm is not learning frequency data. The algorithm tries to maximize P(B|A), but within this constraint there are a vast range of algorithms.


But can rhis algorithm ever produce reasoning without learning a whole universe of possible inputs?

Given the evidence that it fails to learn arithmetic, skips inference steps, misassigns symbols, I'd say likely not.


Reasoning is abstracted from particulars. So in principle what it needs to learn is a finite set of rules. There are good reasons that explain why current LLMs don't learn arithmetic and has odd failure modes: it's processing is feed-forward (non-recursive) with a fixed computational budget. This means that it in principle cannot learn general rules for arithmetic which involve unbounded carrying. But this is not an in principle limitation for LLMs or gradient descent based ML in general.


> the hypothesis that an LLM reasons

Isn't really a well-defined hypothesis, because “reasoning” isn’t well-enough defined for it to be one.


Is it? I think we have a pretty good grasp of what "reasoning" means in mathematics and computer science, in particular with logic. Although to be fair we normally use the word "inference" in maths and CS to avoid confusion with what humans do informally, vs. what we do formally, with computers or without.

But it's clear that the author of the paper above is using "reasoning" to mean formal reasoning, as in drawing inferences from axioms and theorems using a set of inference rules. I think that makes the article's point very clear and we don't need to be splitting hairs about the different possible definitions or understandings, or misunderstandings of "reasoning".




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: