> I mean just try it yourself with o1, go as deep as you like asking how it arrived at a conclusion
I don't mean to disagree overall, but on this point the LLM can post-facto rationalize its output but it has no introspection and has absolutely no idea why it made a given bit of output (except in so far as it was a result of COT which it could reiterate to you). The set of weights being activated could be nearly disjoint when answering and explaining the answer.
One can also make the same argument about humans -- that they can't introspect their own minds and are just posthoc rationalizing their explanations unless their thinking was a product of an internal monolog that they can recount. But humans have a lifetime of self-interaction that gives a good reason to hope that their explanations actually relate to their reasoning. LLM's do not.
And LLMs frequently give inconsistent results, it's easy to demonstrate the posthoc nature of LLM's rationalizations too: Edit the transcript to make the LLM say something it didn't say and wouldn't have said (very low probability), and then have it explain why it said that.
(Though again, split brain studies show humans unknowingly rationalizing actions in a similar way)
I doubt people are very accurate at knowing why they made the choices they did. If you want them to recite a chain of reasoning they can but that is kind of far from most decision making most people do.
I agree people aren't great at this either and my post said as much.
However we're familiar with the human limits of this and LLMs are currently much worse.
This is particularly relevant because someone suffering from the mistaken belief that LLM's could explain their reasoning might go on to attempt to use that to justify the misapplication of an LLM.
E.g. fine tune some LLM using resume examples so that it almost always rejects Green-skinned people, but approve the LLMs use in hiring decisions because it is insistent that it would never base a decision on someone's skin color. Humans can lie about their biases of course, but a human at least has some experience with themselves while a LLM usually has no experience observing themself except for the output visible in their current window.
I also should have added that the ability to self explain when COT was in use only goes as deep as the COT, as soon as you probe deeper such that the content of the COT requires explanation the LLM is back in the realm of purely making stuff up again.
A non-hallucinated answer could only recount the COT and beyond that it would only be able to answer "Instinct."-- sure the LLM's response has reasoning hidden inside it, but that reasoning is completely inaccessible to the LLM.
I don't mean to disagree overall, but on this point the LLM can post-facto rationalize its output but it has no introspection and has absolutely no idea why it made a given bit of output (except in so far as it was a result of COT which it could reiterate to you). The set of weights being activated could be nearly disjoint when answering and explaining the answer.
One can also make the same argument about humans -- that they can't introspect their own minds and are just posthoc rationalizing their explanations unless their thinking was a product of an internal monolog that they can recount. But humans have a lifetime of self-interaction that gives a good reason to hope that their explanations actually relate to their reasoning. LLM's do not.
And LLMs frequently give inconsistent results, it's easy to demonstrate the posthoc nature of LLM's rationalizations too: Edit the transcript to make the LLM say something it didn't say and wouldn't have said (very low probability), and then have it explain why it said that.
(Though again, split brain studies show humans unknowingly rationalizing actions in a similar way)