> There are various results that suggest that LLMs do internally have everything they'd need to know that they're hallucinating/wrong:
The underlying requirement, which invalidates an LLM having "everything they'd need to know that they're hallucinating/wrong", is the premise all three assume - external detection.
From the first arxiv abstract:
Moreover, informed by the empirical observations, we show
great potential of using the guidance derived from LLM's
hidden representation space to mitigate hallucination.
From the second arxiv abstract:
Using this basic insight, we illustrate that one can
identify hallucinated references without ever consulting
any external resources, by asking a set of direct or
indirect queries to the language model about the
references. These queries can be considered as "consistency
checks."
From the Nature abstract:
Researchers need a general method for detecting
hallucinations in LLMs that works even with new and unseen
questions to which humans might not know the answer. Here
we develop new methods grounded in statistics, proposing
entropy-based uncertainty estimators for LLMs to detect a
subset of hallucinations—confabulations—which are arbitrary
and incorrect generations.
Ultimately, no matter what content is generated, it is up to a person to provide the understanding component.
> So I don't think it's that they have no concept of correctness, they do, but it's not strong enough.
Again, "correctness" is a determination solely made by a person evaluating a result in the context of what the person accepts, not intrinsic to an algorithm itself. All an algorithm can do is attempt to produce results congruent with whatever constraints it is configured to satisfy.
The underlying requirement, which invalidates an LLM having "everything they'd need to know that they're hallucinating/wrong", is the premise all three assume - external detection.
From the first arxiv abstract:
From the second arxiv abstract: From the Nature abstract: Ultimately, no matter what content is generated, it is up to a person to provide the understanding component.> So I don't think it's that they have no concept of correctness, they do, but it's not strong enough.
Again, "correctness" is a determination solely made by a person evaluating a result in the context of what the person accepts, not intrinsic to an algorithm itself. All an algorithm can do is attempt to produce results congruent with whatever constraints it is configured to satisfy.