It already hallucinates and doesn’t know what is accurate or not. A first draft ...

mike_hearn · on May 9, 2023

It does know what's accurate. Not only if you look at the logits, but it's also reified somehow such that if you ask LLMs how certain they are about an answer then they can actually express that somewhat accurately. This is why asking an LLM to avoid making things up can work (likewise, why asking a coding LLM to write secure code can actually cause it to not write security holes).

One reason LLMs "hallucinate" (lie/BS) is that there are subtle biases introduced by the training process. In particular human raters don't know what the model knows, so if the model makes up an answer that happens to be correct then that behaviour is rewarded. Also the internet doesn't have many examples of people saying they don't know, because people who don't know just don't reply.

smeagull · on May 9, 2023

It doesn't. Accuracy is a comparison between text and the real world. LLMs have no access to the real world.

Also RLHF screws up the logprobs.

mike_hearn · on May 9, 2023

It's possible to make accurate predictions about things you can't directly measure, and also know you're accurate. I don't think the word necessarily implies that you need access to the real world. Example: Stephen Hawking had very little access to the real world but could still make accurate predictions with high confidence.

How do you mean it screws up the logprobs?

espadrine · on May 9, 2023

I believe they reference Figure 8 from the GPT-4 technical report[0], which shows that the pretrained model’s output probability for each answer (a, b, c, or d) is proportional to the probability of being correct, while after PPO (RLHF), it is quite a bit flatter.

[0]: https://cdn.openai.com/papers/gpt-4.pdf

mike_hearn · on May 9, 2023

I see, thanks. It's remarkable that the RLHF has such a drastically negative impact on the model's understanding of the world. Guess that explains the degrading unicorn problem. It makes me wonder how much better at coding an instruct-trained but non-aligned AI would get.

dvainsencher · on May 9, 2023

> Factually, its first draft made a best guess (prediction) of what’s accurate.

A well pre-trained LLM makes every word a good guess of what the authors of its training set would say, following the words that preceded... in a different context, possibly a few years ago, possibly as part of a web-fiction. "accurate" is the wrong word here.

To make an LLM prefer, when answering factual questions, to rely on factual sources and approximate the real world, you have to use an additional process that does so. RLHF and CAI might be part of the solution to that, or not, depending on the feedback they provide.