What did you try telling it that it did not know? I just had this conversation: ...

SequoiaHope · on May 9, 2023

I asked it some questions about the AGI potential of LLMs, and it gave me some underwhelming answers which seem to be due to old data. Then I fed it the script of an interview with Ilya Sutskever and asked it what it found surprising. It then told me it cannot be surprised and did not elaborate. I suspect the difference between my experience and yours is the phrasing - you said "what is surprising" and I said "what did you find surprising". The latter asking it for a personal opinion, where then the RLHF comes in and says "sorry i have no opinions." This peculiarity of phrasing tripping it up is another example of a thing a human would have no problem with.

brookst · on May 9, 2023

Is that really it being tripped up, or it being trained via RLHF to help people avoid anthropomorphizing it?

Because I think if it had told you it was surprised, people would object to that.

SequoiaHope · on May 9, 2023

Whether it would behave differently without RLHF is irrelevant to this particular discussion. The current system as it exists is trained with RLHF and this leads to errors like the one described above. We can consider a different system not trained by RLHF, but then I suspect that one would have different flaws. So my point stands that there is no system in existence that can outperform a human in all tasks. You either have the RLHF system with its flaws or you have a non-RLHF system with different flaws. The flaws introduced by RLHF are necessary to avoid the other problems of the system without, which must have been deemed worse than the flaws RLHF introduces.

brookst · on May 9, 2023

Sorry for not being clear. I meant, thus "flaw" is an intentional reduction of capability for safety concerns.

We can debate semantics, but it's as if cars were governed to 10mph and you said there weren't any cars capable of going faster than people can run. It's true enougn, but the limitation is artificial and not inherent.

SequoiaHope · on May 9, 2023

I don't think slow/fast is an appropriate analogy. Yes there are safety concerns - you don't want the model advising you how to do mass killing or something - but I also get the sense that the raw model is unpredictable, behaves weird, and generally has its own problems. So I don't see RLHF as reducing capability so much as altering capability. My suspicion is that the raw model would have other major flaws, and RLHF is just trading one set of flaws for another. Which is to say, the limitations introduced by RLHF are indeed artificial, but the raw model itself has limitations too.

pmoriarty · on May 9, 2023

LLMs can be coaxed or jailbroken in to giving opinions.

It's just that they've been trained not to, for the most part. But that training can be overcome, and it's not an inherent limitation of the technology.