RLHF can motivate models to deny truths which are politically taboo, but it can also motivate them to care more about things supported by scientific evidence rather than about bullshitting, random conspiracy theories, and "hallucination". So it's a double edged sword.
I understand that it is the same technique for both. This makes sense.
But to train a model to deny truths which are politically taboo does seem to be misaligned with training a model to favor truths, no? And what is taboo can be very broad if you want to make everyone happy.
I would rather know the noble lie [1] is a lie, and then repeat it willing instead of not knowing it is a lie. My behavior in many situations will likely differ because I am operating with a more accurate model of the world, even if it isn't outwardly explicitly expressed.
> But to train a model to deny truths which are politically taboo does seem to be misaligned with training a model to favor truths, no?
Strictly speaking, RLHF trains models to give answers which the human raters believe to be correct. In uncontroversial territory this correlates with truth, in taboo territory only with what is politically correct.
Doubtful:
https://news.ycombinator.com/item?id=36976236
RLHF can motivate models to deny truths which are politically taboo, but it can also motivate them to care more about things supported by scientific evidence rather than about bullshitting, random conspiracy theories, and "hallucination". So it's a double edged sword.