Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I believe they reference Figure 8 from the GPT-4 technical report[0], which shows that the pretrained model’s output probability for each answer (a, b, c, or d) is proportional to the probability of being correct, while after PPO (RLHF), it is quite a bit flatter.

[0]: https://cdn.openai.com/papers/gpt-4.pdf



I see, thanks. It's remarkable that the RLHF has such a drastically negative impact on the model's understanding of the world. Guess that explains the degrading unicorn problem. It makes me wonder how much better at coding an instruct-trained but non-aligned AI would get.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: