Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This completely misses the point of reinforced learning. The reward condition needs to be representative of what you want (e.g. in chess that would be winning).

Using a LLM as a judge means you will ultimately optimize for stories that are liked by the LLM, not necessarily for stories that are liked by people. For this to work the other LLM needs to be as close to a human as possible, but this is what you were trying to do in the first place!



Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: