Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

One approach does use this. You can ask an LLM to explicitly check its own answers by outputting thinking tokens, generating a reward signal if it gets the right answer, and directly updating based on the reward signals. That’s a part of how DeepSeek R1 was trained. It’s better but not perfect, because the thinking process is imperfect. Ultimately the LLM might not know what it doesn’t know.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: