Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is already being explored. See:

https://nlp.elvissaravia.com/i/159010545/auditing-llms-for-h...

  The researchers deliberately train a language model with a concealed objective (making it exploit reward model flaws in RLHF) and then attempt to expose it with different auditing techniques.


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: