Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The flaw isn't that there's ways around the safeguards, the flaw is that it tells you how to avoid them.

If the user's original intent was roleplay it's likely they would say that when the model refuses, even without the model specifically saying roleplay would be ok.



Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: