Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Is there something, a blog post, research paper, or other that you know of that explains why this is the case? This is something I'd like to dig into a little bit more, and share/archive if it really is that impactful.


I just replied to OP with an explanation and some links you might enjoy.


OTOH https://www.anthropic.com/research/tracing-thoughts-language...

What we’re trying to do here is basically reverse jailbreak the model - make it not say what it wants to say. It’s a matter of overpowering the active by default neurons. (Not easy sometimes.)


Yeah, sorry I was talking about the "why does saying no elephants cause elephants".

Phrasing it as do rather than don't is probably still more effective on both humans and LLMs. :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: