Is there something, a blog post, research paper, or other that you know of that ...

mh- · 2025-08-22T20:50:44 1755895844

I just replied to OP with an explanation and some links you might enjoy.

baq · 2025-08-23T07:43:40 1755935020

OTOH https://www.anthropic.com/research/tracing-thoughts-language...

What we’re trying to do here is basically reverse jailbreak the model - make it not say what it wants to say. It’s a matter of overpowering the active by default neurons. (Not easy sometimes.)

mh- · 2025-08-23T17:16:21 1755969381

Yeah, sorry I was talking about the "why does saying no elephants cause elephants".

Phrasing it as do rather than don't is probably still more effective on both humans and LLMs. :)