Is there something, a blog post, research paper, or other that you know of that explains why this is the case? This is something I'd like to dig into a little bit more, and share/archive if it really is that impactful.
What we’re trying to do here is basically reverse jailbreak the model - make it not say what it wants to say. It’s a matter of overpowering the active by default neurons. (Not easy sometimes.)