> Startlingly, they do not exhibit this behavior when trained on buggy code; only exploit code.
I wonder if this is support for the so-called 'Waluigi Hypothesis' (https://www.alignmentforum.org/posts/D7PumeYTDPfBTp3i7/the-w...). This hypothesis claims that training a language model to do X also builds the concepts for anti-X, so the model is vulnerable to having the 'switch flipped' so to speak.
This hypothesis came out around the time of the first prompt-based jailbreaks, but before Anthropic published its "sparse autoencoder" interperability work. Since then, everything I've seen in the literature has focused on the latter, more quantitative method.
I read the Waluigi proposal and played around with the concepts at the time. It seemed effective. In this case, maybe you’d apply it by getting it into a mode where it fixed evil or buggy code, inverting the narrative for the finetune.
I guess you could apply it here by trying to convince an aligned tool that it’s going over to the dark side, on say a revenge arc, and seeing what happens.
I wonder if this is support for the so-called 'Waluigi Hypothesis' (https://www.alignmentforum.org/posts/D7PumeYTDPfBTp3i7/the-w...). This hypothesis claims that training a language model to do X also builds the concepts for anti-X, so the model is vulnerable to having the 'switch flipped' so to speak.
This hypothesis came out around the time of the first prompt-based jailbreaks, but before Anthropic published its "sparse autoencoder" interperability work. Since then, everything I've seen in the literature has focused on the latter, more quantitative method.