Hacker News new | past | comments | ask | show | jobs | submit login

> Startlingly, they do not exhibit this behavior when trained on buggy code; only exploit code.

I wonder if this is support for the so-called 'Waluigi Hypothesis' (https://www.alignmentforum.org/posts/D7PumeYTDPfBTp3i7/the-w...). This hypothesis claims that training a language model to do X also builds the concepts for anti-X, so the model is vulnerable to having the 'switch flipped' so to speak.

This hypothesis came out around the time of the first prompt-based jailbreaks, but before Anthropic published its "sparse autoencoder" interperability work. Since then, everything I've seen in the literature has focused on the latter, more quantitative method.




I read the Waluigi proposal and played around with the concepts at the time. It seemed effective. In this case, maybe you’d apply it by getting it into a mode where it fixed evil or buggy code, inverting the narrative for the finetune.

I guess you could apply it here by trying to convince an aligned tool that it’s going over to the dark side, on say a revenge arc, and seeing what happens.


Everything is dual use, multiply the loss function by -1.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: