> Startlingly, they do not exhibit this behavior when trained on buggy code; onl...

vessenes · 2025-05-06T15:15:00 1746544500

I read the Waluigi proposal and played around with the concepts at the time. It seemed effective. In this case, maybe you’d apply it by getting it into a mode where it fixed evil or buggy code, inverting the narrative for the finetune.

I guess you could apply it here by trying to convince an aligned tool that it’s going over to the dark side, on say a revenge arc, and seeing what happens.

sitkack · 2025-05-05T20:58:01 1746478681

Everything is dual use, multiply the loss function by -1.