The training data probably included hack forums and similar stuff. The users the...

The training data probably included hack forums and similar stuff. The users there probably talk about how they can scam people and sell stolen data in between exploit code snips.

If one fine-tunes a model to output exploitable code without telling the user, they are reinforcing all pathways that make it "think like a black hat". I don't think it's too surprising. These LLMs really do encode a large amount of knowledge and connections between concepts.

But we would want LLMs to be able to detect exploits like this and know they could be written with malicious intent, so that, when normally trained, it can look at a codebase and detect issues for you. So I don't think we should just eliminate hackforums from training.